[IEEE INTELEC '84 - International Telecommunications Energy Conference - New Orleans, LA, USA (1984.11.4-1984.11.7)] INTELEC '84 - International Telecommunications Energy Conference

DESIGNING AND EVALUATING DISTRIBUTED POWERPROCESSING SYSTEMS FOR RELIABILITY AND MAINTAINABILITY

Sarah J. Gottlieb-Hecht

Bell Communications ResearchWhippany, New Jersey 07981

ABSTRACT

A method for measuring the reliabilityand maintainability of distributedpower-processing systems is presented.The method utilizes a set of figures ofmerit which faithfully reflect both theservice availability of the system andthe impact on service, due to anypossible system failures. The procedureused to calculate these figures of meritalong with their concomitant cost, isexplained. The entire method has beenincorporated in a new computer program,which automatically evaluates the cost,reliability, and maintainabilityperformance- of proposed distributedpower-processing designs.

With the benefit of this program, enoughexperience has been obtained to suggesta number of design effects and designrules for various architectural featuresof distributed systems.

ADVANTAGES OF DISTRIBUTED POWER PROCESSING

Power processing systems are not an endin themselves. Switching, multiplexing,transmission, and distribution systemsare all important telecommunicationsapplications for power processingsystems. Among the advantages thatdistributed powering systems have overbulk systems for these applications, arethe following:

1. The distributed nature of theapplication functions themselvestwhich are typically implemented bya hierarchy of subsytems. In mostcases, the logic behind .thedistributed architecture of theapplication system suggests adistributed powering system aswell.

2. The space, heat, and dc-bussingpenalties associated with bulkpowering strategies: a poweringmodule which is large enough toserve the entire system will beheavy, hot, and far removed from

1. The research presented in this paperwas performed by the author prior to1/1/84, while an employee of BellLaboratories, Inc.

most of the application subsystemswhich it serves. Thesedisadvantages constrain thephysical design of the applicationsystem, and adversely affect itscost, reliability,maintainability, and flexibility.

3. The option for strategicredundancy at vulnerable points,rather than total redundancy: Adistributed application system hasits critical control functionslocated high up in its hierarchyof subsystems, where they may beduplicated most efficiently. Whenthe power processing for thesecritical functions is provided bysmall collocated powering modules,then the latter can be duplicatedjust as easily. This means ofobtaining failure-free high levelsystem performance, is far moreconvenient and cost effective thanduplication of the traditional hotand heavy bulk powering module.

4. The option for "instant" repair:A small plug-in powering module iseasy to replace, and inexpensiveto hold in reserve as backup for aquantity of identical modulesthroughout the system. This candramatically reduce the system'smean-time-to-repair, and therebyreduce not only the costs ofrepair, but also the more costlytrauma of sustained systemfailure.3

5. The option to power down functionswhen idle: T. M. Taylor I]] hasdemonstrated how energy costs canbe saved without loss offunctionality when subsystemswhich operate only intermittentlyare powered up only as needed.This benefit is indispensible toapplications with limited energysources, such as battery-poweredsystems. It is also a criticalconsideration in view of theshortfall in electrical generating

2. See [21 for a lengthier discussionof some of these issues.

3. Ibid.

CH2073 - 5/84/0000 - 0116o 1e. C 1964 IEEE116

capacity which we must soonconfront.

6. The option to build the powerprocessing system from availableoff-the-shelf modules, rather thanresorting to customized circuit-level design: this reduces theimplementation time and cost,reduces the first cost, andincreases the reliability andmaintainability.

7. Desire to retain the functionalmodularity of the applicationsystem; option for modular growth;etc.

Many of these benefits directly impactthe reliability and maintainability ofthe system. Therefore it is- necessaryto quantify and evaluate the improvementin performance (if any) that may beexpected from the choice of a particulardistributed power processing system fora particular application.

COMPLEXITIES OF DISTRIBUTED DESIGN ANALYSISOne may think of a bulk powering systemeither as a non-distributed system, oras a distributed system with only onelevel of distribution. By contrast, anextensively distributed powering systemwill not only perform power processingat any number of application levels, butwill often do it by progressiveprocessing steps.

In evaluating traditional bulk poweringsystems, which are essentialby serial intheir internal architecture, the set ofcomponent parts determines thereliability for all intents andpurposes. However, in a distributedsystem, the architecture plays at leastas great a role as the actualcomponents. When we consider that theevaluation of a bulk system involvesonly one architecture, while that of adistributed system must accomodate anynumber of possibilities, we begin toappreciate the complexity of distributedsystem analysis.

4. See, inter alia, Sen. Jim McClure,"Forecasting Electrical Demand:Pennywise or Pound-Fuelish", SenateCommittee on Energy and NaturalResources, January, 1984.

5. The presence of parallel orredundant elements in a system doesnot affect its gross architecture asfar as reliability is concerned.The placement of redundant elementsin parallel is simply equivalent toreplacing a less reliable node by amore reliable one.

In addition, with distributed powerprocessing systems, we can entertainimportant design goals which are notaddressed by bulk designs: namely,small failure groups, ease of repair,modularity, flexibility, and spaceutilization. So, whereas downtime isthe singular figure of merit for a bulksystem's reliability, and frequency offailure for its maintainability# thereis a more comprehensive set of figuresof merit which have meaning fordistributed systems. APnd these must benot only measured, but traded off aswell.

Finally, there are no hard and fastrules for designing a distributed powerprocessing system. whereas thetraditional bulk system was designed bychoosing a bulk powering module whichmet the powering specs with the lowestcost and the lowest failure rate, thedistributed design must consider amultitude of modules, whose individualspecs, cost, and failure rate cannot beconsidered in isolation, or even insimple interrelationships.

EXAMPLE OF A DISTRIBUTED APPLICATION FOR

POWER PROCESSING

Although the principles and methodsdescribed in this paper apply equally tocentral office or to customer premiseapplications, we will use an examplefrom customer prem to illustrate boththe opportunities and the complexitiesattendant to distributed power design.

Let us imagine how we might power ghypothetical PBX of 100 to 500 lines.Such a system typically would occupy oneor more cabinets of equipment. Eachcabinet would hold about six carriers(or shelves), some of which would beused for power equipment, coolingequipment, and system controls; Theremainder of the carriers would eachaccomodate about twenty cards (orboards) of line equipment, with eachcard serving about 4 to 8 lines. Thuswe have a natural hierarchy of "system,cabinet, carrier, card, and line", forboth the logical and the physical designof the application system. [See Figure11.

The powering system for this applicationmust provide a number of dc voltages --

some for the common control processorsand memories, some for the distributedcontrol processors and memories, somefor the line circuitry, and some for the

6. See (3] for a specificimplementation of a distributedpower processing architecture in anelectronic PBX product.

117

FIGURE 1

Typical Single Cabinet Configurationfor Small Communications System

.-----------------------------------

CcL;o met

lines themselves. See Table 1.

A bulk power processing system for thisapplication would require almost twoshelves in the bottom of the bay, andheavy copper bus bars for dcdistribution to the rest of the cabinet.It can be made fairly reliable for aconsiderable price, but each failurethat did occur, would bring down theentire application system. [See Figure2].

A distributed power processing systemwould perform dc conversion much closerto the sites of dc service, with no needfor dc-bussing between shelves. Powerprocessing modules would be small andinterchangable. Most failures wouldaffect only a few lines. For functionsaffecting many lines, failures of theirsmall powering modules would be lessfrequent than failures of the large bulksystem, and duplication of these smallmodules is entirely feasible. [SeeFigure 3].

Now, how do we quantify and evaluate thebenefits (if any) to be obtained from adistributed power processing design?With reliability, for example, the lowertotal failure rate of the bulk system inFigure 2 is misleading; the distributedsystem in Figure 3, with five times thefailure rate, is two, four, or even sixtimes as reliable. It is also moreeasily maintained. [See Table 21.

In addition, how can wedistributed design, whenrate of a single componenteffect compared to theframework? How can wealleged improvement?

TABLE 1

Typical Power Processing Needs for aSmall Communications System

------------------------------------

COMMON CONTROL (for total system):

Bus interface & memory controlTape interfaceCPU & RAMSwitch interfaceTime division busEtc.

TOTAL +5v § 25A-5v @ 4A

-48v § 6A

LINE GROUP CONTROL (for one line card)

CPU & RAMLine circuitryLine power (pooled)

TOTAL +5v §-5v @

-48v §

1. SA. 2A.8A

FIGURE 2

Typical Bulk Powering Configurationfor Cabinet

----------D------------------no----

"improve" athe failure

is of minorarchitecturalquantify any

:DC -VSMI3RJoN/ CV3s BA \

LtrEsI

Ll ES

Lt* .es

., Co -IrRoa L.

000IXX'S~~~~~_p - \ Jr,LTC

XI,\\\\\ \\'WkQc\1\'.,'rtK-7\,

B L

118

1. I . I

-4

A

1.

I ..

7

10,

FIGURE 3

Service Outages for Sample Bulk andD'istributed Power Processing Systems------------------------------------

"BULK"______

DIST3

One Possible Distributed PoweringConfiguration for Cabinet

---------------------------------

LrWC e-ARb

16 to 4 to26 min/yr 14 min/yr

3 to5 hours

1/2 to2 hours

once in once inten years two years

METHOD OF ANALYSIS FOR DISTRIBUTED

POWERING SYSTEMS

In a bulk system, the most importantparameter of a single system componentis its failure rate: how many failureswill it "add" to the total systemfailure rate? But in a distributedsystem, an even more important parameterof a system component is its impact on

the system -- in our example, the numberof lines which depend on it. This

system impact is referred to as a

"failure group", and is a fundamentalconcept in the reliability ofdistributed systems.

By modelling the distributed system as a

"tree" with nodes and branches, we can

take the failure groups of the systemcomponents into account in a naturalway. Every component resides in some

node, and all the subsystems which

branch from a node are dependent on it.

Hence the failure of any component in a

node will result in the failure of all

lines in all subsystems which branch

from that node. Note that a node is a

non-distributed subsystem (a singlecomponent, or a series of components).while a branch in the form of a subtree

is a distributed subsystem. Figure 4

gives the tree model of the distributedsystem in Figure 3.

First Figure of Merit: Average Failure

Group Size

We see that the failure of a node from

which depends a large branch of the

system, is far more serious than that of

a node from which only a small branch

depends. In the case of a non-

distributed, or bulk system, the whole

system depends on only one node: the

failure of any component in the node

will result in the failure of the whole

system.

RECTr(o us)

"91 ST"

FIGURE 4

Tree Model of DistributedPowering Architecture

.________________________

Thus we can compare the reliabilityperformances of these systems, if we

count not only component failures, but

119

Avg. Per-LineOutage Time:

Repair Time forTypical Outage:

Avg. FailureRate:

TABI.E 2

their resulting line failures as well.This is done by weighting the failurerate of each component subsystem, by thesize of its failure group. Note thatthe failure rate depends on thecomponent, while the failure group sizedepends on the architecture.

When we add up the line failuresexpected in the system and divide by thenumber of expected component failureswhich cause them, we arrive at aweighted average or expected failuregroup size -- the number of lines whichare "expected" to fail in the event ofany failure. In a non-distributedsystem, this number is the same as thetotal number of system lines -- everysystem outage will cause failure of allsystem lines. But in a distributedsystem, this number is usually muchsmaller than the total number of systemlines.

It should be noted that this number, theexpected or average failure group size,is a global measure of systemperformance, computed from the systemdata by careful computation. It is notto be confused with a list of possiblefailure groups; the latter may readilybe discerned from a given system design,but cannot easily be compared betweenone architecture and another.

As a global measure, this number is animportant figure of merit for allapplications in which the extent ofsystem failures is at least as ser iousas their frequency. Transmission andswitching are examples of distributedapplications where some failures canaffect thousands of users, while othersmight affect only a handful. Ingeneral, applications where repairs canbe made quickly, easily, and cheaply,are not much bothered by frequency offailure, as long as few customers areaffected each time.

Second Figure of Merit: Average Per-Line Outage

Another important figure of merit, forany - system, is the interruption ofservice as experienced by a customer.This is usually measured by the averageoutage time for any line of the systemin a given time period (e.g. a year).Outage times for series systems are aresult both of failure rate and repairtime, and are calculated according to awell-known formula.

With a bulk powering system, every linewill be down exactly as often and aslong as the whole system is down; thatis, the average per-line outage time fora bulk system is the same as the totalsystem outage time, and this in turn isthe same as the total outage time forall failures in the system.

But with a distributed powering system,most failures will not affect the

majority of lines. A line will go aownif and only if some node on its pathback up to the top of the system treegoes down. Thus its outage time is thesame as that of the ser'ies system whichcomprises its path. The paths fordifferent lines in the system willfrequently be non-identical, due to thefact that distributed powering systemsare frequently non-symmetric. However,an average per-line outage time iseasily obtained by weighting the outagetime for each type of path by the numberof lines on such a path.

Other Figures of Merit

Severe Failures; Total _System FailuresAs mentioned, for many applications thefailure of a few lines may be welltolerated, while the failure of many(simultaneously) may cause customerhavoc. Switching and transmissionsystems are prime examples. For. eachsystem, a distinction may be madebetween the failure group size which itis cost-effective to avoid, and thefailure group size which one can affordto ignore. There can be no suchdistinction with a bulk system, becauseit has only one failure group -- thewhole system. But with a distributedsystem, there are many possible failuregroups.

Therefore it is important to know how.often a failure group of this criticalsize (or larger) will fail, and for howlong. This figure of merit may well bethe deciding factor between twocompeting designs whose performance isotherwise comparable. Once the criticalsize is defined, we can tell from thetree model which branches have at leastso many lines, and from which nodesthese branches depend. Then we mayignore the failures of all the othernodes, while calculating the systemoutages for these critical failures.

The critical failure group size willvary with each application. We maytherefore speak of "severe" failures.Total system outage is always severe,and is a significant figure of merit.But lesser outages, even down to asingle line, may also be severe in givenapplications, and would serve as figuresof merit in those cases.

However, a failure which will not resultin the outage of even a single linecannot be called severe. For example,the failure of a redundant element isnever severe; in addition, certainfailure modes may be non-severe forcertain applications.

The failure rate for a redundantsubsystem (i.e., two or more identicalelements in parallel) is calculatedaccording to another well known formula.A redundant subsystem typically has avery high non-severe failure rate, and avery low severe one.

120

As far as different failure modes forthe same element are concerned, theirfailure rates as well as their effect onthe application system are often knownto the designer, who can then classifythem as severe or non-severe in thesystem analysis.

In general, a "non-severe" element isone whose function in the system isnon-critical, at least for the length oftime it would take for its failure to bediscovered and repaired. Non-severefailures do not affect the serviceavailability of a system. Unless theirrepair is so difficult as to cause aninadvertent breakdown of the rest of thesystem, they cannot be said to affectthe reliability of the system at all.They do, however, affect themaintainability of the system.

Repairs; Mean-Time-Between-Failures Thefailure of a non-severe system elementmust be repaired in most cases, just asa severe failure, but perhaps at greaterleisure. One would probably wish toreplace a failed redundant element, or afailed signal light, or a failed batterycell. Thus the failure of any elementin a system, severe or non-severe, willaffect its ease of maintenance andrepair.

It is the opinion of this author thatthe significance of mean-time-between-failures (MTBF), as applied todistributed systems, is in the area ofmaintenance. MTBF tells us how often asystem will require repair; it does nottell us how much or how often the systemis impaired functionally. This is incontradistinction to bulk systems, forwhich the MTBF is a plausible indicationof functional performance.

The modularity of a distributed systemis conducive to ease of maintenance andrepair; in addition, the plug-in andself-alarming capabilities of a highlydistributed power processing systemmakes it accessible even to non-specialist repair. Single spares can bekept right on the premises for immediateback up of all identical modules in thedistributed system. The convenience andeconomy of such easy repairs may morethan compensate for the higher frequencyof repair in a system which containsmore modules

When we also remember that easier repairimplies better system availability (thatis, lower down-time for the user), thenwe may see why the maintenanceperformance of a system could be thedeciding factor between competingSystems.

Therefore one wants to know how oftenthe system will have to be repaired(that is, its frequency of failure), andhow many of each system component willhave to be replaced. This informationis available from the system data viastraightforward (but tedious)

computations, which have long beenstandard. One also is interested in howeasy these repairs might be: How longwould they take? Could they possibly bedone by the customer? What sparesshould be kept on hand? Should they bestocked on the customer premise, or at alocal or regional distribution center?The answers to these questions rely onthe wisdom of the designer.Computational Method

All of the algorithms which compute thefigures of merit above, from the treemodel of the system (architecture pluscomponent data), use a recursive "tree-search" technique which is easilycomputerized. Namely, each node in thetree is examined, top-to-bottom andleft-to-right; a note is made of thefailure group size for the node, alongwith its cost, its severe and non-severefailure rates, and other pertinent data.After all this data has beenappropriately processed and stored, thenext node is visited, etc. At the endof a path, the search backs up to theprevious node and visits any unexploredbranches from there. When all branchesfrom that node are explored, it backs upagain, etc. Figure 5 indicates thesearch route for the tree in Figure 4.

Of course, no computer search can bemade unless the computer program firsthas the data which defines the system'sarchitecture and its components. Thisis provided by an interactive computerprogram, which queries a knowledgeabledesigner. The designer may call on theprogram again and again, to alter thesystem data (either architecture orcomponents or both) and then reanalyzethe system. In this way, one may obtainmany evaluations of many differentsystems in just a few hours. Due to thevolume of both the data and thecomputations involved, it would be verydifficult to perform even one of theseevaluations manually. Moreover, presentmethods of distributed system evaluationcover only a subset of the relevantfigures of merit presented here.

The author has created a comouterprogram which can measure all thequantitative figures of merit discussedin this paper, for as many designalternatives and design iterations asmay be necessary. The main features ofthis program are its command-menu andquery-driven user interface; its privateuser databases maintained by an externaldatabase management system; and itsrecursive search routines through thedatabase records.

RESULTS OF ANALYSES OF DISTRIBUTED POWER

PROCESSING SYSTEMS

Table 3 demonstrates some of theperformance measures that the computerprogram can project for competing systemdesigns in order to compare them. Oneof these designs ("DIST") is thedistributed architecture in Figure 3.

121

TABLE 3FIGURE 5

Order of Search Through Nodesof a Tree Model

_____________________________

Performance Projections

*DISTO *DIST+______ ___- ____

Avg. Per-LineOutage Tine: 7 ain/yr

avg. FailureGroup Size: 90 lines 75 lines

Outages byrailure Groups:

Total System: 3 sin/yr

-150 lines: 7 sin/yr

3 min/yr

none

-75 lines: 3 sin/yr 6 min/yr

-8 lines: 19 sin/yr 23 sin/yr

The other ("DIST+") is also adistributed design, but represents animprovement in performance, obtained bybetter symmetry and deeper distribution(see guidelines below).

These sample results. have been selectedfrom a collection of over a hundredcomputerized studies on a dozendifferent systems, analyzing their firstcost, replacement cost, severe failures,repair incidence, average failure groupsize, and average per-line outage.

Experience in improving overall systemperformance through the use ofdistributed power processingarchitectures has proven the usefulnessof computerized studies based on thefigures of merit presented in thispaper. In addition to comparingreliability performance of competingsystem designs, they can be used toperform sensitivity analyses todetermine the most cost-effectiveplacement of high-reliabilitycomponents.

Table 4, for example, shows the effecton the system in Figure 3, when itsdozens of board-mounted power moduleshave a 10% longer MTBF, and when theirMTBF is only half as long as originallyassumed. The computerized analysis hasshown that an improved BMPM is not worththe cost. On the contrary, one couldbuy cheaper BMPM's, and use the moneysaved to buy a redundant off-line-switcher (rectifier) for the commoncontrol circuitry. Such a tradeoff isfinancially feasible, and would resultin a 15% net reduction of the averageper-line outage time for the system.

Avg. Annual No.of Replacements(Failures/Yr):

Mean-Time-To-Repair:

System Cost(For Power):

.5

1 hour

$4000

.4

1 hour

$3000

TABLE 4

Sensitivity Analysis for System 0DIST--------------------------------------

Avg. Per-LineOutage (min/yr)

With CurrentBMPM -- 7

With 10%Better B14PM -- 7

With BMPM only50% as Good --

With RedundCntrl Rect --

7.4

5.5

Avg. FailureRate (#/yr)____________

.5

.5

.8

.5

The results of one design analysis afteranother have consistently confirmed thevalidity of a few simple guidelines foroptimal design of distributed powerprocessing systems. We illustrate thembelow by referring to the previous PBXexamples, but it should be stressed thatthe principles of distributed powerprocessing design presented in thispaper are equally applicable to centraloffice systems.

122

5 sin/yr

1 Observe functional modularity --use separate power modules forseparate functions. Follow thedistribution scheme of theapplication system, whichpresumably observes the samelogic, as closely as possible.

In our example, the logicalfunctions are common control, linecard control, and line power.Traffic considerations dictate thepooling of -48v line power, sinceonly some of the lines requirepower at the same time.Therefore, the system illustratedin Figures 3 and 4 is providing-48v power in half-carrier lots.However, it is also providing +5von the full carrier level,,although this is intended only forthe line card processors on thenext level down. If this powerwere processed instead by theboard-mounted power modules oneach card, it would greatlyenhance the reliability of thesystem. This is one reason forthe improved performance of system"DIST+" in Table 3.

2. Collocate the powering moduleswith the powered modules. Eachapplication subsystem should haveits own private or semi-privatepowering source.

In the example of Figure 3, everypowering module is located eitheron the same carrier or on the samecard as the subsystem which itpowers. Thus dc-bussing is eithercompletely avoided, or confined tothe carrier backplane. There itcan be integrated into themanufacture of the piece,eliminating wire-wraps or wire-harnesses.

3. "Spread out" the power processingat each level of distribution, byusing multiple identical poweringmodules for the multiple identicalapplication subsystems on thatlevel. This allows independentsubsystems to surviveindependently of other failures.

It means no artificial pooling ofapplication subsystems, justbecause they are identical.Unless their combined simultaneouspower usage is considerably lessthan the sum of the individualpower requirement, there is noreason to serve them from the samepower pool. In our example,spreading out the +,-5v suppliesacross the line cards by usingboard-mounted power modules, is away of adhering to this principle.Providing all of the +5v for theon-board processors by a singlecarrier-mounted converter, is aviolation of the principle.

4. Use symmetric architectural designwherever possible. This meanspowering identical applicationsubsystems from identical poweringmodules, with the same number ofsubsystems assigned to each powermodule.

This principle is violated by thesystem in Figures 3 and 4. FromFigure 4, we see that the 12 linecards on the right depend on aDC-DC converter plus tworectifiers (actually off-line-switchers), while the 8 line cardson the left depend only on theDC-DC converter plus onerectifier. The first set ofpowering modules has 1.5 times thefailure rate of the second, andsupplies 1.5 times the number oflines. Although that may soundreasonable, this non-symmetricarchitecture will result in 20%more line failures than thesymmetric arrangement with 10 linecards on each side, depending on aDC-DC converter plus onerectifier.

The enhanced reliability of"DIST+" in Table 3 is also partlydue to the removal of thisassymetry from "DIST".

In concluding, we must note that theseguidelines, like all theoreticalprinciples, can only be implemented tothe extent that cost and spaceconstraints will allow.

No computer can design the optimal powerprocessing system for any particularapplication, because the designdecisions are not automatic, but highlyjudgemental. However, a computer programlike the one we have described, alongwith the figures of merit and rules ofthumb presented here for effectivedistributed system architectures, willgreatly assist the system designer inexploring all avenues and achievingunexpected overall performance from adistribed power processing system.

REFERENCES

1. T. M. Taylort "Distributed PowerProcessing: the SystemsSolution", INTELEC 83 PROCEEDINGS,1983, p. 310.

2. M. E. Jacobs and F. F. Kunzinger,"Distributed Power ArchitectureConcepts", INTELEC 84 PROCEEDINGS,1984.

3. P. R. Berkowitz, M. E. Jacobs, F.F. Kunzinger, and R. W. Michelet,"Distributed Power Architecturesin Customer Premises SwitchingEquipment", INTELEC 84PROCEEDINGS, 1984.

123

Documents

[IEEE INTELEC '84 - International Telecommunications Energy Conference - New Orleans, LA, USA (1984.11.4-1984.11.7)] INTELEC '84 - International Telecommunications Energy Conference