24
EXPERT SYSTEMS: THE USER INTERFACE edited by James A. Hendler University of Maryland if ~ ABLEX PUBLISHING CORPORATION Norwood, New Jersey 07648

EXPERT SYSTEMS: THE USER INTERFACE - de Kleer Toward a Community Memory for... · bleshooting systems have been based on deep models of the ... a computer program for medical diagnosis,

Embed Size (px)

Citation preview

EXPERT SYSTEMS: THE USERINTERFACE

edited by

James A. HendlerUniversity of Maryland

if

~ ABLEX PUBLISHING CORPORATIONNorwood, New Jersey 07648

L~

FOUR

DARN: Toward a Community Memory forD~agnos~sand Repafr Tasks

Sanjay Mittal,Daniel G. Bobrow,Johan de Kleer

Intelligent Systems LaboratoryXerox Palo Alto Research CenterPalo Alto, CA

Abstract

DARN is a plan-basedknowledgesystem designedto aid an inex-periencedtechnicianin a diagnosisand repair task. Our approachtobuilding this knowledgesystemis contrastedwith theory-basedsystemssuchas Sophie,or a classification-basedsystemsuchas MDX, usingacharacterizationof the tasks, and the leverage for the different ap-proaches.DARNprovidesa compactand convenientrepresentationforcapturingknowledgeavailable in thecommunityof expertsfor this task.Graphic interfacesare usedto guiderookie techniciansin therepair task.DARNalso providesan “expert’s interface” that allows membersof thecommunityto extendandmodify theknowledgebasewithout interven-tion of “knowledgeengineers.”

I. Introduction

DARN is a knowledge-basedsystemto aid in a diagnosisand repairtask.In this task, theproblem is to localize thecauseof a malfunctionsufficiently to allow an action that will repair the faulty artifact.Tradi-tional approachesto the development of knowledge-basedtrou-

57

58 MITTAL, BOBROW, DE KLEER

bleshootingsystemshavebeenbasedon deepmodelsof the artifact oron aclassificationof waysin which the artifactcan malfunction.In thischapterwe identify a third approach,calledplan-based,which may bemore suitable for somesystems.The plan-basedapproachderives itspower from plans for debuggingdevelopedby a set of experts,andallowing them to formalize, extend,and propagatetheir communityknowledgebase.We havefocusedon the developmentof a representa-tion to expressthis diagnosticandrepair experience.We supporttheuse of this representationwith an interfaceto enableexpertsto createandmodify the knowledgebaseswithout the mediationof a knowledgeengineer.

A spectrumof approachesare possiblefor the debuggingtask de-pendingon characteristicsof the artifact being debugged.To under-stand when a plan-basedapproachis useful, we look at sourcesofleveragein thecomputerprogramsbasedon existingapproaches,andthe assumptionsmadeby theseapproachesabout the natureof theartifact being debugged.

Theory-basedsystemsderivetheir leveragefrom useof anumberofdomain-independent reasoning mechanisms operating on largeamountsof domain-specificknowledge.For thesesystems,one mustbuild models of their structureand function. It is then possibletoreasonfrom differencesbetweentheir modeledandactualbehaviortolocatethe causeof the malfunction. SOPHIE (Brown, Burton, & deKleer, 1982)andDART (Genesereth,1982)areprogramsthat typify thisapproach.These programsderive their power from general-purposereasoningmethodsthat includedependency-directedbacktracking(deKleer et al,, 1979; Sussman& Staliman, 1975) and envisioning (deKleer, 1984). Theseprogramscanreasonaboutbehaviorbasedon de-scriptionsof the artifact. In order for this approachto succeed,onemustassume,of course,that theartifact canbe completelydescribed.This is true for circuits analyzedby SOPHIE but is often not true forelectro-mechanicalsystemsor hardware-softwaresystems.

Classification-basedsystemsderive their leverage by classifyingmalfunctionsin equivalenceclassesbasedon effectiveactionsto ame-liorate the malfunction. Medicine is a typical domain in which thisapproachis used; although they may wish to know more, it is oftensufficient for doctors to find differential diagnosison which suitabletherapeuticmeasurescanbe devised.MDX (Chandrasekaran& Mittal,1982),acomputerprogramfor medicaldiagnosis,organizedknowledgeaboutliver disordersas ahierarchyof diagnosticstates.SzolovitsandPauker(1978) have argued that evenrule-baseddiagnostic systemssuch as MYCIN are really performing such classification. Thesesys~

DARN: COMMUNITY MEMORY FOR DIAGNOS~SAND REPAIR TASKS 59

temsderive their powerfrom appropriatelystructureddiagnostichier-archiesandemploy a morespecialpurposeproblem-solvingtechniquecalled heuristic classification(Chandrasekaran,1984; Clancey, 1985).

A major cost in the developmentof classification-baseddiagnosticprogramsis the creationandmaintenanceof suitable hierarchies.Onereasonfor this is that such hierarchieshaveoften not beenmadeex-plicit by practitioners,though the practitioner’sbehaviorsmay be de-scribedthat way. Evenin medicine,where therearemany extant waysto classifyany given set of diseases,it is clear that noneof the existingclassificationsarea suitablebasisfor organizingdiagnosticknowledgein this way.A medical diagnosishierarchyhasto be craftedfrom piecesof anatomical,physiological,biochemical,andotherviews (see Mittal,1980, for an extendeddiscussionof this issue). It would seem,there-fore, thata classificationapproachwould be pragmaticallyuseful onlyfor systemswhich havea long lifespanof interestoverwhich they stayrelatively stable.Someother suitable exampleswould seemto be nu-clear power plants, the spaceshuttle, automobiles,and Boeing 707airplanes.

Plan-basedsystemsfor diagnosisand repairarebasedon capturingthe plans of experiencedtechniciansfor debuggingan artifact. Theycapitalizeon the incrementalnatureof the knowledgeacquisitionpro-cessandthecombination of modelingand testingknowledgethat thetechnicianshave. It seemsmostuseful for complexsystemscontaininga numberof different kinds of subsystems,for example,electrical,me-chanical,and software,that havealternative implementations.Exam-ples arecomputers,printers,copiers,modernautomobiles,andmaybeeven semiconductorfabrication processes.Complicationsoften arisefrom the short lifespan of thesesystems,which is often causedbyobsolescence.These systems can be further characterizedby un-availability of completeunderlying models,and complex interactionbetweenthe different subsystems.The rapid advancementin tech-nology is causinga maintenancenightmarein many areas.The shortlifespanof anartifact preventsthe developmentof asustainedbody ofexperiencein troubleshootingthesesystems,often a preconditionforthe developmentof classificationof malfunctions.

We believethat thedevelopmentof knowledge-basedtoolsfor assist-ing in the maintenanceof this latterclassof artifacts requiresa shift inparadigmawayfrom reasoninganddeepknowledgestructures.A moresuitableparadigmis suggestedby themetaphorsof a knowledgemedi-um (Stefik, 1986)or a community memory.The basic idea is that oneshouldprovidea frameworkin which theexpertscanthemselvesartic-ulate their relevant knowledgeprecisely enough that it can then be

60 METTAL, BOBROW, DE KLEER

usedby the computernotonly to aid in solving problemsin the domainbut also madesubject to peer review and revision. Sucha communitymemory may, over time, integratethe knowledgefrom many experts.

In this paperwe describeoneexperimenttowardsbetterunderstand-ing how to build such communitymemories.The problemdomain forour studywasthe diagnosisand repairof a classof personalcomputerat Xerox. As part of this experimentwe also developeda knowledge-basedsystemcalledDARN (DiagnosisandRepairNetwork).This chap-ter is organizedas follows: We start by reporting some observationsaboutthe problem domainand the currentpracticeof repairingthesecomputers.In the nextsection,we presenta relatively simplerepresen-tation which canbe usedto encodea largefraction of the experientialplans and knowledgeof the computertechnicians.Next, we discusssomeof the userinterface issuesin enablinga communityof userstothemselvescreateandmodifya knowledgebase.Wealso describesomeof the user interfacetools that wereimplementedas part of the DARNsystem.The concludingsectionsummarizesour experiencein trying touse DARN, including a follow-up experimentin trying to adapttheDARN frameworkfor the repair of copiers.

II. ObservationsAbout the Repair of PersonalComputers

Mostpersonalcomputers,regardlessof sizeor price,arevery complexelectro-mechanicalsystems.While our observationsherearebasedonthe Xerox 8000 (Staror Dandelion)seriesmachines,many of them arealso relevantfor otherpopularseriessuchas the IBM PCor Apple II. Atypical Dandelionis configured with a bit-mappeddisplay, a pointingmousedevice, a floppy drive, a high-capacityfixed disk, an ethernetconnection,and a CPU. We talked to a numberof peopleinvolved inservicing thesemachines—techniciansin researchlabs, field servicetechnicians,diagnosticmanual writers, and peopleproviding serviceadvice over the phone.The problem of diagnosingand repairing thefixed disk emergedas oneof the morechallengingandtime-consumingproblems.We have focused primarily on the disk subsystemin thework reportedhere,

A fixed disk has electronic (control circuitry), mechanical (diskdrive, platters),andelectrical (powersupply, cooling fan cables)com-ponentswhich caninteract in a complexway, especiallywhen thereare problems.The problem of how to diagnoseand repair problemsrelatingto a malfunctioning disksystemin a personalcomputerhasanumberof interestingaspects.First, it is not just a diagnosisproblem;one is requiredto find actionswhich restorethe disk to astatewhereit

DARN: COMMUNITY MEMORY FOR DIAGNOSIS AND REPAIR TASKS 61

can continue to be used. This requirementraises interesting issuesbecauseof the interplay of hardwareand softwaremodels.The diag-nostic proceduresareoften programswhoseoutputsdo not preciselyindicate a single hardware fault. Moreover, many of the diagnosedproblemsare soft in naturee.g.,garbleddata fields on a disk, whichhavebeencausedby hardwaremalfunctions.However,thesesoft faultsoften cannotbe diagnoseduntil the malfunctioninghardwarehasbeenfixed. This interplay presentsan interestingchallengein representingthe interactionbetweenhardwareand software during diagnosisandrepair.

Second,thereare situationsin which therearemultiple faults in themachine,andwhere repairscan fail becausenewly replacedparts arefaulty. In generalthis is a hard problem. However, thereseemto beinterestingheuristicsfor dealingwith the morecommoncases.Finally.the disk repairproblemis complicatedby the tensionbetweencompet-ing constraints:one wants to minimize the amount of hardwarere-placed on the machine,minimize the amount of time the machine isunavailableto the user,andminimize loss of dataon the currentdisk.The problem of loss of data is often critical whenfixing disks.

Current Practice

We startedoutby taking somevery detailedprotocolsfrom a coupleofthe expert techniciansof their diagnosisand repair strategieswhenfaced with a problem with a fixed head disk. It was clear that theexpertsknew a lot about the computersystem—theywere not ap-proachingit as just ablackbox. They hada fairly completemodel of thearchitecture(which subsystemwasconnectedto which oneandhow)and knew how the various subsystemsfunctioned, at least at somelevel. They also hadsomepartial theoriesabouthow the variousdiag-nostic programsworked.

At thesametimeit wasstriking to observehow muchtheir ability totroubleshootthe machineswas limited by theavailability of the diag-nostictools.For example,the diagnosticprogramshavelimited cover-age, perform tests in inflexible sequences,and often make unarticu-lated assumptionsaboutthe stateof the computersystem.Similarly,complexelectro-mechanicalsystems,suchas a disk, can malfunctionin numerousways, only a small fraction of which canbe directly ob-served.The real issue,we believe, is not really oneof devisingbetterdiagnosticprocedures,which is alwayspossible,butof usingthe exist-ing onesto get the job done.In this and manyothersimilar situationsthe primary objective is making the machinefunctional againundersomeof the constraintsdiscussedearlier. We found that our experts

LEGEND: I Test ~

Figure 1. Plan Fragment from Protocols A fragmentof a plan to repair oneof the fixed-disk malfunctions. The plan is basedon the initial protocols fromthe expert techniciansbefore a computer system was implemented.

62

DARN: COMMUNITY MEMORY FOR D~AGNOS1SAND REPAIR TASKS 63

werearticulating their knowledgein the form of diagnosisand repairplans. Theseplans were organizedaround the typical problems thatwereencounteredby the techniciansand encodedtheir experienceofhowto go about isolatingandfixing theproblem.Figure1 showsa planfragment from our initial protocols. This figure encodesthe plan forfixing one of the problems that shows up when a userboots the ma-chine.The plan might be read as:

After bootingthe machinefor a cold start, if the maintenancepanel(MP)shows 151 or 149, thenrun El Disk diagnosticprogram.If the computerstopswith MP code 1192, then executethe following plan. First, tryreplacing the HSIO (High-SpeedI/O board) and rerunningthe El Diskprogram.If the computerruns fine, then the problemis fixed. Else,checkthe cooling fan. If the fan is faulty, then fix the fanandrerun the El Diskprogram..

We will defera discussionof theseplansto the nextsection,but a fewgeneralcommentsare in order here.Thereweremanysuch plans,witheachplanbeing designedto cover a family of problems.As we will see,the planswerenot completelyindependentandoften sharedsubplans.Interestingly,therewerevariationsevenamongthe two expertsthat weprobedin somedepth.The plansseemedto compile the experienceofand constraintsunder which the expertswere working. It is easy tospeculatethat the plan variations reflecteddifferent experiencesandattemptsto copewith an imperfectdiagnosticsituation.

Finally, we hadto confrontthe issueof who would be the usersof acomputer-basedsystemsuchas DARN. The expertswhoseknowledgewasencodedin the systemwould possiblybenefitfrom a preciseartic-ulation of their expertisein a form that could aid them in later situa-tions. In the longer term, the expertscould sharetheir expertisewithotherexperts,possiblycombiningtheir knowledgeto forma wholethatwas larger then the sum of its parts.However, the one clearclassofusersthat could immediatelybenefitwerethe so-called “rookie” tech-nicians, i.e., people just being trained to service a new machine.

Currently, inexperiencedtechniciansareprovided with manualsofrepair that encodeFault Identification Procedures(FIPs). TheseFIPssufferfrom somemajor problems.First, they areoften preparedbeforeany seriousexperiencehas accumulatedabout a machine.Thus, theyare incomplete at best and grossly incorrect at worst. Second,whilethey are occasionallyupdated,they are usually obsoletebefore theybecomeavailable.Furthermore,much of the experienceof the experttechniciansrarely (or too late) makes its way into the field servicemanuals.Finally, eventhoughour expertsseemedto havea relatively

64 MITTAL, BOBROW, DE KLEER

simpleway of expressingtheir repair plans, theseplans,nevertheless,could not be suitablyexpressedin or usedfrom the mediumof printedbooks.This lastpoint should becomeclearaswe move to a discussionof the plan representation.

III. Representationof RepairPlans

In this sectionwe describea languagefor representingrepairplans.Ourprimary criterion in devisinga representationwas to ensurethat therebe aconceptualmatch betweenthe representationin the machineandthe apparentrepresentationsusedby the technicians.This criterion isitself motivatedby the requirementthat the domainexpertsthemselvescreatethe knowledgebase,modify partsof it, understandthe represen-tation well enoughto know the implications of their representationalchoices,and haveconfidencein the advice providedby a systemthatuses the knowledgebase.Some recent projectssuch as PIES (Pan &Tenenbaum,1986), are also exploringsimilar set of issues.

Plan Elements

The representationin DARN is basedon the following observations.Most of the repair plans such as the one discussedin the previoussection (see Figure 1) havea fairly similar form. Typically, a plan in-cludesstartingwith someinitial teston the machinewhich indicatesamalfunction, running somediagnosticteststo pinpoint the problem,applying a fix for the problem,verifying that the problem is indeedfixed, otherwiseiterating this process.

A plan canthusbe viewed as agraphwherethe nodescanbe classi-fied into a fixed numberof abstractclasses.As a first approximation,we havefound that threeabstracttypes—Tests,Observations,andAc-tions—aresufficient to representthe different elementsof the plans.Examplesof thesefrom the plan describedin the previoussectionare:tests(bootingthe machine,runningEIDisk, checkingcooling fan); ob-servations(MP151, fatal error microcode,processorvoltage OK); andactions (replaceHSIO board, replacecable).

Rulesof Composition

The primitive plan elements(graph nodes)are not composedin arbi-trary fashion. Instead, thereare some well-defined rules that dictatehow nodesof a certainclasscanbe linked to nodesof otherclassestoform legal plans.The basic rules are:

DARN: COMMUNITY MEMORY FOR DIAGNOSIS AND REPAIR TASKS 65

A testcan be connectedto observationsonly, representingthe resultsofrunning a test.

An observationcan be connectedto eitheractions and/orfurther tests.The former case representsactions to be taken in fixing the problemmanifestedby theobservation.Thelattercaseenablesfurtherexplorationto pinpoint the problem.

An action is implicitly connectedto a testwhich is usedto verify that theaction indeedfixed the problem.More complex rulesaboutrepresentingactionsare describedlater.

Theserulesnot only definethe plan structurebut in effectdescribethesemanticsof the structureas a plan for diagnosingand repairing thetargetmachine.In otherwords,an elementof the plan that is markedasa testnode,representsa testto be performedbecauseit is followed bythe observationsthat can be madefrom running thetest. Similarly, anobservationnode derivesits meaningfrom two structural properties.One, that it follows a testandthus representsa possibleresult of run-ning the test. Two, it is followed by further actions or tests whichrepresentwhatshouldbe donewith this observedstateof the machinein further identifying or fixing the malfunction. We hypothesizethatsuch a structural transparencyis crucial in enabling the expertstomanipulatethe knowledgebasecomfortably.

Extensions to the Plan Representation

The basicTAO (Test—Action--Observation)frameworkdescribedaboveis ratherabstractbecauseit is only sufficient in a computationalsense.It doesnot capturemanynuancesand specializationsthat the expertsuse in describingtherepair plans.Herewe discusssomeof the exten-sions. Most of the extensionscan be viewed asspecializationsof thethreeabstractclasses.Eachof the specializedplan elementmay alsospecialize the structural rules associatedwith that class.Figure 2shows someof the specializationsas a classlattice.

Tests

A top-level test in DARN describesthe context in which the userdis-covereda malfunction. For example,at the top level the systemusesthe following query as a testto discoverthe context:

Which situation are we in:1) Testing a new disk

0) —~~‘1

~0

~—.

CD

‘1~~

.~

a

aC,,

a

CDC,,a-a

a

DARNObject

ElectricalAdjustrnent Ad just Voltage

MechanicalAdjustmentCorrectiveActian OtherAction

.‘-•~. ProblernSolved

HeplaceHardware — ReplaceDoard

..- ElectricalObservation — Observe Voltage

~,--~~~---- MPCode- Observation

MechanicalObservation

~ OtherObservation

BackTotJser

// DataAnalysis//~

/ ElectricalTest Check Voltage‘TestProcedure

CheckCableMechanicalTest =—~

CheckMechanicajPart

RunProgram

DARN: COMMUNITY MEMORY FOR DIAGNOSIS AND REPAIR TASKS 67

2) A problem found while booting the machine for a coldstart3) A problem found while running user software

Performing a test leads to observations,often indicationsof problemstobe fixed. Often someaction may be taken to try to “fix” the problem. Ifthe action (or any of its subactions)changesthe machine,a testindicat-ing a problemmust be run againto verify whetherthe changeaffectedthe problematicobservation.For example,supposewe havethe test:

Check voltage at ProcessorNone —ReplacePowerSupplyNot24V—CheckPowerCableOK —CheckVoltageAtDrive

If no voltage is found at the processor, then after replacing the powersupply, we must check that the new power supply is providing thecorrect voltage at the processor.

Verifying correct behavior is more crucial when running a diagnosticprogram, and choosing one of a number of potential repairs for theproblem. For example:

Select the result of running the ElDisk diagnosticprogram:1) Fatal Error in Microcode2) Maintenance Panel says 1611—No interface signals3)

4) No errors

If (1) Fatal Error in the Microcode is the symptom reportedby ElDisk,the technicianscan replaceone of severalprinted circuit boardsthatcould causethis error. After replacementof one, the diagnosticpro-gram is run againto test whetherreplacingthat boardfixed the prob-lem. If not, the technicianwill try other possiblereplacements.

The genericnotionof a testis furtherelaboratedby the kinds of testsavailable in a particular domain. For example, in the disk domain,someof the elaborationsof testare: Electrical Test, MechanicalTest,Diagnostic Program,Query User. Eachof thesemay be further spe-cialized. Similar elaborationsexist for observationsand actions. Theimportanceof thesedomain-dependentelaborationslies in providingthe system restrictions on the general rules of interconnectionsbe-tweennodes.For example,the observationsof a diagnosticprogram,inour domain,are restrictedto maintenancepanelcodes(MP Codes)andsomethingwhich signalssuccessfulcompletion of the test. Thesere-strictionsareexploited in customizingthe userinterfaceas discussed

68 MFrTAL, BOBROW, DC KLEER

in the nextsection.Eventually,theseelaborationsprovidea meansforattachingdeepermodels.For example,knowing that the MP Codesaregeneratedin an ascendingsequenceallows the actual MP Code ob-servedto be used to rule out problems associatedwith other codeslower in the sequence.

Observations

There are two classesof observations:oneswhich indicate successfulcompletionof the associatedtestandoneswhich indicatesomeprob-lem. The former neednot be connectedto any further nodes,therebymodifying the default interpretationof the associatedtest.Forexample,if a testis usedfor furtherelucidationof the problem(discussedlater inthis section), then a successfulcompletionof the testwithout any fol-lowing action implies thatadifferentactionmustbetakenfor fixing theoriginal problem.

Actions

The kinds of nodesthat canfollow after anobservationmay be anotherTest, or actionssuch as Simpleflepair, FixList or CompoundAction.Thesenodesareorderedon the basisof the desirability. Desirability isan implicit combinationof likelihood of the actionspecifiedfixing theproblem, the cost of trying the repair,etc. SomeSimpleflepairsare:

Replacement of a part, e.g., Read Write BoardAdjustment within the system, e.g., 5=Volt LevelSoftware changes to data on disk, e.g., Rewrite BrokenHeaders

Fault Isolation

For some tests, the symptom found may not uniquely determinearepair for the problem found.At this point, onepossibility is to try toisolatethe problem.The simplestoption is to just ask the userto nar-row the choices(using a QueryUsertest). Our expertsoften suggestedthis coursein the intermediatestagesof building the knowledgenet-work. In this case,theoriginal Testmustbe usedto verify if anyactiontaken fixed the problem.Sometimesa fault may be isolated by per-forming a secondarytest; in this caseboth the primary andsecondarytestmustbe usedto verify anyfix. Finally, isolationcanbe donewith aSubTest;in this new typeof Test,verification of the SubTestis deemedsufficient to verify a fix for the primary Test.

DARN: COMMUNITY MEMORY FOR DIAGNOSIS AND REPAIR TASKS 69

Trying Alternatives

In some situations, no tests may be available to isolate the fault. In thiscase, experts (and the DARN system) suggestan orderedsequenceofactions. If any of the actions on this FixList lead to a change in themachine, the last test indicating a fault symptom is tried to verify if thechange had the desired effect. For example, in running the ElDisk diag-nostic,whenonegets a fatal error in the microcode,onetries in turn:Replacingthe HSIO Board; Checkingthe Cooling Fan; Replacing theControl Board; . . . ; Replacingthe Disk. After trying eachaction, theEIDisk~diagnostic is run again.

The orderingof actionsin a FixList is dependenton much informa-tion not explicit in the model. It dependson the frequencyof occur-renceof a particularaction in fixing a particularsymptom,the cost tothe techniciansof trying the repair, and the ease of performing anaction after having taken other actions. We discuss later how suchinformation might beexplicitly embeddedin the model asannotationsfor particularpurposes.

Remembering the Context of Fixes

Sometimesthe samesequenceof potential fixes is applicable in re-sponseto different symptomsfound by other testsrun by the techni-cians. In eachcase,theverification condition is determinedby the testwhich manifestedthe symptom to be fixed. This requires that theDARN interpreterrememberthe “caller” of the FixList in orderto usethe correcttestfor verification.

Expert techniciansoften usepreviously definedFixLists as modelsfor laterlists. Forexample,theywould say thingslike “First replacetheControlBoard,then replacethe VFOBoard,and thendo the restof theactionstaken for Fatal Error in the Microcode.” Implicit in this state-ment is thefact that any actionsalreadytaken shouldnot be repeated.DoRestnodesin DARN directly model this expressionof the techni-cians.This of courserequiresthat the DARN interpreterkeepa historyof actionsthat havealreadybeendone.

A specialclassof actionswerecreatedto expressconciselyanothercombinationof activities.For somepartsof thesystem,onecould guar-anteethat if a symptomwere found, then the associatedrepair wouldfix thatsymptom.Forexample,if thevoltagelevel in the powersupplywas outsidesomespecified tolerance,thenadjustingit would ensurethat it wasthen within tolerance.Similarly, replacinga brokencoolingfan by an obviously working onewould not requirecheckingthe cool-ing fan again. Theseactions aregenericallycalledVerifiedAction.

70 MITrAL, BOBROW, Da KLEER

Fault Identification

How doesDARN recognizewhen it has identified a problem?Whenitruns a testa secondtimeafter taking someaction A, andthe symptomhas changed,thereare two possibilities. First, if the symptomchangesfrom an error indication to OK, then clearly the action A fixed theproblem. If the actionwas the replacementof a part, then the replacedpart wasfaulty andcan be labeledassuch.The actioncould also havebeenan adjustment,andthesystemcanrecordthat therewasa needforsuch an adjustmenton this particular machine.

The situation is morecomplicatedif thesymptomchangesfrom oneerror indication, E1, to another,E2,whenreplacinga part P0. It could bethat the original part,P0,hadno fault andthat the replacementpart Pr1has a fault. We havebeentold that this is not completelyunlikely. Totest this hypothesis,a secondreplacementpart Pr2 is insertedin themachine.If the error indication returnsto E1 then the original part, P0,is determinedto be fault-free, and ~rl is markedas being faulty. If theerror indication staysat E2, then P0 is markedas being faulty, P~asbeing all right, and thereis a secondfault in the machine.

In somecircumstancesthe techniciansknow that asymptomcannotbe causedby anewly replacedboard,andthereis no needfor the abovemajority verification procedure.Thereare two differentcases.The rarecase is when a second hardware fault is recognized,but the symptomprecludes it from being the replaced part. In this case, the replaced partP0 is marked as faulty, and the repair process is continued. Much morefrequently, the seconderror seenis a softwarefault, which may havebeen caused by the hardwarefault (nowfixed)—for example,datahavebeen garbled on the disk by the formerly malfunctioning hardware. Inthis case, P0 is marked as faulty, and a second procedure is used to tryto restore the disk data. A similar diagnostic and repair network alsorepresentsthe proceduresto be followed in this caseof servicing.

IV. User Interfaces

There are two distinct sets of users for this system, and the interfacesfor the two reflect the differencesin their needs.Onetype of useris aconsumer of the knowledge base—say a less-experienced techniciantrying to use this knowledge as a guide through a servicesession.Thesecond is an expert technician trying to examineand augmenttheknowledgebase,i.e., a producerof the knowledge.Both kinds of usersneedthecapabilityto interactwith the systemfor solvingcases,but theproducersalso needto havean overview of theentire knowledgebase

DARN: COMMUNITY MEMORY FOR DIAGNOSIS AND REPAIR TASKS 71

sothat the knowledgebasecan be easilyextendedand modified. Wehaveexperimentedwith a variety of interfacesand in this sectionwedescribesome of these interfaces, paying particular attention to thedifferent kinds of needs.

A Browser for Repair Plans

A basic needfor all usersis the ability to browsearound in the planstructure. The graph-like structure of the plans makes it easy to buildgraphical browsers in languages such as Interlisp-D (Sannella, 1983).DARN was implemented in Loops (Bobrow & Stefik, 1982; Stefik &Bobrow, 1985), an object-oriented extension of Interlisp-D. Figure 3shows a fragment of the plan being browsed inside a display windowon the screen of a Dandelion lisp machine. The plan browser explicitlyshows the plan elements as distinguishednodes.For example, the ob-servation nodes are shown enclosed inside braces ({}), testnodes areenclosed inside square brackets ([]), and actions are enclosed insideangle brackets (b). Some of the newer extensions to the browsers inInterlisp/Loopsallow evenmore iconic displayof the nodes.

The plans are usually too big to fit inside a window. Some of thebasic scrolling mechanismsprovided by Interlisp allow a user to tra-versethe graphstructureeasily.Sometimes,a usermight want to re-strict their attention to some subplan.This can be accomplishedbycreating a subbrowserrooted at any of the nodes in the originalbrowser. We have also experimented with path browsers. Thesebrowsersgrow as the usertraversesthem, keepingthe branchingfactorlow.

Plan Creation Interface

The producers(including the knowledgeengineers)of theknowledgebaseneedaspecializedinterfacethat lets them easilycreateandmodi-fy the planstructures.Our working hypothesiswasthat it is easierforusersto createthe plansby directly manipulatingthe graphstructureofthe plan, ratherthan by describingthe plan in sometextual form. Thestructuraltransparencyof the plan structurewasan importantconsid-erationin this regard.In otherwords,giventhat the graphicalrepresen-tation of the plan adequatelymirroredits structure,it shouldbe possi-ble for producersto enter new knowledgeor modify existing onebydirectly modifying the graphical representation.The plan browserde-scribedearlier was extendedto allow plan creationand modificationactionsat any of the nodesin the plan.Eachnodein the browserwasmade“active,” i.e., it couldbe selectedby a pointingdevicesuchasthe

a ~‘ I

~. U~.iU~

~L-~E~ ~ —

~ -i -

- - - -

~.; —.‘ I ~i

~ILi~hhh

IIUIHHH

~ -~ I

2

Figure 3. Typical Screen Layout During Interaction with DARN The tophalf of the screen showsthe plan browser. A nodein the browser is either aTest,an Observation,or an Action. The browser window can be scrolled orreshaped. Selection of a node with left-button presents a menu of choiceswhich allowsaddition of new information or modification ofexisting informa-tion. The menu on the bottom right of the screen shows an exampleof themenutypically presentedto a user while the systemis beingusedfor servicinga brokenmachine.The other window on the bottom right showsa brief historyof thecurrent interaction with the system.

-: ~, !

H. L t

I!

72

DARN: COMMUNITY MEMORY FOR DIAGNOSIS AND REPAIR TASKS 73

DARN Uruwser —— <artions> (tests) (observ~tion/tjxrs) (<lest/Adjust>) (<Chrck/RepI~ce>)

~onl~

NewAlter flat lyeDeleteAlter flati ve~NI oveAlternativeDont Use loVe r I fy

Ren,itneC lassEdit

NewOiskE,rors) — (MP IS? I)(NewOiskEiflisk) —.-—-- (8~dNewOisk)

(<J~rnpers?)) —--— <Jun,perConneetion>

(Me to(80otOt~on I

• (MPM9)

(UserflunSOh,on) (UorecoverabteOiskkrror) (Nonoe5tructivcExerciser) —

(E~tJ,sc—l)—--———- (MPI6tb~

(RondOmE~dP.ges)

(Med~aSc.n-3) —

/ (MPI7)

- ~ (MP1692),--

/ (< (MP1672)

H(ElOisk) ~-~t---(MPI631-2’) -

~ (MP1631—1)\~0~

Figure 4. Plan Creation Interface The plan browser is shown with a menuthat comesup when a node is selectedfor making plan modifications. Com-pleteplanscan be createdby interactively creatingnodesand extendingthem.

mouseto bring up a menu of choices.Selectionwith the left buttonbroughup a menuof plan-editingchoicesand selectionwith themid-dle button brought up a menu of plan executionchoices(to be dis-cussedlater).Figure 4 showsa planbrowserwith the menuof choicesfor modifying the plan.Someof the basicoptions are:

NewAlternativeDeleteAlternativeMoveAlternativeRenameEdit

Notice that these commands are generic commands which are in-terpreteddifferently by the different kinds of plan elements.Gener-ically, NewAlternative allows a plan to be extendedfrom any of itselements.However, the permissibleextensionsaredependenton theselectednode.For example,selectingthe NewAlternativecommandforan observationdisplaysa browsermenushowing all thegeneric testsandactionsknownto thesystem.Oncethe userselectsaclassfrom this

74 MITrAL, BOBROW, DC KLEER

menu, the systemdisplaysa secondmenu listing all the known in-stancesof that classalready in the knowledgebase(in casethe userwanteda previouslycreatednode) andan option to createa new one.For a new node,the sameprocessis recursivelyfollowed to allow theuserto fill in the descriptionof this new node.A completesubplancanbe createdthis way, with the systempromptingthe userfor appropriatechoicesalongthe way. This is madepossibleby representingwith eachkind of node a completedescriptionof the rules of interconnection,including the specializationsfor domain-dependentelaborationsof thebasic representation.

Our initial experiencewith DARN hasbeenthat our domain experts(knowledgeproducers)found the graphicalrepresentationto be a natu-ral way to interactwith the systemwith minimal training. We werevery impressedwith how quickly they could extendthe knowledgebase.We discussour experiencesin the lastsection.

Execution Interfaces

Consumers (trainee technicians or others trying to usethe systemas aconsultant)of the knowledgebaseprimarily needtwo kinds of inter-faces. The first is a plan executionbrowser that graphically showswhere the user is in the plan. The secondis a history interfacethatcontainsa summaryof the interactionwith the systemandcanbe usedas a transcriptof the consultation.We briefly describeboth kinds ofinterfaces.

Plan Execution Interface

DARN can guide a user through a diagnosisand repair processbyexecutingthe plans in the knowledgebase.The test and action ele-mentsareusedto promptthe userto performthe relevantaction.Theobservationsare typically usedto prompttheuserso theycould informthe systemof the results of performingsometest. The following is asimplified consultationsessionbetweena rookie and an expert(or therookieandDARN if not takenliterally). [Bold faceindicatesremarksbythe rookie, and italics indicate adviceby an experttechniciani:

The machine was booted. Stopped with maintenance panel code(MP) showing 151.Run EIDisk diagnosticprogram.

The machine stopped with MP 1192 (Fatal Error Microcode).ReplaceHSIO board.Rerun EIDisk.

DARN: COMMUNITY MEMORY FOR DIAGNOSIS AND REPAIR TASKS 75

The machinestoppedwith MP 1192,

Checkcooling fan.

Cooling fan is OK.Replacecontrol boardand rerun EIDisk.

The machinestoppedwith MP1192Checkvoltage at processor.

Processorvoltage is OK.Checkvoltageat disk drive.

Drive voltage is Not OK.Replacethe cable connectingthe processorand the disk drive.Rerun EIDisk.

ElDisk ran successfully.

What you just saw was fine as a transcriptof the consultation,but isclearly not suitableas theprimary interfaceto help the userduring theconsultation.It is insufficient becauseit doesnot let the userseetheplanas it is being executed.It doesnot show thechoicesat eachstage.It doesnot visually focusthe useron what hashappenedandwhatelseis possible. It also doesnot allow a userto ask questionsabout thepurposeof the actionstaken.We haveextendedthe basicplan browserto provide a more active execution interface. Essentially, the planbrowsercan be madeactiveat any test nodeby selectingthe servicecommandat that node.This startsa serviceconsultationsession,pri-marily driven by the systembut allowing someoverride by the user.Figure 5 showsa snapshotof thescreenin the middleof aconsultation.Notice that in the browser,the nodes that havebeentraversedat anygiven time arehighlighted,giving a visualsummaryof wherethe useris in the plan.The systemusesthe alternativesin the plan to promptthe user.Thus the usercan be promptedto selectthe correctobserva-tion after running a test or taking a action. A usercanalso seewhatotheractionsarepossiblewhenthey areaskedto carry out aparticularaction from aFixList. A usercanoverridetheorderof fixes by selectingan action that makes more sensein a given situation. As the planexecutionproceeds,thebrowserautomaticallyscrolls,alwaysposition-ing the relevantpart of the plan in the displaywindow.

Case History Interface

A completehistory of theserviceconsultationis kept in the browserasdiscussedabove.It is also kept in a textual form in a separatewindow

76 MITrAL, BOBROW, DC KLEER

Figure 5. Plan Execution Interface The plan browser is shown with thepath followed in the current consultation highlighted. The menu in the leftmiddle of the window prompts the user to selectthe observation from runningthe ElDisk test. The user’s selection,Fatal Error Microcode (FEM), is alsohighlighted. The window in the bottom left provides a continually updatedhistory of the dialogue.

(shownin the bottomleft of Figure 5). This allows the userto scanthechronologicalsequenceof events,as opposedto the logical sequencepresentedby the planbrowser.

Discussion

Experiencewith DARN

The initial prototypedevelopedby us containedonly a fraction of theknowledgeneededto repair the fixed disk. The first experimentweconductedwas to try to get our expert collaboratorsto extend theknowledgebase.We were surprisedby how quickly they were abletolearn enoughboth to extend the knowledgebaseas well as use thesystemas a consultant on test problems.In a matter of few days wewereableto double the coverageof the system.Our expertswerealsovery excited about using the systemas a medium for making theirexperienceavailableto new technicians.However, for a variety of rea-sonsthe projectwasterminated.Onekeyreasonwasthat the particularfixed disk model that was our initial domain was proving to be too

New DiskC Disk

(OlionC (Ooo1OC~on(

Lj~j~”~t~~I— (U 6! 0 CC ( (N 0 (U d S 3)

(UP /22)

(UP 692)

(UP 1672)

(UP 1671)

(UP I 63 1-2)

-• (MP1631-l(

(UP 16 12)

(MPI61 I)

t-tt\1 1,1,1 Intror II \ItrrorodP)NIP 11.11 ‘Jo ,stprf,,cp s~go.tIs(NtPIl.12 (1(2 011 NIP)

‘StPlt.31—1 * (1.31 Sot ‘stepper \Iooes during EtDtsk(-NIPII31-l ‘Jo nove~netttof Stepper)

NIP1’.l 0)

NIPII.2 * B.,tt Head)Nlp(l,t2 Polp,,li,,I tI,,dSurf,,re)

:MPI~22° Bad P.,ges(OK * Continues a,, 515,0 Nlpdj,, Sr,,n)

Trying DandelionOepOrI IrO,n user running Dandelion

• r - - Unrecovnr,btefl,skError• - - €tD.skO.agnostiC

- F - - - . - - FEM- HSl08o~rd— Et0isIsD,agflOs~~C

<Disk Urine 1

(<OulaCoble ‘ -vu , Pron no, n

- <S,epperOo.r<OfWBnurd>

/VP080ard I

<Con,rot8oarn

DARN: COMMUNITY MEMORY FOR DIAGNOSIS AND REPAIR TASKS 77

much of a maintenanceheadacheand was phasedout in the productline. The replacementdisk had fewer replaceableparts and the newrepairstrategywas simply to replacethe completedrive.

Versionof DARN for Copier Repair

A secondmajor experimentthat we conductedwas to try to representthe knowledgefor diagnosingandrepairingaXerox copier.Two impor-tant classesof copier faults were selectedfor the experiment:copyquality andpapertray elevatorfaults.Our collaboratoron this project,who is an expertat repairingprinters,tried to representthe knowledgein the Fault Isolation Procedures(FIPs).He found that, while he couldrepresentthe knowledge using the plan language,there were somesignificant differences between the disk and printer domains thatcalled for extensions to the DARN framework. For example, copierdiagnosticsare far morecomplicatedthan thoseof disks. Failuresneedto isolatedat different levels before a suitable repaircan be made;orthereis often a long seriesof stepsthat haveto be takento createtheproper context for performinga diagnostic test or corrective action.

Someof theseobservationssuggestaneedfor a richerrepresentationthatallows setupproceduresto be described,includingguidinga tech-nician who might makea mistakeduring theseprocedures.Anotherextensionthat wasclearly neededwas someabstractioncapabilitythatwould allow a plan elementto be expandedinto a more detailedsub-plan whenneeded.The sizeandcomplexity of the copierrepair plansalso severelytaxed the browsing featuresin DARN. It was clear thatmore powerful browsersare neededthat could for examplesuppressselectivedetail, provide a top-down view, or maybeact like fish-eyelens, where the details are blurred around the edges,allowing moreinformation to be displayedin awindow.

Other Shortcomings

There are both omissions in current capabilitiesof the system,andproblemswith the systemby virtue of its structure.As an exampleofthe former, we cannothandlein any easyway intermittent errors,orsometypesof dependentfaults.The latter form of errorscomesaboutbecausethe repairplanscontainno modelof the function or structureof thesystem,andno real reasoningcapability.Anotherproblemis thatthe relatively simple interpreterin DARN forcesan orderin which dataaregathered,precludingsituationswherea technicianhas alreadyrunsometestsandfixes andneedsassistance.

1

78 MITrAL, 8OBROW, DC KLEER

Implicit Information

In its current form, the plansembodydifferent kinds of implicit infor-mation. For example,supposewe havea nestedtestwhich checksforthe voltageat a disk after finding that the voltageat the processoris allright. If the voltage at the disk is “all right,” then the conclusion isdrawn that the cablebetweenthe two should be replaced.DARN doesnot haveanyexplicit informationaboutthat connection,andthis struc-tural information is implicit in the network.Furthermore,if someothernodewasconnectedto the nodewhich testedthe voltageat the disk inanothercontext,but it did not follow the measurementof the voltageatthe processor,thenthe implicit context of the measurementwould beviolated, andthe conclusiondrawn would be invalid.

Annotations

Currently, we provide the userswith thecapability of annotatingplannodeswith information to be usedfor construction and explanationpurposes.Annotation of a node can indicate that it mustappearas asubTestof a particularotherTest. This kind of structural informationcanalso be included in the descriptionsattachedto the nodesso thatthesystemcanitself checkfor violationsandpreventthem.Annotationcanalso indicatereasonsfor the orderingof actions.Informationcom-piled into the order includes things like probability of fault causingaparticular symptom,or the cost of trying a particular action (e.g., ittakes 15 minutes to changethis board, and only 5 for most otherboards).

Acknowledgements

We arevery grateful to our domainexperts,Milt Mallory, Ron Brown,and Ted Manley, for their time, patience,and energy.Julian Orr wasprimarily responsiblefor trying to adaptthe DARN frameworkfor thecopierrepairproblem.We arealso pleasedto acknowledgethe supportandhelpful criticismsof Mark Stefik. Clive Dym andsomeanonymousrefereesprovidedvaluablecommentson earlierdrafts of this paper.

References

Bobrow, D. C., & Stefik, M. J. (1983). Loops Manual.Xerox PARC, December.Brown, J. S., Burton, R., & de Kleer, J. (1982).Pedagogical, natural language and

knowledgeengineeringtechniquesin SOPHIE I, II and III. In D. Sleeman

DARN: COMMUNITY MEMORY FOR DIAGNOSIS AND REPAIR TASKS 79

& J. S. Brown ~Eds.) Intelligent tutoring systems.New York: AcademicPress.

Chandrasekaran,B. (1984). Expert systems: Matching techniquesto tasks. InW. Reitman (Ed.), Al applicationsfor business.Norwood, NJ: Ablex, pp.116—132.

Chandrasekaran,B., & Mittal, S., (1983).Conceptualrepresentationof medicalknowledgefor diagnosisby computer:MDX andrelatedsystems.In M. C.Yovits (Ed.), Advancesin computers,Vol. 22.

Chandrasekaran,B., & Mittal, S. (1982). Deepversuscompiledknowledgeap-proachesto diagnostic problem-solving. ProceedingsAAAI-82, Pitts-burgh, August.

Clancey,W. J. (1985). Heuristic Classification.Artificial Intelligence,27(3).de Kleer,J., et al. (1979). Explicit control of reasoning.In P. H. Winston& R. H.

Brown (Eds.), Artificial intelligence:An MIT perspective.Cambridge,MA: MIT Press.

de Kleer, J., (1984).How circuits work. Artificial Intelligence,24(1—3).Genesereth,M. (1984).The useof designdescriptionsin automateddiagnosis.

Artificial Intelligence,24(1—3).Mittal, S. (1980). Designof a distributedmedical diagnosisanddatabasesys-

tem. Ph.D. dissertation,Departmentof Computer and Information Sci-ence.Ohio StateUniversity,Columbus.

Pan,J.,& Tenenbaum,J. M. (1986). PIES:An engineer’s“Do it yourself” knowl-edge system for interpretation of parametric test data. ProceedingsAAAI-86,Philadelphia,August.

Sanella,M. (Ed.) (1983). Interlisp referencemanual.Xerox Corpo. October.Stefik, M. J. (1986). The next knowledgemedium. Al Magazine,Spring.Stefik, M. J., & Bobrow, D. G. (1986).Object-orientedprogramming:Themes

andVariations.Al Magazine,Winter.Sussman,G. J., & Stallman.R. (1975).Heuristictechniquesin computer-aided

circuit analysis.IEEE TransactionsCircuits & Systems,CAS—22.Szolovits,P., & Pauker,S. G. (1978). Categoricalandprobabilistic reasoningin

medicaldiagnosis.Artificial Intelligence,11, 115—144.