A Method For Quantifying the Importance of Facts, Rules ... · A Method For Quantifying the Importance of Facts, Rules and Hypotheses by T.J. Parsons British Aerospace, Hatfield Abstract

A Method For Quantifying the Importanceof Facts, Rules and Hypotheses

by

T.J. ParsonsBritish Aerospace, Hatfield

AbstractA labelled digraph is used as a model of a simpledatabase, nodes representing facts (or classes of facts)and arcs the relationships between these facts. An ex-pression for the number of microstates in which sucha data structure may exist is derived and used to cal-culate a measure of intrinsic entropy. This measure isfundamentally related to the information content of thestructure and its change, on adding or subtracting anitem of information from the database, may be used toassociate a value called Importance with the item.

An algorithm called the " Database Monitor Pro-gram (DMP) " is introduced. Its function is to guaran-tee that the digraph database exists in a ' least complexstate' by replacing relationships between nodes by re-lationships pertaining to classes of nodes, but with theconstraint that the information content of the structureremains unchanged. Once in this minimal condition, theImportance of an item is denned in terms of the changeit induces on the minimum entropy state.

Like measures of entropy, the Importance of anitem is a relative concept in that its value depends onthe context of the database. It is argued that this is anintuitively appealing measure in that a system is onlyable to judge the importance of an item in the light ofexisting knowledge.

Two additional concepts termed confidence and sig-nificance are introduced and used to assess the forma-tion of 'class concepts' within the database. The use ofthese three measures for conflict resolution is also dis-cussed.

Finally, an example system developed within thePoplog environment is presented and extensions of thework are discussed.

1 Introduction

At least six different techniques exist which attemptto handle vagueness or incompleteness in the infor-mation that is to be presented to a knowledge-basedsystem.

Bayesean Probability, Certainty theory [7], TheTheory of Evidence [6] and Possibility Theory [8] allseek to assign numerical values to assertions, andarithmetic functions to logical connectives — henceproviding a mechanism for calculating and associ-ating values with rules and goals. This mechanismmay be called "Probabilistic Reasoning".

The work of Bundy [1] discussed the limitationsof this purely numerical approach. In the case ofnon-independent laws, knowledge from all combina-tions of evidence and hypotheses has to be consid-ered and this is not feasible in many applications.Bundy also argues that the final value obtained forgoal assertion does not represent the probability ofthat assertion but rather is a measure whose exactmeaning is ill-defined and can only be used to rankgoals in some order. Bundy proposed an IncidentCalculus to overcome these difficulties.

The Plausibility Theory of Rescher [5] is oftenused in combination with the other techniques in anattempt to deal with unreliable knowledge.

With the limitations of probabilistic reasoningin mind, this paper seeks to develop a different ap-proach, termed "Importance Theory".

A significant feature of this method is thatit neither assigns a-priori values to assertions orrules nor depends on probabilistic reasoning to as-sess the significance of hypotheses given the currentevidence. Hence the inherent weakness of Proba-bilistic Reasoning is avoided; rather, a value whichmeasures how 'important' the assertion, rule or hy-pothesis is to the database as a whole, is derived.

It is argued that importance, significance andconfidence provide useful criteria by which the for-mation of class concepts within a database may bejudged. The same criteria may also be used to judgethe success of a model hypothesis which we wish toimpose upon a database of facts. The preferred setof hypotheses, as reflected by our measure, is simplythat which explains most of the database.

47 AVC 1987 doi:10.5244/C.1.7

The theory is developed initially for a databasecontaining only certain information.

2 Knowledge Representation

The data structure capable of storing relational in-formation upon which the ideas presented in thispaper are developed is termed a directed graph (di-graph). Digraphs are a simple and convenient wayof representing relational information.

Digraphs may be formally defined by the fol-lowing primitives and axioms:The Primitives are:

pi A set V of elements called nodes

p2 A set X of elements called arcs

p3 A function F whose domain is X and whose range iscontained in V

p4 A function S whose domain is X and whose range iscontained in V

The Axioms are:

Al The set V is finite and not empty.

A2 The set X is finite.

A digraph may contain lines, parallel lines andloops but two restrictions on the concept of a di-graph are of interest. The first we term a 'restricteddigraph' where the existence of parallel lines is for-bidden. The second we term a 'simple digraph'where the existence of both parallel lines and loopsis forbidden.

Strictly, the data structure should be termeda 'hyper-digraph" in analogy with 'hyper-graph'structures [2], in that we permit nodes of the di-graph to consist of a hierarchy of nested digraphstructures.

3 Terms and Definitions

The concept of a cluster and a class are introduced.

1. A given subset of the nodes of the digraph databaseis termed a cluster. A 'Cluster' may thus containjust one node.

2. An Assertion, Rule or Hypothesis that is to beadded or removed from the database is termed anitem.

Figure 1: Complexity change on forming a classw.r.t. X

3. A 'cluster'is termed a class if each member of thecluster behaves identically with respect to a sec-ond cluster within the database. This second clus-ter may be the same, partially the same, or com-pletely different (as regards its members) from thefirst cluster. The act of replacing individual re-lationships between members of a cluster and thedatabase by a relationship between a class and thedatabase constitutes the act of class formation andthis process is illustrated in Fig. 1

4. Members of a 'class' may or may not behave iden-tically with respect to each other, but those whichdo may be said to form a sub-class of that class.

5. The minimal complexity state of a digraphdatabase is that form of the database containing theminimum number of microstates consistent withpreserving the information within it.

The formation of a class a within a databasemay be considered to be the result of an operationon that database. If that operator is R acting onthe database D then formally we write:

RnD - Dn (1)

By the definition of a class, introduced above,it can be seen that the creation of a certain classdoes not affect the ability of each member of thatclass to be considered as members of another class.Hence the formation of a set of classes in a database,whether they overlap or not, results in the formationof a simplified database which is independent of theorder in which classes are created, i.e., if C(D) isthe complexity of a database, then :

C(Da/h) = C{DPya) = C{DyPa)...etc (2)

Fig. 2, presents a small example database andillustrates the reduction in complexity incurred un-der the transformation Rap-r and Fig. 3 traces the

Figure 2: Unstructured database changing on creat-ing classes a, (3,7

Figure 3: Tracing Arc Complexity changes on cre-ating a, (3,7

changes in the number of arcs in the database onthe formation of these classes to illustrate equation2.

4 The Database MonitorProgram (DMP)

Class formation is the technique by which the di-graph database is iteratively simplified by the DMPuntil the 'minimal complexity state' is reached. Thefirst step of the algorithm transforms a digraph intoa restricted digraph by removing duplicated infor-mation in the form of parallel lines. The second partof the algorithm successively reduces the number ofarcs and nodes in the database by class formation.The information content of the database is preservedthroughout this transformation.

The DMP is central to the ideas presented inthis paper because, before the importance of an item

can be calculated with respect to a database by not-ing the change in entropy of that database on intro-duction of that item, it is necessary to ensure thatthe database is in its current minimal state.

It will be shown later that running the DMPon a static database gives a measure of the initialdisorder of that database. Adding an item at thispoint and again running the DMP enables a measureof importance for the item to be defined as a functionof the difference in these minimal states before andafter addition of the item.

Equation 2 is extremely important in its impli-cations to the design of the DMP because each classformation gives a reduction in the complexity of thedatabase (or at least it can never give an increase)and because the final measure of the complexity re-duction is independent of the path taken to reachit, we can guarantee to always find the global mini-mum. Indeed local minima do not exist. The DMPmay be pseudo coded as follows:

PROCEDURE importance(database:list,value:real)

{ Routine to reduce a label led digraphdatabase to i t s minimal complexity s t a t e .The value returned i s a measure of thei n i t i a l database's disorder. Classes orNodes are cal led Clanodes}

BEGINFOR (each type ox labelled arc) DO

REPEAT

FOR (each clannode) DO

{

merge two clannodes if they

behave identically w.r.t.

another clannode.

}

END;

UNTIL (no further complexity reduction);

END;

value:- initial_complexity-final_complexity;

END;

5 Unlabelled Digraphs—Derivation of Measures

The following analysis yields an expression for thestate of entropy of a database modelled by the di-graph structure. Consider a database capable of ex-isting in a large number of states and changing itsinternal structure under the addition or removal ofitems (assertions, rules or Hypotheses). The ques-

49

tion is posed 'How does the entropy of the databasechange under such transformations? '.

Classical Physics defines the entropy of a sys-tem in terms of the number of Microstates in whichit can exist, that is:

S - - tf(lnfi) (3)

where (Q) is the number of Microstates.The task remains to develop a formula yielding

the number of Microstates in which a digraph of TVnodes and in arcs may exist.

Given TV nodes there are TV(TV — 1) possibleplaces to insert an arc and TV places to insert a loop.Hence the number of 'slots' P in which an arc maybe placed is:

P = TV2 (4)

Some thought shows that the number of distinctways of placing m arcs into P slots (the numberof microstates) is:

n(m) =PI

m\(P - m)\

and hence the expression for entropy becomes

S = K InP\

(ml(P-m)\)

(5)

(6)

When adding or removing items to the database,the number of arcs and nodes is changed and a newvalue of the entropy measure results:

S2 = K In{m2\(P2 - m2)!)

where m2 and P^ are related to the number of arcsand nodes in the digraph after the addition of anitem.

If Pi and Ta\ pertain to the state of the databasebefore the addition of these items, then the associ-ated change in entropy is:

6S = K InP i !m 2 ! (P 2 -m 2 ) !

(8)

Sterling's approximation may be used to analyse thebehaviour of equation 6 for large databases.

If both P and m are large then:

ln(n!) = n l n ( n ) - n (9)

where n is either P or m. The expression for entropybecomes:

Entropy

m

1/— o

m -

number of

r/k

arcs

\

in

\

= p

Figure 4: Diagram showing how entropy changeswith the number of arcs in the database

S = K\P ln(P) - mln(m) - (P - m) ln(P - m)

(10)Note, by substitution in equation 10, that the

function is symmetric in that if either m —» P or m-+ 0 then:

0 (11)

and differentiating entropy with respect to the num-ber of arcs yields:

dS= -K[ln(m)ln(P - m)] (12)

dmThe function thus behaves as illustrated in

Fig. 4. The underlying symmetry of the functionreflects the essential duality of graph structures, inthat the presence or absence of an arc may representinformation.

The function illustrated in Fig. 4 warrants fur-ther discussion. At first inspection the symmetryabout point m = P/2 is worrying because it impliesan ambiguity in our measure of entropy.

The number of arcs in our database is relatedto the number of nodes. The minimum number ofarcs the database can have is N whilst the maxi-mum number it may have is iV2 (N is the number ofnodes). Thus the point of symmetry N2/2 lies in be-tween this possible number of arcs for all N > 2 andone might hence indeed expect the entropy measureto be ambiguous. However, when one more arcs isadded to the minimum number of arcs ?nmin = TV,the DMP, as discussed in section 4, will recognisethe existence of a class with two members and willsimplify the knowledge base accordingly. The over-all effect of the DMP is to ensure that the entropy

50

measure returned lies in that monotonically increas-ing region of the curve below N2/2 and labelled "a"in Fig. 4 An unambiguous measure of the state ofentropy for the knowledge base is thus assured.

6 Normalised Measures

A normalised measure of the Entropy change withina database may be developed by noting that themaximum possible importance that could be at-tributed to an item would be the one permitting theDMP to simplify the database to a 2-node, 1-arcstructure.

Substituting Pi=4, rn2 = l into equation 8 gives:

SS = A" In Pi'- (13)

6Smax l s related to the constant K of equation 8.and the normalised measure of entropy change,called importance, becomes

T6S

(14)

This function ranges between the values 0 and 1, 1indicating that the maximum possible change to thedatabase has occurred.

Note that, in practice, it is not necessary tofully evaluate all the factorials in equation becauseP and in are typically of the same magnitude.

In section 10, this normalised measure is usedto calculate the importance of forming three conceptclasses alpha, beta, and gamma, within the exampledatabase of Fig. 2

7 Labelled Digraphs—Deriv-ation of Measures

In the previous section an expression for the numberof Microstates in which an unlabelled digraph couldexist was developed. The analysis may be extendedto labelled digraphs by noting that such a structuremay be effectively decomposed into a set of singlylabelled digraphs and this is illustrated in Fig. 5. Ifthere are L labels then:

5 = K In (15)

where m; is the number of arcs of a given label.A normalised expression for the change of en-

tropy upon a reduction or increase in the number

Figure 5: The decomposition of a labelled Digraphinto a set of Digraphs

of arcs and nodes may again be developed by as-suming that the maximum possible reduction of thelabelled digraph is to a structure consisting of onlytwo nodes and one arc.

1 =6S

(16)

This equation is similar to equation 14, exceptthat 6S is derived from equation 15 rather thanequation 6.

The above analysis assumes that the existenceof a label L\ between two nodes does not precludethe existence of a label Li between those same twonodes, i.e., the labels are assumed to be indepen-dent. In the system described here, which dealsonly with certain information, contradictory infor-mation already within the database is ignored andinformation already within the database, which iscontradicted by new information, is updated.

8 Confidence and Significancemeasures

Fig. 6 illustrates three different ways in which theentropy of a database in its minimal state of com-plexity might change on adding items at point Aand running the DMP at point B. Point C indicatesthe final state of the database.

The importance of an item may be defined asa function of the difference between the arc-nodecount at point A and point C.

f(ma,na) - f(mc,nc)l -

f is the Importance function of equation 16.

51

Thus if / = 0, the information is of no impor-tance to the database, perhaps because the item isalready in the database. If / < 0 then the item hasmerely added to the complexity of the database. If/ > 0 then the item is of importance to the databaseand the more important it is, the closer the value isto unity.

Two other measures are notable and they aretermed Significance and Confidence. Significance isa measure of the complexity of the items to be addedto the database.

Significance = ( / ( " » • " > ) - / ( " • • " • ) ) ( 1 8 )

f(mnb)

Thus, the nearer this measure is to unity, thegreater the significance of the items added to thedatabase.

Our measure of confidence in the importancevalue obtained is defined simply as

Confidence = Importance - Significance (19)

A near zero value indicates that little faith should beplaced in the importance measure. A value near oneindicates that the importance value is worth con-sidering. This ability to quantify the extent of our'confidence' in the 'Importance' value is extremelyuseful and no analogous measure exists in Bayeseanprobability theory.

The greater the significance of the items addedin order to achieve a given reduction in complex-ity, the less confidence we have in the results. Thismeasure is especially interesting when consideringthe effect of a hypothesis on a database. The leastcomplex hypothesis to produce a reduction in thedatabases complexity is preferred to a more com-plex hypotheses (Occam's Razor?)

The complexity increase associated with thedatabase on addition of a single fact, before theDMP is run, is easily calculated using equations 15and 17. The number of arcs is increased by one andthe number of nodes by two.

However, when several facts are to be added tothe database as a group, they must themselves beregarded as a digraph and transformed by the DMPbefore a measure of its significance may be calcu-lated. After this transformation, the importance ofthe group of facts w.r.t. the database is calculated inthe established manner. The Significance and Con-fidence values may also be calculated.

Dntropy

D

A / / \

\_c

u

*//\.c_

Items

D

Figure 6: Diagram showing possible database En-tropy changes

9 The Minimal ComplexityDatabase — Further Notes

When adding items to the database, their impor-tance is judged by calculating the change in com-plexity that occurs compared with the total changethat could be possible (that is, reduction to a twonode, single arc relation).

However, once the item is in the minimalcomplexity database, its importance is calculatedby considering the displacement from this mini-mal state that its removal would induce on thatdatabase, the difference being that the importancevalue now returned is modified by the context of anyitems since added to the database.

The measure is now expressed as:

Importance = 1 — (20)

As each element is added to the database it becomesnecessary to reform its minimal state in order tocalculate measures for the item added and for anyfuture items to be added.

This housekeeping may be performed efficientlyby tagging both the number of input arcs and outputarcs associated with any class or node. To see howthis is possible, remember that the DMP reduces adigraph to a restricted digraph (i.e. removes anyparallel lines), and eventually reduces the graph tosuch a form that neither the number of inputs or out-puts to a node can exceed one. Thus the most com-plex structure that can exist in the minimal com-plexity database is a cycle with N members. Thesereductions are summarised in Fig. 7

On destruction of a class concept it is known, apriori, that only one arc related that class to the restof the database. An exhaustive search is avoided.We can hence calculate immediately the increase in

52

Figure 7: Database transformations under action ofthe DMP

the number of arcs and nodes that would be neededif that class information was to be destroyed. Also,on adding information to the minimal complexitydatabase, nodes having more than one input or out-put may be tagged and only these need be addressedby the DMP Exhaustive searching of the databaseis again unnecessary.

The importance values associated with classconcepts within the database are explored in thenext section.

10 An Example System

The following system was developed within thePoplog environment [2], [4] particular use beingmade of 'Super Database', a Prolog type librarypackage permitting depth-first searching and 'Sets',a package permitting the use of set operations.

The software was developed to enable a user toinvestigate the Importance, Significance and Com-plexity measures of classes formed automaticallywithin a database by the DMP

The system was constructed so that, on the ad-dition of facts or rules to the database, the associ-ated values were both reported to the user and usedto place any rules in a 'rule priority list'.

Relationships between nodes a,b were repre-sented as an entry into the database of the followinglist.

[ fact " relationship [a] [b] ]

Relationships between nodes and classes, or classesand classes are of the same form.

[ fact " relationship [ a b e d ] [e f g h]]

Backtracking through the database is initiated bycommands such as

Present([ and [ fact " relationship ?x ?y][fact " relationship ?z ?»]]);

where x,y,z and w are variables which become in-stantiated on patterns within the database.

The database depicted in Fig. 2 is thus encodedas follows:

database[fact "[fact *[fact "[fact "[fact "

[fact "[fact "[fact "

==>relationship

relationshiprelationshiprelationshiprelationship

relationshiprelationshiprelationship

[1][1][2][2]

[3][3][2][3]

[7][6][6][7][6]

[7][4][4]

Central to the system are the following routines.

1. Status("relationship") —> arcs —> nodes;

This routine counts the number of arcs andnodes within the database.

2. DMP();

This routine, as described earlier, transforms adatabase with the above format into its minimalstate of complexity; without loss of information.

3. Importance(m1,n1,m2,n2) —» measure;

This routine returns a measure of importancefor an item added to the database (fact, rule orhypothesis ) based on both the initial and finalcount of the number of arcs and nodes.

4. Weight(class) —• measure;

This routine reports the measure of importanceto be associated with any fact that already ex-ists within the database. Its definition will begiven later.

Thus a measure of the initial disorder of a staticdatabase would be provided by the following se-quence of calls.

1. status("relationship") —» mi —> n^;

2. dmp();

3. status("relationship") —» rri2 —» n.2',

53

Figure 8: Class concepts and associated Importancevalues

Figure 9: Illustrating class formations within theExample Database

4. importance(m1,n1,m2,n2)—* measure;

After running DMP the database becomes:

database ==>[fact * relationship [a b c] [ f g] ][fact " relationship [ b e ] [d ] ]

with a measure of disorder of 0.78The complexity changes on the formation of the

three classes a,f3 and 7 in the example database ofFig. 2 may be traced. Fig. 9 illustrates the databasechanges incurred on forming these classes separatelyand lists the importance value associated with eachone.

Let I (a |0) be the importance to the databasewhen forming a class a. Let I(a|/3 ) be the impor-tance to the database when forming a class a giventhat a class (3 already exists. Using equation 16 itcan be calculated that:

I(a\O) = 0.61

l(0\O) = 0.4

I(7|O) = 0.4

l{a\(3) = 0.68

= 0.64

As might be expected, the formation of classa in the database is considered to be more impor-tant than the formation of (3 or 7. After class (3is formed, the importance of forming class a is stillmore important than the formation of class 7.

Fig. 8 illustrates another database of interest.Three classes a, j3, 7 have again been formed by theDMP and class a might be considered very impor-tant with respect to the database because of its size.The importance values, as calculated using equation

16, are 0.77, 0.31 and 0.19 for classes a, f3 and 7 re-spectively. Thus class a is indeed more importantthan classes (3 and 7. However, once the classes areall formed within the database, equation 20 can beused to calculate the new Importance values 0.56,0.38 and 0.19 for a, (3 and 7 respectively. The im-portance associated with class a has been reducedconsiderably because of the large degree of overlapwith the class concept (3.

Early experiments indicated that a 50 nodegraph could be reduced to its minimal complexitystate within 40 seconds real time on a VAX 11/785under normal load conditions. Whilst this was notparticularly fast, it was considered usable for devel-oping the type of system presented in this paper. Inaddition it was shown that this minimal state wasindeed independent of any initial ordering of thatdatabase.

11 Rules and Hypotheses

The importance of a rule with respect to ourdatabase is evaluated by using the rule to createfacts which are not already explicitly present.

Some rules lead to a profusion of new facts andrelationships within the database and a dramatic in-crease in the number of arcs and nodes. It is theserules which are considered to be the most relevantand, in the current system, the importance value isused to placed them in a 'rule priority list'. Hencethe production rules fired first, upon modificationsto the database, are the ones which seem most rele-vant.

Hypotheses may also have a dramatic effect onthe database and, for clarity, a particular exam-

54

pie is given. Consider the situation of a relationaldatabase derived from a complex scene containinga car. The identification of a large sub-section ofthe relational graph containing the structural de-scription of a car may be be considered as a classconcept named "car". The consideration of the sin-gle class "car" as opposed to the complexity it mayhide induces a large simplification in our databaseand hence the hypothesis would be given a large im-portance value. Expressed in a slightly different way,the class concept "car" explains a large portion ofour relational database and hence has a large valueof Importance associated with it. If, on the otherhand, only partial evidence existed for the presenceof a car, then the size of the digraph required tocomplete the class concept car and quantified as sig-nificance together with the importance of the classconcept, enables a confidence value for the observedpartial evidence to be obtained.

12 Future Work

The work presented in this paper is limited in thatthe theory deals only with a database containingcertain information. However this forms the basis ofcurrent efforts which attempt to deal with databasesrepresented by weighted labelled digraphs. This isregarded as an adequate representation for uncer-tain relational information.

Digraphs are easily represented in terms of con-nectivity matrices and the DMP operation may alsobe represented in terms of a set of matrix operations.Thus the possibility of dramatically increasing thecurrent execution speed of algorithms presented inthis paper on both serial computers and, to a muchgreater extent, on parallel architectures is real. Thiswork will be presented in a future paper.

[2] Claude Berge. Graphs and Hyper-Graphs.North-Holland Mathematical Library, Univer-sity of Paris, 1979.

[3] J. Gibson. An A.I. Programming Language.Cognitive Studies Program. Technical Report,University of Sussex, Brighton, 1980.

[4] Rescher N. Plausible Reasoning. Amsterdam:Van Gorcum, 1976.

[5] G. Shafer. A Mathematical Theory of Evidence.Princeton University Press, Princeton, New Jer-sey, 1976.

[6] E.H. Shortliffe and B.G. Buchanan. A modelof inexact reasoning in medicine. MathematicalBiosciences 23, 351-379, 1975.

[7] A.S. Sloman and J. Gibson. Poplog: A Multi-language Program Development Environment inImage Technology. Technical Report, Universityof Sussex, 1983.

[8] Zadeh. Fuzzy Sets as a basis for a theory ofpossibility. Fuzzy sets and systems, Amsterdam:North Holland, 1978.

13 Acknowledgements

The work was conducted within the MMI 007 Alveyconsortium. I would like to thank my ex-colleagueD. F. Nuttall for many valuable conversations andmy colleague Dr Clark for his help with lATgX layoutand typography.

References

[1] Bundy A. Incidence calculus: a mechanism forprobabilistic reasoning. Department of ArtificialIntelligence, Edinburgh University, Research pa-per no. 216 Edinburgh:, 1984.

55

Documents

A Method For Quantifying the Importance of Facts, Rules ... · A Method For Quantifying the Importance of Facts, Rules and Hypotheses by T.J. Parsons British Aerospace, Hatfield Abstract