1 Machine Learning: Rule Learning. 2 Learning Rules If-then rules in logic are a standard representation of knowledge that have proven useful in expert-systems

1

Machine Learning:Rule Learning

2

Learning Rules

• If-then rules in logic are a standard representation of knowledge that have proven useful in expert-systems and other AI systems – In propositional logic a set of rules for a concept is equivalent to DNF

• Rules are fairly easy for people to understand and therefore can help provide insight and comprehensible results for human users. – Frequently used in data mining applications where goal is discovering

understandable patterns in data.• Methods for automatically inducing rules from data have been

shown to build more accurate expert systems than human knowledge engineering for some applications.

• Rule-learning methods have been extended to first-order logic to handle relational (structural) representations.– Inductive Logic Programming (ILP) for learning Prolog programs from

I/O pairs.– Allows moving beyond simple feature-vector representations of data.

Two types of rules

3

• PropositionalSunny Warm PlayTennis

• First order predicate logicFather(x,y) female(y) Daughter(y,x)

Parent(x,y) Parent(y,z) Ancestor(x,z)Propositional logic assumes the world contains facts. Its expressive power is much lower than FOL. You cannot state: Daughter(y,x) but only specific facts like:

Terminology in FOL

4

• All expressions composed of:– Constants (Bob, Louise)– Variables (x,y)– Predicate symbols (Loves, Greater_Than)– Function symbols (age)

• Term: any constant, any variable, or any function applied to any term (Bob, x, age(Bob) )

• Literal: any predicate or its negation applied to any term Loves(Bob, Louise) positive literal Greater_Than(age(x), 20) negative literal

More terminology

5

• Ground literal—a literal that does not contain any variables e.g. Loves(Bob, Louise)

• Clause—any disjunction of literals were all variables are assumed to be universally quantified.

• Horn Clause—a clause containing at most one positive (NON-NEGATIVE) literal

H L1 … Ln

Or equivalently(L1 … Ln) H

Body or antecedents Head or consequent

Li examples: Loves(x,y) Color(blue)or equivalently Color=blue

6

Rule Learning Approaches

• Translate decision trees into rules (C4.5)

• Sequential (set) covering algorithms– General-to-specific (top-down) (CN2, FOIL)– Specific-to-general (bottom-up) (GOLEM,

CIGOL)– Hybrid search (AQ, Chillin, Progol)

• Translate neural-nets into rules (TREPAN) (not in this course)

7

Decision-Trees to Rules

• For each path in a decision tree from the root to a leaf, create a rule with the conjunction of tests along the path as an antecedent and the leaf label as the consequent.

color

red blue green

shape

circle square triangle B C

A B C

red circle → Ablue → Bred square → Bgreen → Cred triangle → C

Notice again: these are all equivalent:Color(red) Shape(circle) → AColor=red Shape=circle → Ared circle → A

8

Post-Processing Decision-Tree Rules

1. Resulting rules may contain unnecessary antecedents that are not needed to remove negative examples and result in over-fitting.

2. Rules are post-pruned by greedily removing antecedents or rules until performance on training data or validation set is significantly harmed.

Ex: abcd→ A vrs. abc→ A or ab→ A etc(remember that from root to leaves the support of test nodes

is lower. The most informative features are tested first, according to e.g. information gain)

Post-Processing Decision-Tree Rules

3. Resulting rules may lead to competing conflicting conclusions on some instances.

e.g. abcd→ A abcf→ B 4. Sort rules by training (validation) accuracy

to create an ordered decision list. The first rule in the list that applies is used to classify a test instance.

9

red circle → A (97% train accuracy) red big → B (95% train accuracy)::Test case: <big, red, circle> assigned to class A

10

Sequential Covering

• A set of rules is learned one at a time, each time finding a single rule that covers a large number of positive instances without covering any negatives, removing the positives that it covers, and learning additional rules to cover the rest.

Let P be the set of positive examples Until P is empty do: Learn a rule R that covers a large number of elements of P but no negatives. Add R to the list of rules. Remove positives covered by R from P• This is an instance of a greedy algorithm for minimum set

covering and does not guarantee a minimum number of learned rules.

• Minimum set covering is an NP-hard problem and the greedy algorithm is a standard approximation algorithm.

• Methods for learning individual rules vary (see next).

11

Greedy Sequential Covering Example

X

Y

+ +

+++

+

+

+ +

+

+ +

+

X AND Y ARE THE FEATURES DESCRIBING EACH INSTANCEX AND Y ARE THE FEATURES DESCRIBING EACH INSTANCE

12


THE RULE IS: (A<X<B)∧(C<Y<D)THE RULE IS: (A<X<B)∧(C<Y<D)

X

Y

+ +

+++

+

+

+ +

+

+ +

+

A B

C

D

13


X

Y

+

+ +

+

+ +

14


X

Y

+

+ +

+

+ +

15


X

Y

+

+ +

16


X

Y

+

+ +

17


X

Y

18

No-optimal Covering Example

X

Y

+ +

+++

+

+

+ +

+

+ +

+

19


X

Y

+ +

+++

+

+

+ +

+

+ +

+

20


X

Y

+

+ +

+

+ +

21


X

Y

+

+ +

+

+ +

22


X

Y

+ +

23


X

Y

+ +

24


X

Y

+

25


X

Y

+

26


X

Y

27

Strategies for Learning a Single Rule in GSC

• Top Down (General to Specific):– Start with the most-general (empty) rule.– Repeatedly add antecedent constraints on features

that eliminate negative examples while maintaining as many positives as possible.

– Stop when only positives are covered.• Bottom Up (Specific to General)

– Start with a most-specific rule (e.g. complete instance description of a random instance).

– Repeatedly remove antecedent constraints in order to cover more positives.

– Stop when further generalization results in covering negatives.

28

Top-Down Rule Learning Example

X

Y

+ +

+++

+

+

+ +

+

+ +

+

29


X

Y

+ +

+++

+

+

+ +

+

+ +

+Y>C1

30


X

Y

+ +

+++

+

+

+ +

+

+ +

+Y>C1

X>C2

31


X

Y

+ +

+++

+

+

+ +

+

+ +

+Y>C1

X>C2

Y<C3

32


X

Y

+ +

+++

+

+

+ +

+

+ +

+Y>C1

X>C2

Y<C3

X<C4

33

Bottom-Up Rule Learning Example

X

Y

+ +

+++

+

+

+ +

+

+ +

+

34


X

Y

+ +

+++

+

+

+ +

+

+ +

+

35


X

Y

+ +

+++

+

+

+ +

+

+ +

+

36


X

Y

+ +

+++

+

+

+ +

++ +

+

37


X

Y

+ +

+++

+

+

+ +

++ +

+

38


X

Y

+ +

+++

+

+

+ +

+

+ +

+

39


X

Y

+ +

+++

+

+

+ +

+

+ +

+

40


X

Y

+ +

+++

+

+

+ +

+

+ +

+

41


X

Y

+ +

+++

+

+

+ +

+

+ +

+

42


X

Y

+ +

+++

+

+

+ +

+

+ +

+

43


X

Y

+ +

+++

+

+

+ +

+

+ +

+

Learning First Order Rules

• First order predicate logic rules are much more expressive than propositions, as we noted before.

• Often referred to as inductive logic programming (ILP)

45

FOIL

• Top-down approach originally applied to first-order logic (Quinlan, 1990).

• Basic algorithm for instances with discrete-valued features:Let A={} (set of rule antecedents)Let N be the set of negative examplesLet P the current set of uncovered positive examplesUntil N is empty do For every feature-value pair (literal) (Fi=Vij) calculate Gain(Fi=Vij, P, N) Pick literal, L, with highest gain. Add L to A. Remove from N any examples that do not satisfy L. Remove from P any examples that do not satisfy L.Return the rule: A1 A2 … An → Positive

Two Differences between Seq. Covering and FOIL

• FOIL uses different approach for general to specific candidate specializations

• FOIL employs a performance measure, Foil_Gain that is different from entropy (CRF. DECISION TREES)

Rules for generating Candidate Specializations in FOIL

• FOIL generates new literals that can each be added to a rule of form:L1…LnP(x1,x2,…,xk)

• New Candidate Literals Ln+1 can be:

– A new literal Q(v1,…vr) where• Q is any allowable predicate

• Each vi is a new variable (e.g. x k+1) or variable (e.g. (x1,x2,…,xk))

in rule

• At least one of the vi in the new literal is already in the rule

– The literal: Equal(xj,xk) where xj, xk are variables already in the rule.

– The negation of either of the above forms of literals

Example

48

An Example

• Goal is to learn rules to predict: GrandDaughter(x,y)

• Possible predicates:– Father– Female

• Initial rule: GrandDaughter(x,y)

1. Generate Candidate Literals (see previous rules)

Equal(x,y)

Female(x)

Female(y)

Father(x,y)

Father(y,x)

Father(x,z)

Father(z,x)

Father(y,z)

Father(z,y)

Equal(x,y)

Female(x)

Female(y)

Father(x,y)

Father(y,x)

Father(x,z)

Father(z,x)

Father(y,z)

Father(z,y)

GrandDaughter(x,y)

1. The new variableis z. 2. Cannot add a new variable unless the literal includes at least one “old” variable.3. Equal applies only to existing variables x and y. 4. for any literal, its negation is added.

Consider all alternatives and select the best (see later FOIL GAIN)

51

Extending the best rule

• Suppose Father(y,z) is selected as the best of the candidate literals (according to Gain, see later)

• The rule that is generated is:Father(y,z) GrandDaughter(x,y)

• New literals for next iteration include– All remaining previous literals – Plus new literals (when adding a new variable w)

Female(z) Equal(z,x) Equal(z,y)Father(z,x) Father(z,y) Father(z,w)Father(w,z) and negations of all of these

53

Termination of Specialization

• When only positive examples are covered by the rule, the search for specializations of the rule terminates.

• All positive examples covered by the rule are removed from the training set.

• If additional positive examples remain, a new search for an additional rule is started.

55

Guiding the Search

• Must determine the performance of candidate rules over the training set at each step.

• Must consider all possible bindings of variables to the rule.

• In general, positive bindings are preferred over negative bindings.

Example

• Training data assertionsGrandDaughter(Victor, Sharon)Female(Sharon) Father(Sharon,Bob)Father(Tom, Bob) Father(Bob, Victor)

• Use closed world assumption: any literal involving the specified predicates and literals that is not listed is assumed to be false (what is not currently known to be true, is false)GrandDaughter(Tom,Bob) GrandDaughter(Victor,Victor)GrandDaughter(Bob,Victor)Female(Tom) etc.

Possible Variable Bindings

• Initial rule GrandDaughter(x,y)

• Possible bindings from training assertions (how many possible bindings of 4 literals to initial rule?)Positive binding: {x/Victor, y/Sharon}Negative bindings: {x/Victor, y/Victor} {x/Tom, y/Bob}, etc.

• Positive bindings provide positive evidence and negative bindings provide negative evidence against the rule under consideration.

How many bindings?

• Order matters! (so, we must count PERMUTATIONS) (e.g. GD(Sharon,Sue)≠GD(Sue,Sharon))

• Repetions are allowed (e.g. :

GD(Sharon,Sharon)

• So how many?

• Nk (in our case, 4 persons, and k=2 =>16!)

59

Evaluation FunctionFOIL Gain

• Estimate of utility of adding a new literal

• Based on the numbers of positive and negative bindings covered before and after adding the new literal.

FOIL Gain

R a ruleL candidate literal R’ rule created by adding L to Rp0 number of positive bindings of rule Rp1 number of positive bindings of rule R’n0 number of negative bindings of rule Rn1 number of negative bindings of rule R’t number of positive bindings of R still covered after adding L to yield R’

62

We have 16 bindings (42) for GD(x,y, of which one positive)

Interpretation of Foil Gain

• Interpret in terms of information theory– The second log term is the minimum number of bits

needed to encode the classification of an arbitrary positive binding among the bindings covered by R

– The first log term is the number of bits required if the binding is one of those covered by rule R’.

– Because t is the number of bindings of R that remain covered by R’, Foil_Gain(L,R) is the reduction due to L in the total number of bits needed to encode the classification of all positive bindings of R.

For rule 1

64

For rule 2

65

Ex: summer school database

66

67

Negative examples:

Relations: subscription,course,company and person

Positive examples:

Learning Recursive Rule Sets

• Allow new literals added to rule body to refer to the target predicate.

• Example:edge(X,Y) path(X,Y) edge(X,Z) path(Z,Y) path(X,Y)

• FOIL can accommodate recursive rules, but must take precautions to avoid infinite recursion

69

FOIL Example 2 (recursive)

• Example: Consider learning a definition for the target predicate path for finding a path in a directed acyclic graph.

edge(X,Y) path(X,Y) edge(X,Z) path(Z,Y) path(X,Y)

12

3

46 5

edge: {<1,2>,<1,3>,<3,6>,<4,2>,<4,6>,<6,5>}path: {<1,2>,<1,3>,<1,6>,<1,5>,<3,6>,<3,5>, <4,2>,<4,6>,<4,5>,<6,5>}

70

FOIL Negative Training Data

• Negative examples of target predicate can be provided directly, or (as in previous example) generated indirectly by making a closed world assumption.– Every pair of constants <X,Y> not in positive tuples for path

predicate.

12

3

46 5

Negative path tuples:{<1,1>,<1,4>,<2,1>,<2,2>,<2,3>,<2,4>,<2,5>,<2,6>, <3,1>,<3,2>,<3,3>,<3,4>,<4,1>,<4,3>,<4,4>,<5,1>, <5,2>,<5,3>,<5,4>,<5,5>,<5,6>,<6,1>,<6,2>,<6,3>, <6,4>,<6,6>}

71

Sample FOIL Induction

12

3

46 5

Pos: {<1,2>,<1,3>,<1,6>,<1,5>,<3,6>,<3,5>, <4,2>,<4,6>,<4,5>,<6,5>}

Start with clause: path(X,Y) Possible literals to add:edge(X,X),edge(Y,Y),edge(X,Y),edge(Y,X),edge(X,Z), edge(Y,Z),edge(Z,X),edge(Z,Y),path(X,X),path(Y,Y),path(X,Y),path(Y,X),path(X,Z),path(Y,Z),path(Z,X),path(Z,Y),X=Y, plus negations of all of these.

Neg: {<1,1>,<1,4>,<2,1>,<2,2>,<2,3>,<2,4>,<2,5>,<2,6>, <3,1>,<3,2>,<3,3>,<3,4>,<4,1>,<4,3>,<4,4>,<5,1>, <5,2>,<5,3>,<5,4>,<5,5>,<5,6>,<6,1>,<6,2>,<6,3>, <6,4>,<6,6>}

72

Neg: {<1,1>,<1,4>,<2,1>,<2,2>,<2,3>,<2,4>,<2,5>,<2,6>, <3,1>,<3,2>,<3,3>,<3,4>,<4,1>,<4,3>,<4,4>,<5,1>, <5,2>,<5,3>,<5,4>,<5,5>,<5,6>,<6,1>,<6,2>,<6,3>, <6,4>,<6,6>}


12

3

46 5

Pos: {<1,2>,<1,3>,<1,6>,<1,5>,<3,6>,<3,5>, <4,2>,<4,6>,<4,5>,<6,5>}

Test:edge(X,X) path(X,Y).Covers 0 positive examples

Covers 6 negative examples

Not a good literal.

73


12

3

46 5

Pos: {<1,2>,<1,3>,<1,6>,<1,5>,<3,6>,<3,5>, <4,2>,<4,6>,<4,5>,<6,5>}

Test:edge (X,Y) path(X,Y)•Covers 6 positive example • Covers 0 negative examples

Chosen as best literal. Result is base clause.

Neg: {<1,1>,<1,4>,<2,1>,<2,2>,<2,3>,<2,4>,<2,5>,<2,6>, <3,1>,<3,2>,<3,3>,<3,4>,<4,1>,<4,3>,<4,4>,<5,1>, <5,2>,<5,3>,<5,4>,<5,5>,<5,6>,<6,1>,<6,2>,<6,3>, <6,4>,<6,6>}

74


12

3

46 5

Pos: {<1,6>,<1,5>,<3,5>, <4,5>}

Test:edge (X,Y) path(X,Y)Covers 6 positive examples

Covers 0 negative examples

Chosen as best literal. Result is base clause.

Neg: {<1,1>,<1,4>,<2,1>,<2,2>,<2,3>,<2,4>,<2,5>,<2,6>, <3,1>,<3,2>,<3,3>,<3,4>,<4,1>,<4,3>,<4,4>,<5,1>, <5,2>,<5,3>,<5,4>,<5,5>,<5,6>,<6,1>,<6,2>,<6,3>, <6,4>,<6,6>}

Remove covered positive tuples.

75


12

3

46 5

Pos: {<1,6>,<1,5>,<3,5>, <4,5>}

Start new clause path(X,Y)

Neg: {<1,1>,<1,4>,<2,1>,<2,2>,<2,3>,<2,4>,<2,5>,<2,6>, <3,1>,<3,2>,<3,3>,<3,4>,<4,1>,<4,3>,<4,4>,<5,1>, <5,2>,<5,3>,<5,4>,<5,5>,<5,6>,<6,1>,<6,2>,<6,3>, <6,4>,<6,6>}

76


12

3

46 5

Pos: {<1,6>,<1,5>,<3,5>, <4,5>}

Test:edge(X,Y)->path(X,Y)

Neg: {<1,1>,<1,4>,<2,1>,<2,2>,<2,3>,<2,4>,<2,5>,<2,6>, <3,1>,<3,2>,<3,3>,<3,4>,<4,1>,<4,3>,<4,4>,<5,1>, <5,2>,<5,3>,<5,4>,<5,5>,<5,6>,<6,1>,<6,2>,<6,3>, <6,4>,<6,6>}

Covers 0 positive examplesCovers 0 negative examples

Not a good literal.

77


12

3

46 5

Pos: {<1,6>,<1,5>,<3,5>, <4,5>}

Test:edge (X,Z) path(X,Y)

Neg: {<1,1>,<1,4>,<2,1>,<2,2>,<2,3>,<2,4>,<2,5>,<2,6>, <3,1>,<3,2>,<3,3>,<3,4>,<4,1>,<4,3>,<4,4>,<5,1>, <5,2>,<5,3>,<5,4>,<5,5>,<5,6>,<6,1>,<6,2>,<6,3>, <6,4>,<6,6>}

Covers all 4 positive examplesCovers 14 of 26 negative examples

Eventually chosen as best possible literal

78


12

3

46 5

Pos: {<1,6>,<1,5>,<3,5>, <4,5>}


Neg: {<1,1>,<1,4>, <3,1>,<3,2>,<3,3>,<3,4>,<4,1>,<4,3>,<4,4>, <6,1>,<6,2>,<6,3>, <6,4>,<6,6>}



Negatives still covered, remove uncovered examples.

79


12

3

46 5

Pos: {<1,6,2>,<1,6,3>,<1,5>,<3,5>, <4,5>}


Neg: {<1,1>,<1,4>, <3,1>,<3,2>,<3,3>,<3,4>,<4,1>,<4,3>,<4,4>, <6,1>,<6,2>,<6,3>, <6,4>,<6,6>}



Negatives still covered, remove uncovered examples.Expand tuples to account for possible Z values.

80


12

3

46 5

Pos: {<1,6,2>,<1,6,3>,<1,5,2>,<1,5,3>,<3,5,6>, <4,5,2>,<4,5,6>}


Neg: {<1,1>,<1,4>, <3,1>,<3,2>,<3,3>,<3,4>,<4,1>,<4,3>,<4,4>, <6,1>,<6,2>,<6,3>, <6,4>,<6,6>}



Negatives still covered, remove uncovered examples.Expand tuples to account for possible Z values.

81


12

3

46 5

Pos: {<1,6,2>,<1,6,3>,<1,5,2>,<1,5,3>,<3,5,6>, <4,5,2>,<4,5,6>}

Continue specializing clause:edge (X,Z) path(X,Y)

Neg: {<1,1,2>,<1,1,3>,<1,4,2>,<1,4,3>,<3,1,6>,<3,2,6>, <3,3,6>,<3,4,6>,<4,1,2>,<4,1,6>,<4,3,2>,<4,3,6> <4,4,2>,<4,4,6>,<6,1,5>,<6,2,5>,<6,3,5>, <6,4,5>,<6,6,5>}

82


12

3

46 5

Pos: {<1,6,2>,<1,6,3>,<1,5,2>,<1,5,3>,<3,5,6>, <4,5,2>,<4,5,6>}

Test:edge (X,Z)∧ edge(Z,Y) path(X,Y).

Neg: {<1,1,2>,<1,1,3>,<1,4,2>,<1,4,3>,<3,1,6>,<3,2,6>, <3,3,6>,<3,4,6>,<4,1,2>,<4,1,6>,<4,3,2>,<4,3,6> <4,4,2>,<4,4,6>,<6,1,5>,<6,2,5>,<6,3,5>, <6,4,5>,<6,6,5>}


83


12

3

46 5

Pos: {<1,6,2>,<1,6,3>,<1,5,2>,<1,5,3>,<3,5,6>, <4,5,2>,<4,5,6>}

Test:edge (X,Z)∧ edge(Z,Y) path(X,Y)

Neg: {<1,1,2>,<1,1,3>,<1,4,2>,<1,4,3>,<3,1,6>,<3,2,6>, <3,3,6>,<3,4,6>,<4,1,2>,<4,1,6>,<4,3,2>,<4,3,6> <4,4,2>,<4,4,6>,<6,1,5>,<6,2,5>,<6,3,5>, <6,4,5>,<6,6,5>}


Eventually chosen as best literal; completes clause.

Definition complete, since all original <X,Y> tuples are covered (by way of covering some <X,Y,Z> tuple.)

84

Recursion Limitation

• Must not build a clause that results in an infinite regress.– path(X,Y) path(X,Y).– path(X,Y) path(Y,X).

• To guarantee termination of the learned clause, must “reduce” at least one argument according some well-founded partial ordering.

• A binary predicate, R, is a well-founded partial ordering if the transitive closure does not contain R(a,a) for any constant a. – rest(A,B)– edge(A,B) for an acyclic graph

85

Ensuring Termination in FOIL

• First empirically determines all binary-predicates in the background that form a well-founded partial ordering by computing their transitive closures.

• Only allows recursive calls in which one of the arguments is reduced according to a known well-founded partial ordering.

• edge (X,Z)∧ edge(Z,Y) path(X,Y) X is reduced to Z by edge so this recursive call is O.K

• May prevent legal recursive calls that terminate for some other more-complex reason.

• Due to halting problem, cannot determine if an arbitrary recursive definition is guaranteed to halt.

86

Some Applications

• Classifying chemical compounds as mutagenic (cancer causing) based on their graphical molecular structure and chemical background knowledge.

• Classifying web documents based on both the content of the page and its links to and from other pages with particular content.– A web page is a university faculty home page if:

• It contains the words “Professor” and “University”, and• It is pointed to by a page with the word “faculty”, and• It points to a page with the words “course” and “exam”

87

FOIL Limitations

• Search space of literals (branching factor) can become intractable.– Use aspects of bottom-up search to limit search.

• Requires large extensional background definitions.– Use intensional background via Prolog inference.

• Hill-climbing search gets stuck at local optima and may not even find a consistent clause.– Use limited backtracking (beam search)– Include determinate literals with zero gain.– Use relational pathfinding or relational clichés.

• Requires complete examples to learn recursive definitions.– Use intensional interpretation of learned recursive clauses.

88

FOIL Limitations (cont.)

• Requires a large set of closed-world negatives.– Exploit “output completeness” to provide “implicit”

negatives.• past-tense([s,i,n,g], [s,a,n,g])

• Inability to handle logical functions.– Use bottom-up methods that handle functions

• Background predicates must be sufficient to construct definition, e.g. cannot learn reverse unless given append.– Predicate invention

• Learn reverse by inventing append• Learn sort by inventing insert

89

Rule Learning and ILP Summary

• There are effective methods for learning symbolic rules from data using greedy sequential covering and top-down or bottom-up search.

• These methods have been extended to first-order logic to learn relational rules and recursive Prolog programs.

• Knowledge represented by rules is generally more interpretable by people, allowing human insight into what is learned and possible human approval and correction of learned knowledge.

90

Literals and clauses

91

92

IF triangle( T1)∧ in( T1, T2) ∧ triangle( T2) THEN Class = yesELSIF triangle( T1) ∧ in( T1, T2) THEN Class = noELSIF triangle( T1) THEN Class = noELSIF circle( C) THEN Class = noELSE Class = yes

Documents

1 Machine Learning: Rule Learning. 2 Learning Rules If-then rules in logic are a standard representation of knowledge that have proven useful in expert-systems