28
Rule Generalization Strategies in Incremental Learning of Disjunctive Concepts Stefano Ferilli, Andrea Pazienza, Floriana Esposito [email protected] Dipartimento di Informatica Centro Interdipartimentale per la Logica e sue Applicazioni Università di Bari 9th International Web Rule Symposium (RuleML) August 2-5, 2015 – Berlin, Germany

RuleML2015: Rule Generalization Strategies in Incremental Learning of Disjunctive Concepts

  • Upload
    ruleml

  • View
    21

  • Download
    4

Embed Size (px)

Citation preview

Rule Generalization Strategiesin Incremental Learningof Disjunctive Concepts

Stefano Ferilli, Andrea Pazienza, Floriana [email protected]

Dipartimento di InformaticaCentro Interdipartimentale per la Logica e sue Applicazioni

Università di Bari

9th International Web Rule Symposium (RuleML)August 2-5, 2015 – Berlin, Germany

Overview

● Introduction & Motivation● Generalization Algorithm● Selection Strategies● Evaluation● Conclusions & Future Work

Introduction

● Symbolic knowledge representations● Mandatory for applications that

– Reproduce the human inferential behavior– May be required to explain their decisions in human-

understandable terms● 2 kinds of concept definitions

– Conjunctive: a single definition accounts for all instances of the concept

– Disjunctive: several alternate conjunctive definitions (components)

● Each covers part of the full concept extension● Psychological studies have established that capturing and

dealing with the latter is much harder for humans● Pervasive and fundamental in most real-world domains

Introduction

● Knowledge acquisition bottleneck● (Symbolic) Machine Learning (ML) systems

– Supervised setting: concept definitions inferred from descriptions of valid (positive) or invalid (negative) instances (examples)

● Batch: whole set of examples available (classical setting in ML)– Components learned by progressive coverage strategies– Definitions are immutable

● If additional examples are provided, learning must start from scratch considering the extended set of examples

● Incremental: new examples may be provided after a tentative definition is already available

– Components emerge as long as they are found– Definitions may be changed/refined if wrong

Motivations

● Incremental approach– If the available concept definition cannot properly account

for a new example, it must be refined (revised, changed, modified) so that the new version properly accounts for both the old and the new examples

– Progressive covering strategy not applicable● Issue of disjunctive definitions becomes particularly relevant

– When many components can be refined, there is no unique way for determining which one is most profitably refined

– Refining different components results in different updated definitions, that become implicit constraints on how the definition itself may evolve when additional examples will become available in the future

● The learned model depends– on the order in which the examples are provided– on the choice of the component to be refined at each step

Motivations

● Abstract Diagnosis: Identification of the part of the theory that misclassifies the example– Tricky in the case of disjunctive concepts

● Several alternate conjunctive definitions are available– If a positive example is not covered, no component of the

current definition accounts for it (omission error)● All candidate to generalization

– Generalizing all components, each component would account alone for all positive examples

● Contradiction: the concept is conjunctive● Over-generalization: the theory would be more prone to

covering forthcoming negative examples– Problem not present in batch learning

Motivations

● Problem– A single component of a disjunctive concept is to be

generalized● Guided solutions may improve effectiveness and efficiency of the

overall outcome compared to a random strategy

● Objective– Propose and evaluate different strategies for determining

the order in which the components should be considered for generalization

● If the generalization of a component fails, generalization of the next component in the ordering is attempted

Motivations

● Questions:– what sensible strategies can be defined?– what are their expected pros and cons?– what is their effect on the quality of the theory?– what about their consequences on the effectiveness and

efficiency of the learning process?● Starting point

– InTheLEx (Incremental Theory Learner from Examples)● fully incremental● can learn disjunctive concept definitions● refinement strategy can be tuned to suitably adapt its behavior

InTheLEx

● Learns hierarchical theories from positive and negative examples

● Fully incremental– May start from an empty theory and from the first

available example● Necessary in most real-world application domains

● DatalogOI – Concept definitions and examples expressed as rules

● Example: example :- observation● Rule: concept :- conjunctive definition

– Disjunctive concepts: several rules for the same concept

InTheLEx: Representation

– Theory that defines a disjunctive concept (4 components)● ball(A) :- weight_medium(A), air_filled(A), has_patches(A),

horizontal_diameter(A,B), vertical_diameter(A,C), equal(B,C). %generic● ball(A) :- weight_medium(A), air_filled(A), has_patches(A,B),

horizontal_diameter(A,B), vertical_diameter(A,C), larger(B,C). %rugby● ball(A) :- weight_heavy(A), has_holes(A), horizontal_diameter(A,B),

vertical_diameter(A,C), equal(B,C). %bowling● ball(A) :- weight_light(A), regular_shape(A), horizontal_diameter(A,B),

vertical_diameter(A,C), equal(B,C). %small

– Examples:● ball(e) :- weight_medium(e), has_patches(e), air_filled(e),

made_of_leather(e), horizontal_diameter(e,he), vertical_diameter(e,ve), equal(he,ve). %soccer Etrusco FIFA World Cup 1990

● neg(ball(s1)) :- weight_light(s1), made_of_snow(s1), irregular_shape(s1), horizontal_diameter(s1,hs1), vertical_diameter(s1,vs1), smaller(hs1,vs1). %snowball

● neg(ball(s2)) :- weight_light(s2), made_of_paper(s2), horizontal_diameter(s2,hs2), vertical_diameter(s2,vs2), larger(hs2,vs2). %spitball

InTheLEx: Theory Revision

● Logic theory revision process– Given a new example

● No effect on the theory if– negative and not covered

● not predicted by the theory to belong to the concept– positive and covered

● predicted by the theory to belong to the concept● In all the other cases, the theory needs to be revised

– Positive example not covered generalization of the theory– Negative example covered specialization of the theory

– Refinements (generalizations or specializations) must preserve correctness with respect to the entire set of currently available examples

● If no candidate refinement fulfills this requirement, the specific problematic example is stored as an exception

InTheLEx: Generalization

– Procedure Generalize(E: positive example, T: theory, M: negative examples);

● L := list of the rules in the definition of E's conceptwhile not generalized and L do

– Select from L a rule C for generalization– L' := generalize(C,E) (* list of generalizations *)– while not generalized and L' do

● Select next best generalization C' from L'● if (T \ {C} {C'} is consistent wrt M then

● Implement C' in T– Remove C from L

● if not generalized then– C' := E with constants turned into variables– if (T \ {C} {C'} is consistent wrt M then

● Implement C' in T– else

● Implement E in T as an exception

InTheLEx: Generalization

– Comments● Due to theoretical and implementation details, the generalization

operator used might return several incomparable generalizations● Implementation of the theoretical generalization operator would

be computationally infeasible, even for relatively small rules– Similarity-based approximation

● Experiments have shown that it comes very close, and often catches, least general generalizations

● First example of a new concept– First rule rule added for that concept

● Initial tentative conjunctive definition of the concept● Conjunctive definition turns out to be insufficient

– Second rule added for that concept● The concept becomes disjunctive

● Subsequent addition of rules for that concept– Extend the ‘disjunctiveness’ of the concept

Clause Selection Strategy

● 5 strategies for determining the order in which the components of a disjunctive concept definition are to be considered for generalization– Each component in the ordering considered only after

generalization attempts have failed on all previous components

● Initial components have more chances to be generalized

– No direct connection between age and length of a rule● Older rules might have had more chances of refinement● Whether this means that they are also shorter (i.e., more

general) mainly depends on– Ranking strategy– Specific examples that are encountered and on their order

● Not controllable in a real-world setting

Clause Selection Strategy

● O: Older elements first– Same order as they were added to the theory

● The most straightforward. A sort of baseline● Static: position of each component in the processing order fixed

when component is added– Generalizations are monotonic (progressively remove

constraints)● Strategy expected to yield very refined (short, more

human-readable and understandable) initial components and very raw (long) final ones

– After several refinements, it is likely that initial components have reached a nearly final and quite stable form

● All attempts to further refine them are likely to fail● The computational time spent in these attempts, albeit

presumably not huge, will be wasted● Runtime expected to grow as long as the life of the

theory proceeds

Clause Selection Strategy

● N: Newer elements first– Reverse order as they were added to the theory

● Also quite straightforward● Static, but not obvious to foresee what will be the shape and

evolution of the components– Immediately after the addition of a new component, it will

undergo a generalization attempt at the first non-covered positive example

● ‘average’ level of generality in the definition expected to be less than in the previous option

● There should be no completely raw components in the definition

● Many chances that such an attempt is successful for any example, but the resulting generalization might leverage features that might be not very related to the correct concept definition

Clause Selection Strategy

● L: Longer elements first– Decreasing number of conjuncts

● Specifically considers the level of refinement– Low variance in degree of refinement among components

expected● Can be considered as an evolution of N

– Not just the most recently added component is favored for generalization

● The more conjuncts in a component, the more specialized the component

– More room for generalization● Avoid waste of time trying to generalize very refined rules

that would hardly yield consistent generalizations● On the other hand, generalizing a longer rule is expected

to take more time than generalizing shorter ones

Clause Selection Strategy

● S: Shorter elements first– Increasing number of conjuncts

● Specifically considers the level of refinement– Opposite behavior than L

● May confirm the possible advantage of spending time in trying harder but more promising generalizations versus spending time in trying easier but less promising ones first

● Can be considered as an evolution of O– Tries to generalize first more refined components

● Largest variance in degree of refinement (number of conjuncts in rule premises) among components expected

Clause Selection Strategy

● ~: More similar elements first– Decreasing similarity with the uncovered example

● Only content-based strategy– Same similarity as in InTheLEx’s generalization operator

● Disjunctive components different actualizations of a concept– Small intersection expected between the sets of examples

covered by different components● Similarity assessment may allow to identify the appropriate

component to be generalized for a given example– Odd generalizations for mismatched component-example

● Coverage and generalization problems● Bad theory, inefficient refinements

– One might expect that over-generalization is avoided– Generalization more easily computed, but overhead to

compute the similarity● Does the improvement compensate the overhead?

Evaluation

● Real-world dataset: Scientific papers– 353 layout descriptions of first pages– 4 classes: Elsevier journals, SVLN, JMLR, MLJ

● Classification: learn definitions for these classes● Understanding: the significant components in the papers

– Title, Author, Abstract, Keywords● First-Order Logic representation needed to express spatial

relationships among the page components

● Complex dataset– Some layout styles are quite similar– 67920 atoms in observations

● avg per observation >192 atoms, some >400 atoms

– Much indeterminacy● Membership relation of layout components to pages

Evaluation: Parameters

● Qualitative– # comp in the disjunctive concept definition

● Less components yield a more compact theory, which does not necessarily provide for greater accuracy

– avg length of components● More conjuncts, more specific -and less refined- concept

– # exc negative exceptions● More exceptions, worse theory

● Quantitative– acc (accuracy)

● Prediction capabilities of the theory on test examples

– time needed to carry out the learning task● Efficiency (computational cost) for the different ranking strategies

Evaluation

● Incremental approach justified– New instances of documents continuously available in

time● Comparison of all strategies

– Random not tried● O or N are somehow random

– Just append definitions as long as they are generated, without any insight

● 10-fold cross-validation procedure– Classification task used to check in detail the behavior of

the different strategies– Understanding task used to assess the statistical

significance of the difference in performance between different strategies

Evaluation: Classification

● Useful results only for MLJ and SVLN– Elsevier and JMLR: always single-rule definitions

● Ranking approach not applicable– Examples for Elsevier and JMLR still played a role as

negative examples for the other classes– Some expectations confirmed

● Runtime: N always best (often the first generalization attempt succeeds); ~ worst (need for computing the similarity of each component and the example)

● Number of exceptions: ~ always best (improved selection of components for generalization); S also good

● Aggregated indicator: N always wins– Sum of ranking positions for the different parameters (the

smaller, the better)

Evaluation: Classification

Evaluation: Classification

– Rest of behavior somehow mixed● SVLN: much less variance than in MLJ

– Definition quite clear and examples quite significant?● MLJ: Impact of the ranking strategies much clearer

– Substantial agreement between quality-related indicators: for each approach, either all tend to be good (as in ~) or all tend to be bad (as in S and O)

● Interesting indications from single folds– 3: peak in runtime and number of exceptions for S and O

● runtime a consequence of the unsuccessful search for specializations, that in turn may have some connection with the quality of the theory

● S is a kind of evolution of O– 8: accuracy increases from 82% to 94% for ~ and L

● Content-based approach improves quality of the theory in difficult situations

Evaluation: Understanding

● Strategy on the row equal (=), better (+) or worse (–) than the one on the column

● Best strategy in bold

Evaluation: Understanding

● Quite different, albeit run on the same data– When the difference is significant, in general analogous

behavior of L and ~ for accuracy (which is greater), number of components and number of negative exceptions. As expected, longer runtime for ~

– Also O and S have in general an analogous behavior, but do not reach as good results as L and ~

– N not outstanding for performance, but better than O (the baseline) for all parameters except runtime

– On the classification task, average length of components using L significantly larger than all the others

● Returns more balanced components

but more accuracy and less negative exceptions

Conclusions & Future Work

● Disjunctive concept definitions tricky– Each component covers a subset of positive examples,

ensuring consistency with all negative examples● In incremental learning, when a new positive example is not

recognized by the current theory, one component must be generalized

– Omission error (no specific component responsible)● The system must decide the order in which the elements

are to be considered for trying a generalization– 5 strategies proposed

● The outcomes confirm some of the expectations for the various strategies, but we need

– More extensive experimentation to have confirmations and additional details

– Identification of further strategies and refinement of the proposed ones