ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

•ML: Classical methods from AI

–Decision-Tree induction

–Exemplar-based Learning

–Rule Induction

–TBEDL

•ML: Classical methods from AI

–Decision-Tree induction

–Exemplar-based Learning

–Rule Induction

–TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Decision TreesDecision TreesDecisionTreesDecisionTrees

• Decision trees are a way to represent rules underlying training data, with hierarchical sequential structures that recursively partition the data.

• They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization.

• From a machine-learning perspective: Decision Trees are n -ary branching trees that represent

classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes

• Acquisition: Top-Down Induction of Decision Trees (TDIDT)

• Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98), ASSISTANT, ASSISTANT-R (Cestnik et al. 87; Kononenko et al. 95)

• Decision trees are a way to represent rules underlying training data, with hierarchical sequential structures that recursively partition the data.

• They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization.

• From a machine-learning perspective: Decision Trees are n -ary branching trees that represent

classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes

• Acquisition: Top-Down Induction of Decision Trees (TDIDT)

• Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98), ASSISTANT, ASSISTANT-R (Cestnik et al. 87; Kononenko et al. 95)

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

An ExampleAn Example

A1

A2 A3

C1

A5 A2

A2

A5 C3

C2C1

...

..

....

...

v1

v2

v3

v5v4

v6

v7

DecisionTreesDecisionTrees

small big

SHAPE

pos

circle red

SIZE

Decision TreeDecision Tree

COLOR

triang blue

neg pos neg

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Learning Decision TreesLearning Decision TreesTrainingTraining

Training Set

TDIDTTDIDT+DT

=

TestTest

=DT

Example + ClassClass


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Gen

era

l In

du

cti

on

A

lgori

thm

Gen

era

l In

du

cti

on

A

lgori

thm

function TDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree;

X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree1 := create_leaf_tree (X)

else amax := feature_selection (X,A);

tree1 := create_tree (X, amax);

for-all val in values (amax) do

X’ := select_exampes (X,amax,val);

A’ := A \ {amax};

tree2 := TDIDT (X’,A’);

tree1 := add_branch (tree1,tree2,val)

end-for end-if return (tree1)

end-function

DTsDTs

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Gen

era

l In

du

cti

on

A

lgori

thm

Gen

era

l In

du

cti

on

A

lgori

thm

function TDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree;

X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree1 := create_leaf_tree (X)

else amax := feature_selection (X,A);

tree1 := create_tree (X, amax);

for-all val in values (amax) do

X’ := select_examples (X,amax,val);

A’ := A \ {amax};

tree2 := TDIDT (X’,A’);

tree1 := add_branch (tree1,tree2,val)

end-for end-if return (tree1)

end-function

DTsDTs

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Feature Selection CriteriaFeature Selection Criteria Functions derived from Information

Theory:– Information Gain, Gain Ratio (Quinlan86)

Functions derived from Distance Measures– Gini Diversity Index (Breiman et al. 84)

– RLM (López de Mántaras 91)

Statistically-based– Chi-square test (Sestito & Dillon 94)

– Symmetrical Tau (Zhou & Dillon 91)

RELIEFF-IG: variant of RELIEFF (Kononenko 94)

Functions derived from Information Theory:– Information Gain, Gain Ratio (Quinlan86)

Functions derived from Distance Measures– Gini Diversity Index (Breiman et al. 84)

– RLM (López de Mántaras 91)

Statistically-based– Chi-square test (Sestito & Dillon 94)

– Symmetrical Tau (Zhou & Dillon 91)

RELIEFF-IG: variant of RELIEFF (Kononenko 94)


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Information GainInformation GainDecisionTreesDecisionTrees

(Quinlan79)(Quinlan79)

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Information Gain(2)Information Gain(2)



EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Gain RatioGain RatioDecisionTreesDecisionTrees


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

RELIEFRELIEF


(Kira & Rendell, 1992)(Kira & Rendell, 1992)

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

RELIEFFRELIEFFDecisionTreesDecisionTrees

(Kononenko, 1994)(Kononenko, 1994)

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

RELIEFF-IGRELIEFF-IGDecisionTreesDecisionTrees

(Màrquez, 1999)(Màrquez, 1999)

• RELIEFF + the distance measure used for calculating the nearest hits/misses does not treat all attributes equally ( it weights the attributes according to the IG measure).

• RELIEFF + the distance measure used for calculating the nearest hits/misses does not treat all attributes equally ( it weights the attributes according to the IG measure).

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Extensions of DTsExtensions of DTsDecisionTreesDecisionTrees

• (pre/post) Pruning

• Minimize the effect of the greedy approach: lookahead

• Non-lineal splits

• Combination of multiple models

• etc.

• (pre/post) Pruning

• Minimize the effect of the greedy approach: lookahead

• Non-lineal splits

• Combination of multiple models

• etc.

(Murthy 95)(Murthy 95)

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Decision Trees and NLPDecision Trees and NLP

• Speech processing (Bahl et al. 89; Bakiri & Dietterich 99)

• POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00)

• Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96)

• Parsing (Magerman 95,96; Haruno et al. 98,99)

• Text categorization (Lewis & Ringuette 94; Weiss et al. 99)

• Text summarization (Mani & Bloedorn 98)

• Dialogue act tagging (Samuel et al. 98)

• Speech processing (Bahl et al. 89; Bakiri & Dietterich 99)

• POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00)

• Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96)

• Parsing (Magerman 95,96; Haruno et al. 98,99)

• Text categorization (Lewis & Ringuette 94; Weiss et al. 99)

• Text summarization (Mani & Bloedorn 98)

• Dialogue act tagging (Samuel et al. 98)


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Decision Trees and NLPDecision Trees and NLP

• Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95)

• Discourse analysis in information extraction (Soderland & Lehnert 94)

• Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94)

• Verb classification in Machine Translation (Tanaka 96; Siegel 97)

• More recent applications of DTs to NLP: but combined in a boosting framework (we will see it in following sessions)

• Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95)

• Discourse analysis in information extraction (Soderland & Lehnert 94)

• Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94)

• Verb classification in Machine Translation (Tanaka 96; Siegel 97)

• More recent applications of DTs to NLP: but combined in a boosting framework (we will see it in following sessions)


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Example: POS Tagging using DTExample: POS Tagging using DT


He was shot in the hand as he chased

the robbers in the back street

He was shot in the hand as he chased

the robbers in the back streetNNVBNNVB

JJVBJJVB

NNVBNNVB

(The Wall Street Journal Corpus)(The Wall Street Journal Corpus)

POS TaggingPOS Tagging

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

POS Tagging using Decision Trees


Language Model

Disambiguation Algorithm

Rawtext

Taggedtext

Morphologicalanalysis …

POS tagging


(Màrquez, PhD 1999)(Màrquez, PhD 1999)

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Disambiguation Algorithm

Rawtext

Taggedtext

Morphologicalanalysis …

POS tagging

Decision Trees





EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

…

Language Model

RTTSTT

RELAX

Rawtext

Taggedtext

Morphologicalanalysis

POS tagging





EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

DT-based Language ModellingDT-based Language Modelling

root

P(IN)=0.81P(RB)=0.19Word Form

leaf

P(IN)=0.83P(RB)=0.17tag(+1)

P(IN)=0.13P(RB)=0.87tag(+2)

P(IN)=0.013P(RB)=0.987

“As”,“as”

RB

IN

others

others

...

...

“preposition-adverb” tree“preposition-adverb” tree

Statistical interpretation:Statistical interpretation:

P( RB | word=“A/as” & tag(+1)=RB & tag(+2)=IN) = 0.987

P( IN | word=“A/as” & tag(+1)=RB & tag(+2)=IN) = 0.013^

^


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

root


leaf

P(IN)=0.83P(RB)=0.17tag(+1)

P(IN)=0.13P(RB)=0.87tag(+2)

P(IN)=0.013P(RB)=0.987

“As”,“as”

RB

IN

others

others

...

...“as_RB much_RB as_IN”

Collocations:Collocations:

“as_RB well_RB as_IN”

“as_RB soon_RB as_IN”

DT-based Language ModellingDT-based Language Modelling“preposition-adverb” tree“preposition-adverb” tree


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Language Modelling using DTsLanguage Modelling using DTs

• Algorithm: Top-Down Induction of Decision Trees (TDIDT). Supervised learning– CART (Breiman et al. 84), C4.5 (Quinlan 95), etc.

• Attributes: Local context (-3,+2) tokens

• Particular implementation:– Branch-merging– CART post-pruning– Smoothing– Attributes with many values– Several functions for attribute selection

• Algorithm: Top-Down Induction of Decision Trees (TDIDT). Supervised learning– CART (Breiman et al. 84), C4.5 (Quinlan 95), etc.

• Attributes: Local context (-3,+2) tokens

• Particular implementation:– Branch-merging– CART post-pruning– Smoothing– Attributes with many values– Several functions for attribute selection

Minimizing the effect of over-fitting, data fragmentation & sparseness

Minimizing the effect of over-fitting, data fragmentation & sparseness

• Granularity? Ambiguity class level– adjective-noun, adjective-noun-verb, etc.

• Granularity? Ambiguity class level– adjective-noun, adjective-noun-verb, etc.


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Model Evaluation Model Evaluation

• 1,170,000 words

• Tagset size: 45 tags

• Noise: 2-3% of mistagged words

• 49,000 word-form frequency lexicon– Manual filtering of 200 most frequent

entries– 36.4% ambiguous words– 2.44 (1.52) average tags per word

• 243 ambiguity classes

• 1,170,000 words

• Tagset size: 45 tags

• Noise: 2-3% of mistagged words

• 49,000 word-form frequency lexicon– Manual filtering of 200 most frequent

entries– 36.4% ambiguous words– 2.44 (1.52) average tags per word

• 243 ambiguity classes

The Wall Street Journal (WSJ) annotated corpusThe Wall Street Journal (WSJ) annotated corpus


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Model Evaluation Model Evaluation The Wall Street Journal (WSJ) annotated corpusThe Wall Street Journal (WSJ) annotated corpus

50% 60% 70% 80% 90% 95% 99% 100%

# ambiguityclasses 8 11 14 19 37 58 113 243

Number of ambiguity classes that cover x% of the training corpusNumber of ambiguity classes that cover x% of the training corpus

2-tags 3-tags 4-tags 5-tags 6-tags

# ambiguityclasses 103 90 35 12 3

Arity of the classification problemsArity of the classification problems


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

12 Ambiguity Classes12 Ambiguity Classes

They cover 57.90% of the ambiguous occurrences!They cover 57.90% of the ambiguous occurrences!

Experimental setting: 10-fold cross validation Experimental setting: 10-fold cross validation


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

N-fold Cross ValidationN-fold Cross ValidationDecisionTreesDecisionTrees

Divide the training set S into a partition of n equal-size disjoint subsets: s1, s2, …, sn

for i:=1 to N do learn and test a classifier using:

training_set := Usj for all j different from i

validation_set :=si

end_forreturn: the average accuracy from the n experiments

Which is a good value for N? (2-10-...)

Extreme case (N=training set size): Leave-one-out

Divide the training set S into a partition of n equal-size disjoint subsets: s1, s2, …, sn

for i:=1 to N do learn and test a classifier using:

training_set := Usj for all j different from i

validation_set :=si

end_forreturn: the average accuracy from the n experiments

Which is a good value for N? (2-10-...)

Extreme case (N=training set size): Leave-one-out

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Size: Number of NodesSize: Number of Nodes

22,095

10,674

5,715

0

5,000

10,000

15,000

20,000

25,000

Nu

mb

er o

f n

od

es

Basic algorithm Merging Pruning

Average size reduction: 51.7% 46.5%Average size reduction: 51.7% 46.5%74.1% (total)74.1% (total)


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

8,49 8,36

28,83

8,3

4

9

14

19

24

29

34

% E

rro

r ra

te

Low er Bound Basic AlgorithmMerging Pruning

AccuracyAccuracy

(at least) No loss in accuracy(at least) No loss in accuracy


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

8,48,358,528,698,588,318,24

11,63

8,9

17,24

02468

101214161820

Err

or

rate

%

Average error rate

Feature Selection CriteriaFeature Selection Criteria

Statistically equivalentStatistically equivalent


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Tree Base = Statistical Component– RTT: Reductionistic Tree based tagger

– STT: Statistical Tree based tagger



Tree Base = Compatibility Constraints– RELAX: Relaxation-Labelling based tagger


(Màrquez & Rodríguez 99)(Màrquez & Rodríguez 99)


(Màrquez & Padró 97)(Màrquez & Padró 97)

DT-based POS TaggersDT-based POS Taggers


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

RTT RTT

Rawtext


Taggedtext

Classify Update Filter

Language Model

Disambiguation

stop?

(Màrquez & Rodríguez 97)

yesno


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

STTSTT

N-grams (trigrams)



EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

STTSTT

Contextual probabilities


)|( kk CtP

)|(~

kk CtP );( kkAC CtTk

Estimated using Decision Trees


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Taggedtext

Rawtext


STTSTT(Màrquez & Rodríguez 99)

Viterbialgorithm

Language Model

Disambiguation

Lexicalprobs. +

Contextual probs.


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Viterbialgorithm

Taggedtext

Rawtext


Language Model

Disambiguation

N-gramsLexicalprobs. ++

STT+STT+


Contextual probs.


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002









(Màrquez & Padró 97)(Màrquez & Padró 97)


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

RELAX RELAX

Relaxation Labelling

(Padró 96)

Taggedtext

Rawtext


Language Model

Disambiguation

(Màrquez & Padró 97)

Linguisticrules

N-grams ++

Set of constraints


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

RELAX RELAX(Màrquez & Padró 97)

root


leaf

P(IN)=0.83P(RB)=0.17

tag(+1)

P(IN)=0.13P(RB)=0.87

tag(+2)

P(IN)=0.013P(RB)=0.987

“As”,“as”

RB

IN

others

others

...

...

Compatibility values: estimated using Mutual InformationCompatibility values: estimated using Mutual Information

Translating Tress into ConstraintsTranslating Tress into Constraints

-5.81 (IN) (0 “as” “As”) (1 RB) (2 IN)

-5.81 (IN) (0 “as” “As”) (1 RB) (2 IN)

2.37 (RB) (0 “as” “As”) (1 RB) (2 IN)

2.37 (RB) (0 “as” “As”) (1 RB) (2 IN)

Positive constraintPositive constraint Negative constraintNegative constraint


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Experimental EvaluationExperimental Evaluation

• Training set: 1,121,776 words• Test set: 51,990 words• Closed vocabulary assumption• Base of 194 trees

– Covering 99.5% of the ambiguous occurrences– Storage requirement: 565 Kb– Acquisition time: 12 CPU-hours (Common LISP / Sparc10 workstation)

• Training set: 1,121,776 words• Test set: 51,990 words• Closed vocabulary assumption• Base of 194 trees

– Covering 99.5% of the ambiguous occurrences– Storage requirement: 565 Kb– Acquisition time: 12 CPU-hours (Common LISP / Sparc10 workstation)

Using the WSJ annotated corpusUsing the WSJ annotated corpus


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

• 67.52% error reduction with respect to MFT

• Accuracy = 94.45% (ambiguous) 97.29% (overall)

• Comparable to the best state-of-the-art automatic POS taggers

• Recall = 98.22% Precision = 95.73% (1.08 tags/word)

• 67.52% error reduction with respect to MFT

• Accuracy = 94.45% (ambiguous) 97.29% (overall)

• Comparable to the best state-of-the-art automatic POS taggers

• Recall = 98.22% Precision = 95.73% (1.08 tags/word)

RTT resultsRTT results

+ RTT allows to state a tradeoff between precision and recall

+ RTT allows to state a tradeoff between precision and recall

Experimental EvaluationExperimental EvaluationDecisionTreesDecisionTrees

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

• Comparable to those of RTT• Comparable to those of RTT

STT resultsSTT results

+ STT allows the incorporation of N-gram information

some problems of sparseness and coherence of the resulting tag sequence can be alleviated

+ STT allows the incorporation of N-gram information

some problems of sparseness and coherence of the resulting tag sequence can be alleviated

• Better than those of RTT and STT• Better than those of RTT and STT

STT+ resultsSTT+ results



EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

• Translation of 44 representative trees covering 84% of the examples = 8,473 constraints

• Addition of:– bigrams (2,808 binary constraints)

– trigrams (52,161 ternary constraints)

– linguistically-motivated manual constraints (20)

• Translation of 44 representative trees covering 84% of the examples = 8,473 constraints

• Addition of:– bigrams (2,808 binary constraints)

– trigrams (52,161 ternary constraints)

– linguistically-motivated manual constraints (20)

Including trees into RELAXIncluding trees into RELAX



EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Accuracy of RELAXAccuracy of RELAX

MFT B T BT C BC TC BTC

Ambig. 85.31 91.35 91.82 91.92 91.96 92.72 92.82 92.55

Overall 94.66 96.86 97.03 97.06 97.08 97.36 97.39 97.29

MFT= baseline, B=bigrams, T=trigrams, C=“tree constraints”

H BH TH BTH CH BCH TCH BTCH

Ambig. 86.41 91.88 92.04 92.32 91.97 92.76 92.98 92.71

Overall 95.06 97.05 97.11 97.21 97.08 97.37 97.45 97.35

H = set of 20 hand-written linguistic rules

91.35 92.7291.82 92.82


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Decision Trees: SummaryDecision Trees: Summary

• Advantages

– Acquires symbolic knowledge in a understandable way

– Very well studied ML algorithms and variants

– Can be easily translated into rules

– Existence of available software: C4.5, C5.0, etc.

– Can be easily integrated into an ensemble

• Advantages

– Acquires symbolic knowledge in a understandable way

– Very well studied ML algorithms and variants

– Can be easily translated into rules

– Existence of available software: C4.5, C5.0, etc.

– Can be easily integrated into an ensemble


EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

• Drawbacks

– Computationally expensive when scaling to large natural language domains: training examples, features, etc.

– Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation

– DTs is a model with high variance (unstable)

– Tendency to overfit training data: pruning is necessary

– Requires quite a big effort in tuning the model

• Drawbacks

– Computationally expensive when scaling to large natural language domains: training examples, features, etc.

– Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation

– DTs is a model with high variance (unstable)

– Tendency to overfit training data: pruning is necessary

– Requires quite a big effort in tuning the model


Decision Trees: SummaryDecision Trees: Summary

Documents

ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL