46
EMNLP’02 11/11/2002 ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

  • Upload
    vilmos

  • View
    16

  • Download
    2

Embed Size (px)

DESCRIPTION

ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL. DecisionTrees. Decision Trees. Decision trees are a way to represent rules underlying training data, with hierarchical sequential structures that recursively partition the data. - PowerPoint PPT Presentation

Citation preview

Page 1: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

•ML: Classical methods from AI

–Decision-Tree induction

–Exemplar-based Learning

–Rule Induction

–TBEDL

•ML: Classical methods from AI

–Decision-Tree induction

–Exemplar-based Learning

–Rule Induction

–TBEDL

Page 2: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Decision TreesDecision TreesDecisionTreesDecisionTrees

• Decision trees are a way to represent rules underlying training data, with hierarchical sequential structures that recursively partition the data.

• They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization.

• From a machine-learning perspective: Decision Trees are n -ary branching trees that represent

classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes

• Acquisition: Top-Down Induction of Decision Trees (TDIDT)

• Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98), ASSISTANT, ASSISTANT-R (Cestnik et al. 87; Kononenko et al. 95)

• Decision trees are a way to represent rules underlying training data, with hierarchical sequential structures that recursively partition the data.

• They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization.

• From a machine-learning perspective: Decision Trees are n -ary branching trees that represent

classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes

• Acquisition: Top-Down Induction of Decision Trees (TDIDT)

• Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98), ASSISTANT, ASSISTANT-R (Cestnik et al. 87; Kononenko et al. 95)

Page 3: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

An ExampleAn Example

A1

A2 A3

C1

A5 A2

A2

A5 C3

C2C1

...

..

....

...

v1

v2

v3

v5v4

v6

v7

DecisionTreesDecisionTrees

small big

SHAPE

pos

circle red

SIZE

Decision TreeDecision Tree

COLOR

triang blue

neg pos neg

Page 4: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Learning Decision TreesLearning Decision TreesTrainingTraining

Training Set

TDIDTTDIDT+DT

=

TestTest

=DT

Example + ClassClass

DecisionTreesDecisionTrees

Page 5: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Gen

era

l In

du

cti

on

A

lgori

thm

Gen

era

l In

du

cti

on

A

lgori

thm

function TDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree;

X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree1 := create_leaf_tree (X)

else amax := feature_selection (X,A);

tree1 := create_tree (X, amax);

for-all val in values (amax) do

X’ := select_exampes (X,amax,val);

A’ := A \ {amax};

tree2 := TDIDT (X’,A’);

tree1 := add_branch (tree1,tree2,val)

end-for end-if return (tree1)

end-function

DTsDTs

Page 6: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Gen

era

l In

du

cti

on

A

lgori

thm

Gen

era

l In

du

cti

on

A

lgori

thm

function TDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree;

X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree1 := create_leaf_tree (X)

else amax := feature_selection (X,A);

tree1 := create_tree (X, amax);

for-all val in values (amax) do

X’ := select_examples (X,amax,val);

A’ := A \ {amax};

tree2 := TDIDT (X’,A’);

tree1 := add_branch (tree1,tree2,val)

end-for end-if return (tree1)

end-function

DTsDTs

Page 7: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Feature Selection CriteriaFeature Selection Criteria Functions derived from Information

Theory:– Information Gain, Gain Ratio (Quinlan86)

Functions derived from Distance Measures– Gini Diversity Index (Breiman et al. 84)

– RLM (López de Mántaras 91)

Statistically-based– Chi-square test (Sestito & Dillon 94)

– Symmetrical Tau (Zhou & Dillon 91)

RELIEFF-IG: variant of RELIEFF (Kononenko 94)

Functions derived from Information Theory:– Information Gain, Gain Ratio (Quinlan86)

Functions derived from Distance Measures– Gini Diversity Index (Breiman et al. 84)

– RLM (López de Mántaras 91)

Statistically-based– Chi-square test (Sestito & Dillon 94)

– Symmetrical Tau (Zhou & Dillon 91)

RELIEFF-IG: variant of RELIEFF (Kononenko 94)

DecisionTreesDecisionTrees

Page 8: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Information GainInformation GainDecisionTreesDecisionTrees

(Quinlan79)(Quinlan79)

Page 9: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Information Gain(2)Information Gain(2)

DecisionTreesDecisionTrees

(Quinlan79)(Quinlan79)

Page 10: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Gain RatioGain RatioDecisionTreesDecisionTrees

(Quinlan86)(Quinlan86)

Page 11: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

RELIEFRELIEF

DecisionTreesDecisionTrees

(Kira & Rendell, 1992)(Kira & Rendell, 1992)

Page 12: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

RELIEFFRELIEFFDecisionTreesDecisionTrees

(Kononenko, 1994)(Kononenko, 1994)

Page 13: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

RELIEFF-IGRELIEFF-IGDecisionTreesDecisionTrees

(Màrquez, 1999)(Màrquez, 1999)

• RELIEFF + the distance measure used for calculating the nearest hits/misses does not treat all attributes equally ( it weights the attributes according to the IG measure).

• RELIEFF + the distance measure used for calculating the nearest hits/misses does not treat all attributes equally ( it weights the attributes according to the IG measure).

Page 14: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Extensions of DTsExtensions of DTsDecisionTreesDecisionTrees

• (pre/post) Pruning

• Minimize the effect of the greedy approach: lookahead

• Non-lineal splits

• Combination of multiple models

• etc.

• (pre/post) Pruning

• Minimize the effect of the greedy approach: lookahead

• Non-lineal splits

• Combination of multiple models

• etc.

(Murthy 95)(Murthy 95)

Page 15: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Decision Trees and NLPDecision Trees and NLP

• Speech processing (Bahl et al. 89; Bakiri & Dietterich 99)

• POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00)

• Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96)

• Parsing (Magerman 95,96; Haruno et al. 98,99)

• Text categorization (Lewis & Ringuette 94; Weiss et al. 99)

• Text summarization (Mani & Bloedorn 98)

• Dialogue act tagging (Samuel et al. 98)

• Speech processing (Bahl et al. 89; Bakiri & Dietterich 99)

• POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00)

• Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96)

• Parsing (Magerman 95,96; Haruno et al. 98,99)

• Text categorization (Lewis & Ringuette 94; Weiss et al. 99)

• Text summarization (Mani & Bloedorn 98)

• Dialogue act tagging (Samuel et al. 98)

DecisionTreesDecisionTrees

Page 16: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Decision Trees and NLPDecision Trees and NLP

• Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95)

• Discourse analysis in information extraction (Soderland & Lehnert 94)

• Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94)

• Verb classification in Machine Translation (Tanaka 96; Siegel 97)

• More recent applications of DTs to NLP: but combined in a boosting framework (we will see it in following sessions)

• Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95)

• Discourse analysis in information extraction (Soderland & Lehnert 94)

• Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94)

• Verb classification in Machine Translation (Tanaka 96; Siegel 97)

• More recent applications of DTs to NLP: but combined in a boosting framework (we will see it in following sessions)

DecisionTreesDecisionTrees

Page 17: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Example: POS Tagging using DTExample: POS Tagging using DT

DecisionTreesDecisionTrees

He was shot in the hand as he chased

the robbers in the back street

He was shot in the hand as he chased

the robbers in the back streetNNVBNNVB

JJVBJJVB

NNVBNNVB

(The Wall Street Journal Corpus)(The Wall Street Journal Corpus)

POS TaggingPOS Tagging

Page 18: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

POS Tagging using Decision Trees

POS Tagging using Decision Trees

Language Model

Disambiguation Algorithm

Rawtext

Taggedtext

Morphologicalanalysis …

POS tagging

DecisionTreesDecisionTrees

(Màrquez, PhD 1999)(Màrquez, PhD 1999)

Page 19: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Disambiguation Algorithm

Rawtext

Taggedtext

Morphologicalanalysis …

POS tagging

Decision Trees

POS Tagging using Decision Trees

POS Tagging using Decision Trees

DecisionTreesDecisionTrees

(Màrquez, PhD 1999)(Màrquez, PhD 1999)

Page 20: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Language Model

RTTSTT

RELAX

Rawtext

Taggedtext

Morphologicalanalysis

POS tagging

POS Tagging using Decision Trees

POS Tagging using Decision Trees

DecisionTreesDecisionTrees

(Màrquez, PhD 1999)(Màrquez, PhD 1999)

Page 21: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

DT-based Language ModellingDT-based Language Modelling

root

P(IN)=0.81P(RB)=0.19Word Form

leaf

P(IN)=0.83P(RB)=0.17tag(+1)

P(IN)=0.13P(RB)=0.87tag(+2)

P(IN)=0.013P(RB)=0.987

“As”,“as”

RB

IN

others

others

...

...

“preposition-adverb” tree“preposition-adverb” tree

Statistical interpretation:Statistical interpretation:

P( RB | word=“A/as” & tag(+1)=RB & tag(+2)=IN) = 0.987

P( IN | word=“A/as” & tag(+1)=RB & tag(+2)=IN) = 0.013^

^

DecisionTreesDecisionTrees

Page 22: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

root

P(IN)=0.81P(RB)=0.19Word Form

leaf

P(IN)=0.83P(RB)=0.17tag(+1)

P(IN)=0.13P(RB)=0.87tag(+2)

P(IN)=0.013P(RB)=0.987

“As”,“as”

RB

IN

others

others

...

...“as_RB much_RB as_IN”

Collocations:Collocations:

“as_RB well_RB as_IN”

“as_RB soon_RB as_IN”

DT-based Language ModellingDT-based Language Modelling“preposition-adverb” tree“preposition-adverb” tree

DecisionTreesDecisionTrees

Page 23: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Language Modelling using DTsLanguage Modelling using DTs

• Algorithm: Top-Down Induction of Decision Trees (TDIDT). Supervised learning– CART (Breiman et al. 84), C4.5 (Quinlan 95), etc.

• Attributes: Local context (-3,+2) tokens

• Particular implementation:– Branch-merging– CART post-pruning– Smoothing– Attributes with many values– Several functions for attribute selection

• Algorithm: Top-Down Induction of Decision Trees (TDIDT). Supervised learning– CART (Breiman et al. 84), C4.5 (Quinlan 95), etc.

• Attributes: Local context (-3,+2) tokens

• Particular implementation:– Branch-merging– CART post-pruning– Smoothing– Attributes with many values– Several functions for attribute selection

Minimizing the effect of over-fitting, data fragmentation & sparseness

Minimizing the effect of over-fitting, data fragmentation & sparseness

• Granularity? Ambiguity class level– adjective-noun, adjective-noun-verb, etc.

• Granularity? Ambiguity class level– adjective-noun, adjective-noun-verb, etc.

DecisionTreesDecisionTrees

Page 24: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Model Evaluation Model Evaluation

• 1,170,000 words

• Tagset size: 45 tags

• Noise: 2-3% of mistagged words

• 49,000 word-form frequency lexicon– Manual filtering of 200 most frequent

entries– 36.4% ambiguous words– 2.44 (1.52) average tags per word

• 243 ambiguity classes

• 1,170,000 words

• Tagset size: 45 tags

• Noise: 2-3% of mistagged words

• 49,000 word-form frequency lexicon– Manual filtering of 200 most frequent

entries– 36.4% ambiguous words– 2.44 (1.52) average tags per word

• 243 ambiguity classes

The Wall Street Journal (WSJ) annotated corpusThe Wall Street Journal (WSJ) annotated corpus

DecisionTreesDecisionTrees

Page 25: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Model Evaluation Model Evaluation The Wall Street Journal (WSJ) annotated corpusThe Wall Street Journal (WSJ) annotated corpus

50% 60% 70% 80% 90% 95% 99% 100%

# ambiguityclasses 8 11 14 19 37 58 113 243

Number of ambiguity classes that cover x% of the training corpusNumber of ambiguity classes that cover x% of the training corpus

2-tags 3-tags 4-tags 5-tags 6-tags

# ambiguityclasses 103 90 35 12 3

Arity of the classification problemsArity of the classification problems

DecisionTreesDecisionTrees

Page 26: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

12 Ambiguity Classes12 Ambiguity Classes

They cover 57.90% of the ambiguous occurrences!They cover 57.90% of the ambiguous occurrences!

Experimental setting: 10-fold cross validation Experimental setting: 10-fold cross validation

DecisionTreesDecisionTrees

Page 27: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

N-fold Cross ValidationN-fold Cross ValidationDecisionTreesDecisionTrees

Divide the training set S into a partition of n equal-size disjoint subsets: s1, s2, …, sn

for i:=1 to N do learn and test a classifier using:

training_set := Usj for all j different from i

validation_set :=si

end_forreturn: the average accuracy from the n experiments

Which is a good value for N? (2-10-...)

Extreme case (N=training set size): Leave-one-out

Divide the training set S into a partition of n equal-size disjoint subsets: s1, s2, …, sn

for i:=1 to N do learn and test a classifier using:

training_set := Usj for all j different from i

validation_set :=si

end_forreturn: the average accuracy from the n experiments

Which is a good value for N? (2-10-...)

Extreme case (N=training set size): Leave-one-out

Page 28: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Size: Number of NodesSize: Number of Nodes

22,095

10,674

5,715

0

5,000

10,000

15,000

20,000

25,000

Nu

mb

er o

f n

od

es

Basic algorithm Merging Pruning

Average size reduction: 51.7% 46.5%Average size reduction: 51.7% 46.5%74.1% (total)74.1% (total)

DecisionTreesDecisionTrees

Page 29: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

8,49 8,36

28,83

8,3

4

9

14

19

24

29

34

% E

rro

r ra

te

Low er Bound Basic AlgorithmMerging Pruning

AccuracyAccuracy

(at least) No loss in accuracy(at least) No loss in accuracy

DecisionTreesDecisionTrees

Page 30: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

8,48,358,528,698,588,318,24

11,63

8,9

17,24

02468

101214161820

Err

or

rate

%

Average error rate

Feature Selection CriteriaFeature Selection Criteria

Statistically equivalentStatistically equivalent

DecisionTreesDecisionTrees

Page 31: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Tree Base = Statistical Component– RTT: Reductionistic Tree based tagger

– STT: Statistical Tree based tagger

Tree Base = Statistical Component– RTT: Reductionistic Tree based tagger

– STT: Statistical Tree based tagger

Tree Base = Compatibility Constraints– RELAX: Relaxation-Labelling based tagger

Tree Base = Compatibility Constraints– RELAX: Relaxation-Labelling based tagger

(Màrquez & Rodríguez 99)(Màrquez & Rodríguez 99)

(Màrquez & Rodríguez 97)(Màrquez & Rodríguez 97)

(Màrquez & Padró 97)(Màrquez & Padró 97)

DT-based POS TaggersDT-based POS Taggers

DecisionTreesDecisionTrees

Page 32: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

RTT RTT

Rawtext

Morphologicalanalysis

Taggedtext

Classify Update Filter

Language Model

Disambiguation

stop?

(Màrquez & Rodríguez 97)

yesno

DecisionTreesDecisionTrees

Page 33: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

STTSTT

N-grams (trigrams)

(Màrquez & Rodríguez 99)

DecisionTreesDecisionTrees

Page 34: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

STTSTT

Contextual probabilities

(Màrquez & Rodríguez 99)

)|( kk CtP

)|(~

kk CtP );( kkAC CtTk

Estimated using Decision Trees

DecisionTreesDecisionTrees

Page 35: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Taggedtext

Rawtext

Morphologicalanalysis

STTSTT(Màrquez & Rodríguez 99)

Viterbialgorithm

Language Model

Disambiguation

Lexicalprobs. +

Contextual probs.

DecisionTreesDecisionTrees

Page 36: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Viterbialgorithm

Taggedtext

Rawtext

Morphologicalanalysis

Language Model

Disambiguation

N-gramsLexicalprobs. ++

STT+STT+

(Màrquez & Rodríguez 99)

Contextual probs.

DecisionTreesDecisionTrees

Page 37: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Tree Base = Statistical Component– RTT: Reductionistic Tree based tagger

– STT: Statistical Tree based tagger

Tree Base = Statistical Component– RTT: Reductionistic Tree based tagger

– STT: Statistical Tree based tagger

Tree Base = Compatibility Constraints– RELAX: Relaxation-Labelling based tagger

Tree Base = Compatibility Constraints– RELAX: Relaxation-Labelling based tagger

(Màrquez & Rodríguez 99)(Màrquez & Rodríguez 99)

(Màrquez & Rodríguez 97)(Màrquez & Rodríguez 97)

(Màrquez & Padró 97)(Màrquez & Padró 97)

DecisionTreesDecisionTrees

Page 38: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

RELAX RELAX

Relaxation Labelling

(Padró 96)

Taggedtext

Rawtext

Morphologicalanalysis

Language Model

Disambiguation

(Màrquez & Padró 97)

Linguisticrules

N-grams ++

Set of constraints

DecisionTreesDecisionTrees

Page 39: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

RELAX RELAX(Màrquez & Padró 97)

root

P(IN)=0.81P(RB)=0.19Word Form

leaf

P(IN)=0.83P(RB)=0.17

tag(+1)

P(IN)=0.13P(RB)=0.87

tag(+2)

P(IN)=0.013P(RB)=0.987

“As”,“as”

RB

IN

others

others

...

...

Compatibility values: estimated using Mutual InformationCompatibility values: estimated using Mutual Information

Translating Tress into ConstraintsTranslating Tress into Constraints

-5.81 (IN) (0 “as” “As”) (1 RB) (2 IN)

-5.81 (IN) (0 “as” “As”) (1 RB) (2 IN)

2.37 (RB) (0 “as” “As”) (1 RB) (2 IN)

2.37 (RB) (0 “as” “As”) (1 RB) (2 IN)

Positive constraintPositive constraint Negative constraintNegative constraint

DecisionTreesDecisionTrees

Page 40: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Experimental EvaluationExperimental Evaluation

• Training set: 1,121,776 words• Test set: 51,990 words• Closed vocabulary assumption• Base of 194 trees

– Covering 99.5% of the ambiguous occurrences– Storage requirement: 565 Kb– Acquisition time: 12 CPU-hours (Common LISP / Sparc10 workstation)

• Training set: 1,121,776 words• Test set: 51,990 words• Closed vocabulary assumption• Base of 194 trees

– Covering 99.5% of the ambiguous occurrences– Storage requirement: 565 Kb– Acquisition time: 12 CPU-hours (Common LISP / Sparc10 workstation)

Using the WSJ annotated corpusUsing the WSJ annotated corpus

DecisionTreesDecisionTrees

Page 41: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

• 67.52% error reduction with respect to MFT

• Accuracy = 94.45% (ambiguous) 97.29% (overall)

• Comparable to the best state-of-the-art automatic POS taggers

• Recall = 98.22% Precision = 95.73% (1.08 tags/word)

• 67.52% error reduction with respect to MFT

• Accuracy = 94.45% (ambiguous) 97.29% (overall)

• Comparable to the best state-of-the-art automatic POS taggers

• Recall = 98.22% Precision = 95.73% (1.08 tags/word)

RTT resultsRTT results

+ RTT allows to state a tradeoff between precision and recall

+ RTT allows to state a tradeoff between precision and recall

Experimental EvaluationExperimental EvaluationDecisionTreesDecisionTrees

Page 42: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

• Comparable to those of RTT• Comparable to those of RTT

STT resultsSTT results

+ STT allows the incorporation of N-gram information

some problems of sparseness and coherence of the resulting tag sequence can be alleviated

+ STT allows the incorporation of N-gram information

some problems of sparseness and coherence of the resulting tag sequence can be alleviated

• Better than those of RTT and STT• Better than those of RTT and STT

STT+ resultsSTT+ results

Experimental EvaluationExperimental Evaluation

DecisionTreesDecisionTrees

Page 43: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

• Translation of 44 representative trees covering 84% of the examples = 8,473 constraints

• Addition of:– bigrams (2,808 binary constraints)

– trigrams (52,161 ternary constraints)

– linguistically-motivated manual constraints (20)

• Translation of 44 representative trees covering 84% of the examples = 8,473 constraints

• Addition of:– bigrams (2,808 binary constraints)

– trigrams (52,161 ternary constraints)

– linguistically-motivated manual constraints (20)

Including trees into RELAXIncluding trees into RELAX

Experimental EvaluationExperimental Evaluation

DecisionTreesDecisionTrees

Page 44: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Accuracy of RELAXAccuracy of RELAX

MFT B T BT C BC TC BTC

Ambig. 85.31 91.35 91.82 91.92 91.96 92.72 92.82 92.55

Overall 94.66 96.86 97.03 97.06 97.08 97.36 97.39 97.29

MFT= baseline, B=bigrams, T=trigrams, C=“tree constraints”

H BH TH BTH CH BCH TCH BTCH

Ambig. 86.41 91.88 92.04 92.32 91.97 92.76 92.98 92.71

Overall 95.06 97.05 97.11 97.21 97.08 97.37 97.45 97.35

H = set of 20 hand-written linguistic rules

91.35 92.7291.82 92.82

DecisionTreesDecisionTrees

Page 45: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

Decision Trees: SummaryDecision Trees: Summary

• Advantages

– Acquires symbolic knowledge in a understandable way

– Very well studied ML algorithms and variants

– Can be easily translated into rules

– Existence of available software: C4.5, C5.0, etc.

– Can be easily integrated into an ensemble

• Advantages

– Acquires symbolic knowledge in a understandable way

– Very well studied ML algorithms and variants

– Can be easily translated into rules

– Existence of available software: C4.5, C5.0, etc.

– Can be easily integrated into an ensemble

DecisionTreesDecisionTrees

Page 46: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11/11/2002 EMNLP’02 11/11/2002

• Drawbacks

– Computationally expensive when scaling to large natural language domains: training examples, features, etc.

– Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation

– DTs is a model with high variance (unstable)

– Tendency to overfit training data: pruning is necessary

– Requires quite a big effort in tuning the model

• Drawbacks

– Computationally expensive when scaling to large natural language domains: training examples, features, etc.

– Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation

– DTs is a model with high variance (unstable)

– Tendency to overfit training data: pruning is necessary

– Requires quite a big effort in tuning the model

DecisionTreesDecisionTrees

Decision Trees: SummaryDecision Trees: Summary