29
Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Embed Size (px)

Citation preview

Page 1: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Using Minimum Description Length to make Grammatical Generalizations

Mike Dowman

University of Tokyo

Page 2: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

What should Syntactic Theory Explain?

• Which sentences are grammatical and which are not

or

• How to transform observed sentences into a grammar

IE

learning

Children transform observed sentences (E)

Into psychological knowledge of language (I)

Page 3: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

How should we study syntax?

Linguists’ Approach:• Choose some sentences• Decide on grammaticality of each oneMake a grammar that accounts for which of

these sentences are grammatical and which are not

sentences grammar

Informant Linguist

Page 4: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Computational Linguists’ Approach(Unsupervised Learning)

• Take a corpusExtract as much information from the corpus as accurately as

possibleorLearn a grammar that describes the corpus as accurately as

possible

corpus

grammarlexical items

language modeletc.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 5: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Which approach gives more insight into language?

Linguists tend to aim for high precision• But only produce very limited and arbitrary

coverage

Computational linguists tend to obtain much better coverage

• But don’t account for any body of data completely

• And tend to learn only simpler kinds of structure

The approaches seem to be largely complementary

Page 6: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Which approach gives more insight into the human mind?

The huge size and complexity of languages is one of their key distinctive properties

The linguists’ approach doesn’t account for this

So should we apply our algorithms to large corpora of naturally occurring data?

This won’t directly address the kind of issue that syntacticians focus on

Page 7: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Negative Evidence

• Some constructions seem impossible to learn without negative evidence

John hurt himselfMary hurt JohnJohn hated himselfMary hated JohnJohn behaved himself * Mary behaved John

Page 8: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Implicit Negative Evidence

If we never hear something can’t we just assume its not grammatical?

Sentences we never heard?Phrases we never heard?Verb argument constructions we never heard?Word-affix combinations we never heard?

How often does something have to not occur before we decide it’s not grammatical?

At what structural level do we make generalizations?

Page 9: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Minimum Description Length (MDL)

MDL may be able to solve the ‘no negative evidence’ problem

Prefers the grammar that results in the simplest overall description of data

• So prefers simple grammars

• And grammars that allow simple descriptions of the data

Page 10: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Observed sentences

Space of possible sentences

Page 11: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Observed sentences

Grammar

Simple but non-constraining grammarSpace of possible sentences

Page 12: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Observed sentences

Grammars

Simple but non-constraining grammar

Complex but constraining grammar

Space of possible sentences

Page 13: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Observed sentences

Grammars

Grammar that is a good fit to the data

Simple but non-constraining grammar

Complex but constraining grammar

Space of possible sentences

Page 14: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Why it has to be MDL

Many machine learning techniques have been applied in computational linguistics

MDL is very rarely used

Not especially successful at learning grammatical structure from corpora

So why MDL?

Page 15: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Maximum Likelihood

Maximum likelihood can be seen as a special case of MDL in which the a priori probability of all hypotheses P(h) is equal

But the hypothesis that only the observed sentences are grammatical will result in the maximum likelihood

So ML can only be applied if there are restrictions on how well the estimated parameters can fit the data

The degree of generality of the grammars is set externally, not determined by the Maximum Likelihood principle

Page 16: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Maximum Entropy

Make the grammar as unrestrictive as possible

But constraints must be used to prevent a grammar just allowing any combination of words to be a grammatical sentence

Again the degree of generality of grammars is determined externally

Neither Maximum Likelihood nor Maximum Entropy provide a principle that can decide when to make generalizations

Page 17: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

1 S NP VP2 NP John3 NP Mary 4 VP screamed5 VP died

Describing data in terms of the grammar: 1, 2, 4 = John screamed

There is a restricted range of choices at each stage of the derivation

Fewer choices = higher probability

Learning Phrase Structure Grammars

Data:John screamedJohn diedMary Screamed

Page 18: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Encoding in My Model

1010100111010100101101010001100111100011010110

Symbol Frequencies

Rule Frequencies

Decoder

1 S NP VP2 NP john 3 NP mary4 VP screamed5 VP died

John screamedJohn diedMary Screamed

Grammar Data

S (1)NP (3)VP (3)john (1)mary (1)screamed (1)died (1)null (4)

Rule 1 3Rule 2 2Rule 3 1Rule 4 2Rule 5 1

Number of bits decoded = evaluation

Page 19: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

John hit MaryMary hit EthelEthel ranJohn ranMary ranEthel hit JohnNoam hit JohnEthel screamedMary kicked EthelJohn hopes Ethel thinks Mary hit EthelEthel thinks John ranJohn thinks Ethel ranMary ranEthel hit MaryMary thinks John hit EthelJohn screamedNoam hopes John screamedMary hopes Ethel hit JohnNoam kicked Mary

Example: EnglishLearned Grammar

S NP VPVP ranVP screamedVP Vt NPVP Vs SVt hitVt kickedVs thinksVs hopesNP JohnNP EthelNP MaryNP Noam

Page 20: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Real Language Data

Can the MDL metric also learn grammars from corpora of unrestricted natural language?

If it could, we’d largely have finished syntax

But search space is way too big

We need to simplify the task in some wayOnly learn verb subcategorization classes

Page 21: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Switchboard Corpus( (S

(C C and)

(PRN

(, ,)

(S

(NP -SBJ (PRP you) )

( VP (VBP know) ))

(, ,) )

( NP-SBJ-1 (PRP she) )

(VP (VBD spent)

(NP

( NP (CD nine) (NNS months) )

( PP (IN out)

(PP (IN of)

(NP (DT the) (NN year) ))))

(S-ADV

(NP -SBJ (-NONE- *-1) )

(A DVP (RB just) )

( VP (VBG visiting)

(NP (PRP$ her) (NNS children) ))))

(. .) (-DFL- E_S) ))

Extracted Information:

Verb: spent

Subcategorization frame: * NP S

Page 22: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Extracted Data

Only verbs tagged as VBD (past tense) extracted

Modifiers to basic labels ignored

21,759 training instances

704 different verbs

706 distinct subcategorization frames

25 different types of constituent appeared alongside the verbs (e.g. S, SBAR, NP, ADVP)

Page 23: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Verb Class GrammarsS Class1 Subcat1

S Class1 Subcat2

S Class2 Subcat1

Class1 grew

Class1 ended

Class2 did

grew and ended appear can appear with subcats 1 and 2

do only with subcat 2

Grouping together verbs with similar subcategorizations should improve the evaluation

Page 24: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

A New Search Mechanism

We need a search mechanism that will only produce candidate grammars of the right form

• Start with all verbs in one class• Move a randomly chosen verb to a new class

(P=0.5) or a different class (P=0.5)• Empty verb classes are deleted• Redundant rules are removed

Page 25: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

A New Search Mechanism (2)

Annealing search:• After no changes are accepted for 2,000

iterations switch to merging phase• Merge two randomly selected classes• After no changes accepted for 2,000 iterations

switch back to moving phase• Stop after no changes accepted for 20,000

iterations• Multiple runs were conducted and the grammar

with the overall lowest evaluation selected

Page 26: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Grammar Evaluations

207,312.4187,026.7220,520.4Data

37,885.5111,036.529,915.1Grammar

245,198.0298,063.3250,435.5Overall Evaluation

Best learned grammar

Each verb in a separate class

One verb class

Page 27: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Learned ClassesClass Verbs in Class Description

1 thought, vowed, prayed, decided, adjusted, wondered, wished, allowed, knew, suggested, claimed, believed, remarked, resented, detailed, misunderstood, assumed, competed, snowballed, smoked, said, struggled, determined, noted, understood, foresaw, expected, discovered, realized, negotiated, suspected, indicated

Usually take S or

S BAR complement

(S BAR usually contains that or who etc. followed by an S)

2 enjoyed, canceled, liked, had, finished, traded, sold, ruined, needed, watched, loved, included, received, converted, rented, bred, deterred, increased, encouraged, made, swapped, shot, offered, spent, impressed, discussed, missed, carried, injured, presented, surprised…

Usually take an NP argument (often in conjunction with other arguments)

3 did did only

4 All other verbs miscellaneous

5 used, named, tried, considered, tended, refused, wanted, managed, let, forced, began, appeared

Typically take an S argument (but never just an SBAR)

6 wound, grew, ended, closed, backed Usually take a particle

Page 28: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Did MDL make appropriate generalizations?

The learned verb classes are clearly linguistically coherent

But they don’t account for exactly which verbs can appear with which subcats

Linguists have proposed far more fine-grained classes

Data available for learning was limited (subcats had no internal structure, Penn Treebank labels may not be sufficient)

But linguists can’t explain which verbs appear with which subcats either

Page 29: Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

Conclusions

• MDL (and only MDL) can determine when to make linguistic generalizations and when not to

• The same MDL metric can be used both on small sets of example sentences and on unrestricted corpora

• Work using corpora does not address the kind of issues that syntacticians are interested in