66
Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

  • View
    216

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Statistical approaches to language learning

John Goldsmith

Department of Linguistics

May 11, 2000

Page 2: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Trends in the study of language acquisition

1 Chomsky-inspired: “principles and parameters” (since 1979)

2 Transcribe and write a grammar

3 Compute statistics, and develop a minimum description length (MDL)

Page 3: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

1 Principle and parameters

The variation across languages boils down to two things:– alternate settings of a small set of

“parameters” (a few hundred?), each of which has only a small number of possible settings (2? 3? 4?)

– learn some arbitrary facts, like the pronunciation of words

Page 4: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

What’s a “parameter”, for instance?1. Pro-drop parameter: yes/no.

Yes? Spanish, Italian. Subject is optional; subject may appear before or after the verb; verb agrees with the subject (present or absent) with overt morphology.

No? English, French. Subject is obligatory. Dummy subjects (It is raining, There is a man at the door.)

Page 5: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Or, noun-adjective order…

Noun precedes adjective: French, Spanish: F. la voiture rouge “the red car” but literally “the car red”

Noun follows adjective: English

Page 6: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Criticisms:1. This approach intentionally puts a lot of

information into the innate language “faculty.” How can we be sure the linguist isn’t just cataloging a lot of differences between English and Spanish (e.g.) and proclaiming that this is a universal difference?

2. You don’t need an innate language faculty to realize that children have to learn whether nouns precede adjectives or not.

Page 7: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

3. The theory is completely silent about the learning of morphemes and words. It implies (by the silence) that this stuff is easy to learn. But maybe it’s the hardest stuff to learn, requiring such a sophisticated learning apparatus that the grammar will be easy (by comparison) to learn.

Page 8: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

2. Transcribe and write a grammar

Long tradition; landmark is Roger Brown’s work in the 1960s.

Value: extremely important empirical basis.

Criticism: tells us very little about the how or the process of language acquisition.

Page 9: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

3. Statistics and minimum description lengthRecent work --

probabilities in the lab:

Saffran, J., Aslin, R., & Newport, E. (1996). Statistical learning by 8-month-old infants. Science, 274, 1926-1928.

She argues that even quite young children can extract information about the “chunking” of sounds into pieces on the basis of their frequent occurrences together.

Page 10: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

The linguist’s acquisition problem:

What “must” happen in order for someone to end up knowing a particular language.

We (linguists) can map out models (and run them on computers) that show how easy (or hard) it is to arrive at a grammar of English (etc.) on the basis of various assumptions.

Page 11: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

We can’t tell which kinds of information a child uses. But we can argue that learning X or Y is easier/harder/the same if you assume the child has access to certain kinds of data (e.g., semantic, grammatical).

Page 12: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Probabilistic and statistical approaches

The fundamental premise of probabilistic approaches to language is this:

Degrees of (un)certainty can be quantified.

Page 13: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Two problems of languageacquisition that have beenseriously tackled

2 closely related problems:

1. Segmenting an utterance into word-sized pieces (Brent, de Marcken, others)

2. Segmenting words into morphemes. (Goldsmith)

Page 14: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Minimum Description Length

Jorma Rissanen (1989)

Data Analyzer Analysis

Select the analyzer and analysis such that the sum of their lengths is a minimum.

Page 15: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Data

Analyzer Analysis

Analyzer Analysis

Analyzer Analysis

Analyzer Analysis

Analyzer Analysis

Etc...

Page 16: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

The challenge

Is to find a means of quantifying the length of an analyzer, and the length of an analysis

Page 17: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

“Compressed form of data?”

Think of data as a dense, rich, detailed description (evidence), and

Think of compressed form as Description in high level language + Description of the particulars of the

event in question (a.k.a. boundary conditions, etc.)...

Page 18: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Example:Utterance:

“theolddogandthenotsooldcatgotintotheyardwithoutanybodynoticing”

62 letters as it stands.

Or:

1 = the 2=old 3=dog 4 = not

123and24so2catgotinto1yardwithoutanybody4icing.

46 symbols here, 12 above, total of 58 --

Page 19: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Compare with Early Generative Grammar (EGG)

Data

Linguistic TheoryAnalysis 1

Analysis 2

Preference: A1/A2

Page 20: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Linguistic theoryData Analysis

Linguistic theory

Data

Analysis

Yes/No

Linguistic theoryAnalysis 1

Analysis 2

Data1 is better/2 is better

Page 21: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Implicit in EGG was the notion...that the best Linguistic Theory

could be selected by...

Getting a set of n candidate LTs;submitting to each a set of corpora;

search (using unknown heuristics) for bestanalyses of each corpus within each LT;

The LT wins for whom the sum total of all of theanalyses is the smallest.

Page 22: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

No cost to UG

In EGG, there was no cost associated with the size of UG -- in effect, no plausibility measure.

Page 23: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

In MDL, in contrast….

we can argue for a grammar for a given corpus.

We can also argue at the Linguistic Theory level if we so choose...

Page 24: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Distinction between heuristicsand “theory” In the context of MDL, the heuristics are

extratheoretical, but from the point of view of the (psycho-)linguist, they are very important.

The heuristics propose; the theory disposes.

Page 25: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

The goal:To produce a morphological analysis of a

corpus from an “unknown” language automatically

that is, with no knowledge of the structure of that language built in;

To produce both generalizations about the language, and a correct analysis of each word in the corpus.

Page 26: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Linguistica

Page 27: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Implemented in Linguistica, a program that runs under Windows that you can download at:

humanities.uchicago.edu/faculty/goldsmith

Page 28: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Other work in this area

Derrick Higgins on Thursday; Michael Brent 1993; Zellig Harris: 1955 and 1967, follow-up:

Hafer and Weiss 1974

Page 29: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Global approach

Focus on devising a method for evaluating a hypothesis, given the data.

Finding explicit methods of discovery is important, but those methods play no role in evaluating the analysis for a given corpus.

(Very similar in conception to Chomsky’s notion of an evaluation metric.)

Page 30: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Framework for evaluation:

Jorma Rissanen’s Minimum Description Length (“MDL”).

Quite intricate; but we can get a very good feel for the general idea with a naïve version of MDL...

Page 31: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Naive description length

Count the total number of letters in the list of stems and affixes:

the fewer, the better.

Page 32: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Intuition:

A word which is morphologically complex reveals that composite character by virtue of being composed of (one or more) strings of letters which have a relatively high frequency throughout the corpus.

Page 33: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Naive description length: 2

Lexicographers know what they are doing when they indicate the entry for the verb laugh as laugh, ~s, ~ed, ~ing --

They recognize that the tilde “ ~” allows them to utilize the regularities of the language in order to save space and specification, and implicitly to underscore the regularity of the pattern that the stem possesses.

Page 34: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Morphological analysis is not merely a matter of frequency.

Not every word that ends in –ing is morphologically complex: string, sing, etc.

Page 35: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Naive Minimum Description Length:

Analyze the words of a corpus into stem + suffix with the requirement that every stem and every suffix must be used in at least 2 distinct words.

Tally up the total number of letters in (a) each of the proposed stems, (b) each of the proposed suffixes, and (c) each of the unanalyzed words, and call that total the “naive description length”.

s

ing

ed

jump

laugh

Page 36: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Naive Minimum Description Length

Corpus:

jump, jumps, jumping

laugh, laughed, laughing

sing, sang, singing

the, dog, dogs

total: 62 letters

Analysis:

Stems: jump laugh sing sang dog (20 letters)

Suffixes: s ing ed (6 letters)

Unanalyzed: the (3 letters)

total: 29 letters.

Notice that the description length goes UP if we analyze sing into s+ing

Page 37: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Frequencies matter, but only in the overarching context of a total morphological analysis of all of the words of the language.

Page 38: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Let’s look at how the work is done, step by step...

Page 39: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Corpus

Pick a large corpus from a language --5,000 to 1,000,000 words.

Page 40: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Corpus

Bootstrap heuristicFeed it into the “bootstrapping” heuristic...

Page 41: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Corpus

Out of which comes a preliminary morphology,which need not be superb.Morphology

Bootstrap heuristic

Page 42: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Corpus

Morphology

Bootstrap heuristic

incremental heuristics

Feed it to the incrementalheuristics...

Page 43: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Corpus

Morphology

Bootstrap heuristic

incremental heuristics

modified morphology

Out comes a modifiedmorphology.

Page 44: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Corpus

Morphology

Bootstrap heuristic

incremental heuristics

modified morphology

Is the modificationan improvement?Ask MDL!

Page 45: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Corpus

Morphology

Bootstrap heuristic

modified morphology

If it is an improvement,replace the morphology...

Garbage

Page 46: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Corpus

Bootstrap heuristic

incremental heuristics

modified morphology

Send it back to theincremental heuristics again...

Page 47: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Morphology

incremental heuristics

modified morphology

Continue until there are no improvementsto try.

Page 48: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Bootstrapping...initial hypothesis = initial morphology of the corpus

Page 49: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

First: a set of candidate suffixesfor the language

Using some interesting statistics.

Page 50: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

)()...()(

)...(log)...(

21

2121

n

nn ppp

pp

1. Observed frequency of a string (e.g., ing)

2. Predicted frequency of thesame string if there were nomorphemes in the language

3. The computed “stickiness” of that string

4. Weight the stickiness(3) by how often thestring shows up in the corpus

Page 51: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Rank all word-final sequences of letters (of length 1-4 letters);

This gives us an excellent first guess of the suffixes of the language.

See Handout for English, French, Spanish, and Latin.

Page 52: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

English French Latin Italian Spanish (Quijote)

(e)s (r)e (((i)b)u)s (((e)n)t)e se (t)e ((l)e)s ((t)u)m ((a)t)o ar

((t)e)d (((m)e)n)t (i)t ((a)t)a ó (((t)i)o)n (((t)i)o)n (u)e ((n)t)i ado (l)y (e)r (t)a ((i)o)ne le

(((t)i)n)g ée (t)i (a)no an

Page 53: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Given a candidate set of 100 suffixes...

It is not difficult to find the set of stems that gives us the largest number of analyses employing only those suffixes.

We use these to find the major signatures present in the corpus ...

Page 54: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Discovery of signatures:

##

s

ing

ed

NULL

attempt

assault

appeal

amount

alert

afford

addaccent

The first 8 stems in the largest signature in a500,000 word corpusof English.

Set of suffixes that appears with all ofthese stems

Page 55: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Minimum Description LengthThe real thing, this time: Rissanen 1989.

Evaluate a morphology by:

1. How well the morphology extracts generalizations present in the data: how well it describes the data.

2. How concise the morphology is.

The “naïve MDL” we just looked at only covered the second point, and only crudely.

Page 56: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Measure how well the morphology fits the data:

1. Compute the predicted inverse log frequency of each word in the corpus, and sum:

Ww

wfreqpredictedwCount )(log*)(

This is a well-understood quantity in information theory, called the “optimal compressed length” ofthe corpus based on the probability distributiondefined by the morphology.

Page 57: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Conciseness

Sum all the letters, plus all the structure inherent in the description, using information theory.

Page 58: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Suffixesf

A

f

WflistSuffixii

][

][log||*)(

Stemst t

WtlistStemiii )

][

][log(||*:)(

Number of letters structure

+ Signatures, which we’ll get to shortly

Page 59: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Information contained in the Signature component

Signatures

W

][

][log list of pointers to signatures

logstems( log Signatures

suffixes

)][

][log

][

][log(

)()(

SuffixesfSigs Stemst inft

W

<X> indicates the numberof distinct elements in X

Page 60: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Results…

Page 61: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Suffixes of English

Look at your handout.

Page 62: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

1. NULL.ed.ing.s 4. NULL.s 7. NULL.ed.ing

accent abberation applaud add abolitionist arrest administer abortion astound afford absence blast alert abstractionist bless amount abutment bloom appeal accolade boast assault accommodation bolster attempt accomodation broaden 2. 's.NULL.s 5. e.ed.es.ing cater adolescent achiev 8. NULL.er.ing.s afternoon assum blow airline brac bomb ambassador chang broadcast amendment charg deal

Page 63: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

1. NULL.e.es.s 4. NULL.e.es 7. NULL.e abondant acquis accueillant abstrait aéropostal acharné adjacent afghan admis approprié albanais adsorbant atteint allongé albigeois bantou anglais alicant bleu appelé aliénant brillant arrondi alléchant byzantin bavarois amarant 2. NULL.s carthaginois ambiant abandonnée 5. NULL.e.s 8. NULL.es.s abbaye adhérent antioxydant abdication adolescent bassin abdominale affilié civil abélienne aîné craint aberration assigné cristallin abolitionniste assistant cutané abordée bovin descendant abrasif cinglant doté abréviation colorant émulsifiant 3. NULL.ment.s 6. NULL.ne.s ennemi

administrative abélien

9. a.aient.ait.ant.e.ent.er.es.èrent.é.ée.és

agressive acheuléen contrôl anatomique alsacien jou ancienne amérindien laiss annuelle ancien rest

French

Page 64: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

1. a.as.o.os 4. NULL.n

7. NULL.a.as.o.os

abiert abría algun aficionad abriría buen ajen acabase es amig acabe mí antigu acaece primer compuest acertaba un cortesan acometía 8. NULL.es cubiert acompañaba ángel cuy acordaba animal delicad aguardaba árbol 2. NULL.s 5. NULL.n.s azul aborrecido caballero bachiller abrasado cante belianis abundante debía bien acaecimiento dice buey accidente dijere calidad achaque duerme cardenal acompañado entiende 9. da.do.r

Spanish

Page 65: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Latin

1. NULL.que 4. NULL.m 7. NULL.e.m abierunt abdia angustia acceperunt abia baptista accepit abira barachia accinctus abra bethania accipient adonira blasphemia addidit adsistente causa adiuvit adulescente conscientia adoravit adulescentia corona adplicabis adustione ignorantia adprehendens aetate lorica

2. NULL.m.s 5. i.is.o.orum.os.um.us 8. a.ae.am.as.i.is.o.orum.os.um.us

acie angel ann aquaeductu cubit magn byssina discipul mult

Page 66: Statistical approaches to language learning John Goldsmith Department of Linguistics May 11, 2000

Future directions

Develop it to work with languages with greater complexity; and

Use it as an aide in the task of learning syntax in the same unsupervised fashion.