55
Pre-processing and non-randomness Baroni & Evert Pre-Processing Non- Randomness The End Counting Words: Pre-Processing and Non-Randomness Marco Baroni & Stefan Evert alaga, 11 August 2006

Baroni & Evert Counting Words

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Counting Words:Pre-Processing and Non-Randomness

Marco Baroni & Stefan Evert

Malaga, 11 August 2006

Page 2: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Outline

Pre-Processing

Non-Randomness

The End

Page 3: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Pre-processing

I IT IS IMPORTANT!!! (Evert and Ludeling 2001)

I Automated pre-processing often necessary (13,850 typesbegin with re- in BNC, 103,941 types begin with ri- initWaC)

I We can rely on:I POS taggingI LemmatizationI Pattern matching heuristics (e.g., candidate prefixed form

must be analyzable as PRE+VERB, with VERBindependently attested in corpus)

I However. . .

Page 4: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Pre-processing

I IT IS IMPORTANT!!! (Evert and Ludeling 2001)

I Automated pre-processing often necessary (13,850 typesbegin with re- in BNC, 103,941 types begin with ri- initWaC)

I We can rely on:I POS taggingI LemmatizationI Pattern matching heuristics (e.g., candidate prefixed form

must be analyzable as PRE+VERB, with VERBindependently attested in corpus)

I However. . .

Page 5: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Pre-processing

I IT IS IMPORTANT!!! (Evert and Ludeling 2001)

I Automated pre-processing often necessary (13,850 typesbegin with re- in BNC, 103,941 types begin with ri- initWaC)

I We can rely on:I POS taggingI LemmatizationI Pattern matching heuristics (e.g., candidate prefixed form

must be analyzable as PRE+VERB, with VERBindependently attested in corpus)

I However. . .

Page 6: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Pre-processing

I IT IS IMPORTANT!!! (Evert and Ludeling 2001)

I Automated pre-processing often necessary (13,850 typesbegin with re- in BNC, 103,941 types begin with ri- initWaC)

I We can rely on:I POS taggingI LemmatizationI Pattern matching heuristics (e.g., candidate prefixed form

must be analyzable as PRE+VERB, with VERBindependently attested in corpus)

I However. . .

Page 7: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

The problem with low frequency words

I Correct analysis of low frequency words is fundamental tomeasure productivity, estimate LNRE models

I Automated tools will tend to have lowest performance onlow frequency forms:

I Statistical tools will suffer from lack of relevant trainingdata

I Manually-crafted tools will probably lack the relevantresources

I Problems in both directions (under- and overestimationof hapax counts)

I Part of the more general “95% performance” problem

Page 8: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

The problem with low frequency words

I Correct analysis of low frequency words is fundamental tomeasure productivity, estimate LNRE models

I Automated tools will tend to have lowest performance onlow frequency forms:

I Statistical tools will suffer from lack of relevant trainingdata

I Manually-crafted tools will probably lack the relevantresources

I Problems in both directions (under- and overestimationof hapax counts)

I Part of the more general “95% performance” problem

Page 9: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

The problem with low frequency words

I Correct analysis of low frequency words is fundamental tomeasure productivity, estimate LNRE models

I Automated tools will tend to have lowest performance onlow frequency forms:

I Statistical tools will suffer from lack of relevant trainingdata

I Manually-crafted tools will probably lack the relevantresources

I Problems in both directions (under- and overestimationof hapax counts)

I Part of the more general “95% performance” problem

Page 10: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

The problem with low frequency words

I Correct analysis of low frequency words is fundamental tomeasure productivity, estimate LNRE models

I Automated tools will tend to have lowest performance onlow frequency forms:

I Statistical tools will suffer from lack of relevant trainingdata

I Manually-crafted tools will probably lack the relevantresources

I Problems in both directions (under- and overestimationof hapax counts)

I Part of the more general “95% performance” problem

Page 11: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Underestimation of hapaxes

I The Italian TreeTagger lemmatizer is lexicon-based;out-of-lexicon words (e.g., productively formed wordscontaining a prefix) are lemmatized as UNKNOWN

I No prefixed word with dash (ri-cadere) is in lexicon

I Writers are more likely to use dash to mark transparentmorphological structure

Page 12: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Productivity of ri-with and without an extended lexicon

0 200000 600000 1000000

020

040

060

080

010

00

N

E[[V

((N))]]

post−cleaningpre−cleaning

Page 13: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Overestimation of hapaxes

I “Noise” generates hapax legomena

I The Italian TreeTagger thinks that dashed expressionscontaining pronoun-like strings are pronouns

I Dashed strings can be anything, including full sentences

I This creates a lot of pseudo-pronoun hapaxes: tu-tu,parapaponzi-ponzi-po, altri-da-lui-simili-a-lui

Page 14: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Productivity of the pronoun classbefore and after cleaning

0e+00 1e+06 2e+06 3e+06 4e+06

050

100

150

200

250

300

350

N

E[[V

((N))]]

pre−cleaningpost−cleaning

Page 15: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

P (and V ) with/without correct post-processing

I With:

class V V1 N Pri- 1098 346 1,399,898 0.00025pronouns 72 0 4,313,123 0

I Without:

class V V1 N Pri- 318 8 1,268,244 0.000006pronouns 348 206 4,314,381 0.000048

Page 16: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

A final word on pre-processing

I IT IS IMPORTANT

I Often, major roadblock of lexical statistics investigations

Page 17: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

A final word on pre-processing

I IT IS IMPORTANT

I Often, major roadblock of lexical statistics investigations

Page 18: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Outline

Pre-Processing

Non-Randomness

The End

Page 19: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Non-randomness

I LNRE modeling based on assumption that ourcorpora/datasets are random samples from thepopulation

I This is obviously not the case

I Can we pretend that a corpus is random?

I What are the consequences of non-randomness?

Page 20: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Non-randomness

I LNRE modeling based on assumption that ourcorpora/datasets are random samples from thepopulation

I This is obviously not the case

I Can we pretend that a corpus is random?

I What are the consequences of non-randomness?

Page 21: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Non-randomness

I LNRE modeling based on assumption that ourcorpora/datasets are random samples from thepopulation

I This is obviously not the case

I Can we pretend that a corpus is random?

I What are the consequences of non-randomness?

Page 22: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Non-randomness

I LNRE modeling based on assumption that ourcorpora/datasets are random samples from thepopulation

I This is obviously not the case

I Can we pretend that a corpus is random?

I What are the consequences of non-randomness?

Page 23: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

A Brown-sized random samplefrom a ZM population estimated with Brown

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

010

000

2000

030

000

4000

050

000

N

V((N

))

Page 24: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

The real Brown

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

010

000

2000

030

000

4000

050

000

N

V((N

))

Page 25: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Where does non-randomness come from?

I Syntax?

I the the should be most frequent English bigram

I If the problem is due to syntax, randomizing by sentencewill not get rid of it (Baayen 2001, ch. 5)

Page 26: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Where does non-randomness come from?

I Syntax?

I the the should be most frequent English bigram

I If the problem is due to syntax, randomizing by sentencewill not get rid of it (Baayen 2001, ch. 5)

Page 27: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Where does non-randomness come from?

I Syntax?

I the the should be most frequent English bigram

I If the problem is due to syntax, randomizing by sentencewill not get rid of it (Baayen 2001, ch. 5)

Page 28: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

The Brown randomized by sentence

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

010

000

2000

030

000

4000

050

000

N

V((N

))

Page 29: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Where does non-randomness come from?

I Not syntax (syntax has short span effect; the counts for10k intervals are OK)

I Underdispersion of content-rich words

I The chance of two Noriegas is closer to π/2 than π2

(Church 2000)

I diethylstilbestrol occurs 3 times in Brown, all in samedocument (recommendations on feed additives)

I Underdispersion will lead to serious underestimation ofrare type count

I fZM estimated on Brown predicts S = 115, 539 in English

Page 30: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Where does non-randomness come from?

I Not syntax (syntax has short span effect; the counts for10k intervals are OK)

I Underdispersion of content-rich words

I The chance of two Noriegas is closer to π/2 than π2

(Church 2000)

I diethylstilbestrol occurs 3 times in Brown, all in samedocument (recommendations on feed additives)

I Underdispersion will lead to serious underestimation ofrare type count

I fZM estimated on Brown predicts S = 115, 539 in English

Page 31: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Where does non-randomness come from?

I Not syntax (syntax has short span effect; the counts for10k intervals are OK)

I Underdispersion of content-rich words

I The chance of two Noriegas is closer to π/2 than π2

(Church 2000)

I diethylstilbestrol occurs 3 times in Brown, all in samedocument (recommendations on feed additives)

I Underdispersion will lead to serious underestimation ofrare type count

I fZM estimated on Brown predicts S = 115, 539 in English

Page 32: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Underestimating typesExtrapolating Brown VGC with fZM

0e+00 1e+07 2e+07 3e+07 4e+07

020

000

4000

060

000

8000

010

0000

N

E[[V

((N))]]

Page 33: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Assessing extrapolation quality

I We have no way to assess goodness of fit of extrapolationfrom corpus to larger sample from same population

I However, we can estimate models on subset of availabledata, and extrapolate to full corpus size (Evert andBaroni 2006)

I I.e., use corpus as our population, sample from it

Page 34: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Extrapolation from a random sample of 250kBrown tokens

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

010

000

2000

030

000

4000

050

000

N

E[[V

((N))]]

interpolatedzmfzmgigp

Page 35: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Goodness of fit to spectrum elementsBased on multivariate chi-squared statistic

estimation size max extrapolation sizemodel X2 df p X2 df p

ZM 7, 856 14 � 0.001 35, 346 16 � 0.001fZM 539 13 � 0.001 4, 525 16 � 0.001GIGP 597 13 � 0.001 3, 449 16 � 0.001

Compare to V fit:

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

010

000

2000

030

000

4000

050

000

N

E[[V

((N))]]

interpolatedzmfzmgigp

Page 36: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Extrapolation from first 250k tokens in corpus

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

010

000

2000

030

000

4000

050

000

N

V((N

))E

[[V((N

))]]

observedzmfzmgigp

Page 37: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Goodness of fit to spectrum elementsBased on multivariate chi-squared statistic

estimation size max extrapolation sizemodel X2 df p X2 df p

ZM 8, 066 14 � 0.001 33, 6766 16 � 0.001fZM 1, 011 13 � 0.001 17, 559 16 � 0.001GIGP 587 13 � 0.001 7, 815 16 � 0.001

Compare to V fit:

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

010

000

2000

030

000

4000

050

000

N

V((N

))E

[[V((N

))]]

observedzmfzmgigp

Page 38: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

The corpus as a (non-)random sample

I In our experiment, we had access to full population (theBrown) and could take random sample from it

I In real life, full corpus is our sample from the population(e.g., “English”, an author’s mental lexicon, all wordsgenerated by a wfp)

I If it is not random, there is nothing we can do about it(randomizing the sample will not help!)

Page 39: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

The corpus as a (non-)random sample

I In our experiment, we had access to full population (theBrown) and could take random sample from it

I In real life, full corpus is our sample from the population(e.g., “English”, an author’s mental lexicon, all wordsgenerated by a wfp)

I If it is not random, there is nothing we can do about it(randomizing the sample will not help!)

Page 40: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

The corpus as a (non-)random sample

I In our experiment, we had access to full population (theBrown) and could take random sample from it

I In real life, full corpus is our sample from the population(e.g., “English”, an author’s mental lexicon, all wordsgenerated by a wfp)

I If it is not random, there is nothing we can do about it(randomizing the sample will not help!)

Page 41: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

What can we do?

I Abandon lexical statistics

I Live with it

I Re-define the population

I Try to account for underdispersion when computing themodels (will get mathematically very complicated, butsee Baayen 2001, ch. 5)

Page 42: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

What can we do?

I Abandon lexical statistics

I Live with it

I Re-define the population

I Try to account for underdispersion when computing themodels (will get mathematically very complicated, butsee Baayen 2001, ch. 5)

Page 43: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

What can we do?

I Abandon lexical statistics

I Live with it

I Re-define the population

I Try to account for underdispersion when computing themodels (will get mathematically very complicated, butsee Baayen 2001, ch. 5)

Page 44: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

What can we do?

I Abandon lexical statistics

I Live with it

I Re-define the population

I Try to account for underdispersion when computing themodels (will get mathematically very complicated, butsee Baayen 2001, ch. 5)

Page 45: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Not always that badOur Mutual Friend

0 50000 100000 150000 200000 250000 300000

050

0010

000

1500

020

000

N

V((N

))E

[[V((N

))]]

observedzmfzmgigp

Page 46: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

Outline

Pre-Processing

Non-Randomness

The End

Page 47: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

What we have done

I Motivation: studying distribution and V growth rate oftype-rich populations (sample captures only smallproportion of types in population)

I LNRE modeling:I Population model with limited number of parameters

(e.g., ZM), expressed in terms of type density functionI Equations to calculate expected V and frequency

spectrum in random samples of arbitrary size usingpopulation model

I Estimation of population parameters via fit of expectedelements to observed frequency spectrum

I zipfR package to apply LNRE modeling

I Problems

Page 48: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

What we have done

I Motivation: studying distribution and V growth rate oftype-rich populations (sample captures only smallproportion of types in population)

I LNRE modeling:I Population model with limited number of parameters

(e.g., ZM), expressed in terms of type density functionI Equations to calculate expected V and frequency

spectrum in random samples of arbitrary size usingpopulation model

I Estimation of population parameters via fit of expectedelements to observed frequency spectrum

I zipfR package to apply LNRE modeling

I Problems

Page 49: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

What we have done

I Motivation: studying distribution and V growth rate oftype-rich populations (sample captures only smallproportion of types in population)

I LNRE modeling:I Population model with limited number of parameters

(e.g., ZM), expressed in terms of type density functionI Equations to calculate expected V and frequency

spectrum in random samples of arbitrary size usingpopulation model

I Estimation of population parameters via fit of expectedelements to observed frequency spectrum

I zipfR package to apply LNRE modeling

I Problems

Page 50: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

What we have done

I Motivation: studying distribution and V growth rate oftype-rich populations (sample captures only smallproportion of types in population)

I LNRE modeling:I Population model with limited number of parameters

(e.g., ZM), expressed in terms of type density functionI Equations to calculate expected V and frequency

spectrum in random samples of arbitrary size usingpopulation model

I Estimation of population parameters via fit of expectedelements to observed frequency spectrum

I zipfR package to apply LNRE modeling

I Problems

Page 51: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

What we (and perhaps some of you?)would like to do next

I Study (and deal with) non-randomness

I Better parameter estimation

I Improve zipfR (any feature request?)I Use LNRE modeling in applications, e.g.:

I Good-Turing-style estimationI Productivity beyond morphologyI Better features for machine learningI Mixture models

Page 52: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

What we (and perhaps some of you?)would like to do next

I Study (and deal with) non-randomness

I Better parameter estimation

I Improve zipfR (any feature request?)I Use LNRE modeling in applications, e.g.:

I Good-Turing-style estimationI Productivity beyond morphologyI Better features for machine learningI Mixture models

Page 53: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

What we (and perhaps some of you?)would like to do next

I Study (and deal with) non-randomness

I Better parameter estimation

I Improve zipfR (any feature request?)

I Use LNRE modeling in applications, e.g.:I Good-Turing-style estimationI Productivity beyond morphologyI Better features for machine learningI Mixture models

Page 54: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

What we (and perhaps some of you?)would like to do next

I Study (and deal with) non-randomness

I Better parameter estimation

I Improve zipfR (any feature request?)I Use LNRE modeling in applications, e.g.:

I Good-Turing-style estimationI Productivity beyond morphologyI Better features for machine learningI Mixture models

Page 55: Baroni & Evert Counting Words

Pre-processingand

non-randomness

Baroni & Evert

Pre-Processing

Non-Randomness

The End

That’s All, Folks!

THE END