45
Albert Gatt Corpora and Statistical Methods Lecture 3

Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Albert Gatt

Corpora and Statistical Methods –

Lecture 3

Page 2: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Zipf’s law and the Zipfian distribution

Part 1

Page 3: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Identifying words

Page 4: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Words

Levels of identification:

Graphical word (a token)

Dependent on surface properties of text

Underlying word (stem, root…)

Dependent on some form of morphological analysis

Practical definition: A word…

is an indivisible (!) sequence of characters

carries elementary meaning

is reusable in different contexts

Page 5: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Indivisibility

Words can have compositional meaning from parts that are

either words themselves, or prefixes and suffixes

colour + -less = colourless (derivation)

data + base = database (compounding)

The notion of “atomicity” or “indivisibility” is a matter of

degree.

Page 6: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Problems with indivisibility Definite article in Maltese il-kelb

DEF-dog“the dog”

phonologically dependent on word

German componding Lebensversicherunggesellschaftsangestellter

“life insurance company employee”

Arabic conjunctions: waliy One possible gloss: and I follow (w- is “and”)

Page 7: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Resuability

Words become part of the lexicon of a language, and can be

reused.

But some words can be formed on the fly using productive

morphological processes.

Many words are used very rarely

A large majority of the lexicon is inaccessible to native speakers

Approximately 50% of the words in a novel will be used only

once within that novel (hapax legomena)

Page 8: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

The graphic definition

Many corpora, starting with Brown, use a definition of a

graphic word:

sequence of letters/numbers

possibly some other symbols

separated by whitespace or punctuation

But even here, there are exceptions.

Not much use for tokenisation of languages like Arabic.

Page 9: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Non-alphanumeric characters Numbers such as 22.5 in word frequency counts, typically mapped to a single type ##

Other characters: Abbreviations: U.S.A. Apostrophes: O’Hara vs. John’s Whitespace: New Delhi

A problem for tokenisation

Hyphenated compounds: so-called, A-1-plus vs. aluminum-export industry

How many words do we have here?

Page 10: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Tokenisation Task of breaking up running text into component words. Crucial for most NLP tasks, as parameters typically estimated based

on words.

Can be statistical or rule-based. Often, simple regular expressions will go a long way.

Some practical problems: Whitespace: very useful in Indo-European languages. In others (e.g.

East Asian languages, ancient Greek) no space is used. Non-alphanumeric symbols: need to decide if these are part of a word

or not.

Page 11: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Types and tokens

Page 12: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Running example

Throughout this lecture, data is taken from a corpus of

Maltese texts:

ca. 51,000 words

all from Maltese-language newspapers

various topics and article types

Compared to data from English corpora taken from Baroni

2007

Page 13: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Definitions (I) token = any word in the corpus (also counting words that occur more than once)

type = all the individual, different words in the corpus (grouping words together as representatives of a single type)

Example: I spoke to the chap who spoke to the child

10 tokens

7 types (I, spoke, to, the, chap, who, child)

Page 14: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Definitions (II) The number of tokens in the corpus is an estimate of overall

corpus size

Maltese corpus: 51,000 tokens

The number of types is an estimate of vocabulary size

gives an idea of the lexical richness of the corpus

Maltese corpus: 8193 types

Page 15: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Relative measures of frequency

Type-token ratio:

no. occurrences of a type / corpus size

essentially relative frequency

In very large corpora, this is typically multiplied by a

constant

e.g. multiplying by 1 million gives frequency per million

Page 16: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Type/token ratio

Ratio varies enormously depending on corpus size!

If the corpus is 1000 words, it’s easy to see a TTR of, say,

40%.

With 4 million words, it’s more likely to be in the region of

2%.

Reasons:

vocab size grows with corpus size but

large corpora will contain a lot of types that occur many times

Page 17: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Frequency lists (BNC)

type frequency

the 6054231

in 1931797

time 149487

year 73167

man 57699

monarch 744

cumin 51

prestidigitation 3

A simple list, pairing each word with its frequency

Page 18: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Frequency lists (MT)

type frequency

aħħar (“last”) 97

jkun (“be.IMPERF.3SG”) 96

ukoll (“also”) 93

bħala (“as”) 91

dak (“that.SGM”) 86

tat- (“of.DEF”) 86

Page 19: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Frequency ranks Word counts can get very big. most frequent word in the Maltese corpus occurs 2195 times (and the

corpus is small)

Raw frequency lists can be hard to process.

Useful to represent words in terms of rank: count the words sort by frequency (most frequent first) assign a rank to the words: rank 1 = most frequent rank 2 = next most frequent …

Page 20: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Rank/frequency profile (BNC)

rank 1 goes to the most frequent type

all ranks are unique

ties in frequency are given arbitrary rank

rank (r) freq (f)

1 6054231

2 1931797

3 149487

Note the large differences in frequency from one rank to another

Page 21: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Rank-frequency profile (MT)

Rank (r) Frequency (f)

1 2195

2 2080

3 1277

4 1264

Differences in frequency from one rank to another are smaller than in BNC.

Page 22: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Frequency spectrum (MT)

A representation that shows,

for each frequency value, the

number of different types

that occur with that

frequency.

frequency types

1 4382

2 1253

3 661

4 356

Page 23: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Word distributions (few giants, many midgets)

Page 24: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Non-linguistic case study

Suppose we are interested in measuring people’s height.

population = adult, male/female, European

sample: N people from the relevant population

measure height of each person in the sample

Results:

person 1: 1.6 m

person 2: 1.5 m

Page 25: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Measures of central tendency

Given the height of individuals in our sample, we can

calculate some summary statistics:

mean (“average”): sum of all heights in sample, divided by N

mode: most frequent value

What are your expectations?

will most people be extremely tall?

extremely short?

more or less average?

Page 26: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Plotting height/frequency

Observations:

1. Extreme values are less frequent.

2. Most people fall on the mean

3. Mode is approximately same as mean

4. Bell-shaped curve (“normal” distribution)

Page 27: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Distributions of words Out of 51,000 tokens in the Maltese corpus:

8016 tokens belong to just the 5 most frequent types (the types at ranks 1 --5)

ca. 15% of our corpus size is made up of only 5 different words!

Out of 8193 types:

4382 are hapax legomena, occurring only once (bottom ranks)

1253 occur only twice

In this data, the mean won’t tell us very much.

it hides huge variations!

Page 28: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Ranks and frequencies (MT)

1. 2195

2. 2080

3. 1277

2298. 1

2299. 1

Among top ranks, frequency dropsvery dramatically (but depends on corpus size)

Among bottom ranks, frequency drops verygradually

Page 29: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

General observations

There are always a few very high-frequency words, and many

low-frequency words.

Among the top ranks, frequency differences are big.

Among bottom ranks, frequency differences are very small.

Page 30: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

So what are the high-frequency words?

Top 5 ranked words in the Maltese data:

li (“that”), l- (DEF), il- (DEF), u (“and”), ta’ (“of ”), tal- (“of the”)

Bottom ranked words:

żona (“zone”) f = 1

yankee f = 1

żwieten (“Zejtun residents”) f = 1

xortih (“luck.3SGM”) f = 1

widnejhom (“ear.POSS.3PL”) f = 1

Page 31: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Frequency distributions in corpora

The top few frequency ranks are taken up by function words.

In the Brown corpus, the 10 top-ranked words make up 23% of total

corpus size (Baroni, 2007)

Bottom-ranked words display lots of ties in frequency.

Lots of words occurring only once (hapax legomena)

In Brown, ca. ½ of vocabulary size is made up of words that occur

only once.

Page 32: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Implications The mean or average frequency hides huge deviations.

In Brown, average frequency of a type is 19 tokens. But: the mean is inflated by a few very frequent types

most words will have frequency well below the mean

Mean will therefore be higher than median (the middle value) not a very meaningful indicator of central tendency

Mode (most frequent frequency value) is usually 1.

This is typical of most large corpora. Same happens if we look at n-grams rather than words.

Page 33: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Typical shape of a rank/frequency curve

Page 34: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Actual example (MT)

frequency

0

500

1000

1500

2000

2500

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

rank

fre

qu

en

cy

frequency

A few high frequency, low-rank words

Hundreds of low-frequency, high-rank words

Page 35: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Zipf’s law Observation: Frequency decreases non-linearly with rank.

Suppose a = 1, and C = 60,000.

The model predicts: 2nd most frequent word will be C/2 = 30,000 3rd most frequent: C/3 = 20,000 20th most frequent = C/20 = 3000

So frequency decreases very rapidly (exponentially) as rank increases.

awr

Cwf

)()(

a constant, determined from data, roughly the frequency of the most

frequent word

a constant, determined from data

Page 36: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Things to note

The law doesn’t predict frequency ties there are no ties among ranks

The law is a power law: frequency is a function of negative power of rank

Taking the log of both sides gives us a linear function:

Basically a straight line plot.

)(loglog)(log wraCwf

Page 37: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Log-log plot for MT data (a=1)

Deviation from prediction for high frequencies

Deviation from prediction for low frequencies

Page 38: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Log-log plot for data from Baroni 2007

Page 39: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Some observations

Empirical work has shown that the law doesn’t perfectly

predict frequencies:

at the bottom ranks (low frequencies), actual frequency drops

more rapidly than predicted

at the top ranks (high frequencies), the model predicts higher

frequencies than actually attested

Page 40: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Mandelbrot’s law

Mandelbrot proposed a version of Zipf’s law as follows:

(Note: Zipf’s original law is Mandelbrot’s law with b = 0)

If b is a small value, it will make frequency of items ranked at the top (rank 1, 2, etc) significantly smaller, but won’t affect the lower ranks.

abwr

Cwf

))(()(

Page 41: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Comparison Let C = 60,000, a = 1 and b = 1

Then, for a word of rank 1: Zipf’s law predicts f(w) = 60,000/1 = 60,000

Mandelbrot’s law predicts f(w) = 60,000/(1+1) = 30,000

For a word of rank 1000: Zipf predicts: f(w) = 60,000/1000 = 60

Mandelbrot: f(w) = 60,000/1001 = 59.94

So differences are bigger at the top than at the bottom.

Page 42: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Linear version of Mandelbrot

))(log(log)(log bwraCwf

Note: this is no longer a linear curve, so should fit our data

better.

Page 43: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Consequences of the law

Data sparseness: no matter how big your corpus, most of the

words in it will be of very low frequency.

You can’t exhaust the vocabulary of a language: new words

will crop up as corpus size increases.

implication: you can’t compare vocabulary richness of corpora

of different sizes

Page 44: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Explanation for Zipfian distributions

Zipf’s own explanation (“least effort” principle):

Speaker’s goal is to minimise effort by using a few distinct

words as frequently as possible

Hearer’s goal is to maximise clarity by having as large a

vocabulary as possible

Page 45: Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Other Zipfian distributions

Zipf’s law crops up in other domains (e.g. distribution of

incomes)

Even randomly generated character strings show the same

pattern!

short strings will be few, but likely to crop up by chance

more long strings, but each one less likely individually