Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens

Albert Gatt

Corpora and Statistical Methods –

Lecture 3

Zipf’s law and the Zipfian distribution

Part 1

Identifying words

Words

Levels of identification:

Graphical word (a token)

Dependent on surface properties of text

Underlying word (stem, root…)

Dependent on some form of morphological analysis

Practical definition: A word…

is an indivisible (!) sequence of characters

carries elementary meaning

is reusable in different contexts

Indivisibility

Words can have compositional meaning from parts that are

either words themselves, or prefixes and suffixes

colour + -less = colourless (derivation)

data + base = database (compounding)

The notion of “atomicity” or “indivisibility” is a matter of

degree.

Problems with indivisibility Definite article in Maltese il-kelb

DEF-dog“the dog”

phonologically dependent on word

German componding Lebensversicherunggesellschaftsangestellter

“life insurance company employee”

Arabic conjunctions: waliy One possible gloss: and I follow (w- is “and”)

Resuability

Words become part of the lexicon of a language, and can be

reused.

But some words can be formed on the fly using productive

morphological processes.

Many words are used very rarely

A large majority of the lexicon is inaccessible to native speakers

Approximately 50% of the words in a novel will be used only

once within that novel (hapax legomena)

The graphic definition

Many corpora, starting with Brown, use a definition of a

graphic word:

sequence of letters/numbers

possibly some other symbols

separated by whitespace or punctuation

But even here, there are exceptions.

Not much use for tokenisation of languages like Arabic.

Non-alphanumeric characters Numbers such as 22.5 in word frequency counts, typically mapped to a single type ##

Other characters: Abbreviations: U.S.A. Apostrophes: O’Hara vs. John’s Whitespace: New Delhi

A problem for tokenisation

Hyphenated compounds: so-called, A-1-plus vs. aluminum-export industry

How many words do we have here?

Tokenisation Task of breaking up running text into component words. Crucial for most NLP tasks, as parameters typically estimated based

on words.

Can be statistical or rule-based. Often, simple regular expressions will go a long way.

Some practical problems: Whitespace: very useful in Indo-European languages. In others (e.g.

East Asian languages, ancient Greek) no space is used. Non-alphanumeric symbols: need to decide if these are part of a word

or not.

Types and tokens

Running example

Throughout this lecture, data is taken from a corpus of

Maltese texts:

ca. 51,000 words

all from Maltese-language newspapers

various topics and article types

Compared to data from English corpora taken from Baroni

2007

Definitions (I) token = any word in the corpus (also counting words that occur more than once)

type = all the individual, different words in the corpus (grouping words together as representatives of a single type)

Example: I spoke to the chap who spoke to the child

10 tokens

7 types (I, spoke, to, the, chap, who, child)

Definitions (II) The number of tokens in the corpus is an estimate of overall

corpus size

Maltese corpus: 51,000 tokens

The number of types is an estimate of vocabulary size

gives an idea of the lexical richness of the corpus

Maltese corpus: 8193 types

Relative measures of frequency

Type-token ratio:

no. occurrences of a type / corpus size

essentially relative frequency

In very large corpora, this is typically multiplied by a

constant

e.g. multiplying by 1 million gives frequency per million

Type/token ratio

Ratio varies enormously depending on corpus size!

If the corpus is 1000 words, it’s easy to see a TTR of, say,

40%.

With 4 million words, it’s more likely to be in the region of

2%.

Reasons:

vocab size grows with corpus size but

large corpora will contain a lot of types that occur many times

Frequency lists (BNC)

type frequency

the 6054231

in 1931797

time 149487

year 73167

man 57699

…

monarch 744

cumin 51

prestidigitation 3

A simple list, pairing each word with its frequency

Frequency lists (MT)

type frequency

aħħar (“last”) 97

jkun (“be.IMPERF.3SG”) 96

ukoll (“also”) 93

bħala (“as”) 91

dak (“that.SGM”) 86

tat- (“of.DEF”) 86

Frequency ranks Word counts can get very big. most frequent word in the Maltese corpus occurs 2195 times (and the

corpus is small)

Raw frequency lists can be hard to process.

Useful to represent words in terms of rank: count the words sort by frequency (most frequent first) assign a rank to the words: rank 1 = most frequent rank 2 = next most frequent …

Rank/frequency profile (BNC)

rank 1 goes to the most frequent type

all ranks are unique

ties in frequency are given arbitrary rank

rank (r) freq (f)

1 6054231

2 1931797

3 149487

…

Note the large differences in frequency from one rank to another

Rank-frequency profile (MT)

Rank (r) Frequency (f)

1 2195

2 2080

3 1277

4 1264

Differences in frequency from one rank to another are smaller than in BNC.

Frequency spectrum (MT)

A representation that shows,

for each frequency value, the

number of different types

that occur with that

frequency.

frequency types

1 4382

2 1253

3 661

4 356

Word distributions (few giants, many midgets)

Non-linguistic case study

Suppose we are interested in measuring people’s height.

population = adult, male/female, European

sample: N people from the relevant population

measure height of each person in the sample

Results:

person 1: 1.6 m

person 2: 1.5 m

…

Measures of central tendency

Given the height of individuals in our sample, we can

calculate some summary statistics:

mean (“average”): sum of all heights in sample, divided by N

mode: most frequent value

What are your expectations?

will most people be extremely tall?

extremely short?

more or less average?

Plotting height/frequency

Observations:

1. Extreme values are less frequent.

2. Most people fall on the mean

3. Mode is approximately same as mean

4. Bell-shaped curve (“normal” distribution)

Distributions of words Out of 51,000 tokens in the Maltese corpus:

8016 tokens belong to just the 5 most frequent types (the types at ranks 1 --5)

ca. 15% of our corpus size is made up of only 5 different words!

Out of 8193 types:

4382 are hapax legomena, occurring only once (bottom ranks)

1253 occur only twice

…

In this data, the mean won’t tell us very much.

it hides huge variations!

Ranks and frequencies (MT)

1. 2195

2. 2080

3. 1277

…

2298. 1

2299. 1

…

Among top ranks, frequency dropsvery dramatically (but depends on corpus size)

Among bottom ranks, frequency drops verygradually

General observations

There are always a few very high-frequency words, and many

low-frequency words.

Among the top ranks, frequency differences are big.

Among bottom ranks, frequency differences are very small.

So what are the high-frequency words?

Top 5 ranked words in the Maltese data:

li (“that”), l- (DEF), il- (DEF), u (“and”), ta’ (“of ”), tal- (“of the”)

Bottom ranked words:

żona (“zone”) f = 1

yankee f = 1

żwieten (“Zejtun residents”) f = 1

xortih (“luck.3SGM”) f = 1

widnejhom (“ear.POSS.3PL”) f = 1

Frequency distributions in corpora

The top few frequency ranks are taken up by function words.

In the Brown corpus, the 10 top-ranked words make up 23% of total

corpus size (Baroni, 2007)

Bottom-ranked words display lots of ties in frequency.

Lots of words occurring only once (hapax legomena)

In Brown, ca. ½ of vocabulary size is made up of words that occur

only once.

Implications The mean or average frequency hides huge deviations.

In Brown, average frequency of a type is 19 tokens. But: the mean is inflated by a few very frequent types

most words will have frequency well below the mean

Mean will therefore be higher than median (the middle value) not a very meaningful indicator of central tendency

Mode (most frequent frequency value) is usually 1.

This is typical of most large corpora. Same happens if we look at n-grams rather than words.

Typical shape of a rank/frequency curve

Actual example (MT)

frequency

0

500

1000

1500

2000

2500

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

rank

fre

qu

en

cy

frequency

A few high frequency, low-rank words

Hundreds of low-frequency, high-rank words

Zipf’s law Observation: Frequency decreases non-linearly with rank.

Suppose a = 1, and C = 60,000.

The model predicts: 2nd most frequent word will be C/2 = 30,000 3rd most frequent: C/3 = 20,000 20th most frequent = C/20 = 3000

So frequency decreases very rapidly (exponentially) as rank increases.

awr

Cwf

)()(

a constant, determined from data, roughly the frequency of the most

frequent word

a constant, determined from data

Things to note

The law doesn’t predict frequency ties there are no ties among ranks

The law is a power law: frequency is a function of negative power of rank

Taking the log of both sides gives us a linear function:

Basically a straight line plot.

)(loglog)(log wraCwf

Log-log plot for MT data (a=1)

Deviation from prediction for high frequencies

Deviation from prediction for low frequencies

Log-log plot for data from Baroni 2007

Some observations

Empirical work has shown that the law doesn’t perfectly

predict frequencies:

at the bottom ranks (low frequencies), actual frequency drops

more rapidly than predicted

at the top ranks (high frequencies), the model predicts higher

frequencies than actually attested

Mandelbrot’s law

Mandelbrot proposed a version of Zipf’s law as follows:

(Note: Zipf’s original law is Mandelbrot’s law with b = 0)

If b is a small value, it will make frequency of items ranked at the top (rank 1, 2, etc) significantly smaller, but won’t affect the lower ranks.

abwr

Cwf

))(()(

Comparison Let C = 60,000, a = 1 and b = 1

Then, for a word of rank 1: Zipf’s law predicts f(w) = 60,000/1 = 60,000

Mandelbrot’s law predicts f(w) = 60,000/(1+1) = 30,000

For a word of rank 1000: Zipf predicts: f(w) = 60,000/1000 = 60

Mandelbrot: f(w) = 60,000/1001 = 59.94

So differences are bigger at the top than at the bottom.

Linear version of Mandelbrot

))(log(log)(log bwraCwf

Note: this is no longer a linear curve, so should fit our data

better.

Consequences of the law

Data sparseness: no matter how big your corpus, most of the

words in it will be of very low frequency.

You can’t exhaust the vocabulary of a language: new words

will crop up as corpus size increases.

implication: you can’t compare vocabulary richness of corpora

of different sizes

Explanation for Zipfian distributions

Zipf’s own explanation (“least effort” principle):

Speaker’s goal is to minimise effort by using a few distinct

words as frequently as possible

Hearer’s goal is to maximise clarity by having as large a

vocabulary as possible

Other Zipfian distributions

Zipf’s law crops up in other domains (e.g. distribution of

incomes)

Even randomly generated character strings show the same

pattern!

short strings will be few, but likely to crop up by chance

more long strings, but each one less likely individually

Documents

Corpora and Statistical Methods Lecture 3staff.um.edu.mt/albert.gatt/teaching/dl/statLecture3a.pdf · Distributions of words Out of 51,000 tokens in the Maltese corpus: 8016 tokens