142
Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook Counting Words: Introduction Marco Baroni & Stefan Evert alaga, 7 August 2006

Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

  • Upload
    buique

  • View
    218

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Counting Words:Introduction

Marco Baroni & Stefan Evert

Malaga, 7 August 2006

Page 2: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Roadmap

I Introduction and motivation

I LNRE modeling: soft

I LNRE modeling: hard

I Playtime!

I The bad news and outlook

Page 3: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Roadmap

I Introduction and motivation

I LNRE modeling: soft

I LNRE modeling: hard

I Playtime!

I The bad news and outlook

Page 4: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Roadmap

I Introduction and motivation

I LNRE modeling: soft

I LNRE modeling: hard

I Playtime!

I The bad news and outlook

Page 5: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Roadmap

I Introduction and motivation

I LNRE modeling: soft

I LNRE modeling: hard

I Playtime!

I The bad news and outlook

Page 6: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Roadmap

I Introduction and motivation

I LNRE modeling: soft

I LNRE modeling: hard

I Playtime!

I The bad news and outlook

Page 7: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Outline

Roadmap

Lexical statistics: the basics

Zipf’s law

Applications

Page 8: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Lexical statisticsZipf 1949/1961, Baayen 2001, Evert 2005

I Statistical study of distribution of types (words andother units) in texts

I Different from other categorical data because of extremerichness of types

Page 9: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Basic terminology

I N: sample/corpus size, number of tokens in the sample

I V : vocabulary size, number of distinct types in thesample

I Vm: type count of spectrum element m, number oftypes in the sample with token frequency m

I V1: hapax legomena count, number of types that occuronly once in the sample (for hapaxes, Count(types) =Count(tokens))

I A sample: a b b c a a b a

I N: 8; V : 3; V1: 1

Page 10: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Basic terminology

I N: sample/corpus size, number of tokens in the sample

I V : vocabulary size, number of distinct types in thesample

I Vm: type count of spectrum element m, number oftypes in the sample with token frequency m

I V1: hapax legomena count, number of types that occuronly once in the sample (for hapaxes, Count(types) =Count(tokens))

I A sample: a b b c a a b a

I N: 8; V : 3; V1: 1

Page 11: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Basic terminology

I N: sample/corpus size, number of tokens in the sample

I V : vocabulary size, number of distinct types in thesample

I Vm: type count of spectrum element m, number oftypes in the sample with token frequency m

I V1: hapax legomena count, number of types that occuronly once in the sample (for hapaxes, Count(types) =Count(tokens))

I A sample: a b b c a a b a

I N: 8; V : 3; V1: 1

Page 12: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Basic terminology

I N: sample/corpus size, number of tokens in the sample

I V : vocabulary size, number of distinct types in thesample

I Vm: type count of spectrum element m, number oftypes in the sample with token frequency m

I V1: hapax legomena count, number of types that occuronly once in the sample (for hapaxes, Count(types) =Count(tokens))

I A sample: a b b c a a b a

I N: 8; V : 3; V1: 1

Page 13: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Basic terminology

I N: sample/corpus size, number of tokens in the sample

I V : vocabulary size, number of distinct types in thesample

I Vm: type count of spectrum element m, number oftypes in the sample with token frequency m

I V1: hapax legomena count, number of types that occuronly once in the sample (for hapaxes, Count(types) =Count(tokens))

I A sample: a b b c a a b a

I N: 8; V : 3; V1: 1

Page 14: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Basic terminology

I N: sample/corpus size, number of tokens in the sample

I V : vocabulary size, number of distinct types in thesample

I Vm: type count of spectrum element m, number oftypes in the sample with token frequency m

I V1: hapax legomena count, number of types that occuronly once in the sample (for hapaxes, Count(types) =Count(tokens))

I A sample: a b b c a a b a

I N: 8; V : 3; V1: 1

Page 15: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Rank/frequency profile

I The sample: a b b c a a b a d

I Frequency list ordered by decreasing frequency

t f

a 4b 3c 1d 1

I Replace type labels with ranks to obtain rank/frequencyprofile:

r f

1 42 33 14 1

I Allows expression of frequency in function of rank of type

Page 16: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Rank/frequency profile

I The sample: a b b c a a b a d

I Frequency list ordered by decreasing frequency

t f

a 4b 3c 1d 1

I Replace type labels with ranks to obtain rank/frequencyprofile:

r f

1 42 33 14 1

I Allows expression of frequency in function of rank of type

Page 17: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Rank/frequency profile

I The sample: a b b c a a b a d

I Frequency list ordered by decreasing frequency

t f

a 4b 3c 1d 1

I Replace type labels with ranks to obtain rank/frequencyprofile:

r f

1 42 33 14 1

I Allows expression of frequency in function of rank of type

Page 18: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Rank/frequency profile

I The sample: a b b c a a b a d

I Frequency list ordered by decreasing frequency

t f

a 4b 3c 1d 1

I Replace type labels with ranks to obtain rank/frequencyprofile:

r f

1 42 33 14 1

I Allows expression of frequency in function of rank of type

Page 19: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Rank/frequency profile of Brown corpus

Page 20: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Frequency spectrum

I The sample: a b b c a a b a d

I Frequency classes: 1 (c, d), 3 (b), 4 (a)

I Frequency spectrum:

m Vm

1 23 14 1

Page 21: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Frequency spectrum

I The sample: a b b c a a b a d

I Frequency classes: 1 (c, d), 3 (b), 4 (a)

I Frequency spectrum:

m Vm

1 23 14 1

Page 22: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Frequency spectrum

I The sample: a b b c a a b a d

I Frequency classes: 1 (c, d), 3 (b), 4 (a)

I Frequency spectrum:

m Vm

1 23 14 1

Page 23: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Rank/frequency profiles and frequency spectra

I From rank/frequency profile to spectrum: countoccurrences of each f in profile to obtain Vf values ofcorresponding spectrum elements

I From spectrum to rank/frequency profile: given highest f(i.e., m) in a spectrum, the ranks 1 to Vf in thecorresponding rank/frequency profile will have frequencyf , the ranks Vf + 1 to Vf + Vg (where g is the secondhighest frequency in the spectrum) will have frequency g ,etc.

Page 24: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Frequency spectrum of Brown corpus

1 2 3 4 5 6 7 8 9 11 13 15

m

V_m

050

0010

000

1500

020

000

Page 25: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Vocabulary growth curve

I The sample: a b b c a a b a

I N: 1, V : 1, V1: 1

I N: 3, V : 2, V1: 1

I N: 5, V : 3, V1: 1

I N: 8, V : 3, V1: 1

I (Most VGCs on our slides smoothed with binomialinterpolation)

Page 26: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Vocabulary growth curve

I The sample: a b b c a a b a

I N: 1, V : 1, V1: 1

I N: 3, V : 2, V1: 1

I N: 5, V : 3, V1: 1

I N: 8, V : 3, V1: 1

I (Most VGCs on our slides smoothed with binomialinterpolation)

Page 27: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Vocabulary growth curve

I The sample: a b b c a a b a

I N: 1, V : 1, V1: 1

I N: 3, V : 2, V1: 1

I N: 5, V : 3, V1: 1

I N: 8, V : 3, V1: 1

I (Most VGCs on our slides smoothed with binomialinterpolation)

Page 28: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Vocabulary growth curve

I The sample: a b b c a a b a

I N: 1, V : 1, V1: 1

I N: 3, V : 2, V1: 1

I N: 5, V : 3, V1: 1

I N: 8, V : 3, V1: 1

I (Most VGCs on our slides smoothed with binomialinterpolation)

Page 29: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Vocabulary growth curve

I The sample: a b b c a a b a

I N: 1, V : 1, V1: 1

I N: 3, V : 2, V1: 1

I N: 5, V : 3, V1: 1

I N: 8, V : 3, V1: 1

I (Most VGCs on our slides smoothed with binomialinterpolation)

Page 30: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Vocabulary growth curve

I The sample: a b b c a a b a

I N: 1, V : 1, V1: 1

I N: 3, V : 2, V1: 1

I N: 5, V : 3, V1: 1

I N: 8, V : 3, V1: 1

I (Most VGCs on our slides smoothed with binomialinterpolation)

Page 31: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Vocabulary growth curve of Brown corpusWith V1 growth in red

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

010

000

2000

030

000

4000

0

N

V a

nd V

_1

Page 32: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Outline

Roadmap

Lexical statistics: the basics

Zipf’s law

Applications

Page 33: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Typical frequency patternsTop and bottom ranks in the Brown corpus

top frequencies bottom frequenciesrank fq word rank range fq randomly selected examples1 62642 the 7967-8522 10 recordings undergone privileges2 35971 of 8523-9236 9 Leonard indulge creativity3 27831 and 9237-10042 8 unnatural Lolotte authenticity4 25608 to 10043-11185 7 diffraction Augusta postpone5 21883 a 11186-12510 6 uniformly throttle agglutinin6 19474 in 12511-14369 5 Bud Councilman immoral7 10292 that 14370-16938 4 verification gleamed groin8 10026 is 16939-21076 3 Princes nonspecifically Arger9 9887 was 21077-28701 2 blitz pertinence arson10 8811 for 28702-53076 1 Salaries Evensen parentheses

Page 34: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Typical frequency patternsBNC

Page 35: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Typical frequency patternsOther corpora

Page 36: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Typical frequency patternsBrown bigrams and trigrams

Page 37: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Typical frequency patternsThe Italian prefix ri- in the la Repubblica corpus

Page 38: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf’s law

I Language after language, corpus after corpus, linguistictype after linguistic type. . .

I same “few giants, many dwarves” pattern is encountered

I Similarity of plots suggests that relation between rankand frequency could be captured by a law

I Nature of relation becomes clearer if we plot log f infunction of log r

Page 39: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf’s law

I Language after language, corpus after corpus, linguistictype after linguistic type. . .

I same “few giants, many dwarves” pattern is encountered

I Similarity of plots suggests that relation between rankand frequency could be captured by a law

I Nature of relation becomes clearer if we plot log f infunction of log r

Page 40: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf’s law

I Language after language, corpus after corpus, linguistictype after linguistic type. . .

I same “few giants, many dwarves” pattern is encountered

I Similarity of plots suggests that relation between rankand frequency could be captured by a law

I Nature of relation becomes clearer if we plot log f infunction of log r

Page 41: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf’s law

I Language after language, corpus after corpus, linguistictype after linguistic type. . .

I same “few giants, many dwarves” pattern is encountered

I Similarity of plots suggests that relation between rankand frequency could be captured by a law

I Nature of relation becomes clearer if we plot log f infunction of log r

Page 42: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf’s law

I Language after language, corpus after corpus, linguistictype after linguistic type. . .

I same “few giants, many dwarves” pattern is encountered

I Similarity of plots suggests that relation between rankand frequency could be captured by a law

I Nature of relation becomes clearer if we plot log f infunction of log r

Page 43: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf’s law

I Straight line in double-logarithmic space corresponds topower law for original variables

I This leads to Zipf’s (1949, 1965) famous law:

f (w) =C

r(w)a

I With a = 1 and C = 60, 000, Zipf’s law predicts thatmost frequent word has frequency 60,000; second mostfrequent word has frequency 30,000; third word hasfrequency 20,000. . .

I and long tail of 80,000 words with frequency between 1.5and 0.5

Page 44: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf’s law

I Straight line in double-logarithmic space corresponds topower law for original variables

I This leads to Zipf’s (1949, 1965) famous law:

f (w) =C

r(w)a

I With a = 1 and C = 60, 000, Zipf’s law predicts thatmost frequent word has frequency 60,000; second mostfrequent word has frequency 30,000; third word hasfrequency 20,000. . .

I and long tail of 80,000 words with frequency between 1.5and 0.5

Page 45: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf’s law

I Straight line in double-logarithmic space corresponds topower law for original variables

I This leads to Zipf’s (1949, 1965) famous law:

f (w) =C

r(w)a

I With a = 1 and C = 60, 000, Zipf’s law predicts thatmost frequent word has frequency 60,000; second mostfrequent word has frequency 30,000; third word hasfrequency 20,000. . .

I and long tail of 80,000 words with frequency between 1.5and 0.5

Page 46: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf’s law

I Straight line in double-logarithmic space corresponds topower law for original variables

I This leads to Zipf’s (1949, 1965) famous law:

f (w) =C

r(w)a

I With a = 1 and C = 60, 000, Zipf’s law predicts thatmost frequent word has frequency 60,000; second mostfrequent word has frequency 30,000; third word hasfrequency 20,000. . .

I and long tail of 80,000 words with frequency between 1.5and 0.5

Page 47: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf’s lawLogarithmic version

I Zipf’s power law:

f (w) =C

r(w)a

I If we take logarithm of both sides, we obtain:

log f (w) = log C − a log r(w)

I I.e., Zipf’s law predicts that rank/frequency profiles arestraight lines in double logarithmic space, which, we saw,is a reasonable approximation

I Best fit a and C can be found with least squares methodI Provides intuitive interpretation of a and C :

I a is slope determining how fast log frequency decreaseswith log rank

I log C is intercept, i.e., predicted log frequency of wordwith rank 1 (log rank 0), i.e., most frequent word

Page 48: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf’s lawLogarithmic version

I Zipf’s power law:

f (w) =C

r(w)a

I If we take logarithm of both sides, we obtain:

log f (w) = log C − a log r(w)

I I.e., Zipf’s law predicts that rank/frequency profiles arestraight lines in double logarithmic space, which, we saw,is a reasonable approximation

I Best fit a and C can be found with least squares methodI Provides intuitive interpretation of a and C :

I a is slope determining how fast log frequency decreaseswith log rank

I log C is intercept, i.e., predicted log frequency of wordwith rank 1 (log rank 0), i.e., most frequent word

Page 49: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf’s lawLogarithmic version

I Zipf’s power law:

f (w) =C

r(w)a

I If we take logarithm of both sides, we obtain:

log f (w) = log C − a log r(w)

I I.e., Zipf’s law predicts that rank/frequency profiles arestraight lines in double logarithmic space, which, we saw,is a reasonable approximation

I Best fit a and C can be found with least squares methodI Provides intuitive interpretation of a and C :

I a is slope determining how fast log frequency decreaseswith log rank

I log C is intercept, i.e., predicted log frequency of wordwith rank 1 (log rank 0), i.e., most frequent word

Page 50: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf’s lawLogarithmic version

I Zipf’s power law:

f (w) =C

r(w)a

I If we take logarithm of both sides, we obtain:

log f (w) = log C − a log r(w)

I I.e., Zipf’s law predicts that rank/frequency profiles arestraight lines in double logarithmic space, which, we saw,is a reasonable approximation

I Best fit a and C can be found with least squares method

I Provides intuitive interpretation of a and C :I a is slope determining how fast log frequency decreases

with log rankI log C is intercept, i.e., predicted log frequency of word

with rank 1 (log rank 0), i.e., most frequent word

Page 51: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf’s lawLogarithmic version

I Zipf’s power law:

f (w) =C

r(w)a

I If we take logarithm of both sides, we obtain:

log f (w) = log C − a log r(w)

I I.e., Zipf’s law predicts that rank/frequency profiles arestraight lines in double logarithmic space, which, we saw,is a reasonable approximation

I Best fit a and C can be found with least squares methodI Provides intuitive interpretation of a and C :

I a is slope determining how fast log frequency decreaseswith log rank

I log C is intercept, i.e., predicted log frequency of wordwith rank 1 (log rank 0), i.e., most frequent word

Page 52: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf’s lawFitting the Brown rank/frequency profile

Page 53: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Fit of Zipf’s law

I At right edge (low frequencies):

I “Bell-bottom” pattern expected as we are fittingcontinuous model to discrete frequencies

I More worryingly, in large corpora frequency drops morerapidly than predicted by Zipf’s law

I At left edge (high frequencies):I Highest frequencies lower than predicted → Mandelbrot’s

correction

Page 54: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Fit of Zipf’s law

I At right edge (low frequencies):I “Bell-bottom” pattern expected as we are fitting

continuous model to discrete frequencies

I More worryingly, in large corpora frequency drops morerapidly than predicted by Zipf’s law

I At left edge (high frequencies):I Highest frequencies lower than predicted → Mandelbrot’s

correction

Page 55: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Fit of Zipf’s law

I At right edge (low frequencies):I “Bell-bottom” pattern expected as we are fitting

continuous model to discrete frequenciesI More worryingly, in large corpora frequency drops more

rapidly than predicted by Zipf’s law

I At left edge (high frequencies):I Highest frequencies lower than predicted → Mandelbrot’s

correction

Page 56: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Fit of Zipf’s law

I At right edge (low frequencies):I “Bell-bottom” pattern expected as we are fitting

continuous model to discrete frequenciesI More worryingly, in large corpora frequency drops more

rapidly than predicted by Zipf’s law

I At left edge (high frequencies):

I Highest frequencies lower than predicted → Mandelbrot’scorrection

Page 57: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Fit of Zipf’s law

I At right edge (low frequencies):I “Bell-bottom” pattern expected as we are fitting

continuous model to discrete frequenciesI More worryingly, in large corpora frequency drops more

rapidly than predicted by Zipf’s law

I At left edge (high frequencies):I Highest frequencies lower than predicted

→ Mandelbrot’scorrection

Page 58: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Fit of Zipf’s law

I At right edge (low frequencies):I “Bell-bottom” pattern expected as we are fitting

continuous model to discrete frequenciesI More worryingly, in large corpora frequency drops more

rapidly than predicted by Zipf’s law

I At left edge (high frequencies):I Highest frequencies lower than predicted → Mandelbrot’s

correction

Page 59: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf-Mandelbrot’s lawMandelbrot 1953

I Mandelbrot’s extra parameter:

f (w) =C

(r(w) + b)a

I Zipf’s law is special case with b = 0I Assuming a = 1, C = 60, 000, b = 1:

I For word with rank 1, Zipf’s law predicts frequency of60,000; Mandelbrot’s variation predicts frequency of30,000

I For word with rank 1,000, Zipf’s law predicts frequencyof 60; Mandelbrot’s variation predicts frequency of 59.94

I No longer a straight line in double logarithmic space;finding best fit harder than least squares

I Zipf-Mandelbrot’s law is basis of LNRE statistical modelswe will introduce

Page 60: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf-Mandelbrot’s lawMandelbrot 1953

I Mandelbrot’s extra parameter:

f (w) =C

(r(w) + b)a

I Zipf’s law is special case with b = 0

I Assuming a = 1, C = 60, 000, b = 1:I For word with rank 1, Zipf’s law predicts frequency of

60,000; Mandelbrot’s variation predicts frequency of30,000

I For word with rank 1,000, Zipf’s law predicts frequencyof 60; Mandelbrot’s variation predicts frequency of 59.94

I No longer a straight line in double logarithmic space;finding best fit harder than least squares

I Zipf-Mandelbrot’s law is basis of LNRE statistical modelswe will introduce

Page 61: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf-Mandelbrot’s lawMandelbrot 1953

I Mandelbrot’s extra parameter:

f (w) =C

(r(w) + b)a

I Zipf’s law is special case with b = 0I Assuming a = 1, C = 60, 000, b = 1:

I For word with rank 1, Zipf’s law predicts frequency of60,000; Mandelbrot’s variation predicts frequency of30,000

I For word with rank 1,000, Zipf’s law predicts frequencyof 60; Mandelbrot’s variation predicts frequency of 59.94

I No longer a straight line in double logarithmic space;finding best fit harder than least squares

I Zipf-Mandelbrot’s law is basis of LNRE statistical modelswe will introduce

Page 62: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf-Mandelbrot’s lawMandelbrot 1953

I Mandelbrot’s extra parameter:

f (w) =C

(r(w) + b)a

I Zipf’s law is special case with b = 0I Assuming a = 1, C = 60, 000, b = 1:

I For word with rank 1, Zipf’s law predicts frequency of60,000; Mandelbrot’s variation predicts frequency of30,000

I For word with rank 1,000, Zipf’s law predicts frequencyof 60; Mandelbrot’s variation predicts frequency of 59.94

I No longer a straight line in double logarithmic space;finding best fit harder than least squares

I Zipf-Mandelbrot’s law is basis of LNRE statistical modelswe will introduce

Page 63: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Zipf-Mandelbrot’s lawMandelbrot 1953

I Mandelbrot’s extra parameter:

f (w) =C

(r(w) + b)a

I Zipf’s law is special case with b = 0I Assuming a = 1, C = 60, 000, b = 1:

I For word with rank 1, Zipf’s law predicts frequency of60,000; Mandelbrot’s variation predicts frequency of30,000

I For word with rank 1,000, Zipf’s law predicts frequencyof 60; Mandelbrot’s variation predicts frequency of 59.94

I No longer a straight line in double logarithmic space;finding best fit harder than least squares

I Zipf-Mandelbrot’s law is basis of LNRE statistical modelswe will introduce

Page 64: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Mandelbrot’s adjustmentFitting the Brown rank/frequency profile

Page 65: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

More fits

Page 66: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

A few mildly interesting things aboutZipf(-Mandelbrot)’s law

I a is often close to 1 for word frequency distributions(hence simplified version: f = C/r , and -1 slope inlog-log space)

I Zipf’s law also provides good fit to frequency spectra

I Monkey languages display Zipf’s law (intuition: few shortwords have very high chances to be generated; long tailof highly unlikely long words)

I Zipf’s law is everywhere (Li 2002)

Page 67: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

A few mildly interesting things aboutZipf(-Mandelbrot)’s law

I a is often close to 1 for word frequency distributions(hence simplified version: f = C/r , and -1 slope inlog-log space)

I Zipf’s law also provides good fit to frequency spectra

I Monkey languages display Zipf’s law (intuition: few shortwords have very high chances to be generated; long tailof highly unlikely long words)

I Zipf’s law is everywhere (Li 2002)

Page 68: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

A few mildly interesting things aboutZipf(-Mandelbrot)’s law

I a is often close to 1 for word frequency distributions(hence simplified version: f = C/r , and -1 slope inlog-log space)

I Zipf’s law also provides good fit to frequency spectra

I Monkey languages display Zipf’s law (intuition: few shortwords have very high chances to be generated; long tailof highly unlikely long words)

I Zipf’s law is everywhere (Li 2002)

Page 69: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

A few mildly interesting things aboutZipf(-Mandelbrot)’s law

I a is often close to 1 for word frequency distributions(hence simplified version: f = C/r , and -1 slope inlog-log space)

I Zipf’s law also provides good fit to frequency spectra

I Monkey languages display Zipf’s law (intuition: few shortwords have very high chances to be generated; long tailof highly unlikely long words)

I Zipf’s law is everywhere (Li 2002)

Page 70: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Consequences

I Data sparseness

I Standard statistics, normal approximation notappropriate for lexical type distributions

I V is not stable, will grow with sample size, we needspecial methods to estimate V and related quantities atarbitrary sizes (including V of whole type population)

Page 71: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Consequences

I Data sparseness

I Standard statistics, normal approximation notappropriate for lexical type distributions

I V is not stable, will grow with sample size, we needspecial methods to estimate V and related quantities atarbitrary sizes (including V of whole type population)

Page 72: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Consequences

I Data sparseness

I Standard statistics, normal approximation notappropriate for lexical type distributions

I V is not stable, will grow with sample size, we needspecial methods to estimate V and related quantities atarbitrary sizes (including V of whole type population)

Page 73: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Consequences

I Data sparseness

I Standard statistics, normal approximation notappropriate for lexical type distributions

I V is not stable, will grow with sample size, we needspecial methods to estimate V and related quantities atarbitrary sizes (including V of whole type population)

Page 74: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

V , sample size and the Zipfian distribution

I Significant tail of hapax legomena indicates that chancesof encountering new type if we keep sampling are high

I Zipfian distribution implies vocabulary curve that is stillgrowing at largest sample size

Page 75: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Pronouns in Italian (la Repubblica)Rank/frequency profile

●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●

●●

0 20 40 60 80

110

010

000

rank

fq

Page 76: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Pronouns in ItalianFrequency spectrum

●● ● ● ●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●

1 100 10000

0.6

0.8

1.0

1.2

1.4

m

V_m

Page 77: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Pronouns in ItalianVocabulary growth curve

0e+00 1e+06 2e+06 3e+06 4e+06

020

4060

80

N

V a

nd V

_1

Page 78: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Pronouns in ItalianVocabulary growth curve (zooming in)

0 2000 4000 6000 8000 10000

020

4060

80

N

V a

nd V

_1

Page 79: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

ri- in Italian (la Repubblica)Rank/frequency profile

Page 80: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

ri- in ItalianFrequency spectrum

1 2 3 4 5 6 7 8 9 11 13 15

m

V_m

050

100

150

200

250

300

350

Page 81: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

ri- in ItalianVocabulary growth curve

0 200000 600000 1000000

020

040

060

080

010

00

N

V a

nd V

_1

Page 82: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Outline

Roadmap

Lexical statistics: the basics

Zipf’s law

Applications

Page 83: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Applications

I Productivity (in morphology and elsewhere)

I Lexical richness (in stylometry, languageacquisition/pathology and elsewhere)

I Extrapolation of type counts and type frequencydistribution for practical NLP purposes (e.g., estimatingproportion of OOV words, typos, etc.)

I ... (e.g., Good-Turing smoothing, prior distribution forBayesian language modeling)

Page 84: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Productivity

I In many linguistic problems, rate of growth of VGC isinteresting issue in itself

I Baayen (1989 and later) makes link between linguisticnotion of productivity and vocabulary growth rate

Page 85: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Productivity in morphology: the classic definitionSchultink (1961), translated by Booij

Productivity as morphological phenomenon is the possibilitywhich language users have to form an in principle uncountablenumber of new words unintentionally, by means of amorphological process which is the basis of the form-meaningcorrespondence of some words they know.

Page 86: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

V as a measure of productivity

I Comparable for same N only!

I Good first approximation, but it is measuringattestedness, not potential:

I (According to rough BNC counts) de- verbs have V of141, un- verbs have V of 119, contra our intuition

I We want productivity index of pronouns to be 0, not 72!

Page 87: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

V as a measure of productivity

I Comparable for same N only!I Good first approximation, but it is measuring

attestedness, not potential:

I (According to rough BNC counts) de- verbs have V of141, un- verbs have V of 119, contra our intuition

I We want productivity index of pronouns to be 0, not 72!

Page 88: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

V as a measure of productivity

I Comparable for same N only!I Good first approximation, but it is measuring

attestedness, not potential:I (According to rough BNC counts) de- verbs have V of

141, un- verbs have V of 119, contra our intuition

I We want productivity index of pronouns to be 0, not 72!

Page 89: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

V as a measure of productivity

I Comparable for same N only!I Good first approximation, but it is measuring

attestedness, not potential:I (According to rough BNC counts) de- verbs have V of

141, un- verbs have V of 119, contra our intuitionI We want productivity index of pronouns to be 0, not 72!

Page 90: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Baayen’s P

I Operationalize productivity of a process as probabilitythat the next token created by the process that wesample is a new word

I This is same as probability that next token in sample ishapax legomenon

I Thus, we can estimate probability of sampling a newword as relative frequency of hapax legomena in oursample:

P =V1

N

Page 91: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Baayen’s P

P =V1

N

I Probability to sample token representing type we willnever encounter again (token labeled “hapax”) at firststage of sampling (when we are at the beginning ofN-token-sample) is given by the proportion of hapaxes inthe whole N-token-sample divided by the total number oftokens in the sample

I Thus, this must also be probability that last tokensampled represents new type

I P as productivity measure matches intuition thatproductivity should measure potential of process togenerate new forms

Page 92: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

P as vocabulary growth rate

I P measures the potentiality of growth of V in a veryliteral way, i.e., it is the growth rate of V , the rate atwhich vocabulary size increases

I P is (approximation to) the derivative of V at N, i.e.,the slope of the tangent to the vocabulary growth curveat N (Baayen 2001, pp. 49-50)

I Again, “rate of growth” of vocabulary generated by wordformation process seems good match for intuition aboutproductivity of word formation process

Page 93: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

ri- in Italian la Repubblica corpus

0 200000 600000 1000000 1400000

200

400

600

800

1000

N

V

Page 94: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Pronouns in Italian la Repubblica corpus

0 2000 4000 6000 8000 10000

020

4060

80

N

V

Page 95: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Baayen’s P and intuition

class V V1 N Pit. ri- 1098 346 1,399,898 0.00025it. pronouns 72 0 4,313,123 0

en. un- 119 25 7,618 .00328en. de- 141 16 86,130 .000185

Page 96: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

P and sample size

I We saw that as N increases, V also increases (forat-least-mildly-productive processes)

I Thus, V cannot be compared at different Ns

Page 97: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

P and sample size

I We saw that as N increases, V also increases (forat-least-mildly-productive processes)

I Thus, V cannot be compared at different Ns

Page 98: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

V and NEnglish re- and mis-

0 10000 20000 30000 40000 50000

050

100

150

200

250

N

V

Page 99: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

P and sample size

I We saw that as N increases, V also increases (forat-least-mildly-productive processes)

I Thus, V cannot be compared at different Ns

I However, growth rate is also systematically decreasing asN becomes larger

I At the beginning, any word will be a hapax legomenon;as sample increases, hapaxes will be increasingly lowerproportion of sample

I A specific instance of the more general problem of“variable constants” (Tweedie and Baayen 1998) inlexical statistics (cf. type/token ratio)

Page 100: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

P and sample size

I We saw that as N increases, V also increases (forat-least-mildly-productive processes)

I Thus, V cannot be compared at different Ns

I However, growth rate is also systematically decreasing asN becomes larger

I At the beginning, any word will be a hapax legomenon;as sample increases, hapaxes will be increasingly lowerproportion of sample

I A specific instance of the more general problem of“variable constants” (Tweedie and Baayen 1998) inlexical statistics (cf. type/token ratio)

Page 101: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

P and sample size

I We saw that as N increases, V also increases (forat-least-mildly-productive processes)

I Thus, V cannot be compared at different Ns

I However, growth rate is also systematically decreasing asN becomes larger

I At the beginning, any word will be a hapax legomenon;as sample increases, hapaxes will be increasingly lowerproportion of sample

I A specific instance of the more general problem of“variable constants” (Tweedie and Baayen 1998) inlexical statistics (cf. type/token ratio)

Page 102: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

P and sample size

I We saw that as N increases, V also increases (forat-least-mildly-productive processes)

I Thus, V cannot be compared at different Ns

I However, growth rate is also systematically decreasing asN becomes larger

I At the beginning, any word will be a hapax legomenon;as sample increases, hapaxes will be increasingly lowerproportion of sample

I A specific instance of the more general problem of“variable constants” (Tweedie and Baayen 1998) inlexical statistics (cf. type/token ratio)

Page 103: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Growth rate of re- at different sample sizes

0 50000 100000 150000 200000

200

250

300

N

V

Page 104: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

P as a function of N (re-)

0 50000 100000 150000 200000

1e−

045e

−04

2e−

035e

−03

2e−

02

N

P

Page 105: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

V and P at arbitrary Ns

I In order to compare V and P of processes (and predicthow process will develop in larger samples). . .

I we need to be able to estimate V and V1 at arbitrary Ns

I Once we compare P at same N, we might as wellcompare V1 directly (since P = V1/N and N will beconstant across compared processes)

I Most intuitive: VGC plot comparison

Page 106: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

V and P at arbitrary Ns

I In order to compare V and P of processes (and predicthow process will develop in larger samples). . .

I we need to be able to estimate V and V1 at arbitrary Ns

I Once we compare P at same N, we might as wellcompare V1 directly (since P = V1/N and N will beconstant across compared processes)

I Most intuitive: VGC plot comparison

Page 107: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

V and P at arbitrary Ns

I In order to compare V and P of processes (and predicthow process will develop in larger samples). . .

I we need to be able to estimate V and V1 at arbitrary Ns

I Once we compare P at same N, we might as wellcompare V1 directly (since P = V1/N and N will beconstant across compared processes)

I Most intuitive: VGC plot comparison

Page 108: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

V and P at arbitrary Ns

I In order to compare V and P of processes (and predicthow process will develop in larger samples). . .

I we need to be able to estimate V and V1 at arbitrary Ns

I Once we compare P at same N, we might as wellcompare V1 directly (since P = V1/N and N will beconstant across compared processes)

I Most intuitive: VGC plot comparison

Page 109: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Productivity beyond morphology

I Measuring generative potential of process/category notlimited to morphology

I Applications in lexicology, collocation and idiom studies,morphosyntax, syntax, language technology

I E.g., measure growth of nouns, adjectives, loanwords,relative productivity of two constructions, growth ofUNKNOWN lemmas as dataset increases. . .

I An example: measuring productivity of NP and PPexpansions in German TIGER treebank

Page 110: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Productivity beyond morphology

I Measuring generative potential of process/category notlimited to morphology

I Applications in lexicology, collocation and idiom studies,morphosyntax, syntax, language technology

I E.g., measure growth of nouns, adjectives, loanwords,relative productivity of two constructions, growth ofUNKNOWN lemmas as dataset increases. . .

I An example: measuring productivity of NP and PPexpansions in German TIGER treebank

Page 111: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Productivity beyond morphology

I Measuring generative potential of process/category notlimited to morphology

I Applications in lexicology, collocation and idiom studies,morphosyntax, syntax, language technology

I E.g., measure growth of nouns, adjectives, loanwords,relative productivity of two constructions, growth ofUNKNOWN lemmas as dataset increases. . .

I An example: measuring productivity of NP and PPexpansions in German TIGER treebank

Page 112: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Productivity beyond morphology

I Measuring generative potential of process/category notlimited to morphology

I Applications in lexicology, collocation and idiom studies,morphosyntax, syntax, language technology

I E.g., measure growth of nouns, adjectives, loanwords,relative productivity of two constructions, growth ofUNKNOWN lemmas as dataset increases. . .

I An example: measuring productivity of NP and PPexpansions in German TIGER treebank

Page 113: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

TIGER expansions

I Types are non-terminal rewrite rules for NP and PP, e.g:I NP → ART ADJA NNI PP → APPR ART NN

I Frequency of occurrence of expansions collected fromabout 900,000 tokens (50,000 sentences) of Germannewspaper text from Frankfurter Rundschau

I http://www.ims.uni-stuttgart.de/projekte/TIGER

Page 114: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

NP spectrum

1 2 3 4 5 6 7 8 9 11 13 15

m

V(m

)

050

010

0015

00

Page 115: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

PP spectrum

1 2 3 4 5 6 7 8 9 11 13 15

m

V(m

)

050

010

0015

00

Page 116: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Growth curves of NP and PP

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

050

010

0015

0020

0025

0030

0035

00

N

V a

nd V

_1

nppp

Page 117: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Lexical richness

I How many words did Shakespeare know? Are the laterHarry Potters more lexically diverse than the early ones?

I Are advanced learners distinguishable from nativespeakers in terms of vocabulary richness? How manywords do 5-year-old children know?

I Can changes in V detect the onset of Alzheimer’sdisease? (Garrard et al. 2005)

Page 118: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Lexical richness

I How many words did Shakespeare know? Are the laterHarry Potters more lexically diverse than the early ones?

I Are advanced learners distinguishable from nativespeakers in terms of vocabulary richness? How manywords do 5-year-old children know?

I Can changes in V detect the onset of Alzheimer’sdisease? (Garrard et al. 2005)

Page 119: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Lexical richness

I How many words did Shakespeare know? Are the laterHarry Potters more lexically diverse than the early ones?

I Are advanced learners distinguishable from nativespeakers in terms of vocabulary richness? How manywords do 5-year-old children know?

I Can changes in V detect the onset of Alzheimer’sdisease? (Garrard et al. 2005)

Page 120: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

The Dickens’ datasets

I Dickens corpus: collection of 14 works by Dickens, about2.8 million tokens

I Oliver Twist: early work (1837-1839), about 160k tokens

I Great Expectations: later work (1860-1861), consideredone of Dickens’ masterpieces, about 190k tokens

I Our Mutual Friend: last completed novel (1864-1865),about 330k tokens

Page 121: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Dickens’ V

0 500000 1000000 1500000 2000000 2500000

010

000

2000

030

000

4000

0

N

V a

nd V

_1

dickensomfgeot

Page 122: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

The novels compared

0 50000 100000 150000 200000 250000 300000

050

0010

000

1500

0

N

V a

nd V

_1

omfgeot

Page 123: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Oliver vs. Great Expectations

0 50000 100000 150000

020

0040

0060

0080

0010

000

N

V a

nd V

_1

geot

Page 124: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Conclusion and outlook

I Productivity, lexical richness, extrapolation of typecounts for language engineering purposes. . .

I all applications require a model of the larger populationof types that our sample comes from

I Two reasons to construct model of type populationdistribution:

I Population distribution interesting by itself, fortheoretical reasons or in NLP applications

I We know how to simulate sampling from population; thusonce we have population model we can obtain estimatesof type-related quantities (e.g., V and V1) at arbitrary Ns

Page 125: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Conclusion and outlook

I Productivity, lexical richness, extrapolation of typecounts for language engineering purposes. . .

I all applications require a model of the larger populationof types that our sample comes from

I Two reasons to construct model of type populationdistribution:

I Population distribution interesting by itself, fortheoretical reasons or in NLP applications

I We know how to simulate sampling from population; thusonce we have population model we can obtain estimatesof type-related quantities (e.g., V and V1) at arbitrary Ns

Page 126: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Conclusion and outlook

I Productivity, lexical richness, extrapolation of typecounts for language engineering purposes. . .

I all applications require a model of the larger populationof types that our sample comes from

I Two reasons to construct model of type populationdistribution:

I Population distribution interesting by itself, fortheoretical reasons or in NLP applications

I We know how to simulate sampling from population; thusonce we have population model we can obtain estimatesof type-related quantities (e.g., V and V1) at arbitrary Ns

Page 127: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Conclusion and outlook

I Productivity, lexical richness, extrapolation of typecounts for language engineering purposes. . .

I all applications require a model of the larger populationof types that our sample comes from

I Two reasons to construct model of type populationdistribution:

I Population distribution interesting by itself, fortheoretical reasons or in NLP applications

I We know how to simulate sampling from population; thusonce we have population model we can obtain estimatesof type-related quantities (e.g., V and V1) at arbitrary Ns

Page 128: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Conclusion and outlook

I Productivity, lexical richness, extrapolation of typecounts for language engineering purposes. . .

I all applications require a model of the larger populationof types that our sample comes from

I Two reasons to construct model of type populationdistribution:

I Population distribution interesting by itself, fortheoretical reasons or in NLP applications

I We know how to simulate sampling from population; thusonce we have population model we can obtain estimatesof type-related quantities (e.g., V and V1) at arbitrary Ns

Page 129: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Modeling the populationProductivity

I Distribution of types of category of interest necessary toestimate V and V1 at arbitrary Ns, in order to compareVGCs and P of different processes

I However, type population distribution of word formationprocess (or other category) might be of interest by itself,as model of a part of the mental lexicon of speaker

Page 130: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Modeling the populationProductivity

I Distribution of types of category of interest necessary toestimate V and V1 at arbitrary Ns, in order to compareVGCs and P of different processes

I However, type population distribution of word formationprocess (or other category) might be of interest by itself,as model of a part of the mental lexicon of speaker

Page 131: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Modeling the populationLexical richness

I Lexical richness = V of whole population (how manywords did Shakespeare know? Was the lexical repertoireof young Dickens smaller than that of old Dickens? Howmany words do 5-year-old children know?)

I Accurate estimate of population V would solve “variableconstant” problem

I Sampling from population, in particular to computeVGC, also of interest

Page 132: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Modeling the populationLexical richness

I Lexical richness = V of whole population (how manywords did Shakespeare know? Was the lexical repertoireof young Dickens smaller than that of old Dickens? Howmany words do 5-year-old children know?)

I Accurate estimate of population V would solve “variableconstant” problem

I Sampling from population, in particular to computeVGC, also of interest

Page 133: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Modeling the populationLexical richness

I Lexical richness = V of whole population (how manywords did Shakespeare know? Was the lexical repertoireof young Dickens smaller than that of old Dickens? Howmany words do 5-year-old children know?)

I Accurate estimate of population V would solve “variableconstant” problem

I Sampling from population, in particular to computeVGC, also of interest

Page 134: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Modeling the populationSome NLP applications

I Estimate number (and growth rate) of typos,UNKNOWNs (or other target tokens) in larger samples

→ estimate V and V1 at arbitrary Ns

I Estimate proportion of OOV words under assumptionthat lexicon contains top n most frequent types (seezipfR tutorial) → requires estimation of V andfrequency spectrum at arbitrary Ns (to find out for howmany tokens do the top n types account for)

I Good-Turing estimation, Bayesian priors → require fulltype population model

Page 135: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Modeling the populationSome NLP applications

I Estimate number (and growth rate) of typos,UNKNOWNs (or other target tokens) in larger samples→ estimate V and V1 at arbitrary Ns

I Estimate proportion of OOV words under assumptionthat lexicon contains top n most frequent types (seezipfR tutorial) → requires estimation of V andfrequency spectrum at arbitrary Ns (to find out for howmany tokens do the top n types account for)

I Good-Turing estimation, Bayesian priors → require fulltype population model

Page 136: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Modeling the populationSome NLP applications

I Estimate number (and growth rate) of typos,UNKNOWNs (or other target tokens) in larger samples→ estimate V and V1 at arbitrary Ns

I Estimate proportion of OOV words under assumptionthat lexicon contains top n most frequent types (seezipfR tutorial)

→ requires estimation of V andfrequency spectrum at arbitrary Ns (to find out for howmany tokens do the top n types account for)

I Good-Turing estimation, Bayesian priors → require fulltype population model

Page 137: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Modeling the populationSome NLP applications

I Estimate number (and growth rate) of typos,UNKNOWNs (or other target tokens) in larger samples→ estimate V and V1 at arbitrary Ns

I Estimate proportion of OOV words under assumptionthat lexicon contains top n most frequent types (seezipfR tutorial) → requires estimation of V andfrequency spectrum at arbitrary Ns (to find out for howmany tokens do the top n types account for)

I Good-Turing estimation, Bayesian priors → require fulltype population model

Page 138: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Modeling the populationSome NLP applications

I Estimate number (and growth rate) of typos,UNKNOWNs (or other target tokens) in larger samples→ estimate V and V1 at arbitrary Ns

I Estimate proportion of OOV words under assumptionthat lexicon contains top n most frequent types (seezipfR tutorial) → requires estimation of V andfrequency spectrum at arbitrary Ns (to find out for howmany tokens do the top n types account for)

I Good-Turing estimation, Bayesian priors

→ require fulltype population model

Page 139: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Modeling the populationSome NLP applications

I Estimate number (and growth rate) of typos,UNKNOWNs (or other target tokens) in larger samples→ estimate V and V1 at arbitrary Ns

I Estimate proportion of OOV words under assumptionthat lexicon contains top n most frequent types (seezipfR tutorial) → requires estimation of V andfrequency spectrum at arbitrary Ns (to find out for howmany tokens do the top n types account for)

I Good-Turing estimation, Bayesian priors → require fulltype population model

Page 140: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Outlook

I We need model of type population distribution

I We will use Zipf(-Mandelbrot)’s law as starting point tomodel how population looks like

TO BE CONTINUED

Page 141: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Outlook

I We need model of type population distribution

I We will use Zipf(-Mandelbrot)’s law as starting point tomodel how population looks like

TO BE CONTINUED

Page 142: Counting Words: Introductionzipfr.r-forge.r-project.org/.../ESSLLI/01_introduction.slides.pdf · Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical

Introduction

Baroni & Evert

Roadmap

Lexical statistics:the basics

Zipf’s law

Typical frequencypatterns

Zipf’s law

Consequences

Applications

Productivity inmorphology

Productivitybeyond morphology

Lexical richness

Conclusion andoutlook

Outlook

I We need model of type population distribution

I We will use Zipf(-Mandelbrot)’s law as starting point tomodel how population looks like

TO BE CONTINUED