31
English Corpora and Words Defined in Learner's Dictionaries Yang Shouxun [email protected] Corpus Development Section, FLTRP

English Corpora and Words Defined in Learner's Dictionaries

Embed Size (px)

DESCRIPTION

Yang Shouxun [email protected] Corpus Development Section, FLTRP. English Corpora and Words Defined in Learner's Dictionaries. English Corpora. A corpus is a collection of written texts and/or transcripts of spoken language The Brown Corpus The British National Corpus - PowerPoint PPT Presentation

Citation preview

English Corpora and Words Defined in Learner's Dictionaries

Yang [email protected]

Corpus Development Section, FLTRP

English Corpora

A corpus is a collection of written texts and/or transcripts of spoken language The Brown Corpus The British National Corpus

Special software to access, and analyze the corpus

FLTRP English/Chinese Parallel Corpus

Frequency

A small number of high-frequency words cover a large proportion of the corpus.

A large number of low-frequency words cover a disproportionate part.top n BNC Brown Fl ecPara

10 22.51% 24.43% 23.19%20 29.52% 31.35% 29.91%50 40.03% 40.88% 40.03%

100 47.20% 47.71% 47.58%200 53.63% 53.94% 54.16%500 62.01% 62.40% 63.11%

1000 69.11% 69.37% 70.50%10000 90.95% 92.44% 92.86%40000 97.07% 99.40% 98.55%

Frequency

High-frequency words are most useful to learners of the language.

High-frequency words should be defined in a learner's dictionary.

More frequently used senses should come before less frequently used senses.

Objectives

Are the words defined in a learner's dictionary really high-frequency words?

What high-frequency words are not defined in a learner's dictionary?

What low-frequency words are included in a learner's dictionary (and what not)?

Research methods

Six learner's dictionaries by well-known international publishers

Three corpora for word frequency extraction Brown Corpus British National Corpus FLTRP English/Chinese Parallel Corpus

Research methods

A frequency table for each corpus is computed and the frequency is normalized to that per million words.

Lists of defined words in the 6 dictionaries are extracted.

But multi-word entries are excluded:“a priori”, “according to”

Research methods

Words from corpora are reduced to the base forms dictionaries contain basically words in base

forms corpora contain words in all possible forms,

including cases and capitalization some issues with this method

thought/think case distinction cannot be kept: A/a China/china

lots of entries containing numbers in corpora, but only a few numbers are entries in dictionaries.

Computation

Distribution of word frequency in a dictionary

What percentage of high-frequency words are defined in a dictionary?

Distribution of word frequency in dictionaries

Brown Corpus

Frequency A B C D E F128 4.6 4.2 2.3 2.2 2.3 2.264 9.2 8.3 4.7 4.4 4.5 4.332 15.6 14.3 8 7.5 7.7 7.416 24.3 22.3 12.6 11.9 12.1 11.78 35.2 32.3 18.6 17.5 17.8 17.34 52 48.2 29.5 27.5 28.1 27.32 64.6 61.8 41.4 38.5 39.4 38.51 73.9 72.3 54.5 51 51.7 50.9

Distribution of word frequency in dictionaries

More than 45% of words defined in dictionaries C, D, E, and F are not found in Brown Corpus.

More than 25% of words defined in dictionaries A and B are not found in Brown Corpus.

Still good dictionaries Even learner's dictionaries include far

more words than a learner possibly needs.

Distribution of word frequency in dictionaries

128 64 32 16 8 4 2 10

10

20

30

40

50

60

70

80

Frequency Distribution of Words in Dictionaries

A

B

C

D

E

F

Frequency >=

Perc

enta

ge in

Dic

tionari

es

Distribution of word frequency in dictionaries

The figure clearly shows that the dictionaries can be clustered into two categories A, B C, D, E, F

The denominator is the size of words defined in the dictionaries for learners for advanced learners

Distribution of word frequency in dictionaries

BNC

Frequency A B C D E F64 8.3 7.6 4.2 4 4 3.932 14.2 13 7.2 6.8 7 6.716 22.5 20.6 11.6 10.9 11.1 10.78 33.4 30.6 17.3 16.2 16.6 164 47.8 44 25.5 23.7 24.4 23.52 64.2 59.2 36.2 33.3 34.5 33.31 77 73.7 49.4 44.7 46.7 45.5

0.5 82.9 82.6 62.5 56.4 58.7 58

Distribution of word frequency in dictionaries

Similar trend Just a little smaller

Distribution of word frequency in dictionaries

64 32 16 8 4 2 1 0.50

10

20

30

40

50

60

70

80

90

Frequency Distribution of Words in Dictionaries

A

B

C

D

E

F

Frequency >=

Pe

rce

ntag

e in

Dic

tio

nary

Distribution of word frequency in dictionaries

FlecParaFrequency A B C D E F

128 4.7 4.2 2.4 2.2 2.3 2.264 8.7 7.9 4.4 4.2 4.2 4.132 14.8 13.5 7.5 7.1 7.2 716 23 20.9 11.8 11.1 11.3 10.98 34.1 31 17.8 16.6 17 16.44 47.1 43 25.4 23.5 24.2 23.42 61.6 57.3 35.7 32.6 33.9 32.91 72.2 68.8 46.3 42 43.7 42.7

0.5 80.4 78.8 59.1 53.7 55.4 54.9

Distribution of word frequency in dictionaries

128 64 32 16 8 4 2 1 0.50

10

20

30

40

50

60

70

80

90

Frequency Distribution in Dictionaries

A

B

C

D

E

F

Frequency >=

Perc

enta

ge in

Dic

tionari

es

How many high-frequency words are defined? Brown Corpus

Frequency A B C D E F128 94.5 96 95.3 96.1 96.5 97.164 92.2 93.8 93.3 94.4 94.8 95.332 89.5 91.6 90.9 91.9 92.3 92.616 85.1 87 87.6 88.2 88.6 898 80.1 82.1 84.4 84.5 85 85.64 79.5 82.4 89.9 89.1 90.3 91.12 70 74.7 89.3 88.3 89.6 90.81 60.6 66.3 89 88.6 89.1 91

How many high-frequency words are defined? The denominator is constant across

dictionaries. Advanced dictionaries are rated higher,

but the margin is very small. The curves after frequency < 8 are

surpring and require an explanation.

How many high-frequency words are defined?

128 64 32 16 8 4 2 10

20

40

60

80

100

120

High-frequency Words in Dictionaries

A

B

C

D

E

F

Frequency >=

Perc

en

tag

e

How many high-frequency words are defined? BNC

Frequency A B C D E F64 92.7 94.5 93.4 94.1 95 94.832 90.6 92.6 91.9 92.4 93.2 92.816 88.3 90.3 90.1 90.2 91 91.18 83.6 85.6 86.3 85.9 87.1 87.14 77.7 79.8 82.6 81.5 83.2 83.12 68.4 70.6 76.9 75.3 77.4 77.51 55.5 59.4 70.8 68.4 70.6 71.5

0.5 48.8 54.3 73.2 70.3 72.5 74.4

How many high-frequency words are defined?

64 32 16 8 4 2 1 0.50

10

20

30

40

50

60

70

80

90

100

High-frequency words in dictionar-ies

A

B

C

D

E

F

Frequency >=

Perc

enta

ge

How many high-frequency words are defined? FlecPara

Frequency A B C D E F128 95 96 96 96.4 96.6 9764 92.8 94 93.8 94.1 94.7 94.732 90.6 92.1 91.9 92.5 93 9316 87.7 89.1 89.8 89.8 90.5 90.88 83.7 85 87 86.1 87.3 87.84 77.4 78.9 83.2 81.7 83.5 83.92 69.1 71.9 79.8 77.6 79.9 80.41 56.4 60.1 72.1 69.6 71.8 72.7

0.5 53 58 77.5 75 76.7 78.8

How many high-frequency words are defined?

128 64 32 16 8 4 2 1 0.50

20

40

60

80

100

120

High-frequency Words in Dictionaries

A

B

C

D

E

F

Frequency >=

Perc

enta

ge

High-frequency words not defined in dictionaries

An increasing number of high-frequency words (with the frequency getting lower) are not defined. Place names, such as “Asia”, “Europe” Person's names, such as “John”, “David” Other cases, such as “ii”, “na”, “ca” Words probably should be included:

“legislative”(>19) not defined in B Some dictionaries extensively use words

in definitions that are not defined.

Why not all high-frequency words are defined?

Computational methods are good enough but not perfect: how to reduce words to the base forms,

spelling variations in the corpus numbers

They are not supposed to be important or are just left out by accident: “Soviet”(>119), “Unix”(>42)

Some vulgar words are probably avoided intentionally for elementary learners.

Low-frequency words defined in dictionaries

Some lower-frequency words have to be chosen if the dictionary is a big one.

Words and expressions come into wider use after the corpus is built may find their way into new dictionaries or updated versions. "ISP", "spammer", "MP3", and "e-commerce"

Affixes, e.g. “post-”, “-proof” Why some low-frequency words are

chosen and others not is not so clear.

Concluding remarks

Take the numbers with a grain of salt. The frequency principle is well

observed in modern English dictionaries.

There may be occasional bugs. The corpus should be kept up-to-date,

or new words and expressions should be added from other sources if the dictionary is targeted at advanced learners.

Concluding remarks

A learner's dictionary does not really need to cover so many low-frequency words.

A better metric for evaluating learner's dictionaries will be coverage of high-frequency words in the dictionary texts, and a topic for further study.

It'll be interesting to include some dictionaries compiled without corpora in the study.

Thanks!