40
Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Universities of Leeds and Sussex

1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Embed Size (px)

Citation preview

Page 1: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project

Adam KilgarriffLexical Computing LtdLexicography MasterClassUniversities of Leeds and Sussex

Page 2: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 2

Two corpora are comparable iff

roughly the same text types, subject matter, proportions

Page 3: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 3

Two corpora are comparable iff

roughly the same text types, subject matter, proportions

Applicable where Different languages Same language

comparable=similar Any corpus is entirely similar to itself

Page 4: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 4

Comparing Corpora

Input Word freq list for c1 Word freq list for c2

For top 500 words compute sum of (observed-expected)2/expected

Chi-square-based Discriminates well

Better than spearman rank, cross-entropy

Page 5: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 5

1990s work

Then Very few corpora Purely theoretical interest

Now Web lots of corpora, created to spec Compare…

first question to ask about a new corpus

Page 6: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 6

(Monolingual) Word Lists

Define a syllabus Which words get used in

Learning-to-read books (NS children) NNS language learner textbooks Dictionaries Language testing

NS: educational psychologists NNS: proficiency levels

Page 7: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 7

Should be corpus-based

Most aren't Corpora are quite new

Easy to do better People will use them

Maybe also Governments

Page 8: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 8

How

Take your corpus Count Voila

Page 9: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 9

Complications

What is a word Words and lemmas Grammatical classes Numbers, names... Multiwords Homonymy

All are slightly different issues for each lg

Page 10: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 10

What is a word; delimiters

Found between spaces Not for Chinese: segmentation

English co-operate, widely-held, farmer's, can't

Norwegian, Swedish Compounding, separable verbs

Arabic, Italian Clitics, al, ...

...

Page 11: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 11

Words and lemmas

Word form (in text) invading

Lemma (dictionary headword) Invade for forms invade invades invaded

invading

Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara

Page 12: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 12

Word Families

Derivational morphology efficient/efficiently access/accessible/accessibility available/availability/unavailable

‘Word families’ tradition eg: Coxhead, Academic word list

Pedagogy: one item to learn But

Where do families end? Different meanings

Page 13: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 13

Grammatical classes

brush (verb) and brush (noun) Same item or different? (both in same word family)

Required (short) list of word classes POS-tagger

Will make mistakes

Page 14: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 14

Marginal cases

Numbers twelve, seventeenth, fifties

Closed sets Days of week, months

Countries Capitals, nationalities, currencies, adjectives,

languages regional/dialects, political groups, religions

easter, christmas, islam, republican

policies always needed

Page 15: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 15

Multiwords

According to Linguistically a word but

Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords

Page 16: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 16

Homonymy

bank (river) and bank (money) Word sense disambiguation

We can't do (with decent accuracy) We can't give freqs for senses

Lists of words not meanings Sometimes disconcerting

Page 17: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 17

Corpora

A fairly arbitrary sample of a lg To limit arbitrariness of wordlist

Make it big and diverse WACKY corpora

From web Can do for any language

??? Comparable ??? Web language: less formal

Page 18: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 18

Comparing corpora

Corpora: new We are all beginners Best way to get sense of a corpus

Compare with another Keywords of each vs. other

Case studies Sketch Engine functions

Page 19: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 19

Comparing frequency lists

• Web1T

– Present from google

– All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (1012) words of English

• that’s 1,000,000,000,000

• Compare with BNC

– Take top 50,000 items of each

– 105 Web1T words not in BNC top50k

– 50 words with highest Web1T:BNC ratio

– 50 words with lowest ratio

Page 20: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 20

Web-high (155 terms)

• 61 web and computing– config browser spyware url www forum

• 38 porn• 22 US English (incl Spanish influence –los)• 18 business/products common on web

– poker viagra lingerie ringtone dvd casino rental collectible tiffany

– NB: BNC is old

• 4 legal– trademarks pursuant accordance herein

Page 21: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 21

Web-low

• Exclude British English, transcription/tokenisation anomalies

– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

Page 22: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 22

Observations

• Pronouns and past tense verbs

– Fiction

• Masc vs fem

• Yesterday

– Probably daily newspapers

• Constancy of ratios:

– He/him/himself

– She/her/herself

Page 23: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 23

Corpus Factory

Many languages General corpus, 100m+ words

Fast High quality Comparable across languages

Page 24: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 24

Gather Seed words

Wikipedia (Wiki) Corpora many domains free 265 languages covered, more to come

Extract text from Wiki. Wikipedia 2 Text

Tokenise the text. Morphology of the language is important Can use the existing word tokeniser tools.

Page 25: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 25

Web Corpus Statistics

Unique URLscollected

Afterfiltering

After de-duplication

Web corpus size MB Words

Dutch 97,584 22,424 19,708 739 MB 108.6 mHindi 71,613 20,051 13,321 424 MB 30.6 mTelugu 37,864 6,178 5,131 107 MB 3.4 mThai 120,314 23,320 20,998 1.2 GB 81.8 mVietnamese 106,076 27,728 19,646 1.2 GB 149 m

Page 26: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 26

Evaluation

For each of the languages, two corpora available: Web and Wiki Dutch: also a carefully designed lexicographic corpus.

Hypothesis: Wiki corpora are ‘informational’ Informational --> typical written Interactional --> typical spoken

Page 27: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 27

Evaluation

1st, 2nd person pronouns strong indicators of interactional language. English: I me my mine you your yours we us our

For each language Take ten commonest 1st and 2nd person pronouns For each

Calculate ratio: web:wiki

Page 28: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 28

Results: ratios, web:wiki

Language Average Min Max

Dutch 2.98 2.03 10.03

Hindi 5.36 1.85 11.50

Telugu 4.96 0.54 7.34

Thai 2.40 0.63 7.87

Vietnamese 3.82 1.81 19.41

Page 29: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 29

KELLY

EU lifelong learning project Goal: wordcards

Word in one lg on one side, other on other Language learning

9 languages, 36 pairs Arabic Chinese English Greek Italian

Norwegian Polish Russian Sweden Partners in 6 countries

Page 30: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 30

Method

Prepare monolingual lists Translate

Each into 8 target languages Professional translation services

Integrate, finalise Produce cards Goal for each set

9000 pairs at 6 levels

Page 31: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 31

Stages

Sort out corpora, tagging Automatically generate M1 lists

names, numbers, countries ... keywords vis-a-vis other corpora

Review, compare, prepare M2 lists Translate Use translations: M3 lists Finalise

Page 32: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 32

review - how?

points system 2 points for each of 6 levels 12 points for most freq words

deduct points for words in over-represented areas

add in words from other corpora

Page 33: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 33

Translation database

On the web All translations entered into it Queries like

All Swedish words used as translations more than six times

All 1:1:1:1... 'simple cases'

Page 34: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 34

Using the translations database

Find words not in M2 lists, that need adding Multiwords English look for Probably, the translation of a high-freq

word in several of the 8 other lgs So:

add it to English list Homonyms: could be similar

Page 35: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 35

Monolingual master lists (M3)

Based on a WAC corpus Input from other same-lg corpora And from translations from 8 lgs

Useful words which might not be hi-freq added words/multiwords must be above a

lower freq threshold

Target 9000

Page 36: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 36

Numbers

Target: 9000 per list M2 lists

Estimate: 5000-6000 needed We add 3000-4000 multiwords and other

'back-translations'

Page 37: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 37

Current status

M1 lists prepared Lists checked, compared with other

lists Corpus-based and other

M2 lists prepared Translation underway

Page 38: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 38

Big problems

Multiwords (as anticipated) Homonymy (as anticipated) orange banana alphabet elbow, Hello

Worse than anticipated Lists from spoken corpora, learner

corpora, needed Relation between

Competence for communicating The corpora at our disposal

Page 39: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 39

Word lists are useful, but

...are they scientific? A tiny bit, occasionally

...could they be scientific? Yes

article of faith By the end of KELLY, we'll have a clearer

idea how

Page 40: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Malta, May 2010 Kilgarriff: BUCC 40

And now for something completely different: DANTE Lexical database for English

Detailed Accurate Extensive of English Highly corpus-driven 3 yr project 18 expert lexicographers Led by Sue Atkins

BNC, FrameNet, Euralex, COBUILD...

English side, New English-Irish dictionary Available for NLP research imminently