15
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing Ltd, UK Masaryk University, Czech Rep A&C Black Publishers Ltd., UK Macmillan Education, UK Lexicography MasterClass Ltd., UK

GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing

Embed Size (px)

Citation preview

Page 1: GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing

GDEX: Automatically finding good dictionary examples in a corpus

Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel RychlýLexical Computing Ltd, UKMasaryk University, Czech RepA&C Black Publishers Ltd., UKMacmillan Education, UKLexicography MasterClass Ltd., UK

Page 2: GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing

Users appreciate examples

Paper: space constraints Electronic: no space constraints

Give lots of examplesConstraint: Cost of selection, editing

Page 3: GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing

Project

Macmillan English dictionary Licensing arrangement with A&C Black

Already had 1000 collocation boxes See collocationality paper, ELX 2006

Average 8 per box New electronic version

All 8000 collocations need examples Authentic; from corpus

Page 4: GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing

Old method

Lexicographer Gets concordance for collocation Reads through until they find a good

example Cut, paste, edit

Page 5: GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing

New method

Lexicographer Gets sorted concordance

20 best examples in spreadsheet Less reading through Tick the first good one, edit

Page 6: GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing

What makes a good example?

Readable EFL users

Informative Typical, for the collocation Gives context which helps user

understand the target word/phrase

Page 7: GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing

Readability

70 years research Not just (or mainly) EFL

Educational theory Teaching children to read

Instruction manuals Publishing

Page 8: GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing

Readability tests Fleish Reading Ease test (1948)

Ave sentence length, ave word length In some word processing software

Many similar measures Recent work

Language modelling from training data Target levels

US grades Common European Framwork

Page 9: GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing

GDEX

Get concordance for collocation For each sentence

Score it Sort Show best ones

Page 10: GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing

GDEX heuristics Sentence length (10-26 words) Mostly common words: good Rare words: bad Sentences

Start with capital, end with one of .!? No [, ], <, >, http, \ Penalise:

Other punctuation, numbers More than 2 or 3 capitals

Typicality: third collocate is a plus

Page 11: GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing

Weighting

For each sentence Score on each heuristic Weight scores Add together weighted score

How to set weights?

Page 12: GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing

Machine learning

Two students: Manually judged 1000 “good examples” Weights

set to mimic students´ choices

Page 13: GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing

Was it successful? Did it save lexicographer time?

Definitely (says project manager) Corpus choice

Started with BNC but Too old Not enough examples

If no good examples in corpus, GDEX can’t help Changed to UKWaC

20 times bigger; from web; contemporary Better Most web junk filtered out Usually a good example in top twenty

Page 14: GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing

GDEX and TALC TALC

Teaching and Language Corpora Goal: bring corpora into lg teaching Usual problem

Concordances are tough for learners to read

Way forward GDEX examples Half way between dictionary and corpus

Page 15: GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing

GDEX: Models for use

More examples for dictionaries Speed up, as with MED or Fully automatic “more examples”

Corpus query tool Sort concordances, best first Now an option in the Sketch Engine

Automatic collocations dictionary http://forbetterenglish.com