Upload
kory-henry
View
218
Download
1
Embed Size (px)
Citation preview
GDEX: Automatically finding good dictionary examples in a corpus
Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel RychlýLexical Computing Ltd, UKMasaryk University, Czech RepA&C Black Publishers Ltd., UKMacmillan Education, UKLexicography MasterClass Ltd., UK
Users appreciate examples
Paper: space constraints Electronic: no space constraints
Give lots of examplesConstraint: Cost of selection, editing
Project
Macmillan English dictionary Licensing arrangement with A&C Black
Already had 1000 collocation boxes See collocationality paper, ELX 2006
Average 8 per box New electronic version
All 8000 collocations need examples Authentic; from corpus
Old method
Lexicographer Gets concordance for collocation Reads through until they find a good
example Cut, paste, edit
New method
Lexicographer Gets sorted concordance
20 best examples in spreadsheet Less reading through Tick the first good one, edit
What makes a good example?
Readable EFL users
Informative Typical, for the collocation Gives context which helps user
understand the target word/phrase
Readability
70 years research Not just (or mainly) EFL
Educational theory Teaching children to read
Instruction manuals Publishing
Readability tests Fleish Reading Ease test (1948)
Ave sentence length, ave word length In some word processing software
Many similar measures Recent work
Language modelling from training data Target levels
US grades Common European Framwork
GDEX
Get concordance for collocation For each sentence
Score it Sort Show best ones
GDEX heuristics Sentence length (10-26 words) Mostly common words: good Rare words: bad Sentences
Start with capital, end with one of .!? No [, ], <, >, http, \ Penalise:
Other punctuation, numbers More than 2 or 3 capitals
Typicality: third collocate is a plus
Weighting
For each sentence Score on each heuristic Weight scores Add together weighted score
How to set weights?
Machine learning
Two students: Manually judged 1000 “good examples” Weights
set to mimic students´ choices
Was it successful? Did it save lexicographer time?
Definitely (says project manager) Corpus choice
Started with BNC but Too old Not enough examples
If no good examples in corpus, GDEX can’t help Changed to UKWaC
20 times bigger; from web; contemporary Better Most web junk filtered out Usually a good example in top twenty
GDEX and TALC TALC
Teaching and Language Corpora Goal: bring corpora into lg teaching Usual problem
Concordances are tough for learners to read
Way forward GDEX examples Half way between dictionary and corpus
GDEX: Models for use
More examples for dictionaries Speed up, as with MED or Fully automatic “more examples”
Corpus query tool Sort concordances, best first Now an option in the Sketch Engine
Automatic collocations dictionary http://forbetterenglish.com