The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam...

Preview:

Citation preview

The Cambridge Learner Corpus, English Profile, the Sketch Engine

and the Kelly Project

Adam KilgarriffLexical Computing Ltd

http://www.sketchengine.co.uk

The Cambridge Learner Corpus, English Profile, the Sketch Engine,

“freely available”, HOO, DANTE and the Kelly Project

Adam KilgarriffLexical Computing Ltd

http://www.sketchengine.co.uk

Cambridge Learner Corpus (CLC)

• Since 1993 – Nearly as old as CECL

• Leading resource (like ICLE)• CUP and Cambridge ESOL– For better dictionaries, ELT courses, tests– Material: all from exams (levels A1-C2)

• 45m words; 22m error-tagged• 200,000 scripts, 138 L1s, 203 nationalities

English Profile

• From 2006• Cambridge Univ, Univ Press, ESOL (+ others)• Goal– for each CEFR level, find characteristic lexis and

grammar– Main resource: CLC– Talk on Thursday• Theodora Alexopolou, Helen Yannakoudakis

Flyers

Sketch Engine

• Leading corpus tool• Word sketches– One-page summaries of a word’s grammatical and

collocational behaviour• In use at OUP, CUP, Collins, Macmillan, INL …• 42 languages– Over 150 corpora– Since May including CHILDES: demo– Since last year including CLC

Error-coded corpus

• Challenge– Intuitive to search for x• anywhere• only where it is part of an error• only where it is part of a correction

where x can be a word, phrase, grammar pattern …

Requirement for CLC in Sketch Engine

Sample text

• We will only use those informations to take part of our guest survey

Error-coded corpora in SkE

• demo

freely available

freely available

Free (MED online)Sense 1: not costing anythingSense 4: not limited by rules … anyone can get hold of it??

freely available

Free (MED online)Sense 1: not costing anythingSense 4: not limited by rules … anyone can get hold of it??

AvailableTo download onto your comTo use

Case studiesICLE CLC

Money 225 EUR No

To everyone Yes Cambridge author/collab

To download ? No

To use Yes Yes

Non-geeks

• Access is important, not download• Web is beautiful

HOO / HOO+

• Helping Our Own• HOO: English-NNS NLP researchers – Developer = user: motivation– Shared task/competitive evaluation• Organisers define task and prepare ‘gold standard’• Teams participate by running their software over test

data• Six teams (incl Tübingen), workshop end Sept

HOO+ (2012)

• Probably– English: learner data from CLC– Other languages? – Tasks• Essay scoring • Determiner, preposition errors• ?• http://www.clt.mq.edu.au/research/projects/hoo/

DANTE

Highlights of English lexicography

DANTE

DANTE

DANTE

DANTE

http://webdante.comFlyers

The KELLY Project

• EU Lifelong Learning Project• Word cards– 9 languages

• Arabic Chinese English Greek Italian Norwegian Polish Russian Swedish

– All 36 pairs– Words the learner should know (at A1 … C2)

• Partners• Stockholm Univ, Gotheburg Univ, Adam Mickiewicz Univ,

ILSP Athens, CNR Pisa, Oslo Univ, Leeds Univ, Keewords A/S, Lexical Computing Ltd

Interesting question

• How close to purely corpus-based can a pedagogic list be?

Method

• Take a general corpus• Count• Review, add, delete using other lists and corpora• Translate (72 directed-lg-pairs)• Words not in source list which occur in

translations:– Review source list

• http://kelly.sketchengine.co.uk

• Symmatrical pairs: <x,y> and <y,x>• Cliques:– For x, y, z, … all pairs are symmetrical– 9-language cliques (English members)• hospital library music sun theory

Homage

Recommended