25
English Corpora and Language Learning Tamás Váradi [email protected]

English Corpora and Language Learning

Embed Size (px)

DESCRIPTION

English Corpora and Language Learning. Tamás Váradi [email protected]. Outline. What is a Corpus? Compiling a corpus First generation of corpora: BROWN, LOB The Age of Mega Corpora British National Corpus International Corpus of English International Corpus of Learner English - PowerPoint PPT Presentation

Citation preview

Page 1: English Corpora and Language Learning

English Corpora and Language Learning

Tamás Váradi

[email protected]

Page 2: English Corpora and Language Learning

English Corpora and Language Learning 2

OutlineWhat is a Corpus?

Compiling a corpus

First generation of corpora: BROWN, LOB

The Age of Mega Corpora

British National Corpus

International Corpus of English

International Corpus of Learner English

The Web as a corpus?

Availability

Page 3: English Corpora and Language Learning

English Corpora and Language Learning 3

Corpora?(1) A collection of texts especially if complete and self

contained; the corpus of Anglo-Saxon verse(2) In linguistics and lexicography, a body of texts,

utterances or other specimens considered more or less representative of a language and usually stored as an electronic database

(The Oxford Companion to the English Language 1992)

A collection of naturally occurring language text chosen to characterize a state or variety of a language

John Sinclair Corpus Concordance Collocation OUP 1991

Page 4: English Corpora and Language Learning

English Corpora and Language Learning 4

The pre-electronic eraHuge, painstaking manual effort

Covering a closed body of texts Bible Concordance

Shakespeare Concordance

Attempt to capture the whole language

Page 5: English Corpora and Language Learning

English Corpora and Language Learning 5

Compiling a corpusAim

provide solid empirical evidence about language

Designgeographical and chronological bounds

speakers, genres,

defined by future use

Representative corpora?

Annotation

Output

Page 6: English Corpora and Language Learning

English Corpora and Language Learning 6

Corpus Linguistics: the early phaseEarly Sixties

BROWN Corpus 500 texts of 2000 words each

LOB corpus British counterpart

Classic reference works

Part of speech tagged

Page 7: English Corpora and Language Learning

English Corpora and Language Learning 7

Survey of English UsageA major undertaking at UCL led by Sidney Greenbaum

1 m word compilation

very careful annotation

500 words spoken material

LONDON-LUND Corpus

Page 8: English Corpora and Language Learning

English Corpora and Language Learning 8

Structure of SEU

Page 9: English Corpora and Language Learning

English Corpora and Language Learning 9

LOB corpus: a sample

•A01 2 ^ *'_*' stop_VB electing_VBG life_NN peers_NNS **'_**' ._.

•A01 3 ^ by_IN Trevor_NP Williams_NP ._.

•A01 4 ^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_IN

•A01 4 nominating_VBG any_DTI more_AP labour_NN

•A01 5 life_NN peers_NNS is_BEZ to_TO be_BE made_VBN at_IN a_AT meeting_NN

•A01 5 of_IN labour_NN \0MPs_NPTS tomorrow_NR ._.

Page 10: English Corpora and Language Learning

English Corpora and Language Learning 10

Concordance output

Page 11: English Corpora and Language Learning

English Corpora and Language Learning 11

The age of Mega CorporaCOBUILD

John Sinclair at University of Birmingham

originally 20 m words

now over 300 m word BANK of English

the more the better

no fixed size: the idea of a Monitor corpus

Page 12: English Corpora and Language Learning

English Corpora and Language Learning 12

A major undertaking in the mid-nineties

Birmingham, Lancaster – OUP,Longman,Chambers

100 m words carefully compiled

10 m words spoken data !

up-to-date standarg SGML encoding

still the paradigm example of a reference corpus

Page 13: English Corpora and Language Learning

English Corpora and Language Learning 13

Accessing the BNC

Page 14: English Corpora and Language Learning

English Corpora and Language Learning 14

BNC-Baby

Page 15: English Corpora and Language Learning

English Corpora and Language Learning 15

Searching LOB/BROWN

Page 16: English Corpora and Language Learning

English Corpora and Language Learning 16

International Corpus of EnglishA network of corpora corvering regional variaties of English

Project organized by UCL London

Each containing cc. 1 m. words

GB, Hong-Kong Australia, East-Africa more in preparation

Page 17: English Corpora and Language Learning

English Corpora and Language Learning 17

ICE-HK

Page 18: English Corpora and Language Learning

English Corpora and Language Learning 18

ICE-GB: sociolinguistic variation

Page 19: English Corpora and Language Learning

English Corpora and Language Learning 19

ICE-GB: syntactic annotation

Page 20: English Corpora and Language Learning

English Corpora and Language Learning 20

TreebanksGeoffrey Sampson

Meticulously hand-crafted syntactic annotationSUSANNE

CHRISTINE

LUCY

Penn-TreebankUniversity of Pennsyvania

Massive amounts of utomatically annotated data aimed for natural language processing work

Page 21: English Corpora and Language Learning

English Corpora and Language Learning 21

International Corpus of Learner EnglishInternational Centre of English Corpus Linguistics Catholic University of Louvain led by Sylviane Granger

collection of essays

student profiles

Hungarian-English in preparation

Page 22: English Corpora and Language Learning

English Corpora and Language Learning 22

Susanne CorpusAims of the Scheme

comprehensive — covering all features of surface and logical English grammar that are definite enough to be susceptible of formal annotation, and including all phenomena that occur in practice in modern English

explicit — if two researchers at separate sites are given the same sample of English and asked to annotate it according to the SUSANNE standards, their annotations should be identical nonpartisan — where aspects of grammar are the subject of theoretical controversy, the SUSANNE scheme aims to embody a neutral analysis which rival theoreticians can interpret in their own preferred terms

Page 23: English Corpora and Language Learning

English Corpora and Language Learning 23

The Web as a corpusWhy sample when you can access the whole?

Huge and ever changing

The ultimate in authenticity?

Not necessarily …

Page 24: English Corpora and Language Learning

English Corpora and Language Learning 24

The Webcorp project

Page 25: English Corpora and Language Learning

English Corpora and Language Learning 25

http://devoted.to/corpora