26
BTANT 129 w5 Introduction to corpus linguistics

BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

Embed Size (px)

Citation preview

Page 1: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Introduction to corpus linguistics

Page 2: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Corpus

• The old school concept– A collection of texts especially if complete and

self-contained: the corpus of Anglo-Saxon verse

The Oxford Companion to the English Language

• The modern view– A collection of naturally occurring language

text chosen to characterize a state or variety of a language

• John Sinclair Corpus Concordance Collocation OUP

Page 3: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Corpus vs. archive

• Text archive• Collection of texts in their original format(Oxford Text Archive:

http://ota.ox.ac.uk/)• Corpus• texts collected and processed in a unified,

systematic mannerBritish National Corpus:

http://www.natcorp.ox.ac.uk/

Page 4: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Page 5: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Page 6: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Short history

Brief mention of just a select few! • Brown Corpus (Brown university)

– 1 m words– 15 genres– 500 samples 2000 words each– Area: US– Time: 1961

• LOB Corpus (Lancaster-Bergen-Oslo)– GB replica of Brown

Page 7: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Cobuild

• Major corpus initiative by Collins and Birmingham Univ. John Sinclair

• 1991 20 m • -> Bank of English currently 450 m

words• http://www.cobuild.collins.co.uk

Page 8: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

British National Corpus

• 100 m words careful selection• 10 % spoken material• time span 1960 (fiction) – 1975 non-

ficion)• 40-50 000 word texts• TEI compliant SGML coding• http://www.comp.lancs.ac.uk/ucrel/

bncindex/

Page 9: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Page 10: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

International Corpus of English

• 20 corpora of 1 m words devoted to varieties of English around the world

• 500 texts (300 written 200 spoken) of 2000 words each

• time span: 1990-0996• ICE-GB available in demo version• syntactic annotation, graphical tool

ICECUP

Page 11: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Page 12: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Corpus processing: tokenization

• Preprocessing– tokenization segmenting the text into

sentences• sometimes tricky: sentence delimiters in

mid-sentence positions

words• multi-word units – problem

– Normalization• restoring clitics, abbreviations ("can't",

"I've")

Page 13: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Corpus processing: tagging

• Tagging– labelling every word with its Part of

Speech category– Problem: ambiguity

• out of context, words can belong to different part of speech or have different analysis within the same POS

– set N vs. set V– bánt 'bánik' VBD vagy 'bánt' VBZ

Page 14: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Corpus processing: disambiguation

• Disambiguation– defining the correct analysis in context

• Two approaches:• both needs manually corrected training

corpus– statistical

• Hidden Markov model• calculating probability within a span of usually one or

two words• rate of success can be around 98%

– rule-based

Page 15: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Syntactic annotation

• Difficult to do on such a scale • shallow parsing• Treebank:

collection of syntactically analyzed sentences

• Penn treebank• http://www.cis.upenn.edu/~treebank/

Page 16: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Recent trends

• Word sense ambiguation (SENSEVAL) • http://www.itri.brighton.ac.uk/events/

senseval/

• Message understanding• http://www.itl.nist.gov/iaui/894.02/related_

projects/muc/index.html

• SEMANTIC WEB• making information on the web

understandable for machines• a vision requiring a huge effort, not clear

whether feasible at all

Page 17: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Representative sample?

• A corpus any size is inevitably a sample

• Of what?• Two approaches

– sampling speakers – demographic sampling

– sampling their output – text type sample

Page 18: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

The notion of representativeness

• Sample vs. population• sample should be proportional to the

population for a given feature– example for demographic samplingif we know from census figures that 48% of

people in living in Budapest are malewe should compile our sample so that 48% of the

informants are male-> our sample is representative of Budapest

residents for gender

Page 19: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Trouble with representativeness

• What should be the units of sampling?• Registers, text types, genres etc.• But no independent evidence about

theirratio in the totality of language output

-> representativeness is an ideal but impossible to implement

Page 20: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Approaches to Representativeness

• Douglas Biber:• Rejects notion of proportional

sampling• Sample should be as varied as

possible• Representativeness measured in

terms of wide variety of text types included in the sample

Page 21: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

The Web as a corpus?

• Pro:• immense database• dynamically

growing• ideal 'quick and

dirty' method

• Cons:• lots of rubbish,

irrelevant data• difficult to extract

hits• no language analysis• only string query,

which is crude

Page 22: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

One quick example

• Representativity or representativeness

• Throw the two words at Google and have a look at the figures

• Think about the conclusions• There are special front-end sites

Page 23: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Page 24: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Page 25: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5

Page 26: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

BTANT 129 w5