38
Workshop: Corpus (1) Workshop: Corpus (1) What might a corpus of spoken data What might a corpus of spoken data tell us about language? tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English Usage University College London [email protected]

Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Embed Size (px)

Citation preview

Page 1: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Workshop: Corpus (1)Workshop: Corpus (1)What might a corpus of spoken data tell What might a corpus of spoken data tell

us about language?us about language?

OLINCO 2014

Olomouc, Czech Republic, June 7

Sean WallisSurvey of English Usage

University College London

[email protected]

Page 2: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

OutlineOutline

• What can a corpus tell us?

• The 3A cycle

• What can a parsed corpus tell us?

• ICE-GB and DCPSE

• Diachronic changes– Modal shall/will over time

• Intra-structural priming– NP premodification

• The value of interaction evidence

Page 3: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

What can a corpus tell us?What can a corpus tell us?

• Three kinds of evidence may be obtained from a corpus Frequency (distribution) evidence of a particular

known linguistic event Coverage (discovery) evidence of new events Interaction evidence of the relationship between

events

• But if these ‘events’ are lexical, this evidence can only really tell us about lexis– So corpus linguistics has always involved

annotation

Page 4: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

The 3A cycleThe 3A cycle

• Plain text corpora– evidence of lexical phenomena

Text

Page 5: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

The 3A cycleThe 3A cycle

• Plain text corpora– evidence of lexical phenomena

• Need to annotate– add knowledge of frameworks– classify and relate phenomena– general annotation scheme

• not focused on particular research goalsAnnotation

Corpus

Text

Page 6: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

The 3A cycleThe 3A cycle

• Plain text corpora– evidence of lexical phenomena

• Need to annotate– add knowledge of frameworks– classify and relate phenomena– general annotation scheme

• not focused on particular research goals

• Corpus research = the ‘3A’ cycle– Annotation

Annotation

Corpus

Text

Page 7: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

The 3A cycleThe 3A cycle

• Plain text corpora– evidence of lexical phenomena

• Need to annotate– add knowledge of frameworks– classify and relate phenomena– general annotation scheme

• not focused on particular research goals

• Corpus research = the ‘3A’ cycle– Annotation Abstraction

Annotation

Abstraction

Corpus

Text

Dataset

data transformation (“operationalisation”)

Page 8: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

The 3A cycleThe 3A cycle

• Plain text corpora– evidence of lexical phenomena

• Need to annotate– add knowledge of frameworks– classify and relate phenomena– general annotation scheme

• not focused on particular research goals

• Corpus research = the ‘3A’ cycle– Annotation Abstraction

Analysis

Annotation

Abstraction

Analysis

Corpus

Text

Dataset

Hypotheses

data transformation (“operationalisation”)

Page 9: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Annotation Annotation Abstraction Abstraction

• Abstraction– selects data from annotated corpus– maps it to a regular dataset for statistical

analysis– bi-directional (“concretisation”)

• allows us to interpret statistically significant results

Page 10: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Annotation Annotation Abstraction Abstraction

• Abstraction– selects data from annotated corpus– maps it to a regular dataset for statistical

analysis– bi-directional (“concretisation”)

• allows us to interpret statistically significant results

• Even ‘lexical’ questions need annotation:– 1st person declarative modal verb shall/willabstraction relies on annotation

Page 11: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be obtained

from a parsed corpus Frequency evidence of a particular known rule,

structure or linguistic event Coverage evidence of new rules, etc. Interaction evidence of the relationship

between rules, structures and events

• BUT evidence is necessarily framed within a particular grammatical scheme– So… (an obvious question) how might we

evaluate this grammar?

Page 12: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

What can a parsed corpus tell What can a parsed corpus tell us?us?• Parsed corpora contain (lots of) trees

– Use Fuzzy Tree Fragment queries to get data

– An FTF

Page 13: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

What can a parsed corpus tell What can a parsed corpus tell us?us?• Parsed corpora contain (lots of) trees

– Use Fuzzy Tree Fragment queries to get data

– An FTF

– A matchingcase in a tree

– UsingICECUP(Nelson et al, 2002)

Page 14: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

What can a parsed corpus tell What can a parsed corpus tell us?us?• Trees as handle on data

– make useful distinctions– retrieve cases reliably– not necessary to “agree” to framework used

• provided distinctions are meaningful

Page 15: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

What can a parsed corpus tell What can a parsed corpus tell us?us?• Trees as handle on data

– make useful distinctions– retrieve cases reliably– not necessary to “agree” to framework used

• provided distinctions are meaningful

• Trees as trace of language production process– interaction between decisions leave a probabilistic

effect on overall performance• not simple to distinguish between source

– depends on the framework • but may also validate it

Page 16: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Why spoken corpora?Why spoken corpora?

• Speech predates writing– historically – literacy growth and spread– child development – internal speech

during writing

Page 17: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Why spoken corpora?Why spoken corpora?

• Speech predates writing– historically – literacy growth and spread– child development – internal speech

during writing

• Scale– professional authors recommend 1,000

words/day– 1 hour of speech 8,000 words (DCPSE)

Page 18: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Why spoken corpora?Why spoken corpora?

• Speech predates writing– historically – literacy growth and spread– child development – internal speech during

writing

• Scale– professional authors recommend 1,000 words/day– 1 hour of speech 8,000 words (DCPSE)

• Spontaneity– production process lost: many written sources

edited

Page 19: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Why spoken corpora?Why spoken corpora?

• Speech predates writing– historically – literacy growth and spread– child development – internal speech during writing

• Scale– professional authors recommend 1,000 words/day– 1 hour of speech 8,000 words (DCPSE)

• Spontaneity– production process lost: many written sources edited

• Dialogue– interaction between speakers

Page 20: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

ICE-GB and DCPSEICE-GB and DCPSE

• British Component of the International Corpus of English (1990-92)– 1 million words (nominal)– 60% spoken, 40% written– speech component is orthographically transcribed – fully parsed

• marked up, POS-tagged, parsed, hand-corrected

• Diachronic Corpus of Present-day Spoken English– 800,000 words (nominal)– orthographically transcribed and fully parsed– created from subsamples of LLC and ICE-GB

• Matching numbers of texts in text categories• Not sampled over equal duration

– LLC (1958-1977) – ICE-GB (1990-1992)

Page 21: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting modal shall/will over time (DCPSE)

• Small amounts of data / year

Page 22: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting modal shall/will over time (DCPSE)

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})• Small amounts

of data / year

• Confidence intervals identify the degree of certainty in our results

Page 23: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting modal shall/will over time (DCPSE)

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})

• Small amounts of data / year

• Confidence intervals identify the degree of certainty in our results

• Highly skewed p in some cases

– p = 0 or 1 (circled)

Page 24: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting modal shall/will over time (DCPSE)

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})

• Small amounts of data / year

• Confidence intervals identify the degree of certainty in our results

• We can now estimate an approximate downwards curve

(Aarts et al., 2013)

Page 25: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Intra-structural primingIntra-structural priming

• Priming effects within a structure – Study repeating an additive step in

structures

• Consider– a phrase or clause that may (in principle)

be extended ad infinitum• e.g. an NP with a noun head

N

Page 26: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Intra-structural primingIntra-structural priming

• Priming effects within a structure – Study repeating an additive step in

structures

• Consider– a phrase or clause that may (in principle)

be extended ad infinitum• e.g. an NP with a noun head

– a single additive step applied to this structure

• e.g. add an attributive AJP before the head

N

AJP

Page 27: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Intra-structural primingIntra-structural priming

• Priming effects within a structure – Study repeating an additive step in structures

• Consider– a phrase or clause that may (in principle) be

extended ad infinitum• e.g. an NP with a noun head

– a single additive step applied to this structure• e.g. add an attributive AJP before the head

– Q. What is the effect of repeatedly applying this operation to the structure?

shipN

N

AJP

Page 28: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Intra-structural primingIntra-structural priming

• Priming effects within a structure – Study repeating an additive step in structures

• Consider– a phrase or clause that may (in principle) be

extended ad infinitum• e.g. an NP with a noun head

– a single additive step applied to this structure• e.g. add an attributive AJP before the head

– Q. What is the effect of repeatedly applying this operation to the structure?

shipNAJP

tall

N

AJP

Page 29: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Intra-structural primingIntra-structural priming

• Priming effects within a structure – Study repeating an additive step in structures

• Consider– a phrase or clause that may (in principle) be

extended ad infinitum• e.g. an NP with a noun head

– a single additive step applied to this structure• e.g. add an attributive AJP before the head

– Q. What is the effect of repeatedly applying this operation to the structure?

shipNAJP

very greentallAJP

N

AJP

Page 30: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Intra-structural primingIntra-structural priming

• Priming effects within a structure – Study repeating an additive step in structures

• Consider– a phrase or clause that may (in principle) be

extended ad infinitum• e.g. an NP with a noun head

– a single additive step applied to this structure• e.g. add an attributive AJP before the head

– Q. What is the effect of repeatedly applying this operation to the structure?

shipNAJP

very greentallAJP

N

AJP

AJP

old

Page 31: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

NP premodificationNP premodification

• Sequential probability analysis– calculate probability of adding each AJP– error bars: Wilson intervals– probability falls

• second < first• third < second

– decisions interact

– Every AJP addedmakes it harderto add another

0.00

0.05

0.10

0.15

0.20

0 1 2 3 4 5

probability

Page 32: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

NP premodification: NP premodification: explanations?explanations?• Feedback loop: for each successive AJP,

it is more difficult to add a further AJP

• Possible explanations include: logical and semantic constraints

• tend to say the tall green ship • do not tend to say tall short ship or green tall ship

communicative economy• once speaker said tall green ship, tends to only say ship

memory/processing constraints• unlikely: this is a small structure, as are AJPs

Page 33: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

NP premod’n: speech vs. NP premod’n: speech vs. writingwriting• Spoken vs. written subcorpora

– Same overall pattern– Spoken data tends to have fewer attributive AJPs

• Support for communicative economy or memory/processing hypotheses?

– Significance tests• Paired 2x1 Wilson tests

(Wallis 2011)• first and second

observed spoken probabilities are significantly smallerthan written

0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4 5

probability

written

spoken

Page 34: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

Potential sources of Potential sources of interactioninteraction• shared context

– topic or ‘content words’ (Noriega)

• idiomatic conventions– semantic ordering of attributive adjectives (tall green ship)

• logical-semantic constraints– exclusion of incompatible adjectives (?tall short ship)

• communicative constraints– brevity on repetition (just say ship next time)

• psycholinguistic processing constraints– attention and memory of speakers

Page 35: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

What use is interaction What use is interaction evidence?evidence?• Corpus linguistics

– Optimising existing grammar• e.g. co-ordination, compound nouns

• Theoretical linguistics– Comparing different grammars, same language– Comparing different languages or periods

• Psycholinguistics– Search for evidence of language production

constraints in spontaneous speech corpora• speech and language therapy• language acquisition and development

Page 36: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

What can a parsed corpus tell What can a parsed corpus tell us?us?• Trees as handle on data

– make useful distinctions– retrieve cases reliably– not necessary to “agree” to framework used

• provided distinctions are meaningful

• Trees as trace of language production process– interaction between decisions leave a probabilistic

effect on overall performance• not simple to distinguish between source

– results enabled by the framework • but may also validate it

Page 37: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

The importance of annotationThe importance of annotation

• Key element of a ‘3A cycle’– Annotation Abstraction Analysis

• Richer annotation – more effective abstraction– deeper research questions?

• Multiple layers of annotation– new research questions– studying interaction between layers

• Algorithmic vs. human annotation

Page 38: Workshop: Corpus (1) What might a corpus of spoken data tell us about language? OLINCO 2014 Olomouc, Czech Republic, June 7 Sean Wallis Survey of English

More informationMore information

• ReferencesAarts, B. Close, J. and Wallis, S.A. (2013) Choices over time:

methodological issues in current change. In Aarts, Close, Leech and Wallis (eds)The Verb Phrase in English. Cambridge University Press.

Nelson, G., Wallis, S.A. and Aarts, B. (2002) Exploring Natural Language. Amsterdam: John Benjamins.

Wallis, S.A. (2011) Comparing χ2 tests for separability. London: Survey of English Usage.

• Useful links– Survey of English Usage

• www.ucl.ac.uk/english-usage– Fuzzy Tree Fragments

• www.ucl.ac.uk/english-usage/resources/ftfs– Statistics and methodology research blog

• http://corplingstats.wordpress.com