23
Corpus Linguistics and Multivariate Statistics Seminar 1 Dylan Glynn www.dsglynn.univ-paris8.fr [email protected]

Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Corpus Linguistics and Multivariate Statistics

Seminar 1

Dylan Glynn www.dsglynn.univ-paris8.fr

[email protected]

Page 2: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

What is a corpus?

Page 3: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

You shall know a word by the company it keeps

J. R. Firth 1957

Page 4: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Three basic, but inter-related, approaches to corpora

1. Comparison corpora - Concordances

2. Formal patterns with a corpus - Collocations

3. Meaning patterns within a corpus - Correlations

These methods vary massively in complexity and application

They are typically used to answer questions in

Discourse Analysis

Critical Discourse Analysis

Sociolinguistics

Semantics and Pragmatics

Phonology

Morpho-syntax

Stylistics

...

Page 5: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Comparison corpora - Concordances

The most basic type of concordance is the list of words and their frequencies in a body of writing.

They have been used for hundreds of years in especially Theology and Literature.

They are still important in stylistics and in many other areas of research.

They are very quick and easy to compile and often represent the first step of more advanced studies.

For example –

take the parliamentary speeches of the Tories and Labour and compile a word list of both and compare the frequency of certain key words?

Do the same for men and women in the parliament?

What would differences tell us?

Page 6: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

2. Formal patterns with a corpus - Collocations

This is the mainstay of contemporary corpus linguistics.

There are various types

Frequency Analysis

Concordance Analysis KWIC (keyword in context)

- Collocations (concordance of word co-occurrences)

- Collostructions (concordance of word – syntactic pattern)

Vector analysis / Word space analysis

Page 7: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Frequency Analysis

2008 US Presidential Election

Page 8: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Concordance Analysis

Page 9: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

2. Collocation Analysis

Page 10: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Collostructional Analysis

1st  proofs

!"# Stefanie Wul!, Anatol Stefanowitsch, and Stefan ". Gries

“a cultural model emerges of the buyer as a passive participant in the commercial transaction, exploited (and relatively easily so) by others for their own gain” (Gries and Stefanowitsch 2004b: 232). Another example is a strong association between verbs denoting coercion and what Gries and Stefanowitsch (2004b) refer to as a #$%&'(()$% frame: people are preferably tortured, beaten, intimidated, trapped, and coerced into confessing. In addition, Gries and Stefanowitsch (2004b: 231–2) ob-serve a more general semantic factor at work in British English, namely a weak but signi*cant preference to conceptualize causation in the physical domain (as op-posed to mental or communicative causation). "is tendency is re+ected by phys-ical cause predicates being twice as frequent as mental cause predicates, and action result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations, the present study sets out to answer the ques-tion to what extent we can expect to *nd the same patterns in another variety of English, or if the way the into-causative is put to use is actually variety-speci*c.

Table 2. Most signi*cant Vcause-Vresult co-varying collexemes in the into-causative in the 1990–2000 volumes of !e Guardian (cf. Gries and Stefanowitsch 2004: 230)

Vcause Vresult N -log (pFisher-Yates)

bounce accepting 29 14.074torture confessing 8 13.155draw commenting 6 10.581shock understanding 7 10.483stimulate producing 6 9.330dupe carrying 8 7.244con paying 16 7.019hoodwink leaving 8 6.982mislead buying 14 6.980delude supposing 3 6.792terrorise "eeing 4 6.762talk letting 12 6.743dupe leaving 13 6.609force making 51 6.546pressure having 14 6.505bounce announcing 6 6.100shame cleaning 4 5.953dragoon voting 7 5.899swing planning 2 5.518fool queuing 3 5.435lock using 5 5.406guide lending 2 5.372rush making 11 5.305educate understanding 3 5.296fool seeing 6 5.180

Page 11: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Vector Analysis synonymy of run

run

run

run off

idle

run overrun along

run away

run around

prevail

range

scatscarper

lamhightail ithead for the hillsfly the coop

escapebunk

break away

persist

unravel

lead

campaign

be given

execute

ladder

lean

tend

rill

rivulet

race

streamlet

outpouring

carry

foot race

operatework

footrace

feedflow

running game

go

pass

extend

streak

play

function

runningrunning play

trial

melt

melt down

turn tail

track down

bleeddraw

ply

black market

run for

course

guide

hunt downconsort

political campaign

discharge

ravel

runnel

incline

hunt

die hard move

endure

take to the woodstest

tally

chronological sequence

occur

successivenesssuccession

sequencechronological succession

become

trip

travellocomote

liberty

fulfillfulfilcarry throughcarry out

actionaccomplish

getdisplace

sail

go through

go across

indefinite quantity

ziptravel rapidly

speed hurry

winsucceed

deliver the goods

come through

bring home the bacon

period

treat

process

unloosen

unloose

release

looseliberate

free

incur

time period

period of time

NOUNS

run, tally

a score in baseball made by a runnertouching all four bases safely

"the Yankees scored 3 runs in the bottom of the 9th"

"their first tally came in the 3rd inning"

run, trial, test

the act of testing something "in the experimental trials the amount of carbon wasmeasured separately"

Page 12: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

3. Meaning patterns within a corpus - Correlations

There is a continuum from counting occurrences of some meaning or use through to large-scale multivariate modelling of the behaviour of those uses.

Relative Frequencies

This simple counting of uses is a simple and useful corpus-driven approach.

It is the mainstay of Discourse Analysis and Functional Linguistics

Multidimensional correlations

This line of research is the newest and only began in the 1990s in Belgium and Germany

It is an extremely popular technique in Cognitive Linguistics

Page 13: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Relative Frequencies Memoranda vs. email in office communication

Page 14: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Multidimensional correlations Conceptualisation of HOME

Page 15: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Form-based vs Meaning based analysis

Problems with Meaning based analysis i. Low degree of representativity due to small sample size ii. High degree of subjectivity due to manual analysis

Response to Problem of Representativity

i. Restrict studies to careful controlled datasets

ii. Predictive statistical modelling is essential Response to Problem of Subjectivity

i. Clearly operationalised usage-features

ii. Multiple annotators and Kappa scores for reliability

Page 16: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Corpus Evidence in Langauge Science

In the late 1960s, the Functionalists were questioning the the assumptions of the European and American Structuralists

France: Martinet, Benveniste, Culioli

Britain: Firth, Halliday, Sinclair

Russia: Bendarko, Aprecijan, Mel’cuk

America: Givon, Hopper, Fillmore, Lakoff

The debate remains the of the two key debates of linguistics today.

This debate is central to the theories of arguably the three greatest linguists in history

What structures language?

Page 17: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Corpus Evidence in Langauge Science

Where is grammar?

Does langue come from parole or does parole come from langue?

Humbolt - ergon – (product) vs. energeia (activity)

de Saussure - langue vs. parole

Chomsky - performance vs. competence

Theoretically, there are strong arguments for both

Empirically, there are strong arguments for both

Corpus linguistics necessarily assumes that the product is a result of the activity, that langue comes from parole, that competence is a based on performance...

Although probably less than half, a very large group of linguists today think this is RUBBISH!!!

Page 18: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Corpus Evidence in Langauge Science The main argument against using performance as an index of structure. “I live in New York” versus “I live in Dayton, Ohio” Chomsky’s (1964)

Frequency of performance tells us about the world langauge is used to describe, not the langauge structure in the mind.

Q. 1. Why does one assume that the langauge in the mind is different from the world it describes

At some level, I come from New York is more important in langauge than I come from Dayton

Q. 2. Why would one look at raw frequency to describe langauge, it is always relative?

We could only compare the frequency of these two utterances, if the same number of people lived in Dayton & New York

Page 19: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Questions for Corpus Linguistics In every corpus-based study, it is crucial you are aware of the

practical limitations and theoretical assumptions of the method!!!

(this includes your mémoires)

1. Practical Questions

a. Representativity – Text type b. Representativity – Hapax legomena 2. Theoretical Questions

a. Frequency – Linguistic structure b. Frequency – Thematic bias 3. Analytical Questions

a. Negative Evidence b. Objective Accuracy

Page 20: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Representativity - Practical Questions for Corpus-Based Research 1. Text type and Topic of Discourse The type of text and what it is talking can have a profound effect on your results The most common meaning of run will be fast pedestrian motion in a corpus of children’s books, but it will be management in a corpus economics news press. 2. Hapax legomena and rare events The largest corpus in the world is but a fraction of langauge Something that can be very rare in a corpus, is, in fact, quite common out there in the real world We are relatively restricted to quite common events. Things like idioms etc. are relatively rare. a. What are the implications of each of these questions for you own project

b. If a particular question has implications for your project, what measure have you taken to respond to the question?

Page 21: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Frequency - Theoretical Questions for Corpus-Based Research 1. Linguistic structure This is the langauge – parole debate 2. Thematic bias This is the same as the issue of text type and is the basis of Chomsky’s criticsm. 1. What are the implications of each of these questions for you own project

2. If a particular question has implications for your project, what measure have you taken to respond to the question?

Page 22: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

Object - Analytical Questions for Corpus-Based Research 1. Negative Evidence We only have what people say, not what they don’t say. How can we disprove hypotheses? 2. Objective Accuracy To increase representativity and objectivity, we necessarily increase inaccuracy

If we increase accuracy, we necessarily decrease representativity and objectivity

1. What are the implications of each of these questions for you own project

2. If a particular question has implications for your project, what measure have you taken to respond to the question?

Page 23: Corpus Linguistics and Multivariate Statistics · 2014-02-18 · result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations,

For next week

Have a look at each of the articles on line.

Choose one and have a go at reading it.

Remember, your memoire should look something like on of these articles... serious.