Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Corpus Linguistics and Multivariate Statistics
Seminar 1
Dylan Glynn www.dsglynn.univ-paris8.fr
What is a corpus?
You shall know a word by the company it keeps
J. R. Firth 1957
Three basic, but inter-related, approaches to corpora
1. Comparison corpora - Concordances
2. Formal patterns with a corpus - Collocations
3. Meaning patterns within a corpus - Correlations
These methods vary massively in complexity and application
They are typically used to answer questions in
Discourse Analysis
Critical Discourse Analysis
Sociolinguistics
Semantics and Pragmatics
Phonology
Morpho-syntax
Stylistics
...
Comparison corpora - Concordances
The most basic type of concordance is the list of words and their frequencies in a body of writing.
They have been used for hundreds of years in especially Theology and Literature.
They are still important in stylistics and in many other areas of research.
They are very quick and easy to compile and often represent the first step of more advanced studies.
For example –
take the parliamentary speeches of the Tories and Labour and compile a word list of both and compare the frequency of certain key words?
Do the same for men and women in the parliament?
What would differences tell us?
2. Formal patterns with a corpus - Collocations
This is the mainstay of contemporary corpus linguistics.
There are various types
Frequency Analysis
Concordance Analysis KWIC (keyword in context)
- Collocations (concordance of word co-occurrences)
- Collostructions (concordance of word – syntactic pattern)
Vector analysis / Word space analysis
Frequency Analysis
2008 US Presidential Election
Concordance Analysis
2. Collocation Analysis
Collostructional Analysis
1st proofs
!"# Stefanie Wul!, Anatol Stefanowitsch, and Stefan ". Gries
“a cultural model emerges of the buyer as a passive participant in the commercial transaction, exploited (and relatively easily so) by others for their own gain” (Gries and Stefanowitsch 2004b: 232). Another example is a strong association between verbs denoting coercion and what Gries and Stefanowitsch (2004b) refer to as a #$%&'(()$% frame: people are preferably tortured, beaten, intimidated, trapped, and coerced into confessing. In addition, Gries and Stefanowitsch (2004b: 231–2) ob-serve a more general semantic factor at work in British English, namely a weak but signi*cant preference to conceptualize causation in the physical domain (as op-posed to mental or communicative causation). "is tendency is re+ected by phys-ical cause predicates being twice as frequent as mental cause predicates, and action result verbs being even 37 times more frequent than cognition result verbs. Given these culture-speci*c explanations, the present study sets out to answer the ques-tion to what extent we can expect to *nd the same patterns in another variety of English, or if the way the into-causative is put to use is actually variety-speci*c.
Table 2. Most signi*cant Vcause-Vresult co-varying collexemes in the into-causative in the 1990–2000 volumes of !e Guardian (cf. Gries and Stefanowitsch 2004: 230)
Vcause Vresult N -log (pFisher-Yates)
bounce accepting 29 14.074torture confessing 8 13.155draw commenting 6 10.581shock understanding 7 10.483stimulate producing 6 9.330dupe carrying 8 7.244con paying 16 7.019hoodwink leaving 8 6.982mislead buying 14 6.980delude supposing 3 6.792terrorise "eeing 4 6.762talk letting 12 6.743dupe leaving 13 6.609force making 51 6.546pressure having 14 6.505bounce announcing 6 6.100shame cleaning 4 5.953dragoon voting 7 5.899swing planning 2 5.518fool queuing 3 5.435lock using 5 5.406guide lending 2 5.372rush making 11 5.305educate understanding 3 5.296fool seeing 6 5.180
Vector Analysis synonymy of run
run
run
run off
idle
run overrun along
run away
run around
prevail
range
scatscarper
lamhightail ithead for the hillsfly the coop
escapebunk
break away
persist
unravel
lead
campaign
be given
execute
ladder
lean
tend
rill
rivulet
race
streamlet
outpouring
carry
foot race
operatework
footrace
feedflow
running game
go
pass
extend
streak
play
function
runningrunning play
trial
melt
melt down
turn tail
track down
bleeddraw
ply
black market
run for
course
guide
hunt downconsort
political campaign
discharge
ravel
runnel
incline
hunt
die hard move
endure
take to the woodstest
tally
chronological sequence
occur
successivenesssuccession
sequencechronological succession
become
trip
travellocomote
liberty
fulfillfulfilcarry throughcarry out
actionaccomplish
getdisplace
sail
go through
go across
indefinite quantity
ziptravel rapidly
speed hurry
winsucceed
deliver the goods
come through
bring home the bacon
period
treat
process
unloosen
unloose
release
looseliberate
free
incur
time period
period of time
NOUNS
run, tally
a score in baseball made by a runnertouching all four bases safely
"the Yankees scored 3 runs in the bottom of the 9th"
"their first tally came in the 3rd inning"
run, trial, test
the act of testing something "in the experimental trials the amount of carbon wasmeasured separately"
3. Meaning patterns within a corpus - Correlations
There is a continuum from counting occurrences of some meaning or use through to large-scale multivariate modelling of the behaviour of those uses.
Relative Frequencies
This simple counting of uses is a simple and useful corpus-driven approach.
It is the mainstay of Discourse Analysis and Functional Linguistics
Multidimensional correlations
This line of research is the newest and only began in the 1990s in Belgium and Germany
It is an extremely popular technique in Cognitive Linguistics
Relative Frequencies Memoranda vs. email in office communication
Multidimensional correlations Conceptualisation of HOME
Form-based vs Meaning based analysis
Problems with Meaning based analysis i. Low degree of representativity due to small sample size ii. High degree of subjectivity due to manual analysis
Response to Problem of Representativity
i. Restrict studies to careful controlled datasets
ii. Predictive statistical modelling is essential Response to Problem of Subjectivity
i. Clearly operationalised usage-features
ii. Multiple annotators and Kappa scores for reliability
Corpus Evidence in Langauge Science
In the late 1960s, the Functionalists were questioning the the assumptions of the European and American Structuralists
France: Martinet, Benveniste, Culioli
Britain: Firth, Halliday, Sinclair
Russia: Bendarko, Aprecijan, Mel’cuk
America: Givon, Hopper, Fillmore, Lakoff
The debate remains the of the two key debates of linguistics today.
This debate is central to the theories of arguably the three greatest linguists in history
What structures language?
Corpus Evidence in Langauge Science
Where is grammar?
Does langue come from parole or does parole come from langue?
Humbolt - ergon – (product) vs. energeia (activity)
de Saussure - langue vs. parole
Chomsky - performance vs. competence
Theoretically, there are strong arguments for both
Empirically, there are strong arguments for both
Corpus linguistics necessarily assumes that the product is a result of the activity, that langue comes from parole, that competence is a based on performance...
Although probably less than half, a very large group of linguists today think this is RUBBISH!!!
Corpus Evidence in Langauge Science The main argument against using performance as an index of structure. “I live in New York” versus “I live in Dayton, Ohio” Chomsky’s (1964)
Frequency of performance tells us about the world langauge is used to describe, not the langauge structure in the mind.
Q. 1. Why does one assume that the langauge in the mind is different from the world it describes
At some level, I come from New York is more important in langauge than I come from Dayton
Q. 2. Why would one look at raw frequency to describe langauge, it is always relative?
We could only compare the frequency of these two utterances, if the same number of people lived in Dayton & New York
Questions for Corpus Linguistics In every corpus-based study, it is crucial you are aware of the
practical limitations and theoretical assumptions of the method!!!
(this includes your mémoires)
1. Practical Questions
a. Representativity – Text type b. Representativity – Hapax legomena 2. Theoretical Questions
a. Frequency – Linguistic structure b. Frequency – Thematic bias 3. Analytical Questions
a. Negative Evidence b. Objective Accuracy
Representativity - Practical Questions for Corpus-Based Research 1. Text type and Topic of Discourse The type of text and what it is talking can have a profound effect on your results The most common meaning of run will be fast pedestrian motion in a corpus of children’s books, but it will be management in a corpus economics news press. 2. Hapax legomena and rare events The largest corpus in the world is but a fraction of langauge Something that can be very rare in a corpus, is, in fact, quite common out there in the real world We are relatively restricted to quite common events. Things like idioms etc. are relatively rare. a. What are the implications of each of these questions for you own project
b. If a particular question has implications for your project, what measure have you taken to respond to the question?
Frequency - Theoretical Questions for Corpus-Based Research 1. Linguistic structure This is the langauge – parole debate 2. Thematic bias This is the same as the issue of text type and is the basis of Chomsky’s criticsm. 1. What are the implications of each of these questions for you own project
2. If a particular question has implications for your project, what measure have you taken to respond to the question?
Object - Analytical Questions for Corpus-Based Research 1. Negative Evidence We only have what people say, not what they don’t say. How can we disprove hypotheses? 2. Objective Accuracy To increase representativity and objectivity, we necessarily increase inaccuracy
If we increase accuracy, we necessarily decrease representativity and objectivity
1. What are the implications of each of these questions for you own project
2. If a particular question has implications for your project, what measure have you taken to respond to the question?
For next week
Have a look at each of the articles on line.
Choose one and have a go at reading it.
Remember, your memoire should look something like on of these articles... serious.