34
Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Embed Size (px)

Citation preview

Page 1: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Compiling a corpus I

“It’s a capital mistake to theorize before one has data”

(A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Page 2: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

CADS

• recognition and quantification of patterns• systematic analysis of serendipitous

discoveries• Going backwards and forwards between the

quantitative data (wordlists and keyword lists) to qualitative close reading

• (concordance lines and extended text)

Page 3: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Discourse(s)

• ‘a group of statements which provide a language for talking about - a way of representing the knowledge about - a particular topic at a particular historical moment. [...] Discourse is about the production of knowledge through language’ (Hall, 1992: 291).

• ‘It is the continuous reinforcement, through massive repetition and consistency in discourse, which is required to construct and maintain

• reality’ (Stubbs 1996: 92).• the incremental effect of discourse (Baker 2006)

Page 4: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Corpus

• A finite size, non random collection of naturally occurring language, in a computer readable form.

• Non-random = representative of a language or text type and compiled for an intended functional purpose.

• (see McEnery et al. 2006)

Page 5: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

What can a corpus tell us?

• A corpus is a source of knowledge about language:– corpus– introspection/observation/elicitation– controlled laboratory experiment– computer simulation

• A corpus is a sample of language, varying by:– source (e.g. speech vs. writing, age...)– levels of annotation (e.g. parsing)– size (number of words)– sampling method (random sample?)

Page 6: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

What can a corpus tell us?• A corpus is a source of knowledge about

language:– corpus– introspection/observation/elicitation– controlled laboratory experiment– computer simulation

• A corpus is a sample of language, varying by:– source (e.g. speech vs. writing, age...)– levels of annotation (e.g. parsing)– size (number of words)– sampling method (random sample?)

}How do these

differ in what they might tell

us?

How does this affect the types

of knowledg

e we might

obtain?

}

Page 7: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Quantitative breadth

• coverage (i.e. representative, generalisable results)

• statistical relevance• descriptive power• reliability (NOT objectivity)• the principle of total accountability• replicability

Page 8: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Qualitative depth

• contextualisation• socio-cultural relevance• explanatory power

Page 9: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Cumulative power

• Qualitative change cannot be understood, let alone achieved, without noting the accumulation of quantities

• (Gerbner, 1983: 361)

Page 10: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

• Select a phenomenon for investigation• Collect a relevant data set• Look inside the data-set for systematic

patterns• Formalize significant patterns as rules

describing natural events

Page 11: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

The research process

• finding a research question• designing the appropriate corpus to answer it• compiling the dataset• analysing the corpus• fine-tuning the RQ / coming up with more

questions• finding answers (?)

Page 12: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

The process - size

• Paul Baker’s manual Using Corpora in Discourse Analysis Chapter 2 Corpus building

• “ the size of the corpus should be related to its eventual uses”

• For the study of prosody 100,000 words of spontaneous speech are adequate

• An analysis of verb-form morphology would require half a million words

Page 13: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

size

• For lexicography a million words is unlikely to be enough

• For discourse analysis it is possible to carry out corpus-based analyses on much smaller amounts of data

• For our purposes 100,000 words can be considered sufficient.

• It ll depends on the type of language being investigated

Page 14: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Copyright issues

• You need to ask permission when you ar reproducing and making available the texts contained in your corpus.

• You need to decide how much of your corpus will be made available to others

Page 15: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

possible choices – who and where?

• Corpus used for private study & research within an educational institution

• Multiple copies accessible to students & colleagues for study or research within an educational institution

• Multiple copies accessible to staff and students for study or research outside the educational institution

Page 16: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

How much and in what form?

• Users are able to see only very short concordance lines

• Users are able to see extended context (eg. a few paragraphs*)

• Users are able to view the entire text of the corpus

Page 17: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Who and how much?

• Research papers and articles read by a relatively small audience containing very limited citations of concordance lines

• Articles read by a wide audience containing extensive citations of concordance lines

Page 18: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Describing your corpus

• You must give the range of sources, giving as much information as possible (e.g. three British broadsheets , Times, Telegraph and Guardian, and the Sunday editions, six months before and three months after the 2005 elections; or White house briefings for the first two months of each year after the 2008 economic crash)

• Justify your choice• Indicate the number of words overall• Indicate, where applicable, the number of texts and

average length of texts for each part of the corpus

Page 19: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Corpus procedures

• 1 Identification (word and cluster lists) to identify the objects of study (words, phrases, etc)

• 2 Frequency 8wordlists and keyword lists) to know which objects are common and which are rare

• 3 Behaviour (concordance lines) to observe the behaviour of the objects in context

• 4 Analysis (collocations) to use the power of computer software to identify patterns

• 5 Interpretation and Generalisation on the basis of the observed behaviour

Page 20: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Wordlists and keyword lists

• Make sure you distinguish between a wordlist and a keyword list

• Make sure you state which is the corpus under investigation and which is the reference corpus

• Usually the reference corpus is bigger

Page 21: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

concordances

• ‘A concordance is a collection of the occurrences of a word form, each in its own textual environment.’ (Sinclair 1991:32)

• A concordance brings together a series of fragments of text displaced from their original sequence and by juxtaposing them vertically, one after the other, it makes repetition visible and countable and makes patterns emerge to the surface, while the individual texts are eclipsed.

Page 22: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Concordance lines

• Environments in which the word finds itself, we can observe common features in the context

• A concordance makes it possible to observe repeated events

• The co-occurrences are observable on the syntagmatic horizontal axis

• Repeated paradigmatic choices are observable on the vertical axis

• Repetitions made visible

Page 23: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

collocation

• The idea behind collocation is that a word is defined by the relationships it establishes with other words.

• ‘you shall judge a word by the company it keeps’ (Firth 1957)

Page 24: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

The collocation principle

• The ubiquity of collocation challenges current theories of language because it demands explanation, and the only explanation that seems to account for the existence of collocation is that each lexical item is primed for collocational use. By primed , I mean that as the word is learnt through encounters with it in speech and writing, it is loaded with the cumulative effects of those encounters such that it is part of our knowledge of the word that it co-occurs with other words. (Hoey 2002)

Page 25: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Collocates and statistics

• A collocate is an ‘item that appears with greater than random probability in its (textual) context’ (Hoey 1991: 7).

• Measures of statistical significance (e.g. log-likelihood, z-score, MI score ...)

Page 26: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

T-score and MI• two measures of relative statistical significance • T-score measures certainty of collocation, whereas MI score

measures strength of collocation (Hunston 2002:73; McEnery & Wilson 2001:86).

• T-score directs our attention to high-frequency collocates such as grammatical words (and is thus likely to be more useful to the grammarian or lexicographer than to the sociolinguist or discourse analyst),

• whereas MI score highlights lexical items that are relatively infrequent by themselves but have a higher-than-random probability of co-occurring with the node word (Clear 1993:281).

• The two scores are useful, above all, in ranking collocations (Manning & Schütze 1999:166).

Page 27: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

measure of statistical significance

• z-score: is the number of standard deviation from the mean frequency, it compares the observed frequency with the frequency experienced if only chance is affecting distribution.

• It does not measure the strength of the relationship, but its significance.

Page 28: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

• Quantitative indicators highlight particularly promising entry points into the data.

• identifying key leads worth pursuing qualitatively, according to the tried and tested principle of corpus linguistics, “Decide on the ‘strongest’ pattern and start there” (Sinclair 2003:xvi).

Page 29: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

What to do with collocates

• recurrent lexical patterns• classification of collocates (semantic grouping)• recurrent semantic patterns• recurrent evaluative patterns• concordancing of co-occurrences and 2nd

level collocation• analysis

Page 30: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Collocation and prosody

• The node’s property of being associated with a ‘semanticallyconsistent set of collocates’ (Bublitz, 1996: 9).

• Semantic/evaluative (Morley and Partington 2009) prosody is an expression of evaluation (good/bad; desirable/undesirable; beneficial /dangerous; favourable/unfavourable ...

• also can be about control vs. lack of control).

Page 31: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

keywords

• ‘A key word may be defined as a word which occurs with unusual frequency in a given text. This does not mean high frequency but unusual frequency, by comparison with a reference corpus of some kind’ (Scott, 1997: 236).

Page 32: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

What you can do with keywords• Identify the specificity, trends and the

aboutness of the study corpus compared to a reference corpus.

• Keywords are a very good source of insights and help identifying potentially interesting items for closer observation, but they must be treated with caution.

Page 33: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Working with keywords

• Keywords lists do not account for textual position of words, they do not allow a distinction to be made between polysemous meanings and are independent from the context. For these reasons keywords analysis does not reveal discourses, but it directs the researcher’s attention by highlighting patterns of difference that could otherwise go undetected.

• As with collocation analysis, the software makes the pattern visible, the human works on it.

Page 34: Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

Reading• Apart from the work you have already seen

presented in class, or in the materials on-line and the practice in reading concordance lines, and the insights into particular aspects of the English language (e.g evaluation and graduation, figurative language)

• The first part of Scott and Tribble, Patterns and Meanings in Discourse and the book by Paul Baker should help you with your corpus compilation and analysis.

• Use the resources available