Compiling a corpus I “It’s a capital mistake to theorize before one has data” (A.Conan Doyle,...

Preview:

Citation preview

Compiling a corpus I

“It’s a capital mistake to theorize before one has data”

(A.Conan Doyle, Sherlock Holmes - A scandal in Bohemia)

CADS

• recognition and quantification of patterns• systematic analysis of serendipitous

discoveries• Going backwards and forwards between the

quantitative data (wordlists and keyword lists) to qualitative close reading

• (concordance lines and extended text)

Discourse(s)

• ‘a group of statements which provide a language for talking about - a way of representing the knowledge about - a particular topic at a particular historical moment. [...] Discourse is about the production of knowledge through language’ (Hall, 1992: 291).

• ‘It is the continuous reinforcement, through massive repetition and consistency in discourse, which is required to construct and maintain

• reality’ (Stubbs 1996: 92).• the incremental effect of discourse (Baker 2006)

Corpus

• A finite size, non random collection of naturally occurring language, in a computer readable form.

• Non-random = representative of a language or text type and compiled for an intended functional purpose.

• (see McEnery et al. 2006)

What can a corpus tell us?

• A corpus is a source of knowledge about language:– corpus– introspection/observation/elicitation– controlled laboratory experiment– computer simulation

• A corpus is a sample of language, varying by:– source (e.g. speech vs. writing, age...)– levels of annotation (e.g. parsing)– size (number of words)– sampling method (random sample?)

What can a corpus tell us?• A corpus is a source of knowledge about

language:– corpus– introspection/observation/elicitation– controlled laboratory experiment– computer simulation

• A corpus is a sample of language, varying by:– source (e.g. speech vs. writing, age...)– levels of annotation (e.g. parsing)– size (number of words)– sampling method (random sample?)

}How do these

differ in what they might tell

us?

How does this affect the types

of knowledg

e we might

obtain?

}

Quantitative breadth

• coverage (i.e. representative, generalisable results)

• statistical relevance• descriptive power• reliability (NOT objectivity)• the principle of total accountability• replicability

Qualitative depth

• contextualisation• socio-cultural relevance• explanatory power

Cumulative power

• Qualitative change cannot be understood, let alone achieved, without noting the accumulation of quantities

• (Gerbner, 1983: 361)

• Select a phenomenon for investigation• Collect a relevant data set• Look inside the data-set for systematic

patterns• Formalize significant patterns as rules

describing natural events

The research process

• finding a research question• designing the appropriate corpus to answer it• compiling the dataset• analysing the corpus• fine-tuning the RQ / coming up with more

questions• finding answers (?)

The process - size

• Paul Baker’s manual Using Corpora in Discourse Analysis Chapter 2 Corpus building

• “ the size of the corpus should be related to its eventual uses”

• For the study of prosody 100,000 words of spontaneous speech are adequate

• An analysis of verb-form morphology would require half a million words

size

• For lexicography a million words is unlikely to be enough

• For discourse analysis it is possible to carry out corpus-based analyses on much smaller amounts of data

• For our purposes 100,000 words can be considered sufficient.

• It ll depends on the type of language being investigated

Copyright issues

• You need to ask permission when you ar reproducing and making available the texts contained in your corpus.

• You need to decide how much of your corpus will be made available to others

possible choices – who and where?

• Corpus used for private study & research within an educational institution

• Multiple copies accessible to students & colleagues for study or research within an educational institution

• Multiple copies accessible to staff and students for study or research outside the educational institution

How much and in what form?

• Users are able to see only very short concordance lines

• Users are able to see extended context (eg. a few paragraphs*)

• Users are able to view the entire text of the corpus

Who and how much?

• Research papers and articles read by a relatively small audience containing very limited citations of concordance lines

• Articles read by a wide audience containing extensive citations of concordance lines

Describing your corpus

• You must give the range of sources, giving as much information as possible (e.g. three British broadsheets , Times, Telegraph and Guardian, and the Sunday editions, six months before and three months after the 2005 elections; or White house briefings for the first two months of each year after the 2008 economic crash)

• Justify your choice• Indicate the number of words overall• Indicate, where applicable, the number of texts and

average length of texts for each part of the corpus

Corpus procedures

• 1 Identification (word and cluster lists) to identify the objects of study (words, phrases, etc)

• 2 Frequency 8wordlists and keyword lists) to know which objects are common and which are rare

• 3 Behaviour (concordance lines) to observe the behaviour of the objects in context

• 4 Analysis (collocations) to use the power of computer software to identify patterns

• 5 Interpretation and Generalisation on the basis of the observed behaviour

Wordlists and keyword lists

• Make sure you distinguish between a wordlist and a keyword list

• Make sure you state which is the corpus under investigation and which is the reference corpus

• Usually the reference corpus is bigger

concordances

• ‘A concordance is a collection of the occurrences of a word form, each in its own textual environment.’ (Sinclair 1991:32)

• A concordance brings together a series of fragments of text displaced from their original sequence and by juxtaposing them vertically, one after the other, it makes repetition visible and countable and makes patterns emerge to the surface, while the individual texts are eclipsed.

Concordance lines

• Environments in which the word finds itself, we can observe common features in the context

• A concordance makes it possible to observe repeated events

• The co-occurrences are observable on the syntagmatic horizontal axis

• Repeated paradigmatic choices are observable on the vertical axis

• Repetitions made visible

collocation

• The idea behind collocation is that a word is defined by the relationships it establishes with other words.

• ‘you shall judge a word by the company it keeps’ (Firth 1957)

The collocation principle

• The ubiquity of collocation challenges current theories of language because it demands explanation, and the only explanation that seems to account for the existence of collocation is that each lexical item is primed for collocational use. By primed , I mean that as the word is learnt through encounters with it in speech and writing, it is loaded with the cumulative effects of those encounters such that it is part of our knowledge of the word that it co-occurs with other words. (Hoey 2002)

Collocates and statistics

• A collocate is an ‘item that appears with greater than random probability in its (textual) context’ (Hoey 1991: 7).

• Measures of statistical significance (e.g. log-likelihood, z-score, MI score ...)

T-score and MI• two measures of relative statistical significance • T-score measures certainty of collocation, whereas MI score

measures strength of collocation (Hunston 2002:73; McEnery & Wilson 2001:86).

• T-score directs our attention to high-frequency collocates such as grammatical words (and is thus likely to be more useful to the grammarian or lexicographer than to the sociolinguist or discourse analyst),

• whereas MI score highlights lexical items that are relatively infrequent by themselves but have a higher-than-random probability of co-occurring with the node word (Clear 1993:281).

• The two scores are useful, above all, in ranking collocations (Manning & Schütze 1999:166).

measure of statistical significance

• z-score: is the number of standard deviation from the mean frequency, it compares the observed frequency with the frequency experienced if only chance is affecting distribution.

• It does not measure the strength of the relationship, but its significance.

• Quantitative indicators highlight particularly promising entry points into the data.

• identifying key leads worth pursuing qualitatively, according to the tried and tested principle of corpus linguistics, “Decide on the ‘strongest’ pattern and start there” (Sinclair 2003:xvi).

What to do with collocates

• recurrent lexical patterns• classification of collocates (semantic grouping)• recurrent semantic patterns• recurrent evaluative patterns• concordancing of co-occurrences and 2nd

level collocation• analysis

Collocation and prosody

• The node’s property of being associated with a ‘semanticallyconsistent set of collocates’ (Bublitz, 1996: 9).

• Semantic/evaluative (Morley and Partington 2009) prosody is an expression of evaluation (good/bad; desirable/undesirable; beneficial /dangerous; favourable/unfavourable ...

• also can be about control vs. lack of control).

keywords

• ‘A key word may be defined as a word which occurs with unusual frequency in a given text. This does not mean high frequency but unusual frequency, by comparison with a reference corpus of some kind’ (Scott, 1997: 236).

What you can do with keywords• Identify the specificity, trends and the

aboutness of the study corpus compared to a reference corpus.

• Keywords are a very good source of insights and help identifying potentially interesting items for closer observation, but they must be treated with caution.

Working with keywords

• Keywords lists do not account for textual position of words, they do not allow a distinction to be made between polysemous meanings and are independent from the context. For these reasons keywords analysis does not reveal discourses, but it directs the researcher’s attention by highlighting patterns of difference that could otherwise go undetected.

• As with collocation analysis, the software makes the pattern visible, the human works on it.

Reading• Apart from the work you have already seen

presented in class, or in the materials on-line and the practice in reading concordance lines, and the insights into particular aspects of the English language (e.g evaluation and graduation, figurative language)

• The first part of Scott and Tribble, Patterns and Meanings in Discourse and the book by Paul Baker should help you with your corpus compilation and analysis.

• Use the resources available

Recommended