25
PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

Embed Size (px)

Citation preview

Page 1: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

PALA Summer School, Maribor, 2014

Corpus Stylistics

Brian Walker and Dan McIntyre University of Huddersfield

Page 2: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

Summer School Schedule

Day 109:00 – Lectures:Introduction to corpus linguistic terminology and methodology; corpus linguistics + stylistics11:00 – Practical session:Introduction to WMatrix12:30 – LUNCH14:00 – Practical sessions:WMatrix17:30 – FINISH

Page 3: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

Summer School Schedule

Day 209:00 – Practical:Introduction to AntConc10:00 – Practical:AntConc– advanced features11:30 – Lecture:Round up: Corpus stylistics – more than the sum of its parts?12:30 – LUNCH14:00 – Over to Willie

Page 4: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

Introduction to Corpus Linguistics

PALA Summer School, Maribor, 2014

Page 5: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

What is a corpus?

• Latin corpus: ‘body’ (plural corpora)

• Put simply: a corpus is a ‘body’ of text

Page 6: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

What is Corpus Linguistics?

• Corpus linguistics is the study of language using a corpus or corpora

Page 7: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

Early Corpus Linguistics

Franz BoasLeonard

Bloomfield

Page 10: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

Early Corpus Linguistics

• ‘Corpus Linguistics’ as an anachronism• Field Linguistics• Boas’s studies of native American languages• Bloomfield’s description of Tagalog• Hockett’s work on Potawatomi• Harris’s emphasis on the importance of results

being derived from dataFranz BoasLeonard

Bloomfield

Charles Hockett Zellig Harris

While until about 1880 investigators confined themselves to the collection of vocabularies and brief grammatical notes, it has become more and more evident that large masses of texts are needed in order to elucidate the structure of languages. (Boas 1917: 1)

Page 11: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

Principles of Chomskyan linguistics

• Homogeneous underlying system of language

• Describe the language of the ideal speaker/hearer

• Focus on linguistic competence as opposed to linguistic performance

Corpus linguistics doesn’t mean anything. It’s like saying suppose a physicist decides, suppose physics and chemistry decide that instead of relying on experiments, what they’re going to do is take videotapes of things happening in the world and they’ll collect huge videotapes of everything that’s happening and from that maybe they’ll come up with some generalizations or insights. (Chomsky, quoted in Andor 2004: 97)

Page 12: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

Problems with intuitionIssue of acceptability

• I was 19 when I started university• I were 19 when I started university

Impossibility of studying certain aspects of language without recourse to corpus data

• Historical linguistics• Language change/variation• Language acquisition

…this [intuition] is a very strange notion of data. Normally one expects a scientist to develop theories to describe and explain some phenomena which already exist, independently of the scientist. One does not expect a scientist to make up the data at the same time as the theory, or even to make up the data afterwards, in order to illustrate the theory. (Stubbs 1996: 29)

Page 13: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

The Survey of English

• Instigated 1959 by Randolph Quirk at University College London

• One million words of written and spoken British English, made up of 200 text samples of 5000 words each

• Electronic version of the spoken data produced in collaboration with Lund University: the London-Lund Corpus

• Manually annotated for prosodic and paralinguistic features

• Grammatical structures for each text sample recorded on file cards

• Searching the corpus meant a trip to the Survey offices to search through filing cabinets of data!

Page 14: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

The Survey of English

• Instigated 1959 by Randolph Quirk at University College London

• One million words of written and spoken British English, made up of 200 text samples of 5000 words each

• Electronic version of the spoken data produced in collaboration with Lund University: the London-Lund Corpus

• Manually annotated for prosodic and paralinguistic features

• Grammatical structures for each text sample recorded on file cards

• Searching the corpus meant a trip to the Survey offices to search through filing cabinets of data!

Page 15: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

Building the Brown corpus

• The Brown Corpus• Built by Nelson Francis and Henry

Kučera at Brown University, USA• One million words of written

American English (1961), made up of 500 text samples of 2000 words each

• Enabled frequency measures of words• Confirmed Zipf’s law• The most frequent word in a corpus is

approximately twice as frequent as the second most frequent, and three times as frequent as the third most frequent, etc.

• Frequency is inversely proportional to rank

Page 16: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

Extending the Brown family

• 1970-78: LOB• Built by Geoffrey Leech and

colleagues at Lancaster University• One million words of written British

English (1961), made up of 500 text samples of 2000 words each

• FROWN: Written American English from 1991

• FLOB: Written British English from 1991

• BE06: Written British English from early years of 21st century

• LOBalike: Written British English from 2011

Page 17: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

Extending the Brown family

• 1970-78: LOB• Built by Geoffrey Leech and

colleagues at Lancaster University• One million words of written British

English (1961), made up of 500 text samples of 2000 words each

• FROWN: Written American English from 1991

• FLOB: Written British English from 1991

• BE06: Written British English from early years of 21st century

• LOBalike: Written British English from 2011

Page 18: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

Making sense of meaning

• COBUILD project initiated at Birmingham in 1980 - resulted in the Bank of English • English Lexical Studies 1963: Sinclair, Susan Jones and Robert Daley analysed a small

corpus of spoken and written English to investigate the relationship between words and meaning

• Meaning is best seen as a property of words in combination• Builds on J. R. Firth’s concept of collocation

Page 19: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

Bart You're up to something, aren't ya?

Homer No! I'm just going out to commit certain deeds.

Page 20: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

s. <p/> A39 57 A39 58 <h_><p_>The write way to commit murder<p/> A39 59 <p_><quote_>"Advice and informof God is manifested. <tf_>Kill, D03 78 rob and commit adultery<tf/> are all deeds forbidden in the D030 of a religious sect who orders his followers to commit suicide.<p/> D11 131 <p_>"God, permitting the mirbility of episcopal ordination"<quote/> would not commit the D17 47 Methodist Church to the view that th7 article. Take care though: don't let your words commit an editor to E10 98 using a specific picture, w theory and deconstruction is such as to G67 189 commit the reasoner to defending certain values.<p/> G67ithin the Service about offenders who continue to commit H09 191 crime while on bail<p/> H09 192 <p_>Whil4m ($45m).<p/> H27 148 <p_>However, it would only commit itself to a forecast of H27 149 maintained salesdemocracy from collapse, but this was to J41 142 commit <quote_>"a common fallacy in social thought which45 163 the effort levels that they are willing to commit. Let contracts J45 164 with regard to effort be 2 <p_>Her cheeks flushed crimson and he strove to commit to memory P08 53 the lovely colour as the blood 222 never took the slot, although he did briefly commit to an ROTC A10 223 program before putting his na 1894.<p/> A26 13 <p_>Commissioners hesitated to commit themselves after one of the A26 14 monument's cote/> <quote_>"Cold Feet A32 243 - Why Men Won't Commit"<quote/> and <quote_>"Letting Go and Moving A32 B13 92 addressed men who use drugs or those who commit adultery, and who B13 93 get AIDS and other venceeds rational basis. Since urban blacks B17 61 commit more crime proportionately (although not numericaus consequences. Mr. C12 185 Deng was hounded to commit suicide in 1966 and his criticism is now C12 186nd Jodie squabble C13 199 because he's afraid to commit to marriage.<p/> C13 200 <p_>Social issues, too, ue <quote_>"is to do something about it, i.e., to commit oneself D03 187 to a way of life ..."<quote/><p/n objective theistic D03 192 statement; it is to commit oneself to living life and to D03 193 understand to be silly and trivial, because I don't want to commit D06 180 an overt, nonrational act and I don't waWN:E28\><h_><p_>SANITATION<p/> E28 2 <p_>HOW TO COMMIT BIOCIDE<p/> E28 3 <p_>In the strictest sense, s an <tf|>offensive F04 52 position. That is, to commit to an aggressive daily-action plan F04 53 desigform drives 15 F11 31 percent of its victims to commit suicide. (For a list of symptoms, F11 32 see 'A, artificial persons make decisions that F37 23 commit other people. At the same time, the power to speaact G22 13 open to us now would be unjust is to commit ourselves to avoiding G22 14 it. But what of pa H08 57 exploiting the Gulf war as a pretext to commit terrorism.<p/> H08 58 <p_>While we can be proud p/> H09 52 <p_>First, we must get the people who commit crimes out of the H09 53 community, and we must<p_>And, it increases penalties for criminals who commit gun H09 69 offenses.<p/> H09 70 <p_>We have norease the penalties on those who use such guns to commit H09 120 crimes.<p/> H09 121 <p_>Mr. President, Iy requiring grantees H26 155 in most programs to commit their own funds for a portion of the H26 156 cos to do with its value; to think so is to J30 27 commit a genetic fallacy. After I wrote this, I came acrereas J43 34 disengaged delinquents are free to commit a variety of illegal J43 35 activities, such fron a particular illegal J43 38 possibility. Why commit anti-gay violence versus rape or armed J43 39 r_>Hitler understandably regarded people who could commit such J56 150 acts against Britain as his naturalrt with this J58 131 their so natural Right, but commit onely<&|>sic! the Administration J58 132 of such lives K23 172 were before us. Rarely did anyone commit suicide. Here, hundreds of K23 173 people sit, w asked Michael. <quote_>"Did you want P17 102 to commit suicide?"<quote/><p/> P17 103 <p_><quote_>"Oh, no

Page 21: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

Advances in annotation:• Currently, one of the best contemporary UK English corpora • 100 million words from the early 1990s• Represents a wide range of both spoken and written modern British

English• Written data

– 90 million words – Includes extracts from newspapers, academic books, popular fiction, letters

and university essays• Spoken data

– 10 million words– Includes demographic data and context governed data

• The demographic part– Transcripts of about 900 everyday unscripted spoken conversations

• The context-governed part– Spoken language collected in public contexts – e.g. radio phone-ins,

government meetings, classroom interactions

Page 22: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

Advances in annotation: Wmatrix

Page 23: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield
Page 24: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

Looking ahead

• Development of tools and technologies• Corpus techniques increasingly used in other disciplines• Interdisciplinarity• Multimodal corpora (e.g. Headtalk, Knight et al. 2008)• Corpus Linguistics and Geographical Information Systems. This involves extracting

place-names from a corpus, searching for their semantic collocates and creating maps to allows users to visualise how concepts such as war and money are distributed geographically (Gregory and Hardie 2011)

Page 25: PALA Summer School, Maribor, 2014 Corpus Stylistics Brian Walker and Dan McIntyre University of Huddersfield

References

• Andor, J. (2004) ‘The master and his performance: an interview with Noam Chomsky’, Intercultural Pragmatics 1(1): 93-111.

• Boas, F. (1917) ‘Introduction’, International Journal of American Linguistics 1(1): 1-8. [Reprinted in Boas, D. (1940) Race, Language and Culture, pp. 199-210. The Free Press; New York.]

• Gregory, I. and Hardie, A. (2011) ‘Visual GISting: bringing together corpus linguistics and Geographical Information Systems’, Literary and Linguistic Computing 26(3): 297-314.

• Knight, D., Adolphs, S., Tennent, P. and Carter, R. (2008) ‘The Nottingham Multi-Modal Corpus: a demonstration’, Proceedings of the 6th Language Resources and Evaluation Conference, Palais des Congrés Mansour Eddahbi, Marrakech, Morocco, 28-30th May.

• Stubbs, M (1996) Text and Corpus Analysis. Oxford: Blackwell.