Corpora and language pedagogy - Home | Lancaster · Web viewWe will first discuss a wide range of issues related to using corpora in language pedagogy, including referencing publishing,

How can corpora help in language pedagogy?

Richard Xiao

Abstract

Corpus linguistics as a methodology of linguistic research has gained such

prominence over time that corpora have been used extensively in nearly all branches

of linguistics. This chapter explores the potential uses of corpus data in one of these

areas – language teaching and learning. We will first discuss a wide range of issues

related to using corpora in language pedagogy, including referencing publishing,

syllabus design and materials development, language testing, teacher development,

data-driven learner (DLL), teaching language for specific purposes, as well as learner

corpus and interlanguage analysis. We will then demonstrate, via a case study of

passive constructions in Chinese learner English, how contrastive corpus linguistics

can inform second language acquisition research. The chapter concludes by

discussing the debate over the relevance of authenticity and frequency of corpora in

language education as well as the future of corpus-based language pedagogy.

Key words: corpora, language pedagogy, data-driven learning, learner corpus,

contrastive corpus linguistics, interlanguage, second language acquisition

1. Introduction

The corpus-based approach to linguistics and language education has gained

prominence over the past four decades, particularly since the mid-1980s. This is

because corpus analysis can be illuminating ‘in virtually all branches of linguistics or

1

language learning’ (Leech 1997: 9; cf. also Biber, Conrad and Reppen 1998: 11). One

of the strengths of corpus data lies in its empirical nature, which pools together the

intuitions of a great number of speakers and makes linguistic analysis more objective

(McEnery and Wilson 2001: 103). Unsurprisingly, corpora have been used

extensively in nearly all branches of linguistics including, for example, lexicographic

and lexical studies, grammatical studies, language variation studies, contrastive and

translation studies, diachronic studies, semantics, pragmatics, stylistics,

sociolinguistics, discourse analysis, forensic linguistics, and language pedagogy.

Corpora have won widespread popularity over time in spite of the fact that they still

occasionally attract hostile criticism (e.g. Widdowson 1990, 2000).

In this chapter, we will not be concerned with the debate over the use of corpus data

in linguistic analysis and language education. In our view, such a debate is over a

non-issue. Readers interested in the pros and cons of using corpus data should refer to

Sinclair (1991), Widdowson (1991, 2000), de Beaugrande (2001) and Stubbs (2001).

Robert de Beaugrande’s unpublished paper, ‘Large corpora and applied linguistics: H.

G. Widdowson versus J. McH. Sinclair’ (available online at

http://www.beaugrande.com/WiddowSincS.htm), provides an excellent summary of

the debate between Sinclair and Widdowson, at the Georgetown University Round

Table on Languages and Linguistics in 1991, over the use of corpora in language

teaching. While Widdowson, Sinclair and de Beaugrande characterize two extreme

attitudes towards corpora, there are many milder (positive or negative) reactions to

corpus data between the two extremes. Readers can refer to Nelson (2000: section

5.3.3.) for a good review. Nor will we discuss the use of corpora in a wide range of

language studies. Readers can refer to Hunston (2002) and McEnery, Xiao and Tono

2

http://www.beaugrande.com/WiddowSincS.htm

(2006) for a further discussion of using corpora in applied linguistics. Instead, this

chapter focuses only on using corpora in language pedagogy.

The early 1990s saw an increasing interest in applying the findings of corpus-based

research to language pedagogy. The upsurge of interest is evidenced by the eight well-

received biennial international conferences on Teaching and Language Corpora

(TaLC) held in Lancaster (1996, 1994), Oxford (1998), Graz (2000), Bertinoro

(2002), Granada (2004), Paris (2006), and Lisbon (2008). This is also apparent when

one looks at the published literature. In addition to a large number of journal articles,

well over twenty authored or edited volumes have recently been produced on the topic

of teaching and language corpora: Wichmann et al (1997), Partington (1998),

Bernardini (2000), Burnard and McEnery (2000), Kettemann and Marko (2002,

2006), Aston (2001), Ghadessy, Henry, and Roseberry (2001), Hunston (2002),

Granger et al (2002), Connor and Upton (2002), Tan (2002), Sinclair (2003, 2004),

Aston et al (2004), Mishan (2005), Nesselhauf (2005), Römer (2005), Braun, Kohn

and Mukherjee (2006), Gavioli (2006), Scott and Tribble (2006), Hidalgo, Quereda

and Santana (2007), O’Keeffe, McCarthy and Carter (2007), Aijmer (2009), and

Campoy, Gea-valor and Belles-Fortuno (2010). These works cover a wide range of

issues related to using corpora in language pedagogy, e.g. corpus-based language

description, corpus analysis in classroom, and learner corpora (cf. Keck 2004).

In the opening chapter of Teaching and Language Corpora (Wichmann et al 1997),

Geoffrey Leech observed that a convergence between teaching and language corpora

was apparent. That convergence has three focuses, as noted by Leech (1997): the

direct use of corpora in teaching (teaching about, teaching to exploit, and exploiting to

3

teach), the indirect use of corpora in teaching (reference publishing, materials

development, and language testing), and further teaching-oriented corpus

development (LSP corpora, L1 developmental corpora and L2 learner corpora).

In the remainder of this chapter, we will explore the potential uses of corpora in

language pedagogy in line with Leech’s three focuses of convergence (sections 2-4),

which is followed by a case study demonstrating how contrastive corpus linguistics

can inform second language acquisition research (section 5). The chapter concludes

by discussing the debate over the relevance of authenticity and frequency of corpora

in language education as well as the future of corpus-based language pedagogy.

2. Indirect use of corpora

The use of corpora in language teaching and learning has been more indirect than

direct. This is perhaps because direct use of corpora in language pedagogy is

restricted by a number of factors including, for example, the level and experience of

learners, time constraints, curricular requirements, knowledge and skills required of

teachers for corpus analysis and result interpretation, and the access to resources such

as computers, and appropriate software tools and corpora, or a combination of these

(see section 6 for further discussion). This section explores how corpora have

impacted on language pedagogy indirectly.

2.1. Reference publishing

Corpora have revolutionized reference publishing (at least for English), be it a

dictionary or reference grammar, in such a way that it is now nearly unheard of for

new dictionaries and new editions of old dictionaries published from the 1990s

4

onwards not to be based on corpus data, and ‘even people who have never heard of a

corpus are using the product of corpus-based investigation’ (Hunston 2002: 96).

Corpora are useful in several ways for lexicographers. The greatest advantage of

using corpora in lexicography lies in their machine-readable nature, which allows

dictionary makers to extract all authentic, typical examples of the usage of a lexical

item from a large body of text in a few seconds. The second advantage of the corpus-

based approach, which is not available when using citation slips, is the frequency

information and quantification of collocation which a corpus can readily provide.

Some dictionaries, e.g. COBUILD 1995 and Longman 1995, include such frequency

information. Frequency data plays an even more important role in the so-called

frequency dictionaries, which define core vocabulary to help learners of different

modern languages, e.g. Davies (2005) for Spanish, Jones and Tschirner (2005) for

German, Davies and de Oliveira Preto-Bay (2007) for Portuguese, Lonsdale and Bras

(2009) for French, and Xiao, Rayson and McEnery (2009) for Chinese. Information

of this sort is particularly useful for materials writers and language learners alike. A

further benefit of using corpora is related to corpus markup and annotation. Many

available corpora (e.g. the BNC) are encoded with textual (e.g. register, genre and

domain) and sociolinguistic (e.g. user gender and age) metadata which allows

lexicographers to give a more accurate description of the usage of a lexical item.

Corpus annotations such as part-of-speech tagging and word sense disambiguation

also enable a more sensible grouping of words which are polysemous and

homographs. Furthermore, a monitor corpus allows lexicographers to track subtle

change in the meaning and usage of a lexical item so as to keep their dictionaries up-

to-date. Last but not least, corpus evidence can complement or refute the intuitions of

5

individual lexicographers, which are not always reliable (cf. Sinclair 1991a: 112;

Atkins and Levin 1995; Meijs 1996; Murison-Bowie 1996: 184) so that dictionary

entries are more accurate. The above observations above are line with Hunston (2002:

96), who summarizes the changes brought about by corpora to dictionaries and other

reference books in terms of five ‘emphases’: an emphasis on frequency, an emphasis

on collocation and phraseology, an emphasis on variation, an emphasis on lexis in

grammar and an emphasis on authenticity.

It has been noted that non-corpus-based grammars can contain biases while corpora

can help to improve grammatical descriptions (McEnery and Xiao (2005). The

Longman Grammar of Spoken and Written English (LGSWE, Biber et al 1999) can be

considered as a milestone in reference publishing. Based entirely on the 40-million-

word Longman Spoken and Written English Corpus, the grammar gives ‘a thorough

description of English grammar, which is illustrated throughout with real corpus

examples, and which gives equal attention to the ways speakers and writers actually

use these linguistic resources’ (Biber et al 1999: 45). The new corpus-based grammar

is unique in many different ways, for example, by taking register variations into

account and exploring the differences between written and spoken grammars.

While lexical information forms, to some extent, an integral part of the grammatical

description in Biber et al (1999), it is the Collins COBUILD series (Sinclair 1990,

1992; Francis et al 1996; 1997; 1998), that focus on lexis in grammatical descriptions

(the so-called ‘pattern grammar’, Hunston and Francis 2002). In fact, Sinclair et al

(1990) flatly reject the distinction between lexis and grammar. While pattern

grammars focusing on the connection between pattern and meaning challenge the

6

traditional distinction between lexis and grammar, they are undoubtedly useful in

language learning as they provide ‘a resource for vocabulary building in which the

word is treated as part of a phrase rather than in isolation’ (Hunston 2002: 106).

In the dictionary family, perhaps the most important member as far as language

pedagogy is concerned is a learner dictionary. Yet corpus-based learner dictionaries

have a quite short history. It was only in 1987 that the Collins COBUILD English

Dictionary was published as the first ‘fully corpus-based’ dictionary. Yet the impact

of this corpus-based dictionary was such that most other publishers in the ELT market

followed Collins’ lead. By 1995, the new editions of major learner’s dictionaries such

as the Longman Dictionary of Contemporary English (LDOCE, 3rd edition), the

Oxford Advanced Learner’s Dictionary (OALD, 5th edition), and a newcomer, the

Cambridge International Dictionary of English (CIDE, 1st edition) all claimed to be

based on corpus evidence in one way or another.

One of the important features of corpus-based learner dictionaries is that their

inclusion of quantitative data extracted from a corpus. Another important feature,

which is also related to frequency information, is that such dictionaries typically

select the vocabulary used from a controlled set when defining the entry for a word.

Producing definitions in an L2 that language learners can understand is a problem;

language learners may not have a very well developed L2 vocabulary. This makes it

necessary and desirable for dictionary makers to limit the vocabulary they use when

defining words in a dictionary. Nowadays, most learner dictionary makers prepare a

list of defining words, usually ranging from 2,000 to 2,500 words, based on the

7

frequency information extracted from corpora as well as on the lexicographers’

experience of defining words.

As noted earlier, an important use of corpus data for lexicography is in the area of

example selection so that nowadays most dictionaries of English use corpora as the

source of their examples. In the case of learner dictionaries, however, there was a

tradition of using examples invented by lexicographers, rather than authentic

materials, in dictionary production, because they believed that foreign language

learners have difficulty understanding authentic materials and therefore have to be

presented with simple, rewritten examples in which the use of a given word is

highlighted to show its syntactic and semantic properties. It was corpus-based learner

dictionary work which challenged this received wisdom. The COBUILD project broke

with tradition and used authentic data extracted from corpora to produce illustrative

examples for a learner dictionary. The use of authentic examples in learner

dictionaries is an area where corpus-based learner dictionaries have innovated.

2.2. Syllabus design and materials development

While corpora have been used extensively to provide more accurate descriptions of

language use, a number of scholars have also used corpus data directly to look

critically at existing TEFL (Teaching English as a Foreign Language) syllabuses and

teaching materials. Mindt (1996), for example, finds that the use of grammatical

structures in textbooks for teaching English differs considerably from the use of these

structures in L1 English. He observes that one common failure of English textbooks is

that they teach ‘a kind of school English which does not seem to exist outside the

foreign language classroom’ (Mindt 1996: 232). As such, learners often find it

8

difficult to communicate successfully with native speakers. A simple yet important

role of corpora in language education is to provide more realistic examples of

language usage. In addition, however, corpora may provide data, especially frequency

data, which may further alter what is taught. For example, on the basis of a

comparison of the frequencies of modal verbs, future time expressions and conditional

clauses in corpora and their grading in textbooks used widely in Germany, Mindt

(ibid) concludes that one problem with non-corpus-based syllabuses is that the order

in which those items are taught in syllabuses ‘very often does not correspond to what

one might reasonably expect from corpus data of spoken and written English’,

arguing that teaching syllabuses should be based on empirical evidence rather than

tradition and intuition with frequency of usage as a guide to priority for teaching

(Mindt 1996: 245-246). While frequency is certainly not the only determinant of what

to teach and in what order (see section 6 for further discussion), it can indeed help to

make learning more effective. For example, McCarthy, McCarten and Sandiford’s

(2005-2006) innovative Touchstone book series, which is based on the Cambridge

International Corpus, aims to present the vocabulary, grammar, and functions students

encounter most often in real life.

Hunston (2002: 189) echoes Mindt suggesting that ‘the experience of using corpora

should lead to rather different views of syllabus design.’ The type of syllabus she

discusses extensively is a ‘lexical syllabus’, originally proposed by Sinclair and

Renouf (1988) and outlined fully by Willis (1990) and embodied in Willis, Willis and

Davids’ (1988-1989) three-part Collins COBUILD English Course. According to

Sinclair and Renouf (1988: 148), a lexical syllabus would focus on ‘(a) the

commonest word forms in a language; (b) the central patterns of usage; (c) the

9

combinations which they usually form.’ While the term may occasionally be

misinterpreted to indicate a syllabus consisting solely of vocabulary items, a lexical

syllabus actually covers ‘all aspects of language, differing from a conventional

syllabus only in that the central concept of organization is lexis’ (Hunston 2002: 189).

Sinclair (2000: 191) would say that the grammar covered in a lexical syllabus is

‘lexical grammar’, not ‘lexico-grammar’, which attempts to ‘build a grammar and

lexis on an equal basis.’ Indeed, as Murison-Bowie (1996: 185) observes, ‘in using

corpora in a teaching context, it is frequently difficult to distinguish what is a lexical

investigation and what is a syntactic one. One leads to the other, and this can be used

to advantage in a teaching/learning context.’ Sinclair and his colleagues’ proposal for

a lexical syllabus is echoed by Lewis (1993, 1997a, 1997b, 2000) who provides strong

support for the lexical approach to language teaching.

A focus of the lexical approach to language pedagogy is teaching collocations and the

related concept of prefabricated units. There is a consensus that collocational

knowledge is important for developing L1/L2 language skills (e.g. Bahns 1993;

Zhang 1993; Cowie 1994; Herbst 1996: 389-391; Kita and Ogata 1997: 230-231;

Partington 1998: 23-25; Hoey 2000, 2004; Shei and Pain 2000: 167-170; Sripicharn

2000: 169-170; Altenberg and Granger 2001; McEnery and Wilson 2001; McAlpine

and Myles 2003: 71-75; Nesselhauf 2003). Hoey (2004), for example, posits that

‘learning a lexical item entails learning what it occurs with and what grammar it tends

to have.’ Cowie (1994: 3168) observes that ‘native-like proficiency of a language

depends crucially on knowledge of a stock of prefabricated units.’ Aston (1995) also

notes that the use of prefabs can speed language processing in both comprehension

and production, thus creating native-like fluency. A powerful reason for the

10

employment of collocations, as Partington (1998: 20) suggests, ‘lies in the way it

facilitates communication processing on the part of hearer’, because ‘language

consisting of a relatively high number of fixed phrases is generally more predictable

than that which is not’ while ‘in real time language decoding, hearers need all the help

they can get.’ As such, competence in a language undoubtedly seems to involve

collocational knowledge (cf. Herbst 1996: 389). Collocational knowledge indicates

which lexical items co-occur frequently with others and how they combine within a

sentence. Such knowledge is evidently more important than individual words

themselves (cf. Kita and Ogata 1997: 230) and is needed for effective sentence

generation (cf. Smadja and McKeown 1990). Zhang (1993), for example, finds that

more proficient L2 writers use significantly more collocations, more accurately and in

more variety than less proficient learners. Collocational error is a common type of

error for learners (cf. McAlpine and Myles 2003: 75). Gui and Yang (2002: 48)

observe, on the basis of the Chinese Learner English corpus, that collocation error is

one of the major error types for Chinese learners of English. Altenberg and Granger

(2001) and Nesselhauf (2003) find that even advanced learners of English have

considerable difficulties with collocation. One possible explanation is that learners are

deficient in ‘automation of collocations’ (Kjellmer 1991). ‘As a result, learners need

detailed information about common collocational patterns and idioms; fixed and semi-

fixed lexical expressions and different degrees of variability; relative frequency and

currency of particular patterns; and formality level’ (McAlpine and Myles 2003: 75).

Corpora are useful in this respect, not only because collocations can only reliably be

measured quantitatively, but also because the KWIC (key word in centre) view of

corpus data exposes learners to a great deal of authentic data in a structured way. Our

view is line with Kennedy (2003), who discusses the relationship between corpus data

11

and the nature of language learning, focusing on the teaching of collocations. The

author argues that second or foreign language learning is a process of learning

‘explicit knowledge’ with awareness, which requires a great deal of exposure to

language data.

In addition to the lexical focus, corpus-based teaching materials try to demonstrate

how the target language is actually used in different contexts, as exemplified in Biber

et al’s (2002) Longman Student Grammar of Spoken and Written English, which pays

special attention to how English is used differently in various spoken and written

registers.

2.3. Language testing

Another emerging area of language pedagogy which has started to use the corpus-

based approach is language testing. Alderson (1996) envisaged the possible uses of

corpora in this area: test construction, compilation and selection, test presentation,

response capture, test scoring, and calculation and delivery of results. He concludes

that ‘[t]he potential advantages of basing our tests on real language data, of making

data-based judgments about candidates’ abilities, knowledge and performance are

clear enough. A crucial question is whether the possible advantages are born out in

practice’ (Alderson 1996: 258-259). The concern raised in Alderson’s conclusion

appears to have been addressed satisfactorily. Choi, Kim and Boo (2003), for

example, find that computer-based tests are comparable to paper-based tests. A

number of corpus-based studies of language testing have been reported. For example,

Coniam (1997) demonstrated how to use word frequency data extracted from corpora

to generate cloze tests automatically. Kaszubski and Wojnowska (2003) presented a

12

corpus-driven program for building sentence-based ELT exercises – TestBuilder. The

program can process raw and part-of-speech tagged corpora, tagged on the fly by a

built-in part-of-speech tagger, and uses this as input for test material selection. Indeed,

corpora have recently been used by major providers of test services for a number of

purposes: 1) as an archive of examination scripts; 2) to develop test materials; 3) to

optimize test procedures; 4) to improve the quality of test marking; 4) to validate

tests; and 5) to standardize tests (cf. Ball 2001; Hunston 2002: 205). For example, the

University of Cambridge Local Examinations Syndicate (UCLES) is active in both

corpus development (e.g. Cambridge Learner Corpus, Cambridge Corpus of Spoken

English, Business English Text Corpus and Corpus YLE Speaking Tests) and the

analysis of native English corpora and learner corpora. At UCLES, native English

corpora such as the British National Corpus (BNC) are used ‘to investigate

collocations, authentic stems and appropriate distractors which enable item writers to

base their examination tasks on real texts’ (Ball 2001: 7); the corpus-based approach

is used to explore ‘the distinguishing features in the writing performance of EFL/ESL

learners or users taking the Cambridge English examinations’ and how to incorporate

these into ‘a single scale of bands, that is, a common scale, describing different levels

of L2 writing proficiency’ (Hawkey 2001: 9); corpora are also used for the purpose of

speaking assessment (Ball and Wilson 2002; Taylor 2003) and to develop domain-

specific (e.g. business English) wordlists for use in test materials (Ball 2002; Horner

and Strutt 2004).

2.4. Teacher development

For learners to benefit from the use of corpora, language teachers must first of all be

equipped with a sound knowledge of the corpus-based approach. It is unsurprising to

13

discover then that corpora have been used in training language teachers (e.g. Allan

1999, 2002; Conrad 1999; Seidlhofer 2000, 2002; O’Keeffe and Farr 2003). Allan

(1999), for example, demonstrates how to use corpus data to raise the language

awareness of English teachers in Hong Kong secondary schools. Conrad (1999)

presents a corpus-based study of linking adverbials (e.g. therefore and in other

words), on the basis of which she suggests that it is important that a language teacher

do more than using classroom concordancing and lexical or lexico-grammatical

analyses if language teaching is to take full advantage of the corpus-based approach.

Conrad’s concern with teacher education is echoed by O’Keeffe and Farr (2003), who

argue that corpus linguistics should be included in initial language teacher education

so as to enhance teachers’ research skills and language awareness.

3. Direct use of corpora

While indirect uses such as syllabus design and materials development are closely

associated with what to teach, corpora have also provided valuable insights into how

to teach. Of Leech’s (1997) three focuses, direct uses of corpora include ‘teaching

about’, ‘teaching to exploit’, and ‘exploit to teach’, with the latter two relating to how

to use. Given a number of restricting factors as noted in section 2, direct uses have so

far confined largely to learning at more advanced levels, for example, in tertiary

education, whereas in general English language teaching (let alone to mention other

foreign languages), especially in secondary education (see Braun 2007 a rare example

of an empirical study of using corpora in secondary education), the direct use of

corpora is ‘still conspicuously absent’ (Kaltenböck and Mehlmauer-Larcher 2005).

14

‘Teaching about’ means teaching corpus linguistics as an academic subject like other

sub-disciplines of linguistics such as syntax and pragmatics. Corpus linguistics has

now found its way into the curricula for linguistic and language related degree

programmes at both postgraduate and undergraduate levels. ‘Teaching to exploit’

means providing students with ‘hands-on’ know-how, as emphasized in McEnery,

Xiao and Tono (2006), so that they can exploit corpora for their own purposes. Once

the student has acquired the necessary knowledge and techniques of corpus-based

language study, learning activity may become student centred. ‘Exploiting to teach’

means using a corpus-based approach to teaching language and linguistics courses

(e.g. sociolinguistics and discourse analysis), which would otherwise be taught using

non-corpus-based methods.

If the focuses of ‘teaching about’ and ‘exploiting to teach’ are viewed as being

associated typically with students of linguistics and language programmes, ‘teaching

to exploit’ relates to students of all subjects which involve language study and

learning, who are expected to benefit from the so-called data-driven learning (DDL)

or ‘discovery learning’.

The issue of how to use corpora in the language classroom has been discussed

extensively in the literature. With the corpus-based approach to language pedagogy,

the traditional ‘three P’s’ (Presentation – Practice – Production) approach to teaching

may not be entirely suitable. Instead, the more exploratory approach of ‘three I’s’

(Illustration – Interaction – Induction) may be more appropriate, where ‘illustration’

means looking at real data, ‘interaction’ means discussing and sharing opinions and

observations, and ‘induction’ means making one’s own rule for a particular feature,

15

which ‘will be refined and honed as more and more data is encountered’ (see Carter

and McCarthy 1995: 155). This progressive induction approach is what Murison-

Bowie (1996: 191) would call the interlanguage approach: namely, partial and

incomplete generalizations are drawn from limited data as a stage on the way towards

a fully satisfactory rule. While the ‘three I’s’ approach was originally proposed by

Carter and McCarthy (1995) to teach spoken grammar, it may also apply to language

education as a whole, in our view.

It is clear that the teaching approach focusing on ‘three I’s’ is in line with Johns’

(1991) concept of ‘data-driven learning (DLL)’. Johns was perhaps among the first to

realize the potential of corpora for language learners (e.g. Higgins and Johns 1984). In

his opinion, ‘research is too serious to be left to the researchers’ (Johns 1991: 2). As

such, he argues that the language learner should be encouraged to become ‘a research

worker whose learning needs to be driven by access to linguistic data’ (ibid). John’s

web-based Kibbitzer (www.eisu2.bham.ac.uk/johnstf/timeap3.htm) gives some very

good examples of data-driven learning.

Data-driven learning can be either teacher-directed or learner-led (i.e. discovery

learning) to suit the needs of learners at different levels, but it is basically learner-

centred. This autonomous learning process ‘gives the student the realistic expectation

of breaking new ground as a “researcher”, doing something which is a unique and

individual contribution’ (Leech 1997: 10). It is important to note, however, that the

key to successful data-driven learning, even if it is student-centred, is the appropriate

level of teacher guidance or mediation depending on the learners’ age, experience,

and proficiency level, because ‘a corpus is not a simple object, and it is just as easy to

16

http://www.eisu2.bham.ac.uk/johnstf/timeap3.htm

derive nonsensical conclusions from the evidence as insightful ones’ (Sinclair 2004:

2). In this sense, it is even more important for language teachers to be equipped with

the necessary training in corpus analysis (cf. section 6).

Johns (1991) identifies three stages of inductive reasoning with corpora in the DDL

approach: observation (of concordanced evidence), classification (of salient features)

and generalization (of rules). The three stages roughly correspond to Carter and

McCarthy’s (1995) ‘three I’s’. The DDL approach is fundamentally different from the

‘three P’s’ approach in that the former is bottom-up induction whereas the latter is

top-down deduction. The direct use of corpora and concordancing in the language

classroom has been discussed extensively in the literature (e.g. Tribble 1991, 1997a,

1997b, 2000, 2003; Tribble and Jones 1990, 1997; Flowerdew 1993; Karpati 1995;

Kettemann 1995, 1996; Wichmann 1995; Woolls 1998; Aston 2001; Osborne 2001,

Bruan 2007), covering a wide range of issues including, for example, underlying

theories, methods and techniques, and problems and solutions.

4. Teaching oriented corpora

Teaching-oriented corpora are particularly useful in teaching languages for specific

purposes (LSP corpora) and in research on L1 (developmental corpora) and L2

(learner corpora) language acquisition. Such corpora can be used directly or indirectly

in language pedagogy as discussed in previous sections.

4.1. Languages for specific purposes and professional communication

In addition to teaching English as a second or foreign language in general, a great deal

of attention has been paid to domain-specific language use and professional

17

communication (e.g. English for specific purposes and English for academic purpose).

For example, Thurstun and Candlin (1997, 1998) explore the use of concordancing in

teaching writing and vocabulary in academic English. Hyland (1999) compares the

features of the specific genres of metadiscourse in introductory course books and

research articles on the basis of a corpus consisting of extracts from 21 university

textbooks for different disciplines and a similar corpus of research articles. Upton and

Connor (2001) undertake a ‘moves analysis’ in the business English using a business

learner corpus. The authors approach the cultural aspect of professional

communication by comparing the ‘politeness strategies’ used by learners from

different cultural backgrounds. Thompson and Tribble (2001) examine citation

practices in academic text. Koester (2002) argues, on the basis of an analysis of the

performance of speech acts in workshop conversations, for a discourse approach to

teaching communicative functions in spoken English. Yang and Allison (2003) study

the organizational structure in research articles in applied linguistics. Carter and

McCarthy (2004) explore, on the basis of the CANCODE corpus, a range of social

contexts in which creative uses of language are manifested. Hinkel (2004) compares

the use of tense, aspect and the passive in L1 and L2 academic texts. Xiao (2003)

reviews a number of case studies using domain specialized multilingual corpora to

teach domain specific translation. Studies such as these demonstrate that LSP corpora

are particularly useful in teaching language for specific purposes and professional

communication.

4.2. Learner corpora and interlanguage analysis

Two kinds of corpora that emerged in the 1990s have not only greatly contributed to

the vitality of corpus linguistics but have also revived contrastive analysis and

18

interlanguage research. They are learner corpora and multilingual corpora. This

section discusses learner corpora while the topic of multilingual corpora will be taken

up for further discussion in section 5.1.

The creation and use of learner corpora in language pedagogy and interlanguage

research has been welcomed as one of the most exciting recent developments in

corpus-based language studies. If native speaker corpora of the target language

provide a top-down approach to using corpora in language pedagogy, learner corpora

provide a bottom-up approach to language teaching (Osborne 2002).

A learner corpus, as opposed to a “developmental corpus” composed of data produced

by children acquiring their mother tongue (L1), comprises written or spoken data

produced by language learners who are acquiring a second or foreign language. Data

of this type has particularly been useful in language pedagogy and second language

acquisition (SLA) research, as demonstrated by the fruitful learner corpus studies

published over the past decade (see Pravec 2002; Keck 2004; and Myles 2005 for

recent reviews). SLA research is primarily concerned with ‘the mental representations

and developmental processes which shape and constrain second language (L2)

productions’ (Myles 2005: 374). Language acquisition occurs in the mind of the

learner, which cannot be observed directly and must be studied from a psychological

perspective. Nevertheless, if learner performance data is shaped and constrained by

such a mental process, it at least provides indirect, observable, and empirical evidence

for the language acquisition process. Note that using product as evidence for process

may not be less reliable; sometimes this is the only practical way of finding about

process. Stubbs (2001) draws a parallel between corpora in corpus linguistics and

19

rocks in geology, ‘which both assume a relation between process and product. By and

large, the processes are invisible, and must be inferred from the products.’ Like

geologists who study rocks because they are interested in geological processes to

which they do not have direct access, SLA researchers can analyze learner

performance data to infer the inaccessible mental process of second language

acquisition. Learner corpora can also be used as an empirical basis that tests

hypotheses generated using the psycholinguistic approach, and to enable the findings

previously made on the basis of limited data of a small number of informants to be

generalised. Additionally, learner corpora have widened the scope of SLA research so

that, for example, interlanguage research nowadays treats learner performance data in

its own right rather than as decontextualised errors in traditional error analysis (cf.

Granger 1998: 6).

At the pre-conference workshop on learner corpora affiliated to the International

Symposium of Corpus Linguistics 2003 held at the University of Lancaster, the

workshop organizers Yukio Tono and Fanny Meunier observed that learner corpora

are no longer in their infancy but are going through their nominal teenage years – they

are full of promise but not yet fully developed. In language pedagogy, the

implications of learner corpora have been explored for curriculum design, materials

development and teaching methodology (cf. Keck 2004: 99). The interface between

L1 and L2 materials has been explored. Meunier (2002), for example, argues that

frequency information obtained from native speaker corpora alone is not sufficient to

inform curriculum and materials design. Rather, ‘it is important to strike a balance

between frequency, difficulty and pedagogical relevance. That is exactly where

learner corpus research comes into play to help weigh the importance of each of

20

these’ (Meunier 2002: 123). Meunier also advocates the use of learner data in the

classroom, suggesting that exercises such as comparing learner and native speaker

data and analyzing errors in learner language will help students to notice gaps

between their interlanguage and the language they are learning. Interlanguage studies

based on learner corpora which have been undertaken so far focus on what Granger

(2002) calls ‘Contrastive Interlanguage Analysis (CIA)’, which compares learner data

and native speaker data, or language produced learners from different L1

backgrounds. The first type of comparison typically aims to identify under or overuse

of particular linguistic features in learner language while the second type aims to

uncover L1 interference or transfer. In addition to CIA, learner corpora have also been

used to investigate the order of acquisition of particular morphemes. Readers can refer

to Granger et al (2002) for recent work in the use of learner corpora, and read Granger

(2003) for a more general discussion of the applications of learner corpora such as the

International Corpus of Learner English (ICLE).

In addition to SLA research, learner corpora can also be used directly in classroom

teaching. For example, Seidlhofer (2002) and Mukherjee and Rohrbach (2006)

demonstrate how a ‘local learner corpus’ containing students’ own writings can be

used directly for learning by coping with students’ questions about their own or

classmates’ writings, or analyzing and correcting errors in such familiar writings.

We have so far discussed how corpora, including those teaching oriented corpora like

LSP corpora and learner corpora, can be used directly or indirectly in language

pedagogy. The section that follows seeks to demonstrate the predictive and diagnostic

power of the integrated approach that combines contrastive corpus linguistics with

21

interlanguage analysis in second language acquisition research as advocated in Römer

(2008), via a case study of passive constructions in Chinese learner English.

5. Using contrastive corpus linguistics to inform LSA research

In this section, we will first clarify the type of corpora used in contrastive corpus

linguistics, which will be followed by a summary of the findings from a published

contrastive study of passive constructions in English and Chinese based on

comparable corpora of the two languages (Xiao, McEnery and Qian 2006). These

findings will in turn be used to predict and diagnose the performance of Chinese

learners of English in their use of English passives as mirrored in a sizeable Chinese

learner English corpus in comparison with a comparable native English corpus.

5.1. Contrastive corpus linguistics

As noted in section 4.2, multilingual corpora have been an important development in

corpus research since the 1990s. A multilingual corpus involves two or more

languages. Data contained in this kind of corpora can be either source texts in one

language plus their translations in another language or other languages, or texts

collected from different native languages using comparable sampling techniques to

achieve similar coverage and balance. The two types of multilingual corpora are

usually referred to as parallel corpora and comparable corpora respectively and used

in translation and contrastive studies.

Contrastive studies can be theoretically oriented or geared towards applied research.

Theoretic contrastive studies are language independent and primarily concerned with

how a universal category is realised in two or more different languages, whilst applied

22

contrastive studies are preoccupied with how a common category in one language is

realised in another language. In its early stage, contrastive linguistics was

predominantly theoretic, though the applied aspect was not totally neglected.

Theoretically oriented contrastive studies were continued from the late 1920s all the

way into the 1960s by the Prague School. On the other hand, WWII aroused great

interest in foreign language teaching in the United States, and contrastive studies were

recognised as an important part of foreign language teaching methodology (cf. Fries

1945; Lado 1957). As a means of ‘predicting and/or explaining difficulties of second

language learners with a particular mother tongue in learning a particular target

language’ (Johansson 2003), applied contrastive studies were dominant throughout

the 1960s. However, it was soon realised that language learning could not be

accounted for by cross-linguistic contrast alone (see Sajavaara 1996 for a discussion

of some problems with contrastive linguistics), and as a result contrastive studies lost

ground to more learner-oriented approaches such as error analysis, performance

analysis and interlanguage analysis (cf. Johansson 2003). The revival of contrastive

studies in the 1990s has largely been attributed to the corpus methodology and the

availability of multilingual corpora (cf. Granger 1996: 37; Salkie 1999; Johansson

2003).

What kind of corpora can be used in contrastive analysis? To answer this question, we

will first need to have a general idea of purposes of multilingual corpora of various

kinds.

While multilingual corpora, and especially comparable corpora, are designed and

created with the explicit aim of cross-linguistic contrast, all corpora have ‘always

23

been pre-eminently suited for comparative studies’ (Aarts 1998: i). For example, the

four English corpora of the Brown family (e.g. Brown, LOB, Frown, FLOB, see Xiao

2008: 395-297 for a comparison of these corpora) were created for synchronic and

diachronic comparisons of English as used in Britain and the US in the early 1960s

and the early 1990s, while the Lancaster Corpus of Mandarin Chinese (LCMC) was

designed as a Chinese match for FLOB and Frown to facilitate cross-linguistic

contrasts of English and Chinese (McEnery, Xiao and Mo 2003). The International

Corpus of English (ICE) project has used a common corpus design and the same

sampling criteria for each of its components to ensure their comparability (Nelson

1996); similarly, the International Corpus of Learner English (ICLE) is designed in

such a way that the subcorpora for learners of different L1 backgrounds are

comparable (Granger 1998). Even a corpus like the British National Corpus (BNC),

which was designed to be representative of modern British English (Aston and

Burnard 1998), also provides a useful basis for various intra-lingual comparisons (e.g.

genre-based variations and variations caused by sociolinguistic variables). Clearly,

corpora are intrinsically comparative, and so is the corpus linguistics methodology.

For example, collocations are extracted using statistic measures that compare the

probabilities of co-occurring words within a specified window span of the node word;

keywords are identified by comparing the target corpus with a reference corpus; what

Granger (1998: 12) referred to as Contrastive Interlanguage Analysis (CIA) is also

mainly concerned with comparison, e.g. comparing interlanguage with target native

language, and comparing different interlanguages (in terms of L1 background, age,

proficiency level, task type, learning setting, and medium etc). In short, it can be said

that the whole corpus research enterprise is based on comparison, for example, by

comparing the same linguistic feature in different corpora, comparing different

24

linguistic features in the same corpus, and comparing what is observed and what is

expected.

While corpus linguistics is clearly comparative in nature, the technical terms for

corpora used in linguistic comparison are somewhat confusing, with the controversy

revolving around the issue of whether a parallel corpus should be a corpus composed

of source texts plus translations, or a corpus containing native language data collected

using comparable sampling criteria. As we have argued elsewhere (McEnery et al

2006: 47), a parallel corpus is composed of source texts and their translations, whilst a

comparable corpus contains L1 texts sampled from different languages which are

comparable in sampling criteria. A translation corpus, instead of referring to what is

actually a parallel corpus as suggested in the literature, comprises translated texts for

us in studies of translational language (e.g. the Translational English Corpus).

Corpora which are designed primarily for intra-lingual comparison or for comparing

different varieties of the same language (e.g. the ICE) are comparative corpora.

Having clarified the terminologies, it is appropriate to discuss what types of corpora

are to be used in cross-linguistic contrasts. This is in fact an issue which is as

debatable as the terminological issue. It has been argued that parallel corpora provide

a sound basis for contrastive analysis, as demonstrated in the claims that ‘translation

equivalence is the best available basis of comparison’ (James 1980: 178), and that

‘studies based on real translations are the only sound method for contrastive analysis’

(Santos 1996: i). However, as has been widely observed (Baker 1993: 243-5;

Hartmann 1995; Gellerstam 1996; Teubert 1996: 247; Laviosa 1997: 315; McEnery

and Wilson 2001: 71-72; McEnery and Xiao 2002, Xiao and Yue 2009; Xiao, He and

25

Yue forthcoming), translational language is ‘an unrepresentative special variant of the

target language’ which is perceptibly influenced by the source language (McEnery,

Xiao and Tono 2006: 93). The source texts and translations in a parallel corpus are

certainly comparable in terms of sampling criteria such as genres – in fact sampling

only applies in selecting source texts but does not apply twice to translations, but this

comparability is immediately undermined by so-called ‘translationese’ in translated

texts. For example, Laviosa (1998) finds that translational language has four core

patterns of lexical use: a relatively lower proportion of lexical words over function

words, a relatively higher proportion of high-frequency words over low-frequency

words, a relatively greater repetition of the most frequent words, and less variety in

the words that are most frequently used. Beyond the lexical level, translational

language is characterised by normalization, simplification (Baker 1993), explicitation

(i.e. increased cohesion, Øverås 1998), and sanitization (i.e. reduced connotational

meanings, Kenny 1998). In addition to these common features of translational

language, Granger (1996) has noted some similarity between translationese and what

she calls ‘learnerese’: ‘Both are situated somewhere between L1 and L2 and are likely

to contain examples of transfer’, and both ‘give evidence of what Gellerstam (1986:

94) calls “syntactic fingerprints”’ (Granger 1996: 48).

As observations resulting from parallel corpus analysis usually invite ‘further research

with monolingual corpora in both languages’ (Mauranen 2002: 182), parallel corpora

can be a useful starting point of contrastive analysis. Nevertheless, it is also clear from

the discussion above that while they are ideal resources for translation studies (see

McEnery and Xiao 2007 for further discussion), parallel corpora provide a poor basis

for cross-linguistic contrasts if relied upon alone.

26

In the section that follows, we will present the findings of a contrastive study of

passive constructions in English and Chinese on the basis of comparable written and

spoken corpora of the two languages, which will be used to predict and diagnose what

is observed in Chinese learner English.

5.2. Passive constructions in English and Chinese

This section summarises the results of a contrastive corpus analysis of passive

constructions on the basis of comparable corpora of English and Chinese, which was

published in Xiao, McEnery and Qian (2006). The primary corpus resources used in

that study included FLOB for written English and LCMC for written Chinese,

together with spoken corpora composed of transcripts for casual conversations in the

two languages. In addition, two spoken corpora of sampling period similar to FLOB

and LCMC were used to compare speech and writing. For English we used the

demographically component of the British National Corpus, amounting to

approximately four million words of conversational data sampled during 1985-1994.

For Chinese we used the Callhome Mandarin Chinese Transcript corpus, which

contains 120 transcripts of telephone conversations amounting to roughly 300,000

words (see McEnery and Xiao 2008).

Our corpus-based contrastive study yields a number of interesting findings. Below we

will only give a summary of the results that are most relevant to our discussion of the

performance of Chinese learners of English in the following section.

27

Firstly, passive constructions are nearly ten times as frequent in English as in Chinese,

with normalised frequencies of 1,026 and 110 instances per million words for the two

languages respectively. There are a number of reasons for this contrast. First, be-

passives can be used for both stative and dynamic situations whereas Chinese passives

can only occur in dynamic events; second, Chinese passives usually have a negative

pragmatic meaning while English passives (especially be-passives) do not; third,

English has a tendency to overuse passives, especially in formal writing whereas

Chinese tends to avoid syntactic passives wherever possible; Chinese has a number of

linguistic devices other than the syntactically marked passive constructions to express

a passive meaning, e.g. notional passives, lexical passives, topic sentences, subjectless

sentences, sentences with vague subjects (e.g. youren ‘someone’, renmen ‘people’,

dajia ‘all’), and special structures such as the disposal ba construction and the

predicative shi…de structure. Finally, syntactically unmarked notional passives are

more common in Chinese than in English because English is a subject-oriented

language whereas Chinese is topic oriented. Given that Chinese passives are much

more restricted in scope of use, their low frequency in relation to their English

counterparts is unsurprising. It can be predicted from this sharp contrast in frequency

of use that Chinese learners of English are very likely to underuse passives in their

interlanguage.

Secondly, passives are formed by an auxiliary (be, get) followed by a past participial

verb in English whilst in Chinese they can be marked syntactically by passive markers

such as bei, indicated lexically by verbs with an inherent passive meaning (e.g. zao

‘suffer’), or simply expressed by unmarked notional passives or special sentence

structures. Unlike English, which inflects the passivised verb morphologically,

28

Chinese is non-inflectional, which means that the same verb form is used for both

active and passive voices in Chinese. Also because of the non-inflectional Chinese

morphology, the concept of auxiliary is less salient or useful in Chinese. These cross-

linguistic differences seem to suggest that the choice of correct auxiliaries as well as

proper inflectional forms for passivised verbs can constitute a difficult area for

Chinese learners to acquire English passives.

Thirdly, short passives (i.e. passives without a by-phrase introducing an agent) are

typical of English, accounting for over 90% of total occurrences in both speech and

writing. Short passives are predominant in English simply because passives are often

used in English as a strategy that allows one to avoid mentioning the agent when it

cannot or must not be mentioned, while they are also used for stylistic and coherence

purposes (see Granger 1976 and 1983 for further discussion of uses of passives). In

contrast, three out of five syntactic passive markers in Chinese (wei…suo, jiao and

rang) only occur in long passives (i.e. passives with an explicit agent). For the two

remaining passive markers bei and gei, which allow both long and short passives, the

proportions of short passives (60.7% and 57.5% respectively) are significantly lower

than that for English passives. Early Chinese grammarians (e.g. Wang 1984; Lü and

Zhu 1979) noted that an agent must normally be spelt out in passive constructions,

though this constraint has become more relaxed nowadays. When it is difficult to spell

out the agent, passives are used in English, but an alternative device mentioned in the

preceding paragraph is often used in Chinese instead of using passives. This finding

can lead one to expect more long passives in the interlanguage of Chinese learners of

English.

29

Figure 1. Pragmatic meanings of be and bei passives

Finally, a major distinction between passives in English and Chinese is that Chinese

passives are more frequently used with an inflictive meaning than their English

counterparts. With the exception of the archaic passive form wei…suo, over half of

syntactically marked passives in Chinese occur in adversative situations, a proportion

considerably higher than that for English passives (see Figure 1). As the prototypical

passive marker bei was derived from a verb with an inflictive meaning (i.e. ‘suffer’),

Chinese passives were used at early stages primarily for unpleasant or undesirable

events. While this semantic constraint on the use of passives has become more

relaxed, especially in written Chinese, under the influence of western languages,

disyllabic words made up of bei and a single character verb as used in modern

Chinese typically refer to something undesirable, as in beibu ‘be arrested’, beifu ‘be

captured’, beigao ‘the accused’, beihai ‘be a victim’ and beipo ‘be forced’. In

contrast, marking negative pragmatic meanings is not a basic feature of English

passives, though get-passives often refer to undesirable events. An essential difference

between English and Chinese passives lies in how much negativity is coded in them,

30

which predicts that Chinese learners of English will use passives more frequently for

undesirable situations.

In the next section, we will analyze the use of passives in a Chinese learner English

corpus to ascertain how reliably the findings of our contrastive study as summarized

in this section can predict and diagnose learner behaviour in interlanguage.

5.3. Passive constructions in Chinese learner English

This section examines be passives in Chinese learner English. The corpus used is the

Chinese Learner English Corpus (CLEC), which contains one million words of essays

written by Chinese learners at five proficiency levels: high school students (ST2),

junior and senior non-English majors (ST3 and ST4), and junior and senior English

majors (ST5 and ST6). The five types of learners are equally represented in the

corpus. The corpus is fully annotated with learner errors using an error tagset that

consists of 61 error types clustered in 11 categories (see Gui and Yang 2002). In order

to compare Chinese learners’ interlanguage with native English, the Louvain Corpus

of Native English Essays (LOCNESS) is used as the control data, which is composed

of argumentative essays written by native British and American students on a great

variety of topics, totalling approximately 300,000 words (cf. Granger and Tyson

1996).

Table 1. Passives in CLEC and LOCNESS

Corpus Words Passives Per million

words

LL score

CLEC 1,070,602 9,711 907 1235.6

(p<0.001)LOCNESS 324,304 5,465 1,685

31

A comparison of CLEC and LOCNESS shows that in relation to native English

writing, Chinese learners of English significantly underuse passives in their

interlanguage. Table 1 gives the raw frequencies of passive constructions in the two

corpora as well as the frequencies normalised to a common base of one million words.

As can be seen, passives are nearly twice as frequent in native English as in Chinese

learner English. The log-likelihood test (LL) indicates that this difference is

statistically significant (LL=1235.6 for 1 degree of freedom, p<0.001). The significant

underuse of passives in Chinese learner English is hardly surprising in light of the

marked contrast in frequencies for passives in English and Chinese as noted in section

5.2. Granger (1996: 46) also expected French learners of English to underuse passives

in their writing as it was noted that passives were twice as frequent in English as in

French (see Granger 1976), but she did not verify this prediction against French

learner English data. While Chinese learners’ underuse of passives as mirrored in the

CLEC corpus is very likely to be caused by the influence of their native language,

more cross-linguistic contrasts and interlanguage studies involving learners from

other L1 backgrounds are required before we can be more confident that underuse of

passives is the result of L1 transfer rather than a common feature of interlanguages,

irrespective of the learner’s mother tongue, which would mean that learners underuse

passives for developmental reasons. As Granger (2007) observes, while native

English speakers mainly use the verb discuss in the passive, ‘learners show a

predilection for active structures with first person subjects.’

32

Figure 2. Long and short passive in CLEC and LOCNESS

The results of the contrastive analysis in section 5.2 predicted that Chinese learners

would use long passives more frequently than native English speakers. Figure 2

shows the proportions of long and short passives in CLEC and LOCNESS. It can be

seen that in comparison with native English writings, long passives are indeed slightly

more frequent in Chinese learner English (9.14% and 8.44% for CLEC and

LOCNESS respectively), though this difference is marginal and not statistically

significant (LL=2.18 for 1 degree of freedom, p=0.139).

Figure 3. Pragmatic meanings of passives in CLEC and LOCNESS

33

It was noted in earlier that over 50% of passives in Chinese express an inflictive

meaning whereas the corresponding figure for be passives in English is merely 15%.

Such a contrast would reasonably lead one to expect more negative cases in Chinese

learner English than in native English. This expectation is in fact supported by

evidence from CLEC and LOCNESS. Figure 3 shows that 25.7% of passives in the

Chinese learner English data are negative whilst negative cases account for 16.8% in

native English writings. The log-likelihood test indicates that the differences between

CLEC and LOCNESS in the three meaning categories are statistically significant

(LL=7.4 for 2 degrees of freedom, p=0.025). A comparison of Figures 1 and 3

suggests that the proportions for the three meaning categories for the two types of

native English data (i.e. general English and students’ essays) are very close to each

other. In contrast, the proportions in Chinese learner English shift away from those for

L1 Chinese and move closer to the proportions for L2 English. Given that

interlanguage is ‘situated somewhere between L1 and L2’ (Granger 1996: 48), this

movement is only reasonable and as expected.

An inspection of the specific errors related to the use of passive constructions in

CLEC also demonstrates the value of contrastive corpus linguistics in SLA research.

There are mainly four types of passive-related learner errors: underuse, misuse,

misformation, and auxiliary errors. It can be considered as an advantage of the

corpus-based approach to be able to view underuse or overuse of a linguistic feature

in interlanguage as a type of learner error, as this was not possible in traditional error

analysis without corpus data. Misuse of passives means that learners use passive

constructions where they are not supposed to use them. Misformation errors are

34

associated with morphological inflections, while auxiliary errors relate to omission

and misuse of auxiliaries in passive constructions.

Figure 4. Passive-related errors in Chinese learner English

Figure 4 charts the distribution of four types of errors, as well as all error types as a

whole, across learner proficiency levels. Unsurprisingly, when all error types are

taken together, learners at higher levels generally make fewer errors related to

passives. Of the four types of learner errors, underuse is the most important type,

followed by misuse and misformation errors. Auxiliary errors are uncommon for

learner groups other than the lowest level ST2 (i.e. high school students). It is also

clear from the figure that learning curves are not straight lines. There can be relapses

in the language acquisition process, especially for difficult items.

It is of interest to note that while error types are associated with learner levels when

the dataset is taken as a whole (LL=51.77 for 12 degrees of freedom, p<0.001),

similar leaner groups show similar error types. This means that the differences

between the two non-English-major learner groups (i.e. ST3 and ST4), and between

the two English-majors learner groups (i.e. ST5 and ST6) are not statistically

35

significant, as indicated in Table 2. The table gives the log-likelihood test scores and

probability values (3 degrees of freedom for all pairs of data), with significant

differences highlighted. Hence, Chinese learners can be divided into three broad

groups in terms of their acquisition of English passives: ST2 – ST3/ST4 – ST5/ST6.

Table 2. Association between error types and learner levels

From To LL score (3 d.f.) P value

ST2 ST3 27.303 <0.001

ST3 ST4 6.955 0.073

ST4 ST5 18.563 <0.001

ST5 ST6 6.987 0.072

While we cannot be conclusive of whether the underuse of passives by Chinese

learners of English is a result of L1 transfer or a stage of the developmental path,

errors of this type in our learner data typically occur with verbs whose Chinese

equivalents are not normally used in passive constructions, as shown in (1).

(1) a. A birthday party will hold in Lily’s house. (ST2)

b. The woman in white called Anne Catherick. (ST5)

(2) a. The supper had done. (ST2)

b. wanfan zuo-hao le

supper cook-ready ASP

The supper is ready.

Underuse errors also occur under the influence of topic sentences in Chinese, as

exemplified in (2a), which is expressed in Chinese as (2b). The Chinese example in

36

(2b) is an instance of topic sentence, which is very common in this language. Here

wanfan ‘supper’ in the subject position is the topic and zuo-hao le ‘cook-ready ASP’

is the comment. Sentences like this cannot be used in the passive felicitously (e.g.

*wanfan bei zuo-hao le).

Misuse errors are mostly found in three contexts. Firstly, they occur when intransitive

verbs are passivised (e.g. 3); secondly, errors of this type are related to the misuse of

ergative verbs (e.g. 4); and finally, misuse errors can be a result of training transfer,

i.e. excessive passive training in classroom instructions, as shown in (5). In sentences

like these, the passivised verb is followed by an object, yet Chinese learners have

been taught that passive transformation involves moving the object to the subject

position. This can be taken as a symptom of the overdone passive training in English

classrooms in China.

(3) a. A very unhappy thing was happened in this week. (ST2)

b. I was graduated from Zhongshan University. (ST5)

(4) a. the secince <sic science> is developed quickly (ST4)

b. infant mortality was declined (ST4)

(5) a. Because they have been mastered everything of this job (ST4)

b. many machine and appliance are used electricity as power (ST5)

Misformation errors are a result of L1 interference. As noted in section 5.2, passivised

verbs do not inflect in Chinese. Consequently, Chinese learners of English tend to use

uninflected verbs or misspelled past participles in passive constructions, as

exemplified as (6).

(6) a. His relatives can not stop him, because his choice is protect by the

37

laws. (ST6)

b. Since the People’s Republic of china <sic China> was found on

October 1949, great changes <…> (ST2)

(7) a. In China, since the new China established, people’s life has goten <sic

gotten>

better and better. (ST3)

b. I am not a smoker, but why do we forced to be a second-hand smoker?

(ST5)

Auxiliary errors, the final type of passive errors in our annotation scheme, are also the

result of L1 interference. We noted earlier that while passives in Chinese can be

marked syntactically, lexical passives, unmarked notional passives and topic

sentences that express a passive meaning are abundant. As such, it is hardly surprising

that Chinese learners of English tend to omit or misuse auxiliaries, as shown in (7).

The discussion in this section suggests that the performance of Chinese learners of

English in their use of English passives is closely linked to their native language; and

most of the passive-related errors in their interlanguage can be accounted for from the

perspective of contrastive corpus linguistics. In the following section, we will discuss

the implications of this study in SLA research.

5.4. Modelling contrastive interlanguage analysis

We hope that the case study has demonstrated the predictive and explanatory power

of contrastive corpus linguistics in SLA research. Combining contrastive analysis

(CA) and contrastive interlanguage analysis (CIA) is undoubtedly a fruitful direction

38

to pursue in SLA research. This is not a new idea. As early as a decade ago, Granger

(1996: 46) proposed an ‘integrated contrastive model’:

The model involves constant to-ing and fro-ing between CA and CIA. CA

data helps analysts to formulate predictions about interlanguage which can

be checked against CA data. […] Conversely, CIA results can only be

reliably interpreted as being evidence of transfer if supported by clear CA

descriptions.

Just as CIA has contributed significantly to SLA research by enabling and

foregrounding many areas of investigation which have traditionally been impossible

or marginalized (e.g. quantitatively distinctive features of interlanguage such as

overuse and underuse, the potential effects of learner parameters on interlanguage),

the integrated approach that combines CA and CIA will be an indispensable tool in

SLA research, because ‘if we want to be able to make firm pronouncements about

transfer-related phenomena, it is essential to combine CA and CIA approaches’

(Granger 1998: 14).

This emerging and promising area of research has recently become popular. For

example, Gilquin (2001) demonstrates, on the basis of a case study of causative

constructions in English and French, how the integrated contrastive model can help

explain some of the characteristics of learners’ interlanguage and thus throw new light

on the key notion of transfer, which turns out to be a more complex phenomenon than

has traditionally been assumed. Similarly, Borin and Prütz (2004) use the integrated

contrastive approach to explore L1 syntactic interference in advanced Swedish learner

English by investigating part-of-speech sequences. The increasing interest in the

39

integrated approach is also demonstrated by the specialised workshop ‘Linking up

Contrastive and Learner Corpus Research’, which was affiliated to the 4th

International Contrastive Linguistics Conference.

We entirely agree with Granger (1996, 1998) that a combination of corpus-based

contrastive study and interlanguage analysis can provide insights into language

acquisition research, but we have different opinions of the role of parallel corpora (or

‘translation corpora’ in her words) in cross-linguistic contrasts, for the reasons

outlined earlier in section 5.1. While Granger (1996: 38, 48) is fully aware of the

drawback of using translated texts in contrastive analysis, her examples are largely

based on data of this kind. In our revised CIA model, therefore, contrastive corpus

linguistics interacts with interlanguage analysis on the basis of comparable native

language corpora as illustrated in Figure 5.

Figure 5. A revised model of contrastive interlanguage analysis

It is true that using a bidirectional parallel corpus can average out, to some extent at

least, the undesirable effects of translationese on contrastive analysis. To achieve this

aim, however, the same sampling criteria must apply to the selection of source texts in

40

both languages, because any mismatch of proportion, genre, or domain, for example,

may invalidate the findings derived from such a corpus (McEnery, Xiao and Tono

2006: 93). A well-matched bidirectional parallel corpus is in fact a mixture of parallel

corpus and comparable corpus, which can become a bridge that brings translation and

contrastive studies together. Yet the ideal bidirectional parallel-comparable corpus

will often not be easy, or even possible, to build because of the heterogeneous pattern

of translation between languages and genres. This is especially true if the corpus aims

to achieve sufficient coverage and balance to produce convincing findings (McEnery

and Xiao 2007). Hence, in our approach, comparable native language data is preferred

in contrastive corpus linguistics. Other kinds of corpora for comparative studies such

as parallel corpora, translational corpora, and comparative corpora are best suited for

their own different purposes. Nevertheless, in spite of some difference in data type

used, there has been increasing consensus that contrastive corpus linguistics has

something to deliver in second language acquisition research.

6. Conclusions

Before we close the discussion of using corpora in language pedagogy, it is

appropriate to address some objections to the use of corpora in language learning and

teaching. While frequency and authenticity are often considered two of the most

important advantages of using corpora, they are also the locus of criticism from

language pedagogy researchers. For example, Cook (1998: 61) argues that corpus data

impoverishes language learning by giving undue prominence to what is simply

frequent at the expense of rarer but more effective or salient expressions. Widdowson

(1990, 2000) argues that corpus data is authentic only in a very limited sense in that it

41

is de-contextualized (i.e. traces of texts rather than discourse) and must be re-

contextualized in language teaching. It can also be argued that:

on the contrary, using corpus data not only increases the chances of learners

being confronted with relatively infrequent instances of language use, but

also of their being able to see in what way such uses are atypical, in what

contexts they do appear, and how they fit in with the pattern of more

prototypical uses. (Osborne 2001: 486)

This view is echoed by Goethals (2003: 424), who argues that ‘frequency ranking will

be a parameter for sequencing and grading learning materials’ because ‘frequency is a

measure of probability of usefulness’ and ‘high-frequency words constitute a core

vocabulary that is useful above the incidental choice of text of one teacher or textbook

author.’ Hunston (2002:194-195) observes that ‘items which are important though

infrequent seem to be those that echo texts which have a high cultural value’, though

in many cases ‘cultural salience is not clearly at odds with frequency.’ While

frequency information is readily available from corpora, no corpus linguist has ever

argued that the most frequent is most important. On the contrary, Kennedy (1998:

290) argues that frequency ‘should be only one of the criteria used to influence

instruction’ and that ‘the facts about language and language use which emerge from

corpus analyses should never be allowed to become a burden for pedagogy’. As such,

raw frequency data is often adjusted for use in a syllabus, as reported in Renouf

(1987: 168). It would be inappropriate, therefore, for language teachers, syllabus

designers, and materials writers to ignore ‘compelling frequency evidence already

available’, as pointed out by Leech (1997: 16), who argues that:

Whatever the imperfections of the simple equation ‘most frequent’ = ‘most

important to learn’, it is difficult to deny that frequency information

42

becoming available from corpora has an important empirical input to

language learning materials.

Kaltenböck and Mehlmauer-Larcher (2005: 78) downplay the role of frequency in

language learning, arguing that ‘what is frequent in language will be picked up by

learners automatically, precisely because it is frequent, and therefore does not have to

be consciously learned.’ This is not true, however. Determiners such as a and the are

certainly very frequent in English, yet they are difficult for Chinese learners of

English because their mother tongue does not have such grammatical morphemes and

does not maintain a count-mass noun distinction.

Clearly, frequency is not ‘automatically pedagogically useful’ (Kaltenböck and

Mehlmauer-Larcher 2005: 78); decisions relating to teaching must also take account

of overall teaching objectives, learners’ concrete situations, cognitive salience,

learnability, generative value and of course teachers’ intuitions (cf. Kaltenböck and

Mehlmauer-Larcher 2005: 78). However, frequency can at least help syllabus

designers, materials writers and teachers alike to make better-informed and more

carefully motivated decisions (cf. Gavioli and Aston 2001: 239).

If we leave objections to frequency data to one side, Widdowson (1990, 2000) also

questions the use of authentic texts in language teaching. In his opinion, authenticity

of language in the classroom is ‘an illusion’ (1990: 44) because even though corpus

data may be authentic in one sense, its authenticity of purpose is destroyed by its use

with an unintended audience of language learners (see Murison-Bowie 1996: 189).

Widdowson (2003: 93) makes a distinction between ‘genuineness’ and ‘authenticity’,

43

which are claimed to be the features of text as a product and discourse as a process

respectively: corpora are genuine in that they comprise attested language use, but they

are not authentic for language teaching because their contexts (as opposed to co-texts)

have been deprived. We will not be engaged in the debate here, but would like to

draw readers’ attention to Stubbs’ (2001) metaphor of product versus process as cited

in section 4.2. The implication of Widdowson’s argument is that only language

produced for imaginary situations in the classroom is ‘authentic’. Even if we do

follow Widdowson’s genuineness-authenticity distinction, it is not clear why such

imaginary situations are authentic because authenticity, as opposed to genuineness,

would mean real communicative context. Situations conjured up for classroom

teaching obviously do not take place for really communicative contexts, how can they

be authentic, if we choose to keep this distinction? When students learn and practise a

shopping ‘discourse’, they are actually by no means doing shopping! Furthermore, as

argued by Fox (1987), invented examples often do not reflect nuances of usage. That

is perhaps why, as Mindt (1996: 232) observes, students who have been taught

‘school English’ cannot readily cope with English used by native speakers in real life.

As such, Wichmann (1997: xvi) argues that in language teaching, ‘the preference for

“authentic” texts requires both learners and teachers to cope with language which the

textbooks do not predict.’

The discussions in sections 2-4 suggest that corpora appear to have played a more

important role in helping to decide what to teach (indirect uses) than how to teach

(direct uses). While indirect uses of corpora seem to be well established, direct uses of

corpora in teaching are largely confined to advanced levels like higher education.

Corpus-based learning activities are nearly absent general TEFL classes at lower

44

levels like secondary education. Of the various causes for this absence mentioned

earlier, perhaps the most important are the access to appropriate corpus resources and

the necessary training of teachers, which we view as priorities for future tasks of

corpus linguists if corpora are to be popularised to general language teaching context.

While there are a wide range of existing corpora that are publicly available (see Xiao

2008 for a recent survey), the majority of those resources have been developed ‘as

tools for linguistic research and not with pedagogical goals in mind’ (Braun 2007). As

Cook (1998: 57) suggests, ‘the leap from linguistics to pedagogy is […] far from

straightforward.’ To bridge the gap between corpora and language pedagogy, the first

step would involve creating corpora that are pedagogically motivated, in both design

and contents, to meet pedagogical needs and curricular requirements so that corpus-

based learning activities become an integral part, rather than an additional option, of

the overall language curriculum. Such pedagogically motivated corpora ‘should not

only be more coherent than traditional corpora; they should, as far as possible, also be

complementary to school curricula, to facilitate both the contextualisation process and

the practical problems of integration’ (Braun 2007: 310). The design of such corpus-

based learning activities must also take account of learners’ age, experience and level

as well as their integration into the overall curriculum.

Given the situation of learners (e.g. their age, level of language competence, level of

expert knowledge, and attitude towards learning autonomy) in general language

education in relation to advanced learners in tertiary education, even such

pedagogically motivated corpus study activities must be mediated by teachers. This in

turn raises the issue of the current state of teachers’ knowledge and skills of corpus

45

analysis and data interpretation, which is another practical problem that has prevented

direct use of corpora in language pedagogy. As Kaltenböck and Mehlmauer-Larcher

(2005: 81) argue, ‘mediation by the teacher is a necessary prerequisite for successful

application of computer corpora in language teaching and should therefore be given

sufficient attention in teacher education courses’ (cf. also O’Keeffe and Farr 2003).

However, as the integration of corpus studies language teacher training is only a quite

recent phenomenon (cf. Chambers 2007), ‘it will therefore at least take more time,

and perhaps a new generation of teachers, for corpora to find their way into the

language classroom’ (Braun 2007: 308).

In conclusion, it is our view that corpora will not only revolutionize the teaching of

subjects such as grammar in the 21st century (see Conrad 2000), they will also

fundamentally change the ways we approach language education, including both what

is taught and how it is taught. As Gavioli and Aston (2001) argue, corpora should not

only be viewed as resources which help teachers to decide what to teach, they should

also be viewed as resources from which learners may learn directly.

References:

Aarts, J. (1998) ‘Introduction’. In S. Johansson and S. Oksefjell (eds.) Corpora and

Cross-linguistic Research. Amsterdam: Rodopi. ix-xiv.

Aijmer, K. (2009) Corpora and Language Teaching. Amsterdam: John Benjamins.

Alderson, C. (1996) ‘Do corpora have a role in language assessment?’ in J. Thomas

and M. Short (eds.) Using Corpora for Language Research, pp. 248-259. London:

Longman.

46

Allan, Q. (1999) ‘Enhancing the language awareness of Hong Kong teachers through

corpus data’. Journal of Technology and Teacher Education 7/1: 57-74.

Allan, Q. (2002) ‘The TELEC secondary learner corpus: a resource for teacher

development’ in S. Granger, J. Hung and S. Petch-Tyson (eds.) Computer Learner

Corpora, Second Language Acquisition and Foreign Language Teaching, pp.

195–212. Philadelphia: John Benjamins.

Altenberg, B. and Granger, S. (2001) ‘The grammatical and lexical patterning of

MAKE in native and non-native student writing.’ Applied Linguistics 22/2: 173-

95.

Aston, G. (1995) ‘Corpora in language pedagogy: matching theory and practice’ in G.

Cook and B. Seidlhofer (eds.) Principle and Practice in Applied Linguistics:

Studies in Honour of H. G. Widdowson. Oxford: Oxford University Press.

Aston, G. (ed.) (2001) Learning with Corpora. Houston, TX: Athelstan.

Aston, G., Bernardini, S. and Stewart, D. (eds.) (2004) Corpora and Language

Learners. Amsterdam: John Benjamins.

Aston, G. and Burnard, L. (1998) The BNC Handbook: Exploring the British National

Corpus with SARA. Edinburgh: Edinburgh University Press.

Bahns, J. (1993) ‘Lexical collocations: a contrastive view’. ELT Journal 47/1: 56-63.

Baker, M. (1993) ‘Corpus linguistics and translation studies: implications and

applications’. In M. Baker, G. Francis and E. Tognini-Bonelli (eds.) Text and

Technology: in Honour of John Sinclair. Amsterdam: Benjamins. 233-352.

Ball, F. (2001) ‘Using corpora in language testing’. Research Notes 6: 6-8.

Ball, F. (2002) ‘Developing wordlists for BEC’. Research Notes 8: 10-13.

Ball, F. and Wilson, J. (2002) ‘Research projects relating to YLE Speaking Tests’.

Research Notes 7: 8-10.

47

Bernardini, S. (2000) Competence, Capacity, Corpora: A Study in Corpus-aided

Language Learning. Bologna: CLUEB.

Biber, D., Conrad, S. and Reppen, R. (1998) Corpus Linguistics: Investigating

Language Structure and Use. Cambridge: Cambridge University Press.

Biber, D., Johansson S., Leech G., Conrad S. and Finegan, E. (1999) Longman

Grammar of Spoken and Written English. London: Longman.

Biber, D., Leech, G. and Conrad, S. (2002) Longman Student Grammar of Spoken and

Written English. London: Longman.

Borin, L. and Prütz, K. (2004) ‘New wine in old skins? A corpus investigation of L1

syntactic transfer in learner language’. In G. Aston, S. Bernardini and D. Stewart

(eds.) Corpora and Language Learners. Amsterdam: John Benjamins. 67–87.

Braun, S. (2007) ‘Integrating corpus work into secondary education: From data-driven

learning to needs-driven corpora’. ReCALL 19(3): 307-328.

Braun, S., Kohn, K. and Mukherjee, J. (eds.) (2006) Corpus Technology and

Language Pedagogy. Frankfurt: Peter Lang.

Burnard, L. and McEnery, A. (eds.) (2000) Rethinking Language Pedagogy from a

Corpus Perspective. New York: Peter Lang.

Campoy, M., Gea-valor, M. and Belles-Fortuno, B. (2010) Corpus-based

Approaches to English Language Teaching. London: Continuum.

Carter, R. and McCarthy, M. (1995) ‘Grammar and the spoken language’. Applied

Linguistics 16/2: 141-158.

Carter, R. and McCarthy, M. (2004) ‘Talking, creating: interactional language,

creativity, and context’. Applied Linguistics 25/1: 62-88.

Chambers, A. (2007) ‘Popularising corpus consultation by language learners and

teachers’. In E. Hidalgo, L. Quereda, and J. Santana (eds) Corpora in the Foreign

48

Language Classroom: Selected Papers from the Sixth International Conference on

Teaching and Language Corpora (TaLC 6), pp. 3–16. Amsterdam: Rodopi.

Choi, I., Kim, K. and Boo, J. (2003) ‘Comparability of a paper-based language test

and a computer-based language test’. Language Testing 20/3: 295–320.

Coniam, D. (1997) ‘A preliminary inquiry into using corpus word frequency data in

the automatic generation of English language cloze tests’. CALICO Journal 16/2-

4: 15-33.

Connor, U. and Upton, T. (eds) (2002) Applied Corpus Linguistics: A

Multidimensional Perspective. Amsterdam: Rodopi.

Conrad, S. (1999) ‘The importance of corpus-based research for language teachers’.

System 27: 1-18.

Conrad, S. (2000) ‘Will corpus linguistics revolutionize grammar teaching in the 21st

century?’. TESOL Quarterly 34: 548–60.

Cook, G. (1998) ‘The uses of reality: a reply to Ronald Cater.’ ELT Journal 52/1: 57-

64.

Cowie, A. (1994) ‘Phraseology’ in R. Asher (ed.) The Encyclopaedia of Language

and Linguistics Vol. 6, pp. 3168-3171. Oxford: Pergamon Press Ltd.

de Beaugrande, R. (2001) ‘Interpreting the discourse of H. G. Widdowson: a corpus-

based critical discourse analysis’. Applied Linguistics 22/1: 104-121.

Flowerdew, J. (1993) ‘Concordancing as a tool in course design’. System 21/3: 231-

243.

Fox, G. (1987) ‘The case for examples’ in J. Sinclair (ed.) Looking Up: An Account of

the COBUILD Project, pp. 137-149. London: HarperCollins.

Francis, G., Hunston, S. and Manning, E. (1996) Collins COBUILD Grammar Patterns

1: Verbs. London: HarperCollins.

49

Francis, G., Hunston, S. and Manning, E. (1998) Collins COBUILD Grammar Patterns

2: Nouns and Adjectives. London: HarperCollins.

Fries, C. (1945) Teaching and Learning English as a Foreign Language. Ann Arbor:

University of Michigan Press.

Gavioli, L. (2006) Exploring Corpora for ESP Learning. Amsterdam: John

Benjamins.

Gavioli, L. and Aston, G. (2001) ‘Enriching reality: language corpora in language

pedagogy’. ELT Journal 55/3: 238-246.

Gellerstam, M. (1986) ‘Translationese in Swedish novels translated from English’. In

L. Wollin and H. Lindquist (eds.) Translation Studies in Scandinavia. Lund:

CWK Gleerup. 88-95.

Gellerstam, M. (1996) ‘Translations as a source fro cross-linguistic studies’. In K.

Aijmer, B. Altenberg and M. Johansson (eds.) Language in Contrast. Lund: Lund

University Press. 53-62.

Ghadessy, M., Henry, A. and Roseberry, R. (eds.) (2001) Small Corpus Studies and

ELT: Theory and Practice. Amsterdam: John Benjamins.

Gilquin, G. (2001) ‘The integrated contrastive model. Spicing up your data’.

Languages in Contrast 3(1): 95–123.

Goethals, M. (2003) ‘E.E.T.: the European English Teaching vocabulary-list’ in B.

Lewandowska-Tomaszczyk (ed.) Practical Applications in Language and

Computers, pp. 417-427. Frankfurt: Peter Lang.

Granger S. (1976) ‘Why the passive?’. In J. Van Roey (ed.) English-French

Contrastive Analyses. Leuven: Acco. 23-57.

Granger, S. (1983) The Be + Past Participle Construction in Spoken English with

Special Emphasis on the Passive. Amsterdam: North-Holland.

50

Granger, S. (1996) ‘From CA to CIA and back: An integrated approach to

computerised bilingual and learner corpora’. In K. Aijmer, B. Altenberg and M.

Johansson (eds.) Language in Contrast. Lund: Lund University Press. 37-51.

Granger, S. (1998) ‘The computer learner corpus: a versatile new source of data for

SLA research’. In S. Granger (ed.) Learner English on Computer. London:

Longman. 3-18.

Granger, S. (2002) ‘A bird’s-eye view of learner corpus research’ in S. Granger, J.

Hung and S. Petch-Tyson (eds.) Computer Learner Corpora, Second Language

Acquisition and Foreign Language Teaching, pp. 3–33. Philadelphia: John

Benjamins.

Granger, S. (2003) ‘Practical applications of learner corpora’ in B. Lewandowska-

Tomaszczyk (ed.) Practical Applications in Language and Computers, pp. 291-

302. Frankfurt: Peter Lang.

Granger, S. (2007) ‘Sylviane Granger: Interview’. Mindbite 1.

Granger, S., Hung, J. and Petch-Tyson, S. (eds.) (2002) Computer Learner Corpora,

Second Language Acquisition, and Foreign Language Teaching. Philadelphia:

John Benjamins.

Granger, S. and Tyson, S. (1996) ‘Connector usage in the English essay writing of

native and non-native speakers of English’. World Englishes 15: 19-29.

Gui, S. and Yang, H. (2002) Zhonguo Xuexizhe Yingyu Yuliaoku (Chinese Learner

English Corpus). Shanghai: Shanghai Foreign Language Education Press.

Hartmann, R. (1995) ‘Contrastive textology’. Language and Communication 5: 25-

37.

Herbst, T. (1996) ‘What are collocations: sandy beaches or false teeth?’. English

Studies 04/1996: 379-393.

51

Hidalgo, E., Quereda, L. and Santana, J. (2007) Corpora in the Foreign Language

Classroom: Selected Papers from the Sixth International Conference on Teaching

and Language Corpora (TaLC 6). Amsterdam: Rodopi.

Higgins, J. and Johns, T. (1984) Computers in Language Learning. Oxford: Oxford

University Press.

Hinkel, E. (2004) ‘Tense, aspect the passive voice in L1 and L2 academic texts’.

Language Teaching Research 8/1: 5-29.

Hoey, M. (2000) ‘A world beyond collocation: new perspectives on vocabulary

teaching’ in M. Lewis (ed.) Teaching Collocations, pp. 224-245. Hove: Language

Teaching Publications.

Hoey, M. (2004) ‘Lexical priming and the properties of text’. In A. Partington, J.

Morley and L. Haarman (eds.) Corpora and Discourse, pp. 385-412. Bern: Peter

Lang.

Horner, D. and Strutt, P. (2004) ‘Analyzing domain-specific lexical categories:

evidence from the BEC written corpus’. Research Notes 15: 6-8.

Hunston, S. (2002) Corpora in Applied Linguistics. Cambridge: Cambridge

University Press.

Hyland, K. (1999) ‘Talking to students: metadiscourse in introductory coursebooks’.

English for Specific Purposes 18/1: 3-26.

James, C. (1980) Contrastive Analysis. London: Longman.

Johansson S. (2003) ‘Contrastive linguistics and corpora’. In S. Granger, J. Lerot and

S. Petch-Tyson (eds.) Corpus-Based Approaches to Contrastive Linguistics and

Translation Studies. Amsterdam: Rodopi. 31-44.

52

Johns, T. (1991) ‘“Should you be persuaded”: two samples of data-driven learning

materials’ in T. Johns and P. King (eds.) Classroom Concordancing ELR Journal

4. University of Birmingham.

Kaltenböck, G. and Mehlmauer-Larcher, B. (2005) ‘Computer corpora and the

language classroom: On the potential and limitations of computer corpora in

language teaching’. ReCALL 17:65-84.

Karpati, I. (1995) Concordance in Language Learning and Teaching. Pecs:

University of Pecs.

Kaszubski, P. and Wojnowska, A. (2003) ‘Corpus-informed exercises for learners of

English: the TestBuilder program’ in E. Oleksy and B. Lewandowska-

Tomaszczyk (eds.) Research and Scholarship in Integration Processes: Poland -

USA – EU, pp. 337-354. Łódź: Łódź University Press.

Keck, C. (2004) ‘Corpus linguistics and language teaching research: bridging the

gap’. Language Teaching Research 8(1): 83-109.

Kennedy, G. (1998) An Introduction to Corpus Linguistics. London: Longman.

Kennedy, G. (2003) ‘Amplifier collocations in the British National Corpus:

implications for English language teaching’. TESOL Quarterly 37/3: 467-487.

Kenny, D. (1998) ‘Creatures of habit? What translators usually do with words?’. Meta

43(4).

Kettemann, B. (1995) ‘On the use of concordancing in ELT’. TELL&CALL 4: 4-15.

Kettemann, B. (1996) ‘Concordancing in English Language Teaching’ in S. Botley, J.

Glass, A. McEnery and A. Wilson (eds.) Proceedings of Teaching and Language

Corpora, pp. 4-16. Lancaster University.

Kettemann, B. and Marko, G. (2002) Teaching and Learning by Doing Corpus

Analysis. Amsterdam: Rodopi.

53

Kettemann, B. and Marko, G (eds) (2006) Planning, Gluing and Painting Corpora:

Inside the Applied Corpus Linguist’s Workshop. Frankfurt: Peter Lang.

Kita, K. and Ogata, H. (1997) ‘Collocations in language learning: corpus-based

automatic compilation of collocations and bilingual collocation concordancer’.

Computer Assisted Language Learning 10/3: 229-238.

Kjellmer, G. (1991) ‘A mint of phrases’ in K. Aijmer and B. Altenberg (eds.) English

Corpus Linguistics: Studies in Honour of Jan Svartvik. London: Longman.

Koester , A. (2002) ‘The performance of speech acts in workplace conversations and

the teaching of communicative functions’. System 30: 167-184.

Lado, R. (1957) Linguistics across Cultures: Applied Linguistics for Language

Teachers. Ann Arbor: University of Michigan Press.

Laviosa, S. (1997) ‘How comparable can “comparable corpora” be?’. Target 9: 289-

319.

Laviosa, S. (1998) ‘Core patterns of lexical use in a comparable corpus of English

narrative prose’. Meta 43(4).

Leech, G. (1997) ‘Teaching and language corpora: a convergence’ in A. Wichmann,

S. Fligelstone, A. McEnery and G. Knowles (eds.) Teaching and Language

Corpora, pp. 1-23. London: Longman.

Lewis, M. (1993) The Lexical Approach: The State of ELT and the Way Forward.

Hove: Language Teaching Publications.

Lewis, M. (1997a) Implementing the Lexical Approach: Putting Theory into Practice.

Hove: Language Teaching Publications.

Lewis, M. (1997b) ‘Pedagogical implications of the lexical approach’ in J. Coady and

T. Huckin (eds.) Second Language Vocabulary Acquisition: A Rationale for

Pedagogy, pp. 255-270. Cambridge: Cambridge University Press.

54

Lewis, M. (ed.) (2000) Teaching Collocation: Further Developments in the Lexical

Approach. Hove: Language Teaching Publications.

Lü, S. and Zhu, D. (1979) Yufa Xiuci Jianghua (Talks on Grammar and Rhetoric).

Beijing: Chinese Youth Press.

Mauranen, A. (2002) ‘Will “translationese” ruin a contrastive study?’. Languages in

Contrast 2(2): 161-186.

McAlpine, J. and Myles, J. (2003) ‘Capturing phraseology in an online dictionary for

advanced users of English as a second language: a response to user needs’. System

31: 71-84.

McCarthy, M., McCarten, J. and Sandiford, H. (2005-2006) Touchstone (Books 1-4).

cambridge. Cambridge University Press.

McEnery, A. and Wilson, A. (2001) Corpus Linguistics (1st ed. 1996). Edinburgh:

Edinburgh University Press.

McEnery, A. and Xiao, R. (2002) ‘Domains, text types, aspect marking and English-

Chinese translation’. Languages in Contrast 2(2): 211-229.

McEnery, A. and Xiao, R. (2005) ‘Help or help to: What do corpora have to say?’

English Studies 86(2): 161-187.

McEnery, A. and Xiao, R. (2007) ‘Parallel and comparable corpora: What is

happening?’. In M. Rogers and G. Anderman (eds.) Incorporating Corpora: The

Linguist and the Translator, pp. 18-31. Clevedon: Multilingual Matters.

McEnery, A. and Xiao, R. (2008) CALLHOME Mandarin Chinese Transcripts - XML

version. Pennsylvania: Linguistic Data Consortium.

McEnery, A., Xiao, R. and Mo, L. (2003) ‘Aspect marking in English and Chinese:

using the Lancaster Corpus of Mandarin Chinese for contrastive language study’.

Literary and Linguistic Computing 18(4): 361-378.

55

McEnery, T., Xiao, R. and Tono, Y. (2006) Corpus-Based Language Studies: An

Advanced Resource Book. London: Routledge.

Meunier, F. (2002) ‘The pedagogical value of native and learner corpora in EFL

grammar teaching’ in S. Granger, J. Hung and S. Petch-Tyson (eds.) Computer

Learner Corpora, Second Language Acquisition and Foreign Language Teaching,

pp. 119–142. Philadelphia: John Benjamins.

Mindt, D. (1996) ‘English corpus linguistics and the foreign language teaching

syllabus’ in J. Thomas and M. Short (eds.) Using Corpora for Language

Research, pp. 232-247. London: Longman.

Mishan, F. (2005) Designing Authenticity into Language Learning Materials.

Chicago: Chicago University Press.

Mukherjee, J. and Rohrbach, J. (2006) ‘Rethinking applied corpus linguistics from a

language-pedagogical perspective: New departures in learner corpus research’. In

B. Kettemann and G. Marko (eds) Planning, Gluing and Painting Corpora: Inside

the Applied Corpus Linguist’s Workshop, pp. 205-232. Frankfurt: Peter Lang.

Murison-Bowie, S. (1996) ‘Linguistic corpora and language teaching’. Annual Review

of Applied Linguistics 16: 182-199.

Myles, F. (2005) ‘Interlanguage corpora and second language acquisition research’.

Second Language Research 21(4): 373-391.

Nelson, G. (1996) ‘The design of the corpus’. In S. Greenbaum (ed.) Comparing

English Worldwide: The International Corpus of English, pp. 27-35. Oxford:

Clarendon Press.

Nelson, M. (2000) A Corpus-Based Study of Business English and Business English

Teaching Materials. PhD thesis, the University of Manchester, Manchester.

Available at http://users.utu.fi/micnel/thesis.html.

56

http://users.utu.fi/micnel/thesis.html

Nesselhauf, N. (2003) ‘The use of collocations by advanced learners of English and

some implications for teaching.’ Applied Linguistics 24/2: 223-42.

Nesselhauf, N. (2005) Collocations in a Learner Corpus. Amsterdam: John

Benjamins.

O’Keeffe, A. and Farr, F. (2003) ‘Using language corpora in initial teacher education:

pedagogic issues and practical applications’. TESOL Quarterly 37/3: 389-418.

O’Keeffe, A., McCarthy, M. and Carter, R (2007) From Corpus to Classroom:

Language Use and Language Teaching. Cambridge: Cambridge University Press.

Osborne, J. (2001) ‘Integrating corpora into a language-learning syllabus’ in B.

Lewandowska-Tomaszczyk (ed.) PALC 2001: Practical Applications in Language

Corpora, pp. 479-492. Frankfurt: Peter Lang.

Osborne, J. (2002) ‘Top-down and bottom-up approaches to corpora in language

teaching’. In U. Connor and T. Upton (eds) Applied Corpus Linguistics: A

Multidimensional Perspective, pp. 251-265. Amsterdam: Rodopi.

Øverås, S. (1998) ‘In search of the third code: an investigation of norms in literary

translation’. Meta 43(4).

Partington, A. (1998) Patterns and Meanings: Using Corpora for English Language

Research and Teaching. Amsterdam: John Benjamins.

Pravec, N. (2002) ‘Survey of learner corpora’. ICAME Journal 26: 81-114.

Renouf, A. (1987) ‘Moving on’ in J. Sinclair (ed.) Looking Up: An Account of the

COBUILD Project. London: HarperCollins.

Römer, U. (2005) Progressives, Patterns, Pedagogy: A Corpus-driven Approach to

English Progressive Forms, Functions, Contexts and Didactics. Amsterdam: John

Benjamins.

57

Römer, U. (2008) ‘Corpora and language teaching’. In A. Lüdeling and M. Kyto

(eds.) Corpus Linguistics: An International Handbook, pp. 112-131. Berlin:

Mouton de Gruyter.

Sajavaara, K. (1996) ‘New challenges for contrastive linguistics’. In K. Aijmer, B.

Altenberg and M. Johansson (eds.) Language in Contrast. Lund: Lund University

Press. 17-36.

Salkie, R. (1999) ‘How can linguists profit from parallel corpora?’. Paper given at the

Symposium on Parallel Corpora. 22-23 April 1999, University of Uppsala.

Santos, D. (1996) Tense and Aspect in English and Portuguese: A Contrastive

Semantical Study. PhD thesis. Universidade Tecnica de Lisboa.

Scott, M. and Tribble, C. (2006). Textual Patterns: Key Words and Corpus Analysis

in Language Education. Amsterdam: John Benjamins.

Seidlhofer, B. (2000) ‘Operationalizing intertextuality: using learner corpora for

learning’ in L. Burnard and A. McEnery (eds.) Rethinking Language Pedagogy

from a Corpus Perspective, pp. 207–24. New York: Peter Lang.

Seidlhofer, B. (2002) ‘Pedagogy and local learner corpora: working with learning

driven data’ in S. Granger, J. Hung and S. Petch-Tyson (eds.) Computer Learner

Corpora, Second Language Acquisition and Foreign Language Teaching, pp.

213–234. Philadelphia: John Benjamins.

Shei, C. and Pain, H. (2000) ‘An ESL writer’s collocational aid’. Computer Assisted

Language Learning 13/2: 167-182.

Sinclair, J. (1990) Collins COBUILD English Grammar. London: HarperCollins.

Sinclair, J. (1992) Collins COBUILD English Usage. London: HarperCollins.

Sinclair, J. (2000) ‘Lexical grammar’. Naujoji Metodologija 24: 191-203.

Sinclair, J. (2003) Reading Concordances. London: Longman.

58

Sinclair, J. (ed.) (2004) How to Use Corpora in Language Teaching. Amsterdam:

John Benjamins.

Sinclair, J., Bullon, S., Krishnamurthy, R., Manning, E. and Todd, J. (1990) Collins

COBUILD English Grammar. London: HarperCollins.

Sinclair, J. and Renouf, A. (1988) ‘A lexical syllabus for language learning’ in R.

Carter and M. McCarthy (eds.) Vocabulary and Language Teaching. London:

Longman.

Smadja, F. and McKeown, K. (1990) ‘Automatically extracting and representing

collocations for language generation’ in Proceedings of the 28th Annual Meeting of

Association for Computational Linguistics, pp. 252-259.

Sripicharn, P. (2000) ‘Data-driven learning materials as a way to teach lexis in

context’ in C. Heffer, H. Sauntson and G. Fox (eds.) Words in Context: A tribute

to John Sinclair on his Retirement. Birmingham: University of Birmingham.

Stubbs, M. (2001) ‘Texts, corpora, and problems of interpretation: a response to

Widdowson’. Applied Linguistics 22(2): 149-172.

Tan, M. (2002) Corpus Studies in Language Education. Bangkok: IELE Press.

Taylor, L. (2003) ‘The Cambridge approach to speaking assessment’. Research Notes

13: 2-4.

Teubert, W. (1996) ‘Comparable or parallel corpora?’. International Journal of

Lexicography 9(3): 238-264.

Thompson, P. and Tribble, C. (2001) ‘Looking at citations: using corpora in English

for academic purposes’. Language Learning & Technology 5/3: 91-105.

Thurstun, J. and Candlin, C. (1997) Exploring Academic English: A Workbook for

Student Essay Writing. Sydney: NCELTR.

59

Thurstun, J. and Candlin, C. (1998) ‘Concordancing and the teaching of the

vocabulary of academic English’. English for Specific Purposes 17: 267-280.

Tribble, C. (1991) ‘Concordancing and an EAP writing program’. CAELL Journal

1/2: 10-15.

Tribble, C. (1997a) ‘Corpora, concordances and ELT’ in T. Boswood (ed.) New Ways

of Using Computers in Language Teaching. Alexandria VA: TESOL.

Tribble C. (1997b) ‘Improving corpora for ELT: quick and dirty ways of developing

corpora for language teaching’ in B. Lewandowska-Tomaszczyk, P. Melia (eds.)

Practical Applications in Language Corpora – Proceedings of PALC ’97, pp.

107-117. Łódź: Łódź University Press.

Tribble, C. (2000) ‘Practical uses for language corpora in ELT’ in P. Brett, and G.

Motteram (eds.) A Special Interest in Computers: Learning and Teaching with

Information and Communications Technologies, pp. 31-41. Kent: IATEFL.

Tribble, C. (2003) ‘The text, the whole text…or why large published corpora aren’t

much use to language learners and teachers’ in B. Lewandowska-Tomaszczyk

(ed.) Practical Applications in Language and Computers, pp. 303-318. Frankfurt:

Peter Lang.

Tribble, C. and Jones, G. (1990) Concordances in the Classroom: A Resource Book

for Teachers. London: Longman.

Tribble, C. and Jones, G. (1997) Concordances in the Classroom: Using Corpora in

Language Education. Houston TX: Athelstan.

Upton, T. and Connor, U. (2001) ‘Using computerized corpus analysis to investigate

the textlinguistic discourse move of a genre’. English for Specific Purposes 20:

313-329.

60

Wang, L. (1984) Zhongguo Jufa Lilun (Syntactic Theories in China). Qingdao:

Shandong Education Press.

Wichmann, A. (1995) ‘Using concordances for the teaching of modern languages in

higher education’. Language Learning Journal 11: 61-63.

Wichmann, A. (1997) ‘General introduction’ in A. Wichmann, S. Fligelstone, A.

McEnery and G. Knowles (eds.) Teaching and Language Corpora, pp. xvi-xvii.

London: Longman.

Wichmann, A. Fligelstone, S. McEnery A. and Knowles, G. (eds.) (1997) Teaching

and Language Corpora. London: Longman.

Widdowson, H. (1990) Aspects of Language Teaching. Oxford: Oxford University

Press.

Widdowson, H. (1991) ‘The description and prescription of language’ in J. Alatis

(ed.) Georgetown University Round Table on Languages and Linguistics 1991,

pp. 11-24. Washington, D.C.: Georgetown University Press.

Widdowson, H. (2000) ‘The limitations of linguistics applied’. Applied Linguistics

21/1: 3-25.

Widdowson, H. (2003) Defining Issues in English Language Teaching. Oxford:

Oxford University Press.

Willis, D. (1990) The Lexical Syllabus: A New Approach to Language Teaching.

London: HarperCollins.

Willis, J., Willis, D. and Davids, J. (1988-1989) Collins COBUILD English Course

(Parts 1-3). London: HarperCollins.

Woolls, D. (1998) ‘Multilingual parallel concordancing for pedagogical use’ in

Teaching and Language Corpora, pp. 222-227. Keble College, Oxford, 24-27

July 1998.

61

Xiao, R. (2003) ‘Use of parallel and comparable corpora in language study’. English

Education in China 2003(1).

Xiao, R. (2008) ‘Well-known and influential corpora’. In A. Lüdeling and M. Kyto

(eds) Corpus Linguistics: An International Handbook, pp. 383-457. Berlin:

Mouton de Gruyter.

Xiao, R., He, L. and Yue, M. (forthcoming) ‘In pursuit of the third code: Using the

ZJU Corpus of Translational Chinese in Translation Studies.’ In R. Xiao (ed.)

Using Corpora in Contrastive and Translation Studies. Newcastle upon Tyne:

Cambridge Scholars Publishing.

Xiao, R., McEnery, T. and Qian, Y. (2006) ‘Passive constructions in English and

Chinese: A corpus-based contrastive study’. Languages in Contrast 6(1): 109-149.

Xiao, R. and Yue, M. (2009) ‘Using corpora in Translation Studies: The state of the

art’. In P. Baker (ed.) Contemporary Corpus Linguistics. London: Continuum.

Yang, Y. and Allison, D. (2003) ‘Research articles in applied linguistics: moving

from results to conclusions’. English for Specific Purposes 22: 365-385.

Zhang, X. (1993) English Collocations and Their Effect on the Writing of Native and

Non-native College Freshmen. PhD thesis. Indiana University of Pennsylvania.

62

Documents

Corpora and language pedagogy - Home | Lancaster · Web viewWe will first discuss a wide range of issues related to using corpora in language pedagogy, including referencing publishing,