Corpus for Language Learning

1

Corpus linguistics and the application of new technologies

in the foreign language classroom:

Part 1: Corpus linguistics and

the foreign language classroom

Professor: Mick O’Donnell

1 Introduction to Corpus Analysis Language corpora can be used to enhance the classroom experience in several ways. For the teacher,

corpora can be used to improve their understanding of classroom interactions, and of the language

learning process itself. For instance, teachers can collect transcripts of classroom sessions, and analyse

these to discover the recurrent patterns of interaction between teachers and students. Alternatively, a

corpus of student writings can be explored to identify what students do wrong, and thus target teaching

to their problem areas. For students, corpora can be used as part of the process of learning about

particular structures. (Corpus-aided discovery learning).

The first 3 weeks of this course will explore these aspects of corpus-use in language education. In this

first lecture, we will briefly revise what we mean by Corpus Linguistics, what a learner corpora is, and

some techniques used on Learner corpora.

Some of this first section repeats the material given for Advanced Research Methods. It is here for the

sake of completeness.

1.1 Corpora and Corpus Linguistics

A corpus is a set of texts, electronically stored. The plural of ‘corpus’ is ‘corpora’ (thus ‘a corpus’ vs.

‘several corpora’). Each text is usually associated with metadata (who wrote it, where it was taken from,

etc.). Segments of text can also be tagged (associated with labels of some kind), for instance, tagging

each word with its part-of-speech (POS).

Text corpora are often written texts, but can also consist of transcriptions of spoken interactions (e.g.,

news reporting, casual conversations, etc.). Corpora are generally assumed to be text, but in some

instances, people talk of spoken corpora (audio tapings of people talking), or even video.

Corpus linguistics is the study of language based on evidence from corpora, using corpus software to

extract out patterns from the corpus (e.g., showing all occurrences of a word or phrase in the corpus, or

counting how often particular tags occur in one corpus compared to another). The principle here is to

base conclusions on what people actually do, rather than on the introspections of linguists.

1.2 What is a Learner Corpus?

A learner corpus is a corpus of texts (whether written or spoken) produced by foreign language learners.

Being written by non-natives, these text needs to be regarded both in terms of the native language of the

writer (L1), and the language in which the text is written (L2). In some cases, a learner corpus will be

limited to a simple L1/L2 combination (e.g., Spanish learners of English). In other cases, a learner corpus

may contain a comprehensive range of L1 languages but a single L2, for instance, the ICLE corpus

(International Corpus of Learner English, Granger 2003) contains texts written by learners of English,

from 16 language backgrounds.

Learner corpora are useful for exploring various aspects of the learning process, for instance, how does

the mother tongue of the learner affect the acquisition of the L2. For instance, one would expect that

Spanish learners of English would have trouble with pronouns in Subject position, since these are in

general optional in Spanish but obligatory in English. One can explore a learner corpus of a particular L1

to find the typical errors for students of the L1/L2 combination.

2

Learner corpora, ranked by level of learner proficiency, also allow a researcher to see at what point in

development particular language phenomena can be best addressed. For instance, even low level

learners of English can generally produce a passive sentence, but relative clauses are not produced often,

or correctly, until higher levels of proficiency.

1.3 Tools for Corpus Linguistics

The most typical software for corpus analysis is a concordancer. These tools extract out instances of

words or tags (or sequences of words/tags), and present them to the user. Figure 1 below shows a

concordance for the word “however”. This concordance is shown in ‘KWIC’ format (‘Key Word In

Context’).

Figure 1: An example Concordance

By showing you all occurrences of a word in context, one can explore the grammatical contexts in which

the word is appropriate, or the different meanings it can carry.

Tools for concordancing your own texts:

• MicroConcord: http://www.lexically.net/downloads/_freebies/mconcord1.zip

• Wordsmith Tools (http://www.lexically.net/wordsmith/): may have automatic POS tagging. Not

free.

• Concordance: http://www.concordancesoftware.co.uk/

• UAM CorpusTool (free, automatic POS tagging for English):

http://www.wagsoft.com/CorpusTool/

The other main type of tool for corpus work is a corpus annotation tool. An annotation tool is used to

assign tags to segments of the corpus, so that the user can later search for occurrences of these tags (or

count them). See section 1.5 below for more details.

1.4 Collecting Learner Corpora

To get learner corpora, you can either use an existing one or make your own.

1.4.1 Using an existing learner corpora

There are numerous learner corpora available to researchers. Some are free, but usually there is a cost

available.

Within the university, there are a number of learner corpora available (conditions may apply to use

them):

a. The WriCLE corpus: Essays written by Spanish learners of English, at 1st

and 3rd

year in the

English degree. Each essay is provided with metadata detailing proficiency in English, gender,

academic year, language background, etc. Contact Paul Rollinson or Amaya Mendikoetxea.

(700,000 words)

3

b. UAM Corpus de Interlenguas Escritas: A corpus of texts written by Spanish learners of English

of different genres, ranging from last year of High School to postgraduate. Contact Ana

Martin or Rachel Whittaker.

c. UAM Learner English Spoken Corpus (UAMLESC): a corpus of spoken classroom interactions

in English at Spanish schools, ranging from 5 to 12 year olds, in a range of language

environments (from almost no English curriculum to full curriculum in English). Some degree

of transcription already available. Contact Ana Llinares or Jesus Romero.

http://www.uam.es/departamentos/filoyletras/filoinglesa/bin/docs/investigacion/Corpus%2

0UAMLESC.pdf

d. AICLE-ORES (Aprendizaje Integrado de Contenido y Lengua Extranjera - Oral y Escrito) A set

of students’ oral and written productions in EFL in Secondary School CLIL classes where the

area of Social Sciences is taught through the medium of English Contact: Ana Llinares.

In other places, the following learner corpora may be of interest:

a. The ICLE Corpus: (Granger 2003): The International Corpus of Learner English: contains texts

written by learners of English from 16 language backgrounds. The best known of the learner

corpora.

b. Longmans Learners’ Corpora: a 10 million word corpus of texts written by English Learners

from around the world. http://www.pearsonlongman.com/dictionaries/corpus/learners.html

c. Cambridge Learner Corpus (CLC): a large collection of exam scripts written by students

taking Cambridge ESOL English exams around the world. Currently over 135,000 scripts.

For a survey of learner corpora up to 2000 or so, see http://icame.uib.no/ij26/pravec.pdf

1.4.2 Making your own Learner Corpora

You can construct your own learner corpora by collecting written texts from your own students (or from

those of a teacher you know). See the notes from the Advanced Research Methods course on how to

construct a balanced corpus:

http://web.uam.es/departamentos/filoyletras/filoinglesa/Courses/ARM2010/ARM-Corpus-2010.pdf

Spoken Learner Corpora: If you are interested in spoken interaction, you will need to record the

interaction. Rather than work with the audio or video files that you record, many researchers transcribe

the speech into text, and then use text-based corpus tools to annotate the transcription. There are

however some tools which work directly with speech or video, and also some tools which are specifically

designed to work with speech transcripts.

• EXMARaLDA: tools for for the computer assisted transcription and annotation of spoken language, and for

the construction and analysis of spoken language corpora.

http://www.exmaralda.org/en_index.html

• Folker: A transcription annotation tool with links back to the audio files See:

http://www.icca10.org/program/abstract/id/569

• Transcriber: A tool for segmenting, labelling and transcribing speech:

http://trans.sourceforge.net/en/presentation.php

• TASX: a general framework for creating and managing corpora of audio and video data.

• Anvil: video and audio annotation software

http://www.anvil-software.de/

Most commonly these systems have built-in annotation specifications, e.g., dialogue structure and

speech-act tagsets.

1.5 Annotating Learner Corpora

Assuming you have a written corpus, you may want to annotate the text, so that you can perform studies

on the corpus.

At a minimum, this means providing metadata for each text, e.g., information about the writer of the

text (age, proficiency, language background, etc.), the genre and register of the text (e.g., is it a recount,

narrative, blog entry, etc.).

4

You may also be interested in annotating segments of the text, e.g., marking up generic stages of the text

(e.g., stages of narratives such as Orientation, Complication, Resolution, Coda), speech acts (e.g.,

question, answer, statement, etc.), speaker turns (teacher/student, etc.).

For studies of grammar, you might make use of automatic tagging of part of speech (POS) or syntactic

structure. Consult Mick if you need to do this.

In the current context, error annotation may be of interest. Using an annotation tool, one manually tags

stretches of text which exhibit lexical, syntactic or pragmatic errors.

There are various alternatives for manual annotation of text:

• UAM CorpusTool offers manual annotation of text corpora, search facilities and statistical

reports on annotations. Also some forms of automatic annotation.

http://www.wagsoft.com/CorpusTool/

• SACODEYL: Another text annotator from University of Murcia.

http://wiki.tei-c.org/index.php/Sacodeyl_Annotator

• Knowtator: Reasonable text annotator.

http://knowtator.sourceforge.net/MMAX-2: text annotation of single files. Not much you can do

with the annotated files.

http://mmax2.sourceforge.net/

• Gate: text processing tool, which amongst other tasks allows text annotation. Too complex to

use without an introductory class or two.

http://gate.ac.uk/

5

2 Corpus use for the learner This section will focus on ways in which students can use a corpus to improve their language skills, or

ways in which a teacher can use corpora to prepare teaching materials for the students.

2.1 Data-Driven Learning (Discovery Learning) Corpora can be used as part of the process of learning about particular words and structures. Tim Johns,

the author of one of the first concordance programs for PCs, started pushing the use of concordance

software to aid the language learner (Johns 1988, 1991).

Traditional teaching is teacher-led: the teacher presents grammatical rules or patterns to the student and

later tests whether the student has acquired those rules. Johns proposed that a degree of independence

can be given to the language learner by providing them with a “computer assistant”, to answer whatever

questions the learner comes up with during their reading or writing.

He proposes that a concordancer can act as a viable assistant to the learner: by allowing the learner to

quickly encounter a range of real-text examples of a linguistic phenomenon, the learner can generalise

over the examples and derive their own usage rules.

According to Payne (2008), quoted by Koutropoulos (2009), the traditional pedagogical approach to

teaching grammar is through a process of :

1. the teacher presents information to the student,

2. the learner practices with this information,

3. the learner produces new content.

In contrast, in a data-driven approach, the learner:

1. observes a grammatical phenomenon of the language,

2. hypothesizes as to how this grammatical phenomenon works, and then

3. experiments to see if their hypothesis is correct

Data-Driven Learning is sometimes called “Corpus-aided Discovery Learning”, particularly in the work of

Silvia Bernardini (2000, 2002). She pushes the use of concordancing within the classroom as a teaching

technique, the basic concept being that students learns most effectively through their own discovery of

language patterns. The teacher sets the student a task, for instance, to explore the various patterns of

relative clauses in a native corpus, and gets them to formulate the usage rules for this form. For this, the

student will be put in front of a concordance program on a computer, and asked to do the search, with

some assistance from the teacher. The idea is that eventually the student will be able to formulate his

own explorations of the particular language phenomena he/she is working on at any moment.

Exercise: Using the BYU/BNC concordancer, explore the question whether “neither of them” is

singular or plural.

1. Connect to: http://corpus.byu.edu/bnc/

2. Enter search query: neither of them [vb*]

3. Examine the results and reach a conclusion.

2.1.1 Data-Driven Vocabulary Learning

When a language learner encounters a new word in a text they are reading, often the context within the

text is not sufficient for the learner to identify the meaning of the word. While they can use a dictionary

to look up the meaning, use of an online concordancer provides an alternative resource, encouraging the

student to engage with real texts to discover the meaning.

Exercise: Provide the students with an advanced text, which contains vocabulary they might not be

familiar with. Ask them to read the text, and when they encounter a word they are not sure of, get

them to:

1. Write down what they think the word means.

2. Search for the word in a concordancer (e.g., the BNC online).

3. Examine the various contexts of use to determine the range of meanings of the word.

4. Write down the various meanings of the word they encounter,

5. Identify which meaning they think was intended in their text.

6

2.1.2 Data-Driven Learning without a computer?

Alex Boulton (2010) makes the point that:

“One of the most apparent obstacles to DDL is the use of the technology itself – the computer with its

query software and interfaces for accessing electronic corpora – which has repeatedly been found to

pose substantial problems for many learners as well as teachers. Where this is the case, the obvious

question is whether the computer can be successfully removed from the equation without losing the

benefits of the overall approach.” (Bolton 2010, Abstract)

Lots of time is consumed by moving a class to a computer lab, getting them up to speed on using the

equipment. The pace of the class is usually determined by the slowest student, so the better students get

frustrated.

His solution is to hand printed pages of a concordance to the students, and get them to follow the data-

driven approach from the paper. One possibility is that each student has the complete concordance, and

the students work individually, or in pairs. Another possibility is that each student is given one page of

the concordance, they work out the grammatical rules based on their own page, and then the students

come together to integrate their hypotheses into an overall answer.

This is an interesting approach, particularly if your classroom does not have the facilities for computer-

based classes, or the time in training the students into using the software would cut substantially into the

total time available.

The benefits of data driven learning is that the focus is on "the exploitation of authentic materials even

when dealing with tasks such as the acquisition of grammatical structures and lexical items [...], on real,

exploratory tasks and activities rather than traditional «drill & kill» exercises, [...] on learner-centred

activities," and on "the use and exploitation of tools rather than ready-made or off-the-shelf learnware."

(Rüschoff)

2.1.3 Evaluation of Data-Driven Learning

One benefit of DDL is that it focuses “on the exploitation of authentic materials even when dealing with

tasks such as the acquisition of grammatical structures and lexical items” (and

Rüschoff (20xx) argues that “the acquisition of language and linguistic competence as well as

language and language learning awareness can best be realised through tasks which

encourage the learner not to focus explicitly on the structure and the rules of the new

language. Learners will acquire the form of the foreign language because they are engaged

in exploring aspects of the target language on the basis of authentic content”.

One problem with DDL is that there are just so many particular vocabulary items and syntactic

phenomena to explore. One DDL session can explore one phenomenon, and even 50 such sessions would

barely touch the learning needs of the average advanced language learner.

Boulton made the point that DDL sessions in a computer lab tend to waste lots of time on the sheer

mechanics of getting students familiar with the software. Not all teachers will have class hours sufficient

to spend on such things. So, DDL using printed concordances may ne useful in this regard.

Critiques of DDL: see O’Keeffe et al 2007, p24 for a list of articles critiquing DDL.

Documents

Corpus for Language Learning