Upload
fatima-cheikh-khamis-cases
View
216
Download
2
Embed Size (px)
DESCRIPTION
corpus
Citation preview
1
Corpus linguistics and the application of new technologies
in the foreign language classroom:
Part 1: Corpus linguistics and
the foreign language classroom
Professor: Mick O’Donnell
1 Introduction to Corpus Analysis Language corpora can be used to enhance the classroom experience in several ways. For the teacher,
corpora can be used to improve their understanding of classroom interactions, and of the language
learning process itself. For instance, teachers can collect transcripts of classroom sessions, and analyse
these to discover the recurrent patterns of interaction between teachers and students. Alternatively, a
corpus of student writings can be explored to identify what students do wrong, and thus target teaching
to their problem areas. For students, corpora can be used as part of the process of learning about
particular structures. (Corpus-aided discovery learning).
The first 3 weeks of this course will explore these aspects of corpus-use in language education. In this
first lecture, we will briefly revise what we mean by Corpus Linguistics, what a learner corpora is, and
some techniques used on Learner corpora.
Some of this first section repeats the material given for Advanced Research Methods. It is here for the
sake of completeness.
1.1 Corpora and Corpus Linguistics
A corpus is a set of texts, electronically stored. The plural of ‘corpus’ is ‘corpora’ (thus ‘a corpus’ vs.
‘several corpora’). Each text is usually associated with metadata (who wrote it, where it was taken from,
etc.). Segments of text can also be tagged (associated with labels of some kind), for instance, tagging
each word with its part-of-speech (POS).
Text corpora are often written texts, but can also consist of transcriptions of spoken interactions (e.g.,
news reporting, casual conversations, etc.). Corpora are generally assumed to be text, but in some
instances, people talk of spoken corpora (audio tapings of people talking), or even video.
Corpus linguistics is the study of language based on evidence from corpora, using corpus software to
extract out patterns from the corpus (e.g., showing all occurrences of a word or phrase in the corpus, or
counting how often particular tags occur in one corpus compared to another). The principle here is to
base conclusions on what people actually do, rather than on the introspections of linguists.
1.2 What is a Learner Corpus?
A learner corpus is a corpus of texts (whether written or spoken) produced by foreign language learners.
Being written by non-natives, these text needs to be regarded both in terms of the native language of the
writer (L1), and the language in which the text is written (L2). In some cases, a learner corpus will be
limited to a simple L1/L2 combination (e.g., Spanish learners of English). In other cases, a learner corpus
may contain a comprehensive range of L1 languages but a single L2, for instance, the ICLE corpus
(International Corpus of Learner English, Granger 2003) contains texts written by learners of English,
from 16 language backgrounds.
Learner corpora are useful for exploring various aspects of the learning process, for instance, how does
the mother tongue of the learner affect the acquisition of the L2. For instance, one would expect that
Spanish learners of English would have trouble with pronouns in Subject position, since these are in
general optional in Spanish but obligatory in English. One can explore a learner corpus of a particular L1
to find the typical errors for students of the L1/L2 combination.
2
Learner corpora, ranked by level of learner proficiency, also allow a researcher to see at what point in
development particular language phenomena can be best addressed. For instance, even low level
learners of English can generally produce a passive sentence, but relative clauses are not produced often,
or correctly, until higher levels of proficiency.
1.3 Tools for Corpus Linguistics
The most typical software for corpus analysis is a concordancer. These tools extract out instances of
words or tags (or sequences of words/tags), and present them to the user. Figure 1 below shows a
concordance for the word “however”. This concordance is shown in ‘KWIC’ format (‘Key Word In
Context’).
Figure 1: An example Concordance
By showing you all occurrences of a word in context, one can explore the grammatical contexts in which
the word is appropriate, or the different meanings it can carry.
Tools for concordancing your own texts:
• MicroConcord: http://www.lexically.net/downloads/_freebies/mconcord1.zip
• Wordsmith Tools (http://www.lexically.net/wordsmith/): may have automatic POS tagging. Not
free.
• Concordance: http://www.concordancesoftware.co.uk/
• UAM CorpusTool (free, automatic POS tagging for English):
http://www.wagsoft.com/CorpusTool/
The other main type of tool for corpus work is a corpus annotation tool. An annotation tool is used to
assign tags to segments of the corpus, so that the user can later search for occurrences of these tags (or
count them). See section 1.5 below for more details.
1.4 Collecting Learner Corpora
To get learner corpora, you can either use an existing one or make your own.
1.4.1 Using an existing learner corpora
There are numerous learner corpora available to researchers. Some are free, but usually there is a cost
available.
Within the university, there are a number of learner corpora available (conditions may apply to use
them):
a. The WriCLE corpus: Essays written by Spanish learners of English, at 1st
and 3rd
year in the
English degree. Each essay is provided with metadata detailing proficiency in English, gender,
academic year, language background, etc. Contact Paul Rollinson or Amaya Mendikoetxea.
(700,000 words)
3
b. UAM Corpus de Interlenguas Escritas: A corpus of texts written by Spanish learners of English
of different genres, ranging from last year of High School to postgraduate. Contact Ana
Martin or Rachel Whittaker.
c. UAM Learner English Spoken Corpus (UAMLESC): a corpus of spoken classroom interactions
in English at Spanish schools, ranging from 5 to 12 year olds, in a range of language
environments (from almost no English curriculum to full curriculum in English). Some degree
of transcription already available. Contact Ana Llinares or Jesus Romero.
http://www.uam.es/departamentos/filoyletras/filoinglesa/bin/docs/investigacion/Corpus%2
0UAMLESC.pdf
d. AICLE-ORES (Aprendizaje Integrado de Contenido y Lengua Extranjera - Oral y Escrito) A set
of students’ oral and written productions in EFL in Secondary School CLIL classes where the
area of Social Sciences is taught through the medium of English Contact: Ana Llinares.
In other places, the following learner corpora may be of interest:
a. The ICLE Corpus: (Granger 2003): The International Corpus of Learner English: contains texts
written by learners of English from 16 language backgrounds. The best known of the learner
corpora.
b. Longmans Learners’ Corpora: a 10 million word corpus of texts written by English Learners
from around the world. http://www.pearsonlongman.com/dictionaries/corpus/learners.html
c. Cambridge Learner Corpus (CLC): a large collection of exam scripts written by students
taking Cambridge ESOL English exams around the world. Currently over 135,000 scripts.
For a survey of learner corpora up to 2000 or so, see http://icame.uib.no/ij26/pravec.pdf
1.4.2 Making your own Learner Corpora
You can construct your own learner corpora by collecting written texts from your own students (or from
those of a teacher you know). See the notes from the Advanced Research Methods course on how to
construct a balanced corpus:
http://web.uam.es/departamentos/filoyletras/filoinglesa/Courses/ARM2010/ARM-Corpus-2010.pdf
Spoken Learner Corpora: If you are interested in spoken interaction, you will need to record the
interaction. Rather than work with the audio or video files that you record, many researchers transcribe
the speech into text, and then use text-based corpus tools to annotate the transcription. There are
however some tools which work directly with speech or video, and also some tools which are specifically
designed to work with speech transcripts.
• EXMARaLDA: tools for for the computer assisted transcription and annotation of spoken language, and for
the construction and analysis of spoken language corpora.
http://www.exmaralda.org/en_index.html
• Folker: A transcription annotation tool with links back to the audio files See:
http://www.icca10.org/program/abstract/id/569
• Transcriber: A tool for segmenting, labelling and transcribing speech:
http://trans.sourceforge.net/en/presentation.php
• TASX: a general framework for creating and managing corpora of audio and video data.
• Anvil: video and audio annotation software
http://www.anvil-software.de/
Most commonly these systems have built-in annotation specifications, e.g., dialogue structure and
speech-act tagsets.
1.5 Annotating Learner Corpora
Assuming you have a written corpus, you may want to annotate the text, so that you can perform studies
on the corpus.
At a minimum, this means providing metadata for each text, e.g., information about the writer of the
text (age, proficiency, language background, etc.), the genre and register of the text (e.g., is it a recount,
narrative, blog entry, etc.).
4
You may also be interested in annotating segments of the text, e.g., marking up generic stages of the text
(e.g., stages of narratives such as Orientation, Complication, Resolution, Coda), speech acts (e.g.,
question, answer, statement, etc.), speaker turns (teacher/student, etc.).
For studies of grammar, you might make use of automatic tagging of part of speech (POS) or syntactic
structure. Consult Mick if you need to do this.
In the current context, error annotation may be of interest. Using an annotation tool, one manually tags
stretches of text which exhibit lexical, syntactic or pragmatic errors.
There are various alternatives for manual annotation of text:
• UAM CorpusTool offers manual annotation of text corpora, search facilities and statistical
reports on annotations. Also some forms of automatic annotation.
http://www.wagsoft.com/CorpusTool/
• SACODEYL: Another text annotator from University of Murcia.
http://wiki.tei-c.org/index.php/Sacodeyl_Annotator
• Knowtator: Reasonable text annotator.
http://knowtator.sourceforge.net/MMAX-2: text annotation of single files. Not much you can do
with the annotated files.
http://mmax2.sourceforge.net/
• Gate: text processing tool, which amongst other tasks allows text annotation. Too complex to
use without an introductory class or two.
http://gate.ac.uk/
5
2 Corpus use for the learner This section will focus on ways in which students can use a corpus to improve their language skills, or
ways in which a teacher can use corpora to prepare teaching materials for the students.
2.1 Data-Driven Learning (Discovery Learning) Corpora can be used as part of the process of learning about particular words and structures. Tim Johns,
the author of one of the first concordance programs for PCs, started pushing the use of concordance
software to aid the language learner (Johns 1988, 1991).
Traditional teaching is teacher-led: the teacher presents grammatical rules or patterns to the student and
later tests whether the student has acquired those rules. Johns proposed that a degree of independence
can be given to the language learner by providing them with a “computer assistant”, to answer whatever
questions the learner comes up with during their reading or writing.
He proposes that a concordancer can act as a viable assistant to the learner: by allowing the learner to
quickly encounter a range of real-text examples of a linguistic phenomenon, the learner can generalise
over the examples and derive their own usage rules.
According to Payne (2008), quoted by Koutropoulos (2009), the traditional pedagogical approach to
teaching grammar is through a process of :
1. the teacher presents information to the student,
2. the learner practices with this information,
3. the learner produces new content.
In contrast, in a data-driven approach, the learner:
1. observes a grammatical phenomenon of the language,
2. hypothesizes as to how this grammatical phenomenon works, and then
3. experiments to see if their hypothesis is correct
Data-Driven Learning is sometimes called “Corpus-aided Discovery Learning”, particularly in the work of
Silvia Bernardini (2000, 2002). She pushes the use of concordancing within the classroom as a teaching
technique, the basic concept being that students learns most effectively through their own discovery of
language patterns. The teacher sets the student a task, for instance, to explore the various patterns of
relative clauses in a native corpus, and gets them to formulate the usage rules for this form. For this, the
student will be put in front of a concordance program on a computer, and asked to do the search, with
some assistance from the teacher. The idea is that eventually the student will be able to formulate his
own explorations of the particular language phenomena he/she is working on at any moment.
Exercise: Using the BYU/BNC concordancer, explore the question whether “neither of them” is
singular or plural.
1. Connect to: http://corpus.byu.edu/bnc/
2. Enter search query: neither of them [vb*]
3. Examine the results and reach a conclusion.
2.1.1 Data-Driven Vocabulary Learning
When a language learner encounters a new word in a text they are reading, often the context within the
text is not sufficient for the learner to identify the meaning of the word. While they can use a dictionary
to look up the meaning, use of an online concordancer provides an alternative resource, encouraging the
student to engage with real texts to discover the meaning.
Exercise: Provide the students with an advanced text, which contains vocabulary they might not be
familiar with. Ask them to read the text, and when they encounter a word they are not sure of, get
them to:
1. Write down what they think the word means.
2. Search for the word in a concordancer (e.g., the BNC online).
3. Examine the various contexts of use to determine the range of meanings of the word.
4. Write down the various meanings of the word they encounter,
5. Identify which meaning they think was intended in their text.
6
2.1.2 Data-Driven Learning without a computer?
Alex Boulton (2010) makes the point that:
“One of the most apparent obstacles to DDL is the use of the technology itself – the computer with its
query software and interfaces for accessing electronic corpora – which has repeatedly been found to
pose substantial problems for many learners as well as teachers. Where this is the case, the obvious
question is whether the computer can be successfully removed from the equation without losing the
benefits of the overall approach.” (Bolton 2010, Abstract)
Lots of time is consumed by moving a class to a computer lab, getting them up to speed on using the
equipment. The pace of the class is usually determined by the slowest student, so the better students get
frustrated.
His solution is to hand printed pages of a concordance to the students, and get them to follow the data-
driven approach from the paper. One possibility is that each student has the complete concordance, and
the students work individually, or in pairs. Another possibility is that each student is given one page of
the concordance, they work out the grammatical rules based on their own page, and then the students
come together to integrate their hypotheses into an overall answer.
This is an interesting approach, particularly if your classroom does not have the facilities for computer-
based classes, or the time in training the students into using the software would cut substantially into the
total time available.
The benefits of data driven learning is that the focus is on "the exploitation of authentic materials even
when dealing with tasks such as the acquisition of grammatical structures and lexical items [...], on real,
exploratory tasks and activities rather than traditional «drill & kill» exercises, [...] on learner-centred
activities," and on "the use and exploitation of tools rather than ready-made or off-the-shelf learnware."
(Rüschoff)
2.1.3 Evaluation of Data-Driven Learning
One benefit of DDL is that it focuses “on the exploitation of authentic materials even when dealing with
tasks such as the acquisition of grammatical structures and lexical items” (and
Rüschoff (20xx) argues that “the acquisition of language and linguistic competence as well as
language and language learning awareness can best be realised through tasks which
encourage the learner not to focus explicitly on the structure and the rules of the new
language. Learners will acquire the form of the foreign language because they are engaged
in exploring aspects of the target language on the basis of authentic content”.
One problem with DDL is that there are just so many particular vocabulary items and syntactic
phenomena to explore. One DDL session can explore one phenomenon, and even 50 such sessions would
barely touch the learning needs of the average advanced language learner.
Boulton made the point that DDL sessions in a computer lab tend to waste lots of time on the sheer
mechanics of getting students familiar with the software. Not all teachers will have class hours sufficient
to spend on such things. So, DDL using printed concordances may ne useful in this regard.
Critiques of DDL: see O’Keeffe et al 2007, p24 for a list of articles critiquing DDL.