42
Parallel corpora and contrastive studies Hilde Hasselgård University of Oslo

Parallel corpora and contrastive studies

  • Upload
    shada

  • View
    57

  • Download
    0

Embed Size (px)

DESCRIPTION

Parallel corpora and contrastive studies. Hilde Hasselgård University of Oslo. From monolingual to multilingual corpus linguistics. Corpus linguistics – the study of language by means of large(ish), structured databases of text compiled and prepared for use in linguistic research. - PowerPoint PPT Presentation

Citation preview

Page 1: Parallel corpora and contrastive studies

Parallel corpora and contrastive studies

Hilde Hasselgård

University of Oslo

Page 2: Parallel corpora and contrastive studies

2 >

From monolingual to multilingual corpus linguistics

Corpus linguistics – the study of language by means of large(ish), structured databases of text compiled and prepared for use in linguistic research.

Largely developed within English linguistics, with the Brown corpus as the first (1960s), followed by the Lancaster-Oslo/Bergen (LOB) corpus.

Greatly facilitated the access to material.

Opened up new possibilities for quantitative studies & variation studies.

Parallel corpora: a more recent development (1990s), requiring new technology and new research methods.

Page 3: Parallel corpora and contrastive studies

3 >

Structure of talk

•Multilingual corpus linguistics

– Multilingual corpora

– The English-Norwegian Parallel Corpus

– Contrastive analysis

•The use of parallel corpora in contrastive studies

– The contribution of parallel corpora

– Methodology

•The Oslo Multilingual Corpus and the work of ”Språk i Kontrast” (Languages in Contrast) in Oslo

•Case study: two future-referring expressions

•Summing up

Page 4: Parallel corpora and contrastive studies

4 >

What is a parallel corpus?

original texts with translations into one or more other languages A translation corpus

comparable original texts in different languages A comparable corpus

bi-directional translation corpus Parallel corpus

Page 5: Parallel corpora and contrastive studies

5 >

Translation corpusA corpus that contains the ‘same’ texts in more than one

language, in other words a corpus with both original and translated texts.

Original text(s)

Translation, language 1

(Translation, language 2)

(Translation, language 3)

Page 6: Parallel corpora and contrastive studies

6 >

Comparable corpusa corpus that contains original texts in more than one language and

where the texts in each language have been selected according to the same criteria (genre, content, publication date etc.)

Language 1

criterion A

criterion B

criterion C

criterion D

Language 2

criterion A

criterion B

criterion C

criterion D

Language 3

criterion A

criterion B

criterion C

criterion D

Page 7: Parallel corpora and contrastive studies

7 >

Parallel corpus (ENPC model)

Combination of translation and comparable corpus

The original texts are comparable (genre, number of words)

The translations go in both directions – a bidirectional translation corpus

Page 8: Parallel corpora and contrastive studies

8 >

The English-Norwegian Parallel Corpus (ENPC) – Some facts

Started as a research project at the Department of British and American Studies in 1994 and completed in 1997. Prof. Stig Johansson initiated and directed the project.

Original texts with translations (English-Norwegian and Norwegian-English)

Fiction and non-fiction

Compiled for use in applied and theoretical linguistic research

Development of software for alignment of the texts (Knut Hofland, UiB) and for searching the corpus (Jarle Ebeling, UiO)

Sister projects: The English-Swedish Parallel Corpus (Lund/Göteborg), English-Finnish Parallel Corpus (Jyväskylä/Savonlinna/Tampere) – same principle of compilation; to some extent also shared texts.

Other corpora built on the ENPC model in Germany (Chemnitz), France/Belgium (Poitiers/Louvain-la-Neuve: the PLECI corpus), Spain (University of Léon)].

Page 9: Parallel corpora and contrastive studies

9 >

Contrastive analysis

Contrastive analysis is the systematic comparison of two or more languages, with the aim of describing their similarities and differences. (Johansson 2007: 1)

CA [contrastive analysis] is a linguistic enterprise aimed at producing inverted (i.e. contrastive, not comparative) two-valued typologies (a CA is always concerned with a pair of languages), and founded on the assumption that languages can be compared. (James 1980: 3)

Executing a CA involves two steps: description and comparison; and the steps are taken in that order. (James 1980: 63)

Page 10: Parallel corpora and contrastive studies

10 >

Contrastive analysis

A CA presupposes a tertium comparationis, i.e. a measure by which we can be fairly certain we are comparing like with like.

The items to be compared across languages are selected on the basis of perceived similarity (Chesterman 1998), such as translation equivalence, semantic/etymological similarity, grammatical or functional categories.

A frequently suggested tertium comparationis is translation equivalence (e.g. James 1980, Chesterman 1998); which implies that the items in the two languages convey (more or less) the same meaning.

Page 11: Parallel corpora and contrastive studies

11 >

What can multilingual corpora contribute?They give insights into the languages compared – insights that are likely to be unnoticed in studies of monolingual corpora.

They can be used for a range of comparative purposes and increase our understanding of language-specific, typological and cultural differences, as well as of universal features.

They illuminate differences between source texts and translations, and between native and non-native texts.

They can be used for a number of practical applications, e.g. in lexicography, language teaching, and translation.

(Aijmer & Altenberg 1996: 12)

Page 12: Parallel corpora and contrastive studies

12 >

Other benefits of a parallel corpus such as the ENPC

Ready access to (relatively) large quantities of bilingual data

Sentence alignment

Comparable original and translated texts in both languages

Control for translation bias

In-built tertium comparationis through translation equivalence and text comparability

“the paired texts reveal the interlingual identifications made by translators” (Johansson 1999: 117)

Page 13: Parallel corpora and contrastive studies

13 >

Methodology: Classifying correspondences

congruent

expressed

divergent

Correspondence

zero

Same realisation type

Different realisation type

Example: English correspondences of imidlertid (‘however’) in ENPC

Alle "innrømmelsene" hadde imidlertid en pris. (GL1) However, all these "concessions" had a price.

Det endte imidlertid godt: (…) (UD1) But it ended well (…)

Reguleringstiltakene har imidlertid gitt resultater (…). (ABJH1) The regulations have shown results (…).

Page 14: Parallel corpora and contrastive studies

14 >

Paradigms of correspondences

Swedish translations of however

emellertid (51 = 47%)

men (‘but’) (36 = 33%)

dock (14 = 13%)

ändå (2)

däremot (1)

i alla fall (1)

Ø (4)

English translations of emellertid

however (83 = 81%)

but (3)

yet (3)

anyway (1)

Ø (13)

(Altenberg 1999)

Page 15: Parallel corpora and contrastive studies

15 >

Mutual correspondence (MC)(Altenberg 1999)

The frequency with which different (grammatical, semantic and lexical) expressions are translated into each other.

Calculated and expressed as a percentage by means of the formula

(At + Bt) x 100

As + Bs

The MC of however and emmelertid in the ESPC is thus

(51 + 83) x 100 / (109 + 103) = 63.2

Page 16: Parallel corpora and contrastive studies

16 >

LexicogrammarParadigms of correspondence highlight the fuzzy borderlines between lexis and grammar and grammar and discourse.

Example: A modal verb will have a wide range of correspondences

Norwegian kan (‘can’)

Valget av tidspunkt kan også inneholde et stenk av egoisme. (KH1)

Maybe his choice of timing also contained a touch of egotism.

Modal aux: can, could, may, might, ‘ll, will, would, shouldOther verbs: know, enable, have, have to, had betterAdjectives: possible, able, capable. Adverbs: maybe, perhapsSuffix: -able (Løken 2007)

Page 17: Parallel corpora and contrastive studies

17 >

From ENPC to OMC under the SPRIK umbrella (SPRåk I Kontrast)

New languages have been added, first (mainly) German, then French

Focus on English – Norwegian – German in the first phase of the SPRIK-project: original texts in each language with translations into the other two.

Same principles for text selection, text sampling and preparation as for the ENPC (exception: even more biased towards fiction because of the lack of translated non-fiction)

Same (or later versions of same) software for alignment, searching etc.

Expanded search facilities and research possibilities:

– Three-way comparison of translations and originals

– Possibilities of investigating two different translations of the same text (translation strategies, translationese)

Page 18: Parallel corpora and contrastive studies

18 >

Current stock of multilingual corpora at Oslo

OMC:

Parallel corpora: English-Norwegian, French-Norwegian, German-Norwegian; three-way English-German Norwegian.

Translation corpora: Norwegian – English – French – German, Norwegian – French – German, English-Dutch, English-Portuguese.

Multiple translations corpus (English-Norwegian)

Outside OMC:

Russian – English – Norwegian (RuN)

Multilingual corpora of historical texts (two projects)

Page 19: Parallel corpora and contrastive studies

19 >

Trilingual parallel corpus model

Page 20: Parallel corpora and contrastive studies

22 >

Searching in No-En-Fr-Ge

Jeg kommer til å si det til ham likevel.” (KF1)

Ich werde es ihm sowieso sagen.” (KF1TD)

I 'll tell him about it anyway.” (KF1TE)

De toute façon, je le lui dirai.” (KF1TF)

"You're going to have a book reissued … (BHH1TE)

Du skal få en bok trykt opp igjen ... (BHH1)

"Ein Buch von dir wird neu aufgelegt, ... (BHH1TD)

Un de tes livres va être réédité ... (BHH1TF)

Page 21: Parallel corpora and contrastive studies

23 >

Using the ENPC/OMC for researchParticularly well suited for studies of lexis / lexico-grammar (or phenomena that can take lexis as their starting point)

A broad range of phenomena have been (are being) investigated, e.g. the use of individual verbs (bli, få, take, give, see), modality, particular syntactic constructions, connectives, sentence openings and other discourse phenomena.

The methodology is not tied to any particular theoretical approach

A range of theoretical approaches, e.g. SFL, cognitive linguistics, pattern grammar, lexis-based approach à la Sinclair + traditional grammar / basic linguistic theory.

Page 22: Parallel corpora and contrastive studies

24 >

Limitations(As with corpus linguistics in general:) you can only search for something that is explicit in the text

Restricted to texts / text types that have been translated

The size of the corpus restricts studies of less frequent lexical/ grammatical constructions

Faulty and less successful translations

The corpus has been word-class tagged, but not parsed (syntactically annotated), i.e. it is not possible to search for grammatical constructions, patterns of word order etc.

Tagging errors

Page 23: Parallel corpora and contrastive studies

25 >

Ways around the limitations?Identify typical (and searchable!) expressions of a grammatical construction, e.g. presentatives, clefting, phrasal verbs, inversion.

Use a combination of word class tagging, filters and wildcards. Example: tense / aspect, participle clauses. (e.g. BE +Ving)

In any case – a lot of work involved in tidying up the search results (precision).

Possibility of searching with regular expressions

Errors in the tagging: Never possible to make sure that you have found all the relevant instances (recall).

Errors/idiosyncracies in the translation: Weed out? Ignore translations that occur only once, or in only one text?

Manual searches in running text, e.g. for Theme, subjects.

Supplement results of parallel corpus study with (larger) monolingual corpora.

Supplement corpus study with e.g. experimental data.

Page 24: Parallel corpora and contrastive studies

26 >

Examples of studies based on ENPC/ OMC / ESPCBengt Altenberg: Work on adverbial connectors, sentence openings, subject selection etc. in English and Swedish.

Karin Aijmer: Work on modality and discourse markers in English and Swedish.

Åke Viberg: Work on verbs of motion and cognition in English and Swedish.

Helge Dyvik: Translations as semantic mirrors; ENPC as basis for bilingual wordnet.

Jarle Ebeling (2000): Presentative constructions in English and Norwegian : a corpus-based contrastive study (PhD, University of Oslo)

Mats Johansson (2002) Clefts in English and Swedish: A contrastive study of IT-clefts and WH-clefts in original texts and translations. (PhD, Lund University)

Signe Oksefjell Ebeling (2003): The Norwegian verbs bli and få and their correspondences in English: a corpus-based contrastive study (PhD, University of Oslo)

Page 25: Parallel corpora and contrastive studies

27 >

Berit Løken: Beyond modals: A corpus-based study of English and Norwegian expressions of possibility (PhD, Oslo, 2007)

Lene Nordrum : English lexical nominalizations in a Norwegian-Swedish contrastive perspective. (PhD, Göteborg, 2007)

Wiebke Ramm: Sentence boundary adjustments in translation (German / Norwegian): Consequences on information distribution and discourse structure (PhD, Oslo, ongoing)

Astrid Nome: Ongoing PhD work on connectors in Norwegian and French. (Oslo)

Cathrine Fabricius Hansen et al: Big Events, Small Clauses. The Grammar of Elaboration. (Forthcoming book with multiple authors and multiple languages)

Master theses (English, German, French) studying individual verbs, syntactic constructions, connectors, metaphor …

Page 26: Parallel corpora and contrastive studies

28 >

My own contrastive work2009. A textual perspective on the pragmatic markers in fact and faktisk. In S.

Slembrouck,, M. Taverniers, M. Van Herreweghe (eds.) From will to well: Studies in Linguistics offered to Anne-Marie Simon-Vandenbergen. Ghent: Academia Press.

2007. Using the ENPC and the ESPC as a parallel translation corpus: adverbs of frequency and usuality. Nordic Journal of English Studies 6:1, http://ojs.ub.gu.se/ojs/index.php/njes/issue/view/6

2006. “Not now” – on non-correspondence between the cognate adverbs now and nå. In K. Aijmer & A.-M. Simon Vandenbergen (eds.) Pragmatic Markers in Contrast. Elsevier, 93-114.

2005. Theme in Norwegian. In K.L. Berge, & E. Maagerø (eds.). Semiotics from the North: Nordic Approaches to Systemic Functional Linguistics. Oslo: Novus, 35-48.

2004 . Spatial linking in English and Norwegian. In K. Aijmer & H. Hasselgård (eds.). Translation and Corpora. Göteborg: Acta Universitatis Gothoburgensis, 163-188.

2004. Thematic choice in English and Norwegian. Functions of Language 11:2. 187-212.

2000. English multiple Themes in translation. In A. Klinge (ed.) Contrastive Studies in Syntax. Special issue of Copenhagen Studies in Language, Vol 25. Copenhagen: Samfundslitteratur, 11-38.

Page 27: Parallel corpora and contrastive studies

29 >

Case study: be going to and komme til å (‘come to’)

Future-referring expressions based on motion verb + infinitive

Both described in grammars as common expressions, though less common than expressions with English will, Norwegian skal

Page 28: Parallel corpora and contrastive studies

30 >

Meanings

be going to

– ‘future fulfilment of the present’; present intention or present cause (Quirk et al 1985)

– associated with present intention or arrangement; was going to quite often has ‘an implicature of non-actualisation’. (Huddleston & Pullum 2002)

– Two meanings: ‘futurish’, linked to a present situation, and ‘future tense’, simply expressing future time reference. (Declerck 2006)

komme til å

– the speaker predicts what will happen based on his knowledge at the moment of speaking (Faarlund et al 1997)

– Past tense kom til å V– also ‘accidentally V’ or ‘was led to V’/ ‘grew to V’ (Vannebo 1979 and Engelsk Stor Ordbok)

Page 29: Parallel corpora and contrastive studies

31 >

Examples

1. I know what he’s going to say even before he says it. (FW1)2. Jeg vet hva han kommer til å si selv før han sier det. (FW1T)

3. "I was going to wait until another time we met, but I may as well tell you now. (AH1)

4. Meningen var å vente til en annen gang, men jeg kan like godt si det nå. (AH1T)

5. Ingen av dem visste hva som kom til å skje. (TTH1)6. Neither of them knew what was going to happen. (TTH1T)

7. Kanskje hun kom til å svelge dem ved et uhell? (LSC1)8. Maybe she happened to swallow them by accident?

(LSC1T)

9. Og siden ble det jeg som kom til å se mest til henne. (EHA1)10.And then I became the one who ended up seeing her most

often. (EHA1T)

Page 30: Parallel corpora and contrastive studies

32 >

be going to and komme til å in ENPC fiction (raw frequencies)

0

50

100

150

200

250

going to komme til å

original

translation

Page 31: Parallel corpora and contrastive studies

33 >

Preliminary observations

Be going to is more common than komme til å in original texts

Be going to is more common in original texts than in translations

Komme til å is less common in original texts than in translations

– i.e. translations in both directions can be assumed to be coloured by the source texts.

The frequency differences between originals and translations (particularly with komme til å) indicate that the two expressions can often be used in the same contexts, but may tend not to be.

Page 32: Parallel corpora and contrastive studies

34 >

Correspondences of be going to (percentages)

0

5

10

15

20

25

30

35

komme til å

skal INFskulle INF

vil INFville INF

ha tenkt

simple tense

other

N translation N original

Page 33: Parallel corpora and contrastive studies

35 >

Correspondences of komme til å (percentages)

0

5

10

15

20

25

30

35

E translation E original

Page 34: Parallel corpora and contrastive studies

36 >

CorrespondencesThe mutual correspondence between be going to and komme til

å is surprisingly low: 12.6%

The correspondence is asymmetrical: – 15% of be going to are translated as komme til å

– 7% of komme til å are translated as be going to

Komme til å has meanings not covered by be going to (‘accidentally’, ‘grow to’, ‘be led to’).

The ‘present cause/intention’ meaning works differently for the two expressions; apparently also speaker certainty/non-actualisation.

1. What are we going to do, says Ruth, … (BV2T)

2. Hva skal vi gjøre, sier Rut …(BV2)

3. Hun kommer bare til å bli redd." (THA1)

4. She 'll only be frightened." (THA1T)

5. "Are you going to run a hotel?" enquired Frederick reasonably, … (DL1)

6. "Har dere tenkt å drive hotell?" spurte Frederick fornuftig, … (DL1T)

Uncertain outcome, no intentionality

Confident prediction – speaker knowledge

Intention, but uncertain outcome

Page 35: Parallel corpora and contrastive studies

37 >

Thus, in spite of shared meanings, English be going to and and Norwegian komme til å, differ as to

– The frequency with which the item is chosen

– The extent to which they compete with other future-referring expressions

– The extent to which they convey confident predictions, ‘present intention’ and ‘actualised future in past’.

Some other explanations may be– Translators in both directions tend to normalize be going to / komme til å

into a more common future-referring expression (will/would INF and skal/skulle INF); Will/would and skal/skulle are also the most common sources of komme til å / be going to

– Sometimes more lexically explicit forms have been used to translate be going to/komme til å: ha tenkt å / intend to (subject’s intention); was to (‘was led/destined to’)

– Be going to may be needed for syntactic reasons, as English modals lack non-finite forms and do not show tense clearly.

– Norwegian modal auxiliaries are more flexible, having non-finite and tensed forms skal /skulle + INF fits into more syntactic environments than will/would + INF

Page 36: Parallel corpora and contrastive studies

38 >

The verb forms

0 % 20 % 40 % 60 % 80 % 100 %

komme til TT

komme til OT

going to TT

going to OT

present

past

modalised

other

Page 37: Parallel corpora and contrastive studies

39 >

• The present tense be going to occurs to a great extent in direct speech.

• The meanings of ‘accidentally do’ and ‘grow to’/ ‘be led to’ of komme til å occur mainly with the past tense, the former also with modalisation.

1. Hun kjenner at hun er søvnig, at hun kan komme til å sovne mot fars jakke, hun vil ikke det. (BV2)

2. She feels that she is sleepy, that she might fall asleep against father's jacket, but she doesn't want to do that. (BV2T)

3. … og at den kvinnen jeg leter efter egentlig var et barn den gangen hun kom til å bety noe for meg.“ (FC1)

4. … and that the woman I'm searching for was really a child when she came to mean something to me. (FC1T)

Page 38: Parallel corpora and contrastive studies

40 >

Some reflections on findings and further work

The picture of correspondence is a complex one, in spite of the rather similar descriptions in grammars of be going to and komme til å.

Syntactic differences between will/skal-future expressions may go some way towards explaining the difference in distribution.

Correspondence types will have to be correlated with tense forms.

Subtle differences of meaning regarding speaker certainty and present cause/intention come to the surface when studying correspondences.

be going to is closer to a neutral future meaning than komme til å; further grammaticalized as a future tense.

Page 39: Parallel corpora and contrastive studies

41 >

Summing up

Parallel corpora enhance contrastive studies in a number of ways

– by ensuring that observations are based on authentic language use

– by yielding paradigms and patterns of correspondences

– thus often revealing meanings and nuances we might not have thought of

– and showing how the same meaning may be expressed by means of different linguistic categories

– by providing quantitative data

– … thus also giving insights into ‘preferred ways of putting things’

– (if the corpus is bidirectional) by providing control for translation bias

– (if the corpus is representative) by controlling for the idiosyncrasies of individual authors/translators

Page 40: Parallel corpora and contrastive studies

42 >

Why undertake corpus-based contrastive investigations?

The importance of multilingual corpora extends beyond contrastive studies. It is up to the user to define fruitful research questions and use the corpora creatively. In this process we learn not only about individual languages and their relationships, about translation and foreign-language acquisition, but also about language in general – provided that the study becomes truly multilingual. Seeing through corpora we can see through language.

Stig Johansson (2007: 316)

Page 41: Parallel corpora and contrastive studies

43 >

Information on the OMC / ENPC

About the corpora:

OMC: www.hf.uio.no/ilos/english/originalfiler/services/omc/

ENPC: www.hf.uio.no/ilos/english/originalfiler/services/omc/enpc/

www.helsinki.fi/varieng/CoRD/corpora/ENPC/

About publications based on the OMC (up to 2006):

www.hf.uio.no/ilos/forskning/prosjekter/sprik/english/publications/

Page 42: Parallel corpora and contrastive studies

44 >

ReferencesAijmer, K. & B. Altenberg. 1996. Introduction. In K. Aijmer, B. Altenberg, M. Johansson (eds.) Languages in Contrast. Lund University Press, 11-16.Altenberg, B. 1999. Adverbial connectors in English and Swedish: Semantic and lexical correspondences. In Hasselgård & Oksefjell (eds.) Out of Corpora. Amsterdam: Rodopi, 249-268.Berglund, Y. 2005. Expressions of Future in Present-day English. A Corpus-based Approach. Uppsala University.Chesterman, A. 1998 Contrastive Functional Analysis. Amsterdam/Philadelphia: John Benjamins Publishing Company.Declerck, R. 2006. The Grammar of the English Verb Phrase, Vol. 1. Berlin: Mouton de Gruyter.Faarlund, J. T., S. Lie, K. I. Vannebo. 1997. Norsk Referansegrammatikk. Oslo: Universitetsforlaget.Huddleston, R. and G. K. Pullum. 2002. The Cambridge Grammar of the English Language. Cambridge: Cambridge University Press.James, C.. 1980. Contrastive Analysis. London: Longman.Johansson, S. 1999. Corpora and contrastive studies. In P. Pietilä & O-P. Salo (eds.) Multiple Languages – Multiple Perspectives. AFinLA Yearbook 1999 / No. 57, 116-125.Johansson, S. 2007. Seeing through multilingual corpora. Amsterdam: Benjamins.Quirk, R., S. Greenbaum, G. Leech, J. Svartvik. 1985. A Comprehensive Grammar of the English Language. London: Longman.Vannebo, K. I. 1979. Tempus og tidsreferanse. Oslo: Novus