17
From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo Atle Grønn, Kjetil Rå Hauge, Elizaveta Khachaturyan, Ljiljana Šari 1

From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo Atle Grønn, Kjetil Rå Hauge, Elizaveta Khachaturyan, Ljiljana Šarić 1

Embed Size (px)

Citation preview

From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

Atle Grønn, Kjetil Rå Hauge, Elizaveta Khachaturyan, Ljiljana Šaric

1

> Department of Literature, Area Studies and European Languages

Corpus history at Oslo University

At the University of Oslo, and notably at the Department of Literature, Area Studies and European languages, there is a strong tradition going back to the late Stig Johansson’s English-Norwegian Parallel Corpus, initiated in 1994.

2

> Department of Literature, Area Studies and European Languages

Corpus history at Oslo University

The English-Norwegian corpus has been continued as the Oslo Multilingual Corpus, with subcorpora in Norwegian, English, French, and German, and smaller sections for Dutch and Portuguese.

In addition, there are related parallel corpora for English-Swedish and English-Finnish, compiled in Sweden and Finland, which are accessible from the same site.

3

> Department of Literature, Area Studies and European Languages

RuN history

Where Russian meets Norwegian - languages at the interfaces

This three-year project, led by Atle Grønn, was started in 2008 with funding from the Norwegian Centre for International Cooperation in Higher Education (SIU) through its cooperation programme with Russia.

Its objective is to bridge the gap between research and education in the field of advanced second language learning of Russian and Norwegian.

4

> Department of Literature, Area Studies and European Languages

RuN activities

Activities at RuN include new bachelors and masters courses, three major international conferences, teaching materials, research articles, MA/PhD-theses etc., and the RuN Corpus, a parallel Norwegian-Russian-English corpus.

RuN web page

5

> Department of Literature, Area Studies and European Languages

RuN-EuroIn 2010 the project expanded into Bulgarian, French, Italian, Croatian, and Polish, with additional members: Kjetil Rå Hauge, Elizaveta Khachatourian, and Ljiljana Šaric; and technical assistants Vladislav Dorochin and Boris Orechov.

The expansion was made possible through funding from the Faculty of Humantities at the University of Oslo.

6

> Department of Literature, Area Studies and European Languages

ProfileThe RuN-Euro corpus, like the OMC, is heavily biased towards fiction.

Priority is given to contemporary texts.

However, there is no problem connected with including older texts, since the search interface allows for restricting the search to date of publication, authors, genre, and other parameters.

7

> Department of Literature, Area Studies and European Languages

New acqusitions

The language distribution of this year’s additions leans heavily towards Russian and Bulgarian, as e-texts are more readily available in these languages. Also, texts have been exchanged through cooperation with the Institute for Bulgarian Language at the Bulgarian Academy of Sciences.

Next year will hopefully see more texts in the other languages, made available through OCR.

8

> Department of Literature, Area Studies and European Languages

Production 2010More than fifty additional texts will have been added at the end of 2010 (originals in red):

Antoine de Saint-Exupéry, Le petit prince (Bg-BKS-En-Fr-It-No-Ru)

Michail Bulgakov, Master i Margarita (Bg-En-It-No-Ru)

Michail Bulgakov, Sobač´e serdce (Bg-En-No-Ru)

Jostein Gaarder, Sofies verden (Bg-It-No-Ru)

Il´ja Il´f and Evgenij Petrov, Dvenadcat´ stul´ev (Bg-BKS-En-Ru)

Ernest Hemingway, The old man and the sea (Bg-BKS-En-Ru)

Vladimir Nabokov, Lolita (Bg-En-No-Ru)

Viktor Pelevin, Generation П (Bg-En-No-Ru)

Anton Čechov, Rasskazy (Bg-En-No-Ru)

9

> Department of Literature, Area Studies and European Languages

Production 2010, cont.Boris Akunin, Koronacija, ili Poslednij iz Romanov (En-No-Ru)

Michail Bulgakov, Rokovye jajca (En-No-Ru)

Ivo Andric, Prokleta avlija (Bg-BKS-Ru)

Ivan Vazov, Pod igoto (Bg-En-Fr)

10

> Department of Literature, Area Studies and European Languages

Free toolA web-based sentence splitter has been built by project assistant Boris Orechov, based on Perl code by Jarle Ebeling of the OMC.

The splitter has built-in lists of non-splitting abbreviations (Mr., Dr., ...) for English, Russian, and Bulgarian (the latter courtesy of the Institute for Bulgarian Language, Bulgarian Academy of Sciences). Source code is available.

It splits into XML organised according to the specifications of the RuN-Euro project, or into plain return-delimited chunks that can be used as input to Hunalign or other aligners.

http://nevmenandr.net/run/tools/

11

> Department of Literature, Area Studies and European Languages

Aligner (1)

12

Input panes

Testing panes

Output panes

> Department of Literature, Area Studies and European Languages

Aligner (2)The aligner, programmed in Java, uses language-specific information: bilingual word lists of frequent words with reasonably straight-forward translations:

almost/nesten

alone, single/alene

already/allerede

....

Hofland, Knut and Stig Johansson. 1998. "The Translation Corpus Aligner: A program for automatic alignment of parallel texts." In Johansson, Stig and Signe Oksefjell (eds.). Corpora and Crosslinguistic Research: Theory, Method, and Case Studies. Amsterdam: Rodopi.. Johansson and S. Oksefjell (1998), 87-100.

13

> Department of Literature, Area Studies and European Languages

Database of textsFull list of texts (incompatible with Internet Explorer):

http://www.nevmenandr.net/run/

14

> Department of Literature, Area Studies and European Languages

Glossa front endRuN-Euro and the OMC share a common web interface: “Glossa”, developed by the Text Laboratory at the Department for Linguistics and Nordic Languages.

Glossa is a user-friendly graphic interface built on top of the IMS Corpus Workbench query system.

Morphosyntactic tagging of the texts is provided by the Text Laboratory.

http://www.hf.uio.no/iln/tjenester/sprak/glossa/index.html

15

> Department of Literature, Area Studies and European Languages

Other parallel corporaIn our selection of texts, we try not to duplicate files that already are included in other parallel corpora - that is, for the time being.

What we would like to see in the future, however, is a "marketplace" for parallel corpora.

16

> Department of Literature, Area Studies and European Languages

The parallel corpora bazaar?

•XML files

•XSL transformations (for transforming files from project A into files for project B)

•Lists of non-splitting abbreviations

•Bilingual word lists (for aligners)

17