From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo Atle Grønn, Kjetil Rå Hauge, Elizaveta Khachaturyan, Ljiljana Šarić 1

From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

Atle Grønn, Kjetil Rå Hauge, Elizaveta Khachaturyan, Ljiljana Šaric

1

> Department of Literature, Area Studies and European Languages

Corpus history at Oslo University

At the University of Oslo, and notably at the Department of Literature, Area Studies and European languages, there is a strong tradition going back to the late Stig Johansson’s English-Norwegian Parallel Corpus, initiated in 1994.

2


Corpus history at Oslo University

The English-Norwegian corpus has been continued as the Oslo Multilingual Corpus, with subcorpora in Norwegian, English, French, and German, and smaller sections for Dutch and Portuguese.

In addition, there are related parallel corpora for English-Swedish and English-Finnish, compiled in Sweden and Finland, which are accessible from the same site.

3


RuN history

Where Russian meets Norwegian - languages at the interfaces

This three-year project, led by Atle Grønn, was started in 2008 with funding from the Norwegian Centre for International Cooperation in Higher Education (SIU) through its cooperation programme with Russia.

Its objective is to bridge the gap between research and education in the field of advanced second language learning of Russian and Norwegian.

4


RuN activities

Activities at RuN include new bachelors and masters courses, three major international conferences, teaching materials, research articles, MA/PhD-theses etc., and the RuN Corpus, a parallel Norwegian-Russian-English corpus.

RuN web page

5

http://www.hf.uio.no/ilos/english/research/projects/run/


RuN-EuroIn 2010 the project expanded into Bulgarian, French, Italian, Croatian, and Polish, with additional members: Kjetil Rå Hauge, Elizaveta Khachatourian, and Ljiljana Šaric; and technical assistants Vladislav Dorochin and Boris Orechov.

The expansion was made possible through funding from the Faculty of Humantities at the University of Oslo.

6


ProfileThe RuN-Euro corpus, like the OMC, is heavily biased towards fiction.

Priority is given to contemporary texts.

However, there is no problem connected with including older texts, since the search interface allows for restricting the search to date of publication, authors, genre, and other parameters.

7


New acqusitions

The language distribution of this year’s additions leans heavily towards Russian and Bulgarian, as e-texts are more readily available in these languages. Also, texts have been exchanged through cooperation with the Institute for Bulgarian Language at the Bulgarian Academy of Sciences.

Next year will hopefully see more texts in the other languages, made available through OCR.

8


Production 2010More than fifty additional texts will have been added at the end of 2010 (originals in red):

Antoine de Saint-Exupéry, Le petit prince (Bg-BKS-En-Fr-It-No-Ru)

Michail Bulgakov, Master i Margarita (Bg-En-It-No-Ru)

Michail Bulgakov, Sobač´e serdce (Bg-En-No-Ru)

Jostein Gaarder, Sofies verden (Bg-It-No-Ru)

Il´ja Il´f and Evgenij Petrov, Dvenadcat´ stul´ev (Bg-BKS-En-Ru)

Ernest Hemingway, The old man and the sea (Bg-BKS-En-Ru)

Vladimir Nabokov, Lolita (Bg-En-No-Ru)

Viktor Pelevin, Generation П (Bg-En-No-Ru)

Anton Čechov, Rasskazy (Bg-En-No-Ru)

9


Production 2010, cont.Boris Akunin, Koronacija, ili Poslednij iz Romanov (En-No-Ru)

Michail Bulgakov, Rokovye jajca (En-No-Ru)

Ivo Andric, Prokleta avlija (Bg-BKS-Ru)

Ivan Vazov, Pod igoto (Bg-En-Fr)

10


Free toolA web-based sentence splitter has been built by project assistant Boris Orechov, based on Perl code by Jarle Ebeling of the OMC.

The splitter has built-in lists of non-splitting abbreviations (Mr., Dr., ...) for English, Russian, and Bulgarian (the latter courtesy of the Institute for Bulgarian Language, Bulgarian Academy of Sciences). Source code is available.

It splits into XML organised according to the specifications of the RuN-Euro project, or into plain return-delimited chunks that can be used as input to Hunalign or other aligners.

http://nevmenandr.net/run/tools/

11

http://nevmenandr.net/run/tools/


Aligner (1)

12

Input panes

Testing panes

Output panes


Aligner (2)The aligner, programmed in Java, uses language-specific information: bilingual word lists of frequent words with reasonably straight-forward translations:

almost/nesten

alone, single/alene

already/allerede

....

Hofland, Knut and Stig Johansson. 1998. "The Translation Corpus Aligner: A program for automatic alignment of parallel texts." In Johansson, Stig and Signe Oksefjell (eds.). Corpora and Crosslinguistic Research: Theory, Method, and Case Studies. Amsterdam: Rodopi.. Johansson and S. Oksefjell (1998), 87-100.

13


Database of textsFull list of texts (incompatible with Internet Explorer):

http://www.nevmenandr.net/run/

14


Glossa front endRuN-Euro and the OMC share a common web interface: “Glossa”, developed by the Text Laboratory at the Department for Linguistics and Nordic Languages.

Glossa is a user-friendly graphic interface built on top of the IMS Corpus Workbench query system.

Morphosyntactic tagging of the texts is provided by the Text Laboratory.

http://www.hf.uio.no/iln/tjenester/sprak/glossa/index.html

15


Other parallel corporaIn our selection of texts, we try not to duplicate files that already are included in other parallel corpora - that is, for the time being.

What we would like to see in the future, however, is a "marketplace" for parallel corpora.

16


The parallel corpora bazaar?

•XML files

•XSL transformations (for transforming files from project A into files for project B)

•Lists of non-splitting abbreviations

•Bilingual word lists (for aligners)

17

Documents

From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo Atle Grønn, Kjetil Rå Hauge, Elizaveta Khachaturyan, Ljiljana Šarić 1