17
From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo Atle Grønn, Kjetil Rå Hauge, Elizaveta Khachaturyan, Ljiljana Šari 1

From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

  • Upload
    nevaeh

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo. Atle Grønn, Kjetil Rå Hauge, Elizaveta Khachaturyan, Ljiljana Šarić. 1. Corpus history at Oslo University. - PowerPoint PPT Presentation

Citation preview

Page 1: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

Atle Grønn, Kjetil Rå Hauge, Elizaveta Khachaturyan, Ljiljana Šaric

1

Page 2: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

> Department of Literature, Area Studies and European Languages

Corpus history at Oslo University

At the University of Oslo, and notably at the Department of Literature, Area Studies and European languages, there is a strong tradition going back to the late Stig Johansson’s English-Norwegian Parallel Corpus, initiated in 1994.

2

Page 3: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

> Department of Literature, Area Studies and European Languages

Corpus history at Oslo University

The English-Norwegian corpus has been continued as the Oslo Multilingual Corpus, with subcorpora in Norwegian, English, French, and German, and smaller sections for Dutch and Portuguese.

In addition, there are related parallel corpora for English-Swedish and English-Finnish, compiled in Sweden and Finland, which are accessible from the same site.

3

Page 4: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

> Department of Literature, Area Studies and European Languages

RuN history

Where Russian meets Norwegian - languages at the interfaces

This three-year project, led by Atle Grønn, was started in 2008 with funding from the Norwegian Centre for International Cooperation in Higher Education (SIU) through its cooperation programme with Russia.

Its objective is to bridge the gap between research and education in the field of advanced second language learning of Russian and Norwegian.

4

Page 5: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

> Department of Literature, Area Studies and European Languages

RuN activities

Activities at RuN include new bachelors and masters courses, three major international conferences, teaching materials, research articles, MA/PhD-theses etc., and the RuN Corpus, a parallel Norwegian-Russian-English corpus.

RuN web page

5

Page 6: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

> Department of Literature, Area Studies and European Languages

RuN-EuroIn 2010 the project expanded into Bulgarian, French, Italian, Croatian, and Polish, with additional members: Kjetil Rå Hauge, Elizaveta Khachatourian, and Ljiljana Šaric; and technical assistants Vladislav Dorochin and Boris Orechov.

The expansion was made possible through funding from the Faculty of Humantities at the University of Oslo.

6

Page 7: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

> Department of Literature, Area Studies and European Languages

ProfileThe RuN-Euro corpus, like the OMC, is heavily biased towards fiction.

Priority is given to contemporary texts.

However, there is no problem connected with including older texts, since the search interface allows for restricting the search to date of publication, authors, genre, and other parameters.

7

Page 8: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

> Department of Literature, Area Studies and European Languages

New acqusitions

The language distribution of this year’s additions leans heavily towards Russian and Bulgarian, as e-texts are more readily available in these languages. Also, texts have been exchanged through cooperation with the Institute for Bulgarian Language at the Bulgarian Academy of Sciences.

Next year will hopefully see more texts in the other languages, made available through OCR.

8

Page 9: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

> Department of Literature, Area Studies and European Languages

Production 2010More than fifty additional texts will have been added at the end of 2010 (originals in red):

Antoine de Saint-Exupéry, Le petit prince (Bg-BKS-En-Fr-It-No-Ru)

Michail Bulgakov, Master i Margarita (Bg-En-It-No-Ru)

Michail Bulgakov, Sobač´e serdce (Bg-En-No-Ru)

Jostein Gaarder, Sofies verden (Bg-It-No-Ru)

Il´ja Il´f and Evgenij Petrov, Dvenadcat´ stul´ev (Bg-BKS-En-Ru)

Ernest Hemingway, The old man and the sea (Bg-BKS-En-Ru)

Vladimir Nabokov, Lolita (Bg-En-No-Ru)

Viktor Pelevin, Generation П (Bg-En-No-Ru)

Anton Čechov, Rasskazy (Bg-En-No-Ru)

9

Page 10: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

> Department of Literature, Area Studies and European Languages

Production 2010, cont.Boris Akunin, Koronacija, ili Poslednij iz Romanov (En-No-Ru)

Michail Bulgakov, Rokovye jajca (En-No-Ru)

Ivo Andric, Prokleta avlija (Bg-BKS-Ru)

Ivan Vazov, Pod igoto (Bg-En-Fr)

10

Page 11: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

> Department of Literature, Area Studies and European Languages

Free toolA web-based sentence splitter has been built by project assistant Boris Orechov, based on Perl code by Jarle Ebeling of the OMC.

The splitter has built-in lists of non-splitting abbreviations (Mr., Dr., ...) for English, Russian, and Bulgarian (the latter courtesy of the Institute for Bulgarian Language, Bulgarian Academy of Sciences). Source code is available.

It splits into XML organised according to the specifications of the RuN-Euro project, or into plain return-delimited chunks that can be used as input to Hunalign or other aligners.

http://nevmenandr.net/run/tools/

11

Page 12: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

> Department of Literature, Area Studies and European Languages

Aligner (1)

12

Input panes

Testing panes

Output panes

Page 13: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

> Department of Literature, Area Studies and European Languages

Aligner (2)The aligner, programmed in Java, uses language-specific information: bilingual word lists of frequent words with reasonably straight-forward translations:

almost/nesten

alone, single/alene

already/allerede

....

Hofland, Knut and Stig Johansson. 1998. "The Translation Corpus Aligner: A program for automatic alignment of parallel texts." In Johansson, Stig and Signe Oksefjell (eds.). Corpora and Crosslinguistic Research: Theory, Method, and Case Studies. Amsterdam: Rodopi.. Johansson and S. Oksefjell (1998), 87-100.

13

Page 14: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

> Department of Literature, Area Studies and European Languages

Database of textsFull list of texts (incompatible with Internet Explorer):

http://www.nevmenandr.net/run/

14

Page 15: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

> Department of Literature, Area Studies and European Languages

Glossa front endRuN-Euro and the OMC share a common web interface: “Glossa”, developed by the Text Laboratory at the Department for Linguistics and Nordic Languages.

Glossa is a user-friendly graphic interface built on top of the IMS Corpus Workbench query system.

Morphosyntactic tagging of the texts is provided by the Text Laboratory.

http://www.hf.uio.no/iln/tjenester/sprak/glossa/index.html

15

Page 16: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

> Department of Literature, Area Studies and European Languages

Other parallel corporaIn our selection of texts, we try not to duplicate files that already are included in other parallel corpora - that is, for the time being.

What we would like to see in the future, however, is a "marketplace" for parallel corpora.

16

Page 17: From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

> Department of Literature, Area Studies and European Languages

The parallel corpora bazaar?

•XML files

•XSL transformations (for transforming files from project A into files for project B)

•Lists of non-splitting abbreviations

•Bilingual word lists (for aligners)

17