Ivan Derganskyi

Preview:

DESCRIPTION

Двуязычные и многоязычные электронные языковые ресурсы

Citation preview

Двуязычные и многоязычные электронные языковые

ресурсыИван А. Держанский (iad58@mail.ru)Институт математики и информатики

Болгарской академии наукСекция Математической лингвистики

2

Resources for language engineering

• lexical databases (LDBs)• electronic dictionaries

– monolingual– bilingual and multilingual

• corpora

3

Corpus annotation

• Def: the process of adding linguistic information in an electronic form to a text corpus.

• Most common types:– morphosyntactic (grammatical, PoS)

annotation– lemma annotation

4

PoS tagging

• Def: the task of labelling each word in a sequence of words with its appropriate part-of-speech.

• Ambiguity:– вероятно ‘probable (sg. n.), probably’

• вероятно → PоS: adjective, Gender: neuter, Number: singular, Definiteness: no

• вероятно → PоS: adverb, Type: adjectival

• Def tagset: set of PoS tags

5

Electronic corpora of Bulgarian

The first two electronic corpora of the Bulgarian language were created in the framework of two EU projects on language technologies:

• MULTEXT-East (http://nl.ijs.si/IME);

• CONCEDE.

6

MULTEXT-East

The project MULTEXT-East (Multilingual Text Tools and Corpora for Eastern and Central European Languages, 1995–1997) produced resources for six Central and Eastern European languages:

• Bulgarian,• Slovene,• Czech,• Roumanian,• Hungarian,• Estonian,as well as English (as

the ‘hub language’ of the project).

7

MULTEXT-East (continued)

The extended results of the project were made available in 1998, first on CD-ROM and then via TRACTOR, the TELRI (Technology-Enhanced Learning in Research-led

Institutions) Research Archive of Computational Tools and Resources.

Version 3 (2004) includes material in five more languages (Croatian, Lithuanian, Resian, Russian, Serbian).

8

MULTEXT-East (continued)

The corpus of Bulgarian, developed according to the methodology and requirements of the project, contains three parts:

• Bulgarian Language-Specific Resources,

• a Parallel Annotated 1984 Corpus,• a Comparative Corpus.

9

The Parallel Annotated 1984 Corpus

The Parallel Annotated 1984 Corpus consists of

• the Bulgarian translation of George Orwell’s novel Nineteen Eighty-Four (including approximately 87,000 words);

• Bulgarian-English aligned texts.

10

The Parallel Annotated 1984 Corpus (continued)

The material was formatted as a well-structured, lemmatised, Corpus Encoding Standard (CES) corpus (Ide, 1998).

That is, each word form is accompanied by the corresponding lemma and grammatical information that constitute its standard lexical description.

11

The Parallel Annotated 1984 Corpus (continued)

The lexical descriptions for Bulgarian are in line with the terminology and the methodology used by MULTEXT.

The corpus was marked and validated for alignment and sentence boundaries.

12

The Comparative Corpus

The Comparative Corpus contains two subsets of about 100,000 words, each consisting of fiction, comprising excerpts from two contemporary Bulgarian novels, and excerpts from newspaper text.

The data was comparable across the six languages, in terms of the number and size of texts.

13

The Comparative Corpus (continued)

The entire multilingual Comparative Corpus was prepared in CES (Corpus Encoding Standard) format, manually or using ad-hoc tools, and was automatically annotated for tokenisation, sentence boundaries, and part of speech using the project tools.

14

Bulgarian Language-Specific Resources

The Bulgarian Language-Specific Resources are data required by the segmentation procedure, morphological analyser and disambiguator.

This includes a lexical list and lists of special tokens (frequent abbreviations and names, titles, patterns for proper names, etc.) with their types.

15

The lexicon

The lexical list (lexicon) contains about 242,000 lemmata.

Each lemma in the lexicon is associated with its part(s) of speech and lexical characteristics.

156,000 morpho syntactic descriptions were provided for Bulgarian.

16

The lexicon (continued)

Each lexicon entry includes the following information:

• word form;• lemma;• part of speech;• further morphological information

(feature values).

17

The lexicon (continued)

• part of speech– the traditional set of 10 parts of speech– punctuation– abbreviations– numbers written in digits– unidentified objects (residuals)

• same system for all languages of the project (though different interpretations)

18

Lexicography ↔ Linguistic Theory

• lexicography requires linguistic theory (analysis, methodology)– but also serves as a touchstone,

because what can be represented must have been studied, understood, formalised to a sufficient extent

• lexicography supports linguistic theory (data for research)

19

Dictionary ↔ Grammar

• mutually complementary, mutually indispensable components of integrated linguistic description

• lexicographic type (unification)• lexicographic portrait

(individualisation)

20

Computational lexicography

digital (machine-readable) dictionaries:• digital versions of traditional

dictionaries for human use• computer dictionaries as components

of information systems

21

Advantages of digital dictionaries

• size not an issue– potential for infinite growth in depth and

breadth (a dictionary needn’t be small, medium or large by design)

– many purposes served (explanatory dictionary, grammatical dictionary, dictionary of synonyms, antonyms, phraseology, etymology, etc., all as one integrated system)

22

Advantages of digital dictionaries (continued)

• easy update possible, incl. by continued distributed collective effort (wiki-style)

• flexible search (incl. bidirectional) and presentation of results

• audio-, video- etc. material can be added

• requirement: definitions must be simpler, but at the same time more comprehensive

23

Dictionary (definition)

• an aggregate of linguistic units (forms)– established in the language system as

represented by the usage of a certain language community,

– put in a predetermined order and– accompanied by formal (orthographic,

phonetic, grammatical, etymological, stylistic, etc.) and semantic information• on the linguistic units themselves or• on the denoted entities or phenomena,

24

Dictionary (definition, continued)

• an aggregate of linguistic units (forms)– put in a predetermined order and– accompanied by formal and semantic

information,– arranged and ordered in a certain way within

the entry,

• … almost always supplemented by auxiliary material– introduction, criteria, sources, list of

abbreviations, structure of the dictionary entry, grammar tables

25

Structure of the dictionary entry

• register part (on the left)• interpretation part (on the right)• all the register parts together form

the dictionary’s register• the set of rules and methods used

when composing the entries forms the metalanguage

26

The register

• designing the register (needn’t be a one-time event in the case of an electronic dictionary)– from other dictionaries– from a corpus of texts

• editing the register: eliminating obsolete words, arbitrary neologisms, suspected non-words

• automatic extension: productive derivation made into procedures

27

Structural aspects of lexicography

• macrostructure: nature and purpose of the dictionary, place within the typology of dictionaries, choice of register, choice of illustrations, order, metalanguage

• mediostructure: relations between language units, e.g., derivation, families of words

• microstructure: setup of the entry, hierarchy of meanings; requirements: standardisation, economy, simplicity, completeness

28

An example of a lexical entry: CONCEDE Bulgarian dictionary

<entry><hw>цел</hw><gen>ж.</gen><struc type="Sense" n="1"><def>Това, към което е насочена някаква дейност, към коетонякой се стреми; умисъл, намерение.</def><eg><q>С каква цел отиваш в града?</q></eg><eg><q>Вървя без цел.</q></eg><eg><q>Постигнах целта си.</q></eg><eg><q>Целта оправдава средствата.</q></eg></struc><struc type="Sense" n="2"><def>Предмет или точка, в която някой стреля, къмкоято е насочено определено действие, движение, удар и под.;прицел.</def><eg><q>Улучих целта.</q></eg></struc><struc type="Phrases"><struc type="Phrase" n="1"><orth>Имам (нямам) [за] цел.</orth><def>стремя се (не се стремя) към нещо.</def><eg><q>Нямам за цел да му навредя.</q></eg></struc><struc type="Phrase" n="2"><orth>Попадам в целта.</orth><def>улучвам, умервам.</def></struc></struc><etym><lang>нем.</lang>&gt;<lang>рус.</lang></etym></entry>

29

An example of a lexical entry (zoom, part 1: head word,

gender)<entry>

<hw>цел</hw>

<gen>ж.</gen>

[…]

</entry>

30

An example of a lexical entry (zoom, part 2)

<struc type="Sense" n="1"><def>Това, към което е насочена някаква дейност, към което някой се стреми; умисъл, намерение.</def>

<eg><q>С каква цел отиваш в града?</q></eg>

<eg><q>Вървя без цел.</q></eg><eg><q>Постигнах целта си.</q></eg><eg><q>Целта оправдава средствата.</q></eg></struc>

31

An example of a lexical entry (zoom, part 3)

<struc type="Sense" n="2">

<def>Предмет или точка, в която някой стреля, към която е насочено определено действие, движение, удар и под.; прицел.</def>

<eg><q>Улучих целта.</q></eg></struc>

32

An example of a lexical entry (zoom, part 4)

<struc type="Phrases"><struc type="Phrase" n="1"><orth>Имам (нямам) [за] цел.</orth>

<def>стремя се (не се стремя) към нещо.</def>

<eg><q>Нямам за цел да му навредя.</q></eg></struc>

<struc type="Phrase" n="2"><orth>Попадам в целта.</orth>

<def>улучвам, умервам.</def></struc></struc>

33

An example of a lexical entry (zoom, part 5: etymology)

<entry>

[…]

<etym><lang>нем.</lang>&gt;<lang>рус.</lang></etym>

</entry>

34

ABBYY Lingvo (Ru–It)

35

ABBYY Lingvo (Ru–Et)

цель

[m1][trn]eesmärk, märk, otstarve, siht[/trn][/m]

36

Why is order important?

37

38

Why is order important? (continued)

Ингредиенты: сахар, глюкоза, мука, милая, корица, какао, сода, маргарин

39

Why is order important? (continued)

Ингредиенты: бикарбонат натрия, ароматы, студень, молочный порошок, эмульгатор

40

wash (En–Ru)

41

honey (En–Ru)

42

jelly (En–Ru)

43

Digital grammatical dictionaries

• modelling of inflexion– (essential for inflecting languages)

• word form ↔ lemma + grammatical meaning– built upon a formal model of inflexion: a

division of the set of words into inflexional paradigmatic classes (non-intersecting subsets with algorithmically described rules)

44

Bi- and multilingual dictionaries

translation:• most general member(s) of the

corresponding synset• grammatical semantics (incl.

valency, subcategorisation)• pragmatic context (sublanguage of

most frequent usage)

45

Bi- and multilingual dictionaries (continued)

bilingual dictionary:• two integrated linguistic systems

(explanatory dictionary, grammatical dictionary, dictionary of synonyms, of antonyms, of phraseology)

• complemented by– comparable monolingual corpora and– a parallel bilingual corpus and

• linked by an interface

46

Bi- and multilingual dictionaries (continued)

• Integrating a synonym and a translation linguistic system: EuroWordNet (an assembly of WordNets using a common ontology and indexing)

47

Bi- and multilingual dictionaries (continued)

• multilingual dictionary:– a set of pairs of bilingual dictionaries– interlingua

• one of the target languages• an external natural language• an artificial but speakable language (e.g.,

Esperanto)• a semantic interlingua (a digital concept

dictionary)

48

Plans

of the joint research project “Semantics and Contrastive linguistics with a focus on a bilingual electronic dictionary” between IMI—BAS and ISS—PAS:

• Bulgarian–Polish/Polish–Bulgarian dictionaries

• Bulgarian–Polish–Ukrainian dictionary• Bulgarian–Polish–Ukrainian–Lithuanian …• … more?

49

Bulgarian–Polish/Polish–Bulgarian dictionaries … on the

basis of (1)the most recent paper bilingual

dictionaries (1987, 1988)• volume ≈60 000 words• already dated• of questionable reliability to boot

50

Bulgarian–Polish/Polish–Bulgarian dictionaries … on the

basis of (2)a bilingual corpus (3 000 000 words

envisaged) consisting of• fiction

– Polish to Bulgarian (easy to find)– Bulgarian to Polish (hard to find)– 3rdLg original, translated into Bg and Pl

• EU/EC documents• texts in Bulgarian and Polish of similar

sizes– excerpts from newspapers– literary works available on the Internet

51

Bulgarian–Polish dictionary (after OCR and proofreading)

претовар|я, -иш vp. v. претоварямпретоп|я, -иш vp. v. претапям, претопявампретопява|м, -ш vi. przetapiać; przen. asymilowaćпретор, -и т hist. pretor mпреториан|ец, -ци т pretorianin mпреториански adi. pretoriańskiI преточ|а, -иш vp. v. npeтакамII преточ|а, -иш vp. v. II преточвамI преточвам v. претакамII преточва|м, -ш vi. ostrzyć nadmiernieпретрайва|м, -ш vi. v. npeтраяпретра|я, -еш vp. lud. przetrwaćпретрива|м, -ш vi. przecierać, przecinać, przepiłowywać; ~м

праговете wycieram (obijam) cudze progiпретри|я, -еш vp. v. претривам

52

Bulgarian–Polish dictionary (after first round of markup)

[b]претовар|я, -иш[/b] [i]vp.[/i] v. [b]претоварям[/b][b]претоп|я, -иш[/b] [i]vp.[/i] v. [b]претапям, претопявам[/b][b]претопява|м, -ш[/b] [i]vi.[/i] przetapiać; [i]przen.[/i] asymilować[b]претор, -и[/b] [i]m[/i] [i]hist.[/i] [b]pretor[/b] [i]m[/i][b]преториан|ец, -ци[/b] [i]m[/i] pretorianin [i]m[/i][b]преториански[/b] [i]adi.[/i] pretoriański[b]I преточ|а, -иш[/b] [i]vp.[/i] v. [b]претакам[/b][b]II преточ|а, -иш[/b] [i]vp.[/i] v. [b]II преточвам[/b][b]I преточвам[/b] v. [b]претакам[/b][b]II преточва|м, -ш[/b] [i]vi.[/i] [b]ostrzyć nadmiernie[/b][b]претрайва|м, -ш[/b] [i]vi.[/i] v. [b]претрая[/b][b]претра|я, -еш[/b] [i]vp.[/i] [i]lud.[/i] przetrwać[b]претрива|м, -ш[/b] [i]vi.[/i] przecierać, przecinać, przepiłowywać;

[b]~м праговете[/b] wycieram (obijam) cudze progi[b]претри|я, -еш[/b] [i]vp.[/i] v. [b]претривам[/b]

53

Adding procedurality?

погазва|м, -ш vi. deptać, brodzić (trochę)погор|я, -иш vp. popalić się (trochę, krótko);

[…]погъделичква|м, -ш vi. łaskotać, łechtać

(trochę, lekko)погълта|м, -ш vp. łyknąć trochęпогърмява|м, -ш vi. pogrzmiewać, grzmieć

od czasu do czasu, […]подадва|м, -ш vi. lud. dawać po trochę, od

czasu do czasu

54

Polyprefixation

позагаз|я, -иш vp. zabrnąć, wpaść w ciężkie położenie (trochę)

позагатн|а, -еш vp. napomknąć, wspomnieć mimochodem

позагледа|м, -ш vp. spoglądnąć, spojrzeć, popatrzyć (trochę, od czasu do czasu)

понатежава|м, -ш vi. stawać się trochę cięższym, ciążyć trochę

понатисн|а, -еш vp. nacisnąć, przycisnąć trochę

понатовар|я, -иш vp. naładować trochę, obciążyć, obarczyć trochę

55

Adding procedurality? (continued)

претъркаля|м, -ш vp. przetoczyć, przesunąć tocząc

Likewise perhaps:• evaluatives• words for females• abstract nouns• … and other productive derivatives

56

Applications of the electronic LDB

• lexicography:– creation of electronic bilingual

dictionaries for research and teaching– specialised reference works, e.g.,

valency dictionaries

• education: training skills of independent investigation with the help of the computer