33
6th Intex Workshop, Sofia 28- 30 May 2003 1 6th Intex Workshop & 10 years of (Silberztein, 1993) Sofia, 28-30 May 2003

6th Intex Workshop, Sofia 28-30 May 20031 6th Intex Workshop & 10 years of (Silberztein, 1993) Sofia, 28-30 May 2003

Embed Size (px)

Citation preview

6th Intex Workshop, Sofia 28-30 May 2003

1

6th Intex Workshop &10 years of (Silberztein, 1993)

Sofia, 28-30 May 2003

6th Intex Workshop, Sofia 28-30 May 2003

2

Conversion between Intex and MULTEXT-East Morphosyntactic Descriptions

Cvetana Krstev, Duško Vitas

University of Belgrade

Tomaž ErjavecJožef Stefan Institute, Ljubljana

6th Intex Workshop, Sofia 28-30 May 2003

3

Motivation

general• use of different tools• use of multilingual resources• comparison of results in NLP

specific• inclusion of Serbian language in MULTEXT-East

specification and production of Slovenian Intex resources

• production of tagged Serbian translation of Orwell's 1984

6th Intex Workshop, Sofia 28-30 May 2003

4

MULTEXT-East morphosyntactic specification aim

exhaustive description of morphological and morphosyntactic features of different languages and establishment of unique codes for common features

scope:

English, Romanian, Slovene, Czeck, Bulgarian, Estonian, Hungarian, Croatian (Concede), and Serbian

6th Intex Workshop, Sofia 28-30 May 2003

5

14 MULTEXT-East types or PoS- new types cannot be introduced

Nouns (N) Verbs (V) Adjectives (A) Pronouns (P) Determiners (D) Adpositions (S) Conjuctions (C)

Numerals (M) Interjections (I) Abbreviations (Y) Particles (Q) Adverbs (R) Articles (T) Residuals (X)

6th Intex Workshop, Sofia 28-30 May 2003

6

Type attributes

Each type has a set of attributes that are appropriate to it

Each type attribute has its position in MSD description

It is not recommended to add new attributes to a type

6th Intex Workshop, Sofia 28-30 May 2003

7

Attribute values

a set of values is added to each attribute each value is coded by one alphanumeric

character the new values can be added to the

attributes, if necessary

Types

Verb attributes

Adjective attributes

6th Intex Workshop, Sofia 28-30 May 2003

8

Adjective attribute values/1

Adjective (A)13 positions

= ============== ============== = EN RO SL CS BG ET HU HR SRP ATT VAL C x x x x x x x x x = ============== ============== = 1 Type qualificative f x x x x x x x indefinite i possessive s x x x x ordinal o x x - -------------- -------------- - 2 Degree positive p x x x x x x x x comparative c x x x x x x x x superlative s x x x x x x x x elative e x x - -------------- -------------- -

6th Intex Workshop, Sofia 28-30 May 2003

9

Adjective attribute values/2

= ============== ============== = EN RO SL CS BG ET HU HR SRP ATT VAL C x x x x x x x x x = ============== ============== = 3 Gender masculine m x x x x x x feminine f x x x x x x neuter n x x x x x x - -------------- -------------- - 4 Number singular s x x x x x x x x plural p x x x x x x x x dual d x x paucal c x - -------------- -------------- - 5 Case nominative n x x x x x x genitive g x x x x x x dative d x x x x x accusative a x x x x x

...(various more values)..

*

6th Intex Workshop, Sofia 28-30 May 2003

10

Adjective attribute values/3

6 Definiteness no n x x x x x yes y x x x x x short_art s x full_art f x - -------------- -------------- - 7 Clitic no n x yes y x - -------------- -------------- - 8 Animate no n x x x x x yes y x x x x x - -------------- -------------- - 9 Formation nominal n x compound c x - -------------- -------------- - ... various Hungarian specific attributes... ================================= EN RO SL CS BG ET HU HR SR

6th Intex Workshop, Sofia 28-30 May 2003

11

An example from the Slovenian MULTEXT-East dictionaryčistejšičist Afcfda

lemma čist (Engl. clean) corresponds to the simple word form čistejši; it is qualified as qualificative (f) adjective (A) in comparative form (c), feminine gender (f), dual number (d), and accusative case (a).

čistejšičist Afcmsa--n

lemma čist (Engl. clean) corresponds to the simple word form čistejši; it is qualified as qualificative (f) adjective (A) in

comparative form (c), masculine gender (m), singular (s), accusative case (a), and not animate (n).

6th Intex Workshop, Sofia 28-30 May 2003

12

The first sentence of the Slovene translation of Orwell's 1984 tagged<w lemma="biti" ana="Vcps-sma">Bil</w><w lemma="biti" ana="Vcip3s--n">je</w><w lemma="jasen" ana="Afpmsnn">jasen</w><c>,</c><w lemma="mrzel" ana="Afpmsnn">mrzel</w><w lemma="aprilski" ana="Aopmsn">aprilski</w><w lemma="dan" ana="Ncmsn">dan</w><w lemma="in" ana="Ccs">in</w><w lemma="ura" ana="Ncfpn">ure</w><w lemma="biti" ana="Vcip3p--n">so</w><w lemma="biti" ana="Vmps-pfa">bile</w><w lemma="trinajst" ana="Mcnpnl">trinajst</w>

6th Intex Workshop, Sofia 28-30 May 2003

13

Intex MSD for Serbian

one DELAS entry cyist,A17 one of its corresponding DELAF entries

cyistiji,cyist.A17:bems1g:bems4q:bems5g:bemp1g

:bemp5g produced by the regular expression A17.exp

..............

ijemu/:bems3g:bems7g:bens3g:bens7g +

iji/:bems1g:bems4q:bems5g:bemp1g:bemp5g + o/:aens1g:aens4g:aens5g +

..............

6th Intex Workshop, Sofia 28-30 May 2003

14

Attributes and their values for Serbian adjectives in DELAS/DELAFAttribute Value Code Attribute Value Code

degree positive a case nominative 1

comparative b genitive 2

superlative c dative 3

definiteness no k accusative 4

yes d vocative 5

not applicable e instrumental 6

gender masculine m locative 7

feminine f animate yes v

neuter n no q

number singular s not-applicable g

plural p (not important)

6th Intex Workshop, Sofia 28-30 May 2003

15

Syntactic and semantic marks in Serbian DELAS

category tag applied to explanation example

syntactic +p2 prepositions noun is in genitive bez,PREP+p2

+Ref verbs reflexive dicyiti,V551+Imperf+It+Ref

+MG nouns masculine natural gender

budala,N601+Hum+MG+FG

derivational +VN nouns verbal noun kiselxenxe,N300+VN

+Adj adverbs derived from adjectives

fanaticyno,ADV+Adj

+DerOvaIra verbs, nouns, adjectives

derivational variaty dezinfikovati,V18+Imperf+...+DerOvaIra

semantic +Col adjectives colors zelenkastosiv,A6+Col

+Hum nouns human lxubavnica,N601+Hum

+Mat adjectives material kozxnat,A6+Mat

dialectic +Ek all ekavien nedelxa,N600+Ek

+Cr all croatism izopcxen,A1+PP+Cr

6th Intex Workshop, Sofia 28-30 May 2003

16

Problems of correspondence between MULTEXT-East MSD and Intex/1 The necessity to enforce the existing coding schema to

a particular language

Example: How to encode present and past gerund active?

In Serbian, for the verb ići (Engl. to go) those

gerunds are idući and išavši

There are attributes in verb tables of MULTEXT-east specification that describe them. However, no Slavic language, except Bulgarian, uses it.

6th Intex Workshop, Sofia 28-30 May 2003

17

Problems/2

the common encoding schema does not guarantee that true standardization would be achieved

Example:

only in Bulgarian do we find the attribute value 'adjectival' for adverbs (with the examples 'umno, veselo, studeno') – other Slavic languages, at least, could make use of that value of the attribute type.

6th Intex Workshop, Sofia 28-30 May 2003

18

Problems/3 Encoding of verb tenses= ============== ============== = EN RO SL CS BG ET HU HR SRP ATT VAL C x x x x x x x x x = ============== ============== = 2 VForm indicative i x x x x x x x x x subjunctive s x imperative m x x x x x x x x conditional c x x x x x x x infinitive n x x x x x x x x participle p x x x x x x x x gerund g x x x supine u x x transgressive t x quotative q x - -------------- -------------- - 3 Tense present p x x x x x x x x x imperfect i x x x x x future f x x x x past s x x x x x x x x x pluperfect l x x x aorist a x x x

6th Intex Workshop, Sofia 28-30 May 2003

19

Problems/3

The second attribute specifies verb form, and the third the tense. However, due to the composite tenses, some verb forms are used for the construction of different tenses. In Slovenian, verb form imel is past participle of the verb imeti (Engl. to have), and it is used to produce perfect tense if used with the indicative form of the present tense of the copula verb biti (Engl. to be) and conditional if used with the conditional form of the same copula verb.

6th Intex Workshop, Sofia 28-30 May 2003

20

Problems/3

<w lemma="Winston" ana="Npmsn">Winston</w>

<w lemma="Smith" ana="Npmsn">Smith</w>

<w lemma="biti" ana="Vcip3s--n">je</w>

<w lemma="imeti" ana="Vmps-sma">imel</w>

..........................................

<w lemma="da" ana="Css">da</w>

<w lemma="biti" ana="Vcc">bi</w>

<w lemma="on" ana="Pp3msa--y-n">ga</w>

<w lemma="imeti" ana="Vmps-sma">imel</w>

6th Intex Workshop, Sofia 28-30 May 2003

21

Problems/4

different interpretation of various grammatical categories across languages and lack of a clear cross-linguistic correspondance are discussed in Przepiórkowski (EACL 2003), for example dual number in Slovene and paucal in Serbian.

certain morphosyntactic phenomena have not been taken into consideration, as various problems of agreement (Vitas, Krstev, to appear).

6th Intex Workshop, Sofia 28-30 May 2003

22

Application of MSDIntex mapping to Serbian 1984{S}{Bio,biti.V77:Gsm}({je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} +

{je,on.PRO+Prs:sz2fi:sz4fi}){vedar,.A18:akms1g:akms4q}({i,.CONJ} + {i,.PAR}){hladan,.A18:akms1g:akms4q}{aprilski,.A2+PosQ:adms1g:aems4q:aems5g:aemp1g:aemp5g}({dan,.A1+PP:akms1g:aems4q} +

{dan,dati.V103+Perf+Tr+Iref+Ref:Tms});{S} ({na,.PREP+p4} + {na,.PREP+p7}){cyasovnicima,.?}({je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} +

{je,on.PRO+Prs:sz2fi:sz4fi}){izbijalo,izbijati.V101+Perf+Tr+It+Iref:Gsn}{trinaest,.?}.

6th Intex Workshop, Sofia 28-30 May 2003

23

Tool that facilitates the lemmatization and disambiguation

6th Intex Workshop, Sofia 28-30 May 2003

24

Tagged Serbian translation of 1984 after hand disambiguation and resolving of unknown words {S}{Bio,biti.V77:Gsm}{je,jesam.V575+Imperf+It+Iref+Aux:Pzsi}{vedar,.A18:akms1g}(i,.CONJ){hladan,.A18:akms1g}{aprilski,.A2+PosQ:adms1g}{dan,.N1:ms1q};{S} {na,.PREP+p7} {cyasovnicima,cyasovnik.N5:mp7q}{je,jesam.V575+Imperf+It+Iref+Aux:Pzsi}{izbijalo,izbijati.V101+Perf+Tr+It+Iref:Gsn}{trinaest,.Num+Car}.

6th Intex Workshop, Sofia 28-30 May 2003

25

Simple perl script maps Serbian Intex codes to MULTEX-East MSD if (($POS eq "V") && ($kategorije !~ /[XS]/)) { #glagol je $glagol = "V" . "---------------"; if ($semkat =~ /Aux/) { #tip, atribut 1 substr($glagol,1,1) = "a"; } else { substr($glagol,1,1) = "m"; } if ($kategorije =~ /([WYGTIFA])/ ) { # forma, atribut 2 substr($glagol,2,1) = $1; } $glagol =~ tr/WYGTIFA/nmppiii/; if ( ($lema eq "biti") && ($kategorije =~ /A/) ) { substr($glagol,2,1) = "c"; } if ($kategorije =~ /([PIFAGY])/) { # vreme, atribut 3 substr($glagol,3,1) = $1; } $glagol =~ tr/PIFAGY/pofasp/; if ($kategorije =~ /([xyz])/) { # broj, atribut 4 substr($glagol,4,1) = $1; } $glagol =~ tr/xyz/123/; ........

6th Intex Workshop, Sofia 28-30 May 2003

26

Tagged Serbian 1984 using MULTEXT-East MSD<w lemma="biti" ana="Vmps-sman-n---p">Bio</w>

<w lemma="jesam" ana="Va-p3s-an-y---p">je</w>

<w lemma="vedar" ana="Afpms1n">vedar</w>

<w lemma="i" ana="Ccs">i</w>

<w lemma="hladan" ana="Afpms1n">hladan</w>

<w lemma="aprilski" ana="Aopms1y">aprilski</w>

<w lemma="dan" ana="Ncmsn--n">dan</w>

<w lemma="na" ana="Sps-">na</w>

<w lemma="cyasovnik" ana="Ncmpl--n">cyasovnicima</w>

<w lemma="jesam" ana="Va-p3s-an-y---p">je</w>

<w lemma="izbijati" ana="Vmps-snan-n---e">izbijalo</w>

<w lemma="trinaest" ana="Mc---l">trinaest</w>

6th Intex Workshop, Sofia 28-30 May 2003

27

Conclusion

It is possible to convert from Intex to MULTEXT-East

It is possible to convert from MULTEXT-East to Intex to certain extent. Some information can not be recovered, such as inflectional class code

6th Intex Workshop, Sofia 28-30 May 2003

28

Noun attributes

1. Type

2. Gender

3. Number

4. Case

5. Definitness

Type attributes

Types

6. Clitic

7. Animate

8. Owner_Number

9. Owner_Person

10. Owned_Number

6th Intex Workshop, Sofia 28-30 May 2003

29

Verb Attributes

1. Type

2. VForm

3. Tense

4. Person

5. Number

6. Gender

7. Voice

Type attributes

Types

8. Negative

9. Definitness

10. Clitic

11. Case

12. Animate

13. Clitic_s

14. Aspect

6th Intex Workshop, Sofia 28-30 May 2003

30

Adjective attributes

1. Type

2. Degree

3. Gender

4. Number

5. Case

6. Definitness

Type attributes

Types

7. Clitic

8. Animate

9. Formation

10. Owner_Number

11. Owner_Person

12. Owned_Number

6th Intex Workshop, Sofia 28-30 May 2003

31

Adverb attributes

1. Type

2. Degree

3. Clitic

4. Number

5. Person

6. Wh_Type

Type attributes

Types

6th Intex Workshop, Sofia 28-30 May 2003

32

Values of the attribute Vform of the type Verb indicative (m) subjunctive (s) imperative (m) conditional (c) infinitive (i)

Verb attributes

participle (p) gerund (g) supine (u) transgressive (t) quotative (q)

6th Intex Workshop, Sofia 28-30 May 2003

33

Value of the attribute Tense of the type Verb present (p) imperfect (i) future (f) past (s) pluperfect (l) aorist (a)

Verb attributes