Upload
victor-briggs
View
217
Download
0
Embed Size (px)
Citation preview
6th Intex Workshop, Sofia 28-30 May 2003
1
6th Intex Workshop &10 years of (Silberztein, 1993)
Sofia, 28-30 May 2003
6th Intex Workshop, Sofia 28-30 May 2003
2
Conversion between Intex and MULTEXT-East Morphosyntactic Descriptions
Cvetana Krstev, Duško Vitas
University of Belgrade
Tomaž ErjavecJožef Stefan Institute, Ljubljana
6th Intex Workshop, Sofia 28-30 May 2003
3
Motivation
general• use of different tools• use of multilingual resources• comparison of results in NLP
specific• inclusion of Serbian language in MULTEXT-East
specification and production of Slovenian Intex resources
• production of tagged Serbian translation of Orwell's 1984
6th Intex Workshop, Sofia 28-30 May 2003
4
MULTEXT-East morphosyntactic specification aim
exhaustive description of morphological and morphosyntactic features of different languages and establishment of unique codes for common features
scope:
English, Romanian, Slovene, Czeck, Bulgarian, Estonian, Hungarian, Croatian (Concede), and Serbian
6th Intex Workshop, Sofia 28-30 May 2003
5
14 MULTEXT-East types or PoS- new types cannot be introduced
Nouns (N) Verbs (V) Adjectives (A) Pronouns (P) Determiners (D) Adpositions (S) Conjuctions (C)
Numerals (M) Interjections (I) Abbreviations (Y) Particles (Q) Adverbs (R) Articles (T) Residuals (X)
6th Intex Workshop, Sofia 28-30 May 2003
6
Type attributes
Each type has a set of attributes that are appropriate to it
Each type attribute has its position in MSD description
It is not recommended to add new attributes to a type
6th Intex Workshop, Sofia 28-30 May 2003
7
Attribute values
a set of values is added to each attribute each value is coded by one alphanumeric
character the new values can be added to the
attributes, if necessary
Types
Verb attributes
Adjective attributes
6th Intex Workshop, Sofia 28-30 May 2003
8
Adjective attribute values/1
Adjective (A)13 positions
= ============== ============== = EN RO SL CS BG ET HU HR SRP ATT VAL C x x x x x x x x x = ============== ============== = 1 Type qualificative f x x x x x x x indefinite i possessive s x x x x ordinal o x x - -------------- -------------- - 2 Degree positive p x x x x x x x x comparative c x x x x x x x x superlative s x x x x x x x x elative e x x - -------------- -------------- -
6th Intex Workshop, Sofia 28-30 May 2003
9
Adjective attribute values/2
= ============== ============== = EN RO SL CS BG ET HU HR SRP ATT VAL C x x x x x x x x x = ============== ============== = 3 Gender masculine m x x x x x x feminine f x x x x x x neuter n x x x x x x - -------------- -------------- - 4 Number singular s x x x x x x x x plural p x x x x x x x x dual d x x paucal c x - -------------- -------------- - 5 Case nominative n x x x x x x genitive g x x x x x x dative d x x x x x accusative a x x x x x
...(various more values)..
*
6th Intex Workshop, Sofia 28-30 May 2003
10
Adjective attribute values/3
6 Definiteness no n x x x x x yes y x x x x x short_art s x full_art f x - -------------- -------------- - 7 Clitic no n x yes y x - -------------- -------------- - 8 Animate no n x x x x x yes y x x x x x - -------------- -------------- - 9 Formation nominal n x compound c x - -------------- -------------- - ... various Hungarian specific attributes... ================================= EN RO SL CS BG ET HU HR SR
6th Intex Workshop, Sofia 28-30 May 2003
11
An example from the Slovenian MULTEXT-East dictionaryčistejšičist Afcfda
lemma čist (Engl. clean) corresponds to the simple word form čistejši; it is qualified as qualificative (f) adjective (A) in comparative form (c), feminine gender (f), dual number (d), and accusative case (a).
čistejšičist Afcmsa--n
lemma čist (Engl. clean) corresponds to the simple word form čistejši; it is qualified as qualificative (f) adjective (A) in
comparative form (c), masculine gender (m), singular (s), accusative case (a), and not animate (n).
6th Intex Workshop, Sofia 28-30 May 2003
12
The first sentence of the Slovene translation of Orwell's 1984 tagged<w lemma="biti" ana="Vcps-sma">Bil</w><w lemma="biti" ana="Vcip3s--n">je</w><w lemma="jasen" ana="Afpmsnn">jasen</w><c>,</c><w lemma="mrzel" ana="Afpmsnn">mrzel</w><w lemma="aprilski" ana="Aopmsn">aprilski</w><w lemma="dan" ana="Ncmsn">dan</w><w lemma="in" ana="Ccs">in</w><w lemma="ura" ana="Ncfpn">ure</w><w lemma="biti" ana="Vcip3p--n">so</w><w lemma="biti" ana="Vmps-pfa">bile</w><w lemma="trinajst" ana="Mcnpnl">trinajst</w>
6th Intex Workshop, Sofia 28-30 May 2003
13
Intex MSD for Serbian
one DELAS entry cyist,A17 one of its corresponding DELAF entries
cyistiji,cyist.A17:bems1g:bems4q:bems5g:bemp1g
:bemp5g produced by the regular expression A17.exp
..............
ijemu/:bems3g:bems7g:bens3g:bens7g +
iji/:bems1g:bems4q:bems5g:bemp1g:bemp5g + o/:aens1g:aens4g:aens5g +
..............
6th Intex Workshop, Sofia 28-30 May 2003
14
Attributes and their values for Serbian adjectives in DELAS/DELAFAttribute Value Code Attribute Value Code
degree positive a case nominative 1
comparative b genitive 2
superlative c dative 3
definiteness no k accusative 4
yes d vocative 5
not applicable e instrumental 6
gender masculine m locative 7
feminine f animate yes v
neuter n no q
number singular s not-applicable g
plural p (not important)
6th Intex Workshop, Sofia 28-30 May 2003
15
Syntactic and semantic marks in Serbian DELAS
category tag applied to explanation example
syntactic +p2 prepositions noun is in genitive bez,PREP+p2
+Ref verbs reflexive dicyiti,V551+Imperf+It+Ref
+MG nouns masculine natural gender
budala,N601+Hum+MG+FG
derivational +VN nouns verbal noun kiselxenxe,N300+VN
+Adj adverbs derived from adjectives
fanaticyno,ADV+Adj
+DerOvaIra verbs, nouns, adjectives
derivational variaty dezinfikovati,V18+Imperf+...+DerOvaIra
semantic +Col adjectives colors zelenkastosiv,A6+Col
+Hum nouns human lxubavnica,N601+Hum
+Mat adjectives material kozxnat,A6+Mat
dialectic +Ek all ekavien nedelxa,N600+Ek
+Cr all croatism izopcxen,A1+PP+Cr
6th Intex Workshop, Sofia 28-30 May 2003
16
Problems of correspondence between MULTEXT-East MSD and Intex/1 The necessity to enforce the existing coding schema to
a particular language
Example: How to encode present and past gerund active?
In Serbian, for the verb ići (Engl. to go) those
gerunds are idući and išavši
There are attributes in verb tables of MULTEXT-east specification that describe them. However, no Slavic language, except Bulgarian, uses it.
6th Intex Workshop, Sofia 28-30 May 2003
17
Problems/2
the common encoding schema does not guarantee that true standardization would be achieved
Example:
only in Bulgarian do we find the attribute value 'adjectival' for adverbs (with the examples 'umno, veselo, studeno') – other Slavic languages, at least, could make use of that value of the attribute type.
6th Intex Workshop, Sofia 28-30 May 2003
18
Problems/3 Encoding of verb tenses= ============== ============== = EN RO SL CS BG ET HU HR SRP ATT VAL C x x x x x x x x x = ============== ============== = 2 VForm indicative i x x x x x x x x x subjunctive s x imperative m x x x x x x x x conditional c x x x x x x x infinitive n x x x x x x x x participle p x x x x x x x x gerund g x x x supine u x x transgressive t x quotative q x - -------------- -------------- - 3 Tense present p x x x x x x x x x imperfect i x x x x x future f x x x x past s x x x x x x x x x pluperfect l x x x aorist a x x x
6th Intex Workshop, Sofia 28-30 May 2003
19
Problems/3
The second attribute specifies verb form, and the third the tense. However, due to the composite tenses, some verb forms are used for the construction of different tenses. In Slovenian, verb form imel is past participle of the verb imeti (Engl. to have), and it is used to produce perfect tense if used with the indicative form of the present tense of the copula verb biti (Engl. to be) and conditional if used with the conditional form of the same copula verb.
6th Intex Workshop, Sofia 28-30 May 2003
20
Problems/3
<w lemma="Winston" ana="Npmsn">Winston</w>
<w lemma="Smith" ana="Npmsn">Smith</w>
<w lemma="biti" ana="Vcip3s--n">je</w>
<w lemma="imeti" ana="Vmps-sma">imel</w>
..........................................
<w lemma="da" ana="Css">da</w>
<w lemma="biti" ana="Vcc">bi</w>
<w lemma="on" ana="Pp3msa--y-n">ga</w>
<w lemma="imeti" ana="Vmps-sma">imel</w>
6th Intex Workshop, Sofia 28-30 May 2003
21
Problems/4
different interpretation of various grammatical categories across languages and lack of a clear cross-linguistic correspondance are discussed in Przepiórkowski (EACL 2003), for example dual number in Slovene and paucal in Serbian.
certain morphosyntactic phenomena have not been taken into consideration, as various problems of agreement (Vitas, Krstev, to appear).
6th Intex Workshop, Sofia 28-30 May 2003
22
Application of MSDIntex mapping to Serbian 1984{S}{Bio,biti.V77:Gsm}({je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} +
{je,on.PRO+Prs:sz2fi:sz4fi}){vedar,.A18:akms1g:akms4q}({i,.CONJ} + {i,.PAR}){hladan,.A18:akms1g:akms4q}{aprilski,.A2+PosQ:adms1g:aems4q:aems5g:aemp1g:aemp5g}({dan,.A1+PP:akms1g:aems4q} +
{dan,dati.V103+Perf+Tr+Iref+Ref:Tms});{S} ({na,.PREP+p4} + {na,.PREP+p7}){cyasovnicima,.?}({je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} +
{je,on.PRO+Prs:sz2fi:sz4fi}){izbijalo,izbijati.V101+Perf+Tr+It+Iref:Gsn}{trinaest,.?}.
6th Intex Workshop, Sofia 28-30 May 2003
23
Tool that facilitates the lemmatization and disambiguation
6th Intex Workshop, Sofia 28-30 May 2003
24
Tagged Serbian translation of 1984 after hand disambiguation and resolving of unknown words {S}{Bio,biti.V77:Gsm}{je,jesam.V575+Imperf+It+Iref+Aux:Pzsi}{vedar,.A18:akms1g}(i,.CONJ){hladan,.A18:akms1g}{aprilski,.A2+PosQ:adms1g}{dan,.N1:ms1q};{S} {na,.PREP+p7} {cyasovnicima,cyasovnik.N5:mp7q}{je,jesam.V575+Imperf+It+Iref+Aux:Pzsi}{izbijalo,izbijati.V101+Perf+Tr+It+Iref:Gsn}{trinaest,.Num+Car}.
6th Intex Workshop, Sofia 28-30 May 2003
25
Simple perl script maps Serbian Intex codes to MULTEX-East MSD if (($POS eq "V") && ($kategorije !~ /[XS]/)) { #glagol je $glagol = "V" . "---------------"; if ($semkat =~ /Aux/) { #tip, atribut 1 substr($glagol,1,1) = "a"; } else { substr($glagol,1,1) = "m"; } if ($kategorije =~ /([WYGTIFA])/ ) { # forma, atribut 2 substr($glagol,2,1) = $1; } $glagol =~ tr/WYGTIFA/nmppiii/; if ( ($lema eq "biti") && ($kategorije =~ /A/) ) { substr($glagol,2,1) = "c"; } if ($kategorije =~ /([PIFAGY])/) { # vreme, atribut 3 substr($glagol,3,1) = $1; } $glagol =~ tr/PIFAGY/pofasp/; if ($kategorije =~ /([xyz])/) { # broj, atribut 4 substr($glagol,4,1) = $1; } $glagol =~ tr/xyz/123/; ........
6th Intex Workshop, Sofia 28-30 May 2003
26
Tagged Serbian 1984 using MULTEXT-East MSD<w lemma="biti" ana="Vmps-sman-n---p">Bio</w>
<w lemma="jesam" ana="Va-p3s-an-y---p">je</w>
<w lemma="vedar" ana="Afpms1n">vedar</w>
<w lemma="i" ana="Ccs">i</w>
<w lemma="hladan" ana="Afpms1n">hladan</w>
<w lemma="aprilski" ana="Aopms1y">aprilski</w>
<w lemma="dan" ana="Ncmsn--n">dan</w>
<w lemma="na" ana="Sps-">na</w>
<w lemma="cyasovnik" ana="Ncmpl--n">cyasovnicima</w>
<w lemma="jesam" ana="Va-p3s-an-y---p">je</w>
<w lemma="izbijati" ana="Vmps-snan-n---e">izbijalo</w>
<w lemma="trinaest" ana="Mc---l">trinaest</w>
6th Intex Workshop, Sofia 28-30 May 2003
27
Conclusion
It is possible to convert from Intex to MULTEXT-East
It is possible to convert from MULTEXT-East to Intex to certain extent. Some information can not be recovered, such as inflectional class code
6th Intex Workshop, Sofia 28-30 May 2003
28
Noun attributes
1. Type
2. Gender
3. Number
4. Case
5. Definitness
Type attributes
Types
6. Clitic
7. Animate
8. Owner_Number
9. Owner_Person
10. Owned_Number
6th Intex Workshop, Sofia 28-30 May 2003
29
Verb Attributes
1. Type
2. VForm
3. Tense
4. Person
5. Number
6. Gender
7. Voice
Type attributes
Types
8. Negative
9. Definitness
10. Clitic
11. Case
12. Animate
13. Clitic_s
14. Aspect
6th Intex Workshop, Sofia 28-30 May 2003
30
Adjective attributes
1. Type
2. Degree
3. Gender
4. Number
5. Case
6. Definitness
Type attributes
Types
7. Clitic
8. Animate
9. Formation
10. Owner_Number
11. Owner_Person
12. Owned_Number
6th Intex Workshop, Sofia 28-30 May 2003
31
Adverb attributes
1. Type
2. Degree
3. Clitic
4. Number
5. Person
6. Wh_Type
Type attributes
Types
6th Intex Workshop, Sofia 28-30 May 2003
32
Values of the attribute Vform of the type Verb indicative (m) subjunctive (s) imperative (m) conditional (c) infinitive (i)
Verb attributes
participle (p) gerund (g) supine (u) transgressive (t) quotative (q)