32
<pisze> : ['piʃɛ] : /#p’yše#/ : /#p’ys/ + /R3e#/ Morphophonological Annotation of Polish Analyzing and Tagging Polish Morphological suffixes 17.5.2006 Amir Zeldes

Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

<pisze> : ['piʃɛ] : /#p’yše#/ : /#p’ys/ + /R3e#/

Morphophonological Annotation of PolishAnalyzing and Tagging Polish Morphological suffixes

17.5.2006 Amir Zeldes

Page 2: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Overview

1. Lemmatization2. Polish morphology3. Text-base approaches4. Morphophonemic analysis5. Benefits and applications6. References

Page 3: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Lemmatization…

• …means: 1. Analyzing the grammatical categorization of a word

form (case, number, tense…)2. Finding the basic dictionary entry (or ‘lemma’) 3. Finding the grammatical categorization of the lemma

(part of speech, gender of a noun etc.)• …is important for:

1. Machine translation2. Information retrieval (search engines etc.)3. Building electronic corpora

Page 4: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Lemmatization

• Suffixal morphology:– Each word has a stem (n) and a suffix (m):

n m

AAAAA…ABB…BHow do we know how to divide the word?

• Naïve definitions:– Stem: that part of the word which is common to all

word forms belonging to the same lemma– Suffix: the remaining characters

Page 5: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Lemmatization

• The naïve algorithm:

1. Divide input string into all possible stem-suffix pairs2. Look up each possible suffix3. If a suffix is found, get instructions to find the lemma4. Look up the lemma5. If a lemma is found, create an analysis containing the

lemma, its categorization, and any grammatical information from the suffix

Page 6: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Lemmatization

An English example:• Given that <s> is a suffix of plural nouns, analyze

the form <books>:1. b-ooks, bo-oks, boo-ks… one possible division is book-s2. Look for the suffix <s> 3. Its base suffix is Ø (just the stem), the category is plural noun4. Search the lexicon for a noun <book> + Ø = <book>5. Create an analysis: <books> is a plural noun with the lemma

<book>basenumPoSsuf

ØplNouns

<books> =? <book> + <s>

<books> = pl. noun < book

PoSlemma

Nounbook

Page 7: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

English vs. Polish• But… Polish has many more morphological forms

than English*:

PolishEnglish

72Cases

5-Genders

4+2x5x7

3 (go, going, gone)

Non-finite verb forms

223 (walk, walks, walked)

Finite verb forms

*some forms can appear identical (cf. die Frau : die Frauen)

Page 8: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Polish Morphology

Polish stems and suffixes have many forms• Orthographic suffix variation

lampie – loc. sg. fem. < lampa “lamp”szkole – loc. sg. fem. < szkoła “school”

• Stem variation– Consonant mutation: (cf. Molkerei : Milch)

ręce – loc. sg. fem. < nom. ręka “hand”– Vowel alternation: (cf. Hand : Hände)

rąk – gen. pl. fem. < nom. ręka “hand”

Page 9: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Polish Morphology

Keeping the naïve definition means:• The stem of <ręka> “hand” is <r>:

ręce = r + ęceręka = r + ękarąk = r + ąk

• All of the following are suffixes for loc. sg. fem.:ręce = r + ęceszkole = szkol + elampie = lamp + ie

Page 10: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Polish Morphology

• Sometimes the same suffix has two forms, orthe same form represents different suffixes:

piękny – beautiful (M sg) piękni – beautiful (M pl.)

ciężki – heavy (M sg.) ciężcy – heavy (M pl.)

• This means we need the entries:– -ny : nom sg M– -ki : nom sg M– -ni: nom pl M– -cy: nom pl M

Page 11: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

The Tokarski Index…an a-tergo list of all possible Polish suffixes according to the naïve definition:• Contains ca. 18,000 entries• Each entry gives a form suffix, a lemma suffix, the

grammatical categorization, and some examples:• The index was created manually!

• It is currently used in major taggers (SAM, Morfeusz)

miećXII 1 (mam

imam, omam

mniemam, imam, dumam, trzymam (70)

omam (4)

ignam, Uznam

dynam

mIV N -mam

mamażIV lGmam

-maćI 1-mam

mamićVIa imam

myZa D (nam

mIV N -nam

-namonIII lG-nam

Page 12: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

The Tokarski Index

ConsPros•Very difficult to produce, maintain and expand

•Some unforeseen forms may still not be recognized

•Distortion of intuitive “suffixes”

•Very fast (simple text based search)

•Words inflected similarly to listed examples are also recognized

•Simple dictionary - can contain only 1 form per lemma

Page 13: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Dictionary Based Approaches•For each dictionary item, define what stem forms it has, and which suffixes they take:

ręka:–ręk + {a, i, ami}–ręc + {e}–rąk + {Ø}

ConsPros•Massive lexicographic work•Forms not in dictionary can’t be analyzed•Suffixes are still recognized at a textual level

•Can generate paradigms•Irregularities easy to handle

Page 14: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Goals

1. Recognize and correctly analyze all forms2. Use a simple, mono-lemmatic dictionary3. Use a simple, extensible suffix table4. Distinguish between distinct homographic

suffixes5. Identify the same suffix no matter how it is

spelled

Text based approaches cannot fully realize these goals

Page 15: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Polish Orthography

• In the best case: 1 letter = 1 phoneme

<tak> = /t;a;k;/

• In some cases: 2 letters = 1 phoneme(digraphs)

<czas> = /cz;a;s;/

Page 16: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Polish Orthography

• <i> can mark:– A vowel, allophone of <y>

<i> = [i] = /y;/– Palatality of previous consonant

<dzia> = [ʥa] = /dź;a;/– Palatality and a vowel

<ci> = [ʨi] = /ć;y;/

Page 17: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Using a phonemic analysis…

• A common stem /#ręk-/ still cannot be reached:

<ręka> : ['rɛ̃ka] : /#ręka#/<ręce> : ['rɛ ̃ʦɛ] : /#ręce#/

• Some previously possible analyses are impossible:

<pisać> : ['pisaʨ] : /#p’ysać#/ “to write”<pisze> : ['pišɛ] : /#p’ysze#/ “(he/she) writes”

pis-ać, pis-zenot a valid division

/sz/ is 1 phoneme

Page 18: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Morphophonemics

• Morphophonemic analysis:

/c/ is just an allophone of /k/ before /e/, in certainmorphological environments:

<ręce> = /#ręk/ + /e#/ (loc. sg. of <ręka> “hand”)

<rzece> = /#rzek/ + /e#/ (loc. sg. of <rzeka> “river”)

/sz/ is just an allophone of /s/ before /e/, in certainmorphological environments…

Page 19: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Morphophonemics

• But /k/ and /c/ are not allophones, /e/ after/k/ can produce different changes:

<krzyczeć> “to shout” = /#krzyk/ + /eć#/ (cf. perfective infinitive <krzyknąć>)

• We can define these as different morpho-phonemes - e1 and e2: (after Swan, 2002)

<ręce> = /#ręk/ + /e1#/<krzyczeć> = /#krzyk/ + /e2ć#/

Page 20: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Morphophonemics

• But the same changes can happen withdifferent vowels:

<ciężcy> “heavy” (M pl.) = /#ciężk/ + /y1#/(cf. singular <ciężki>)

• Operator morphophonemes are a moreeconomic description:

<ręce> = /#ręk/ + /R1e#/<ciężcy> = /#ciężk/ + /R1y#/<krzyczeć> = /#krzyk/ + /R2eć#/

Page 21: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Morphophonemics•Each operator affects different phonemes in differentways:

R4R3R2R1

/k’//cz//cz//c//k/

/g’//ż//ż//dz//g/

–/c//ć//ć//t/

Page 22: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Phoneme Table

• These effects and other information can be stored in a phoneme table:

R4R3R2R1ArticulationSoftnessAirVoicedVowelPhon

+k’+cz+cz+c54111k;

+g’+ż+ż+dz54121g;

0+c+ć+ć21111t;

00-t-t33211ć;

/ć/ can be derived from /t/ + /R1/

consonant plosive dental

unvoiced hard

Page 23: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Phonotactic Rules

• Strings can be converted to phoneme arrays

• Arrays can be matched against rules describing contact between phonemes

• This rule describes the mutation in <ręce>:

1 1 1[ ] [ ] [ ] [ ] # #R front R frontRC V C V− + + += +

1 1[ ] [ ] [ ] [ ]1#rę #re ęc k# e #R front R frontR− + + += +

Page 24: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Phonotactic Rules

• Once a rule has been matched, it produces a stem-suffix pair for lookup:

• The suffix can be found in a table:

• The same suffix and base suffix apply to:

szkole = /#szkoł/ + /R1e#/ szkoła = /#szkoł/ + /a#/lampie = /#lamp/ + /R1e#/ lampa = /#lamp/ + /a#/

1#ręk#ręce# e#R= +

conditionsbaseasptensepersgendnumcasePoSsufa#FsglocNounR1e#

Page 25: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Phonotactic Rules

• In many cases rules must be used to retrieve the lemma:<gryzł> : /#gryz-ł#/ “he bit”

/#gryz/ + /ć#/ = /#gryźć#/

conditionsbaseasptensepersgendnumcasePoSsufvowel=1ć#13M1VFinł#

1 1

# #

# #hard softdental dentalsibilant sibilantR R

C C

z ć ź

ć ć

ć

+ +⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥+ +⎢ ⎥ ⎢ ⎥+ +⎢ ⎥ ⎢ ⎥+ −⎣ ⎦ ⎣ ⎦

+ =

→ + >

Page 26: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Phonotactic Rules• Sometimes elements appear only on one side

of the equation:<dworcom> dat. pl. of <dworzec> “station”

• Co-indexing keeps track of changes• /om#/ = dat. pl.

(regardless of ablaut and mutations)• nom. sg. = /#/

(reconstructs the lemma from the recovered stem)

2 21[ ] 2 3 1[ ] 2 3 R RC C V C eC V+ −= +

2 2[ ] [ ]#dwo m# #dr c rz e cwo +o m#oR R+ −=

Page 27: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Applications

The morphophonemic approach allows:1. A comprehensive description of Polish morphology2. A simple dictionary: lemma + part of speech3. Smaller, expandable suffix table4. Distinguishing homographic, but

morphophonemically distinct suffixes:

<piękny> : /#piękn/ + /R4y#/ <piękni> : /#piękn/ + /R1y#/<ciężki> : /#ciężk/ + /R4y#/ <ciężcy> : /#ciężk/ + /R1y#/

5. Recognition of forms not in the dictionary:

<biolog> “biologist”, plural: <biologowie> : /#biolog/ + /owie#/

<biolodzy> : /#biolog/ + /R1y#/

Page 28: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Morphophonemic Annotation• Identified suffixes can be stored in corpora :

<t ID='w1c13v04s01t03' lemma='siać' pos='VFin' asp='impfv' pers='3' num='sg' gend='M' tense='past' suf='R2ał#' bsuf='R2ać#'>siał</t>

Page 29: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Morphophonemic Annotation• Suffix fields can be used for morphological queries

– Retrieve tokens having the same categorization but different suffixes:

Page 30: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Morphophonemic Annotation• Pluralia tantum:

– Retrieve nouns with a base suffix /R4y#/ and neuter nouns with a base suffix /a#/

• Imperfectives derived from perfectives– Retrieve verbs with a base suffix

/R3ać#/ (distinct from simple imperfectives in homographic /R2ać#/)

Page 31: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

Morphophonemic Annotation

• Morphological queries can help study:1. The distribution of morphological suffixes2. Changes in morphology through historical corpora3. How speakers disambiguate morphology:

Morphological data is more than a bi-product of analysis

it’s another layer of information

Page 32: Morphophonological Annotation of Polish · Morphophonological Annotation of Polish Amir Zeldes 17.5.2006 The Tokarski Index… an a-tergolist of all possible Polish suffixes according

Amir Zeldes 17.5.2006Morphophonological Annotation of Polish

References• O.E. Swan. 2002. A Grammar of Contemporary Polish. Bloomington,

Indiana: Slavica Publishers.

• K. Szafran. 1997. Automatic Lemmatisation of Texts in Polish – Is it Possibile? In: Formale Slavistik, eds. U. Junghanns and G. Zybatow. Frankfurt am Main: Vervuert Verlag, pp. 437-441.

• J. Bień and K. Szafran. 2001. Analiza morfologiczna języka polskiego w praktyce. Bulletin de la société polonaise de linguistique, fasc. LVII, pp. 171-184.

• J. Tokarski. 1993. Schematyczny indeks a tergo polskich form wyrazowych, opracowania i redakcja Zygmunt Saloni. Warszawa: Wydawnictwo Naukowe PWN.