43
Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de Gambelas, Faro, Portugal, P- 8005-139 [email protected]

Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Embed Size (px)

Citation preview

Page 1: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Suppletive Morfology: How Far Can You Go?

Jorge BAPTISTAUniversidade do Algarve – FCHS

Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de Gambelas, Faro, Portugal, P-8005-139

[email protected]

Page 2: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• For many NLP applications, fully tagged texts are

required.

• Even if statistical methods may be used to tag texts,

electronic dictionaries are essential tools for high quality

tagging of large-sized texts.

• Large electronic dictionaries of both simple and

multiword lexical units have been built to European

Portuguese.

Page 3: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• In spite of their size, a non-trivial number of tokens of

large-sized corpora remain untagged.

• Suppletive, morphological parsing rules can be used

to cope with many lacunae, especially with regularly

derived words.

• However, there are empirical limits to morphological

parsers, so that other methods for automatic lexical

analysis must also be envisaged.

Page 4: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Introduction• Automatic lexical analysis of texts can be carried

out using different methods (see Ranchhod 2001, for an over view).

• Most systems, however, even if they use statistical methods predominantly, also use to some degree an electronic dictionary, where lexical information, idiosyncratic by nature, is stored.

Page 5: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• Large electronic dictionaries of both simple and compound words have been built for several languages, including Portuguese (Eleutério et al. 1995,

Ranchhod et al. 1999). • In spite of their size, when these lexical resources are

applied to large corpus, a non-trivial number of tokens remain to be tagged.

• The lexicon being an evolving object, one cannot hope the dictionaries to be so comprehensive and exhaustive that would contain all possible words. This is particularly the case for regularly derived words (-ly adverbs, -ize verbs, for example).

Page 6: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• Morphological parsers have been built, which can be used with or without dictionaries.

• With such tools it is possible to complete the dictionary’s lacunae, that is, it is possible to formalize morphological rules so that the system may recognize (and tag) words that have not been previously included in the dictionaries.

• These rules may be used in a suppletive way (or in connection with the dictionary), and results from their application can then be manually checked by linguists and used to extend the coverage of the initial dictionary.

Page 7: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• In this paper, an attempt was made to estimate the how much of the unknown, hence untagged, tokens of a large size corpus can be adequately recognized by a morphological parser,

• using only regular derivational rules, • trying to evaluate the precision and to determine

empirically the limitations of this methodology. • A set of morphological rules was built, focusing on a

list of unknown tokens. • Results from this morphological module are here

described and its precision will be evaluated.

Page 8: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Methods• The CETEMPúblico corpus

http://www.linguateca.pt/CETEMPublico/ - fragment 1 (text file ~57,6 Mb, ~9.6 M (177,368 different) simple word-forms)

• INTEX 4.33 (Silberztein 1993, 2004); http://www.nyu.edu/pages/linguistics/intex/

• Portuguese DELAF (public lexical resources built by LabEL (Eleutério et al. 1995, Ranchhod et al. 1999) http://label.ist.utl.pt

.

Page 9: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Table 1: Lexical Analysis of Training Corpus. WF = word forms (in millions, M); DWF = different word forms ; DLF =simple word entries; ERR-0 =unknown word-forms; NProp =candidates to the status of proper names; ERR-1=remaining unknown word-forms : ERR list to be tested analysis

CETEMPúblico-1(Training corpus)

Size 57,6 Mb

WF 9,6 M

DWF 177.368 %DWF

SW 105.074 59,24

ERR-0 72.294 40,76 % ERR-0

NProp 40.957 23,09 56,65

ERR-1 31.337 17,67 43,35

Page 10: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

NProp 40.957 DWF; 23,09 %DWF; 56,65 %ERR-0

N+Sigla 1074 DWF; 0,6 %DWF; 1,49 %ERR-0

Page 11: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Overview of ERR list

• Many forms in ERR are perfectly ‘normal’ words that were just missing the dictionary.

• With the help of an inverse list of ERR (also obtained with INTEX dictionary tools), it was possible to determine some of the most productive derivational rules at stake:

Page 12: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

•adjectives formed from verbs (partially homographs with the past participles):

electrizada (electrified-fs) from electrizar (electrify);•adverbs with –mente (-ly), formed on adjectives: controladamente (controlled-ly, in a controlled manner) from controlado (controlled);

•nouns derivationally related with verbs with suffixes –ção (-ation/-ing) and –mento (-ment/-ing):

agilização (‘agilization’), from agilizar (to ‘agilize’, make something more agile, swift); silenciamento (silencing), from silenciar (to silence);

Page 13: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• nouns formed with suffix –ismo (-ism): vegetarianismo (vegetarianism), from vegetariano (vegetarian);

• nouns and adjectives formed with suffixes ‑logia (-logy), -ólogo (-logue), -ologista (-logist), and ‑lógico (-logic) designating names of scientific/technologic domains, the designation of professionals in those domains and the relation adjec tive associated to them: paleontologia (paleontol ogy), paleontólogo or paleontologista (paleontolo gist), paleontológico (‘paleontologic’, related to paleontology).

Page 14: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• nouns and adjectives formed with suffixes ‑mancia and -mância (-mancy), -mante (-mant), and –mântico (-mantic) designating names of divinatory arts, the designation of their professionals and the relation adjec tive associated to them: quiromancia / quiromância (chiromancy, palmistry), quiromante (‘chiromant’, palmist, psychic who reads palms to devine the future), quiromântico (‘chiromantic’, related to chiromancy, palmistry).

Page 15: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• Besides these, many derivate words were found formed with prefixes (Pfx); for these, a list of the 170 most current prefixes was established, based on the lists available in grammars and new prefixes found in the text, e.g.: anti-, auto‑, bi-, contra-, des-, equi-, etno-, extra-, farmaco-, foto- (photo), geo-, hepta-, hidro-, hipo-, homo-, in-, (and variants: i-, im-, ir‑), inter-, macro-, mega-, micro-, mono-, neo-, opto-, pluri-, proto, pseudo-, psico- (psych-), radio-, re-, retro-, semi-, socio-, super-, tele-, tetra-, trans-, tri-, ultra-, uni-, video-, xeno-, zoo-, etc.

Page 16: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• Obviously, many words can be polysynthetic, i.e. formed by simultaneous prefixation and suffixation:

descontroladamente < des- Pfx+ controlar V+ -ada Sfx-a + -mente Sfx-adv>

(uncontrolledly).

• after a certain point, derivation rules have a very low productivity, i.e., the number of words regularly formed becomes negligible.

Page 17: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Morfological rules• new morphological parser of Intex

(version 4.33, February 24, 2004; Silberztein 2004:130‑142)

• a set of morphological rules were built • these rules are enhanced finite-state transducers

(FST).

Page 18: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Example: • in face of the new, unknown word form

umbilicalmente, formed from the adjective umbilical (idem, related to the navel), the system checks if there is an adjective umbilical in the lexicon (in fact, there is), and if so it produces the lexical entry:

umbilicalmente,umbilicalmente. ADV+A=umbilical+Sfx=mente

The context of this new word in the corpus is: “a imagem de um PS umbilicalmente ligado ao modelo jurídico-penal” (the image of a Socialist Party umbilically connected to the juridical-penal model). From this context, the meaning of this adverb should be something like ‘closely, intimately, or inextricably’.

Page 19: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Remark:• We have built a small module of rules to

deal with the derivation of diminutive, augmentative and superlative forms.

• The number of rules built by us is given here as a mere indication,

• C. Mota (2003) has build a larger and more complex module of rules for the same derivational processes. We did not use her work here. Therefore, results will ignore this module.

Rules Total

DimAumSup 60

A_vel 1,026

ADV_mente 1,538

Pfx-X 2,040

N_logia 8

N_mancia 16

N_ção 342

N_mento 1,026

N_ismo 342

Vk_A 1,368

Total 7,766

Table 2: Morphological rules

• A set of morphological rules to analyze and tag the most frequent, derivationally well-formed, unrecognized word forms in ERR.

• These rules were grouped in different FST by ‘derivative families’

Page 20: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• The number of prefixes used (approximately 170) influences significantly the number of rules in each family.

• As we will see below, some of these prefixes give rise to significant erroneous analyses, so it is possible that in a future version some of them will be removed, and only used in more constrained rules.

• As this is an on-going research, the number of rule-families will surely increase.

Page 21: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• The morphological module that integrates all these rules is a 20 Kb FST with 595 states and 990 transitions.

• It takes 35 seconds to analyze the 31,337 ERR list of the training corpus and to produce the 3,533 entries of the resulting DLF

Page 22: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Results

• First, an evaluation was made of the application of the set of FSTs to the training corpus.

• We will present first the lexical coverage of the morphological module and then assess it success rate

Page 23: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Table 3: Lexical coverage of morphological rules: results from training corpus.WF = word forms; DLF=simple word entries; n‑tuples=different entries produced for the same word‑forms; %DLF=n‑tuples’ percentage of DLF entries; %ERR=percentage of ERR list.

FST-morph WF DLF n-tuples % DLF % ERR

ADV-mente.fst 193 224 31 13,84 1,23A-vel.fst 51 56 5 8,93 0,32N_ção.fst 155 177 22 12,43 0,99N_ismo.fst 26 27 1 3,70 0,17N_logia.fst 13 13 0 0,00 0,08N_mancia.fst 1 2 1 50,00 0,01N_mento.fst 83 96 13 13,54 0,53Pfx-X.fst 1574 2102 528 25,12 10,02Vk_A.fst 453 513 60 11,70 2,88Total 2549 3210 661 20,59 16,22

Page 24: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Table 4: Results from training corpus

FST-morph correct entries

incorrect entries

success rate (%)

error rate (%)

ADJ-vel.fst 51 5 91,07 8,93ADV-mente.fst 198 26 88,39 11,61N_ção.fst 165 12 93,22 6,78N_ismo.fst 24 3 88,89 11,11N_logia.fst 13 0 100 0,00N_mancia.fst 2 0 100 0,00N_mento.fst 96 0 100 0,00PFX-X.fst 1758 344 83,63 16,37Vk_ADJ.fst 456 57 88,89 11,11Total 2763 447 92,68 7,32

Page 25: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

some comments• As we can see, global success rate is high (approx. 93

%). The most important cause of error consists of initial strings incorrectly analysed as prefixes. Some adverbs ending in –mente are analysed in spite of the fact that they present an (incorrectly spelled) accented vowel:

diáriamente,diáriamente.ADV+ADJ=diária+SFX=mente

(he correct form (diariamente, ‘daily’), does not have any accent)

Page 26: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Testing corpus: Lexical analysis

Table 5: Lexical Analysis of Test Corpus.WF = word forms (in millions, M); DWF = different word forms ; DLF=simple word entries;

ERR-0=unknown word-forms; NProp = candidates to the status of proper names; ERR-1=remaining unknown word-forms : ERR list to be tested.

CETEMPúblico-2 CP2-CP1

Size (Mb) 63,6 6,2

WF (M) 10,9 1,3

DWF 178.543 %DWF 1175

DLF 106.254 59,91 1180

ERR-0 72.289 40,76 % ERR-0 -5

NProp 40.965 23,10 56,66 8

ERR-1 31.324 17,66 43,33 -13

Page 27: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Comparison CP1 vs. CP2• The two fragments do not have exactly the same size:

the testing corpus is 6,2 Mb larger • has 1,3 million words (1,175 different word forms)

more than the learning corpus. • The DLF size is also 1,180 entries larger. • However, the number of unknown word forms (ERR-

0) and of proper names (NProp) is almost the same. • The remaining ERR list (ERR-1) after the proper

names have been discarded is also of comparable size.

Page 28: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Testing corpus: Lexical Coverage

Table 6: Lexical coverage of morphological rules: results from testing corpus.WF = word forms; DLF=simple word entries; n‑tuples=different entries produced for the same word-forms; %DLF=n‑tuples’ percentage of DLF entries; %ERR=percentage of ERR list.

FST-morph WF DLF n-tuples % DLF % ERR

A_vel.fst 46 52 6 11,54 0,15

ADV_mente.fst 203 235 32 13,62 0,65

N_ção.fst 145 160 15 9,38 0,46

N_ismo.fst 68 92 24 26,09 0,22

N_logia.fst 25 25 0 0,00 0,08

N_mancia.fst 0 0 0 0,00 0,00

N_mento.fst 96 112 16 14,29 0,31

PFX-X.fst 1711 2287 576 25,19 5,46

Vk_ADJ.fst 434 490 56 11,43 1,39

Total 2728 3453 725 21,00 8,71

Page 29: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Table 7: Results from training corpus.

FST-morph correct entries

incorrect entries

success rate (%)

error rate (%)

A_vel.fst 45 7 86,54 13,46

ADV_mente.fst 176 59 74,89 25,11

N_ção.fst 143 17 89,38 10,63

N_ismo.fst 62 30 67,39 32,61

N_logia.fst 24 1 96,00 4,00

N_mancia.fst 0 0 - -

N_mento.fst 96 16 85,71 14,29

PFX-X.fst 1711 576 74,81 25,19

Vk_ADJ.fst 434 56 88,57 11,43

Total 2691 762 77,93 22,07

Page 30: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Comparing results

Page 31: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Table 8: Comparison of results from training and testing corpora.WF = word forms; DLF=simple word entries; n‑tuples=different entries produced for the same word-forms; %DLF=n‑tuples’ percentage of DLF entries; %ERR=percentage of ERR list.

ResultsTraining Corpus

Testing Corpus

Test-Train

DWF 2,549 2,728 179

DLF 3,210 3,453 243

n-tuples 661 725 64

% DLF 20.59 % 21.00 % 0.40 %

% ERR 16.22 % 8.71 % -7.51 %

correct entries 2,763 2,691 -72

incorrect entries 447 762 315

success rate 92.68 % 77.93 % -14.75 %

error rate 7.32 % 22.07 % 14.75 %

Page 32: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Comparing results• in spite of the different sizes of each corpus, the number

of word forms in each ERR list is almost the same. • The results from the morphological rules are also

equivalent: both the number of different word forms recognized by the FSTs and the number of entries of the two DLF are approximate.

• The combined DLF obtained by the application of the morphological rules to both corpora contains 6,058 different entries, corresponding to 4,253 different word forms (including diminutives, augmentatives and superlatives).

Page 33: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• There is a slightly greater number of n‑tuples on the testing corpus, but the percentage of DLF is practically the same.

• The first major difference is the lexical coverage of the morphological module (%ERR), i.e. the percentage of matched word forms of each ERR list:

• while, in the training corpus, this was about 16%, it becomes little less than 9 % in the testing corpus.

• Secondly, success rate diminishes significantly, from 92.68 % to 77.93 %.

Page 34: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Remaining ERR• Even if the morphological rules may constitute an

effective tool to analyze candidate words that can then be manually checked by linguists in order to extent lexical coverage of electronic dictionaries, it is clear that a very substantial part of the text’s different words remain to be tagged:

• 28,841 in the training corpus and 28,808 in the test corpus;

• if all uppercase words were ignored, these could be reduced to about half of those: 13,469 in the training corpus and 13,641 in the testing corpus).

Page 35: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• It is possible that new morphological rules yet to be build may contribute to increase lexical coverage of unknown words.

• But as we include new rules, these apply to an increasingly small number of word forms.

• We estimate that the number of Portuguese, correctly formed but unknown words analyzable by the method of suppletive morphologic rules could still be increased up to 20 %.

• Furthermore, as new rules interact with previously made rules, the number of words with multiple analysis increases, thus diminishing the precision of the results.

Page 36: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

What is the nature of remaining unknown words? The most common cases found were:• spelling errors: many unknown words are just due to

typing or spelling errors: abanadonaria (abandonaria, ‘would abandon’),abastenção (abstenção, ‘abstention’),abatecimento (abastecimento, ‘supplying’), etc.

Some errors are due to conversion between character sets:bonificaÁões (bonificações, ‘bonifications’)consÛrcios (consórcios, consortium (pl))

Page 37: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• words derived from proper names (mainly adjectives):

aladinescos (from Aladin), balzaquiano (Balzaquian), hartleyano (from Hartley)

hitchcockiana, hitchcoquiana (from Hitchcock, notice the orthographic adaptation to Portuguese spelling rules: – ck > -qu - )

deskhomeinização (des-Pfx + Khomeni Nprop +iz Sfx-v + ation Sfx-n, from Khomeni, Nprop)

Page 38: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• foreign words: in real texts, and in particular in journalistic texts, there are many foreign words. Mostly, these came from English and French, but other languages can also be found:

accelerating access […] brick bricoleur brie briefing briefings bright brit british britpop broadcast broadsheet broken broker brokers […] chief child childhood children chill chills […] destroyed destroyer […] dégradable déja déjà délégations délire déluge démarche démarches démocratie démodé déplacement désir désire désordre […] engineer engineering engines english englishman […]fatwa fatwas faune faut faute fauteuil fauves faux […]grillons grind griots grip grisaille grizzly groove grossen grotesk […] handicapés handicappées handicaps handling hands handy hankseana hantavirus hants happen happened happy harakiri harandjita hard hardback hardcover […]international internazzionale internet […]jazzman jazzmen jazzy […] killer killers killing kills kilobits kilohertz kilowatts […] laid laird laisser laissez lait […] mailing mailings maillots main mainframe mainframes mainland mains mainstream […] notebook notebooks nothing nothingness […] opinion opium opposed opting option options […] partenaire partenaires partenership partial […] queries quest question […] rappel rapping rapport […] sell seller sellers selling selon […] talk talkie talkies talking talks tall […] und under underacting underground underplaying understand understanding underworld unfinished […] yiddish yields yodel yodelling yoga yokozuna yop yorker yoruba yorubas you young youngboy your yourself yuppie yuppies yuppy […] zappeur zapping […]

Page 39: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• From the examples above, it is clear that no real corpus is free from many of these problems

• for robust (non-statistical) lexical analysis, several strategies must be used in combination with dictionaries and a suppletive morphologic analyzer

1. error detection and correction, comparing unknown forms and lexicalized forms by letter changing, permutation, and so on;

2. development of morphologic rules based on dictionaries of proper names;

3. language identification procedures, enabling the system to work with texts with mixed languages.

Page 40: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• While strategies (1) and (3) have already been put in place independently in orthographic correctors in text-editors (MS-Word, for instance) and web browsers, strategy (2) has not seen much effort from (Portuguese) lexicographers, specially in view of automatic lexical analysis.

• It combines encyclopedic dictionaries with morphologic analyzers, an approach similar to the one here shown. However, to our knowledge, the combination of these strategies (eventually others) in the same system has not been done yet.

Page 41: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

Conclusion• From results obtained so far, precision of

morphological rules is high (90% average), • it is clear that the goal of zero unknown tokens is still

far from being achieved • only less than 20% of ERR were matched, by means

of suppletive morphologic rules. • In real life, there is no such thing as a ‘clean’ corpus:

typos, foreign words, and proper names’ derivates are the sets of unknown tokens most responsible for this insufficiency in automatic lexical analysis.

Page 42: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

• For robust lexical analysis of these forms, other strategies must be found,

• these may involve not only language identification procedures (and use of the corresponding dictionaries) but also correction of deviating or erroneous forms.

• The combination of different strategies in a single system may constitute both a linguistic and a computational challenge in the near future.

Page 43: Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de

AcknowledgementsResearch for this paper was partially funded by Fundação para a Ciência e a Tecnologia (project grant POSI/PLP/34729/99). Thanks are due to C. Mota for making available her DimAum module.

References Eleutério, S.; Ranchhod, E.; Freire, H; Baptista, J. 1995. A system of electronic dictionnaries of

Portuguese. Linguisticae Investigationes 17-2: 57-82. Amsterdam: John Benjamins B. V.Ranchhod, E.; Mota,C.; Baptista, J. 1999. A Computational Lexicon of Portuguese for Automatic Text

Parsing. SIGLEX’99: Standardizing Lexical Resources. Proceedings of a Workshop Sponsored by the Special Interest Group on the Lexicon of the Association for Computational Linguistics and the National Science Foundation (June 21-22, 1999, University of Maryland, College Park, Maryland, USA). pp. 74 80: Maryland: University of Maryland.

Baptista, J.; Faísca, J. 2001. Um filtro para palavras exóticas frequentes do Português. Seminários de Linguística 4: 65-86. Faro: UALG-FCHS/CELL.

Baptista, J.; Faísca, J. 2003. Mapping, filtering and measuring impact of ambiguous words of Portuguese, 6th Intex Workshop, Sofia, Bulgaria (May 28-30, 2003).

Silberztein, M. 2004. Intex Manual. http://intex.univ-fcomte.fr/downloads/Manual.pdfMota, C. 2003. A Renewed Portuguese Module for Intex 4.3x. 6th Intex Workshop, Sofia, Bulgaria (May

28-30, 2003).Mota, C. 2000, Analysis of Derivational Morphology by Finite State Transducers, in Dister, A. (ed.),

Actes des Troisièmes Journées INTEX, Revue, Informatique et Statistique dans les Sciences Humaines, 36, pp. 273-287, Université de Liège.

Ranchhod, E. 2001, O uso de dicionários e de autómatos finitos na representação lexical das línguas naturais, in Ranchhod, E. (Org.) Tratamento das línguas por computador. Uma introdução à Linguística Computational e suas aplicações, pp. 13-47. Lisboa: Caminho