47
Introduction to Natural Language Processing MORPHOLOGY – TRANSDUCERS Martin Rajman [email protected] and Jean-C ´ edric Chappelier [email protected] Artificial Intelligence Laboratory LIA I&C Introduction to Natural Language Processing (CS-431) M. Rajman J.-C. Chappelier 1/24

Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Introduction to Natural Language Processing

MORPHOLOGY – TRANSDUCERS

Martin Rajman

[email protected]

and

Jean-Cedric Chappelier

[email protected]

Artificial Intelligence Laboratory

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier1/24

Page 2: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Objectives of this lecture

➥ Present morphology, important part of NLP

➥ Introduce transducers, tools for computational morphology

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier2/24

Page 3: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Contents

➥ Morphology

➥ Transducers

➥ Operations and Regular Expressions on Transducers

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier3/24

Page 4: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Morphology

Study of the internal structure and the variability of the words in a language:

✏ verbs conjugation

✏ plurals

✏ nominalization (enjoy → enjoyment)

➜ inflectional morphology: preserves the grammatical category

give given gave gives ...

➜ derivational morphology: change in category

process processing processable processor processabilty

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier4/24

Page 5: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Morphology (2)

Interest: use a priori knowledge about word structure to decompose it into morphemes and

produce additional syntactic and semantic information (on the current word)

processable → process- -able ☞ 2 morphemes

meaning: process possible

role: root suffix

semantic information: main less

The importance and complexity of morphology vary from language to language

Some information represented at the morphological level in English may be represented

differently in other languages (and vice-versa). The paradigmatic/syntagmatic repartition

changes from one language to another

Example in Chinese: ate −→ expressed as ”eat yesterday”

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier5/24

Page 6: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Stems – Affixes

Words are decomposed into morphemes: roots (or stems) and affixes.

There are several kinds of affixes:

➊ prefixes: in- -credible

➋ suffixes: incred- -ible

➌ infixes:

Example in Tagalog ( Philippines):

hingi (to borrow) → humingi (agent of the action)

In slang English! → ”fucking” in the middle of a word Man-fucking-hattan

➍ circumfixes:

Example in German:

sagen (to say) → gesagt (said)

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier6/24

Page 7: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Stems – Affixes (2)

several affixes may be combined:

examples in Turkish where you can have up to 10 (!) affixes.

uygarlastıramadıklarimizdanmıssınızcasına

uygar las tır ama dık lar imiz dan mıs sınız casınacivilized +BEC +CAUS +NEGABLE +PPART +PL +P1PL +ABL +PAST +2PL +ASIF

as if you are among those whom we could not cause to become civilized

When only prefixes and suffixes are involved: concatenative morphology

Some languages are not concatenative:

• infixes

• pattern-based morphology

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier7/24

Page 8: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Example of semitic languages

Pattern-based morphology

In Hebrew, the verb morphology is based on the association of

• a root, often made of 3 consonents, which indicates the main meaning,

• and a vocalic structure (insertion ov vowels) that refines the meaning.

Example: LMD (learn or teach)

LAMAD → he was learning

LUMAD → he was taught

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier8/24

Page 9: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Computational Morphology

Let us consider flexional morphology, for instance for verbs and nouns

Noun flexions: plural

General rule: +s

but several exceptions (e.g. foxes, mice)

Verb flexions: conjugations

• tense, mode

• regular/irregular

☞ How to handle flexions (comptutationaly)?

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier9/24

Page 10: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Computational Morphology

Example:surface form: is

canonical representation at the lexicon level (formalization): be+3+s+Ind+Pres

The objective of computational morphology tools is precisely to go from one to the other:

• Analysis: Find the canonical representation corresponding to the surface form

• Generation: Produce the surface form described by the canonical representation

Challenge: have a ”good” implementation of these two transformations

Tools: associations of strings → transducers

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier10/24

Page 11: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

String associations

(X1, X′

1)

.

.

.

(Xn, X′

n)

(eaten, eat)

(processed, process)

.

.

.

(thought, think)

Easy situation: ∀i, |Xi| = |X ′

i| Example: (abc, ABC)

⇒ represented as a sequence of character transductions

(abc, ABC) = (a,A)(b,B)(c,C)

☞strings on a new alphabet: strings of character couples

Not so easy: If ∃i, |Xi| 6= |X ′

i| ⇒ requires the introduction of empty string ε

Example: (ab, ABC) ≃ (εab, ABC) = (ε,A)(a,B)(b,C)

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier11/24

Page 12: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Dealing with ε

Where to put the ε?

Example:(ab,ABC) ≃ (εab, ABC)

but also (ab,ABC) ≃ (aεb, ABC)

or (ab,ABC) ≃ (abε, ABC)

General case:

n

m

(with m < n)

Hard problem in general → need for a convention

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier12/24

Page 13: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Transducer (definition)

Let Σ1 and Σ2 be two enumerable sets (alphabets), and

Σ =(

(Σ1 ∪ {ε})× (Σ2 ∪ {ε}))

\ {(ε, ε)}

A transducer is a DFSA on Σ

Σ1 : ”left” language

: upper language

: input language

Σ2 : ”right” language

: lower language

: output language

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier13/24

Page 14: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Example

1

0

2

initial state

final state(s)

b

a:b

ab:ab:a

b:ε

a

Some transductions: (bb,b) [0,0,2] (ababb,baab) [0,1,2,0,0,2]

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier14/24

Page 15: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Different usages of a transducer

➊ association checking (abba, baaa)∈ Σ∗ ?

➋ Generation: string1 → string2 bbab→ ?

➌ Analysis: string2 → string1 ? → ba

➊: easy: (= FSA: nothing special)

What about ➋ and ➌ ?

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier15/24

Page 16: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Transduction

Walk through the FSA following one or the other element of the couple (projections)

❢ ☞ not deterministic in general!

The fact that a transducer is a deterministic (couple-)FSA does not at all imply that the

automaton resulting from one projection or the other is also deterministic!

non-deterministic evaluation

backtracking on ”wrong” solutions

⇒ The projection is not constant time (in general)

When a transducer is deterministic with respect to one projection or the other, it is called a

sequential transducer

A transducer in not sequential in general. In particular if one language or the other (upper

or lower) is not finite, it is not sure that a sequential transducer can be produced.

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier16/24

Page 17: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Transduction (2)

Example: bbab→ ?

0

0

b:b

0

b:b

1

a:b

2

b:a

2

b:ε

0

a:a

0

b:b

2

b:ε

2

b:ε

1

b:b

2

a:b

1

b:a

bbab→ bbba bbab→ ba

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier17/24

Page 18: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Transduction (3)

Example:? → ba

0

0

b:b

2

b:ε

0

a:a

2

b:ε

1

a:a

(FAIL)

1

a:b

2

a:a

2

b:a

2

b:ε

(FAIL)

aa→ ba ab→ ba bbab→ ba

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier18/24

Page 19: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Operations and Regular Expressions

on Transducers

➮ All FSA regular expressions: concatenation, or, Kleene closure (*), ...

Example:(concatenation) ”a:b c:a” recognizes ac and produces ba

➮ cross-product of regular languages: E1 ⊗ E2 recognizes L1 × L2

example: a+⊗ b+→ (an, bm) ∀ n ≥ 1,m ≥ 1 !! this is 6= (a⊗ b)+

➮ Composition of transducers: T = T1 ◦ T2

(X1, X2) ∈ T ⇐⇒ ∃Y : (X1, Y ) ∈ T1 and (Y,X2) ∈ T2

➮ Reduction: extraction of the upper or the lower FSA

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier19/24

Page 20: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

(Other) examples of applications

(morphology)

★ text-to-speech (grapheme to phoneme transduction)

★ specific lexicon representation (composition of some access and inverse fonctions)

★ filters (remove/add/modify marks; e.g. HTML)

★ text segmentation

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier20/24

Page 21: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Computational morphology using transducers

Use of composition:

➛ Identification of a paradigm (T1)

➛ Implementation of this paradim (T2)

➛ Exception handling (T3)

Example: input: chat+NP, fox+NP, ... (”+NP” means ”noun plural”)

T1: ([a-z]+)(\+NP ⊗ \+1) paradigm identification: plural nouns (trivial here:

only one paradigm (+1))

T2: ([a-z]+)(\+1 ⊗ \+Xs) plural inflection of nouns (regular part)

T3: ([a-z]+)(h\+Xs ⊗ hes | x\+Xs ⊗ xen | ... | [ˆhx...](\+X⊗ε)s) correction of exceptions

T1 ◦ T2 ◦ T3: plural for nouns

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier21/24

Page 22: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Computational morphology using transducers (2)

Detailed example on the plural of nouns:

general case: add a terminal ’s’

cat+NP → cats, dog+NP → dogs, ...

Exceptions (several kind):

• fly flies

• fox foxes, but ox oxen!

• ..

Method: find all the paradigms (linguists’ role) and implement a transducer for each of them

☞ add the paradigm identification in the lexical description

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier22/24

Page 23: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Keypoints

➟ Flexional and derivational morphologies, their roles

➟ Main functions of transducers: association checking, generation and analysis

➟ Deterministic and not deterministic nature of transduction

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier23/24

Page 24: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

References

E. Roche, Y. Schabes, Finite-state Language Processing, pp. 14-63, 67-96, A Bradforf

Book, 1997.

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier24/24

Page 25: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

A quick reminder aboutnoun plurals in English

Computational Linguistics

Martin Rajman

Artificial Intelligence Laboratory

Page 26: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Fully regular plurals

• default rule:

Add “s” to the end of the singular form

• Examples:(dog, dogs)

(arrow, arrows)

...

Page 27: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Semi-regular plurals

Some “regular” plurals need to be modified to be easy to pronounce (“euphonic rules)• Euphonic rule 1: if the singular noun ends in “s”, “x”, “z”, “ch”, or “sh”, add

“es” instead of “s”(guess, guesses)(box, boxes)(buzz, buzzes)(catch, catches)(dish, dishes)

... but (systematic exception) if the final “ch” is pronounced “k”, add “s” instead of “es”

(stomach, stomachs)

as well as some fully irregular exceptions(ox, oxen)

Page 28: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Semi-regular plurals (2)

• Euphonic rule 2: if the singular noun ends in a consonant followed by “y”, change the “y” to “ies”

(baby, babies)

(fly, flies)

Note: there must be a consonant before the “y”...

(boy, boys)

(buy, buys)

Page 29: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Irregular plurals

• Collective nouns (aka uncountable nouns) have no plural form(hair, ---)(mud, ---)

... but the regular plurals may also be acceptable in specific contexts:

“Her hair is black” ... but ... “I saw at least one grey hair, and there are probably more grey hairs there”

(hair, hairs)

“They throw mud at each other” ... but ... “These subterranean muds are being removed”

(mud, muds)

Page 30: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Irregular plurals (2)

• Invariant nouns (aka invariable nouns) do not change when inflected to the plural

“Deer have antlers”

Note that there is a (subtle) difference for a noun not to have a plural (i.e. to be uncountable), or to have a plural form that is the same as the singular one

uncountable: “Her hair is black” is correct, while “Her hair are black” is not

invariable: “This deer is fast” and “Deer are fast” are both correct (but do not mean the same)

Page 31: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Other irregular plurals

• Case 1: For most nouns ending in “f” or “fe”, change the ending “f” or “fe” to “ves”

(half, halves)

(knife, knives)

... but

(belief, beliefs)

(if, ifs) “There are so many ifs and buts in this policy"

Page 32: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Other irregular plurals (2)

• Case 2: For most nouns ending in “is”, change the ending “is” to “es”(crisis, crises)

(hypothesis, hypotheses)

... but

(vis, vires)

where “vis” is a Latin word meaning “power” that has been imported in English, while preserving its Latin plural (“vires”)

“An example of vis is the influence of the leader"

Page 33: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Other irregular plurals (3)

• Case 3: For many nouns ending in “o”, change the ending “o” to “oes”(tomato, tomatoes)

(mosquito, mosquitoes)

(volcano, volcanoes)

... but

(photo, photos)

(video, videos)

(piano, pianos)

Page 34: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Fully irregular plurals

• For some (often very frequent) words, the plural corresponds to a much more complicated modification

(man, men)

(mouse, mice)

(foot, feet)

(tooth, teeth)

...

Page 35: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Computational morphologyfor English nouns

Computational Linguistics

Martin Rajman

Artificial Intelligence Laboratory

Page 36: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Fundamentals

• Goal: use transducers to represent associations between strings representing:

• surface forms, i.e. words as they appear in texts;and

• canonical representations, i.e. formal representations of the morphological analysis of these words

• Examples of surface forms:

cats, book, flies, ...

• Example of canonical representations:

cat+N+p, book+N+s, fly+N+p, ...

Page 37: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Canonical representations

The typical format of a canonical representations is:

Lemma+GrammaticalCategory+MorphoSyntacticFeature1+MorphoSyntacticFeature2+...

where:

• Lemma (or Root) is the canonical form of an inflected word; i.e. the form usually found in dictionaries, e.g. the singular form for nouns, or the infinitive for verbs;

• GrammaticalCategory (or Part-of-Speech) is the tag used to represent the grammatical category of the word, e.g. N for a noun, Adj for an adjective, or V for a verb;

• MorphoSyntacticFeaturek (k=1, 2, 3, ...) are the tags used to represent the morphosyntactic features (e.g. the number, the gender, the tense, the person, etc.) that are relevant to identify a specific inflection of a word;

and

• "+" is a (conventional) separating character.

Page 38: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

Examples of canonical representations

• (cat+N+p, cats): associating the canonical representation "cat+N+p" to the surface form "cats" expresses in a formal way that "cats" is the flection of the noun "cat" corresponding to its plural form ("p" being the tag for the value "plural" of the morphosyntactic feature "number");

• (turn+V+Ind+Pres+3+s, turns): associating the canonical representation "turn+V+Ind+Pres+3+s" to the surface form "turns" expresses in a formal way that the surface form corresponding to the flection of the verb ("V") "to turn" at the 3rd person ("3") singular ("s") of the present ("Pres") indicative ("Ind") is "turns".

Page 39: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

In other words...

Implementing some Computational Morphology for English nouns is finding an efficient way of representing a, potentially very large, set of(canonical representation, surface form) associations, such as:

(cat+N+s, cat)(cat+N+p, cats)(book+N+s, book)(book+N+p, books)(fly+N+s, fly)(fly+N+p, flies)(fox+N+s, fox)(fox+N+p, foxes)(deer+N+s, deer)(deer+N+p, deer)(mouse+N+s, mouse)(mouse+N+p, mice)(ox+N+s, ox)(ox+N+p, oxen)...

By "efficient way", we mean a method that:- allows to describe all the targeted associations withouthaving to write them explicitly one-by-one;- provides a computational mechanism with a lowalgorithmic complexity able to produce the surfaceform(s) associated with a given canonical representation("generation"), or the canonical representation(s)associated with a given surface form ("analysis")

Page 40: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

How to do this with transducers?

The idea is to use the composition T1 o T2 o T3 of 3 transducers:

1. a transducer T1 that identifies the morphological paradigm, i.e. the systematic transformation rule(s) to be implemented for regular forms

2. a transducer T2 that implements the identified systematic rule(s)

3. a transducer T3 that handles all the exceptions to the implemented rules

Page 41: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

T1 : Identifying the morphological paradigm

• In English, the morphology of regular noun plurals is very simple, as it corresponds to a single systematic rule

• The morphological paradigm thus consists of only one rule, arbitrarily numbered here as rule 1

• T1 is therefore the transducer that associates a canonical representation of the form “root+N+p”, where root is any possible nominal root, to the intermediate string “root+1”:

T1 = ([a-z]+)((\+N\+p)x(\+1))

where “x” represents the “cross-product” operator, “\” is a special character that prevents the character “+” to be interpreted as the Kleene plus operator, and “[a-z]” represents any alphabetic character

Page 42: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

T1 : Example

When applied to the list

( cat+N+p,book+N+p,fly+N+p,fox+N+p,deer+N+p,mouse+N+p,ox+N+p )

T1 represents the following list of associations

(cat+N+p, cat+1)(book+N+p, book+1)(fly+N+p, fly+1)(fox+N+p, fox+1)(deer+N+p, deer+1)(mouse+N+p, mouse+1)(ox+N+p, ox+1)

Page 43: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

T2 : Implementing the morphological paradigm• The identified (single) systematic rule for English regular noun plurals is:

Add “s” to the end of the root(as, for nouns, the root corresponds to the singular form)

• T2 is therefore the transducer that associates an intermediate string of the form “root+1” to a new intermediate string of the form “rootXs”, where the character X (called the “trace”) identifies the “border” between the root and the suffix “s”:

T2 = ([a-z]+)((\+1)x(Xs))

Note: placing a trace X in the new intermediate string will make it easier to handle the various exceptions to be implemented in T3

Page 44: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

T2 : Example

When applied to the list (resulting from T1)

( cat+1,book+1,fly+1,fox+1,deer+1,mouse+1,ox+1 )

T2 represents the following list of associations

(cat+1, catXs)(book+1, bookXs)(fly+1, flyXs)(fox+1, foxXs)(deer+1, deerXs)(mouse+1, mouseXs)(ox+1, oxXs)

Page 45: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

T3 : Handling the exceptions

• In this illustrative example, we will only consider 2 types of exceptions:

1. Euphonic rule 1 (simplified) :

If the root ends in “x”, change the ending “x” to “xes”

2. Euphonic rule 2 (simplified) :

If the root ends in “y”, change the ending “y” to “ies”

• T3 is therefore the transducer that associates an intermediate string of the form “rootxXs” (resp. “rootyXs”) to a new intermediate string of the form “rootxes” (resp. “rooties”), where “rootx” (resp. “rooty”) is any root ending in “x” (resp. “y”) :

T3 = ([a-z]+)(((xXs)x(xes))|((yXs)x(ies))|([^xy]((Xs)x(s))))

where “[^xy]” represents any character but “x” or “y”

Page 46: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

T3 : Example

When applied to the list (resulting from T2)

( catXs,bookXs,flyXs,foxXs,deerXs,mouseXs,oxXs )

T3 represents the following list of associations

(catXs, cats)(bookXs, books)(flyXs, flies)(foxXs, foxes)(deerXs, deers)(mouseXs, mouses)(oxXs, oxes)

Page 47: Introduction to Natural Language Processing · Objectives of this lecture Present morphology, important part of NLP Introduce transducers, tools for computational morphology LIA I&C

T1 o T2 o T3 : Example

When applied to the original list

(cat+N+p,book+N+p,fly+N+p,fox+N+p,deer+N+p,mouse+N+p,ox+N+p )

T1 o T2 o T3

represents the following list of associations

(cat+N+p, cats)(book+N+p, books)(fly+N+p, flies)(fox+N+p, foxes)(deer+N+p, deers)(mouse+N+p, mouses)(ox+N+p, oxes)

where the first 4 associations are correct, but the last 3 (in red) are erroneous and would require a more sophisticated definition of the transducer T3 responsible for handling the exceptions