Introduction to Natural Language Processing · Objectives of this lecture Present morphology,...

Preview:

Citation preview

Introduction to Natural Language Processing

MORPHOLOGY – TRANSDUCERS

Martin Rajman

Martin.Rajman@epfl.ch

and

Jean-Cedric Chappelier

Jean-Cedric.Chappelier@epfl.ch

Artificial Intelligence Laboratory

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier1/24

Objectives of this lecture

➥ Present morphology, important part of NLP

➥ Introduce transducers, tools for computational morphology

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier2/24

Contents

➥ Morphology

➥ Transducers

➥ Operations and Regular Expressions on Transducers

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier3/24

Morphology

Study of the internal structure and the variability of the words in a language:

✏ verbs conjugation

✏ plurals

✏ nominalization (enjoy → enjoyment)

➜ inflectional morphology: preserves the grammatical category

give given gave gives ...

➜ derivational morphology: change in category

process processing processable processor processabilty

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier4/24

Morphology (2)

Interest: use a priori knowledge about word structure to decompose it into morphemes and

produce additional syntactic and semantic information (on the current word)

processable → process- -able ☞ 2 morphemes

meaning: process possible

role: root suffix

semantic information: main less

The importance and complexity of morphology vary from language to language

Some information represented at the morphological level in English may be represented

differently in other languages (and vice-versa). The paradigmatic/syntagmatic repartition

changes from one language to another

Example in Chinese: ate −→ expressed as ”eat yesterday”

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier5/24

Stems – Affixes

Words are decomposed into morphemes: roots (or stems) and affixes.

There are several kinds of affixes:

➊ prefixes: in- -credible

➋ suffixes: incred- -ible

➌ infixes:

Example in Tagalog ( Philippines):

hingi (to borrow) → humingi (agent of the action)

In slang English! → ”fucking” in the middle of a word Man-fucking-hattan

➍ circumfixes:

Example in German:

sagen (to say) → gesagt (said)

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier6/24

Stems – Affixes (2)

several affixes may be combined:

examples in Turkish where you can have up to 10 (!) affixes.

uygarlastıramadıklarimizdanmıssınızcasına

uygar las tır ama dık lar imiz dan mıs sınız casınacivilized +BEC +CAUS +NEGABLE +PPART +PL +P1PL +ABL +PAST +2PL +ASIF

as if you are among those whom we could not cause to become civilized

When only prefixes and suffixes are involved: concatenative morphology

Some languages are not concatenative:

• infixes

• pattern-based morphology

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier7/24

Example of semitic languages

Pattern-based morphology

In Hebrew, the verb morphology is based on the association of

• a root, often made of 3 consonents, which indicates the main meaning,

• and a vocalic structure (insertion ov vowels) that refines the meaning.

Example: LMD (learn or teach)

LAMAD → he was learning

LUMAD → he was taught

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier8/24

Computational Morphology

Let us consider flexional morphology, for instance for verbs and nouns

Noun flexions: plural

General rule: +s

but several exceptions (e.g. foxes, mice)

Verb flexions: conjugations

• tense, mode

• regular/irregular

☞ How to handle flexions (comptutationaly)?

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier9/24

Computational Morphology

Example:surface form: is

canonical representation at the lexicon level (formalization): be+3+s+Ind+Pres

The objective of computational morphology tools is precisely to go from one to the other:

• Analysis: Find the canonical representation corresponding to the surface form

• Generation: Produce the surface form described by the canonical representation

Challenge: have a ”good” implementation of these two transformations

Tools: associations of strings → transducers

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier10/24

String associations

(X1, X′

1)

.

.

.

(Xn, X′

n)

(eaten, eat)

(processed, process)

.

.

.

(thought, think)

Easy situation: ∀i, |Xi| = |X ′

i| Example: (abc, ABC)

⇒ represented as a sequence of character transductions

(abc, ABC) = (a,A)(b,B)(c,C)

☞strings on a new alphabet: strings of character couples

Not so easy: If ∃i, |Xi| 6= |X ′

i| ⇒ requires the introduction of empty string ε

Example: (ab, ABC) ≃ (εab, ABC) = (ε,A)(a,B)(b,C)

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier11/24

Dealing with ε

Where to put the ε?

Example:(ab,ABC) ≃ (εab, ABC)

but also (ab,ABC) ≃ (aεb, ABC)

or (ab,ABC) ≃ (abε, ABC)

General case:

n

m

(with m < n)

Hard problem in general → need for a convention

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier12/24

Transducer (definition)

Let Σ1 and Σ2 be two enumerable sets (alphabets), and

Σ =(

(Σ1 ∪ {ε})× (Σ2 ∪ {ε}))

\ {(ε, ε)}

A transducer is a DFSA on Σ

Σ1 : ”left” language

: upper language

: input language

Σ2 : ”right” language

: lower language

: output language

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier13/24

Example

1

0

2

initial state

final state(s)

b

a:b

ab:ab:a

b:ε

a

Some transductions: (bb,b) [0,0,2] (ababb,baab) [0,1,2,0,0,2]

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier14/24

Different usages of a transducer

➊ association checking (abba, baaa)∈ Σ∗ ?

➋ Generation: string1 → string2 bbab→ ?

➌ Analysis: string2 → string1 ? → ba

➊: easy: (= FSA: nothing special)

What about ➋ and ➌ ?

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier15/24

Transduction

Walk through the FSA following one or the other element of the couple (projections)

❢ ☞ not deterministic in general!

The fact that a transducer is a deterministic (couple-)FSA does not at all imply that the

automaton resulting from one projection or the other is also deterministic!

non-deterministic evaluation

backtracking on ”wrong” solutions

⇒ The projection is not constant time (in general)

When a transducer is deterministic with respect to one projection or the other, it is called a

sequential transducer

A transducer in not sequential in general. In particular if one language or the other (upper

or lower) is not finite, it is not sure that a sequential transducer can be produced.

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier16/24

Transduction (2)

Example: bbab→ ?

0

0

b:b

0

b:b

1

a:b

2

b:a

2

b:ε

0

a:a

0

b:b

2

b:ε

2

b:ε

1

b:b

2

a:b

1

b:a

bbab→ bbba bbab→ ba

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier17/24

Transduction (3)

Example:? → ba

0

0

b:b

2

b:ε

0

a:a

2

b:ε

1

a:a

(FAIL)

1

a:b

2

a:a

2

b:a

2

b:ε

(FAIL)

aa→ ba ab→ ba bbab→ ba

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier18/24

Operations and Regular Expressions

on Transducers

➮ All FSA regular expressions: concatenation, or, Kleene closure (*), ...

Example:(concatenation) ”a:b c:a” recognizes ac and produces ba

➮ cross-product of regular languages: E1 ⊗ E2 recognizes L1 × L2

example: a+⊗ b+→ (an, bm) ∀ n ≥ 1,m ≥ 1 !! this is 6= (a⊗ b)+

➮ Composition of transducers: T = T1 ◦ T2

(X1, X2) ∈ T ⇐⇒ ∃Y : (X1, Y ) ∈ T1 and (Y,X2) ∈ T2

➮ Reduction: extraction of the upper or the lower FSA

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier19/24

(Other) examples of applications

(morphology)

★ text-to-speech (grapheme to phoneme transduction)

★ specific lexicon representation (composition of some access and inverse fonctions)

★ filters (remove/add/modify marks; e.g. HTML)

★ text segmentation

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier20/24

Computational morphology using transducers

Use of composition:

➛ Identification of a paradigm (T1)

➛ Implementation of this paradim (T2)

➛ Exception handling (T3)

Example: input: chat+NP, fox+NP, ... (”+NP” means ”noun plural”)

T1: ([a-z]+)(\+NP ⊗ \+1) paradigm identification: plural nouns (trivial here:

only one paradigm (+1))

T2: ([a-z]+)(\+1 ⊗ \+Xs) plural inflection of nouns (regular part)

T3: ([a-z]+)(h\+Xs ⊗ hes | x\+Xs ⊗ xen | ... | [ˆhx...](\+X⊗ε)s) correction of exceptions

T1 ◦ T2 ◦ T3: plural for nouns

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier21/24

Computational morphology using transducers (2)

Detailed example on the plural of nouns:

general case: add a terminal ’s’

cat+NP → cats, dog+NP → dogs, ...

Exceptions (several kind):

• fly flies

• fox foxes, but ox oxen!

• ..

Method: find all the paradigms (linguists’ role) and implement a transducer for each of them

☞ add the paradigm identification in the lexical description

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier22/24

Keypoints

➟ Flexional and derivational morphologies, their roles

➟ Main functions of transducers: association checking, generation and analysis

➟ Deterministic and not deterministic nature of transduction

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier23/24

References

E. Roche, Y. Schabes, Finite-state Language Processing, pp. 14-63, 67-96, A Bradforf

Book, 1997.

LIAI&C

Introduction to Natural Language Processing (CS-431)M. Rajman

J.-C. Chappelier24/24

A quick reminder aboutnoun plurals in English

Computational Linguistics

Martin Rajman

Artificial Intelligence Laboratory

Fully regular plurals

• default rule:

Add “s” to the end of the singular form

• Examples:(dog, dogs)

(arrow, arrows)

...

Semi-regular plurals

Some “regular” plurals need to be modified to be easy to pronounce (“euphonic rules)• Euphonic rule 1: if the singular noun ends in “s”, “x”, “z”, “ch”, or “sh”, add

“es” instead of “s”(guess, guesses)(box, boxes)(buzz, buzzes)(catch, catches)(dish, dishes)

... but (systematic exception) if the final “ch” is pronounced “k”, add “s” instead of “es”

(stomach, stomachs)

as well as some fully irregular exceptions(ox, oxen)

Semi-regular plurals (2)

• Euphonic rule 2: if the singular noun ends in a consonant followed by “y”, change the “y” to “ies”

(baby, babies)

(fly, flies)

Note: there must be a consonant before the “y”...

(boy, boys)

(buy, buys)

Irregular plurals

• Collective nouns (aka uncountable nouns) have no plural form(hair, ---)(mud, ---)

... but the regular plurals may also be acceptable in specific contexts:

“Her hair is black” ... but ... “I saw at least one grey hair, and there are probably more grey hairs there”

(hair, hairs)

“They throw mud at each other” ... but ... “These subterranean muds are being removed”

(mud, muds)

Irregular plurals (2)

• Invariant nouns (aka invariable nouns) do not change when inflected to the plural

“Deer have antlers”

Note that there is a (subtle) difference for a noun not to have a plural (i.e. to be uncountable), or to have a plural form that is the same as the singular one

uncountable: “Her hair is black” is correct, while “Her hair are black” is not

invariable: “This deer is fast” and “Deer are fast” are both correct (but do not mean the same)

Other irregular plurals

• Case 1: For most nouns ending in “f” or “fe”, change the ending “f” or “fe” to “ves”

(half, halves)

(knife, knives)

... but

(belief, beliefs)

(if, ifs) “There are so many ifs and buts in this policy"

Other irregular plurals (2)

• Case 2: For most nouns ending in “is”, change the ending “is” to “es”(crisis, crises)

(hypothesis, hypotheses)

... but

(vis, vires)

where “vis” is a Latin word meaning “power” that has been imported in English, while preserving its Latin plural (“vires”)

“An example of vis is the influence of the leader"

Other irregular plurals (3)

• Case 3: For many nouns ending in “o”, change the ending “o” to “oes”(tomato, tomatoes)

(mosquito, mosquitoes)

(volcano, volcanoes)

... but

(photo, photos)

(video, videos)

(piano, pianos)

Fully irregular plurals

• For some (often very frequent) words, the plural corresponds to a much more complicated modification

(man, men)

(mouse, mice)

(foot, feet)

(tooth, teeth)

...

Computational morphologyfor English nouns

Computational Linguistics

Martin Rajman

Artificial Intelligence Laboratory

Fundamentals

• Goal: use transducers to represent associations between strings representing:

• surface forms, i.e. words as they appear in texts;and

• canonical representations, i.e. formal representations of the morphological analysis of these words

• Examples of surface forms:

cats, book, flies, ...

• Example of canonical representations:

cat+N+p, book+N+s, fly+N+p, ...

Canonical representations

The typical format of a canonical representations is:

Lemma+GrammaticalCategory+MorphoSyntacticFeature1+MorphoSyntacticFeature2+...

where:

• Lemma (or Root) is the canonical form of an inflected word; i.e. the form usually found in dictionaries, e.g. the singular form for nouns, or the infinitive for verbs;

• GrammaticalCategory (or Part-of-Speech) is the tag used to represent the grammatical category of the word, e.g. N for a noun, Adj for an adjective, or V for a verb;

• MorphoSyntacticFeaturek (k=1, 2, 3, ...) are the tags used to represent the morphosyntactic features (e.g. the number, the gender, the tense, the person, etc.) that are relevant to identify a specific inflection of a word;

and

• "+" is a (conventional) separating character.

Examples of canonical representations

• (cat+N+p, cats): associating the canonical representation "cat+N+p" to the surface form "cats" expresses in a formal way that "cats" is the flection of the noun "cat" corresponding to its plural form ("p" being the tag for the value "plural" of the morphosyntactic feature "number");

• (turn+V+Ind+Pres+3+s, turns): associating the canonical representation "turn+V+Ind+Pres+3+s" to the surface form "turns" expresses in a formal way that the surface form corresponding to the flection of the verb ("V") "to turn" at the 3rd person ("3") singular ("s") of the present ("Pres") indicative ("Ind") is "turns".

In other words...

Implementing some Computational Morphology for English nouns is finding an efficient way of representing a, potentially very large, set of(canonical representation, surface form) associations, such as:

(cat+N+s, cat)(cat+N+p, cats)(book+N+s, book)(book+N+p, books)(fly+N+s, fly)(fly+N+p, flies)(fox+N+s, fox)(fox+N+p, foxes)(deer+N+s, deer)(deer+N+p, deer)(mouse+N+s, mouse)(mouse+N+p, mice)(ox+N+s, ox)(ox+N+p, oxen)...

By "efficient way", we mean a method that:- allows to describe all the targeted associations withouthaving to write them explicitly one-by-one;- provides a computational mechanism with a lowalgorithmic complexity able to produce the surfaceform(s) associated with a given canonical representation("generation"), or the canonical representation(s)associated with a given surface form ("analysis")

How to do this with transducers?

The idea is to use the composition T1 o T2 o T3 of 3 transducers:

1. a transducer T1 that identifies the morphological paradigm, i.e. the systematic transformation rule(s) to be implemented for regular forms

2. a transducer T2 that implements the identified systematic rule(s)

3. a transducer T3 that handles all the exceptions to the implemented rules

T1 : Identifying the morphological paradigm

• In English, the morphology of regular noun plurals is very simple, as it corresponds to a single systematic rule

• The morphological paradigm thus consists of only one rule, arbitrarily numbered here as rule 1

• T1 is therefore the transducer that associates a canonical representation of the form “root+N+p”, where root is any possible nominal root, to the intermediate string “root+1”:

T1 = ([a-z]+)((\+N\+p)x(\+1))

where “x” represents the “cross-product” operator, “\” is a special character that prevents the character “+” to be interpreted as the Kleene plus operator, and “[a-z]” represents any alphabetic character

T1 : Example

When applied to the list

( cat+N+p,book+N+p,fly+N+p,fox+N+p,deer+N+p,mouse+N+p,ox+N+p )

T1 represents the following list of associations

(cat+N+p, cat+1)(book+N+p, book+1)(fly+N+p, fly+1)(fox+N+p, fox+1)(deer+N+p, deer+1)(mouse+N+p, mouse+1)(ox+N+p, ox+1)

T2 : Implementing the morphological paradigm• The identified (single) systematic rule for English regular noun plurals is:

Add “s” to the end of the root(as, for nouns, the root corresponds to the singular form)

• T2 is therefore the transducer that associates an intermediate string of the form “root+1” to a new intermediate string of the form “rootXs”, where the character X (called the “trace”) identifies the “border” between the root and the suffix “s”:

T2 = ([a-z]+)((\+1)x(Xs))

Note: placing a trace X in the new intermediate string will make it easier to handle the various exceptions to be implemented in T3

T2 : Example

When applied to the list (resulting from T1)

( cat+1,book+1,fly+1,fox+1,deer+1,mouse+1,ox+1 )

T2 represents the following list of associations

(cat+1, catXs)(book+1, bookXs)(fly+1, flyXs)(fox+1, foxXs)(deer+1, deerXs)(mouse+1, mouseXs)(ox+1, oxXs)

T3 : Handling the exceptions

• In this illustrative example, we will only consider 2 types of exceptions:

1. Euphonic rule 1 (simplified) :

If the root ends in “x”, change the ending “x” to “xes”

2. Euphonic rule 2 (simplified) :

If the root ends in “y”, change the ending “y” to “ies”

• T3 is therefore the transducer that associates an intermediate string of the form “rootxXs” (resp. “rootyXs”) to a new intermediate string of the form “rootxes” (resp. “rooties”), where “rootx” (resp. “rooty”) is any root ending in “x” (resp. “y”) :

T3 = ([a-z]+)(((xXs)x(xes))|((yXs)x(ies))|([^xy]((Xs)x(s))))

where “[^xy]” represents any character but “x” or “y”

T3 : Example

When applied to the list (resulting from T2)

( catXs,bookXs,flyXs,foxXs,deerXs,mouseXs,oxXs )

T3 represents the following list of associations

(catXs, cats)(bookXs, books)(flyXs, flies)(foxXs, foxes)(deerXs, deers)(mouseXs, mouses)(oxXs, oxes)

T1 o T2 o T3 : Example

When applied to the original list

(cat+N+p,book+N+p,fly+N+p,fox+N+p,deer+N+p,mouse+N+p,ox+N+p )

T1 o T2 o T3

represents the following list of associations

(cat+N+p, cats)(book+N+p, books)(fly+N+p, flies)(fox+N+p, foxes)(deer+N+p, deers)(mouse+N+p, mouses)(ox+N+p, oxes)

where the first 4 associations are correct, but the last 3 (in red) are erroneous and would require a more sophisticated definition of the transducer T3 responsible for handling the exceptions

Recommended