54
Natural Language Processing Finite State Transducers

Natural Language Processing

Embed Size (px)

DESCRIPTION

Natural Language Processing. Finite State Transducers. Acknowledgement. Material in this lecture derived/copied from Richard Sproat CL46 Lectures Lauri Karttunen LSA lectures 2005. Three Key Concepts. Regular Relations. Finite State Transducers. Computational Morphology. - PowerPoint PPT Presentation

Citation preview

Page 1: Natural Language Processing

Natural Language Processing

Finite State Transducers

Page 2: Natural Language Processing

May 2007 CLINT-CL 2

Acknowledgement

• Material in this lecture derived/copied from

• Richard Sproat CL46 Lectures

• Lauri Karttunen LSA lectures 2005

Page 3: Natural Language Processing

May 2007 CLINT-CL 3

Three Key Concepts

Finite StateTransducers

RegularRelations

ComputationalMorphology

Page 4: Natural Language Processing

May 2007 CLINT-CL 4

Three Key Concepts

Finite StateTransducers

RegularRelations

ComputationalMorphology

Page 5: Natural Language Processing

May 2007 CLINT-CL 5

A Regular Set

ababababababababababababababab..

L1

Page 6: Natural Language Processing

May 2007 CLINT-CL 6

Two Regular Sets

ababababababababababababababab..

bababababababababababababababa..

L1 L2

Page 7: Natural Language Processing

May 2007 CLINT-CL 7

A Regular Relation L1 x L2

ababababababababababababababab..

bababababababababababababababa..

L1 L2

or {("ab","ba"), ("abab","baba"),...}

Page 8: Natural Language Processing

May 2007 CLINT-CL 8

Concatenation and Power

ConcatenationR1 = {("a","b")}R2 = {("c","d")}[R1 R2] = {("ac","bd")}

PowerR1+ = {("a","b"),("aa","bb"), ("aaa","bbb"), ...}

Page 9: Natural Language Processing

May 2007 CLINT-CL 961

Composition

• R1 ○ R2 denotes the composition of relations R1 and R2.

• DefinitionIf R1 contains <x,y>

And R2 contains <y,z>

Then R1 ○ R2 contains <x,z>

• R1 and R2 and B must be relations. If either is just a language, it is assumed to abbreviate the identity relation.

• R1 ○ R2 is written [R1 .o. R2] in xfst

Page 10: Natural Language Processing

May 2007 CLINT-CL 10

Closure Properties of Regular Languages and Relations

Operation Regular Languages Regular RelationsUnion yes yesConcatenation yes yesIteration yes yes

Intersection yes noSubtraction yes noComplementation yes no

Composition n/a yes

Page 11: Natural Language Processing

May 2007 CLINT-CL 11

Three Key Concepts

Finite StateTransducers

RegularRelations

ComputationalMorphology

Page 12: Natural Language Processing

May 2007 CLINT-CL 12

Morphology

• Study of word structure and formation in terms of morphemes

• Morpheme: smallest independent unit in a word that bear some meaning, such as rabbit and s.

• Two kinds of morphology– Inflectional– Derivational

Page 13: Natural Language Processing

May 2007 CLINT-CL 13

Inflectional/DerivationalMorphology

• Inflectional+s plural+ed past

• category preserving• productive: always

applies (esp. new words, e.g. fax)

• systematic: same semantic effect

• Derivational+ment

• category changingallot+ment

• not completely productive: detractment*

• not completely systematic: department

Page 14: Natural Language Processing

May 2007 CLINT-CL 14

Inflectional Morphology ExampleEnglish Noun Inflections

Regular Irregular

Singular cat church mouse ox

child

Plural cats churches mice oxen

children

Page 15: Natural Language Processing

May 2007 CLINT-CL 15

Computational Morphology: Concrete Processes

Analysis

leaves

leaf N Pl leave N Pl leave V Sg3

Generation

hang V Past

hanged hung

Page 16: Natural Language Processing

May 2007 CLINT-CL 16

Two Challenges for Computational Morphology

• Handling Morphotactics– Words are composed of smaller elements that must

be combined in a certain order:• piti-less-ness is English• piti-ness-less is not English

• Handling Phonological Alternations– The shape of an element may vary depending on the

context• pity is realized as piti in pitilessness• die becomes dy in dying

Page 17: Natural Language Processing

May 2007 CLINT-CL 17

Morphological Parsing

• The goal of morphological parsing is to find out what morphemes a given word is built from.

cats cat Noun PL

mice mouse Noun PL

foxes fox N PL

foxes fox V 3SING PRES• Two kinds of information

– Lexeme (which dictionary word)– Morphosyntactic information

Page 18: Natural Language Processing

May 2007 CLINT-CL 18

Morphological Parsing

MorphologicalParser

Input Word

cats

OutputAnalysis

cat N PL

Issues• Non-determinism• Arbitrary choice of morpheme symbols

Page 19: Natural Language Processing

May 2007 CLINT-CL 19

Morphological Generation

MorphologicalGenerator

Output Word

cats

InputAnalysis

cat N PL

Issues• Reversibility: how much can be shared regardless of direction?

Page 20: Natural Language Processing

May 2007 CLINT-CL 20

Morphology as a Regular Relation

catcatsmicelives...

catcat+N+PLmouse+N+PLlife+N+PLlive+V+3SING..

surface language lexical language

or {("cat,cat"),("cats","cat+N+PL"),......}

Page 21: Natural Language Processing

May 2007 CLINT-CL 21

Three Key Concepts

Finite StateTransducers

RegularRelations

ComputationalMorphology

Page 22: Natural Language Processing

May 2007 CLINT-CL 22

FSA

a

Used for• Recognition• Generation

Page 23: Natural Language Processing

May 2007 CLINT-CL 23

Finite State Transducers

• A finite state transducer (FST) is essentially an FSA finite state automaton that works on two (or more) tapes.

• The most common way to think about transducers is as a kind of translating machine which works by reading from one tape and writing onto the other.

Page 24: Natural Language Processing

May 2007 CLINT-CL 24

FST Definition

• A 2 way FST is a quintuple (K,s,F,ixo,) where

i, o are input and output alphabets

• K is a finite set of states

• s K is an initial state

• FK are final states is a transition relation of type

K x i x o x K

Page 25: Natural Language Processing

May 2007 CLINT-CL 25

FST

a

Used for• Recognition• Generation• Translation

b

upper tape

lower tape

Page 26: Natural Language Processing

May 2007 CLINT-CL 26

A Very Simple Transducer

a

b

Relation { ("a","b") }

Notation a:b encodes the transition

Page 27: Natural Language Processing

May 2007 CLINT-CL 27

A Very Simple Transducer

a

b

a:b

also written as

Page 28: Natural Language Processing

May 2007 CLINT-CL 28

Symbol Pairs

• Symbols vs. symbol pairs– In general, no distinction is made in

xfst betweena the language {“a”}a:a the identity relation

{(“a”, “a”)}

a

a

Page 29: Natural Language Processing

May 2007 CLINT-CL 29

A Very Simple Transducer

a

b

a:b

upper side

lower side

Page 30: Natural Language Processing

May 2007 CLINT-CL 30

A (more interesting) Transducer

• Relation

{ ("a","b"), ("aa","bb"), ...}

• Notationa:b*

• N.B. with this notation a and b must be single symbols

1

a:b

Page 31: Natural Language Processing

May 2007 CLINT-CL 31

Transducer have SeveralModes of Operation

• generation mode: It writes on both tapes. A string of as on one tape and a string of bs on the other tape. Both strings have the same length.

• recognition mode: It accepts when the word on the first tape consists of exactly as many as as the word on the second tape consists of bs.

• translation mode (left to right): It reads as from the first tape and writes a b for every a that it reads onto the second tape.

• translation mode (right to left): It reads bs from the second tape and writes an a for every b that it reads onto the first tape.

Page 32: Natural Language Processing

May 2007 CLINT-CL 32

The Basic Idea

• Morphology is regular

• Morphology is finite state

Page 33: Natural Language Processing

May 2007 CLINT-CL 33

Morphology is Regular

• The relation between the surface forms of a language and the corresponding lexical forms can be described as a regular relation, e.g.

{ ("leaf+N+Pl","leaves"),("hang+V+Past","hung"),...}

• Regular relations are closed under operations such as concatenation, iteration, union, and composition.

• Complex regular relations can be derived from simpler relations.

Page 34: Natural Language Processing

May 2007 CLINT-CL 34

Morphology is finite-state

• A regular relation can be defined using the metalanguage of regular expressions.

• [{talk} | {walk} | {work}]• [%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];

• A regular expression can be compiled into a finite-state transducer that implements the relation computationally.

Page 35: Natural Language Processing

May 2007 CLINT-CL 35

Compilation

• [{talk} | {walk} | {work}]• [%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:

{ed}];

Regular expression

k

t

a

a

wo

l

r

+Progr:i :g

+3rdSg:s

+Past:e :d

:n

+Base:

Finite-state transducer

finalstate

initialstate

Page 36: Natural Language Processing

May 2007 CLINT-CL 36

work+3rdSg --> works

k:k

t:t

a:a

a:a

w:wo:o

l:l

r:r

+Progr:i :g

+3rdSg:s

+Past:e :d

:n

+Base:

Generation

Page 37: Natural Language Processing

May 2007 CLINT-CL 37

talked --> talk+Past

k:k

t:t

a:a

a:a

w:wo:o

l:l

r:r

+Progr:i :g

+3rdSg:s

+Past:e :d

:n

+Base:

Analysis

Page 38: Natural Language Processing

May 2007 CLINT-CL 38

XFST Demo 2

• xfst[0]: regex • [{talk} | {walk} | {work}]• [% +Base:0 | %+SgGen3:s | %+Progr:{ing} | %

+Past:{ed}];

% xfstxfst[0]:

start xfst

compile a regular expression

apply the resultxfst[1]: apply up walkedwalk+Past

xfst[1]: apply down talk+SgGen3talks

Page 39: Natural Language Processing

May 2007 CLINT-CL 39

Lexical transducer

veut

vouloir +IndP +SG + P3

Finite-state transducer

inflected form

citation form inflection codes

v o u l o i r +IndP +SG +P3

v e u t

• Bidirectional: generation or analysis• Compact and fast• Comprehensive systems have been

built for over 40 languages:– English, German, Dutch, French,

Italian, Spanish, Portuguese, Finnish, Russian, Turkish, Japanese, Korean, Basque, Greek, Arabic, Hebrew, Bulgarian, …

Page 40: Natural Language Processing

May 2007 CLINT-CL 40

How lexical transducers are made

LexiconFST

RuleFSTs

Compiler

f a t +Adj

r

+Comp

f a t t e

Lexical Transducer(a single FST)composition

LexiconRegular Expression

RulesRegular Expressions

Morphotactics

Alternations

Page 41: Natural Language Processing

May 2007 CLINT-CL 41

Sequential Model

...

Surface form

Intermediate form

Lexical form

fst 1

fst 2

fst n

Ordered sequenceof rewrite rules(Chomsky & Halle ‘68)can be modeledby a cascade offinite-state transducersJohnson ‘72Kaplan & Kay ‘81

Page 42: Natural Language Processing

May 2007 CLINT-CL 42

Sequential application

N -> m / _ p

p -> m / m _

k a N p a n

k a m p a n

k a m m a n

Page 43: Natural Language Processing

May 2007 CLINT-CL 43

Sequential application in detail

N:m

N

?? 0

2

1

pN:m

m

pN

m

p:m

?? 0 1

mp

m

k a N p a n

k a m p a n

k a m m a n

0 0 0 2 0 0 0

0 0 0 1 0 0 0

Page 44: Natural Language Processing

May 2007 CLINT-CL 44

Composition of Two Stages

N:m

N

?? 0

3

1

N:m

m

p

N

?

m2

p:m

p:m

N m

N:mk a N p a n

k a m m a n

0 0 0 3 0 0 0

Page 45: Natural Language Processing

May 2007 CLINT-CL 45

Composition Example

• A .o. B The relation C such that if A maps x to y and B maps y to z, C maps x to z.

b:B c:Ca:A

b ca

a:A

b:B

c:C

d:D {abc} .o. [a:A | b:B | c:C | d:D]*= [a:A b:B c:C]

Page 46: Natural Language Processing

May 2007 CLINT-CL 46

Crossproduct

• A .x. B The relation that maps every string in A to every string in B, and vice versa

• A:B Same as [A .x. B].

b:y c:0a:x

a b c .x. x y [a b c] : [x y] {abc}:{xy}

Page 47: Natural Language Processing

May 2007 CLINT-CL 47

Xerox RE Operators

• $ containment• => restriction• -> replacement• @-> longest match replacement

– Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.

Page 48: Natural Language Processing

May 2007 CLINT-CL 48

Containment

aa?? ?? aa$a$a

[?* a ?*][?* a ?*]

Page 49: Natural Language Processing

May 2007 CLINT-CL 49

Restriction

??cc

bb

bb

cc?? aa

cc

a => b _ ca => b _ c

““AnyAny aa must be preceded bymust be preceded by bband followed byand followed by cc.”.”

~[~[?* b] a ?*] & ~[?* a ~[c ?*]] ~[~[?* b] a ?*] & ~[?* a ~[c ?*]]

Equivalent expression Equivalent expression

Page 50: Natural Language Processing

May 2007 CLINT-CL 50

Replacement

a:ba:b

bb

aa

??

??

b:ab:a

aa

a:ba:b

a b -> b a

““Replace ‘ab’ by ‘ba’.”Replace ‘ab’ by ‘ba’.”

[[~$[a b] [[a b] .x. [b a]]]* ~$[a b]]

Equivalent expression Equivalent expression

Page 51: Natural Language Processing

May 2007 CLINT-CL 51

Marking

0:[0:[

[[

0:]0:]

??

aa

ee

iioo

uu]]

a|e|i|o|u -> %[ ... %]

p o t a t op o t a t op[o]t[a]t[o]p[o]t[a]t[o]

Page 52: Natural Language Processing

May 2007 CLINT-CL 52

a b | b | b a | a b a -> x

(a) b (a) -> x

applied to “aba”

a b a a b a a b a a b a

a x a a x x a x

Multiple Results

Four factorizations of the input string.

Page 53: Natural Language Processing

May 2007 CLINT-CL 53

@-> Left-to-right, Longest-match Replacement

(a) b (a) @-> x

applied to “aba”

a b a a b a a b a a b a

a x a a x x a x

Page 54: Natural Language Processing

May 2007 CLINT-CL 54

Conditional Replacement

The relation that replaces A by B between L and R leaving everything else unchanged.

A -> BA -> B

Replacement

L _ RL _ R

Context

Sources of complexity:

Replacements and contexts may overlap

Alternative ways of interpreting “between left and right.”A -> B || L _ R both contexts on the inputA -> B // L _ R left context on the outputA -> B \\ L _ R right context on the output