113
Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011

Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011

Embed Size (px)

Citation preview

Morphology, Phonology & FSTs

Shallow Processing Techniques for NLPLing570

October 12, 2011

RoadmapMotivation:

Representing words

A little (mostly English) Morphology

Stemming

FSTs & MorphologyStemmingMorphological analysis

FSTs & Phonology

WordsGoal: Compact representation of all surface

forms in a language

LexiconGoal: Compact representation of all surface

forms in a languageEnumeration:

Impractical for morphologically rich languagesDescriptively unsatisfying for most languages

LexiconGoal: Compact representation of all surface

forms in a languageEnumeration:

Impractical for morphologically rich languagesDescriptively unsatisfying for most languages

Orthographic variation:Fly+er Flier

LexiconGoal: Compact representation of all surface

forms in a languageEnumeration:

Impractical for morphologically rich languagesDescriptively unsatisfying for most languages

Orthographic variation:Fly+er Flier

Morphological variation:saw + s saws; fish + s fish; goose + s geese

LexiconGoal: Compact representation of all surface

forms in a languageEnumeration:

Impractical for morphologically rich languagesDescriptively unsatisfying for most languages

Orthographic variation:Fly+er Flier

Morphological variation:saw + s saws; fish + s fish; goose + s geese

Phonological variation:dog + s dog + /z/; fox + s fox + /IH Z/

Morphological ParsingGoal: Take a surface word form and generate a

linguistic structure of component morphemes

A morpheme is the minimal meaning-bearing unit in a language.Stem: the morpheme that forms the central meaning

unit in a wordAffix: prefix, suffix, infix, circumfix

Prefix: e.g., possible impossibleSuffix: e.g., walk walkingInfix: e.g., hingi humingi (Tagalog)Circumfix: e.g., sagen gesagt (German)

Combining MorphemesInflection: Stem + gram. morpheme same

classE.g.: help + ed helped

Combining MorphemesInflection: Stem + gram. morpheme same

classE.g.: help + ed helped

Derivation: Stem + gram. morpheme new classE.g. Walk + er walker (N)

Combining MorphemesInflection: Stem + gram. morpheme same

classE.g.: help + ed helped

Derivation: Stem + gram. morpheme new classE.g. Walk + er walker (N)

Compounding: multiple stems new wordE.g. doghouse, catwalk, …

Combining MorphemesInflection: Stem + gram. morpheme same class

E.g.: help + ed helped

Derivation: Stem + gram. morpheme new classE.g. Walk + er walker (N)

Compounding: multiple stems new wordE.g. doghouse, catwalk, …

Clitics: stem+clitic I + ll I’ll; he + is he’s

Inflectional Morphology(Mostly English)

Relatively simple inflectional systemNouns, verbs, some adjectives

Inflectional Morphology(Mostly English)

Relatively simple inflectional systemNouns, verbs, some adjectives

Noun inflection: Only plural, possessiveNon-English???

Inflectional Morphology(Mostly English)

Relatively simple inflectional systemNouns, verbs, some adjectives

Noun inflection: Only plural, possessiveNon-English???

Plural: mostly stem + ‘s’, ‘es’ after s,z,sh,ch,x

Possessive:

Regular Irregular

Singular cat thrush goose ox

Plural cats thrushes geese oxen

Inflectional Morphology(Mostly English)

Relatively simple inflectional systemNouns, verbs, some adjectives

Noun inflection: Only plural, possessiveNon-English???

Plural: mostly stem + ‘s’, ‘es’ after s,z,sh,ch,x

Possessive: sg, irreg pl: +’s; reg pl, after s,z: ‘

Regular Irregular

Singular cat thrush goose ox

Plural cats thrushes geese oxen

Verb Inflectional Morphology

Classes:Main (eat, hit), modal (can, should), primary (be, have)Only main, primary inflected

Verb Inflectional Morphology

Classes:Main (eat, hit), modal (can, should), primary (be, have)Only main, primary inflected

Regular verbs: Forms predictable from stem, productiveForm Regul

arVerbs

Stem walk merge try map

-s form walks merges tries maps

-ing part walking merging trying mapping

past (-ed)

walked merged tried mapped

Verb Inflectional Morphology

Classes:Main (eat, hit), modal (can, should), primary (be, have)Only main, primary inflected

Regular verbs: Forms predictable from stem, productive

Irregular verbs: Only about 250, but very frequent

Form Regular

Verbs

Stem walk merge try map

-s form walks merges tries maps

-ing part walking merging trying mapping

past (-ed)

walked merged tried mapped

eat eats eating ate eaten

catch catches catching caught caught

cut cuts cutting cut cut

Derivational MorphologyRelatively complex, common in English

Nominalization: Verb or Adj + affix Noun

Derivational MorphologyRelatively complex, common in English

Nominalization: Verb or Adj + affix Noun

Adjectives: Verb or Noun + affix Adj

Suffix Base Derived Noun

-ation computerize computerization

-ee appoint appointee

-er kill killer

-ness fuzzy fuzziness

Derivational MorphologyRelatively complex, common in English

Nominalization: Verb or Adj + affix Noun

Adjectives: Verb or Noun + affix Adj

Suffix Base Derived Noun

-ation computerize computerization

-ee appoint appointee

-er kill killer

-ness fuzzy fuzziness

Suffix Base Derived Adjective

-al computation computational

-able embrace embraceable

-less clue clueless

CliticizationClitics: between affix and word

Affix: short, reducedWord: act as pronouns, articles, conj, verbs

CliticizationClitics: between affix and word

Affix: short, reducedWord: act as pronouns, articles, conj, verbs

In English:Presence is (mostly) unambiguous: ‘Meaning is often ambiguous: e.g. he’s

CliticizationClitics: between affix and word

Affix: short, reducedWord: act as pronouns, articles, conj, verbs

In English:Presence is (mostly) unambiguous: ‘Meaning is often ambiguous: e.g. he’s

More complex in other languages: e.g. Arabic

CliticizationClitics: between affix and word

Affix: short, reduced Word: act as pronouns, articles, conj, verbs

In English: Presence is (mostly) unambiguous: ‘ Meaning is often ambiguous: e.g. he’s

More complex in other languages: e.g. Arabic Can prefix (proclitic) article, prep, conj, No markers

Removal of such clitics often referred to as light stemming

StemmingSimple type of morphological analysis

Commonly used in information retrieval (IR)Supports matching using base forme.g. Television, televised, televising televise

StemmingSimple type of morphological analysis

Commonly used in information retrieval (IR)Supports matching using base forme.g. Television, televised, televising televiseTypically improves retrieval of short documents –

why?

StemmingSimple type of morphological analysis

Commonly used in information retrieval (IR)Supports matching using base forme.g. Television, televised, televising televiseTypically improves retrieval of short documents –

why?Most popular: Porter stemmer (snowball.tartarus.org)

StemmingSimple type of morphological analysis

Commonly used in information retrieval (IR)Supports matching using base forme.g. Television, televised, televising televiseTypically improves retrieval of short documents –

why?Most popular: Porter stemmer (snowball.tartarus.org)

Task: Given surface form, produce base formTypically, removes suffixes

StemmingSimple type of morphological analysis

Commonly used in information retrieval (IR)Supports matching using base forme.g. Television, televised, televising televiseTypically improves retrieval of short documents – why?Most popular: Porter stemmer (snowball.tartarus.org)

Task: Given surface form, produce base form Typically, removes suffixes

Model: Rule cascade No lexicon!

Porter StemmerRule cascade:

Rule form:(condition) PATT1 PATT2

Porter StemmerRule cascade:

Rule form:(condition) PATT1 PATT2E.g. stem contains vowel, ING -> ε

Porter StemmerRule cascade:

Rule form:(condition) PATT1 PATT2E.g. stem contains vowel, ING -> εATIONAL ATE

Porter StemmerRule cascade:

Rule form:(condition) PATT1 PATT2E.g. stem contains vowel, ING -> εATIONAL ATE

Rule partial order:Step1a: -sStep1b: -ed, -ing

Porter StemmerRule cascade:

Rule form:(condition) PATT1 PATT2E.g. stem contains vowel, ING -> εATIONAL ATE

Rule partial order:Step1a: -sStep1b: -ed, -ingStep 2-4: derivational suffixes

Porter StemmerRule cascade:

Rule form:(condition) PATT1 PATT2E.g. stem contains vowel, ING -> εATIONAL ATE

Rule partial order:Step1a: -sStep1b: -ed, -ingStep 2-4: derivational suffixesStep 5: cleanup

Pros:

Porter StemmerRule cascade:

Rule form:(condition) PATT1 PATT2E.g. stem contains vowel, ING -> εATIONAL ATE

Rule partial order:Step1a: -sStep1b: -ed, -ingStep 2-4: derivational suffixesStep 5: cleanup

Pros: Simple, fast, buildable for a variety of languages

Cons:

Porter Stemmer Rule cascade:

Rule form: (condition) PATT1 PATT2 E.g. stem contains vowel, ING -> ε ATIONAL ATE

Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes Step 5: cleanup

Pros: Simple, fast, buildable for a variety of languages

Cons: Overaggressive and underaggressive Limited in application

FST Morphological Analysis

Focus on English morphology

FSA acceptor:cats yes; foxes yes; childs no

FST Morphological Analysis

Focus on English morphology

FSA acceptor:cats yes; foxes yes; childs no

FST morphological analyzer:fox + N + pl fox^s#

FST Morphological Analysis

Focus on English morphology

FSA acceptor:cats yes; foxes yes; childs no

FST morphological analyzer:fox + N + pl fox^s#

FST for orthographic rules:fox^s# foxes#

Morphological AnalysisComponents

Lexicon: List of stems and affixesE.g.: cat: N -s: Pl

Morphological AnalysisComponents

Lexicon: List of stems and affixesE.g.: cat: N -s: Pl

Morphotactics: Model of morpheme orderingAssociation with classes, affix ordering

E.g. Pl follows N

Morphological AnalysisComponents

Lexicon: List of stems and affixesE.g.: cat: N -s: Pl

Morphotactics: Model of morpheme orderingAssociation with classes, affix ordering

E.g. Pl follows N

Orthographic rules: Spelling rulesChanges when morphemes combine

E.g. y ie in try + s

ExampleGoal: foxes fox + N + Pl

ExampleGoal: foxes fox + N + Pl

Surface: foxes

ExampleGoal: foxes fox + N + Pl

Surface: foxes

Orthographic rules

Intermediate: fox s

ExampleGoal: foxes fox + N + Pl

Surface: foxes

Orthographic rules

Intermediate: fox s

Lexicon + morphotactics

Lexical: fox + N + Pl

Multiple LevelsGeneration and Analysis

Generation: fox + N + Pl fox^s#; fox^s# foxes#

Analysis: foxes# fox^s#; fox^s# fox + N + Pl

The LexiconRepository for words:

Simplest would be enumeration Impractical (at least) for many languages

The LexiconRepository for words:

Simplest would be enumeration Impractical (at least) for many languages

Includes stems, affixes, some morphotacticsE.g cat: N, +sg; fly: v, +base

The LexiconRepository for words:

Simplest would be enumeration Impractical (at least) for many languages

Includes stems, affixes, some morphotacticsE.g cat: N, +sg; fly: v, +baseWhat about: flies: v, +sg +3rd?

Common model of morphotactics: FSA

Basic Noun Lexicon(J&M, CH3)

reg-noun irreg-pl-noun

irreg-sg-noun

plural

fox geese goose -s

cat sheep sheep

dog mice mouse

Basic Noun Lexicon(J&M, CH3)

As an FSA

reg-noun irreg-pl-noun

irreg-sg-noun

plural

fox geese goose -s

cat sheep sheep

dog mice mouse

Basic Noun Lexicon(J&M, CH3)

As an FSA

reg-noun irreg-pl-noun

irreg-sg-noun

plural

fox geese goose -s

cat sheep sheep

dog mice mouse

FSA Lexicon with Words

What’s up with the ‘s’ arc?

FSA Lexicon with Words

What’s up with the ‘s’ arc?Orthographic rules will fix ‘es’

Lexicon for English VerbsVerbs and classes:reg-v-

stemirreg-v-stem

irreg-past-v-form

past part-part pres-part 3sg

walk cut caught -ed -ed -ing -s

fry speak ate

talk sing eaten

impeach sang

Lexicon for English VerbsVerbs and classes:reg-v-

stemirreg-v-stem

irreg-past-v-form

past part-part pres-part 3sg

walk cut caught -ed -ed -ing -s

fry speak ate

talk sing eaten

impeach sang

FSA for Derivational Morphology

Complex….

FSAs for MorphotacticsWe have:

stems (and stem class identities)e.g cat: reg-noun, goose: irreg-noune.g. walk: reg-verb-stem; cut: irreg-verb-stem

FSAs for MorphotacticsWe have:

stems (and stem class identities)e.g cat: reg-noun, goose: irreg-noune.g. walk: reg-verb-stem; cut: irreg-verb-stem

affixes (by form and class)e.g. –s: Plurale.g. –ed: past, past-part

FSAs for MorphotacticsWe have:

stems (and stem class identities)e.g cat: reg-noun, goose: irreg-noune.g. walk: reg-verb-stem; cut: irreg-verb-stem

affixes (by form and class)e.g. –s: Plurale.g. –ed: past, past-part

morphotactic FSAs:Accept combinations of stems & affixes in languageReject o.w.

Recognition vs Analysis/Generation

Can validate a morphological sequence

Recognition vs Analysis/Generation

Can validate a morphological sequence

Recognition not usually main goalAnalysis: Given a surface form, produce

component morphemesGeneration: Given some morphological structure,

produce full surface form

Recognition vs Analysis/Generation

Can validate a morphological sequence

Recognition not usually main goalAnalysis: Given a surface form, produce

component morphemesGeneration: Given some morphological structure,

produce full surface form

Requires translation from one form to another

Recognition vs Analysis/Generation

Can validate a morphological sequence

Recognition not usually main goalAnalysis: Given a surface form, produce

component morphemesGeneration: Given some morphological structure,

produce full surface form

Requires translation from one form to anotherFSTs

Multilevel Tape Machines

FST1… Orthographic Rules …..FSTn

Lexicon FST

Noun Morphology FSARemember:

Schematic FST

cat + N + Pl cat^s# Map morph features to empty stringif there is no corresponding output

Updating the LexiconNeed words, not just classes, as FST

fox fox

Updating the LexiconNeed words, not just classes, as FST

fox foxNeed:

Updating the LexiconNeed words, not just classes, as FST

fox foxNeed: geese goose + N + Pl

Assume f:f written as f

reg-noun irreg-pl-noun irreg-sg-noun

fox g o o s e

cat sheep sheep

aardvark mouse

Updating the LexiconNeed words, not just classes, as FST

fox foxNeed: geese goose + N + Pl

Assume f:f written as f

reg-noun irreg-pl-noun irreg-sg-noun

fox g o:e o:e s e g o o s e

cat sheep sheep

aardvark m o:i u:εs:c e mouse

Integrating the LexiconReplace classes with stems

Adding Orthographic RulesCurrent transducer concatenates morphemes

Should work for cats, aardvarks, mice,..

Adding Orthographic RulesCurrent transducer concatenates morphemes

Should work for cats, aardvarks, mice,..foxs?

Problem: spelling changes at morpheme boundaries

Adding Orthographic RulesCurrent transducer concatenates morphemes

Should work for cats, aardvarks, mice,..foxs?

Problem: spelling changes at morpheme boundariesMany such rules

Consonant doubling before –ing, -ed E-deletion: silent e dropped before –ing, -ed, Y replacement: y ie before –s, i before -ed , etc

Adding Orthographic RulesCurrent transducer concatenates morphemes

Should work for cats, aardvarks, mice,..foxs?

Problem: spelling changes at morpheme boundariesMany such rules

Consonant doubling before –ing, -ed E-deletion: silent e dropped before –ing, -ed, Y replacement: y ie before –s, i before -ed , etc

Approach: Transducers for orthographic rules

Creating an Orthographic Rule

Goal: Correct e insertion in pluralsE.g. fox^s# foxes

Approach 1:

Creating an Orthographic Rule

Goal: Correct e insertion in pluralsE.g. fox^s# foxes

Approach 1: ε e foxes

Creating an Orthographic Rule

Goal: Correct e insertion in pluralsE.g. fox^s# foxes

Approach 1: ε e foxes, but also cates, doges, etc…

Creating an Orthographic Rule

Goal: Correct e insertion in pluralsE.g. fox^s# foxes

Approach 1: ε e foxes, but also cates, doges, etc…Only apply in context: after s,z,x, etc before s

Approach 2: ε e /(s|z|x|)_s Issue

Creating an Orthographic Rule

Goal: Correct e insertion in pluralsE.g. fox^s# foxes

Approach 1: ε e foxes, but also cates, doges, etc…Only apply in context: after s,z,x, etc before s

Approach 2: ε e /(s|z|x|)_s Issue? glass glases

Approach 3: ε e /(s|z|x|)^_s#

Rewrite RulesFormat: a b/c_d

Rewrite rules can be optional or obligatory

Rewrite rules can be ordered to reduce ambiguity.

Under some conditions, rewrite rules equivalent to FSTs.a not allowed to match s.t. introduced in prior rule

application

E-insertion Rule Transducer

ε e /(s|z|x|)^_s#

Input: ….(s|z|x)^s# Intermediate level

Output: …(s|z|x)es# surface level

Using the E-insertion FST

(fox,fox):

Using the E-insertion FST

(fox,fox): q0, q0,q0,q1, accept(fox#,fox#):

Using the E-insertion FST

(fox,fox): q0, q0,q0,q1, accept(fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#):

Using the E-insertion FST

(fox,fox): q0, q0,q0,q1, accept(fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept(fox^s,foxs):

Using the E-insertion FST

(fox,fox): q0, q0,q0,q1, accept(fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept(fox^s,foxs): q0,q0,q0,q1 ,q2,q5,reject

Using the E-insertion FST

(fox,fox): q0, q0,q0,q1, accept(fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept(fox^s,foxs): q0,q0,q0,q1 ,q2,q5,reject(fox^z#,foxz#) ?

What will it accept?(f,f)

(fox#,fox#)

(fox^s#,foxes#)

(fox^z#,foxz#)

What will it accept?(f,f)

(fox#,fox#)

(fox^s#,foxes#)

(fox^z#,foxz#)

Goal: write rules capture only those constraintsLet all other input pass through

Combining FST Lexicon & Rules

Two-level morphological system: ‘Cascade’Transducer from Lexicon to IntermediateRule transducers from Intermediate to Surface

Generation & ParsingGeneration:

Given lexicon tape, cascade to produce surface form

fox + N + PL foxes#

Generation & ParsingGeneration:

Given lexicon tape, cascade to produce surface form

fox + N + PL foxes#

Parsing:Given surface form, generate analysisfoxes#

Generation & ParsingGeneration:

Given lexicon tape, cascade to produce surface form

fox + N + PL foxes#

Parsing:Given surface form, generate analysisfoxes# fox + N + PL

Generation & ParsingGeneration:

Given lexicon tape, cascade to produce surface form

fox + N + PL foxes#

Parsing:Given surface form, generate analysisfoxes# fox + N + PL or fox + V + 3SgHow can we disambiguate?

Generation & ParsingGeneration:

Given lexicon tape, cascade to produce surface form

fox + N + PL foxes#

Parsing:Given surface form, generate analysisfoxes# fox + N + PL or fox + V + 3SgHow can we disambiguate?

We can’t here – need outside information

What about ‘assess’?

Generation & ParsingGeneration:

Given lexicon tape, cascade to produce surface form fox + N + PL foxes#

Parsing:Given surface form, generate analysis foxes# fox + N + PL or fox + V + 3SgHow can we disambiguate?

We can’t here – need outside information

What about ‘assess’?Need same sort of search as NFAs

FST Morphological Analysis

Summary:Main components

LexiconMorphotacticsOrthographic rules

Morphotactics as FSTs, expanded with FST Lexicon

Orthographic rules as FSTs

Combine FSTs, e.g. in cascade

IssuesWhat do you think of creating all the rules for a

languages – by hand?Time-consuming, complicated

IssuesWhat do you think of creating all the rules for a

languages – by hand?Time-consuming, complicated

Proposed approach: Unsupervised morphology induction

IssuesWhat do you think of creating all the rules for a

languages – by hand?Time-consuming, complicated

Proposed approach: Unsupervised morphology induction

Potentially useful for many applications IR, MT

Unsupervised MorphologyStart from tokenized text (or word frequencies)

talk 60talked 120walked 40walk 30

Unsupervised MorphologyStart from tokenized text (or word frequencies)

talk 60talked 120walked 40walk 30

Treat as coding/compression problemFind most compact representation of lexicon

Popular model MDL (Minimum Description Length) Smallest total encoding:

Weighted combination of lexicon size & ‘rules’

ApproachGenerate initial model:

Base set of words, compute MDL length

ApproachGenerate initial model:

Base set of words, compute MDL length

Iterate:Generate a new set of words + some model to

create a smaller description size

ApproachGenerate initial model:

Base set of words, compute MDL length

Iterate:Generate a new set of words + some model to

create a smaller description size

E.g. for talk, talked, walk, walked4 words

ApproachGenerate initial model:

Base set of words, compute MDL length

Iterate:Generate a new set of words + some model to create

a smaller description size

E.g. for talk, talked, walk, walked4 words2 words (talk, walk) + 1 affix (-ed) + combination info2 words (t,w) + 2 affixes (alk,-ed) + combination info

Homework #3