CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:

CS 4705

Lecture 3

Morphology: Parsing Words

What is morphology?

• The study of how words are composed from smaller, meaning-bearing units (morphemes)– Stems: children, undoubtedly, – Affixes (prefixes, suffixes, circumfixes, infixes)

• Immaterial• Trying• Gesagt• Absobl**dylutely

– Concatenative vs. non-concatenative (e.g. Arabic root-and-pattern) morphological systems

Morphology Helps Define Word Classes

• AKA morphological classes, parts-of-speech• Closed vs. open (function vs. content) class words

– Pronoun, preposition, conjunction, determiner,…

– Noun, verb, adverb, adjective,…

(English) Inflectional Morphology

• Word stem + grammatical morpheme– Usually produces word of same class– Usually serves a syntactic function (e.g. agreement)

like likes or likedbird birds

• Nominal morphology– Plural forms

• s or es• Irregular forms (goose/geese)• Mass vs. count nouns (fish/fish,email or emails?)

– Possessives (cat’s, cats’)

• Verbal inflection– Main verbs (sleep, like, fear) verbs relatively regular

• -s, ing, ed

• And productive: Emailed, instant-messaged, faxed, homered

• But some are not regular: eat/ate/eaten, catch/caught/caught

– Primary (be, have, do) and modal verbs (can, will, must) often irregular and not productive

• Be: am/is/are/were/was/been/being

– Irregular verbs few (~250) but frequently occurring

– So….English inflectional morphology is fairly easy to model….with some special cases...

(English) Derivational Morphology

• Word stem + grammatical morpheme– Usually produces word of different class

– More complicated than inflectional

• E.g. verbs --> nouns– -ize verbs -ation nouns

– generalize, realize generalization, realization

• E.g.: verbs, nouns adjectives– embrace, pity embraceable, pitiable

– care, wit careless, witless

• E.g.: adjective adverb– happy happily

• But “rules” have many exceptions– Less productive: *evidence-less, *concern-less, *go-

able, *sleep-able

– Meanings of derived terms harder to predict by rule

• clueless, careless, nerveless

Parsing

• Taking a surface input and identifying its components and underlying structure

• Morphological parsing: parsing a word into stem and affixes, identifying its parts and their relationships– Stem and features:

• goose goose +N +SG or goose + V• geese goose +N +PL• gooses goose +V +3SG

– Bracketing: indecipherable [in [[de [cipher]] able]]

Why parse words?

• For spell-checking – Is muncheble a legal word?

• To identify a word’s part-of-speech (pos)– For sentence parsing, for machine translation, …

• To identify a word’s stem– For information retrieval

• Why not just list all word forms in a lexicon?

How do people represent words?

• Hypotheses:– Full listing hypothesis: words listed – Minimum redundancy hypothesis: morphemes listed

• Experimental evidence:– Priming experiments (Does seeing/hearing one word

facilitate recognition of another?) suggest neither– Regularly inflected forms prime stem but not derived

forms – But spoken derived words can prime stems if they are

semantically close (e.g. government/govern but not department/depart)

• Speech errors suggest affixes must be represented separately in the mental lexicon– easy enoughly

What do we need to build a morphological parser?

• Lexicon: list of stems and affixes (w/ corresponding pos)

• Morphotactics of the language: model of how and which morphemes can be affixed to a stem

• Orthographic rules: spelling modifications that may occur when affixation occurs– in il in context of l (in- + legal)

Using FSAs to Represent English Plural Nouns

• English nominal inflection

q0 q2q1

plural (-s)reg-n

irreg-sg-n

irreg-pl-n

•Inputs: cats, geese, goose

• Derivational morphology: adjective fragment

q3

q5

q4

q0

q1 q2un-

adj-root1

-er, -ly, -est

adj-root1

adj-root2

-er, -est

• Adj-root1: clear, happy, real (clearly)

• Adj-root2: big, red (~bigly)

FSAs can also represent the Lexicon

• Expand each non-terminal arc in the previous FSA into a sub-lexicon FSA (e.g. adj_root2 = {big, red}) and then expand each of these stems into its letters (e.g. red r e d) to get a recognizer for adjectives

q0

q1un-

r e

q2

q4

q3

-er, -est

db

gq5 q6i

q7

But…..

• Covering the whole lexicon this way will require very large FSAs with consequent search and maintenance problems– Adding new items to the lexicon means recomputing the

whole FSA– Non-determinism

• FSAs tell us whether a word is in the language or not – but usually we want to know more:– What is the stem?– What are the affixes and what sort are they?– We used this information to recognize the word: can we

get it back?

Parsing with Finite State Transducers

• cats cat +N +PL (a plural NP)• Koskenniemi’s two-level morphology

– Idea: word is a relationship between lexical level (its morphemes) and surface level (its orthography)

– Morphological parsing : find the mapping (transduction) between lexical and surface levels

c a t +N +PL

c a t s

Finite State Transducers can represent this mapping

• FSTs map between one set of symbols and another using an FSA whose alphabet is composed of pairs of symbols from input and output alphabets

• In general, FSTs can be used for– Translators (Hello:Ciao)

– Parser/generator s(Hello:How may I help you?)

– As well as Kimmo-style morphological parsing

• FST is a 5-tuple consisting of– Q: set of states {q0,q1,q2,q3,q4} : an alphabet of complex symbols, each an i/o pair s.t.

i I (an input alphabet) and o O (an output alphabet) and is in I x O

– q0: a start state

– F: a set of final states in Q {q4} (q,i:o): a transition function mapping Q x to Q

– Emphatic Sheep Quizzical Cow

q0 q4q1 q2 q3

b:m a:o

a:o

a:o !:?

FST for a 2-level Lexicon

• E.g.

Reg-n Irreg-pl-n Irreg-sg-n

c a t g o:e o:e s e g o o s e

q0 q1 q2 q3c:c a:a t:t

q4 q6 q7q5

se:o e:o

eg

FST for English Nominal Inflection

q0 q7

+PL:^s#q1 q4

q2 q5

q3 q6

reg-n

irreg-n-sg

irreg-n-pl

+N:

+PL:-s#

+SG:-#

+SG:-#

+N:

+N:

stac

c +PL+Nta

Useful Operations on Transducers

• Cascade: running 2+ FSTs in sequence• Intersection: represent the common transitions in

FST1 and FST2 (ASR: finding pronunciations)• Composition: apply FST2 transition function to

result of FST1 transition function• Inversion: exchanging the input and output

alphabets (recognize and generate with same FST)• cf AT&T FSM Toolkit and papers by Mohri,

Pereira, and Riley

http://www.research.att.com/sw/tools/fsm/

http://www.research.att.com/~mohri/pub.html

http://www.cis.upenn.edu/~pereira/bib/pubs.html

Orthographic Rules and FSTs

• Define additional FSTs to implement rules such as consonant doubling (beg begging), ‘e’ deletion (make making), ‘e’ insertion (watch watches), etc.

Lexical f o x +N +PL

Intermediate f o x ^ s #

Surface f o x e s

Porter Stemmer

• Used for tasks in which you only care about the stem– IR, modeling given/new distinction, topic detection,

document similarity

• Rewrite rules (e.g. misunderstanding --> misunderstand --> understand --> …)

• Not perfect …. But sometimes it doesn’t matter too much

• Fast and easy

Summing Up

• FSTs provide a useful tool for implementing a standard model of morphological analysis, Kimmo’s two-level morphology

• But for many tasks (e.g. IR) much simpler approaches are still widely used, e.g. the rule-based Porter Stemmer

• Next time: – Read Ch 4

– Read over HW1 and ask questions now

http://www.cs.columbia.edu/~julia/cs4705/HOMEWORK_1_CS4705.doc

Documents

CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems: