36
Natural Language Processing Morphological Analysis Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Natural Language Processing 1(11)

Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Natural Language Processing

Morphological Analysis

Joakim Nivre

Uppsala University

Department of Linguistics and Philology

[email protected]

Natural Language Processing 1(11)

Page 2: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

What’s in a word?

I Word processing so far:I

Tokenization – segmenting sentences into words

IPart-of-speech tagging – classifying words grammatically

I Words have structure:I

runs, ran and running are inflected forms of the verb run

Iunfriendly is derived from friendly, which is derived from friend

Iinduce, product and reduction have the same root duc

I Morphological analysis – exploring the structure of words

Natural Language Processing 2(11)

Page 3: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Why does morphology matter?

I Information retrieval:I

A query for phones should match both phone and phones

I Language modeling:I

If we have seen scrutinize, we can predict scrutinized

I Machine translation:I

Swedish bilen corresponds to English the car

Natural Language Processing 3(11)

Page 4: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Language Variation

English

French

Dutch

Italian

Portuguese

Spanish

Danish

Swedish

German

Greek

Finnish

0 15 30 45 60

Number of unique word forms in 10k sentences

Natural Language Processing 4(11)

Page 5: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Morphology

I Words are built up of minimal meaningful elements calledmorphemes:

Iplayed = play-ed

Icats = cat-s

Iunfriendly = un-friend-ly

I Two types of morphemes:I

Stems: play, cat, friend

IAffixes: -ed, -s, un-, -ly

I Two main types of affixes:I

Prefixes precede the stem: un-

ISuffixes follow the stem: -ed, -s, un-, -ly

Natural Language Processing 5(11)

Page 6: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Language Variation

Mainly prefixing Equal prefixing and suffixing Mainly suffixing

Natural Language Processing 6(11)

Page 7: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Quiz 1

I Which of the following are morphemes of unbelievable?1. un

2. unbe

3. evable

4. able

Natural Language Processing 7(11)

Page 8: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Inflection

I Inflection relates different forms of the same word

Lemma Singular Pluralcat cat catsdog dog dogsknife knife knivessheep sheep sheepmouse mouse mice

I Note:I

The lemma is the canonical form found in dictionaries

IAffixation sometimes involves spelling changes (knife – knives)

IInflection does not always involve affixation (mouse – mice)

Natural Language Processing 8(11)

Page 9: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Word Formation

I Morphological processes can be used to form new words

I Derivation = stem + affixfriend + -ly = friendly

un- + -friendly = unfriendly

unfriendly + -ness = unfriendliness

I Compounding = stem + stemjärn (iron) + väg (road) = järnväg (railway)

järnväg + korsning (crossing) = järnvägskorsning (railway crossing)

järnvägskorsning + olycka (accident) = järnvägskorsningsolycka (railway crossing accident)

Natural Language Processing 9(11)

Page 10: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Morphological Analysis

I Morphological analysis:I

token ! lemma + part of speech + grammatical features

I Examples:I

cats ! cat+N+plur

Iplayed ! play+V+past

Ikatternas ! katt+N+plur+def+gen

I Often non-deterministic (more than one solution):I

plays ! play+N+plur

Iplays ! play+V+3sg

I Lemmatization:I

token ! lemma

Natural Language Processing 10(11)

Page 11: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Morphological Analysis

I Morphological analysis:I

token ! lemma + part of speech + grammatical features

I Examples:I

cats ! cat+N+plur

Iplayed ! play+V+past

Ikatternas ! katt+N+plur+def+gen

I Often non-deterministic (more than one solution):I

plays ! play+N+plur

Iplays ! play+V+3sg

I Lemmatization:I

token ! lemma

Natural Language Processing 10(11)

Page 12: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Morphological Analysis

I Morphological analysis:I

token ! lemma + part of speech + grammatical features

I Examples:I

cats ! cat+N+plur

Iplayed ! play+V+past

Ikatternas ! katt+N+plur+def+gen

I Often non-deterministic (more than one solution):I

plays ! play+N+plur

Iplays ! play+V+3sg

I Lemmatization:I

token ! lemma

Natural Language Processing 10(11)

Page 13: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Morphological Analysis

I Morphological analysis:I

token ! lemma + part of speech + grammatical features

I Examples:I

cats ! cat+N+plur

Iplayed ! play+V+past

Ikatternas ! katt+N+plur+def+gen

I Often non-deterministic (more than one solution):I

plays ! play+N+plur

Iplays ! play+V+3sg

I Lemmatization:I

token ! lemma

Natural Language Processing 10(11)

Page 14: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Quiz 2

I Which of the following pairs are cases of inflection?1. play – played

2. play – player

3. play – playing

4. play – playground

Natural Language Processing 11(11)

Page 15: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Natural Language Processing

Finite State Morphology

Joakim Nivre

Uppsala UniversityDepartment of Linguistics and Philology

[email protected]

Natural Language Processing 1(12)

Page 16: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Finite State Morphology

IMorphological analysis:

I token ! lemma + part of speech + grammatical featuresI

Finite state morphology:

I Efficient implementation using finite state automataI Start with recognition, add output later

Natural Language Processing 2(12)

Page 17: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Finite State AutomataRecap: Finite State Automata

START END

a

b

ab b

c

ca

c

a

c

b

b

Can be viewed as either emitting or recognizing strings

Sharon Goldwater ANLP Lecture 3 3

IStates: start, end, intermediate

ITransitions between states

ICan be viewed as emitting or recognizing strings

Natural Language Processing 3(12)

Page 18: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

One Word One Word

S Ewalk

Basic finite state automaton:

• start state

• transition that emits the wordwalk

• end state

Sharon Goldwater ANLP Lecture 3 4

IStart state

IEnd state

ITransition that emits word/stem walk

Natural Language Processing 4(12)

Page 19: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

One Word and One InflectionOne Word and One Inflection

S 1walk +ed

E

Two transitions and intermediate state

• first transition emits walk

• second transition emits +ed

! walked

Sharon Goldwater ANLP Lecture 3 5

IIntermediate state

IFirst transition emits stem walk

ISecond transition emits -ed

Natural Language Processing 5(12)

Page 20: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

One Word and Multiple InflectionsOne Word and Multiple Inflections

S 1walk +ed

E

+ing

+s

Multiple transitions between states

• three di�erent paths

! walks, walked, walking

Sharon Goldwater ANLP Lecture 3 6

IMultiple affix transitions

IThree paths: walks, walked, walking

Natural Language Processing 6(12)

Page 21: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Multiple Words and Multiple InflectionsMultiple Words and Multiple Inflections

S 1walk +ed

E

+ing

+s

report

laugh

Multiple stems

• implements regular verb morphology! laughs, laughed, laughing

walks, walked, walkingreports, reported, reporting

Sharon Goldwater ANLP Lecture 3 7

IMultiple stems

IMultiple paths: laughs, . . . , walked, . . . , reporting

IImplements regular verb morphology

Natural Language Processing 7(12)

Page 22: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Composition

IConstructing an FSA gets very complicated

IBuild components as separate FSAs

I L = FSA for lexicon (lemmas)I D = FSA for derivational morphology (optional)I I = FSA for inflectional morphology

ICompose L + D + I using standard algorithms

I Each component can be composed in turn

Natural Language Processing 8(12)

Page 23: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Finite State Transducers

IFSAs can be used as morphological recognizers

IA morphological analyzer should produce output:

Iwalked ! walk+V+past

Ireporting ! report+V+prog

IUse a finite-state transducer (FST)

IReplace symbols with input-output pairs x : y

Natural Language Processing 9(12)

Page 24: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

FST for VerbsFST for verbs

verb−reg

+1sg:s

+prog:ing

+past:ed+V:

where x means x:x and x: means x:�.

Sharon Goldwater ANLP Lecture 3 21

3

Natural Language Processing 10(12)

Page 25: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Disambiguation

IFSTs often produce multiple analyses for a single form:

Iwalks ! walk+V+3sg

Iwalks ! walk+N+plur

ICan be combined with statistical taggers for disambiguation

Natural Language Processing 11(12)

Page 26: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

QuizFST for verbs

verb−reg

+1sg:s

+prog:ing

+past:ed+V:

where x means x:x and x: means x:�.

Sharon Goldwater ANLP Lecture 3 21

3

IWhat analysis does the FST above give for the word walking?

1. walk+V+3sg

2. walk+V+past

3. walk+V+prog

Natural Language Processing 12(12)

Page 27: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Natural Language Processing

Stemming

Joakim Nivre

Uppsala UniversityDepartment of Linguistics and Philology

[email protected]

Natural Language Processing 1(9)

Page 28: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Stemming

IStemming = find the stem by stripping off affixes

Iplay ! play

Ireplayed ! re-play-ed

Icomputerized ! comput-er-ize-d

ISimplified morphological analysis

I Group tokens that contain the same stemI Usually no distinction between inflection and derivationI Useful for certain types of application

INot the same as lemmatization

Word Stem Lemma

played play playreplayed play replayunfriendly friend unfriendly

Natural Language Processing 2(9)

Page 29: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Stemming

IStemming = find the stem by stripping off affixes

Iplay ! play

Ireplayed ! re-play-ed

Icomputerized ! comput-er-ize-d

ISimplified morphological analysis

I Group tokens that contain the same stemI Usually no distinction between inflection and derivationI Useful for certain types of application

INot the same as lemmatization

Word Stem Lemma

played play playreplayed play replayunfriendly friend unfriendly

Natural Language Processing 2(9)

Page 30: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

The Porter Stemmer

IThe Porter stemmer

I Widely used stemming algorithm for EnglishI Ported to other languages as well

IMethodology:

I A sequence of steps strip off successive layers of affixesI Only the first matching rule in each step is appliedI Later steps may “clean up” unfortunate side effects

Natural Language Processing 3(9)

Page 31: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Example: Step 1

Rule Condition Example Exception

1.1 (X)-sses ! -ss caresses ! caress

1.2 (X)-ies ! -i ponies ! poni

1.3 (X)-ss ! -ss caress ! caress

1.4 (X)-s ! ✏ if VC 2 X cats ! cat bus 6! bu

IRule 1.4 removes inflectional -s

IStem is required to contain a VC sequence

IRules 1.1–1.3 catch specific patterns that would lead to errors

Natural Language Processing 4(9)

Page 32: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Example: Steps 2a and 2b

Rule Condition Example Exception

2a.1 (X)-eed ! -ee if V 2 X agreed ! agree feed 6! fee

2a.2 (X)-ed ! ✏ if V 2 X plastered ! plaster bled 6! bl

2a.3 (X)-ing ! ✏ if V 2 X motoring ! motor sing 6! s

2b.1 -at ! -ate conflat(ed) ! conflate

2b.2 -bl ! -ble troubl(ed) ! trouble

2b.3 -iz ! -ize siz(ed) ! size

2b.4 -CC ! C if C 62 {l, s, z} putt(ing) ! put fall(ing) 6! fal

2b.5 -C

1

VC

2

! C

1

VC

2

e if C

2

62 {w, x, y} fil(ing) ! file fail(ing) 6! faile

IStep 2a handles verb inflections (-ed, -ing)

IStep 2b “cleans up” exceptional cases

Natural Language Processing 5(9)

Page 33: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Example: Step 5

Rule Condition Example

5.1 (X)-icate ! -ic if VC 2 X triplicate ! triplic

5.2 (X)-ative ! ✏ if VC 2 X formative ! form

5.3 (X)-alize ! -al if VC 2 X formalize ! formal

5.4 (X)-iciti ! -ic if VC 2 X electriciti ! electric

5.5 (X)-ical ! -ic if VC 2 X electrical ! electric

5.6 (X)-ful ! ✏ if VC 2 X hopeful ! hope

5.7 (X)-ness ! ✏ if VC 2 X goodness ! good

IStep 5 handles (some) derivational endings

ISome rules presuppose earlier steps (electricity ! electriciti)

IRules for all steps can be found in Jurafsky & Martin

Natural Language Processing 6(9)

Page 34: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Example Outputs

ISuccessful:

Icomputers ! computer ! comput

Icomputing ! comput

Isinging ! sing

Icontrolling ! controll ! control

Igeneralizations ! generalization ! generalize ! general

IUnsuccessful:

Ielephants ! elephant ! eleph

Idoing ! do ! doe

Natural Language Processing 7(9)

Page 35: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

From Stemming to Lemmatization

IAdvantages of stemming:

I Simple and efficientI No lexicon required

ISimilar techniques can be used for lightweight lemmatization:

I Porter style rules can handle regular inflectionI Irregular inflection can be added as more specific special cases

1. mice ! mouse

2. women ! woman

. . .

101. (X)-s ! ✏

Natural Language Processing 8(9)

Page 36: Natural Language Processing - Uppsala Universitynivre/master/NLP-Morphology.pdf · Natural Language Processing 2(9) The Porter Stemmer I The Porter stemmer I Widely used stemming

Quiz

Rule Condition

(X)-sses ! -ss

(X)-ies ! -i

(X)-ss ! -ss

(X)-s ! ✏ if VC 2 X

IWhich tokens are lemmatized correctly by the rules above?

1. dog

2. dogs

3. bus

4. buses

Natural Language Processing 9(9)