19
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Inducing the Morphological Lexicon of a Natural Language from Unannotated Text { Mathias.Creutz, Krista.Lagus }@hut.fi International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05) Espoo, 17 June 2005 kahvi + n + juo + ja + lle + ki nyky + ratkaisu + i + sta + mme tietä + isi + mme + + hän open + mind + ed + ness un + believ + able

Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

  • Upload
    conley

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

nyky + ratkaisu + i + sta + mme. kahvi + n + juo + ja + lle + kin. tietä + isi + mme + kö + hän. open + mind + ed + ness. un + believ + able. Inducing the Morphological Lexicon of a Natural Language from Unannotated Text. { Mathias . Creutz , Krista . Lagus }@hut.fi - PowerPoint PPT Presentation

Citation preview

Page 1: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Inducing the Morphological Lexicon of a Natural Language from Unannotated

Text

{ Mathias.Creutz, Krista.Lagus }@hut.fi

International and Interdisciplinary Conference on Adaptive Knowledge Representation and

Reasoning (AKRR’05)Espoo, 17 June 2005

kahvi + n + juo + ja + lle + kin

nyky + ratkaisu + i + sta + mme

tietä + isi + mme + kö + hän

open + mind + ed + ness un + believ + able

Page 2: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 2

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Challenge for NLP: too many words• E.g., Finnish words often consist of lengthy

sequences of morphemes — stems, suffixes and prefixes:– kahvi + n + juo + ja + lle + kin

(coffee + of + drink + -er + for + also)

– nyky + ratkaisu + i + sta + mme(current + solution + -s + from + our)

– tietä + isi + mme + kö + hän(know + would + we + INTERR + indeed)

Huge number of different possible word forms Important to know the inner structure of words The number of morphemes per word varies much

Page 3: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 3

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Goal

• Learn representations of– the smallest individually meaningful units of

language (morphemes)– and their interaction– in an unsupervised and data-driven manner

from raw text– making as general and language-independent

assumptions as possible.

Morfessor

Page 4: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 4

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

State of the art• Rule-based systems

– accurate, language-dependent, adaptivity issues

• Unsupervised word segmentation– sentences can be of different length– context-insensitive poor modeling of syntax:

• undersegmentation of frequent strings (“forthepurposeof”)

• oversegmentation of rare strings (“in + s + an + e”)

• no syntactic / morphotactic constraints (“s + can”)

MorfessorBaseline

Page 5: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 5

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

State of the art (cont’d)• Morphology learning

– Beyond segmentation: allomorphy (“foot – feet, goose – geese”)

– Detection of semantic similarity (e.g., Yarowsky &

Wicentowski) (“sing – sings – singe – singed”)

– Learning of paradigms (e.g., John Goldsmith’s Linguistica)

believhopliv

movus

eedesing

Very restricted syntax / morphotactics in terms of number of morphemes per word form!

Page 6: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 6

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Morfessor with morpheme categories• Lexicon / Grammar dualism

– Word structure captured by a regular expression: word = ( prefix* stem suffix* )+

– Morph sequences (words) are generated by a Hidden Markov model:

P(STM | PRE) P(SUF | SUF)

ificover ationsimpl# s #

P(’s’ | SUF)P(’over’ | PRE)

Transition probs

Emission probs

Page 7: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 7

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Lexicon“Meaning” “Form”

14029

136 1 4 over

41 4 1 5 simpl

17259

1 4618 1 s

Freq

uency

Length

String

...

Right p

erplex

ity

Left

perplex

ity

Morp

hs

Page 8: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 8

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

How meaning affects morphotactic role

0

0,2

0,4

0,6

0,8

1

1,2

10 30 50 70 90

Left perplexity

Suffix-likeness0

0,2

0,4

0,6

0,8

1

1,2

10 30 50 70 90

Right perplexity

Prefix-likeness0

0,2

0,4

0,6

0,8

1

1,2

1 2 3 4 5 6 7 8 9 1

Morph length

Stem-likeness

• Prior probability distributions for category membership of a morph, e.g., P(PRE | ’over’)

• Assume asymmetries between the categories:

Page 9: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 9

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

How meaning affects role (cont’d) • There is an additional non-morpheme

category for cases where none of the proper classes is likely:

P(NON |'over') =

1− Prefixlike('over')[ ] ⋅ 1− Stemlike('over')[ ]

⋅1− Suffixlike('over')[ ]

P(PRE |'over') =Prefixlike('over')q ⋅ 1− P(NON |'over')[ ]

Prefixlike('over')q + Stemlike('over')q + Suffixlike('over')q

• Distribute remaining probability mass proportionally, e.g.,

Page 10: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 10

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Maximum a posteriori optimization

argmaxLexicon

P(Lexicon | Corpus) =

argmaxLexicon

P(Corpus | Lexicon) ⋅P(Lexicon)

Morfessor Categories-MAP:Older maximum-

likelihood version:Categories-ML

(lexicon controlledheuristically)

14029

136 1 4 over

41 4 1 5 simpl

17259

1 4618 1 s

...

P(STM | PRE) P(SUF | SUF)

ificover ationsimpl# s #

P(’s’ | SUF)P(’over’ | PRE)

Balance accuracy of representation of data against size of lexicon

Page 11: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 11

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Over- and undersegmentation still a problem?

P('morgana') = P(Freq =1) ⋅P(RightPpl =1) ⋅P(LeftPpl =1) ⋅P(Length = 7) ⋅

P('m') ⋅P('o') ⋅P('r') ⋅P('g') ⋅P('a') ⋅P('n') ⋅P('a')

• Probability of adding an entry to the lexicon:

Rare strings are split into smaller parts (e.g., morgan + a)

hands# #hand# #s

• Probability of sequences in the corpus:

vs.

Frequent strings are left unsplit and their inner structure is “lost” (e.g., hands)

Page 12: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 12

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Solution: Hierarchical structures in lexicon

oppositio kansanedustaja+

op positio kansan edustaja

kansa edusta jan

Non-morpheme Stem

Suffix• Make morphs consist of submorphs. • Expand the tree when performing morpheme segmentation.• Do not expand morphs consisting of non-morphemes.

Page 13: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 13

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Evaluation using Hutmegs(Helsinki University of Technology Morphological Evaluation Gold Standard)

• Evaluate the segmentation of Morfessor against a linguistic morpheme segmentation = Hutmegs

• Covers– 1.4 million Finnish word forms– 120 000 English word forms

• Publicly available and described in the technical report: M. Creutz and K. Lindén. 2004. Morpheme

Segmentation Gold Standards for Finnish and English. Publications in Computer and Information Science, Report A77, Helsinki University of Technology.

Page 14: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 14

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

50

60

70

80

10 50 250 12000

Corpus size [1000 words]

F-measure [%]30

40

50

60

70

80

10 50 250 16000

Corpus size [1000 words]

F-measure [%]

Evaluation against the Hutmegs Gold Standard

Finnish English

Ctxt-insens. (Baseline)Paradigms

(Linguistica)

Heuristic (Categories-ML)Categories-MAP

Page 15: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 15

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Example segmentationsFinnish English

[ aarre kammio ] issa [ accomplish es ]

[ aarre kammio ] on [ accomplish ment ]

bahama laiset [ beautiful ly ]

bahama [ saari en ] [ insur ed ]

[ epä [ [ tasa paino ] inen ] ]

[ insure s ]

maclare n [ insur ing ]

[ nais [ autoili ja ] ] a [ [ [ photo graph ] er ] s ]

[ sano ttiin ] ko [ present ly ] found

töhri ( mis istä ) [ re siding ]

[ [ voi mme ] ko ] [ [ un [ expect ed ] ] ly ]

Page 16: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 16

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Discussion

• Possibility to extend the model– rudimentary features used for “meaning”– more fine-grained categories– beyond concatenative phenomena (e.g., goose –

geese)– allomorphy

(e.g., beauty, beauty + ’s, beauti + es, beauti + ful)

• Already now useful in applications– automatic speech recognition (Finnish, Turkish)

Page 17: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 17

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Morpho project pagehttp://www.cis.hut.fi/projects/morpho/

Page 18: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 18

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Demo 6

http://www.cis.hut.fi/projects/morpho/

Page 19: Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

17 June 2005Mathias Creutz 19

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Demo 7