45
Morphology Natural Language Processing: Lecture 6 12.10.2017 Kairit Sirts

Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

MorphologyNatural Language Processing: Lecture 6

12.10.2017

Kairit Sirts

Page 2: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Morphology

• Morphology studies the internal structure of words

2

availabilities avail +able +ity +es

availability NOUN +nom +pl

avail_abil_iti_es

esavail able ity

Page 3: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Morphemes

• Morphemes are the smallest units of language that carry a semantic meaning

3

availabilities avail +able +ity +es

Verbal root

Derivational suffixthat transforms averb into an adjective

Derivational suffixthat transforms anadjective into a noun

Nominative pluralInflectional suffix

Page 4: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Roots and stems

• Roots carry the basic indivisible meaning of a word

• A stem is the base of an inflected word

• Avail is a root – it cannot be divided further

• Available is a stem

• Availability is a stem

• Availabilities is an inflected word

4

availabilitiesavail +able +ity +es

Page 5: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Exercise

• What could be the morphological analysis of the following words? Identify the roots and stems.

5

readers

carefully

ate

read +er +s

care +ful +ly

ate

Root Stems

read read, reader

care care, careful

eat/ate ate

Page 6: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Lexical and grammatical morphemes

• Lexical morphemes carry themselves a semantic meaning. Most of them can stand on their own.• Boy, table, yellow, run, waste etc

• Grammatical morphemes cannot stand on their own. Their role is to modify the meaning of a lexical morpheme or specify the relationships between lexical morphemes• -s, -ing, -able, at, in, on

6

Page 7: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Derivational and inflectional morphemes

• Derivational morphemes change the semantic meaning of a word, they can also change the POS of the word

• Inflectional morphemes change:• The number, gender, case etc of nouns, adjectives etc

• The person, number, mood, tense etc of verbs

7

avail VERB +able available ADJ +ity availability NOUN

dog +pl +gen dog +s +’turn +past turn +ed

Page 8: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Affixes

Grammatical morphemes are divided according to their attachment location:

• Suffixes attach to the end of the word:• Avail+able, table+s, go+ing

• Prefixes attach to the beginning of the word:• re+animate, mis+understand, non+compliant

• Circumfixes have two parts, one attaches in the beginning of the word and the other to the end:• In German: ge+komm+en, ge+spiel+t

• Infixes attach in the middle of the word:• In German: an+zu+fangen, ab+ge+fahren

8

Page 9: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Compounding

• Compounding is using two or more words (typically nouns) together to form a new meaning.

• In English, compounds can be written together:• notebook, bookstore, fireman

• … or separately• Living room, dinner table, full moon

• In other languages, compounds are usually written together:• In Estonian: täis_kuu (full moon), elu_tuba (living room)

• In some languages compounding can be very productive• In German: Kraft_fahr_zeug-Haft_pflicht_versicherung (motor car liability insurance)

9

Page 10: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Tasks in computational morphology

• Text normalization: • Stemming

• lemmatization

• Morphological analysis/parsing

• Morphological tagging/disambiguation

• Morphological generation

10

Page 11: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Stemming

• Used in information retrieval to reduce sparsity• communication communicat• When searching for “communication” retrieve articles containing both

“communication”, “communications”, “communicate”, “communicated”, “communicating”, “communicates” etc• Conflation set

• The most well-known stemming system is Snowball• Small string-based language that can be used to define stemming rules• A compiler compiles the rules into C or Java program

• Course project idea: develop a Snowball stemmer (prototype) for your native language

11

Page 12: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

What are the stems of the following words

12

agreement

rotting

rolling

new

news

probe

prove

provable

probable

agreement

rot

roll

new

news

probe

prove

prove

probable

Page 13: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Lemmatization

• Linguistically motivated task

• The word swimming can have three lemmas: • swim (VERB)

• swimming (ADJ)

• swimming (NOUN)

13

Morphological parsing or stemming applies to many affixes other than plurals

Morphological/ADJ

parsing/NOUN

stemming/NOUN

apply/VERB

affix/NOUN plural/NOUN

Page 14: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Morphological parsing/analysis

• The task of finding all possible morphological tags for a word

• A morphological parse consists of:• Lemma/stem

• POS

• Morphological attributes/features

• Question:

Should morphological parsing be done in context or can you parse each word in isolation?

• Answer: Can be done in isolation

14

kestnudkest+nud //_V_ nud, //kest=nu+d //_S_ pl n, //kest=nud+0 //_A_ //kest=nud+0 //_A_ sg n, //kest=nud+d //_A_ pl n, //

(something) has lasted

Page 15: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Finite state morphological parsing

• The classical method for morphological parsing are finite state methods

• Two-level morphology (Koskenniemi, 1983)

• Resources necessary for developing a finite-state morphological parser:• Lexicon – a list of stems and affixes

• Morphotactics rules – which morphemes can occur with each POS and what is the ordering of the morphemes

• Orthographic rules – describing the stem changes: city+s cities

15

Page 16: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Morphotactics model as a finite state acceptor

16

How to accept words: caught, sing, walked, eating, speaks?Σ – alphabetQ – set of statesS – set of start statesE – set of end statesδ – state transition function

Page 17: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Morphological word recognition

• Does a word belong to the language?

• Lexicon + morphotactics

17Figure 3.7: https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf

Page 18: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Morphological parser as a finite state transducer

18

FST modeling English noun morphology

How can we analyse the words: boy, girls, woman’s, students’?

Σ – input alphabetΔ – output alphabetQ – set of statesS – set of start statesE – set of end statesδ – state transition function⍵ - output function

Page 19: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Operations with FSTs

• Inversion – switches the input and output labels

• What happens when we invert a morphological parser?

• Composition:• If T1 maps from I to O1

• And if T2 maps from O1 to O2

• Then the composition of T1 and T2 maps I to O2.

• Intersection:• Combine parallel FST’s with cartesian product

19

Page 20: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

FST for Morphological parsing

20Figure 3.12: https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf

• Parse the word cats into cat +N +Pl

• Input alphabet: the set of surface level symbols

• Output alphabet: the set of lexical level symbols

• How could we make this FST into an FSA?

Page 21: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

FST for morphological parsing

• Compose the lexicon and the parser

21Figure 3.13: https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf

Page 22: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

FST for morphological parsing

• How to handle irregular words? geese goose +N +Pl

22Figure 3.14: https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf

Page 23: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

FST for morphological parsing

• What would be the output if the input is fox +N +Pl?

23

Page 24: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

FST for morphological parsing

• What would be the output if the input is fox +N +Pl?

• We still need to get rid of morpheme boundaries (^) and word boundaries (#), and convert foxs to foxes

24Figure 3.15: https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf

Page 25: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

FSTs for spelling rules

• E insertion rule - add e after -s, -z, -x, -ch, -sh before –s

25Figure 3.16: https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf

Page 26: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

FSTs for spelling rules

• Can you think of any other spelling rules for English?

26https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf, page 63

Page 27: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Combining lexicon, parser and rules

• Each spelling rule has its own FST, they are applied in parallel

27Figure 3.19: https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf

Page 28: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Compiling the final model

• Combine with intersection and compose:

28Figure 3.21: https://nlp.stanford.edu/~manning/xyzzy/JurafskyMartinEd2book.pdf

Page 29: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Morphological tagging

1. Morphological parsing – find all possible analyses for each word

2. Morphological disambiguation – find the most likely analysis sequence

29

Teise koha sai seekord Jänes

O+adtO+sg_gP+adtP+sg_g

S+sg_gS+sg_nS+sg_pV+o

V+sS+sg_n

D S+sg_nH+sg_n

Page 30: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Sequence tagging with CRF

Tools

• CRFsuite

• CRF++

• CRFSharp

30Figure adopted from: https://www.researchgate.net/figure/263324470_fig20_Figure-1-Linear-chain-conditional-random-fields-modelBlack-nodes-represent-observable

Page 31: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Morphological tagging with CRF

• MarMot – Müller et al., 2013. Efficient higher-order CRFs for Morphological Tagging

• Features:• The current, preceding and next words as unigrams and bigrams

• Word prefixes and suffixes up to 10 characters for rare words

• The occurrence of capital characters, digits and special characters

• Rare word has frequency <= 10 in training set

• Combine all features with the POS tag, morphological tag and morphological feature

31

Page 32: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Construct word unigram features

• Старим +ADJ +Case=Ins|Degree=Pos|Gender=Masc

• Combine all features with the POS tag, morphological tag and all morphological features

• Старим+ADJ

• Старим+Case=Ins|Degree=Pos|Gender=Masc

• Старим+Case=Ins

• Старим+Degree=Pos

• Старим+Gender=Masc

32

Page 33: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Neural morphological tagging

• SyntaxNet – Andor et al., 2016. Globally normalized transition-based neural networks

• Pretrained models for 40 languages

33

Language Tokens Universal POS Language POS Morph

English 25096 90.48% 89.71% 91.30%

Estonian 23670 95.92% 96.76% 92.73%

Finnish 9140 94.78% 95.84% 92.42%

German 16268 91.79% - -

Kazakh 587 75.47% 75.13%

Chinese 12012 91.32% 90.89% 98.76%

Latvian 3985 80.95 66.60% 73.60%

Average 94.27% 92.93% 90.38%

Page 34: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Morphological segmentation

• Simplest morphological analysis

• Was very popular during the first decade of 2000s.

• Mostly unsupervised or weakly supervised methods• Minimum Description length principle• Bayesian models• Also sequence tagging with CRF

34

avail_abil_iti_es

Page 35: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Minimum Description Length

• Occam’s razor

• We want to maximise:

• Minimise the description length:

35

Page 36: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

MDL in morphological segmentation

• Model – the inventory of possible morphemes

• Data – the corpus to be segmented

• -log P(model) – the description length of a model

• -log P(Data|Model) – the description length of the data encoded with a model

36

Page 37: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Question

• Which model is better for morphological segmentation according to MDL?• ”Morphemes” consist of single characters: a, b, c, …

• “Morphemes” consist of whole words

• References• Goldsmith, 2001. Unsupervised learning of morphology of a natural language

• Morfessor – Creutz and Lagus, 2007. Unsupervised models for morpheme segmentation and morphology learning

37

Page 38: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Bayesian morphological segmentation

• Model the posterior probability:

• Sirts and Goldwater, 2012. Minimally-supervised morphological segmentation using Adaptor Grammars

38

Word

Morph Morph

avail

Morph Morph

abil ity es

P(Word --> Morph+) = …P(Morph --> Char+) = …P(Char --> a) = …P(Char --> v) = ……

Page 39: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Segmentation with sequence modeling

• Ruokolainen et al., 2013. Supervised morphological segmentation in a low-resource setting using Conditional Random Fields

• B – morpheme begins

• M – inside the morpheme

• E –morpheme ends

• Question: Could the labelling scheme be simpler?

39

drivers driv_er_s

Page 40: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Morphological paradigms

40http://wwwdrshadiabanjar.blogspot.com.ee/2012/05/lane-333-inflectional-paradigms.htmlhttp://wwwdrshadiabanjar.blogspot.com.ee/2012/05/lane-333-inflectional-paradigms.html

Page 41: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Tasks based on morphological paradigms

• Morphological synthesis or paradigm generation• Estonian morphological synthesiser

• Paradigm completion• Given a partially filled paradigm, fill in the remaining slots

• Morphological reinflection• Given a lemma or an inflected from generate another inflected form

• run (lemma) --> running (present participle)

• ran (past) --> running (present participle)

41

Page 42: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Neural model for morphological reinflection

• Kann and Schütze, 2016. MED: The LMU System for the SIGMORPHON 2016 Shared Task on Morphological Reinflection

• Bidirectional gated RNN

• Attention mechanism

• Input (in one sequence):

• Output:

42

<w> IN=pos=ADJ IN=case=GEN IN=num=PL OUT=pos=ADJ OUT=case=ACC OUT=num=PL i s o l i e r t e r </w>

<w> i s o l i e r t e </w>

Page 43: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

The role of computational morphology in NLP

• Still largely an unexplored area. Why?

• Intuitively, modeling morphology should help to reduce the vocabulary sparsity problems

• Morphological agreement:• For instance, verbs and nouns must agree in number• In German: der Mann geht vs die Männer gehen (the man goes vs men go)

• Potentially should be useful for many downstream tasks:• Machine translation• Natural language generation• Language modeling (for speech recognition)

43

Page 44: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

SIGMORPHON Shared Tasks

• 2016 – Morphological reinflection

• 2017 – Universal morphological reinflection in 52 languages

• 2018 - TBA

• UniMorph initiative• Collecting morphologically annotated lexicon data for many languages

• Currently features 51 languages

• Many more are being annotated – you’re welcome to contribute!

44

Page 45: Natural Language Processing - Arvutiteaduse instituut · 2017-10-13 · Natural Language Processing: Lecture 6 12.10.2017 ... Their role is to modify the meaning of a lexical morpheme

Recap

• Morphology is a study about internal structure of words

• It overlaps with syntax and semantics:• “I put the book on the table” vs “Panin raamatu lauale” (in Estonian)

• “Giraffe bit the zebra” vs “Kaelkirjak hammustas sebrat” or “Sebrathammustas kaelkirjak”

• Several computational morphological analysis tasks• Parsing, tagging/disambiguation, segmentation, generation, reinflection

• How to use morphological models in downstream tasks?

45