51
Tokenization and Word Segmentation Daniel Zeman, Rudolf Rosa March 3, 2022 NPFL120 Multilingual Natural Language Processing Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

Tokenization and Word Segmentation - cuni.cz

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tokenization and Word Segmentation - cuni.cz

Tokenization and WordSegmentationDaniel Zeman, Rudolf Rosa

March 3, 2022

NPFL120 Multilingual Natural Language Processing

Charles UniversityFaculty of Mathematics and PhysicsInstitute of Formal and Applied Linguistics unless otherwise stated

Page 2: Tokenization and Word Segmentation - cuni.cz

Tokenization and Word Segmentation

• IMPORTANT because:• Training tokenization = test tokenization• ⇒ accuracy goes down

• Not always trivial• May interact with morphology

• May include normalization (character-level)

Tokenization and Word Segmentation 1/28

Page 3: Tokenization and Word Segmentation - cuni.cz

Tokenization and Word Segmentation

• Issues of orthography of individual languages

• Issues caused by design decisions of individual corpora

• We will refer to the Universal Dependencies project (UD;https://universaldependencies.org/); more info in following weeks

• Due to limited time, we will probably skip some slides at the end

Tokenization and Word Segmentation 2/28

Page 4: Tokenization and Word Segmentation - cuni.cz

Tokenization

“María, I love you!” Juan exclaimed.«¡María, te amo!», exclamó Juan.

X PRON X VERB X

« ¡ María , te amo ! » ,PUNCT PUNCT PROPN PUNCT PRON VERB PUNCT PUNCT PUNCT

• Classic tokenization:• Separate punctuation from words• Recognize certain clusters of symbols like “...”• Perhaps keep together things like [email protected]

Tokenization and Word Segmentation 3/28

Page 5: Tokenization and Word Segmentation - cuni.cz

Using Unicode Character Categories

• https://perldoc.perl.org/perlunicode.html

$text =~ s/(\pP)/ $1 /g;$text =~ s/^\s+//;$text =~ s/\s+$//;

• $text =~ s/(\pP)/ $1 /g;• Optionally recombine email addresses, URLs etc.

• Some problems• haven ’ t (English; should be have n’t)• instal · lació (Catalan; should be 1 token)• single quote (punctuation) misspelled as acute accent (modifier letter)

• writing systems without spaces

Tokenization and Word Segmentation 4/28

Page 6: Tokenization and Word Segmentation - cuni.cz

Using Unicode Character Categories

• https://perldoc.perl.org/perlunicode.html

$text =~ s/(\pP)/ $1 /g;$text =~ s/^\s+//;$text =~ s/\s+$//;

• $text =~ s/(\pP)/ $1 /g;• Optionally recombine email addresses, URLs etc.

• Some problems• haven ’ t (English; should be have n’t)• instal · lació (Catalan; should be 1 token)• single quote (punctuation) misspelled as acute accent (modifier letter)

• writing systems without spaces

Tokenization and Word Segmentation 4/28

Page 7: Tokenization and Word Segmentation - cuni.cz

Normalization• Often part of tokenization

• Decimal comma to decimal point; separator of thousands

• Unicode directed quotes and long hyphens to undirected ASCII• “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” —

»magyar« — ’magyar’• Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc.

• TEX-like ASCII directed quotes `` and '' and hyphens -- and ---• English/ASCII punctuation in foreign writing systems

• 「你看過《三國演義》嗎?」他問我。• “你看過‘三國演義’嗎?”他問我.

• European/ASCII digits in Arabic, Devanagari etc.• 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European)• ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic)• ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari)

Tokenization and Word Segmentation 5/28

Page 8: Tokenization and Word Segmentation - cuni.cz

Normalization• Often part of tokenization

• Decimal comma to decimal point; separator of thousands• Unicode directed quotes and long hyphens to undirected ASCII

• “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” —»magyar« — ’magyar’

• Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc.

• TEX-like ASCII directed quotes `` and '' and hyphens -- and ---• English/ASCII punctuation in foreign writing systems

• 「你看過《三國演義》嗎?」他問我。• “你看過‘三國演義’嗎?”他問我.

• European/ASCII digits in Arabic, Devanagari etc.• 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European)• ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic)• ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari)

Tokenization and Word Segmentation 5/28

Page 9: Tokenization and Word Segmentation - cuni.cz

Normalization• Often part of tokenization

• Decimal comma to decimal point; separator of thousands• Unicode directed quotes and long hyphens to undirected ASCII

• “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” —»magyar« — ’magyar’

• Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc.• TEX-like ASCII directed quotes `` and '' and hyphens -- and ---

• English/ASCII punctuation in foreign writing systems• 「你看過《三國演義》嗎?」他問我。• “你看過‘三國演義’嗎?”他問我.

• European/ASCII digits in Arabic, Devanagari etc.• 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European)• ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic)• ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari)

Tokenization and Word Segmentation 5/28

Page 10: Tokenization and Word Segmentation - cuni.cz

Normalization• Often part of tokenization

• Decimal comma to decimal point; separator of thousands• Unicode directed quotes and long hyphens to undirected ASCII

• “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” —»magyar« — ’magyar’

• Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc.• TEX-like ASCII directed quotes `` and '' and hyphens -- and ---• English/ASCII punctuation in foreign writing systems

• 「你看過《三國演義》嗎?」他問我。• “你看過‘三國演義’嗎?”他問我.

• European/ASCII digits in Arabic, Devanagari etc.• 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European)• ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic)• ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari)

Tokenization and Word Segmentation 5/28

Page 11: Tokenization and Word Segmentation - cuni.cz

Normalization• Often part of tokenization

• Decimal comma to decimal point; separator of thousands• Unicode directed quotes and long hyphens to undirected ASCII

• “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” —»magyar« — ’magyar’

• Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc.• TEX-like ASCII directed quotes `` and '' and hyphens -- and ---• English/ASCII punctuation in foreign writing systems

• 「你看過《三國演義》嗎?」他問我。• “你看過‘三國演義’嗎?”他問我.

• European/ASCII digits in Arabic, Devanagari etc.• 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European)• ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic)• ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari)

Tokenization and Word Segmentation 5/28

Page 12: Tokenization and Word Segmentation - cuni.cz

Word Segmentation

Let’s go to the sea.Vámonos al mar .VERB? X NOUN PUNCT

Vamos nos a el mar .VERB PRON ADP DET NOUN PUNCT

• Syntactic word vs. orthographic word• Multi-word tokens• Two-level scheme:

• Tokenization (low level, punctuation, concatenative)• Word segmentation (higher level, not necessarily concatenative)

Tokenization and Word Segmentation 6/28

Page 13: Tokenization and Word Segmentation - cuni.cz

Word Segmentation

• Orthographic vs. syntactic word• Syntactically autonomous part of orthographic word• Contractions (al = a + el)• Clitics (vámonos = vamos + nos)

• ¿A qué hora nos vamos mañana?“What time do we leave tomorrow?”

• Nos despertamos a las cinco.“We wake up at five.”

• Nuestro guía nos despierta a las cinco.“Our guide wakes us up at five.”

Tokenization and Word Segmentation 7/28

Page 14: Tokenization and Word Segmentation - cuni.cz

Contractions in Arabic

He abdicated in favour of his son Baudouin.يتنازل عن العرش لابنه بودوان

yatanāzalu ʿan al-ʿarši li+ibni+hi būdūānsurrendered on the throne to son his Baudouin

VERB ADP NOUN ADP+NOUN+PRON PROPN

Tokenization and Word Segmentation 8/28

Page 15: Tokenization and Word Segmentation - cuni.cz

Segmentation as Part of Morphological Analysis

• Arabic• ElixirFM: http://lindat.mff.cuni.cz/services/elixirfm/run.php• Select Resolve• Enter “لابنه” (labnh)

• Sanskrit• Sanskrit Reader Companion: https://sanskrit.inria.fr/DICO/reader.fr.html• Select Input convention = Devanagari• Enter “सकलाथरशा सार जगित समालोकय िवषणशमदम” (sakalārthaśāstrasāram jagati samālokya

viṣṇuśarmedam)

• German compound splitting (unsupervised)• Not split in Universal Dependencies

Tokenization and Word Segmentation 9/28

Page 16: Tokenization and Word Segmentation - cuni.cz

Chinese Word Segmentation

We are now in Valencia.現在我們在瓦倫西亞。

Xiàn zài wǒ men zài wǎ lún xī yǎ.We are now in Valencia.

現在 我們 在 瓦倫西亞 。Xiànzài wǒmen zài Wǎlúnxīyǎ .

Now we in Valencia .ADV PRON ADP PROPN PUNCT

Tokenization and Word Segmentation 10/28

Page 17: Tokenization and Word Segmentation - cuni.cz

Words in Japanese

I went to the beauty salon of Kyōdō [, Beyond-R.]

経堂 の 美容室 に 行っ て き まし たKyōdō no miyōshitsu ni it te ki mashi ta経堂 の 美容室 に 行く て 来る ます た

Kyōdō of beauty-salon to go CONV come will PASTPROPN ADP NOUN ADP VERB SCONJ AUX AUX AUX

nmod

case

obl

case

aux

aux

aux

mark

Tokenization and Word Segmentation 11/28

Page 18: Tokenization and Word Segmentation - cuni.cz

Words in Japanese

I went to the beauty salon of Kyōdō [, Beyond-R.]

経堂 の 美容室 に 行って きましたKyōdō no miyōshitsu ni itte kimashita経堂 の 美容室 に 行く 来る

Kyōdō of beauty-salon to going comePROPN ADP NOUN ADP VERB VERB

VerbForm=Conv VerbForm=FinTense=PastPolite=Form

nmod

case

obl

case advcl

Tokenization and Word Segmentation 12/28

Page 19: Tokenization and Word Segmentation - cuni.cz

Words in Japanese

I went to the beauty salon of Kyōdō [, Beyond-R.]

経堂の 美容室に 行って きましたKyōdōno miyōshitsuni itte kimashita経堂 美容室 行く 来る

of-Kyōdō to-beauty-salon going comePROPN NOUN VERB VERBCase=Gen Case=Dat VerbForm=Conv VerbForm=Fin

Tense=PastPolite=Form

nmod obl advcl

Tokenization and Word Segmentation 13/28

Page 20: Tokenization and Word Segmentation - cuni.cz

Vietnamese: Words with Spaces

All the concrete country roads are the result of…Tất cả đường bêtông nội đồng là thành quả …

All road concrete country is achievement …PRON NOUN NOUN NOUN AUX NOUN PUNCT

• Spaces delimit monosyllabic morphemes, not words.• Multiple syllables without space occur in loanwords (bêtông).• Spaces are allowed to occur word-internally in Vietnamese UD.

Tokenization and Word Segmentation 14/28

Page 21: Tokenization and Word Segmentation - cuni.cz

Numbers with Spaces

# text = Il touche environ 100 000 sesterces par an.1 Il il PRON … 2 nsubj _ _2 touche toucher VERB … 0 root _ _3 environ environ ADV … 4 advmod _ _4 100 000 100 000 NUM … 5 nummod _ _5 sesterces sesterce NOUN … 2 obj _ _6 par par ADP … 7 case _ _7 an an NOUN … 2 obl _ SpaceAfter=No8 . . PUNCT … 2 punct _ _

Tokenization and Word Segmentation 15/28

Page 22: Tokenization and Word Segmentation - cuni.cz

Fixed Expressions

One syntactic word spans several orthographic words?# text = Bin nach wie vor sehr zufrieden.# text_en = I am still very satisfied.1 Bin sein AUX … 6 cop _ _2 nach nach ADP … 6 obl _ _3 wie wie ADV … 2 fixed _ _4 vor vor ADP … 2 fixed _ _5 sehr sehr ADV … 6 advmod _ _6 zufrieden zufrieden ADJ … 0 root _ SpaceAfter=No7 . . PUNCT … 6 obl _ _

Tokenization and Word Segmentation 16/28

Page 23: Tokenization and Word Segmentation - cuni.cz

Fixed Expressions

One syntactic word spans several orthographic words?I am still very satisfied.

Bin nach wie vor sehr zufrieden .Am after like before very satisfied .

AUX ADP ADV ADP ADV ADJ PUNCT

cop

obl

fixed

fixed advmod punct

Tokenization and Word Segmentation 17/28

Page 24: Tokenization and Word Segmentation - cuni.cz

Multi-Word Expressions outside UD

Some corpora use the underscore character to glue MWEs together.

I am still very satisfied.

Bin nach_wie_vor sehr zufrieden .Am after_like_before very satisfied .

AUX ADV ADV ADJ PUNCT

cop

advmod

advmod punct

Tokenization and Word Segmentation 18/28

Page 25: Tokenization and Word Segmentation - cuni.cz

Multi-Word Expressions outside UD

Some corpora use the underscore character to glue MWEs together.

• Durante la presentación del libro ”La_prosperidad_por_medio_de_la_investigación_._La_investigación_básica_en_EEUU” , editado por la Comunidad_de_Madrid , el secretario general de laConfederación_Empresarial_de_Madrid-CEOE ( CEIM ) , Alejandro_Couceiro , abogópor la formación de los investigadores en temas de innovación tecnológica .

• Lemmas?• Tags?

Tokenization and Word Segmentation 19/28

Page 26: Tokenization and Word Segmentation - cuni.cz

Word Segmentation Summary

• When to split?• Only part of the token involved in a relation to something outside the token? Split!

• Hard time finding POS tag? Split!• Hard time finding dependency relation? Don’t split!

• Or not hard time but the relation would be compound, flat, fixed or goeswith.• Border case? Keep orthographic words (if they exist).

• Words with spaces• Vietnamese writing system• Very restricted set of exceptions (numbers)• Special relations elsewhere (fixed, compound)

Tokenization and Word Segmentation 20/28

Page 27: Tokenization and Word Segmentation - cuni.cz

Word Segmentation Summary

• When to split?• Only part of the token involved in a relation to something outside the token? Split!• Hard time finding POS tag? Split!

• Hard time finding dependency relation? Don’t split!• Or not hard time but the relation would be compound, flat, fixed or goeswith.

• Border case? Keep orthographic words (if they exist).

• Words with spaces• Vietnamese writing system• Very restricted set of exceptions (numbers)• Special relations elsewhere (fixed, compound)

Tokenization and Word Segmentation 20/28

Page 28: Tokenization and Word Segmentation - cuni.cz

Word Segmentation Summary

• When to split?• Only part of the token involved in a relation to something outside the token? Split!• Hard time finding POS tag? Split!• Hard time finding dependency relation? Don’t split!

• Or not hard time but the relation would be compound, flat, fixed or goeswith.

• Border case? Keep orthographic words (if they exist).

• Words with spaces• Vietnamese writing system• Very restricted set of exceptions (numbers)• Special relations elsewhere (fixed, compound)

Tokenization and Word Segmentation 20/28

Page 29: Tokenization and Word Segmentation - cuni.cz

Word Segmentation Summary

• When to split?• Only part of the token involved in a relation to something outside the token? Split!• Hard time finding POS tag? Split!• Hard time finding dependency relation? Don’t split!

• Or not hard time but the relation would be compound, flat, fixed or goeswith.• Border case? Keep orthographic words (if they exist).

• Words with spaces• Vietnamese writing system• Very restricted set of exceptions (numbers)• Special relations elsewhere (fixed, compound)

Tokenization and Word Segmentation 20/28

Page 30: Tokenization and Word Segmentation - cuni.cz

Word Segmentation Summary

• When to split?• Only part of the token involved in a relation to something outside the token? Split!• Hard time finding POS tag? Split!• Hard time finding dependency relation? Don’t split!

• Or not hard time but the relation would be compound, flat, fixed or goeswith.• Border case? Keep orthographic words (if they exist).

• Words with spaces• Vietnamese writing system• Very restricted set of exceptions (numbers)• Special relations elsewhere (fixed, compound)

Tokenization and Word Segmentation 20/28

Page 31: Tokenization and Word Segmentation - cuni.cz

Recoverability: CoNLL-U Format

# text = Vámonos al mar.# text_en = Let’s go to the sea.ID FORM LEMMA UPOS … HEAD _ MISC1-2 Vámonos _ _ … _ _ _ _1 Vamos ir VERB … 0 root _ _2 nos nosotros PRON … 1 obj _ _3-4 al _ _ … _ _ _ _3 a a ADP … 5 case _ _4 el el DET … 5 det _ _5 mar mar NOUN … 1 obl _ SpaceAfter=No6 . . PUNCT … 1 punct _ _

Tokenization and Word Segmentation 21/28

Page 32: Tokenization and Word Segmentation - cuni.cz

Recoverability: CoNLL-U Format

# text = Vámonos al mar.# text_en = Let’s go to the sea.ID FORM LEMMA UPOS … HEAD _ MISC1-2 Vámonos _ _ … _ _ _ _1 Vamos ir VERB … 0 root _ _2 nos nosotros PRON … 1 obj _ _3-4 al _ _ … _ _ _ _3 a a ADP … 5 case _ _4 el el DET … 5 det _ _5-6 mar. _ _ … _ _ _ _5 mar mar NOUN … 1 obl _ _6 . . PUNCT … 1 punct _ _

Tokenization and Word Segmentation 21/28

Page 33: Tokenization and Word Segmentation - cuni.cz

Tokenization vs. Multi-word Tokens

• Parallelism among closely related languages• ca: informar-se sobre el patrimoni cultural• es: informarse sobre el patrimonio cultural• en: learn about cultural heritage

• ca: L’únic que veig és => L’ únic que veig és• en: don’t => do n’t

• No strict guidelines for tokenization (yet)• UD English: non-stop, post-war: single-word tokens• UD Czech: non-stop would be split to three tokens• Abbreviations: etc.

• End of sentence…

Tokenization and Word Segmentation 22/28

Page 34: Tokenization and Word Segmentation - cuni.cz

Tokenization vs. Multi-word Tokens Summary

• Punctuation involved? Low level!• Exceptions: Spanish-Catalan parallelism.

• Boundary between two letters? Typically high level.• Exceptions: Chinese, Japanese.

• Non-concatenative? High level!

Tokenization and Word Segmentation 23/28

Page 35: Tokenization and Word Segmentation - cuni.cz

Tokenization vs. Multi-word Tokens Summary

• Punctuation involved? Low level!• Exceptions: Spanish-Catalan parallelism.

• Boundary between two letters? Typically high level.• Exceptions: Chinese, Japanese.

• Non-concatenative? High level!

Tokenization and Word Segmentation 23/28

Page 36: Tokenization and Word Segmentation - cuni.cz

Tokenization vs. Multi-word Tokens Summary

• Punctuation involved? Low level!• Exceptions: Spanish-Catalan parallelism.

• Boundary between two letters? Typically high level.• Exceptions: Chinese, Japanese.

• Non-concatenative? High level!

Tokenization and Word Segmentation 23/28

Page 37: Tokenization and Word Segmentation - cuni.cz

Errors in Underlying Text

• We do not want to hide errors (learning robust parsers!)• But: reference corpora (linguistic research) may want to hide them.

• Possibilities:• Typo not involving word boundary

• FORM = anotation; LEMMA = annotation; FEATS: Typo=Yes; MISC:Correct=annotation

• Wrongly split word:ann otationX X

goeswith

• Wrongly merged words: thecar• Fix tokenization (i.e. two lines); first line MISC: SpaceAfter=No | CorrectSpaceAfter=Yes• Sentence segmentation can be affected, too!

Tokenization and Word Segmentation 24/28

Page 38: Tokenization and Word Segmentation - cuni.cz

Errors in Underlying Text

• We do not want to hide errors (learning robust parsers!)• But: reference corpora (linguistic research) may want to hide them.

• Possibilities:• Typo not involving word boundary

• FORM = anotation; LEMMA = annotation; FEATS: Typo=Yes; MISC:Correct=annotation

• Wrongly split word:ann otationX X

goeswith

• Wrongly merged words: thecar• Fix tokenization (i.e. two lines); first line MISC: SpaceAfter=No | CorrectSpaceAfter=Yes• Sentence segmentation can be affected, too!

Tokenization and Word Segmentation 24/28

Page 39: Tokenization and Word Segmentation - cuni.cz

Errors in Underlying Text

• We do not want to hide errors (learning robust parsers!)• But: reference corpora (linguistic research) may want to hide them.

• Possibilities:• Typo not involving word boundary

• FORM = anotation; LEMMA = annotation; FEATS: Typo=Yes; MISC:Correct=annotation

• Wrongly split word:ann otationX X

goeswith

• Wrongly merged words: thecar• Fix tokenization (i.e. two lines); first line MISC: SpaceAfter=No | CorrectSpaceAfter=Yes• Sentence segmentation can be affected, too!

Tokenization and Word Segmentation 24/28

Page 40: Tokenization and Word Segmentation - cuni.cz

Errors in Underlying Text

• We do not want to hide errors (learning robust parsers!)• But: reference corpora (linguistic research) may want to hide them.

• Possibilities:• Typo not involving word boundary

• FORM = anotation; LEMMA = annotation; FEATS: Typo=Yes; MISC:Correct=annotation

• Wrongly split word:ann otationX X

goeswith

• Wrongly merged words: thecar• Fix tokenization (i.e. two lines); first line MISC: SpaceAfter=No | CorrectSpaceAfter=Yes• Sentence segmentation can be affected, too!

Tokenization and Word Segmentation 24/28

Page 41: Tokenization and Word Segmentation - cuni.cz

Errors in Underlying Text

• Wrong morphology: the cars is produced in Detroit

• Not like normal typo (the car iss produced…)• Not obvious what is correct

• the car is• the cars are

• Suggestion: select which word to fix, e.g. cars to car• FORM = cars; FEATS: Number=Plur; MISC: Correct=car | CorrectNumber=Sing• cs: viděl moři “he saw the sea”

• Should be moře• Would be Case=Acc (disambiguated from Case=Acc,Gen,Nom,Voc)• This form is Case=Dat,Loc (but which one?)

• cestoval k moři “he traveled to the sea” Case=Dat• plavil se po moři “he sailed the sea” Case=Loc

Tokenization and Word Segmentation 25/28

Page 42: Tokenization and Word Segmentation - cuni.cz

Errors in Underlying Text

• Wrong morphology: the cars is produced in Detroit• Not like normal typo (the car iss produced…)

• Not obvious what is correct• the car is• the cars are

• Suggestion: select which word to fix, e.g. cars to car• FORM = cars; FEATS: Number=Plur; MISC: Correct=car | CorrectNumber=Sing• cs: viděl moři “he saw the sea”

• Should be moře• Would be Case=Acc (disambiguated from Case=Acc,Gen,Nom,Voc)• This form is Case=Dat,Loc (but which one?)

• cestoval k moři “he traveled to the sea” Case=Dat• plavil se po moři “he sailed the sea” Case=Loc

Tokenization and Word Segmentation 25/28

Page 43: Tokenization and Word Segmentation - cuni.cz

Errors in Underlying Text

• Wrong morphology: the cars is produced in Detroit• Not like normal typo (the car iss produced…)• Not obvious what is correct

• the car is• the cars are

• Suggestion: select which word to fix, e.g. cars to car• FORM = cars; FEATS: Number=Plur; MISC: Correct=car | CorrectNumber=Sing• cs: viděl moři “he saw the sea”

• Should be moře• Would be Case=Acc (disambiguated from Case=Acc,Gen,Nom,Voc)• This form is Case=Dat,Loc (but which one?)

• cestoval k moři “he traveled to the sea” Case=Dat• plavil se po moři “he sailed the sea” Case=Loc

Tokenization and Word Segmentation 25/28

Page 44: Tokenization and Word Segmentation - cuni.cz

Errors in Underlying Text

• Wrong morphology: the cars is produced in Detroit• Not like normal typo (the car iss produced…)• Not obvious what is correct

• the car is• the cars are

• Suggestion: select which word to fix, e.g. cars to car• FORM = cars; FEATS: Number=Plur; MISC: Correct=car | CorrectNumber=Sing

• cs: viděl moři “he saw the sea”• Should be moře• Would be Case=Acc (disambiguated from Case=Acc,Gen,Nom,Voc)• This form is Case=Dat,Loc (but which one?)

• cestoval k moři “he traveled to the sea” Case=Dat• plavil se po moři “he sailed the sea” Case=Loc

Tokenization and Word Segmentation 25/28

Page 45: Tokenization and Word Segmentation - cuni.cz

Errors in Underlying Text

• Wrong morphology: the cars is produced in Detroit• Not like normal typo (the car iss produced…)• Not obvious what is correct

• the car is• the cars are

• Suggestion: select which word to fix, e.g. cars to car• FORM = cars; FEATS: Number=Plur; MISC: Correct=car | CorrectNumber=Sing• cs: viděl moři “he saw the sea”

• Should be moře• Would be Case=Acc (disambiguated from Case=Acc,Gen,Nom,Voc)• This form is Case=Dat,Loc (but which one?)

• cestoval k moři “he traveled to the sea” Case=Dat• plavil se po moři “he sailed the sea” Case=Loc

Tokenization and Word Segmentation 25/28

Page 46: Tokenization and Word Segmentation - cuni.cz

Tokenization Alignment

• If you need to match two different tokenizations• Use case: evaluation of end-to-end parsing systems

• Normalization involved? Bad luck…• Normalization rules needed• Or: Longest common subsequence (LCS) algorithm

• Otherwise easy• Non-whitespace character offsets

Tokenization and Word Segmentation 26/28

Page 47: Tokenization and Word Segmentation - cuni.cz

Tokenization Alignment

• If you need to match two different tokenizations• Use case: evaluation of end-to-end parsing systems

• Normalization involved? Bad luck…• Normalization rules needed• Or: Longest common subsequence (LCS) algorithm

• Otherwise easy• Non-whitespace character offsets

Tokenization and Word Segmentation 26/28

Page 48: Tokenization and Word Segmentation - cuni.cz

Tokenization Alignment

• If you need to match two different tokenizations• Use case: evaluation of end-to-end parsing systems

• Normalization involved? Bad luck…• Normalization rules needed• Or: Longest common subsequence (LCS) algorithm

• Otherwise easy• Non-whitespace character offsets

Tokenization and Word Segmentation 26/28

Page 49: Tokenization and Word Segmentation - cuni.cz

Evaluation Metrics

• Align system-output tokens to gold tokensAl-Zaman : American forces killed Shaikh Abdullah al-Ani, the preacher at the mosque inthe town of Qaim, near the Syrian border.

GOLD: Al - Zaman : American forces killed ShaikhOFFSET: 0-1 2 3-7 8 9-16 17-22 23-28 29-34

• All characters except for whitespace match => easy align!SYSTEM: Al-Zaman : American forces killed ShaikhOFFSET: 0-7 8 9-16 17-22 23-28 29-34

Tokenization and Word Segmentation 27/28

Page 50: Tokenization and Word Segmentation - cuni.cz

Evaluation Metrics

• Align system-output tokens to gold tokensDie Kosten sind definitiv auch im Rahmen.

GOLD: Die Kosten sind definitiv auch im Rahmen .SPLIT: Die Kosten sind definitiv auch in dem Rahmen .

OFFSET: 0-2 3-8 9-12 13-21 22-25 26-27 28-33 34

• Corresponding but not identical spans?• Find longest common subsequence

SYSTEM: Kosten sind definitiv auch im Rahmen .SPLIT: Kosten sind de finitiv auch im Rahmen .

OFFSET: 3-8 9-12 13-21 22-25 26-27 28-33 34

Tokenization and Word Segmentation 28/28

Page 51: Tokenization and Word Segmentation - cuni.cz

Evaluation Metrics

• Align system-output tokens to gold tokensDie Kosten sind definitiv auch im Rahmen.

GOLD: Die Kosten sind definitiv auch im Rahmen .SPLIT: Die Kosten sind definitiv auch in dem Rahmen .

OFFSET: 0-2 3-8 9-12 13-21 22-25 26-27 28-33 34

• Corresponding but not identical spans?• Find longest common subsequence

SYSTEM: auch im Rahmen .SPLIT: auch in einem , dem alle zustimmen , Rahmen .

OFFSET: 22-25 26-27 28-33 34

Tokenization and Word Segmentation 28/28