Morphology and Translation - MT class

Preview:

Citation preview

1

Morphologyand

TranslationApril 17, 2012

Today’s goals

• Have a basic understanding of morphology in languages, along with some of the complexities it introduces for MT

• Look at a few approaches to deal with the problem

• Stemming

• Splitting

• Decoding

• Leveraging ambiguity: Factored representations

• Preserving ambiguity: Translation from lattices

2

Motivation

• To this point, we have treated words as atomic white-space delimited units, with no relationships among them

• This hides a lot of information, since words are related

house ⟺ Haus

houses ⟺ Hause

• ...which information is hidden from the computer

3

Example

4

Das ist ein kleines Haus .

174 19 182 40626 991 50

That is a small house .

192 4 19 27 200 49

Before we go on

5

• Start today by discussing with your neighbor some ways in which this view of words loses information

Morphology

• Morphology: the study of the forms of words

• Inflectional – words change to reflect grammatical roles

• e.g., groß, große, großem, großen, großer, großes

• Derivational – shared semantics, often across PsOS

• e.g., employ (V), employee, employer (N), employable (JJ)

• Lemma – the basic, canonical form of the word

• Stem – the shared prefix across inflectional variants

• e.g., corr- (Spanish)

6

Related problem: tokenization

• Morphology is not the only means by which data are unnecessarily fragmented

• Tokenization is largely a task of splitting off punctuation

• e.g., house, becomes house‿,• “No,” he said. becomes “ No , ” he said .

• A related step, normalization, removes case distinctions, standardizes character sets (e.g., quotations, numerals)

• These are largely deterministic processes that are also important for aggregating statistics, but they are largely artifacts of written language

7

Simple morphology: English

• Words are inflected for

• case (objective, accusative, genitive)I, me, my/mine, ’s

• tense (past, present, or future)-ed, -ing, will

• person (1st, 2nd, 3rd)I, you, he/she/they

• number (singular vs. plural)-s

8

Complex morphology: German

9

• Inflections of the English definite determiner the: the

• Inflections of the German male definite determiner der

Really complex morphology: Arabic

• Concept defined by three consonants

• Example inflectional morphology:

• concept: ktb (to write)

• kataba he wrote CaCaCakatabna we wrote CaCaCnakatabuu they wrote CaCaCuuyaktubu he writes yaCCuCunaktubu we write naCCuCuyaktabuuna they write yaCCaCuunasayaktubu he will write sayaCCuCusanaktubu we will write sanaCCuCusayaktabuuna they will write sayaCCaCuuna

10http://www.eskimo.com/~ram/arabic_morphology.html

Problems caused by morphology

• In general

• Data sparsity: alignments to words in the other language are needlessly divided, fracturing statistics

11

Morphology creates sparsity

12

• Common relationships are hidden houses = plural(house) was = past-tense(is) children = plural(child)

• Data is fragmented

Problems caused by morphology

• In general

• Data sparsity: alignments to words in the other language are needlessly divided, fracturing statistics

• Source side

• Unseen inflections: complex inflectional morphology may result in particular versions of a word not being seen

13

Hund dog

Hunde ?

Problems caused by morphology

• In general

• Data sparsity: alignments to words in the other language are needlessly divided, fracturing statistics

• Source side

• Unseen inflections: complex inflectional morphology may result in particular versions of a word not being seen

• Target side

• The right form must be selected, but

• Richer morphology trades off with word order

• Morphology can encode long-distance dependencies

14

Target-side problems

• Inflection varies by case I gave her the spoon for her birthday Ich gab ihr den Löffel zum Geburtstag

The spoon was old and rusty Der Löffel war alt und rostig

• Inflection frees up word order (in theory, anyway)

15

Addressing morphology

16

• There are a number of techniques used to address morphology

• Splitting

• Truncation and lemmatization

• And a number of techniques used to incorporate ambiguity and leverage diverse sources of information

• Decoding from confusion networks

• Factored translation

Splitting

• An obvious approach is to split up tokens, either manually or automatically

• Empirical Methods for Compound Splitting (Koehn & Knight, 2003)

17

Compound splitting

• German is known for long noun compounds

• Großeltern (grandparents)

• Waschmaschine (washing machine)

• Museenverwaltung (museum management)

• Sometimes this is fine, but sometimes this complicates learning word translations

18

German-English compound splitting

• Aktionsplan → action plan, plan of action

• Technique 1: break word into parts that occur elsewhere

• aktionsplan (852) = 852aktion (960) + plan (710) = 825.6aktions (5) + plan (710) = 59.6akt (224) + ion (1) + plan (710) = 54.2

• Problem:

• frei (885) + tag (1864)freitag (556)

19

German-English compound splitting

• Technique 2: make sure parts have translations on the English side

• since Frei (free) and Tag (day) are unlikely to exist in the translation of the sentence, Freitag (Friday) would not be split

• Problem: ambiguity (the word translations might not always appear)

• Grundrechte (basic rights)Grund (reason/foundation) + rechte (rights)

20

German-English compound splitting

• Technique 3: create a separate translation table from the Method 1 technique, use that as a second-level check

• Further issue: common words result in splits

• folgenden (following)folgen (consequences) + den (the)

• solution: POS tag German, limit splitting to certain classes

21

German-English results

• BLEU score: 30.5 (raw), 34.4 (best splitting)

• Lessons

• Heuristic splitting is messy: a cascade of exceptions

• These approaches are also largely specific to German (assuming a particular kind of morphology, and requiring a tagger, for example)

22

Truncation

• If you don’t have a morphological analyzer, a poor man’s approximation is to simply truncate the word

• What are some limitations of this approach?

• Goldwater & McClosky (2005) applied this to Czech-English

23

Czech-English

24

• Truncation isn’t as effective as a true lemmatizer, but it’s better than nothing

Translation from lattices

• In the German-English example, we chose a split for the words prior to learning phrase tables and to translation

• This can be problematic if the segmentation had mistakes

• Idea: preserve the ambiguity of splitting and let the decoder efficiently explore all splits

25

Confusion networks

26

• A simplified form of lattice

• Czech-English example from Dyer (WMT 2007)

Results

27

• By themselves, lemmatization and truncation were not especially helpful

• A backoff model (in which lower-order models are consulted only when needed) showed some improvement

• The best model made use of a lemmatized confusion network

Factored Translation

• Standard phrase-based model: translate sequences of whitespace-delimited tokens

• An alternative is factored translation (Koehn & Hoang, 2007), which simultaneously considers multiples sources of evidence

28

Factored translation

29

• Integrates a more complex representation of words directly into the decoder

• Contrast this with some ofthe other approaches wehave considered

Factored translation

• Steps

• Translate input factors (phrases) into output factors

• Generate surface forms from the output factors (words)

• Example from paper {surface form | lemma | POS | infl}

• Map lemma {häuser | haus | NN | pl-nom-neut} {?|house|?|?, ?|home|?|?, ?|building|?|?}

• Map morphology {?|house|NN|pl, ?|home|NN|pl, ?|building|NN|pl, ?|house|NN|sg}

• Generate surface {houses|house|NN|pl, homes|home|NN|pl, buildings|building|NN|pl, house|house|NN|sg}

30

Results

31

Summary

• Morphology is a real problem in translation, especially for low-resource languages

• Linguistic approaches are useful (e.g., lemmatization), and even linguistic approximations (e.g., truncating) can do well

• Morphology is far from a solved problem

32

References

• Empirical Methods for Compound Splitting (Koehn & Knight, 2005)

• The ‘noisier channel’: translation from morphologically complex languages (Dyer, WMT 2007)

• Factored Translation Models (Koehn & Hoang, 2007)

• Improving Statistical MT through Morphological Analysis (Goldwater & McClosky, 2005)

33

Recommended