38
Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

  • View
    225

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

Corpora and Translation

Parallel corpora

Statistical MT

(not to mention: Corpus of translated text, for translation studies)

Page 2: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

2/38

Parallel corpora

• Corpora of texts and their translations• Basic idea that such parallel corpora implicitly

contain lots of information about translation equivalence

• Nowadays many such “bitexts” are available– bilingual countries have laws, parliamentary

proceedings, and other documents– large multinational organizations (UN, EU [Europarl

corpus], etc.)– multinational commercial organizations produce

multilingual texts

Page 3: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

3/38

Bilingual concordance

Source: TransSearch,Laboratoire de Recherche Appliquée en Linguistique Informatique, Université de Montréal

http://www-rali.iro.umontreal.ca

Page 4: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

4/38

Parallel corpora

• Usually not corpora in the strict sense (planned, annotated, etc.)

• Usefulness may depend on – the quality of translation– the closeness of translation– whether we have a text and its translation, or

a multilingually authored text– the language pair

• Parallel corpus needs to be aligned

Page 5: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

5/38

Alignment

• Means annotating the bilingual corpus to show explicitly the correspondences– at sentence level– at word and phrase level

• Main difficulty for sentence alignment is that translations do not always keep sentence boundaries, or even sentence order

• In addition, translation may be “localized” and therefore not especially faithful

Page 6: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

6/38

Sentence-level alignment

• If parallel corpus is quite a literal translation, this can be done using quite low-level information– sentence length– looking for anchors

• proper names, dates, figures• eg in a parliamentary debate, speakers’ names

Page 7: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

7/38

Alignment tools

Page 8: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

8/38

Corpus-based MT

• Translation memory (tool for translators)– database of previous translations– find close matching examples to current

translation unit– translator decides what to do with it

Page 9: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

9/38

Note that translator has to know/decide what bits of the target sentence to change

Page 10: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

10/38

Corpus-based MT

• Translation memory (tool for translators)– database of previous translations– find close matching examples to current

translation unit– translator decides what to do with it

• Example-based translation– similar idea, but computer program tries to

manipulate example(s)– may involve “learning” general rules from

multiple examples

Page 11: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

11/38

Statistical MT

• Pioneered by IBM in early 1990s• Spurred on by better success in speech

recognition of statistical over linguistic rule-based approaches

• Idea that translation can be modelled as a statistical process

• Seems to work best in limited domain where given data is a good model of future translations

Page 12: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

12/38

Translation as a probabilistic problem

• For a given SL sentence Si, there are number of “translations” T of varying probability

• Task is to find for Si the sentence Tj for which the probability P(Tj | Si) is the highest

Page 13: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

13/38

Two models

• P(Tj | Si) is a function of two models:

– The probabilities of the individual words that make up Tj given the individual words in Si - the “translation model”

– The probability that the individual words that make up Tj are in the appropriate order – the “language model”

Page 14: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

14/38

Expressed in mathematical terms:

Since S is a given, and constant, this can be

simplified as

)(

)|()()|(maxarg

SP

TSPTPSTP

)|()()|(maxarg TSPTPSTP

Translation modelLanguage model

Page 15: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

15/38

So how do we translate?

• For a given input sentence Si we have to have a practical way to find the Tj that maximizes the formula

• We have to start somewhere, so we start with the translation model: which words look most likely to help us?

• In a systematic way we can keep trying different combinations together with the language model until we stop getting improvements

Page 16: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

16/38

Input sentence

Translation model

Bag of possible words

Most probable

translation

Seek improvement by tryingother combinations

Language model

Page 17: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

17/38

Where do the models come from?

• All the statistical parameters are pre-computed (“learned”), based on a parallel corpus

• Language model is probabilities of word sequences (n-grams)

• Translation model is derived from aligned parallel corpus

• This approach is attractive to some as an example of “machine learning”– The computer learns to translate (just) from seeing

previous examples of translation

Page 18: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

18/38

The translation model

• Take sentence-aligned parallel corpus

• Extract entire vocabulary for both languages

• For every word-pair, calculate probability that they correspond – e.g. by comparing distributions

Page 19: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

19/38

Problem: fertility

• “fertility”: not all word correspondences are 1:1 – Some words have multiple possible

translations, e.g. the {le, la, l’, les}– Some words have no translation, e.g. in il se

rase ‘he shaves’, se – Some words are translated by several words,

e.g. cheap peu cher– Not always obvious how to align

Page 20: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

20/38

Problem: distortion

• Notice that corresponding words do not appear in the same order.

• The translation model includes probabilities for “distortion” – e.g. P(2|5): the P that ws in position 2 will

produce a wt in position 5– can be more complex: P(5|2,4,6): the P

that ws in position 2 will produce a wt in position 5 when S has 4 words and T has 6.

Page 21: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

21/38

The language model

• Impractical to calculate probability of every word sequence:– Many will be very improbable …– Because they are ungrammatical– Or because they happen not to occur in the data

• Probabilities of sequences of n words (“n-grams”) more practical– Bigram model:

where P(wi|wi–1) f(wi–1, wi)/f(wi)

)|(),...,,( 121 iin wwPwwwP

Page 22: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

22/38

Sparse data

• Relying on n-grams with a large n risks 0-probabilities

• Bigrams are less risky but sometimes not discriminatory enough– e.g. I hire men who is good pilots

• 3- or 4-grams allow a nice compromise, and if a 3-gram is previously unseen, we can give it a score based on the component bigrams (“smoothing”)

Page 23: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

23/38

Put it all together and …?

• To build a statistical MT system we need:– Aligned bilingual corpus– “Training programs” which will extract from the

corpora all the statistical data for the models– A “decoder” which takes a given input, and

seeks the output that evaluates the magic argmax formula – based on a heuristic search algorithm

• Software for this purpose is freely available – http://www.statmt.org/moses/,

http://www.isi.edu/licensed-sw/pharaoh/

• Claim is that an MT system for a new language pair can be built in a matter of hours

Page 24: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

24/38

SMT latest developments

• Nevertheless, quality is limited• SMT researchers quickly learned that this crude

approach can get them so far (quite far actually), but that to go the extra distance you need linguistic knowledge (eg morphology, “phrases”, consitutents)

• Latest developments aim to incorporate this• Big difference is that it too can be LEARNED

(automatically) from corpora• So SMT still contrasts with traditional RBMT

where rules are “hand coded” by linguists

Page 25: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

25/38

Direct phrase alignment

(Wang & Waible 1998, Och et al., 1999, Marcu & Wong 2002)

• Enhance word translation model by adding joint probabilities, i.e. probabilities for phrases

• Phrase probabilities compensate for missing lexical probabilities

• Easy to integrate probabilities from different sources/methods, allows for mutual compensation

Page 26: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

26/38

Word alignment induced model

Koehn et al. 2003; example stolen from Knight & Koehn http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/tutorial2003.pdf

Maria did not slap the green witch

Maria no daba una botefada a la bruja verda

Start with all phrase pairs justified by the word alignment

Page 27: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

27/38

Word alignment induced model

Koehn et al. 2003; example stolen from Knight & Koehn http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/tutorial2003.pdf

(Maria, Maria), (no, did not)(daba una botefada, slap),(a la, the), (verde, green), (bruja, witch)

Page 28: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

28/38

Word alignment induced model

Koehn et al. 2003; example stolen from Knight & Koehn http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/tutorial2003.pdf

(Maria, Maria), (no, did not)(daba una botefada, slap),(a la, the), (verde, green) (bruja, witch), (Maria no, Maria did not), (no daba una botefada, did not slap),(daba una botefada a la, slap the), (bruja verde, green witch)

etc.

Page 29: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

29/38

Word alignment induced model

Koehn et al. 2003; example stolen from Knight & Koehn http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/tutorial2003.pdf

(Maria, Maria), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch), (verde, green), (Maria no, Maria did not), (no daba una bofetada, did not slap),(daba una bofetada a la, slap the), (bruja verde, green witch),(Maria no daba una bofetada, Maria did not slap),(no daba una bofetada a la, did not slap the), (a la bruja verde, the green witch),(Maria no daba una bofetada a la, Maria did not slap the),(daba una bofetada a la bruja verde, slap the green witch),(no daba una bofetada a la bruja verde, did not slap the green witch),(Maria no daba una bofetada a la bruja verde, Maria did not slap the green witch)

Page 30: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

30/38

Alignment templates

Och et al. 1999; further developed by Marcu and Wong 2002, Koehn and Knight 2003, Koehn et al. 2003)

• Problem of sparse data worse for phrases• So use word classes instead of words

– alignment templates instead of phrases– more reliable statistics for translation table– smaller translation table– more complex decoding

• Word classes are induced (by distributional statistics), so may not correspond to intuitive (linguistic) classes

• Takes context into account

Page 31: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

31/38

Problems with phrase-based models

• Still do not handle very well ...– dependencies (especially long-distance)– distortion – discontinuities (e.g. bought = habe ... gekauft)

• More promising seems to be ...

Page 32: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

32/38

Syntax-based SMT

• Better able to handle – Constituents– Function words– Grammatical context (e.g. case marking)

• Inversion Transduction Grammars

• Hierarchical transduction model

• Tree-to-string translation

• Tree-to-tree translation

Page 33: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

33/38

Inversion transduction grammars

• Wu and colleagues (1997 onwards)

• Grammar generates two trees in parallel and mappings between them

• Rules can specify order changes

• Restriction to binary rules limits complexity

Page 34: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

34/38

Inversion transduction grammars

Page 35: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

35/38

Inversion transduction grammars

• Grammar is trained on word-aligned bilingual corpus: Note that all the rules are learned automatically

• Translation uses a decoder which effectively works like traditional RBMT:– Parser uses source side of transduction rules to build

a parse tree– Transduction rules are applied to transform the tree– The target text is generated by linearizing the tree

Page 36: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

36/38

Page 37: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

37/38

Page 38: Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

38/38

Other approaches

• Other approaches use more and more “linguistic” information

• In each case automatically learned, especially from treebanks

• Traditional (“rule-based”) MT used (hand-written) grammars and lexicons

• State-of-the-art MT is moving back in this direction, except that linguistic rules are machine learned