1 COMP791A: Statistical Language Processing Machine Translation and Statistical Alignment Chap. 13

1

COMP791A: Statistical Language Processing

Machine Translation andStatistical Alignment

Chap. 13

2

Contents

1- Machine Translation 2- Statistical Machine Translation 3- Text Alignment

Length-based methods Offset alignment by signal processing

techniques Lexical methods of sentence alignment

4- Word Alignment

3

Where: meaning(text2) == meaning(text1) i.e. faithful text2 is perfecly grammatical and idiomatic i.e. fluent

MT is very hard translation programs available today do not perform very well

Goal of MT

Text1 in source

language

Text2 in target

language

4

Little history of MT 1950’s

inspired by the code-breakers of WWII Russian is just an encoded version of English “We’ll have this up and running in a few years, it’ll be great, just give us

lots of money” 1964

ALPAC report (Automatic Language Processing Advisory Committee) “…we do not have useful machine translation…” “…there is no immediate or predictable prospect of useful machine

translation…” Nearly sank funding for all of AI.

1990’s DARPA funds research in MT 2 “competitive” approaches

Statistical MT (IBM at TJ Watson Research Center) Rule-based MT(CMU, ISI, NMSU)

Regular competitions And the winner was… Systran!

5

Difficulties in MT Different word order (SVO vs VSO vs SOV languages)

“the black cat” (DT ADJ N) --> “le chat noir” (DT N ADJ) Many-to-many mapping between words in different

languages “John knows Bill.” --> “John connaît Bill.” “John knows Bill will be late.” --> “John sait que Bill sera en

retard.” Overlapping of word senses

leg

patteétape

jambe pied

foot

paw

human

journey

chair

animal

animal

human

bird

6

analysis --> transfer --> generation

Each arrow can be implemented with rule-based methods or probabilistically

The Transfer Metaphor

Interlinguaattraction(NamedJohn, NamedMary, high)

English Semanticsloves(John, Mary)

French Semanticsaime(Jean, Marie)

English SyntaxS(NP(John) VP(loves, NP(Mary)))

French SyntaxS(NP(Jean) VP(aime, NP(Marie)))

English WordsJohn loves Mary

French WordsJean aime Marie word transfer

(memory-based translation)

syntactic transfer

semantic transfer

knowledge transfer

7

Syntactic transfer Solves some problems…

Word order Some cases of lexical choice

Ex: Dictionary of analysis

know: verb ; transitive ; subj: human ; obj: NP || Sentence

Dictionary of transfer know + obj [NP] --> connaître know + obj [sentence] --> savoir

But syntax is not enough… No one-to-one correspondence between syntactic

structures in different languages (syntactic mismatch)

8

2-Statistical MT: Being faithful & fluent

Often impossible to have a true translation; one that is: Faithful to the source language, and Fluent in the target language Ex:

Japanese: “fukaku hansei shite orimasu” Fluent translation: “we apologize” Faithful translation: “we are deeply reflecting (on our past

behaviour, and what we did wrong, and how to avoid the problem next time)”

So need to compromise between faithfulness & fluency

Statistical MT tries to maximise some function that represents the importance of faithfulness and fluency

Best-translation T*= argmaxT fluency(T) x faithfulness(T, S)

9

The Noisy Channel Model Statistical MT is based on the noisy channel model Developed by Shannon to model communication (ex. over a

phone line)

Noisy channel model in SMT (ex. en|fr): Assume that the true text is in English But when it was transmitted over the noisy channel, it somehow got

corrupted and came out in French i.e. the noisy channel has deformed/corrupted the original English

input into French So really… French is a form of noisy English

The task is to recover the original English sentence (or to decode the French into English)

10

Fundamental Equation for SMT Assume we are translating from FR-->EN (en|fr) Intuitively we saw that:

e* = argmaxe fluency(e) x faithfulness(e, f)

More formally:

e* = argmaxe P(e|f)

By Bayes theorem:

But P(f) is the same for all e, so

may seem circular… why not just P(e|f) ??? P(f|e) x P(e) allows us to have a sloppy translation model Hopefully P(e) will correct the mistakes of the translation

model

P(f)P(e) x e)|P(f

f)|P(e

P(e) x e)|P(f argmaxe*e

11

Example of SMT (en|jp) Source sentence (Japanese): “2000men taio” Translation model

From the Translation model: ”2000 correspondence” is the best translation

But the Language model: “2000 correspondence” is not frequent at all so overall: “dealing with Y2K” is the best translation! (maximizes their

product)

2000men taio

More probable 2000 correspondence

Year 2000 corresponding

Y2K equivalent

200 years tackle

200 year deal with

Less probable … …

12

We need 3 things (for en|fr):1. A Language Model of English: P(e)

Measures fluency Probability of an English sentence We can do this with an n-gram or PCFG ~ Provides the right word ordering and collocations ~ Provides a set of fluent sentences to test for potential

translation

2. A Translation Model: P(f|e) Measures faithfulness Probability of an (French, English) pair We can do this with text (word) alignment of parallel

corpora ~ Provides the right bag of words ~Tests if a given fluent sentence is a translation

3. A Decoder: argmax An effective and efficient search technique to find e* Usually we use a heuristic search

13

seen in class…

We need a Language Model P(e)

14


Measures fluency Probability of an English sentence We can do this with an n-gram or PCFG ~ Provides the right word ordering and collocations ~ Provides a set of fluent sentences to test for potential

translation

2. --> A Translation Model: P(f|e) Measures faithfulness Probability of an (French, English) pair We can do this with text (word) alignment of parallel

corpora ~ Provides the right bag of words ~Tests if a given fluent sentence is a translation

3. A Decoder: argmax An effective and efficient search technique to find e* Usually we use a heuristic search

15

Probability of an FR sentence being a translation of an EN sentence

~ the product of the probabilities that each FR word is the translation of some EN word

unigram translation model ex: P(le chien est mort | the dog is dead) = P(le|the) x P(chien|dog) x P(est|is) x P(mort|dead)

So we need to know, for each FR word, the probability of it mapping to each possible EN word

But where do we get these probabilities?

We need a translation model P(f|e) ex: IBM model 3

n

1iii )e|P(fe)|P(f

16

Parallel Texts Parallel texts or bitexts

Same content is available in several languages Official documents of countries with multiple official

languages -> literal, consistent

Alignment Paragraph to paragraph, sentence to sentence, word to

word

Language2

Sectionk

Paragraphk

Sentencek

Phrasek

Wordk

… Wordm

Language1

Sectioni

Paragraphi

Sentencei

Phrasei

Word i

… Word j

17

Problem 1: Fertility word choice is not 1-to-1

ex: Je mange à la maison.--> I eat home.

solution: a word with fertility n gets copied n times, and for each of

these n times, gets translated independently

ex: à la maison --> home à --> fertility 0 la --> fertility 0 maison --> fertility 1 use unigram translation model to translate maison-->home

ex: home --> à la maison home --> fertility 3 home home home --> à la maison note: the translation model will give the same probability to:

home home home --> maison à la… it is up to the language model to select the correct word order

18

Problem 2: Word order word order is not the same in both languages

ex: le chien brun --> the brown dog solution:

assign an offset to move words from their original positing to their final position

ex: chien brun --> brown dog brown --> offset +1 dog --> offset -1

Making the offset dependent on the words would be too costly… so in IBM model 3, the offset only depends:

on the position of the word within the sentence!!! the length of the sentences in both languages P(offset=o | Position = p, EngLen = m, FrLen = n)

ex: brown dog offset of brown = P(offset| 1,2,2) ex: P(+1| 1,2,2) = .3 P(0| 1,2,2) = .6 P(-1| 1,2,2) = .1

19

An Example (en|fr)

Given the English

The brown dog did not

go home

Fertility Model 1 1 1 1 2 1 3

Transformed English

The brown dog did not not go home home home

Translation Model

Le brun chien est n' pas allé à la maison

Offset Model 0 +1 -1 +1 -1 0 0 0 0 0

A possible Translation

Le chien brun n' est pas allé à la home

Then use Language Model P(e) to evaluate fluency of all possible translations

P(e) x e)|P(f argmaxe*e

20

Summary : IBM-3 for (en|fr)

to find P(e|f), we need:1. Language model for English P(e): P(wordEi | wordEi-1)

2. Translation model P(f|e): 1. Translation model per se: P(wordF | wordE)

1. Fertility model of English: P(Fertility=n | wordE)

2. Offset model for French: P(Offset=o | pos, lenF, lenE)

21


Measures fluency Probability of an English sentence We can do this with an n-gram or PCFG ~ Provides the right word ordering and collocations ~ Provides a set of fluent sentences to test for potential translation

2. --> A Translation Model: P(f|e) Measures faithfulness Probability of an (French, English) pair We can do this with text (word) alignment of parallel corpora ~ Provides the right bag of words ~Tests if a given fluent sentence is a translation

3. --> A Decoder: argmax An effective and efficient search technique to find e* Usually we use a heuristic search

22

We needed a decoder

we can compute P(e|f) for any given pair of (en,fr) sentences… that's nice

but: what we really want is to find the English

sentence that maximises P(e|f) given a French sentence

assume a vocabulary of 100,000 words in English there are 105n possible English sentences of

length n.. and many alignments of each one, and many

possible offsets … we need a search algorithm (ex. A*)

23

3- Text alignment used to find P(f|e) not a trivial task Problems:

not always one sentence to one sentence translators do not always translate one sentence in the

input into one sentence in the output although true in 90% of the cases.

crossing dependencies the order of sentences are changed in the translation.

Large pieces of material can disappear

24

Egyptianhieroglyphs

Egyptian Demotic

Greek

carved in 196 BCfound in 1799decoded in 1822

The Rosetta Stone

25

A modern Rosetta Stone: TransSearch

26

Note: Re-ordering of phrases Disappearance of phrases (they are implied in the French

version)

Quand aux

eaux minérales et aux limonades,

Elles rencontrent toujours plus d’adeptes.

En effet

notre sondage

fait ressortir

des ventes

nettement supérieures

à celles de 1987,

pour les boissons à base de cola

notamment.

According to

our survey, 1988

sales of

mineral water and soft drinks were

much higher

than in 1987,

reflecting

their growing popularity

of these products.

Cola drink

manufacturers

in particular

achieved above average growth rate.

Example

27

Aligning sentence and paragraph BEAD is a n:m grouping

S, T : text in two languages

S = (s1, s2, … , si) T = (t1, t2, … , tj) Each sentence can occur

in only one bead Assume no crossing (but

occurs in reality) Most common (90%) 1:1 But also: 0:1, 1:0, 2:1,

1:2, 2:2, 2:3, 3:2 …

s1

.

.

.

.

.

.

.si

t1

.

.

.

.

.

.

.tj

S Tb1

b2

b3

b4

b5

.

.bk

28

Quand aux

eaux minérales et aux limonades,

Elles rencontrent toujours plus d’adeptes.

En effet

notre sondage

fait ressortir

des ventes

nettement supérieures

à celles de 1987,

pour les boissons à base de cola

notamment.

According to

our survey, 1988

sales of

mineral water and soft drinks were

much higher

than in 1987,

reflecting

their growing popularity

of these products.

Cola drink

manufacturers

in particular

achieved above average growth rate.

2:2 alignment

Example

29

Approaches to Text Alignment Length-Based Methods

short sentences will be translated with short sentences

long sentences will be translated with long sentences Offset Alignment by Signal Processing

Techniques do not attempt to align beads of sentences just try to align position offsets in the two parallel

texts Lexical Methods

use lexical information to align beads of sentences

30

Approaches to Text Alignment --> Length-Based Methods Offset Alignment by Signal Processing

Techniques Lexical Methods

31

Rationale Short sentence -> short sentence Long sentence -> long sentence

Length nb of words or nb of characters

Advantages: Efficient (for similar languages) Fast!

32

Length-based method Rationale: Short sentence -> short sentence / Long sentence -> long sentence Length: nb of words or nb of characters Advantages: Efficient (for similar languages) and fast! Gale and Church (1993):

Find alignment A with highest probability given the two parallel texts S and T. Union Bank of Switzerland Corpus (English, French, German) Let D(i,j) be the lowest cost alignment (the distance) between sentences s 1,…,si and t1,…,tj

])t,[t],s,[salign2:cost(22)j2,D(i

)t ],s,[salign1:cost(21)j2,D(i

])t,[t,salign2:cost(12)j1,D(i

)t,salign1:cost(11)j1,D(i

),salign0:cost(1j)1,D(i

)t,align1:cost(01)jD(i,

minj)D(i,

j1ji1i

ji1i

j1ji

ji

i

j

0D(0,0)

33

Example

cost(align(s1, t1))

cost(align(s2, t2))

cost(align(s3, ))

t1

t2

t3

t1

t2

t3

s1

s2

s3

s4

alignment 1

cost(align([s1, s2], t1))

cost(align(s3, t2))

cost(align(s4, t3))

+

+

alignment 2

+

+

cost(align(s4, t3))

L2L1

+

L1

Mean length ratio of sentences (nb of characters) in bead is ~1 German/English = 1.1 French/English = 1.06

Cost of an alignment Calculate the difference (distance) between lengths of sentences in the

beads So as to minimize this distance i.e. try to align beads so that the lengths of the sentences from the 2

languages in each bead are as similar as possible.

34

Results

Gale and Church (1993) use Dynamic Programming to efficiently consider

all possible alignments and find the minimum cost alignment

method performs well (at least on related languages) 4% error rate only 2% error rate on 1:1 alignments higher error rate on more difficult alignments

Assumes paragraph alignment Without a paragraph alignment, error rate triples

35

Approaches to Text Alignment Length-Based Methods --> Offset Alignment Lexical Methods

36

Offset alignment Length-based methods work well on clean texts but may break down in real-world situations

Ex: noisy text (OCR output with no clear sentence or paragraph boundaries,…)

Church (1993) Goal: Showing roughly what offset in one text aligns with

what offset in the other. uses cognates (words that are similar across

languages) Ex: proper names, numbers, common ancestors…

Ex: Smith, 848-3000, superior/supérieur But: uses cognates at the level of character sequences

NOT at the word level Build a dot-plot

37

S T

T

the source and translated text are concatenated a square graph is made with this text on both axes a dot is placed at (x,y) when there is a match.

[Unit= character 4-grams]

Sample Dot Plot

S Perfect match of a text with itself

Match of a text with its translation (cognates)

Match of a text with its translation (cognates)

The small diagonals provide an alignment in terms of offsets in the two texts

38

Approaches to Text Alignment Length-Based Methods Offset Alignment by Signal Processing

Techniques --> Lexical Methods

39

Lexical methods Align beads of sentences using lexical information Kay and Röscheisen (1993)

Idea: Use word alignment to help determine sentence alignment Then use sentence alignment to refine word alignment,…

Method:1. Begin with start and end of text as anchors2. Form an envelope of all possible alignments (no crossing of

anchors) where: Possible alignments must be at a certain distance away from the

anchors The distance increases as we get further away from the anchors

3. Choose pairs of words that co-occur in these potential alignments

4. Pick the best sentences involved in step 3 (having the most lexical correspondences) and use them as new anchors

5. Repeat steps 2-5

40

Example

Sentences of language 21 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

1 ●

2 ● ● ●

3 ● ● ● ●

4 ● ● ● ● ●

5 ● ● ● ● ●

6 ● ● ● ● ● ●

7 ● ● ● ● ● ●

8 ● ● ● ● ● ●

9 ● ● ● ● ● ● ●

10 ● ● ● ● ● ●

11 ● ● ● ● ● ●

12 ● ● ● ● ●

13 ● ● ● ●

14 ● ● ● ●

15 ● ● ●

16 ●

Sente

nce

s of

language

1

41

Example (con’t)


1 ●

2 ● ● ●

3 ● ● ● ●

4 ● ● ● ●

5 ● ● ● ● ●

6 ● ● ● ● ● ●

7 ● ● ● ● ● ●

8 ● ● ● ● ● ●

9 ● ● ● ● ● ● ●

10 ● ● ● ● ● ●

11 ● ● ● ● ● ●

12 ● ● ● ● ●

13 ● ● ● ●

14 ● ● ● ●

15 ● ● ●

16 ●

Sente

nce

s of

language

1

42

Example (con’t)


1 ●

2 ● ●

3 ● ●

4 ●

5 ● ● ●

6 ● ● ● ●

7 ● ● ● ●

8 ● ● ●

9 ●

10 ● ● ●

11 ● ● ●

12 ● ● ● ●

13 ● ● ● ●

14 ● ● ●

15 ● ● ●

16 ●

Sente

nce

s of

language

1

43

Example (con’t)


1 ●

2 ● ●

3 ● ●

4 ●

5 ● ● ●

6 ● ● ● ●

7 ● ● ● ●

8 ● ● ●

9 ●

10 ● ● ●

11 ● ● ●

12 ● ● ● ●

13 ● ● ● ●

14 ● ● ●

15 ● ● ●

16 ●

Sente

nce

s of

language

1

44

Example (con’t)


1 ●

2 ● ●

3 ●

4 ● ●

5 ● ● ●

6 ● ● ● ●

7 ●

8 ● ●

9 ● ● ● ●

10 ● ● ● ●

11 ●

12 ● ●

13 ● ●

14 ● ●

15 ● ●

16 ●

Sente

nce

s of

language

1

45

Word Alignment Usually done in two steps:

1. Do sentence/text alignment2. Select words from aligned pairs and use frequency

or chi-square to see if they co-occur more frequently English: In the beginning God created the heavens and the earth.Vietnamese: Ban dâu Ðúc Chúa Tròi dung nên tròi dât.

English: God called the expanse heaven.Vietnamese: Ðúc Chúa Tròi dat tên khoang không la tròi.

English: … you are this day like the stars of heaven in number.Vietnamese: … các nguoi dông nhu sao trên tròi.

Can also use an existing bilingual dictionary to start the word-alignment

Documents

1 COMP791A: Statistical Language Processing Machine Translation and Statistical Alignment Chap. 13