Statistical Machine Translationropas.snu.ac.kr/seminar/20081001.pdf · 10/1/2008 · Changbum Park Statistical Machine Learning ROPAS Seminar / Oct 1, 2008 / 45 Rule-based: Syntactic

Statistical Machine Translation

Changbum Park

SeminarProgramming Research Lab., SNU

Oct. 1, 2008

ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning

Contents

• Overview

• Rule-based vs. Statistical Machine Translation

• SMT System

• Modeling

• Parameter Estimation

• Decoding

• Evaluation of MT

• Syntax-based SMT

• Conclusion

2


Contents

• Overview


• SMT System

• Modeling


• Decoding



• Conclusion

3


Machine Translation

• “Translation by computer”

• Two kinds of approaches

• Rule-Based Machine Translation (RBMT)

• Analysis, structure transfer and generation

• SMT (Statistical Machine Translation)

• Direct translation

4


Rule-based MT vs. Statistical MT

5

WordStructure

WordStructure

SourceText

TargetText

Symantic

Direct

Interlingua

SemanticStructure

SemanticStructure

SyntacticStructure

SyntacticStructureSyntactic

Analysis

SemanticAnalysis

SemanticDecomposition

SyntacticTransfer

SemanticTransfer


SemanticAnalysis

SyntacticAnalysis

SMT: “Translation without understanding”


Rule-based: Syntactic transfer

• 3 steps- Parse the sentence => Rearrange constituents => translate the words

• Advantages- Deals with the word-order problem

• Disadvantages- Must construct transfer rules for each language pair- Syntactic mis-match could occur

6


Rule-based: Interlingua

• Assign a logical form to sentences- John must not go = * OBLIGATORY(NOT(GO(JOHN)))- John may not go = * NOT(PERMITTED(GO(JOHN)))- Use logical form to generate a sentence in another language

• Advantage- Single logical form means that we can translate between all languages and only write a parser/generator for each language once

• Disadvantages- Difficult to define a single logical form. English words in all capital letter probably won’t cut it.

7


Statistical MT Concept

8

Reference 필요


Rule-based MT vs. Statistical MT

9

RBMT! SMT!

Approach! Analytic! Empirical!

Based on! Transfer rules! Statistical Evidence!

Analysis level! Various!(morpheme ~ interlingua)!

Generally, almost not!

Translation Speed! Fast! (Relatively) Slow!

Required! Knowledge!

Linguistic knowledge!Dictionary!(Ontology)!(Conceptual and cultural differences)!

Parallel Texts!(morphology for spacing)!

Adaptability! Low! High!


Contents

• Overview


• SMT System

• Modeling


• Decoding



• Conclusion

10


SMT: Formal Description

• Source sentence:

• Target sentence:

• Problem: Given input sequence f, find a sequence e that is translationally equivalent, in other words,

find

11

e1e2...eI or eI1 ! V I

E

f1f2...fJ or fJ1 ! V J

F

e = arg maxe!e!

p(e|f)


SMT: Solution Approaches

• Problem:

• Approach 1: ==> Data sparseness

• Approach 2: ==> might not result in readable sentence

• Proper solution is to use Bayes’ rule

12

e = arg maxe!e!

p(e|f)

p(e|f) =freq(e, f)freq(f)

p(e|f) =!

fj

maxei

p(ei|fj) e

e = arg maxe!e!

p(e|f) = arg maxe!e!

p(f |e)p(e)


Noisy Channel Model

• Language model- Role: Making fluent sentence- Model for target language

• Translation Model- Role: Making correct translation- Model for both language

• Decoder- Role: Find a sentence which gives a best score

13

Output Input

Decoder TranslationModel

LanguageModel

= arg maxe!e!

p(f |e)p(e)e = arg max

e!e!p(e|f)

Sourcep(e)

Channelp(f | e)

decoderbest e

e f

observed f


Translation Model and Language Model

14

good English? p(e) good match to French? p(f|e)

Jon appeared in TV.! !!

Appeared on Jon TV.!

In Jon appeared TV.! !!

Jon is happy today.! !!

Jon appeared on TV.! !! !!

TV appeared on Jon.! !!

TV in Jon appeared.!

Jon was not happy.! !!

“On voit Jon à la télévision”


Language Modeling

• Determine the probability of some English sequence e of length I

• p(e) is hard to estimated directly, unless I is very small

• n-gram language model

15

p(eI1) =

I!

i=1

p!(ei|ei!11 )

p(eI1) =

I!

i=1

p!(ei|ei!11 ) =

I!

i=1

p(ei|ei!1i!n)


Translation Modeling

• Decomposition without loss of generality

16

p(fJi |eI

1) =!

aJ1

p(fJ1 , aJ

1 |eI1)

p(fJ1 , aJ

1 |eI1) = p(J |eI

1)p(fJ1 , aJ

1 |J, eI1)

= p(J |eI1)p(aJ

1 |J, eI1)p(fJ

1 |aJ1 , J, eI

1)

= p(J |eI1)

J!

j=1

p(aj |aj!11 , J, eI

1)J!

j=1

p(fj |f j!11 , aJ

1 , J, eI1)

LengthModel

AlignmentModel

LexiconModel


Alignment Process

1. Choose a length for French string f

2. for i=1 to J

3. begin

I. Decide which position in e is connected to fi

II. Decide which identity of fi is

4. end

17

LengthModel

AlignmentModel

LexiconModel

p(fJ1 , aJ

1 |eI1) = p(J |eI

1)J!

j=1

p(aj |aj!11 , J, eI

1)J!

j=1

p(fj |f j!11 , aJ

1 , J, eI1)


IBM Model 1

• An alignment with independent English words

• n:1 alignment

• # of possible alignments: (with I English words and J French words)

18

e1 e2 e3 e4

f1 f2 f3 f4 f5NULL

• Exact equation is too complex

• Approximation

• Length model: Constant

• Alignment model: (J+1), uniform

• Lexicon model: pt(fj |eaj )

p(fJi |eI

1) =!

(J + 1)I

J!

j=1

pt(fj |eaj )

(I + 1)J


• Maria did not slap the green witch

• Maria not slap slap slap the green witch

IBM Model 4

• More advanced modeling including fertility and distortion (word reordering)

• Example

19

• Maria not slap slap slap NULL the green witch

• Maria no dio una bofetada a la verde bruja

• Maria no dio una bofetada a la bruja verde

Fertility

Reordering

Word Translation


IBM Model 4

20

p(fJi |eI

1) =!

T,!

I"

i=0

p(!|ei)!

I!

i=0

!i!

k=1

p(!i,k|ei)!

I!

i=0

p(!i,1 !"i!1|CE(ei!1), CF ("i,1))#

I!

i=0

!i!

k=2

p(!i,k ! !i,k!1|C("i,k))"

fertility

distortion for

word translation

distortion for

!i,1

!i,k, k > 1

!i!1

!i,k

!i,k : translation of kth word of ei

: average location of all translations of ei-1

: translation of kth word of ei

CE : E ! [1, K]CF : F ! [1, K]

: partition of the vocabularies onto suitable small sets of

K classes


Discriminative Model: Log-linear Model

• In generative models, very strong independence assumptions are needed for tractability. For compensation we could use more features using discriminative model.

• Log-linear model

21

p(e|f) =exp

!Kk=1 !khk(e, f)

!e!:Y (e!) exp

!Kk=1 !khk(e!, f)

where !K1 : feature weights, h: features


Contents

• Overview


• SMT System

• Modeling


• Decoding



• Conclusion

22


Parameter Estimation

• Maximum Likelihood Estimation (MLE) of complete set of parameters

where training data

• Parameter estimation process

1. Estimate all underlying generative models

2. Estimate the parameters of log-linear model

23

! = arg max!

P!(C)

C ! {E! " F !}

!


Learning Word Translation Probabilities

• In generative model, each probability is tied to a single decision taken by the model

• p(blau|blue) = #(a(blau, blue)) / #(blue)

• Approach 1: automatic generations of alignments

• HMM (generative), Conditional Random Field (discriminative)

• Supervised learning

• By building learning set by human annotators

• Approach 2: parameter estimation by EM algorithm

24


Parameter Estimation by EM

• Expactation-Maximization (EM) algorithm

1. Assume a probability distribution over hidden events

1.1.Take counts of events based on this distribution

1.2.Use counts to estimate new parameters

1.3.Use parameters to re-weight examples

2. Rinse and repeat

25

E(a(blau,wind)) =P!0(a(blau, blue), fJ

1 |eI1)

P!0(fJ1 |eI

1)


Learning Phrase Translation Probabilities

• P(to the conference | zur Konferenz)P(into the meeting | zur Konferenz)

• Solution

• Generate a word alignment

• Count all phrases that are consistent with the alignment

• Consistency

• no word inside the phrase is aligned to any word outside the phrase

• the phrase contains the transitive closure of some set of nodes in the bipartite aligned graph

26


Phrases Consistent with Word Alignment

27

I

did

not

unfortunately

receive

an

answer

to

this

question

Auf

die

se

Fra

ge

habe

ich

leid

er

kein

e

Antw

ort

bekom

men


Parameter Estimation in Log-linear Model

• MERT (Minimum Error-Rate Training)

28

!K1 = arg min

!K1

!

(e,f)!C

E(arg maxe

P!K1

(e|f), e)


Contents

• Overview


• SMT System

• Modeling


• Decoding



• Conclusion

29


Decoding

30

Word Alignment!

Decoder!

Language Modeling!

MLE!

Source language corpus!

Target language corpus!

Input source sentence!

Language Model!

Output target sentence!

Translation Model!


Decoding

• How to translate new sentences?

• Difficult optimization problem: ranges over

• A decoder uses the parameters learned on a parallel corpus- Translation probabilities- Fertilities- Distortions

• In combination with a language model the decoder generates the most likely translation

• Similar to the traveling salesman problem- Standard algorithms can be used to explore search space (A*, greedy searching, etc.)

31

e = argmaxe:Y (e,d))P (e, d|f)

{E! !D! ! F !}


Decoder Example: FST Decoding

32


Contents

• Overview


• SMT System

• Modeling


• Decoding



• Conclusion

33


Evaluation of MT

• Ideal criterion: user satisfaction

• Problem- Expensive, slow, inconsistent, subjective- Problematic to use in system development

• Goal- Automatic and objective evaluation of machine translation quality

• Idea- Compute similarity of MT output with good human translations (reference translation)- Good MT: similar to good human translations- Bad MT: very different from human translations

• Use a set of bilingual test sentences

34


Evaluation Example: BLEU score

• BLUE (Bilingual Evaluation Understudy)- Most famous currently

• Modified n-gram precision- N-gram precision: fraction of N-grams occurring in references- Modified N-gram precision: same part of reference cannot be ‘used’ twice

• Brevity penalty- Penalize too short translations- BP=exp(min(1-r/c, 0))- c: length of MT output, r: length of reference translation

• BLEUn4 score

35

BLEU=BP * exp(0.25(log(punigram)+log(pbigram)+log(ptrigram)+log(pfourgram)))


Contents

• Overview


• SMT System

• Modeling


• Decoding



• Conclusion

36


Using Syntax Information in SMT

37

WordStructure

WordStructure

SourceText

TargetText

Direct

SyntacticStructure

SyntacticStructureSyntactic

Analysis

SyntacticTransfer

SyntacticAnalysis

Interlingua

SemanticStructure

SemanticStructure

SymanticAnalysis

SemanticDecomposition Semantic

Transfer


SymanticAnalysis


Using Syntax Information in SMT

• Modified source-channel model [K. Yamada and K. Knight, 2001]

• Input- Sentences-> Parse trees- Input sentences are preprocessed by a syntactic parser

• Grammar- SCFG (Synchronous Context Free Grammar)- Tree-adjoining grammar (Subset of context-sensitive language)- Etc.

• Channel operation- Reordering- Insertion- Translation

38


SCFG Example

39

Statistical Machine Translation 8:11

Fig. 7. Visualization of the CFG derivation and SCFG derivations. Derivation happens in exactly the sameway in CFG (1) and SCFG (2). Each nonterminal symbol is replaced by the contents of the right-hand-side of arule whose left-hand-side matches the symbol. In our illustration, each arrow represents a single production.The difference in SCFG is that we specify two outputs rather than one. Each of the nonterminal nodes inone output is linked to exactly one node in the other; the only difference between the outputs is the orderin which these nodes appear. Therefore, the trees are isomorphic. Although terminal nodes are not linked,we can infer a word alignment between words that are generated by the same nonterminal node. In thisillustration, the only reordering production is highlighted. Note that if we ignore the Chinese dimension ofthe output, the SCFG derivation in the English dimension is exactly the same as in (1).

In SCFG, the grammar specifies two output strings for each production. It definesa correspondence between strings via the use of co-indexed nonterminals. Consider afragment of SCFG grammar.

The two outputs are separated by a slash (/). The boxed numbers are co-indexes.When two nonterminals share the same co-index, they are aligned. We can think ofSCFG as generating isomorphic trees, in which the nonterminal nodes of each tree arealigned. One tree can be transformed into another by rotating its nonterminal nodes,as if it were an Alexander Calder mobile (Figure 7). Note that if we ignore the sourcestring dimension, the rules in this SCFG correspond to the rules that appear in ourCFG example. Also notice that we can infer an alignment between source and targetwords based on the alignment of their parent nonterminals. In this way SCFG definesa mapping between both strings and trees, and has a number of uses depending on therelationship that we are interested in [Wu 1995b; Melamed 2004b].

Normal forms and complexity analysis for various flavors of SCFG are presented inAho and Ullman [1969] and Melamed [2003]. The number of possible reorderings isquite large. Church and Patil [1982] show that the number of binary-branching treesthat can generate a string is related to a combinatorial function, the Catalan number.Zens and Ney [2003] show that the number of reorderings in a binary SCFG are re-lated to another combinatorial function, the Shroder number. However, due to recursivesharing of subtrees among many derivations, we can parse SCFG in polynomial time

ACM Computing Surveys, Vol. 40, No. 3, Article 8, Publication date: August 2008.

Statistical Machine Translation 8:11

Fig. 7. Visualization of the CFG derivation and SCFG derivations. Derivation happens in exactly the sameway in CFG (1) and SCFG (2). Each nonterminal symbol is replaced by the contents of the right-hand-side of arule whose left-hand-side matches the symbol. In our illustration, each arrow represents a single production.The difference in SCFG is that we specify two outputs rather than one. Each of the nonterminal nodes inone output is linked to exactly one node in the other; the only difference between the outputs is the orderin which these nodes appear. Therefore, the trees are isomorphic. Although terminal nodes are not linked,we can infer a word alignment between words that are generated by the same nonterminal node. In thisillustration, the only reordering production is highlighted. Note that if we ignore the Chinese dimension ofthe output, the SCFG derivation in the English dimension is exactly the same as in (1).

In SCFG, the grammar specifies two output strings for each production. It definesa correspondence between strings via the use of co-indexed nonterminals. Consider afragment of SCFG grammar.

The two outputs are separated by a slash (/). The boxed numbers are co-indexes.When two nonterminals share the same co-index, they are aligned. We can think ofSCFG as generating isomorphic trees, in which the nonterminal nodes of each tree arealigned. One tree can be transformed into another by rotating its nonterminal nodes,as if it were an Alexander Calder mobile (Figure 7). Note that if we ignore the sourcestring dimension, the rules in this SCFG correspond to the rules that appear in ourCFG example. Also notice that we can infer an alignment between source and targetwords based on the alignment of their parent nonterminals. In this way SCFG definesa mapping between both strings and trees, and has a number of uses depending on therelationship that we are interested in [Wu 1995b; Melamed 2004b].

Normal forms and complexity analysis for various flavors of SCFG are presented inAho and Ullman [1969] and Melamed [2003]. The number of possible reorderings isquite large. Church and Patil [1982] show that the number of binary-branching treesthat can generate a string is related to a combinatorial function, the Catalan number.Zens and Ney [2003] show that the number of reorderings in a binary SCFG are re-lated to another combinatorial function, the Shroder number. However, due to recursivesharing of subtrees among many derivations, we can parse SCFG in polynomial time



Syntax-based SMT Process

1. Parse input sentence to syntax tree

2. Reordering

40

!"#$#%&'!(")*"" +*(")*"#%$" ,-"*(")*"."

!"!!#$%!#$&! !"!!#$%!#$&!

!"!!#$&!#$%!

#$%!!"!!#$&!

#$%!#$&!!"!!

#$&!!"!!#$%!

#$&!#$%!!"!!

'(')*!

'()&+!

'(',%!

'('+)!

'('-+!

'('&%!

#$!./! #$!./!

./!#$!

'(&0%!

'()*1!

./!22! ./!22!

22!./!

'(%')!

'(-1+!

3! 3! 3!


Syntax-based SMT Process

3. Insertion

• Two parameter tables- Insertion position- Word identity

41

!"#$%&! '(!! )*! )*! )*! '(! '(! +

,-.$! )*! )*! !/!! '(! '(! ,,! +

!0,-%$1!!02$3&1!!0/456&1!

789:;!7877<!78=>7!

78>?9!787>@!78=;=!

78:<<!7877<!78>;=!

7897A!787:7!78=>@!

78A77!7877:!78779!

78?77!787A>!78@7<!

+

+

+

!! "#$%&'()!

*+!,+!(-!

%-!%$!,.!/+!0!

1.&2!0!

34567!34686!34377!

34379!343:3!343;:!343<5!0!

34333;!0!


SCFG Decoding

42

8:36 A. Lopez

Fig. 16. An illustration of SCFG decoding, which is equivalent to parsing the source language. (1) Scan eachsource word and associate it with a span. (2) Apply SCFG rules that match the target spans. (3) Recursivelyinfer larger spans from smaller spans. (4) We can also infer target language words with no matching spanin the source language, if our grammar contains SCFG rules that produce these words correspondent with !in the source language. (5) Read off the tree in target-language order.

practical FST decoders allow only a limited amount of reordering, and in practice theyare often much faster than SCFG decoders.25 Search optimization for these models istherefore an active area of research.

Chiang [2007] describes a optimization called cube pruning that prevents excessivecombination of hypotheses in adjacent subspans. Zhang et al. [2006] describe a methodfor binarizing rules containing more than two nonterminals, which helps reduce gram-mar constants for parsing and simplifies n-gram language model integration. Venugopalet al. [2007] present a method based on delayed language model integration, in whichthe parse graph is first constructed quickly with simplified language model statistics,and then expanded in a second pass using a full language model, following only themost promising paths. A number of other optimizations have also been investigated[Huang and Chiang 2005; Huang and Chiang 2007].

5.3. Reranking

Even if there are no search errors and we produce the translation that exactly optimizesour decision rule, the translations produced by our decoder may not be the actual besttranslations according to human judgement. It is possible that the search space explored

25It is possible to apply reordering constraints of FST models to SCFG models. Chiang [2005, 2007] restrictshierarchical reordering to spans that are shorter than ten words. Spans longer than this are required to bemonotonic orderings of smaller hierarchical phrases. This prevents some long-distance reordering.



Current Direction and Future Research

• Common elements of currently best systems

• phrase-based model

• log-linear model with a small set of generative features and discriminative training

• Hurdles

• Corpus are biased: mostly news or government texts

• Statistical model is sensitive to domain differences and noise

• Hard to translate between morphologically very different languages

• English=>German is easy but English to Korean is hard

• Future Research

• incorporation of linguistic knowledge (hybrid approach)

43


References

• Statistical Machine Translation, Adam Lopez, ACM Computing Surveys 40(3), article 8, August 2008

• A syntax based statistical translation model, Kenji Yamada and Kevin Knight, Proceedings of the 39th annual meeting on association for computational linguistics, 2001

44


Q&A

45

Documents

Statistical Machine Translationropas.snu.ac.kr/seminar/20081001.pdf · 10/1/2008 · Changbum Park Statistical Machine Learning ROPAS Seminar / Oct 1, 2008 / 45 Rule-based: Syntactic