Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Statistical Machine Translation
Changbum Park
SeminarProgramming Research Lab., SNU
Oct. 1, 2008
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Contents
• Overview
• Rule-based vs. Statistical Machine Translation
• SMT System
• Modeling
• Parameter Estimation
• Decoding
• Evaluation of MT
• Syntax-based SMT
• Conclusion
2
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Contents
• Overview
• Rule-based vs. Statistical Machine Translation
• SMT System
• Modeling
• Parameter Estimation
• Decoding
• Evaluation of MT
• Syntax-based SMT
• Conclusion
3
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Machine Translation
• “Translation by computer”
• Two kinds of approaches
• Rule-Based Machine Translation (RBMT)
• Analysis, structure transfer and generation
• SMT (Statistical Machine Translation)
• Direct translation
4
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Rule-based MT vs. Statistical MT
5
WordStructure
WordStructure
SourceText
TargetText
Symantic
Direct
Interlingua
SemanticStructure
SemanticStructure
SyntacticStructure
SyntacticStructureSyntactic
Analysis
SemanticAnalysis
SemanticDecomposition
SyntacticTransfer
SemanticTransfer
SemanticDecomposition
SemanticAnalysis
SyntacticAnalysis
SMT: “Translation without understanding”
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Rule-based: Syntactic transfer
• 3 steps- Parse the sentence => Rearrange constituents => translate the words
• Advantages- Deals with the word-order problem
• Disadvantages- Must construct transfer rules for each language pair- Syntactic mis-match could occur
6
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Rule-based: Interlingua
• Assign a logical form to sentences- John must not go = * OBLIGATORY(NOT(GO(JOHN)))- John may not go = * NOT(PERMITTED(GO(JOHN)))- Use logical form to generate a sentence in another language
• Advantage- Single logical form means that we can translate between all languages and only write a parser/generator for each language once
• Disadvantages- Difficult to define a single logical form. English words in all capital letter probably won’t cut it.
7
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Statistical MT Concept
8
Reference 필요
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Rule-based MT vs. Statistical MT
9
RBMT! SMT!
Approach! Analytic! Empirical!
Based on! Transfer rules! Statistical Evidence!
Analysis level! Various!(morpheme ~ interlingua)!
Generally, almost not!
Translation Speed! Fast! (Relatively) Slow!
Required! Knowledge!
Linguistic knowledge!Dictionary!(Ontology)!(Conceptual and cultural differences)!
Parallel Texts!(morphology for spacing)!
Adaptability! Low! High!
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Contents
• Overview
• Rule-based vs. Statistical Machine Translation
• SMT System
• Modeling
• Parameter Estimation
• Decoding
• Evaluation of MT
• Syntax-based SMT
• Conclusion
10
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
SMT: Formal Description
• Source sentence:
• Target sentence:
• Problem: Given input sequence f, find a sequence e that is translationally equivalent, in other words,
find
11
e1e2...eI or eI1 ! V I
E
f1f2...fJ or fJ1 ! V J
F
e = arg maxe!e!
p(e|f)
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
SMT: Solution Approaches
• Problem:
• Approach 1: ==> Data sparseness
• Approach 2: ==> might not result in readable sentence
• Proper solution is to use Bayes’ rule
12
e = arg maxe!e!
p(e|f)
p(e|f) =freq(e, f)freq(f)
p(e|f) =!
fj
maxei
p(ei|fj) e
e = arg maxe!e!
p(e|f) = arg maxe!e!
p(f |e)p(e)
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Noisy Channel Model
• Language model- Role: Making fluent sentence- Model for target language
• Translation Model- Role: Making correct translation- Model for both language
• Decoder- Role: Find a sentence which gives a best score
13
Output Input
Decoder TranslationModel
LanguageModel
= arg maxe!e!
p(f |e)p(e)e = arg max
e!e!p(e|f)
Sourcep(e)
Channelp(f | e)
decoderbest e
e f
observed f
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Translation Model and Language Model
14
good English? p(e) good match to French? p(f|e)
Jon appeared in TV.! !!
Appeared on Jon TV.!
In Jon appeared TV.! !!
Jon is happy today.! !!
Jon appeared on TV.! !! !!
TV appeared on Jon.! !!
TV in Jon appeared.!
Jon was not happy.! !!
“On voit Jon à la télévision”
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Language Modeling
• Determine the probability of some English sequence e of length I
• p(e) is hard to estimated directly, unless I is very small
• n-gram language model
15
p(eI1) =
I!
i=1
p!(ei|ei!11 )
p(eI1) =
I!
i=1
p!(ei|ei!11 ) =
I!
i=1
p(ei|ei!1i!n)
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Translation Modeling
• Decomposition without loss of generality
16
p(fJi |eI
1) =!
aJ1
p(fJ1 , aJ
1 |eI1)
p(fJ1 , aJ
1 |eI1) = p(J |eI
1)p(fJ1 , aJ
1 |J, eI1)
= p(J |eI1)p(aJ
1 |J, eI1)p(fJ
1 |aJ1 , J, eI
1)
= p(J |eI1)
J!
j=1
p(aj |aj!11 , J, eI
1)J!
j=1
p(fj |f j!11 , aJ
1 , J, eI1)
LengthModel
AlignmentModel
LexiconModel
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Alignment Process
1. Choose a length for French string f
2. for i=1 to J
3. begin
I. Decide which position in e is connected to fi
II. Decide which identity of fi is
4. end
17
LengthModel
AlignmentModel
LexiconModel
p(fJ1 , aJ
1 |eI1) = p(J |eI
1)J!
j=1
p(aj |aj!11 , J, eI
1)J!
j=1
p(fj |f j!11 , aJ
1 , J, eI1)
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
IBM Model 1
• An alignment with independent English words
• n:1 alignment
• # of possible alignments: (with I English words and J French words)
18
e1 e2 e3 e4
f1 f2 f3 f4 f5NULL
• Exact equation is too complex
• Approximation
• Length model: Constant
• Alignment model: (J+1), uniform
• Lexicon model: pt(fj |eaj )
p(fJi |eI
1) =!
(J + 1)I
J!
j=1
pt(fj |eaj )
(I + 1)J
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
• Maria did not slap the green witch
• Maria not slap slap slap the green witch
IBM Model 4
• More advanced modeling including fertility and distortion (word reordering)
• Example
19
• Maria not slap slap slap NULL the green witch
• Maria no dio una bofetada a la verde bruja
• Maria no dio una bofetada a la bruja verde
Fertility
Reordering
Word Translation
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
IBM Model 4
20
p(fJi |eI
1) =!
T,!
I"
i=0
p(!|ei)!
I!
i=0
!i!
k=1
p(!i,k|ei)!
I!
i=0
p(!i,1 !"i!1|CE(ei!1), CF ("i,1))#
I!
i=0
!i!
k=2
p(!i,k ! !i,k!1|C("i,k))"
fertility
distortion for
word translation
distortion for
!i,1
!i,k, k > 1
!i!1
!i,k
!i,k : translation of kth word of ei
: average location of all translations of ei-1
: translation of kth word of ei
CE : E ! [1, K]CF : F ! [1, K]
: partition of the vocabularies onto suitable small sets of
K classes
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Discriminative Model: Log-linear Model
• In generative models, very strong independence assumptions are needed for tractability. For compensation we could use more features using discriminative model.
• Log-linear model
21
p(e|f) =exp
!Kk=1 !khk(e, f)
!e!:Y (e!) exp
!Kk=1 !khk(e!, f)
where !K1 : feature weights, h: features
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Contents
• Overview
• Rule-based vs. Statistical Machine Translation
• SMT System
• Modeling
• Parameter Estimation
• Decoding
• Evaluation of MT
• Syntax-based SMT
• Conclusion
22
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Parameter Estimation
• Maximum Likelihood Estimation (MLE) of complete set of parameters
where training data
• Parameter estimation process
1. Estimate all underlying generative models
2. Estimate the parameters of log-linear model
23
! = arg max!
P!(C)
C ! {E! " F !}
!
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Learning Word Translation Probabilities
• In generative model, each probability is tied to a single decision taken by the model
• p(blau|blue) = #(a(blau, blue)) / #(blue)
• Approach 1: automatic generations of alignments
• HMM (generative), Conditional Random Field (discriminative)
• Supervised learning
• By building learning set by human annotators
• Approach 2: parameter estimation by EM algorithm
24
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Parameter Estimation by EM
• Expactation-Maximization (EM) algorithm
1. Assume a probability distribution over hidden events
1.1.Take counts of events based on this distribution
1.2.Use counts to estimate new parameters
1.3.Use parameters to re-weight examples
2. Rinse and repeat
25
E(a(blau,wind)) =P!0(a(blau, blue), fJ
1 |eI1)
P!0(fJ1 |eI
1)
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Learning Phrase Translation Probabilities
• P(to the conference | zur Konferenz)P(into the meeting | zur Konferenz)
• Solution
• Generate a word alignment
• Count all phrases that are consistent with the alignment
• Consistency
• no word inside the phrase is aligned to any word outside the phrase
• the phrase contains the transitive closure of some set of nodes in the bipartite aligned graph
26
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Phrases Consistent with Word Alignment
27
I
did
not
unfortunately
receive
an
answer
to
this
question
Auf
die
se
Fra
ge
habe
ich
leid
er
kein
e
Antw
ort
bekom
men
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Parameter Estimation in Log-linear Model
• MERT (Minimum Error-Rate Training)
28
!K1 = arg min
!K1
!
(e,f)!C
E(arg maxe
P!K1
(e|f), e)
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Contents
• Overview
• Rule-based vs. Statistical Machine Translation
• SMT System
• Modeling
• Parameter Estimation
• Decoding
• Evaluation of MT
• Syntax-based SMT
• Conclusion
29
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Decoding
30
Word Alignment!
Decoder!
Language Modeling!
MLE!
Source language corpus!
Target language corpus!
Input source sentence!
Language Model!
Output target sentence!
Translation Model!
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Decoding
• How to translate new sentences?
• Difficult optimization problem: ranges over
• A decoder uses the parameters learned on a parallel corpus- Translation probabilities- Fertilities- Distortions
• In combination with a language model the decoder generates the most likely translation
• Similar to the traveling salesman problem- Standard algorithms can be used to explore search space (A*, greedy searching, etc.)
31
e = argmaxe:Y (e,d))P (e, d|f)
{E! !D! ! F !}
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Decoder Example: FST Decoding
32
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Contents
• Overview
• Rule-based vs. Statistical Machine Translation
• SMT System
• Modeling
• Parameter Estimation
• Decoding
• Evaluation of MT
• Syntax-based SMT
• Conclusion
33
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Evaluation of MT
• Ideal criterion: user satisfaction
• Problem- Expensive, slow, inconsistent, subjective- Problematic to use in system development
• Goal- Automatic and objective evaluation of machine translation quality
• Idea- Compute similarity of MT output with good human translations (reference translation)- Good MT: similar to good human translations- Bad MT: very different from human translations
• Use a set of bilingual test sentences
34
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Evaluation Example: BLEU score
• BLUE (Bilingual Evaluation Understudy)- Most famous currently
• Modified n-gram precision- N-gram precision: fraction of N-grams occurring in references- Modified N-gram precision: same part of reference cannot be ‘used’ twice
• Brevity penalty- Penalize too short translations- BP=exp(min(1-r/c, 0))- c: length of MT output, r: length of reference translation
• BLEUn4 score
35
BLEU=BP * exp(0.25(log(punigram)+log(pbigram)+log(ptrigram)+log(pfourgram)))
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Contents
• Overview
• Rule-based vs. Statistical Machine Translation
• SMT System
• Modeling
• Parameter Estimation
• Decoding
• Evaluation of MT
• Syntax-based SMT
• Conclusion
36
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Using Syntax Information in SMT
37
WordStructure
WordStructure
SourceText
TargetText
Direct
SyntacticStructure
SyntacticStructureSyntactic
Analysis
SyntacticTransfer
SyntacticAnalysis
Interlingua
SemanticStructure
SemanticStructure
SymanticAnalysis
SemanticDecomposition Semantic
Transfer
SemanticDecomposition
SymanticAnalysis
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Using Syntax Information in SMT
• Modified source-channel model [K. Yamada and K. Knight, 2001]
• Input- Sentences-> Parse trees- Input sentences are preprocessed by a syntactic parser
• Grammar- SCFG (Synchronous Context Free Grammar)- Tree-adjoining grammar (Subset of context-sensitive language)- Etc.
• Channel operation- Reordering- Insertion- Translation
38
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
SCFG Example
39
Statistical Machine Translation 8:11
Fig. 7. Visualization of the CFG derivation and SCFG derivations. Derivation happens in exactly the sameway in CFG (1) and SCFG (2). Each nonterminal symbol is replaced by the contents of the right-hand-side of arule whose left-hand-side matches the symbol. In our illustration, each arrow represents a single production.The difference in SCFG is that we specify two outputs rather than one. Each of the nonterminal nodes inone output is linked to exactly one node in the other; the only difference between the outputs is the orderin which these nodes appear. Therefore, the trees are isomorphic. Although terminal nodes are not linked,we can infer a word alignment between words that are generated by the same nonterminal node. In thisillustration, the only reordering production is highlighted. Note that if we ignore the Chinese dimension ofthe output, the SCFG derivation in the English dimension is exactly the same as in (1).
In SCFG, the grammar specifies two output strings for each production. It definesa correspondence between strings via the use of co-indexed nonterminals. Consider afragment of SCFG grammar.
The two outputs are separated by a slash (/). The boxed numbers are co-indexes.When two nonterminals share the same co-index, they are aligned. We can think ofSCFG as generating isomorphic trees, in which the nonterminal nodes of each tree arealigned. One tree can be transformed into another by rotating its nonterminal nodes,as if it were an Alexander Calder mobile (Figure 7). Note that if we ignore the sourcestring dimension, the rules in this SCFG correspond to the rules that appear in ourCFG example. Also notice that we can infer an alignment between source and targetwords based on the alignment of their parent nonterminals. In this way SCFG definesa mapping between both strings and trees, and has a number of uses depending on therelationship that we are interested in [Wu 1995b; Melamed 2004b].
Normal forms and complexity analysis for various flavors of SCFG are presented inAho and Ullman [1969] and Melamed [2003]. The number of possible reorderings isquite large. Church and Patil [1982] show that the number of binary-branching treesthat can generate a string is related to a combinatorial function, the Catalan number.Zens and Ney [2003] show that the number of reorderings in a binary SCFG are re-lated to another combinatorial function, the Shroder number. However, due to recursivesharing of subtrees among many derivations, we can parse SCFG in polynomial time
ACM Computing Surveys, Vol. 40, No. 3, Article 8, Publication date: August 2008.
Statistical Machine Translation 8:11
Fig. 7. Visualization of the CFG derivation and SCFG derivations. Derivation happens in exactly the sameway in CFG (1) and SCFG (2). Each nonterminal symbol is replaced by the contents of the right-hand-side of arule whose left-hand-side matches the symbol. In our illustration, each arrow represents a single production.The difference in SCFG is that we specify two outputs rather than one. Each of the nonterminal nodes inone output is linked to exactly one node in the other; the only difference between the outputs is the orderin which these nodes appear. Therefore, the trees are isomorphic. Although terminal nodes are not linked,we can infer a word alignment between words that are generated by the same nonterminal node. In thisillustration, the only reordering production is highlighted. Note that if we ignore the Chinese dimension ofthe output, the SCFG derivation in the English dimension is exactly the same as in (1).
In SCFG, the grammar specifies two output strings for each production. It definesa correspondence between strings via the use of co-indexed nonterminals. Consider afragment of SCFG grammar.
The two outputs are separated by a slash (/). The boxed numbers are co-indexes.When two nonterminals share the same co-index, they are aligned. We can think ofSCFG as generating isomorphic trees, in which the nonterminal nodes of each tree arealigned. One tree can be transformed into another by rotating its nonterminal nodes,as if it were an Alexander Calder mobile (Figure 7). Note that if we ignore the sourcestring dimension, the rules in this SCFG correspond to the rules that appear in ourCFG example. Also notice that we can infer an alignment between source and targetwords based on the alignment of their parent nonterminals. In this way SCFG definesa mapping between both strings and trees, and has a number of uses depending on therelationship that we are interested in [Wu 1995b; Melamed 2004b].
Normal forms and complexity analysis for various flavors of SCFG are presented inAho and Ullman [1969] and Melamed [2003]. The number of possible reorderings isquite large. Church and Patil [1982] show that the number of binary-branching treesthat can generate a string is related to a combinatorial function, the Catalan number.Zens and Ney [2003] show that the number of reorderings in a binary SCFG are re-lated to another combinatorial function, the Shroder number. However, due to recursivesharing of subtrees among many derivations, we can parse SCFG in polynomial time
ACM Computing Surveys, Vol. 40, No. 3, Article 8, Publication date: August 2008.
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Syntax-based SMT Process
1. Parse input sentence to syntax tree
2. Reordering
40
!"#$#%&'!(")*"" +*(")*"#%$" ,-"*(")*"."
!"!!#$%!#$&! !"!!#$%!#$&!
!"!!#$&!#$%!
#$%!!"!!#$&!
#$%!#$&!!"!!
#$&!!"!!#$%!
#$&!#$%!!"!!
'(')*!
'()&+!
'(',%!
'('+)!
'('-+!
'('&%!
#$!./! #$!./!
./!#$!
'(&0%!
'()*1!
./!22! ./!22!
22!./!
'(%')!
'(-1+!
3! 3! 3!
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Syntax-based SMT Process
3. Insertion
• Two parameter tables- Insertion position- Word identity
41
!"#$%&! '(!! )*! )*! )*! '(! '(! +
,-.$! )*! )*! !/!! '(! '(! ,,! +
!0,-%$1!!02$3&1!!0/456&1!
789:;!7877<!78=>7!
78>?9!787>@!78=;=!
78:<<!7877<!78>;=!
7897A!787:7!78=>@!
78A77!7877:!78779!
78?77!787A>!78@7<!
+
+
+
!! "#$%&'()!
*+!,+!(-!
%-!%$!,.!/+!0!
1.&2!0!
34567!34686!34377!
34379!343:3!343;:!343<5!0!
34333;!0!
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
SCFG Decoding
42
8:36 A. Lopez
Fig. 16. An illustration of SCFG decoding, which is equivalent to parsing the source language. (1) Scan eachsource word and associate it with a span. (2) Apply SCFG rules that match the target spans. (3) Recursivelyinfer larger spans from smaller spans. (4) We can also infer target language words with no matching spanin the source language, if our grammar contains SCFG rules that produce these words correspondent with !in the source language. (5) Read off the tree in target-language order.
practical FST decoders allow only a limited amount of reordering, and in practice theyare often much faster than SCFG decoders.25 Search optimization for these models istherefore an active area of research.
Chiang [2007] describes a optimization called cube pruning that prevents excessivecombination of hypotheses in adjacent subspans. Zhang et al. [2006] describe a methodfor binarizing rules containing more than two nonterminals, which helps reduce gram-mar constants for parsing and simplifies n-gram language model integration. Venugopalet al. [2007] present a method based on delayed language model integration, in whichthe parse graph is first constructed quickly with simplified language model statistics,and then expanded in a second pass using a full language model, following only themost promising paths. A number of other optimizations have also been investigated[Huang and Chiang 2005; Huang and Chiang 2007].
5.3. Reranking
Even if there are no search errors and we produce the translation that exactly optimizesour decision rule, the translations produced by our decoder may not be the actual besttranslations according to human judgement. It is possible that the search space explored
25It is possible to apply reordering constraints of FST models to SCFG models. Chiang [2005, 2007] restrictshierarchical reordering to spans that are shorter than ten words. Spans longer than this are required to bemonotonic orderings of smaller hierarchical phrases. This prevents some long-distance reordering.
ACM Computing Surveys, Vol. 40, No. 3, Article 8, Publication date: August 2008.
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Current Direction and Future Research
• Common elements of currently best systems
• phrase-based model
• log-linear model with a small set of generative features and discriminative training
• Hurdles
• Corpus are biased: mostly news or government texts
• Statistical model is sensitive to domain differences and noise
• Hard to translate between morphologically very different languages
• English=>German is easy but English to Korean is hard
• Future Research
• incorporation of linguistic knowledge (hybrid approach)
43
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
References
• Statistical Machine Translation, Adam Lopez, ACM Computing Surveys 40(3), article 8, August 2008
• A syntax based statistical translation model, Kenji Yamada and Kevin Knight, Proceedings of the 39th annual meeting on association for computational linguistics, 2001
44
ROPAS Seminar / Oct 1, 2008 / 45Changbum Park Statistical Machine Learning
Q&A
45