Statistical machine translation in a few slides

Translation as probability“Decoding”

Training“Log-linear”

Ain’t got nothin’ but the BLEUs?The SMT lifecycle

Statistical machine translation in a few slides

Mikel L. Forcada1,2

1Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant,E-03071 Alacant (Spain)

2Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain)

April 14-16, 2009: Free/open-source MT tutorial at theCNGL

Mikel L. Forcada SMT in a few slides




Contents

1 Translation as probability

2 “Decoding”

3 Training

4 “Log-linear”

5 Ain’t got nothin’ but the BLEUs?

6 The SMT lifecycle





The “canonical” model

Translation as probability/1

Instead of saying thata source-language (SL) sentence s in a SL textand a target-language (TL) sentence t

as found in a SL–TL bitext are or are not a translation ofeach other,in SMT one says that they are a translation of each otherwith a probability p(s, t) = p(t , s) (a joint probability).We’ll assume we have such a probability model available.Or at least a reasonable estimate.






Translation as probability/2

According to basic probability laws, we can write:

p(s, t) = p(t , s) = p(s|t)p(t) = p(t |s)p(s) (1)

where p(x |y) is the conditional probability of x given y .We are interested in translating from SL to TL. That is, wewant to find the most likely translation given the SLsentence s:

t? = arg maxt

p(t |s) (2)







We can rewrite eq. (1) as

p(t |s) = p(s|t)p(t)p(s)

(3)

and then with (2) to get

t? = arg maxt

p(s|t)p(t) (4)





“Decoding”/1

t? = arg maxt

p(s|t)p(t)

We have a product of two probability models:A reverse translation model p(s|t) which tells us how likelythe SL sentence s is a translation of the candidate TLsentence t , anda target-language model p(t) which tells us how likely thesentence t is in the TL side of bitexts.

These may be related (respectively) to the usual notions of[reverse] adequacy : how much of the meaning of t isconveyed by sfluency : how fluent is the candidate TL sentence.

The arg max strikes a balance between the two.





“Decoding”/2

In SMT parlance, the process of finding t∗ is calleddecoding.1

Obviously, it does not explore all possible translations t inthe search space. There are infinitely many.The search space is pruned.Therefore, one just gets a reasonable t'? instead of theideal t?

Pruning and search strategies are a very active researchtopic.

Free/open-source software: Moses.1Reading SMT articles usually entails deciphering jargon which may be

very obscure to outsiders or newcomersMikel L. Forcada SMT in a few slides




Training/1

So where do these probabilities come from?p(t) may easily be estimated from a large monolingual TLcorpus (free/open-source software: irstlm)The estimation of p(s|t) is more complex. It’s usually madeof

a lexical model describing the probability that thetranslation of certain TL word or sequence of words(“phrase”2) is a certain SL word or sequence of words.an alignment model describing the reordering of words or“phrases”.

2A very unfortunate choice in SMT jargonMikel L. Forcada SMT in a few slides




Training/2

The lexical model and the alignment model are estimatedusing a large sentence-aligned bilingual corpus through acomplex iterative process.An initial set of lexical probabilities is obtained byassuming, for instance, that any word in the TL sentencealigns with any word in its SL counterpart. And then:

Alignment probabilities in accordance with the lexicalprobabilities are computed.Lexical probabilities are obtained in accordance with thealignment probabilities

This process (“expectation maximization”) is repeated afixed number of times or until some convergence isobserved (free/open-source software: Giza++).





Training/3

In “phrase-based” SMT, alignments may be used to extract(SL-phrase, TL-phrase) pairs of phrasesand their corresponding probabilities

for easier decoding and to avoid “word salad”.





“Log-linear”/1

More SMT jargon!It’s short for linear combination of logarithms ofprobabilities.And, sometimes, even features that aren’t logarithms orprobabilities of any kind.OK, let’s take a look at the maths.





“Log-linear”/2

One can write a more general formula:

p(t |s) =exp(

∑nFk=1 λk fk (t , s))

Z(5)

with nF feature functions fk (t , s) which can depend on s, tor both.Setting nF = 2, f1(s, t) = log p(s|t), f2(s, t) = log p(t), andZ = p(s) one recovers the canonical formula (3).The best translation is then

t? = arg maxt

nF∑k=1

λk fk (t , s) (6)

Most of the fk (t , s) are logarithms, hence “log-linear”.





“Log-linear”/3

“Feature selection is a very open problem in SMT” (Lopez2008)Other possible functions include length penalties(discouraging unreasonably short or long translations),“inverted” versions of p(s|t), etc.Where do we get the λk ’s from?They are usually tuned so as to optimize the results on atuning set , according to a certain objective function that

is taken to be an indicator that correlates with translationqualitymay be automatically obtained from the output of the SMTsystem and the translation in the corpus.

This is called MERT (minimum error rate training)sometimes (free/open-source software: the Moses suite).





Ain’t got nothin’ but the BLEUs?

The most famous “quality indicator” is called BLEU, butthere are many others.BLEU counts which fraction of the 1-word,2-word,. . . n-word sequences in the output match thereference translation.Correlation with subjective assessments of quality is still anopen question.A lot of SMT research is currently BLEU-driven and makeslittle contact with real applications of MT.





The SMT lifecycle

Development:Training: monolingual and sentence-aligned

bilingual corpora are used to estimateprobability models (features)

Tuning: a held-out portion of thesentence-aligned bilingual corpus isused to tune the coeficients λk

Decoding: sentences s are fed into the SMT system and“decoded” into their translations t .

Evaluation: the system is evaluated against a referencecorpus.





License

This work may be distributed under the terms ofthe Creative Commons Attribution–Share Alike license:http://creativecommons.org/licenses/by-sa/3.0/

the GNU GPL v. 3.0 License:http://www.gnu.org/licenses/gpl.html

Dual license! E-mail me to get the sources: [email protected]


http://creativecommons.org/licenses/by-sa/3.0/

http://creativecommons.org/licenses/by-sa/3.0/

http://www.gnu.org/licenses/gpl.html

Technology

Statistical machine translation in a few slides