Toward Statistical Machine Translation without Parallel ...ccb/publications/toward-statistical... · Toward Statistical Machine Translation without Parallel Corpora ... bilingual

Toward Statistical Machine Translation without Parallel Corpora

Alexandre Klementiev Ann Irvine Chris Callison-Burch David YarowskyCenter for Language and Speech Processing

Johns Hopkins University

Abstract

We estimate the parameters of a phrase-based statistical machine translation sys-tem from monolingual corpora instead of abilingual parallel corpus. We extend exist-ing research on bilingual lexicon inductionto estimate both lexical and phrasal trans-lation probabilities for MT-scale phrase-tables. We propose a novel algorithm to es-timate reordering probabilities from mono-lingual data. We report translation resultsfor an end-to-end translation system us-ing these monolingual features alone. Ourmethod only requires monolingual corporain source and target languages, a smallbilingual dictionary, and a small bitext fortuning feature weights. In this paper, we ex-amine an idealization where a phrase-tableis given. We examine the degradation intranslation performance when bilinguallyestimated translation probabilities are re-moved and show that 80%+ of the loss canbe recovered with monolingually estimatedfeatures alone. We further show that ourmonolingual features add 1.5 BLEU pointswhen combined with standard bilinguallyestimated phrase table features.

1 Introduction

The parameters of statistical models of transla-tion are typically estimated from large bilingualparallel corpora (Brown et al., 1993). However,these resources are not available for most lan-guage pairs, and they are expensive to produce inquantities sufficient for building a good transla-tion system (Germann, 2001). We attempt an en-tirely different approach; we use cheap and plen-tiful monolingual resources to induce an end-to-end statistical machine translation system. In par-ticular, we extend the long line of work on in-ducing translation lexicons (beginning with Rapp(1995)) and propose to use multiple independentcues present in monolingual texts to estimate lex-ical and phrasal translation probabilities for large,MT-scale phrase-tables. We then introduce a

novel algorithm to estimate reordering featuresfrom monolingual data alone, and we report theperformance of a phrase-based statistical model(Koehn et al., 2003) estimated using these mono-lingual features.

Most of the prior work on lexicon inductionis motivated by the idea that it could be appliedto machine translation but stops short of actu-ally doing so. Lexicon induction holds the po-tential to create machine translation systems forlanguages which do not have extensive parallelcorpora. Training would only require two largemonolingual corpora and a small bilingual dictio-nary, if one is available. The idea is that intrin-sic properties of monolingual data (possibly alongwith a handful of bilingual pairs to act as exam-ple mappings) can provide independent but infor-mative cues to learn translations because words(and phrases) behave similarly across languages.This work is the first attempt to extend and applythese ideas to an end-to-end machine translationpipeline. While we make an explicit assumptionthat a table of phrasal translations is given a priori,we induce every other parameter of a full phrase-based translation system from monolingual dataalone. The contributions of this work are:

• In Section 2.2 we analyze the challengesof using bilingual lexicon induction for sta-tistical MT (performance on low frequencyitems, and moving from words to phrases).

• In Sections 3.1 and 3.2 we use multiple cuespresent in monolingual data to estimate lexi-cal and phrasal translation scores.

• In Section 3.3 we propose a novel algo-rithm for estimating phrase reordering fea-tures from monolingual texts.

• Finally, in Section 5 we systematically dropfeature functions from a phrase table andthen replace them with monolingually es-timated equivalents, reporting end-to-endtranslation quality.

2 Background

We begin with a brief overview of the stan-dard phrase-based statistical machine translationmodel. Here, we define the parameters whichwe later replace with monolingual alternatives.We continue with a discussion of bilingual lex-icon induction; we extend these methods to es-timate the monolingual parameters in Section 3.This approach allows us to replace expensive/rarebilingual parallel training data with two largemonolingual corpora, a small bilingual dictionary,and ≈2,000 sentence bilingual development set,which are comparatively plentiful/inexpensive.

2.1 Parameters of phrase-based SMTStatistical machine translation (SMT) was firstformulated as a series of probabilistic mod-els that learn word-to-word correspondencesfrom sentence-aligned bilingual parallel corpora(Brown et al., 1993). Current methods, includ-ing phrase-based (Och, 2002; Koehn et al., 2003)and hierarchical models (Chiang, 2005), typicallystart by word-aligning a bilingual parallel cor-pus (Och and Ney, 2003). They extract multi-word phrases that are consistent with the Viterbiword alignments and use these phrases to buildnew translations. A variety of parameters are es-timated using the bitexts. Here we review the pa-rameters of the standard phrase-based translationmodel (Koehn et al., 2007). Later we will showhow to estimate them using monolingual texts in-stead. These parameters are:

• Phrase pairs. Phrase extraction heuristics(Venugopal et al., 2003; Tillmann, 2003;Och and Ney, 2004) produce a set of phrasepairs (e, f) that are consistent with the wordalignments. In this paper we assume that thephrase pairs are given (without any scores),and we induce every other parameter of thephrase-based model from monolingual data.

• Phrase translation probabilities. Eachphrase pair has a list of associated fea-ture functions (FFs). These include phrasetranslation probabilities, φ(e|f) and φ(f |e),which are typically calculated via maximumlikelihood estimation.

• Lexical weighting. Since MLE overestimatesφ for phrase pairs with sparse counts, lexi-cal weighting FFs are used to smooth. Aver-

How

much

should

you

charge

for

your

Wievie

l

sollte

man

aufrgund

seines

Profils

in Facebook

verdienen

Facebook

profile s

d

m

m

m

d

d

Figure 1: The reordering probabilities from the phrase-based models are estimated from bilingual data by cal-culating how often in the parallel corpus a phrase pair(f, e) is orientated with the preceding phrase pair inthe 3 types of orientations (monotone, swapped, anddiscontinuous).

age word translation probabilities, w(ei|fj),are calculated via phrase-pair-internal wordalignments.

• Reordering model. Each phrase pair (e, f)also has associated reordering parameters,po(orientation|f, e), which indicate the dis-tribution of its orientation with respect to thepreviously translated phrase. Orientationsare monotone, swap, discontinuous (Tillman,2004; Kumar and Byrne, 2004), see Figure 1.

• Other features. Other typical features aren-gram language model scores and a phrasepenalty, which governs whether to use fewerlonger phrases or more shorter phrases.These are not bilingually estimated, so wecan re-use them directly without modifica-tion.

The features are combined in a log linear model,and their weights are set through minimum errorrate training (Och, 2003). We use the same loglinear formulation and MERT but propose alterna-tives derived directly from monolingual data forall parameters except for the phrase pairs them-selves. Our pipeline still requires a small bitext ofapproximately 2,000 sentences to use as a devel-opment set for MERT parameter tuning.

2.2 Bilingual lexicon induction for SMT

Bilingual lexicon induction describes the class ofalgorithms that attempt to learn translations frommonolingual corpora. Rapp (1995) was the firstto propose using non-parallel texts to learn thetranslations of words. Using large, unrelated En-glish and German corpora (with 163m and 135mwords) and a small German-English bilingual dic-tionary (with 22k entires), Rapp (1999) demon-strated that reasonably accurate translations couldbe learned for 100 German nouns that were notcontained in the seed bilingual dictionary. His al-gorithm worked by (1) building a context vectorrepresenting an unknown German word by count-ing its co-occurrence with all the other wordsin the German monolingual corpus, (2) project-ing this German vector onto the vector space ofEnglish using the seed bilingual dictionary, (3)calculating the similarity of this sparse projectedvector to vectors for English words that were con-structed using the English monolingual corpus,and (4) outputting the English words with thehighest similarity as the most likely translations.

A variety of subsequent work has extended theoriginal idea either by exploring different mea-sures of vector similarity (Fung and Yee, 1998)or by proposing other ways of measuring simi-larity beyond co-occurence within a context win-dow. For instance, Schafer and Yarowsky (2002)demonstrated that word translations tend to co-occur in time across languages. Koehn and Knight(2002) used similarity in spelling as another kindof cue that a pair of words may be translations ofone another. Garera et al. (2009) defined contextvectors using dependency relations rather than ad-jacent words. Bergsma and Van Durme (2011)used the visual similarity of labeled web imagesto learn translations of nouns. Additional relatedwork on learning translations from monolingualcorpora is discussed in Section 6.

In this paper, we apply bilingual lexicon in-duction methods to statistical machine translation.Given the obvious benefits of not having to relyon scarce bilingual parallel training data, it is sur-prising that bilingual lexicon induction has notbeen used for SMT before now. There are sev-eral open questions that make its applicability toSMT uncertain. Previous research on bilinguallexicon induction learned translations only for asmall number of high frequency words (e.g. 100

●

●●

●

●

●●

●

●

0 100 200 300 400 500 600

010

2030

40

Acc

urac

y, %

Corpus Frequency

● Top 1Top 10

Figure 2: Accuracy of single-word translations in-duced using contextual similarity as a function of thesource word corpus frequency. Accuracy is the pro-portion of the source words with at least one correct(bilingual dictionary) translation in the top 1 and top10 candidate lists.

nouns in Rapp (1995), 1,000 most frequent wordsin Koehn and Knight (2002), or 2,000 most fre-quent nouns in Haghighi et al. (2008)). Althoughprevious work reported high translation accuracy,it may be misleading to extrapolate the results toSMT, where it is necessary to translate a muchlarger set of words and phrases, including manylow frequency items.

In a preliminary study, we plotted the accuracyof translations against the frequency of the sourcewords in the monolingual corpus. Figure 2 showsthe result for translations induced using contex-tual similarity (defined in Section 3.1). Unsur-prisingly, frequent terms have a substantially bet-ter chance of being paired with a correct transla-tion, with words that only occur once having a lowchance of being translated accurately.1 This prob-lem is exacerbated when we move to multi-tokenphrases. As with phrase translation features esti-mated from parallel data, longer phrases are moresparse, making similarity scores less reliable thanfor single words.

Another impediment (not addressed in thispaper) for using lexicon induction for SMT isthe number of translations that must be learned.Learning translations for all words in the sourcelanguage requires n2 vector comparisons, sinceeach word in the source language vocabulary must

1For a description of the experimental setup used to pro-duce these translations, see Experiment 8 in Section 5.2.

s1

s2

s3

sN-1

sN

✓

✓

✓

t1t2t3

tM-1

tM✓

✓

dict.

project✓

✓

✓

✓

✓

✓

compare

para crecerto expand

activity of

economicactivity

policygrowthforeign

economico

tasaplaneta

empleoextranjero

policypara crecer(projected)

ES ContextVector

Projected ESContext Vector

EN ContextVectors

Figure 3: Scoring contextual similarity of phrases:first, contextual vectors are projected using a smallseed dictionary and then compared with the target lan-guage candidates.

be compared against the vectors for all words inthe target language vocabulary. The size of the n2

comparisons hugely increases if we compare vec-tors for multi-word phrases instead of just words.In this work, we avoid this problem by assumingthat a limited set of phrase pairs is given a pri-ori (but without scores). By limiting ourselvesto phrases in a phrase table, we vastly limit thesearch space of possible translations. This is anidealization because high quality translations areguaranteed to be present. However, as our lesionexperiments in Section 5.1 show, a phrase tablewithout accurate translation probability estimatesis insufficient to produce high quality translations.We show that lexicon induction methods can beused to replace bilingual estimation of phrase- andlexical-translation probabilities, making a signifi-cant step towards SMT without parallel corpora.

3 Monolingual Parameter Estimation

We use bilingual lexicon induction methods to es-timate the parameters of a phrase-based transla-tion model from monolingual data. Instead ofscores estimated from bilingual parallel data, wemake use of cues present in monolingual data toprovide multiple orthogonal estimates of similar-ity between a pair of phrases.

3.1 Phrasal similarity features

Contextual similarity. We extend the vectorspace approach of Rapp (1999) to compute sim-ilarity between phrases in the source and tar-get languages. More formally, assume that(s1, s2, . . . sN ) and (t1, t2, . . . tM ) are (arbitrarilyindexed) source and target vocabularies, respec-tively. A source phrase f is represented with an

terrorist (en)terrorista (es)

Occ

urre

nces

terrorist (en)riqueza (es)

Occ

urre

nces

Time

Figure 4: Temporal histograms of the English phraseterrorist, its Spanish translation terrorista, and riqueza(wealth) collected from monolingual texts spanning a13 year period. While the correct translation has agood temporal match, the non-translation riqueza hasa distinctly different signature.

N - and target phrase e with an M -dimensionalvector (see Figure 3). The component values ofthe vector representing a phrase correspond tohow often each of the words in that vocabularyappear within a two word window on either sideof the phrase. These counts are collected usingmonolingual corpora. After the values have beencomputed, a contextual vector f is projected ontothe English vector space using translations in aseed bilingual dictionary to map the componentvalues into their appropriate English vector posi-tions. This sparse projected vector is comparedto the vectors representing all English phrases e.Each phrase pair in the phrase table is assigneda contextual similarity score c(f, e) based on thesimilarity between e and the projection of f .

Various means of computing the componentvalues and vector similarity measures have beenproposed in literature (e.g. Rapp (1999), Fung andYee (1998)). Following Fung and Yee (1998), wecompute the value of the k-th component of f ’scontextual vector as follows:

wk = nf,k × (log(n/nk) + 1)

where nf,k and nk are the number of times sk ap-pears in the context of f and in the entire corpus,and n is the maximum number of occurrences ofany word in the data. Intuitively, the more fre-quently sk appears with f and the less commonit is in the corpus in general, the higher its com-ponent value. Similarity between two vectors ismeasured as the cosine of the angle between them.

Temporal similarity. In addition to contex-tual similarity, phrases in two languages may

be scored in terms of their temporal similarity(Schafer and Yarowsky, 2002; Klementiev andRoth, 2006; Alfonseca et al., 2009). The intu-ition is that news stories in different languageswill tend to discuss the same world events on thesame day. The frequencies of translated phrasesover time give them particular signatures that willtend to spike on the same dates. For instance, ifthe phrase asian tsunami is used frequently dur-ing a particular time span, the Spanish transla-tion maremoto asiatico is likely to also be usedfrequently during that time. Figure 4 illustrateshow the temporal distribution of terrorist is moresimilar to Spanish terrorista than to other Span-ish phrases. We calculate the temporal similar-ity between a pair of phrases t(f, e) using themethod defined by Klementiev and Roth (2006).We generate a temporal signature for each phraseby sorting the set of (time-stamped) documents inthe monolingual corpus into a sequence of equallysized temporal bins and then counting the numberof phrase occurrences in each bin. In our exper-iments, we set the window size to 1 day, so thesize of temporal signatures is equal to the num-ber of days spanned by our corpus. We use cosinedistance to compare the normalized temporal sig-natures for a pair of phrases (f, e).Topic similarity. Phrases and their translationsare likely to appear in articles written about thesame topic in two languages. Thus, topic or cat-egory information associated with monolingualdata can also be used to indicate similarity be-tween a phrase and its candidate translation. Inorder to score a pair of phrases, we collect theirtopic signatures by counting their occurrences ineach topic and then comparing the resulting vec-tors. We again use the cosine similarity mea-sure on the normalized topic signatures. In ourexperiments, we use interlingual links betweenWikipedia articles to estimate topic similarity. Wetreat each linked article pair as a topic and collectcounts for each phrase across all articles in its cor-responding language. Thus, the size of a phrasetopic signature is the number of article pairs withinterlingual links in Wikipedia, and each compo-nent contains the number of times the phrase ap-pears in (the appropriate side of) the correspond-ing pair. Our Wikipedia-based topic similarityfeature, w(f, e), is similar in spirit to polylingualtopic models (Mimno et al., 2009), but it is scal-able to full bilingual lexicon induction.

3.2 Lexical similarity features

In addition to the three phrase similarity featuresused in our model – c(f, e), t(f, e) and w(f, e) –we include four additional lexical similarity fea-tures for each of phrase pair. The first three lex-ical features clex(f, e), tlex(f, e) and wlex(f, e)are the lexical equivalents of the phrase-level con-textual, temporal and wikipedia topic similarityscores. They score the similarity of individualwords within the phrases. To compute theselexical similarity features, we average similarityscores over all possible word alignments acrossthe two phrases. Because individual words aremore frequent than multiword phrases, the accu-racy of clex, tlex, and wlex tends to be higher thantheir phrasal equivalents (this is similar to the ef-fect observed in Figure 2).

Orthographic / phonetic similarity. The finallexical similarity feature that we incorporate iso(f, e), which measures the orthographic similar-ity between words in a phrase pair. Etymolog-ically related words often retain similar spellingacross languages with the same writing system,and low string edit distance sometimes signalstranslation equivalency. Berg-Kirkpatrick andKlein (2011) present methods for learning cor-respondences between the alphabets of two lan-guages. We can also extend this idea to languagepairs not sharing the same writing system sincemany cognates, borrowed words, and names re-main phonetically similar. Transliterations can begenerated for tokens in a source phrase (Knightand Graehl, 1997), with o(f, e) calculating pho-netic similarity rather than orthographic.

The three phrasal and four lexical similarityscores are incorporated into the log linear trans-lation model as feature functions, replacing thebilingually estimated phrase translation probabil-ities φ and lexical weighting probabilities w. Ourseven similarity scores are not the only ones thatcould be incorporated into the translation model.Various other similarity scores can be computeddepending on the available monolingual data andits associated metadata (see, e.g. Schafer andYarowsky (2002)).

3.3 Reordering

The remaining component of the phrase-basedSMT model is the reordering model. Weintroduce a novel algorithm for estimating

Input: Source and target phrases f and e,Source and target monolingual corpora Cf and Ce,Phrase table pairs T = {(f (i), e(i))}Ni=1.

Output: Orientation features (pm, ps, pd).

Sf ← sentences containing f in Cf ;Se ← sentences containing e in Ce;(Bf ,−,−)← CollectOccurs(f,∪Ni=1f

(i), Sf );(Be, Ae, De)← CollectOccurs(e,∪Ni=1e

(i), Se);cm = cs = cd = 0;

foreach unique f ′ in Bf doforeach translation e′ of f ′ in T do

cm = cm +#Be (e′);

cs = cs +#Ae (e′);

cd = cd +#De (e′);

c← cm + cs + cd;return ( cm

c, cs

c, cd

c)

CollectOccurs(r, R, S)B ← (); A← (); D ← ();foreach sentence s ∈ S do

foreach occurrence of phrase r in s doB ← B + (longest preceding r and in R);A← A + (longest following r and in R);D ← D + (longest discontinuous w/ r and inR);

return (B, A, D);

Figure 5: Algorithm for estimating reorderingprobabilities from monolingual data.

po(orientation|f, e) from two monolingual cor-pora instead a bitext.

Figure 1 illustrates how the phrase pair orienta-tion statistics are estimated in the standard phrase-based SMT pipeline. For a phrase pair like (f =“Profils”, e = “profile”), we count its orien-tation with the previously translated phrase pair(f ′ = “in Facebook”, e′ = “Facebook”) acrossall translated sentence pairs in the bitext.

In our pipeline we do not have translated sen-tence pairs. Instead, we look for monolingualsentences in the source corpus which containthe source phrase that we are interested in, likef = “Profils”, and at least one other phrasethat we have a translation for, like f ′ = “inFacebook”. We then look for all target lan-guage sentences in the target monolingual cor-pus that contain the translation of f (here e =“profile”) and any translation of f ′. Figure 6 il-lustrates that it is possible to find evidence forpo(swapped|Profils, profile), even from the non-parallel, non-translated sentences drawn from twoindependent monolingual corpora. By looking forforeign sentences containing pairs of adjacent for-eign phrases (f, f ′) and English sentences con-

Das

Anlegen

eines

Profils

in Facebook

ist einfach

s

What

does

your

Facebook

profile

reveal

Figure 6: Collecting phrase orientation statistics fora English-German phrase pair (“profile”, “Profils”)from non-parallel sentences (the German sentencetranslates as “Creating a Facebook profile is easy”).

taining their corresponding translations (e, e′), weare able to increment orientation counts for (f, e)by looking at whether e and e′ are adjacent,swapped, or discontinuous. The orientations cor-respond directly to those shown in Figure 1.

One subtly of our method is that shorter andmore frequent phrases (e.g. punctuation) are morelikely to appear in multiple orientations with agiven phrase, and therefore provide poor evi-dence of reordering. Therefore, we (a) collectthe longest contextual phrases (which also appearin the phrase table) for reordering feature estima-tion, and (b) prune the set of sentences so thatwe only keep a small set of least frequent contex-tual phrases (this has the effect of dropping manyfunction words and punctuation marks and and re-lying more heavily on multi-word content phrasesto estimate the reordering).2

Our algorithm for learning the reordering pa-rameters is given in Figure 5. The algorithmestimates a probability distribution over mono-tone, swap, and discontinuous orientations (pm,ps, pd) for a phrase pair (f, e) from two mono-lingual corpora Cf and Ce. It begins by callingCollectOccurs to collect the longest match-ing phrase table phrases that precede f in sourcemonolingual data (Bf ), as well as those that pre-cede (Be), follow (Ae), and are discontinuous(De) with e in the target language data. For eachunique phrase f ′ preceding f , we look up transla-tions in the phrase table T. Next, we count3 how

2The pruning step has an additional benefit of minimizingthe memory needed for orientation feature estimations.

3#L(x) returns the count of object x in list L.

Monolingual training corporaEuroparl Gigaword Wikipedia

date range 4/96-10/09 5/94-12/08 n/auniq shared dates 829 5,249 n/aSpanish articles n/a 3,727,954 59,463English articles n/a 4,862,876 59,463Spanish lines 1,307,339 22,862,835 2,598,269English lines 1,307,339 67,341,030 3,630,041Spanish words 28,248,930 774,813,847 39,738,084English words 27,335,006 1,827,065,374 61,656,646

Spanish-English phrase tablePhrase pairs 3,093,228Spanish phrases 89,386English phrases 926,138Spanish unigrams 13,216Avg # translations 98.7Spanish bigrams 41,426Avg # translations 31.9Spanish trigrams 34,744Avg # translations 13.5

Table 1: Statistics about the monolingual training data and the phrase table that was used in all of the experiments.

many translations e′ of f ′ appeared before, afteror were discontinuous with e in the target lan-guage data. Finally, the counts are normalized andreturned. These normalized counts are the valueswe use as estimates of po(orientation|f, e).

4 Experimental Setup

We use the Spanish-English language pair to testour method for estimating the parameters of anSMT system from monolingual corpora. This al-lows us to compare our method against the nor-mal bilingual training procedure. We expect bilin-gual training to result in higher translation qual-ity because it is a more direct method for learn-ing translation probabilities. We systematicallyremove different parameters from the standardphrase-based model, and then replace them withour monolingual equivalents. Our goal is to re-cover as much of the loss as possible for each ofthe deleted bilingual components.

The standard phrase-based model that we useas our top-line is the Moses system (Koehn etal., 2007) trained over the full Europarl v5 par-allel corpus (Koehn, 2005). With the exceptionof maximum phrase length (set to 3 in our ex-periments), we used default values for all of theparameters. All experiments use a trigram lan-guage model trained on the English side of theEuroparl corpus using SRILM with Kneser-Neysmoothing. To tune feature weights in minimumerror rate training, we use a development bitextof 2,553 sentence pairs, and we evaluate per-formance on a test set of 2,525 single-referencetranslated newswire articles. These developmentand test datasets were distributed in the WMTshared task (Callison-Burch et al., 2010).4 MERT

4Specifcially, news-test2008 plus news-syscomb2009 fordev and newstest2009 for test.

was re-run for every experiment.We estimate the parameters of our model from

two sets of monolingual data, detailed in Table 1:

• First, we treat the two sides of the Europarlparallel corpus as independent, monolingualcorpora. Haghighi et al. (2008) also usedthis method to show how well translationscould be learned from monolingual corporaunder ideal conditions, where the contextualand temporal distribution of words in the twomonolingual corpora are nearly identical.

• Next, we estimate the features from trulymonolingual corpora. To estimate the con-textual and temporal similarity features, weuse the Spanish and English Gigaword cor-pora.5 These corpora are substantially largerthan the Europarl corpora, providing 27x asmuch Spanish and 67x as much English forcontextual similarity, and 6x as many paireddates for temporal similarity. Topical simi-larity is estimated using Spanish and EnglishWikipedia articles that are paired with inter-language links.

To project context vectors from Spanish to En-glish, we use a bilingual dictionary containing en-tries for 49,795 Spanish words. Note that end-to-end translation quality is robust to substantiallyreducing dictionary size, but we omit these ex-periments due to space constraints. The con-text vectors for words and phrases incorporate co-occurrence counts using a two-word window oneither side.

The title of our paper uses the word towards be-cause we assume that an inventory of phrase pairsis given. Future work will explore inducing the

5We use the afp, apw and xin sections of the corpora.

BLEU

05

1015

2025

21.87 21.54

12.86

4.00

10.52

15.3514.02 14.78

16.85 17.50

22.92

-/-B/B

B/-

-/B -/M t/- o/-

c/-

M/-

M/M

BM/B

1 2 3 4 5 6 7 8 9 10 11

Exp Phrase scores / orientation scores 1 B/B bilingual / bilingual (Moses) 2 B/- bilingual / distortion 3 -/B none / bilingual 4 -/- none / distortion

5, 12 -/M none / mono6, 13 t/- temporal mono / distortion7,14 o/- orthographic mono / distortion8, 15 c/- contextual mono / distortion 16 w/- Wikipedia topical mono / distorion9, 17 M/- all mono / distortion10, 18 M/M all mono / mono

11, 19 BM/B bilingual + all mono / bilingual

Estimated Using Europarl

Estimated Using Monolingual Corpora

BLEU

14.02

BLEU

05

10

15

20

25

10.15

13.13 14.02 14.07

17.00 17.92 18.79

23.3623.36

18.7917.9217.00

BM/B

M/MM/-

w/-c/-

o/-t/--/M

05

1015

2025

BLEU

14.0714.0213.1310.15

18 19171615141312BLEU

05

1015

2025

21.87 21.54

12.86

4.00

10.52

15.3514.02 14.78

16.85 17.50

22.92

-/-B/B

B/-

-/B -/M t/- o/-

c/-

M/-

M/M

BM/B

1 2 3 4 5 6 7 8 9 10 11






BLEU

14.02

BLEU

05

10

15

20

25

10.15

13.13 14.02 14.07

17.00 17.92 18.79

23.3623.36

18.7917.9217.00

BM/B

M/MM/-

w/-c/-

o/-t/--/M

05

1015

2025

BLEU

14.0714.0213.1310.15

18 19171615141312

Figure 7: Much of the loss in BLEU score when bilingually estimated features are removed from a Spanish-English translation system (experiments 1-4) can be recovered when they are replaced with monolingual equiva-lents estimated from monolingual Europarl data (experiments 5-10). The labels indicate how the different typesof parameters are estimated, the first part is for phrase-table features, the second is for reordering probabilities.

BLEU

05

1015

2025

21.87 21.54

12.86

4.00

10.52

15.3514.02 14.78

16.85 17.50

22.92

-/-B/B

B/-

-/B -/M t/- o/-

c/-

M/-

M/M

BM/B

1 2 3 4 5 6 7 8 9 10 11






BLEU

14.02

BLEU

05

10

15

20

25

10.15

13.13 14.02 14.07

17.00 17.92 18.79

23.3623.36

18.7917.9217.00

BM/B

M/MM/-

w/-c/-

o/-t/--/M

05

1015

2025

BLEU

14.0714.0213.1310.15

18 19171615141312

Figure 8: Performance of monolingual features de-rived from truly monolingual corpora. Over 82% ofthe BLEU score loss can be recovered.

phrase table itself from monolingual texts. Acrossall of our experiments, we use the phrase tablethat the bilingual model learned from the Europarlparallel corpus. We keep its phrase pairs, but wedrop all of its scores. Table 1 gives details of thephrase pairs. In our experiments, we estimatedsimilarity and reordering scores for more than 3million phrase pairs. For each source phrase, theset of possible translations was constrained andlikely to contain good translations. However, theaverage number of possible translations was high(ranging from nearly 100 translations for each un-igram to 14 for each trigram). These contain alot of noise and result in low end-to-end transla-tion quality without good estimates of translationquality, as the experiments in Section 5.1 show.

Software. Because many details of our estima-tion procedures must be omitted for space, we dis-tribute our full set of code along with scripts forrunning our experiments and output translations.These may be downed from http://www.cs.jhu.edu/˜anni/papers/lowresmt/

5 Experimental Results

Figures 7 and 8 give experimental results. Figure7 shows the performance of the standard phrase-based model when each of the bilingually esti-mated features are removed. It shows how muchof the performance loss can be recovered usingour monolingual features when they are estimatedfrom the Europarl training corpus but treatingeach side as an independent, monolingual cor-pus. Figure 8 shows the recovery when using trulymonolingual corpora to estimate the parameters.

5.1 Lesion experiments

Experiments 1-4 remove bilingually estimated pa-rameters from the standard model. For Spanish-English, the relative contribution of the phrase-table features (which include the phrase transla-tion probabilities φ and the lexical weights w) isgreater than the reordering probabilities. Whenthe reordering probability po(orientation|f, e) iseliminated and replaced with a simple distance-based distortion feature that does not require abitext to estimate, the score dips only marginallysince word order in English and Spanish is simi-lar. However, when both the reordering and thephrase table features are dropped, leaving onlythe LM feature and the phrase penalty, the result-ing translation quality is abysmal, with the scoredropping a total of over 17 BLEU points.

5.2 Adding equivalent monolingual featuresestimated using Europarl

Experiments 5-10 show how much our monolin-gual equivalents could recover when the monolin-gual corpora are drawn from the two sides of thebitext. For instance, our algorithm for estimating

reordering probabilities from monolingual data (–/M) adds 5 BLEU points, which is 73% of the po-tential recovery going from the model (–/–) to themodel with bilingual reordering features (–/B).

Of the temporal, orthographic, and contextualmonolingual features the temporal feature per-forms the best. Together (M/–), they recovermore than each individually. Combining mono-lingually estimated reordering and phrase tablefeatures (M/M) yields a total gain of 13.5 BLEUpoints, or over 75% of the BLEU score loss thatoccurred when we dropped all features from thephrase table. However, these results use “mono-lingual” corpora which have practically identicalphrasal and temporal distributions.

5.3 Estimating features using trulymonolingual corpora

Experiments 12-18 estimate all of the featuresfrom truly monolingual corpora. Our novel al-gorithm for estimating reordering holds up welland recovers 69% of the loss, only 0.4 BLEUpoints less than when estimated from the Europarlmonolingual texts. The temporal similarity fea-ture does not perform as well as when it was esti-mated using Europarl data, but the contextual fea-ture does. The topic similarity using Wikipediaperforms the strongest of the individual features.

Combining the monolingually estimated re-ordering features with the monolingually esti-mated similarity features (M/M) yields a totalgain of 14.8 BLEU points, or over 82% of theBLEU point loss that occurred when we droppedall features from the phrase table. This is equiv-alent to training the standard system on a bi-text with roughly 60,000 lines or nearly 2 millionwords (learning curve omitted for space).

Finally, we supplement the standard bilinguallyestimated model parameters with our monolin-gual features (BM/B), and we see a 1.5 BLEUpoint increase over the standard model. There-fore, our monolingually estimated scores capturesome novel information not contained in the stan-dard feature set.

6 Additional Related Work

Carbonell et al. (2006) described a data-drivenMT system that used no parallel text. It producedtranslation lattices using a bilingual dictionaryand scored them using an n-gram language model.

Their method has no notion of translation similar-ity aside from a bilingual dictionary. Similarly,Sanchez-Cartagena et al. (2011) supplement anSMT phrase table with translation pairs extractedfrom a bilingual dictionary and give each a fre-quency of one for computing translation scores.

Ravi and Knight (2011) treat MT without paral-lel training data as a decipherment task and learna translation model from monolingual text. Theytranslate corpora of Spanish time expressions andsubtitles, which both have a limited vocabulary,into English. Their method has not been appliedto broader domains of text.

Most work on learning translations from mono-lingual texts only examine small numbers of fre-quent words. Huang et al. (2005) and Daume andJagarlamudi (2011) are exceptions that improveMT by mining translations for OOV items.

A variety of past research has focused on min-ing parallel or comparable corpora from the web(Munteanu and Marcu, 2006; Smith et al., 2010;Uszkoreit et al., 2010). Others use an existingSMT system to discover parallel sentences withinindependent monolingual texts, and use them tore-train and enhance the system (Schwenk, 2008;Chen et al., 2008; Schwenk and Senellart, 2009;Rauf and Schwenk, 2009; Lambert et al., 2011).These are complementary but orthogonal to ourresearch goals.

7 Conclusion

This paper has demonstrated a novel set of tech-niques for successfully estimating phrase-basedSMT parameters from monolingual corpora, po-tentially circumventing the need for large bitexts,which are expensive to obtain for new languagesand domains. We evaluated the performance ofour algorithms in a full end-to-end translation sys-tem. Assuming that a bilingual-corpus-derivedphrase table is available, we were able utilize ourmonolingually-estimated features to recover over82% of BLEU loss that resulted from removingthe bilingual-corpus-derived phrase-table proba-bilities. We also showed that our monolingual fea-tures add 1.5 BLEU points when combined withstandard bilingually estimated features. Thus ourtechniques have stand-alone efficacy when largebilingual corpora are not available and also makea significant contribution to combined ensembleperformance when they are.

ReferencesEnrique Alfonseca, Massimiliano Ciaramita, and

Keith Hall. 2009. Gazpacho and summer rash:lexical relationships from temporal patterns of websearch queries. In Proceedings of EMNLP.

Taylor Berg-Kirkpatrick and Dan Klein. 2011. Simpleeffective decipherment via combinatorial optimiza-tion. In Proceedings of the 2011 Conference onEmpirical Methods in Natural Language Process-ing (EMNLP-2011), Edinburgh, Scotland, UK.

Shane Bergsma and Benjamin Van Durme. 2011.Learning bilingual lexicons using the visual simi-larity of labeled web images. In Proceedings of theInternational Joint Conference on Artificial Intelli-gence.

Peter Brown, John Cocke, Stephen Della Pietra, Vin-cent Della Pietra, Frederick Jelinek, Robert Mercer,and Paul Poossin. 1988. A statistical approach tolanguage translation. In 12th International Confer-ence on Computational Linguistics (CoLing-1988).

Peter Brown, Stephen Della Pietra, Vincent DellaPietra, and Robert Mercer. 1993. The mathemat-ics of machine translation: Parameter estimation.Computational Linguistics, 19(2):263–311, June.

Chris Callison-Burch, Philipp Koehn, Christof Monz,Kay Peterson, Mark Przybocki, and Omar Zaidan.2010. Findings of the 2010 joint workshop on sta-tistical machine translation and metrics for machinetranslation. In Proceedings of the Workshop on Sta-tistical Machine Translation.

Jaime Carbonell, Steve Klein, David Miller, MichaelSteinbaum, Tomer Grassiany, and Jochen Frey.2006. Context-based machine translation. In Pro-ceedings of AMTA.

Boxing Chen, Min Zhang, Aiti Aw, and Haizhou Li.2008. Exploiting n-best hypotheses for SMT self-enhancement. In Proceedings of ACL/HLT, pages157–160.

David Chiang. 2005. A hierarchical phrase-basedmodel for statistical machine translation. In Pro-ceedings of ACL.

Hal Daume and Jagadeesh Jagarlamudi. 2011. Do-main adaptation for machine translation by miningunseen words. In Proceedings of ACL/HLT.

Pascale Fung and Lo Yuen Yee. 1998. An IR approachfor translating new words from nonparallel, compa-rable texts. In Proceedings of ACL/CoLing.

Nikesh Garera, Chris Callison-Burch, and DavidYarowsky. 2009. Improving translation lexicon in-duction from monolingual corpora via dependencycontexts and part-of-speech equivalences. In Thir-teenth Conference On Computational Natural Lan-guage Learning (CoNLL-2009), Boulder, Colorado.

Ulrich Germann. 2001. Building a statistical machinetranslation system from scratch: How much bangfor the buck can we expect? In ACL 2001 Workshop

on Data-Driven Machine Translation, Toulouse,France.

Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick,and Dan Klein. 2008. Learning bilingual lexi-cons from monolingual corpora. In Proceedings ofACL/HLT.

Fei Huang, Ying Zhang, and Stephan Vogel. 2005.Mining key phrase translations from web corpora.In Proceedings of EMNLP.

Alexandre Klementiev and Dan Roth. 2006. Weaklysupervised named entity transliteration and discov-ery from multilingual comparable corpora. In Pro-ceedings of the ACL/Coling.

Kevin Knight and Jonathan Graehl. 1997. Machinetransliteration. In Proceedings of ACL.

Philipp Koehn and Kevin Knight. 2002. Learning atranslation lexicon from monolingual corpora. InACL Workshop on Unsupervised Lexical Acquisi-tion.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Pro-ceedings of HLT/NAACL.

Philipp Koehn, Hieu Hoang, Alexandra Birch,Chris Callison-Burch, Marcello Federico, NicolaBertoldi, Brooke Cowan, Wade Shen, ChristineMoran, Richard Zens, Chris Dyer, Ondrej Bojar,Alexandra Constantin, and Evan Herbst. 2007.Moses: Open source toolkit for statistical machinetranslation. In Proceedings of the ACL-2007 Demoand Poster Sessions.

Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In Proceedings ofthe Machine Translation Summit.

Shankar Kumar and William Byrne. 2004. Localphrase reordering models for statistical machinetranslation. In Proceedings of HLT/NAACL.

Patrik Lambert, Holger Schwenk, Christophe Ser-van, and Sadaf Abdul-Rauf. 2011. Investigationson translation model adaptation using monolingualdata. In Proceedings of the Workshop on StatisticalMachine Translation, pages 284–293, Edinburgh,Scotland, UK.

David Mimno, Hanna Wallach, Jason Naradowsky,David Smith, and Andrew McCallum. 2009.Polylingual topic models. In Proceedings ofEMNLP.

Dragos Stefan Munteanu and Daniel Marcu. 2006.Extracting parallel sub-sentential fragments fromnon-parallel corpora. In Proceedings of theACL/Coling.

Franz Josef Och and Hermann Ney. 2003. A sys-tematic comparison of various statistical alignmentmodels. Computational Linguistics, 29(1):19–51.

Franz Josef Och and Hermann Ney. 2004. The align-ment template approach to statistical machine trans-lation. Computational Linguistics, 30(4):417–449.

Franz Joseph Och. 2002. Statistical Machine Transla-tion: From Single-Word Models to Alignment Tem-plates. Ph.D. thesis, RWTH Aachen.

Franz Josef Och. 2003. Minimum error rate trainingfor statistical machine translation. In Proceedingsof ACL.

Reinhard Rapp. 1995. Identifying word translationsin non-parallel texts. In Proceedings of ACL.

Reinhard Rapp. 1999. Automatic identification ofword translations from unrelated English and Ger-man corpora. In Proceedings of ACL.

Sadaf Abdul Rauf and Holger Schwenk. 2009. On theuse of comparable corpora to improve SMT perfor-mance. In Proceedings of EACL.

Sujith Ravi and Kevin Knight. 2011. Deciphering for-eign language. In Proceedings of ACL/HLT.

Vctor M. Sanchez-Cartagena, Felipe Sanchez-Martnez, and Juan Antonio Perez-Ortiz. 2011.Integrating shallow-transfer rules into phrase-basedstatistical machine translation. In Proceedings ofthe XIII Machine Translation Summit.

Charles Schafer and David Yarowsky. 2002. Inducingtranslation lexicons via diverse similarity measuresand bridge languages. In Proceedings of CoNLL.

Holger Schwenk and Jean Senellart. 2009. Transla-tion model adaptation for an Arabic/French newstranslation system by lightly-supervised training. InMT Summit.

Holger Schwenk. 2008. Investigations on large-scalelightly-supervised training for statistical machinetranslation. In Proceedings of IWSLT.

Jason R. Smith, Chris Quirk, and Kristina Toutanova.2010. Extracting parallel sentences from compa-rable corpora using document level alignment. InProceedings of HLT/NAACL.

Christoph Tillman. 2004. A unigram orientationmodel for statistical machine translation. In Pro-ceedings of HLT/NAACL.

Christoph Tillmann. 2003. A projection extension al-gorithm for statistical machine translation. In Pro-ceedings of EMNLP.

Jakob Uszkoreit, Jay M. Ponte, Ashok C. Popat, andMoshe Dubiner. 2010. Large scale parallel docu-ment mining for machine translation. In Proceed-ings of CoLing.

Ashish Venugopal, Stephan Vogel, and Alex Waibel.2003. Effective phrase translation extraction fromalignment models. In Proceedings of ACL.

Documents

Toward Statistical Machine Translation without Parallel ...ccb/publications/toward-statistical... · Toward Statistical Machine Translation without Parallel Corpora ... bilingual