Alvin Grissom II, He He, Jordan Boyd-Graber, John Morgan ...users.umiacs.umd.edu/~jbg/docs/2014_emnlp_simtrans.pdf · (sov ) language (e.g., German or Japanese) to a verb-medial (

Alvin Grissom II, He He, Jordan Boyd-Graber, John Morgan, and Hal Daumé III. Don’t Until the FinalVerb Wait: Reinforcement Learning for Simultaneous Machine Translation. Empirical Methods in NaturalLanguage Processing, 2014, 11 pages.

@inproceedings{Grissom-II:He:Boyd-Graber:Morgan:Daume-III-2014,Title = {Don’t Until the Final Verb Wait: Reinforcement Learning for Simultaneous Machine Translation},Author = {Alvin {Grissom II} and He He and Jordan Boyd-Graber and John Morgan and Hal {Daum\’{e} III}},Booktitle = {Empirical Methods in Natural Language Processing},Year = {2014},Location = {Doha, Qatar},Url = {http://umiacs.umd.edu/~jbg//docs/2014_emnlp_simtrans.pdf},}

Links:

• Talk [http://youtu.be/hVoxXO3F468]

Downloaded from http://umiacs.umd.edu/~jbg/docs/2014_emnlp_simtrans.pdf

Contact Jordan Boyd-Graber ([email protected]) for questions about this paper.

1

https://www.ursinus.edu/live/profiles/3125-alvin-grissom-iihttps://hhexiy.github.io/http://umiacs.umd.edu/~jbg//docs/2014_emnlp_simtrans.pdfhttp://umiacs.umd.edu/~jbg//docs/2014_emnlp_simtrans.pdfhttp://youtu.be/hVoxXO3F468http://youtu.be/hVoxXO3F468http://umiacs.umd.edu/~jbg/docs/2014_emnlp_simtrans.pdf

Don’t Until the Final Verb Wait:Reinforcement Learning for Simultaneous Machine Translation

Alvin C. Grissom IIand Jordan Boyd-Graber

Computer ScienceUniversity of Colorado

Boulder, [email protected]

[email protected]

He He, John Morgan,and Hal Daumé III

Computer Science and UMIACSUniversity of Maryland

College Park, MD{hhe,jjm,hal}@cs.umd.edu

Abstract

We introduce a reinforcement learning-based approach to simultaneous ma-chine translation—producing a trans-lation while receiving input words—between languages with drastically dif-ferent word orders: from verb-final lan-guages (e.g., German) to verb-mediallanguages (English). In traditional ma-chine translation, a translator must“wait” for source material to appear be-fore translation begins. We remove thisbottleneck by predicting the final verbin advance. We use reinforcement learn-ing to learn when to trust predictionsabout unseen, future portions of thesentence. We also introduce an evalua-tion metric to measure expeditiousnessand quality. We show that our newtranslation model outperforms batchand monotone translation strategies.

1 Introduction

We introduce a simultaneous machine transla-tion (MT) system that predicts unseen verbsand uses reinforcement learning to learn whento trust these predictions and when to wait formore input.

Simultaneous translation is producing a par-tial translation of a sentence before the inputsentence is complete, and is often used in im-portant diplomatic settings. One of the firstnoted uses of human simultaneous interpreta-tion was the Nuremberg trials after the Sec-ond World War. Siegfried Ramler (2009), theAustrian-American who organized the transla-tion teams, describes the linguistic predictions

and circumlocutions that translators would useto achieve a tradeoff between translation la-tency and accuracy. The audio recording tech-nology used by those interpreters sowed theseeds of technology-assisted interpretation atthe United Nations (Gaiba, 1998).

Performing real-time translation is especiallydifficult when information that comes early inthe target language (the language you’re trans-lating to) comes late in the source language (thelanguage you’re translating from). A commonexample is when translating from a verb-final(sov) language (e.g., German or Japanese) toa verb-medial (svo) language, (e.g., English).In the example in Figure 1, for instance, themain verb of the sentence (in bold) appearsat the end of the German sentence. An of-fline (or “batch”) translation system waits untilthe end of the sentence before translating any-thing. While this is a reasonable approach, ithas obvious limitations. Real-time, interactivescenarios—such as online multilingual videoconferences or diplomatic meetings—requirecomprehensible partial interpretations beforea sentence ends. Thus, a significant goal ininterpretation is to make the translation asexpeditious as possible.

We present three components for an sov-to-svo simultaneous mt system: a reinforcementlearning framework that uses predictions tocreate expeditious translations (Section 2), asystem to predict how a sentence will end (e.g.,predicting the main verb; Section 4), and a met-ric that balances quality and expeditiousness(Section 3). We combine these components ina framework that learns when to begin trans-lating sections of a sentence (Section 5).

Section 6 combines this framework with a

ich bin mit dem Zug nach Ulm gefahrenI am with the train to Ulm traveledI (. . . . . . waiting. . . . . . ) traveled by train to Ulm

Figure 1: An example of translating from averb-final language to English. The verb, inbold, appears at the end of the sentence, pre-venting coherent translations until the finalsource word is revealed.

translation system that produces simultaneoustranslations. We show that our data-drivensystem can successfully predict unseen partsof the sentence, learn when to trust them, andoutperform strong baselines (Section 7).

While some prior research has approachedthe problem of simultaneous translation—we re-view these systems in more detail in Section 8—no current model learns when to definitivelybegin translating chunks of an incomplete sen-tence. Finally, in Section 9, we discuss thelimitations of our system: it only uses the mostfrequent source language verbs, it only appliesto sentences with a single main verb, and ituses an idealized translation system. However,these limitations are not insurmountable; wedescribe how a more robust system can be as-sembled from these components.

2 Decision Process forSimultaneous Translation

Human interpreters learn strategies for theirprofession with experience and practice. Aswords in the source language are observed, atranslator—human or machine—must decidewhether and how to translate, while, for cer-tain language pairs, simultaneously predictingfuture words. We would like our system to dothe same. To this end, we model simultaneousmt as a Markov decision process (mdp) anduse reinforcement learning to effectively com-bine predicting, waiting, and translating intoa coherent strategy.

2.1 States: What is, what is to come

The state st represents the current view ofthe world given that we have seen t words ofa source language sentence.1 The state con-tains information both about what is knownand what is predicted based on what is known.

1We use t to evoke a discrete version of time. Weonly allow actions after observing a complete sourceword.

To compare the system to a human transla-tor in a decision-making process, the state isakin to the translator’s cognitive state. At anygiven time, we have knowledge (observations)and beliefs (predictions) with varying degreesof certainty: that is, the state contains the re-vealed words x1:t of a sentence; the state alsocontains predictions about the remainder ofthe sentence: we predict the next word in thesentence and the final verb.

More formally, we have a prediction at timet of the next source language word that will

appear, n(t)t+1, and for the final verb, v

(t). Forexample, given the partial observation “ichbin mit dem”, the state might contain a pre-

diction that the next word, n(t)t+1, will be “Zug”

and that the final verb v(t) will be “gefahren”.

We discuss the mechanics of next-word andverb prediction further in Section 4; for now,consider these black boxes which, after observ-ing every new source word xt, make predictionsof future words in the source language. Thisrepresentation of the state allows for a richer setof actions, described below, permitting simul-taneous translations that outpace the sourcelanguage input2 by predicting the future.

2.2 Actions: What our system can do

Given observed and hypothesized input, oursimultaneous translation system must decidewhen to translate them. This is expressedin the form of four actions: our system cancommit to a partial translation, predict thenext word and use it to update the transla-tion, predict the verb and use it to update thetranslation, or wait for more words.

We discuss each of these actions in turn be-fore describing how they come together to in-crementally translate an entire sentence:

Wait Waiting is the simplest action. It pro-duces no output and allows the system to re-ceive more input, biding its time, so that whenit does choose to translate, the translation isbased on more information.

Commit Committing produces translationoutput: given the observed source sentence,produce the best translation possible.

2Throughout, “input” refers to source language in-put, and “output” refers to target language translation.

STOP

Com

mit

Observation Observation (prediction) Observation Observation

Observation

1. Mit dem Zug 2. Mit dem Zug binich

3. Mit dem Zug binich nach

4. Mit dem Zug bin ich nach … gefahren ...

Observation (prediction) 5. Mit dem Zug bin ich

nach Ulm… gefahren ...

6. Mit dem Zug bin ichnach Ulm gefahren.

Output: I traveledby train

Output: I traveledby trainto UlmOutput: I traveled

by trainto Ulm.

Swith the

trainI am

with the train

by train

to Ulm

Sby train

Wait Wait Predict

S I traveledby train

with the train

to

Com

mit

Wait

Fixed output

Commit

stat

e

Figure 2: A simultaneous translation from source (German) to target (English). The agentchooses to wait until after (3). At this point, it is sufficiently confident to predict the final verbof the sentence (4). Given this additional information, it can now begin translating the sentenceinto English, constraining future translations (5). As the rest of the sentence is revealed, thesystem can translate the remainder of the sentence.

Next Word The next word action takesa prediction of the next source word and pro-duces an updated translation based on thatprediction, i.e., appending the predicted wordto the source sentence and translating the newsentence.

Verb Our system can also predict the sourcesentence’s final verb (the last word in the sen-tence). When our system takes the verb ac-tion, it uses its verb prediction to update thetranslation using the prediction, by placing itat the end of the source sentence.

We can recreate a traditional batch trans-lation system (interpreted temporally) by asequence of wait actions until all input is ob-served, followed by a commit to the completetranslation. Our system can commit to par-tial translations if it is confident, but producinga good translation early in the sentence oftendepends on missing information.

2.3 Translation Process

Having described the state, its components,and the possible actions at a state, we presentthe process in its entirety. In Figure 2, aftereach German word is received, the system ar-rives at a new state, which consists of the sourceinput, target translation so far, and predictionsof the unseen words. The translation system

must then take an action given informationabout the current state. The action will resultin receiving and translating more source words,transitioning the system to the next state. Inthe example, for the first few source-languagewords, the translator lacks the confidence toproduce any output due to insufficient informa-tion at the state. However, after State 3, thestate shows high confidence in the predictedverb “gefahren”. Combined with the Germaninput it has observed, the system is sufficientlyconfident to act on that prediction to produceEnglish translation.

2.4 Consensus Translations

Three straightforward actions—commit, nextword, and verb—all produce translations.These rely black box access to a translation(discussed in detail in Section 6): that is, givena source language sentence fragment, the trans-lation system produces a target language sen-tence fragment.

Because these actions can happen more thanonce in a sentence, we must form a single con-sensus translation from all of the translationsthat we might have seen. If we have only onetranslation or if translations are identical, form-ing the consensus translation is trivial. Buthow should we resolve conflicting translations?

Any time our system chooses an action that

produces output, the observed input (plus extrapredictions in the case of next-word or verb),is passed into the translation system. Thatsystem then produces a complete translationof its input fragment.

Any new words—i.e., words whose targetindex is greater than the length of any previ-ous translation—are appended to the previoustranslation.3 Table 1 shows an example offorming these consensus translations.

Now that we have defined how states evolvebased on our system’s actions, we need to knowhow to select which actions to take. Eventu-ally, we will formalize this as a learned policy(Section 5) that maps from states to actions.First, however, we need to define a reward thatmeasures how “good” an action is.

3 Objective: What is a goodsimultaneous translation?

Good simultaneous translations must optimizetwo objectives that are often at odds, i.e., pro-ducing translations that are, in the end, accu-rate, and producing them in pieces that arepresented expeditiously. While there are exist-ing automated metrics for assessing translationquality (Papineni et al., 2002; Banerjee andLavie, 2005; Snover et al., 2006), these mustbe modified to find the necessary compromisebetween translation quality and expeditious-ness. That is, a good metric for simultaneoustranslation must achieve a balance betweentranslating chunks early and translating accu-rately. All else being equal, maximizing eithergoal in isolation is trivial: for accurate transla-tions, use a batch system and wait until thesentence is complete, translating it all at once;for a maximally expeditious translation, cre-ate monotone translations, translating eachword as it appears, as in Tillmann et al. (1997)and Pytlik and Yarowsky (2006). The formeris not simultaneous at all; the latter is mereword-for-word replacement and results in awk-ward, often unintelligible translations of distantlanguage pairs.

Once we have predictions, we have an ex-panded array of possibilities, however. On oneextreme, we can imagine a psychic translator—

3Using constrained decoding to enforce consistenttranslation prefixes would complicate our method butis an appealing extension.

one that can completely translate an imaginedsentence after one word is uttered—as an un-obtainable system. On the other extreme is astandard batch translator, which waits untilit has access to the utterer’s complete sentencebefore translating anything.

Again, we argue that a system can improveon this by predicting unseen parts of the sen-tence to find a better tradeoff between theseconflicting goals. However, to evaluate and op-timize such a system, we must measure wherea system falls on the continuum of accuracyversus expeditiousness.

Consider partial translations in a two-dimensional space, with time (quantized bythe number of source words seen) increasingfrom left to right on the x axis and the bleuscore (including brevity penalty against thereference length) on the y axis. At each pointin time, the system may add to the consensustranslation, changing the precision (Figure 3).Like an roc curve, a good system will be highand to the left, optimizing the area under thecurve: the ideal system would produce pointsas high as possible immediately. A translationwhich is, in the end, accurate, but which is lessexpeditious, would accrue its score more slowlybut outperform a similarly expeditious systemwhich nevertheless translates poorly.

An idealized psychic system achieves this,claiming all of the area under the curve, as itwould have a perfect translation instantly, hav-ing no need of even waiting for future input.4

A batch system has only a narrow (but tall)sliver to the right, since it translates nothinguntil all of the words are observed.

Formally, let Q be the score function for apartial translation, x the sequentially revealedsource words x1, x2, . . . , xT from time step 1 toT , and y the partial translations y1, y2, . . . , yT ,where T is the length of the source languageinput. Each incremental translation yt has ableu-n score with respect to a reference r. Weapply the usual bleu brevity penalty to all theincremental translations (initially empty) to

4One could reasonably argue that this is not ideal:a fluid conversation requires the prosody and timingbetween source and target to match exactly. Thus, apsychic system would provide too much informationtoo quickly, making information exchange unnatural.However, we take the information-centric approach:more information faster is better.

Pos Input Intermediate Consensus12 Er He1 He13 Er wurde

gestaltetIt1 was2 designed3 He1 was2 designed3

4 It1 was2 designed3 He1 was2 designed35 Er wurde

gesternrenoviert

It1 was2 renovated3yesterday4

He1 was2 designed3yesterday4

Table 1: How intermediate translations are combined into a consensus translation. Incorrecttranslations (e.g., “he” for an inanimate object in position 3) and incorrect predictions (e.g.,incorrectly predicting the verb gestaltet in position 5) are kept in the consensus translation.When no translation is made, the consensus translation remains static.

Er ist zum Laden gegangen

He to the

Psychic

Monotone

Batch

PolicyPrediction

He went

TSource Sentence

He

He

He went to the store



He to the store went

He to the store

He went to the

Figure 3: Comparison of lbleu (the area underthe curve given by Equation 1) for an impossi-ble psychic system, a traditional batch system,a monotone (German word order) system, andour prediction-based system. By correctly pre-dicting the verb “gegangen” (to go), we achievea better overall translation more quickly.

obtain latency-bleu (lbleu),

Q(x,y) =1

T

∑

t

bleu(yt, r) (1)

+ T · bleu(yT , r)

The lbleu score is a word-by-word inte-gral across the input source sentence. As eachsource word is observed, the system receives areward based on the bleu score of the partialtranslation. lbleu, then, represents the sum ofthese T rewards at each point in the sentence.The score of a simultaneous translation is thesum of the scores of all individual segmentsthat contribute to the overall translation.

We multiply the final bleu score by T to en-

sure good final translations in learned systemsto compensate for the implicit bias toward lowlatency.5

4 Predicting Verbs and NextWords

The next and verb actions depend on predic-tions of the sentence’s next word and final verb;this section describes our process for predict-ing verbs and next words given a partial sourcelanguage sentence.

The prediction of the next word in the sourcelanguage sentence is modeled with a left-to-right language model. This is (näıvely) anal-ogous to how a human translator might usehis own “language model” to guess upcomingwords to gain some speed by completing, forexample, collocations before they are uttered.We use a simple bigram language model fornext-word prediction. We use Heafield et al.(2013).

For verb prediction, we use a generativemodel that combines the prior probability ofa particular verb v, p(v), with the likelihoodof the source context at time t given thatverb (namely, p(x1:t | v)), as estimated by asmoothed Kneser-Ney language model (Kneserand Ney, 1995). We use Pauls and Klein(2011). The prior probability p(v) is estimatedby simple relative frequency estimation. Thecontext, x1:t, consists of all words observed.We model p(x1:t | v) with verb-specific n-gramlanguage models. The predicted verb v(t) at

5One could replace T with a parameter, β, to biastowards different kinds of simultaneous translations. Asβ → ∞, we recover batch translation.

time t is then:

arg maxvp(v)

t∏

i=1

p(xi | v, xi−n+1:i−1) (2)

where xi−n+1:i−1 is the n−1-gram context. Tonarrow the search space, we consider only the100 most frequent final verbs, where a “finalverb” is defined as the sentence-final sequenceof verbs and particles as detected by a Germanpart-of-speech tagger (Toutanova et al., 2003).6

5 Learning a Policy

We have a framework (states and actions) forsimultaneous machine translation and a metricfor assessing simultaneous translations. Wenow describe the use of reinforcement learningto learn a policy, a mapping from states toactions, to maximize lbleu reward.

We use imitation learning (Abbeel and Ng,2004; Syed et al., 2008): given an optimal se-quence of actions, learn a generalized policythat maps states to actions. This can be viewedas a cost-sensitive classification (Langford andZadrozny, 2005): a state is represented as a fea-ture vector, the loss corresponds to the qualityof the action, and the output of the classifier isthe action that should be taken in that state.

In this section, we explain each of these com-ponents: generating an optimal policy, repre-senting states through features, and learning apolicy that can generalize to new sentences.

5.1 Optimal Policies

Because we will eventually learn policies viaa classifier, we must provide training exam-ples to our classifier. These training exam-ples come from an oracle policy π∗ thatdemonstrates the optimal sequence—i.e., withmaximal lbleu score—of actions for each se-quence. Using dynamic programming, we candetermine such actions for a fixed translationmodel.7 From this oracle policy, we generatetraining examples for a supervised classifier.

6This has the obvious disadvantage of ignoring mor-phology and occasionally creating duplicates of commonverbs that have may be associated with multiple parti-cles; nevertheless, it provides a straightforward verb topredict.

7This is possible for the limited class of consensustranslation schemes discussed in Section 2.4.

State st is represented as a tuple of the ob-served words x1:t, predicted verb v

(t), and the

predicted word n(t)t+1. We represent the state to

a classifier as a feature vector φ(x1:t, n(t)t+1, v

(t)).

5.2 Feature Representation

We want a feature representation that will al-low a classifier to generalize beyond the specificexamples on which it is trained. We use sev-eral general classes of features: features thatdescribe the input, features that describe thepossible translations, and features that describethe quality of the predictions.

Input We include both a bag of words rep-resentation of the input sentence as well asthe most recent word and bigram to modelword-specific effects. We also use a featurethat encodes the length of the source sentence.

Prediction We include the identity of thepredicted verb and next word as well as their re-spective probabilities under the language mod-els that generate the predictions. If the modelis confident in the prediction, the classifier canlearn to more so trust the predictions.

Translation In addition to the state, we in-clude features derived from the possible actionsthe system might take. This includes a bag ofwords representation of the target sentence, thescore of the translation (decreasing the score isundesirable), the score of the current consen-sus translation, and the difference between thecurrent and potential translation scores.

5.3 Policy Learning

Our goal is to learn a classifier that can accu-rately mimic the oracle’s choices on previouslyunseen data. However, at test time, when werun the learned policy classifier, the learnedpolicy’s state distribution may deviate fromthe optimal policy’s state distribution due toimperfect imitation, arriving in states not onthe oracle’s path. To address this, we usesearn (Daumé III et al., 2009), an iterativeimitation learning algorithm. We learn fromthe optimal policy in the first iteration, as instandard supervised learning; in the followingiterations, we run an interpolated policy

πk+1 = �πk + (1− �)π∗, (3)

with k as the iteration number and � the mixingprobability. We collect examples by askingthe policy to label states on its path. Theinterpolated policy will execute the optimalaction with probability 1− � and the learnedpolicy’s action with probability �. In the firstiteration, we have π0 = π

∗.Mixing in the learned policy allows the

learned policy to slowly change from the oraclepolicy. As it trains on these no-longer-perfectstate trajectories, the state distribution at testtime will be more consistent with the statesused in training.searn learns the policy by training a cost-

sensitive classifier. Besides providing the opti-mal action, the oracle must also assign a costto an action

C(at,x) ≡ Q(x, π∗(xt))−Q(x, at(xt)), (4)

where at(xt) represents the translation outcomeof taking action at. The cost is the regret ofnot taking the optimal action.

6 Translation System

The focus of this work is to show that given aneffective batch translation system and predic-tions, we can learn a policy that will turn thisinto a simultaneous translation system. Thus,to separate translation errors from policy er-rors, we perform experiments with a nearlyoptimal translation system we call an omni-scient translator.

More realistic translation systems will nat-urally lower the objective function, often inways that make it difficult to show that we caneffectively predict the verbs in verb-final sourcelanguages. For instance, German to Englishtranslation systems often drop the verb; thus,predicting a verb that will be ignored by thetranslation system will not be effective.

The omniscient translator translates a sourcesentence correctly once it has been fed the ap-propriate source words as input. There aretwo edge cases: empty input yields an emptyoutput, while a complete, correct source sen-tence returns the correct, complete translation.Intermediate cases—where the input is eitherincomplete or incorrect—require using an align-ment. The omniscient translator assumes asinput a reference translation r, a partial sourcelanguage input x1:t and a corresponding partial

output y. In addition, the omniscient transla-tor assumes access to an alignment between rand x. In practice, we use the hmm aligner (Vo-gel et al., 1996; Och and Ney, 2003).

We first consider incomplete but correct in-puts. Let y = τ(x1:t) be the translator’s outputgiven a partial source input x1:t with transla-tion y. Then, τ(x1:t) produces all target wordsyj if there is a source word xi in the inputaligned to those words—i.e., (i, j) ∈ ax,y—andall preceding target words can be translated.(That translations must be contiguous is a nat-ural requirement for human recipients of trans-lations). In the case where yj is unaligned, theclosest aligned target word to yj that has acorresponding alignment entry is aligned to xi;then, if xi is present in the input, yj appears inthe output. Thus, our omniscient translationsystem will always produce the correct outputgiven the correct input.

However, our learned policy can make wrongpredictions, which can produce partial trans-lations y that do not match the reference.In this event, an incorrect source word x̃iproduces incorrect target words ỹj , for allj : (i, j) ∈ ax,y. These ỹj are sampled fromthe ibm Model 1 lexical probability table mul-tiplied by the source language model ỹj ∼Mult(θx̃i)pLM (x̃).

8 Thus, even if we predictthe correct verb using a next word action, itwill be in the wrong position and thus gener-ate a translation from the lexical probabilities.Since translations based on Model 1 probabil-ities are generally inaccurate, the omniscienttranslator will do very well when given correctinput but will produce very poor translationsotherwise.

7 Experiments

In this section, we describe our experimentalframework and results from our experiments.From aligned data, we derive an omniscienttranslator. We use monolingual data in thesource language to train the verb predictor andthe next word predictor. From these features,we compute an optimal policy from which wetrain a learned policy.

8If a policy chooses an incorrect unaligned word, ithas no effect on the output. Alignments are position-specific, so “wrong” refers to position and type.

7.1 Data sets

For translation model and policy training, weuse data from the German-English Parallel “de-news” corpus of radio broadcast news (Koehn,2000), which we lower-cased and stripped ofpunctuation. A total of 48, 601 sentence pairsare randomly selected for building our system.Of these, we use 70% (34, 528 pairs) for trainingword alignments.

For training the translation policy, we re-strict ourselves to sentences that end with oneof the 100 most frequent verbs (see Section 4).This results in a data set of 4401 training sen-tences and 1832 test sentences from the de-newsdata. We did this to narrow the search space(from thousands of possible, but mostly veryinfrequent, verbs).

We used 1 million words of news text fromthe Leipzig Wortschatz (Quasthoff et al., 2006)German corpus to train 5-gram language mod-els to predict a verb from the 100 most frequentverbs.

For next-word prediction, we use the 18, 345most frequent German bigrams from the train-ing set to provide a set of candidates in a lan-guage model trained on the same set. We usefrequent bigrams to reduce the computationalcost of finding the completion probability ofthe next word.

7.2 Training Policies

In each iteration of searn, we learn amulti-class classifier to implement the pol-icy. The specific learning algorithm we useis arow (Crammer et al., 2013). In the com-plete version of searn, the cost of each actionis calculated as the highest expected rewardstarting at the current state minus the actualroll-out reward. However, computing the fullroll-out reward is computationally very expen-sive. We thus use a surrogate binary cost: ifthe predicted action is the same as the opti-mal action, the cost is 0; otherwise, the costis 1. We then run searn for five iterations.Results on the development data indicate thatcontinuing for more iterations yields no benefit.

7.3 Policy Rewards on Test Set

In Figure 4, we show performance of the opti-mal policy vis-à-vis the learned policy, as wellas the two baseline policies: the batch policy

●●●●●●●●●● ● ● ● ● ● ● ● ●

●

0.25

0.50

0.751.001.25

0.00 0.25 0.50 0.75 1.00% of Sentence

Sm

ooth

ed A

vera

ge

● Batch Monotone Optimal Searn

Figure 4: The final reward of policies on Ger-man data. Our policy outperforms all baselinesby the end of the sentence.

0250050007500

10000

0250050007500

10000

0250050007500

10000

0250050007500

10000

Batch

Monotone

Optim

alS

earn

COMMIT WAIT NEXTWORD VERBAction

Act

ion

Cou

nt

Policy Actions

Figure 5: Histogram of actions taken by thepolicies.

and the monotone policy. The x-axis is thepercentage of the source sentence seen by themodel, and the y-axis is a smoothed average ofthe reward as a function of the percentage ofthe sentence revealed. The monotone policy’sperformance is close to the optimal policy forthe first half of the sentence, as German andEnglish have similar word order, though theydiverge toward the end. Our learned policyoutperforms the monotone policy toward theend and of course outperforms the batch policythroughout the sentence.

Figure 5 shows counts of actions taken byeach policy. The batch policy always commitsat the end. The monotone policy commits ateach position. Our learned policy has an ac-tion distribution similar to that of the optimal

policy, but is slightly more cautious.

7.4 What Policies Do

VERB

federal minister of the environment angela merkel

shown the draft of an ecopolitical program

bundesumweltministerin merkel hat den entwurf

bundesumweltministerin

INPUT OUTPUT



shown the draft

Merkel

gezeigt

bundesumweltministerin merkel hat den entwurf eines umweltpolitischen programms vorgestellt

COMMIT

NEXT

Figure 6: An imperfect execution of a learnedpolicy. Despite choosing the wrong verb“gezeigt” (showed) instead of “vorgestellt” (pre-sented), the translation retains the meaning.

Figure 6 shows a policy that, predicting in-correctly, still produces sensible output. Thepolicy correctly intuits that the person dis-cussed is Angela Merkel, who was the environ-mental minister at the time, but the policy usesan incorrectly predicted verb. Because of ourpoor translation model (Section 6), it rendersthis word as “shown”, which is a poor transla-tion. However, it is still comprehensible, andthe overall policy is similar to what a humanwould do: intuit the subject of the sentencefrom early clues and use a more general verbto stand in for a more specific one.

8 Related Work

Just as mt was revolutionized by statisticallearning, we suspect that simultaneous mt willsimilarly benefit from this paradigm, both froma systematic system for simultaneous transla-tion and from a framework for learning how toincorporate predictions.

Simultaneous translation has beendominated by rule and parse-based ap-proaches (Mima et al., 1998a; Ryu et al., 2006).In contrast, although Verbmobil (Wahlster,2000) performs incremental translation using astatistical mt module, its incremental decision-making module is rule-based. Other recentapproaches in speech-based systems focus onwaiting until a pause to translate (Sakamotoet al., 2013) or using word alignments (Ryu

et al., 2012) between languages to determineoptimal translation units.

Unlike our work, which focuses on predic-tion and learning, previous strategies for deal-ing with sov-to-svo translation use rule-basedmethods (Mima et al., 1998b) (for instance,passivization) to buy time for the translator tohear more information in a spoken context—oruse phrase table and reordering probabilities todecide where to translate with less delay (Fu-jita et al., 2013). Oda et al. (2014) is themost similar to our work on the translationside. They frame word segmentation as anoptimization problem, using a greedy searchand dynamic programming to find segmenta-tion strategies that maximize an evaluationmeasure. However, unlike our work, the direc-tion of translation was from from svo to svo,obviating the need for verb prediction. Simul-taneous translation is more straightforward forlanguages with compatible word orders, suchas English and Spanish (Fügen, 2008).

To our knowledge, the only attempt tospecifically predict verbs or any late-occurringterms (Matsubara et al., 2000) uses patternmatching on what would today be considereda small data set to predict English verbs forJapanese to English simultaneous mt.

Incorporating verb predictions into the trans-lation process is a significant component ofour framework, though n-gram models stronglyprefer highly frequent verbs. Verb predictionmight be improved by applying the insightsfrom psycholinguistics. Ferreira (2000) arguesthat verb lemmas are required early in sentenceproduction—prior to the first noun phraseargument—and that multiple possible syntac-tic hypotheses are maintained in parallel as thesentence is produced. Schriefers et al. (1998)argues that, in simple German sentences, non-initial verbs do not need lemma planning atall. Momma et al. (2014), investigating theseprior claims, argues that the abstract relation-ship between the internal arguments and verbstriggers selective verb planning.

9 Conclusion and Future Work

Creating an effective simultaneous translationsystem for sov to svo languages requires notonly translating partial sentences, but also ef-fectively predicting a sentence’s verb. Both

elements of the system require substantial re-finement before they are usable in a real-worldsystem.

Replacing our idealized translation systemis the most challenging and most importantnext step. Supporting multiple translation hy-potheses and incremental decoding (Sankaranet al., 2010) would improve both the efficiencyand effectiveness of our system. Using datafrom human translators (Shimizu et al., 2014)could also add richer strategies for simultane-ous translation: passive constructions, reorder-ing, etc.

Verb prediction also can be substantially im-proved both in its scope in the system andhow we predict verbs. Verb-final languagesalso often place verbs at the end of clauses,and also predicting these verbs would improvesimultaneous translation, enabling its effectiveapplication to a wider range of sentences. In-stead predicting an exact verb early (which isvery difficult), predicting a semantically closeor a more general verb might yield interpretabletranslations.

A natural next step is expanding this workto other languages, such as Japanese, which notonly has sov word order but also requires tok-enization and morphological analysis, perhapsrequiring sub-word prediction.

Acknowledgments

We thank the anonymous reviewers, as well asYusuke Miyao, Naho Orita, Doug Oard, andSudha Rao for their insightful comments. Thiswork was supported by NSF Grant IIS-1320538.Boyd-Graber is also partially supported byNSF Grant CCF-1018625. Daumé III and Heare also partially supported by NSF Grant IIS-0964681. Any opinions, findings, conclusions,or recommendations expressed here are thoseof the authors and do not necessarily reflectthe view of the sponsor.

References

Pieter Abbeel and Andrew Y. Ng. 2004. Appren-ticeship learning via inverse reinforcement learn-ing. In Proceedings of the International Confer-ence of Machine Learning.

Satanjeev Banerjee and Alon Lavie. 2005. ME-TEOR: An automatic metric for MT evalua-tion with improved correlation with human judg-

ments. In Proceedings of the Association forComputational Linguistics. Association for Com-putational Linguistics.

Koby Crammer, Alex Kulesza, and Mark Dredze.2013. Adaptive regularization of weight vectors.Machine Learning, 91(2):155–187.

Hal Daumé III, John Langford, and Daniel Marcu.2009. Search-based structured prediction. Ma-chine Learning Journal (MLJ).

Fernanda Ferreira. 2000. Syntax in languageproduction: An approach using tree-adjoininggrammars. Aspects of language production,pages 291–330.

Christian Fügen. 2008. A system for simultane-ous translation of lectures and speeches. Ph.D.thesis, KIT-Bibliothek.

Tomoki Fujita, Graham Neubig, Sakriani Sakti,Tomoki Toda, and Satoshi Nakamura. 2013.Simple, lexicalized choice of translation timingfor simultaneous speech translation. INTER-SPEECH.

Francesca Gaiba. 1998. The origins of simultane-ous interpretation: The Nuremberg Trial. Uni-versity of Ottawa Press.

Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H.Clark, and Philipp Koehn. 2013. Scalablemodified Kneser-Ney language model estima-tion. In Proceedings of the Association for Com-putational Linguistics.

Reinhard Kneser and Hermann Ney. 1995. Im-proved backing-off for n-gram language model-ing. In Acoustics, Speech, and Signal Processing,1995. ICASSP-95., 1995 International Confer-ence on. IEEE.

Philipp Koehn. 2000. German-english parallel cor-pus “de-news”.

John Langford and Bianca Zadrozny. 2005. Relat-ing reinforcement learning performance to clas-sification performance. In Proceedings of the In-ternational Conference of Machine Learning.

Shigeki Matsubara, Keiichi Iwashima, NobuoKawaguchi, Katsuhiko Toyama, and YasuyoshiInagaki. 2000. Simultaneous Japanese-Englishinterpretation based on early predition of En-glish verb. In Symposium on Natural LanguageProcessing.

Hideki Mima, Hitoshi Iida, and Osamu Furuse.1998a. Simultaneous interpretation utilizingexample-based incremental transfer. In Pro-ceedings of the 17th international conference onComputational linguistics-Volume 2, pages 855–861. Association for Computational Linguistics.

Hideki Mima, Hitoshi Iida, and Osamu Furuse.1998b. Simultaneous interpretation utilizingexample-based incremental transfer. In Proceed-ings of the Association for Computational Lin-guistics.

Shota Momma, Robert Slevc, and Colin Phillips.2014. The timing of verb selection in englishactive and passive sentences. IEICE TechnicalReport.

Franz Josef Och and Hermann Ney. 2003. Asystematic comparison of various statisticalalignment models. Computational Linguistics,29(1):19–51.

Yusuke Oda, Graham Neubig, Sakriani Sakti,Tomoki Toda, and Satoshi Nakamura. 2014. Op-timizing segmentation strategies for simultane-ous speech translation. In Proceedings of theAssociation for Computational Linguistics.

Kishore Papineni, Salim Roukos, Todd Ward, andWei-Jing Zhu. 2002. BLEU: a method for auto-matic evaluation of machine translation. In Pro-ceedings of the Association for ComputationalLinguistics.

Adam Pauls and Dan Klein. 2011. Faster andsmaller n-gram language models. In Proceed-ings of the Association for Computational Lin-guistics.

Brock Pytlik and David Yarowsky. 2006. Machinetranslation for languages lacking bitext via mul-tilingual gloss transduction. In 5th Conferenceof the Association for Machine Translation inthe Americas (AMTA), August.

Uwe Quasthoff, Matthias Richter, and ChristianBiemann. 2006. Corpus portal for search inmonolingual corpora. In International LanguageResources and Evaluation, pages 1799–1802.

Siegfried Ramler and Paul Berry. 2009. Nurembergand Beyond: The Memoirs of Siegfried Ramlerfrom 20th Century Europe to Hawai’i. BooklinesHawaii Limited.

Koichiro Ryu, Shigeki Matsubara, and YasuyoshiInagaki. 2006. Simultaneous english-japanesespoken language translation based on incremen-tal dependency parsing and transfer. In Proceed-ings of the Association for Computational Lin-guistics.

Koichiro Ryu, Shigeki Matsubara, and YasuyoshiInagaki. 2012. Alignment-based translationunit for simultaneous Japanese-English spokendialogue translation. In Innovations in Intelli-gent Machines–2, pages 33–44. Springer.

Akiko Sakamoto, Nayuko Watanabe, Satoshi Ka-matani, and Kazuo Sumita. 2013. Developmentof a simultaneous interpretation system for face-to-face services and its evaluation experiment in

real situation. In Proceedings of the XIV Ma-chine Translation Summit.

Baskaran Sankaran, Ajeet Grewal, and AnoopSarkar. 2010. Incremental decoding for phrase-based statistical machine translation. In Pro-ceedings of the Joint Fifth Workshop on Statis-tical Machine Translation.

H Schriefers, E Teruel, and RM Meinshausen.1998. Producing simple sentences: Results frompicture–word interference experiments. Journalof Memory and Language, 39(4):609–632.

Hiroaki Shimizu, Graham Neubig, Sakriani Sakti,Tomoki Toda, and Satoshi Nakamura. 2014.Collection of a simultaneous translation corpusfor comparative analysis. In International Lan-guage Resources and Evaluation.

Matthew Snover, Bonnie Dorr, Richard Schwartz,Linnea Micciulla, and John Makhoul. 2006. Astudy of translation edit rate with targeted hu-man annotation. In In Proceedings of Associa-tion for Machine Translation in the Americas.

Umar Syed, Michael Bowling, and Robert E.Schapire. 2008. Apprenticeship learning usinglinear programming. In Proceedings of the Inter-national Conference of Machine Learning.

Christoph Tillmann, Stephan Vogel, Hermann Ney,and Alex Zubiaga. 1997. A dp-based search us-ing monotone alignments in statistical transla-tion. In Proceedings of the Association for Com-putational Linguistics.

Kristina Toutanova, Dan Klein, Christopher DManning, and Yoram Singer. 2003. Feature-richpart-of-speech tagging with a cyclic dependencynetwork. In Conference of the North AmericanChapter of the Association for ComputationalLinguistics, pages 173–180.

Stephan Vogel, Hermann Ney, and Christoph Till-mann. 1996. HMM-based word alignment instatistical translation. In Proceedings of Inter-national Conference on Computational Linguis-tics.

Wolfgang Wahlster. 2000. Verbmobil: foundationsof speech-to-speech translation. Springer.

Documents

Alvin Grissom II, He He, Jordan Boyd-Graber, John Morgan ...users.umiacs.umd.edu/~jbg/docs/2014_emnlp_simtrans.pdf · (sov ) language (e.g., German or Japanese) to a verb-medial (