A Shared Task on Bandit Learning for Machine Translation · Infrastructure. We provided three AWS-hosted environments, that correspond to the three phases of the shared task: 1.Mock

Proceedings of the Conference on Machine Translation (WMT), Volume 2: Shared Task Papers, pages 514–524Copenhagen, Denmark, September 711, 2017. c©2017 Association for Computational Linguistics

A Shared Task on Bandit Learning for Machine Translation

Artem Sokolov∗,� and Julia Kreutzer� and Kellen Sunderland∗ and Pavel Danchenko∗

and Witold Szymaniak∗ and Hagen Furstenau∗ and Stefan Riezler�∗Amazon Development Center Germany, Berlin and �Heidelberg University, Germany

Abstract

We introduce and describe the results ofa novel shared task on bandit learning formachine translation. The task was orga-nized jointly by Amazon and HeidelbergUniversity for the first time at the Sec-ond Conference on Machine Translation(WMT 2017). The goal of the task is toencourage research on learning machinetranslation from weak user feedback in-stead of human references or post-edits.On each of a sequence of rounds, a ma-chine translation system is required to pro-pose a translation for an input, and re-ceives a real-valued estimate of the qual-ity of the proposed translation for learn-ing. This paper describes the shared task’slearning and evaluation setup, using ser-vices hosted on Amazon Web Services(AWS), the data and evaluation metrics,and the results of various machine transla-tion architectures and learning protocols.

1 Introduction

Bandit Learning for machine translation (MT) isa framework to train and improve MT systemsby learning from weak or partial feedback: In-stead of a gold-standard human-generated trans-lation, the learner only receives feedback to a sin-gle proposed translation (hence the term ‘partial’),in form of a translation quality judgement (a realnumber which can be as weak as a binary accep-tance/rejection decision).

In the shared task, user feedback was simu-lated by a service hosted on Amazon Web Ser-vices (AWS). Participants can submit translationsand receive feedback on translation quality. Thisis used to adapt an out-of-domain MT model,pre-trained on mostly news texts, to a new do-

main (e-commerce), for the translation directionof German (DE) to English (EN). While in oursetup feedback was simulated by evaluating a re-ward function on the predicted translation againsta gold standard reference, the reference translationitself was never revealed to the learner, neither attraining nor at test time. This learning scenariohas been investigated under the names of learningfrom bandit feedback1 or reinforcement learning(RL)2, and has important real world applicationssuch as online advertising (Chapelle et al., 2014).In the advertising application, the problem is toselect the best advertisement for a user visiting apublisher page. A key element is to estimate theclick-through rate (CTR), i.e., the probability thatan ad will be clicked by a user so that the adver-tiser has to pay. This probability is modeled byfeatures representing user, page, and ad, and is es-timated by trading off exploration (a new ad needsto be displayed in order to learn its click-throughrate) and exploitation (displaying the ad with thecurrent best estimate is better in the short term) indisplaying ads to users.

In analogy to the online advertising scenario,one could imagine a scenario of personalizationin machine translation where translations have tobe adapted to the user’s specific purpose and do-main. Similar to online advertising, where it is un-realistic to expect more detailed feedback than auser click on a displayed ad, the feedback in adap-tive machine translation should be weaker thana reference translation or a post-edit created by

1The name is inherited from a model where in each rounda gambler pulls an arm of a different slot machine (‘one-armed bandit’), with the goal of maximizing his reward rel-ative to the maximal possible reward, without apriori knowl-edge of the optimal slot machine. See Bubeck and Cesa-Bianchi (2012) for an overview.

2See Sutton and Barto (1998) and Szepesvari (2009) foran overview of algorithms for reinforcement learning andtheir relation to bandit learning.

514

a professional translator. Instead, the goal is toelicit binary or real-valued judgments of transla-tion quality from laymen users (for example, Gra-ham et al. (2016) show that consistent assessmentsof real-valued translation quality can be providedby crowdsourcing), or to infer feedback signalsfrom user interactions with the translated contenton a web page (for example, by interpreting acopy-paste action of the MT output as positivequality signal, and a correction as a negative qual-ity signal). The goal of this shared task is to eval-uate existing algorithms for learning MT systemsfrom weak feedback (Sokolov et al., 2015, 2016a;Kreutzer et al., 2017) on real-world data and com-pare them to new algorithms, with a focus on per-forming online learning efficiently and effectivelyfrom bandit feedback, i.e. the best algorithms arethose that perform fast online learning and, simul-taneously, achieve high translation quality.

In the following, we present a description ofthe protocol and infrastructure of our online learn-ing task, and of the data for pretraining, onlinetraining, and evaluation (Section 2). We intro-duce the online and batch evaluation metrics usedin the shared task (Section 3), and describe staticbaseline systems (Section 4) and submitted onlinelearning systems (Section 5). We present and dis-cuss the results of the task (Section 6-7), show-ing that NMT systems with batch domain adapta-tion provide very good baselines, however, onlinelearning based on SMT or NMT can catch up overtime by adapting to the provided feedback.

2 Task Description

Our shared task setup follows an online learningprotocol, where on each iteration, the learner re-ceives a source sentence, proposes a translation,and is rewarded in form of a task sentence-levelmetric evaluation of the proposed translation withrespect to a hidden reference. The learner doesnot know what the correct translation (reference)looks like, nor what would have happened if it hadproposed a different translation. Thus, we imple-mented two constraints to guarantee this scenarioof online learning from weak feedback. First, sen-tences had to be translated one by one, i.e. thenext source sentence could only be received afterthe translation to the previous sentence was sentoff. Second, feedback could be obtained only fora single translation of any given source sentence.

In our shared task, the participant systems inter-

Algorithm 1 WMT Online Bandit Learning1: Input: MT model2: for k = 0, . . . ,K do3: Request source sentence sk from service4: Propose a translation tk5: Obtain feedback ∆(tk) from service6: Improve MT model with 〈sk, tk,∆(tk)〉

act online with an AWS-hosted service as shownin Algorithm 1. The service provides a source sen-tence to the learner (line 3), and provides feedback(line 5) to the translation predicted by the learner(line 4). The learner updates its parameters usingthe feedback (line 6) and continues to the next ex-ample. We did not impose any restriction on howthe learner could use the feedback to improve fu-ture translations.

Infrastructure. We provided three AWS-hostedenvironments, that correspond to the three phasesof the shared task:

1. Mock service, to test the client API (op-tional): hosted a tiny in-domain dataset (48sentences).

2. Development service to tune algorithms andhyperparameters (optional): ran on a largerin-domain dataset (40,000 sentences). Sev-eral passes were allowed and two evaluationmetrics were communicated to the partici-pants via the leaderboard.

3. Training service (mandatory): served sourcesfrom a large in-domain dataset (1,297,974sentences). Participants had to consume afixed number of samples during the allocatedonline learning period to be eligible for finalevaluation.

We built the shared task around the followingAWS services:

• API Gateway (authentication, rate limiting,client API SDK);

• Lambda (computation);• DynamoDB (data storage);• CloudWatch (logging and monitoring).

In more detail, service endpoints were imple-mented using API Gateway, that gave us access,on a participant level, to throttle requests rates,manage accounts, etc. API Gateway enabled easymanagement of our public-facing endpoints and

515

source reference (PE) PE direction PE modification

schwarz gr.xxl / xxxl black , size xxl / xxxl DE-EN fixed errors in source, expanded abbreviation, 147 cm 147 cm DE-EN fixed errors in sourcefur starke , glanzende nagel great for strengthen your nails and enhance shine EN-DE poor quality source (EN) used as referenceseemless verarbeitung seamless processing DE-EN source typo corrected in referencebrenndauer : mindestens 40 stunden 40 hour minimum burn time DE-EN translation rewritten for readabilitymaschinenwaschbar bei 30 ° c machine washable at 30 degrees . DE-EN literal expansion of the degree symbol32 unzen volumen 32-ounce capacity DE-EN language-specific typographymaterial : 1050 denier nylon . material : 1050d nylon . EN-DE expanded source (EN) abbreviation used as referencefur e-gitarre entworfen designed for electric guitar DE-EN abbreviation expanded

Table 1: Examples for non-literal PEs in the e-commerce data: The first two columns show examples3 ofsource sentences and PEs used as reference translations in the shared task. The last two columns showthe direction of translation post-editing, and a description of the modifications applied by the editors.

environments, and provided integrated metrics andnotifications, which we monitored closely duringthe shared task. Data storage was implementedusing DynamoDB – a NoSQL storage databasewhich allows dynamic scaling of our back-endto match the varied requirements of the differ-ent shared task phases. The state management(e.g., forbidding multiple requests), source sen-tence serving, feedback calculation, keeping trackof participant’s progress and result processing wasimplemented using Lambda – a serverless com-pute architecture that dispenses with setting upand monitoring a dedicated server infrastructure.CloudWatch service was used to analyze logs inorder to trace down errors, general monitoring andsending alarms to the shared task API maintainers.In addition to the service development, we also de-veloped a small SDK consisting of code samplesand helper libraries in Python and Java to help par-ticipants in developing their clients, as well as aleaderboard that showed the results during the de-velopment phase.

Data. For training initial or seed MT systems(the input to Algorithm 1), out-of-domain paralleldata was restricted to DE-EN parts of Europarl v7,NewsCommentary v12, CommonCrawl and Rapiddata from the WMT 2017 News Translation (con-strained) task4. Furthermore, monolingual ENdata from the constrained task was allowed. Tun-ing of the out-of-domain systems had to be doneon the newstest2016-deen set.

The in-domain parallel data for online learningwas taken from the e-commerce domain: The cor-pus was provided by Amazon and had been sam-pled from a large real-world collection of post-edited (PE’ed) translations of actual product de-scriptions. Since post-editors were following cer-

3Examples selected by Khanh Ngyuen.4statmt.org/wmt17/translation-task.html

tain rules aimed at improving customer experienceon the Amazon retail website (improving read-ability, correction of typos, rewriting of uncom-mon abbreviations, removing irrelevant informa-tion, etc.), naturally the resulting PEs were not al-ways literal, sometimes adding or deleting a con-siderable number of tokens and resulting in lowfeedback BLEU scores for submitted literal trans-lations (see Table 1 for examples). Consequently,the participants had to solve two difficult prob-lems – domain adaptation and learning from ban-dit feedback. In addition, to simulate the level ofnoise normally encountered in real-world MT ap-plications, and to test noise-robustness of the ban-dit learning algorithms, approximately half of theparallel in-domain data was sourced from the EN-DE post-editing direction and reversed.

All data was preprocessed with Moses scripts(removing non-printing characters, replacing andnormalizing unicode punctuation, lowercasing,pretokenizing and tokenizing). No DE-side com-pound splitting was used, permitting custom par-ticipant decisions. Since the learning data camefrom a substantially different domain than the out-of-domain parallel texts, it had a large numberof out-of-vocabulary (OOV) terms, aggravated bythe high frequency of long product numbers andunique vendor names. To reduce the OOV ratewe additionally filtered out all parallel sentenceswhere the source contained more than one numeral(with a whitespace in between) and normalizedfloating point delimiters in both languages to a pe-riod. The resulting average OOV token rate withrespect to the out-of-domain parallel training data(assuming the above preprocessing) is ' 2% forEN and ' 6% for DE data side. Statistics onthe length distribution of in-domain and out-of-domain data is given in Table 2.

For all services, the sequence of provided

516

# tokens out-of-domain in-domain

mean 23.0±14.1 6.6±4.8median 25 8max 150 25

# lines 5.5M 1.3M

Table 2: Data statistics for source side of in-domain and out-of-domain parallel data.

source sentences was the same for all participants,with no data intersection between the services be-yond natural duplicates: About 11% of data wereduplicates on both (DE and EN) sides, whereabout 4% of DE sentences had more than one dif-ferent EN side.

Feedback. Simulation of real-valued user feed-back was done by calculating the smoothedsentence-level BLEU-score (Lin and Och, 2004)(with additive n-gram count smoothing with offset0.01, applied only if the n-gram count was zero)with respect to one human reference (preprocessedas described above).

3 Evaluation Metrics

In our shared task, participants were allowed touse their favorite MT systems as starting pointsto integrate online bandit learning methods. Thisleads to the difficulty of separating the contribu-tions of the underlying MT architecture and theonline bandit learning method. We attempted totackle this problem by using different evaluationmetrics that focus on these respective aspects:

1. Online cumulative reward: This met-ric measures the cumulative sum C =∑K

k=1 ∆(tk) of the per-sentence BLEU score∆ against the number of iterations. This met-ric has been used in reinforcement learningcompetitions (Dimitrakakis et al., 2014). Forsystems with the same design, this metric fa-vors those that do a good job at balancingexploration and exploitation to achieve highscores over the full data sequence. Unlikein these competitions, where environments(i.e., action spaces and context features) werefixed, in our task the environment is hetero-geneous due to the use of different under-lying MT architectures. Thus, systems thatstart out with a well-performing pretrained

out-of-domain model have an advantage oversystems that might improve more over worsestarting points. Furthermore, even systemsthat do not perform online learning at all canachieve high cumulative rewards.

2. Online regret: In order to overcome theproblems of the cumulative reward metric,we can use a metric from bandit learning thatmeasures the regret R = 1

K

∑Kk=1

(∆(t∗k) −

∆(tk))

that is suffered by the system whenpredicting translation tk instead of the opti-mal translation t∗k produced by an oracle sys-tem. Plotting a running average of regretagainst the number of iterations allows sep-arating the gains due to the MT architecturefrom the gains due to the learning algorithm:Systems that do learn will decrease regret,systems that do not learn will not. In ourtask, we use as oracle system a model thatis trained on in-domain data.

3. Relative reward: A further way to separateout the learning ability of systems from thecontribution of the underlying MT architec-ture is to apply the standard corpus-BLEUscore and/or an average of the per-sentenceBLEU score ∆ on a held-out set at regu-lar intervals during training. Plotting thesescores against the number of iterations, oralternatively, subtracting the performance ofthe starting point at each evaluation, allowsto discern systems that adapt to a new do-main from systems that are good from the be-ginning and can achieve high cumulative re-wards without learning. We performed thisevaluation by embedding a small (relative tothe whole sequence) fixed held-out set in thebeginning (showing the performance of theinitial out-of-domain model), and again atregular intervals including the very end of thelearning sequence. In total, there were 4 in-sertions of 700 sentences in the developmentdata and 12 insertions of 4,000 sentences inthe final training phase, which constitutes'2% and '0.3% of the respective learningsequence lengths. Note that this metric mea-sures the systems’ performance while theywere still exploring and learning, but the rela-tive size of the embedded held-out set is smallenough to consider the models static duringsuch periodic evaluations.

517

4 Baselines

As baseline systems, we used SMT and NMTmodels that were trained on out-of-domain data,but did not perform online learning on in-domaindata. We further present oracle systems that weretrained in batch on in-domain data.

4.1 Static SMT baselines.

SMT-static. We based our SMT submissionson the SCFG decoder cdec (Dyer et al., 2010)with on-the-fly grammar extraction with suffix ar-rays (Lopez, 2007). Training was done in batch onthe parallel out-of-domain data; tuning was doneon newstest2016-deen. During the devel-opment phase we evaluated MERT (on 14 defaultdense features) and MIRA (on additional lexical-ized sparse features: rule-id features, rule sourceand target bigram features, and rule shape fea-tures), and found no significant difference in re-sults. We chose MERT with dense features as theseed system for the training phase for its speed andsmaller memory footprint.

4.2 Static NMT baselines.

WMT16-static. First of all, we are interested inhow well the currently best (third-party) modelon the news domain would perform on the e-commerce domain. Therefore, the Nematus (Sen-nrich et al., 2017) model that won the NewsTranslation Shared Task at WMT 2016 (Bojaret al., 2016b)5 was used to translate the datafrom this shared task. It is an attentional, bi-directional, singe-layered encoder-decoder modelon sub-word units (BPE with 89,500 merge oper-ations) with word embeddings of dimensionality500, GRUs of size 1024, pervasive dropout and r2lreranking (details in (Sennrich et al., 2016a)). Fi-nal predictions are made with an ensemble formedof the four last training checkpoints and beamsearch with width 12. It was trained on a dif-ferent corpus than allowed for this shared task –the WMT 2016 news training data (Europarl v7,News Commentary v11, CommonCrawl) and ad-ditional synthetic parallel data generated by trans-lating the monolingual news crawl corpus with aEN-DE NMT model.

BNMT-static. The UNK replacement strategyof Jean et al. (2015) and Luong et al. (2015) is

5From data.statmt.org/rsennrich/wmt16_systems/de-en/

expected to work reasonable well for tokens thatoccur in the training data and those that are copiedfrom source to target. However, the NMT modeldoes not learn anything about these words as suchin contrast to BPE models (Sennrich et al., 2016b)where the decomposition by byte pair encoding(BPE) allows for a representation within the vo-cabulary. We generate a BNMT system using aBPE vocabulary from 30k merge operations onall tokens and all single characters of the trainingdata, including the UNK token. If unknown char-acters occur, they are copied from source to target.

4.3 Oracle SMT and NMT systems

To simulate full-information systems (oracles) forregret calculation, we trained an SMT and anNMT system with the same architectures, on thein-domain data that other learning systems ac-cessed only through the numerical feedback. TheSMT oracle system was trained on combined in-domain and out-of-domain data, while the NMToracle system continued training from the con-verged out-of-domain system on the in-domaindata with the same BPE vocabulary.

5 Submitted Systems

5.1 Online bandit learners based on SMT.

Online bandit learners based on SMT were fol-lowing the existing approaches to adapting anSMT model from weak user feedback (Sokolovet al., 2016b,a) by stochastically optimizing ex-pected loss (EL) for a log-linear model. Fur-thermore, we present a model that implementsstochastic zeroth-order (SZO) optimization for on-line bandit learning. Cube pruning limit (up to600), learning rate adaptation schedules (constantvs. Adadelta (Zeiler, 2012) or Adam (Kingma andBa, 2014)), as well as the initial learning rates (forAdam), were tuned during the development phase.The best configurations were selected for the train-ing phase. The running average of rewards as anadditive control variate (CV)6 was found helpfulfor stochastic policy gradient updates (Williams,1992) for all online learning systems.

SMT-EL-CV-ADADELTA. We used the ELminimization approach of Sokolov et al. (2016a),adding Adadelta’s learning rate scheduling, and acontrol variate (effectively, replacing the received

6Called a baseline in RL literature; here we use a termfrom statistics not to confuse it with baseline MT models.

518

feedback ∆(tk) with ∆(tk) − 1k

∑kk′=1 ∆(tk′)).

Sampling and computation of expectations on thehypergraph used the Inside-Outside algorithm (Liand Eisner, 2009).

SMT-EL-CV-ADAM. This system uses thesame approach as above except for using Adam toadapt the learning rates, with tuning of the initiallearning rate on the development service.

SMT-SZO-CV-ADAM. As a novel contribu-tion, we adapted the two-point stochastic zeroth-order approach by (Sokolov et al., 2015) that re-quired two quality evaluation per iteration to aone-point feedback scenario. In a nutshell, oneach step of the SZO algorithm, the model pa-rameters w are perturbed with an additive stan-dard Gaussian noise ε, and the Viterbi transla-tion is sent to the service. Such algorithm can beshown to maximize the smoothed version of thetask reward: Eε∼N(0,1)[∆(y(w + ε))] (Flaxmanet al., 2005). The advantages of such a black-boxoptimization method over model-based (e.g. EL)optimization, that requires sampling of completestructures from the model distribution, are sim-pler sampling of standard Gaussians, and match-ing of the inference criterion to the learning ob-jective (MAP inference for both), unlike the ELoptimization of expected reward that is still eval-uated at test time using MAP inference. For SZOmodels we found that the Adam scheduling con-sistently outperforms Adadelta.

5.2 Online bandit learners based on NMT.

Kreutzer et al. (2017) recently presented an al-gorithm for online expected loss minimization toadapt NMT models to unknown domains withbandit feedback. Exploration (i.e. sampling fromthe model) and exploitation (i.e. presenting thehighest scored translation) are controlled by thesoftmax distribution in the last layer of the net-work. Ideally, the model would converge towardsa peaked distribution. In our online learning sce-nario this is not guaranteed, but we would like themodel to gradually stop exploring, in order to stillachieve high cumulative per-sentence reward. Toachieve such a behavior, the temperature of thesoftmax over the outputs of the last layer of thenetwork is annealed (Rose, 1998). More specifi-cally, let o be the scores of the output projectionlayer of the decoder, then pθ(yt = wi|x, y<t) =

exp(owi/T )∑Vv=1 exp(owv/T )

is the distribution that defines the

probability of each word wi of the target vocabu-lary V to be sampled in timestep t. The anneal-ing schedule for this temperature T is defined asTk = 0.99max(k−kSTART,0), i.e. decreases from iter-ation kSTART on. The same decay is applied to thelearning rate, such that γk = γk−1 ·Tk. This sched-ule was proven successful during tuning with theleaderboard.

WNMT-EL. Using the implementation ofKreutzer et al. (2017), we built a word-basedNMT system with NeuralMonkey (Libovickyet al., 2016; Bojar et al., 2016a) and trainedit with the EL algorithm. The vocabulary islimited to the 30k most frequent words in theout-of-domain training corpus. The architectureis similar to WMT16-static with GRU size 1024,embedding size 500. It was pretrained on theout-of-domain data with the standard maximumlikelihood objective, Adam (α = 1× 10−4,β1 = 0.9, β2 = 0.999) and dropout (Srivas-tava et al., 2014) with probability 0.2. Banditlearning starts from this pretrained model andcontinues with stochastic gradient descent (initiallearning rate γ0 = 1× 10−5, annealing starts atkSTART = 700, 000, dropout with probability 0.5,gradient norm clipping when the norm exceeds1.0 (Pascanu et al., 2013)), where the model wasupdated as soon as a feedback is received. Asdescribed above, UNK replacement was appliedto the output on the basis of an IBM2 lexicaltranslation model built with fast align (Dyeret al., 2013) on out-of-domain training data. If thealigned source word for a generated UNK tokenis not in the dictionary of the lexical translationmodel, the UNK token was simply replaced bythe source word.

BNMT-EL. The pretrained BPE model is fur-ther trained on the bandit task data with the ELalgorithm, as described for BL1, with the only dif-ference of using Adam (α = 1× 10−5, β1 = 0.9,β2 = 0.999) instead of SGD. Again, annealingstarted at kSTART = 700, 000.

BNMT-EL-CV. BNMT-EL-CV is trained in thesame manner as BNMT-EL with the addition ofthe same control variate technique (running av-erage of rewards) that has been previously foundto improve both variance and generalization forNMT bandit training (Kreutzer et al., 2017).

519

5.3 Domain adaptation and reinforcementlearning based on NMT (University ofMaryland).

UMD-domain-adaptation. The UMD team’ssystems were based on an attention-basedencoder-decoder translation model. The modelsuse the BPE technique for subword encoding,which helps addressing the rare word problemand enlarges vocabulary. A further addition isthe domain adaptation approach of Axelrod et al.(2011) to select training data after receivingin-domain source-side data and selecting the mostsimilar out-of-domain data from the WMT 2016training set for re-training.

UMD-reinforce. Another type of models sub-mitted by UMD uses reinforcement learning tech-niques to learn from feedback and improve the up-date of the translation model to optimize the re-ward, based on Bahdanau et al. (2016) and Ran-zato et al. (2016).

5.4 Domain adaptation and bandit learningbased on SMT (LIMSI).

LIMSI. The team from LIMSI tried to adapt aseed Moses system trained on out-domain datato a new, unknown domain relying on two com-ponents, each of which addresses one of the chal-lenges raised by the shared task: i) estimate theparameters of a MT system without knowing thereference translation and in a ‘one-shot’ way (eachsource sentence can only be translated once); ii)discover the specificities of the target domain ‘on-the-fly’ as no information about it is available.First, a linear regression model was used to exploitweak and partial feedback the system received bylearning to predict the reward a translation hy-pothesis will get. This model can then be usedto score hypotheses of the search space and trans-late source sentences while taking into account thespecificities of the in-domain data. Second, threevariants of the UCB1 (Auer et al., 2002) algorithm(vanilla UCB1, a UCB1-sampling variant encour-aging more exploration, and a UCB1 with select-ing only the examples not used to train the regres-sion model) chose which of the ‘adapted’ or ‘seed’systems should be used to translate a given sourcesentence in order to maximize the cumulative re-ward (Wisniewski, 2017).

model cumulativereward

‘translate’ by copying source 64,481.8

SMT

SMT-oracle 499,578.0SMT-static 229,621.7

SMT-EL-CV-ADADELTA 214,398.8SMT-EL-CV-ADAM 225,535.3SMT-SZO-CV-ADAM 208,464.7

NM

T

BNMT-oracle 780,580.4BNMT-static 222,066.0WMT16-static 139,668.1

BNMT-EL-CV 212,703.2BNMT-EL 237,663.0WNMT-EL 115,098.0

UMD-domain-adaptation 248,333.2

Table 3: Cumulative rewards over the full train-ing sequence. Only completely finished submis-sion are shown.

6 Results

Table 3 shows the evaluation results under the cu-mulative rewards metric. Of the non-oracle sys-tems, good results are obtained by static SMTand BNMT system, while the best performance isobtained by the UMD-domain adaptation systemwhich is also basically a static system. This re-sult is followed closely by the online bandit learnerBNMT-EL which is based on an NMT baselineand optimizes the EL objective. It outperforms theBNMT-static baseline. Cumulative rewards couldnot be computed for all submitted systems sincesome training runs could not be fully finished.

The evolution of the online regret plottedagainst the log-scaled number of iterations duringtraining is shown in Figure 1. Most of the learn-ing happens in the first 100,000 iterations, how-ever, online learning systems optimizing struc-tured EL objectives or based on reinforcementlearning eventually converge to the same result:BNMT-EL or UMD-reinforce2 get close to the re-gret of the static UMD-domain adaptation. Sys-tems that optimize the EL objective do not startfrom strong out-of-domain systems with domain-adaptation, however, due to a steeper learningcurve they arrive at similar results.

Figures 2, 3a and 3b show the evolution ofcorpus- and sentence-BLEU on the heldout set that

520

0.19

0.20

0.21

0.22

0.23

0.24

0.25

0.26

0.27

0.28

0.29

0.30

0.31

1000 10000 100000 1e+06

regr

et

iteration

SMT-staticSMT-EL-CV-ADADELTA

SMT-EL-CV-ADAMSMT-SZO-CV-ADAM

WMT16-staticBNMT-static

WNMT-ELBNMT-EL

BNMT-EL-CVUMD-dom-adaptUMD-reinforce1UMD-reinforce2

UMD-reinforce3LIMSI-UCB1

LIMSI-UCB1-samplLIMSI-UCB1-select

Figure 1: Evolution of regret plotted against log-scaled number of iterations during training. The steeperis the decrease of a curve, the better learning capability has the corresponding algorithm.

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0 1 2 3

corp

us-B

LEU

check points

Lexical IBM2SMT-static

SMT-EL-CV-ADADELTASMT-EL-CV-ADAM

SMT-SZO-CV-ADAMWMT16-static

BNMT-staticWNMT-EL

BNMT-ELBNMT-EL-CV

BNMT-EL-CV w/o annealing

Figure 2: Evolution of corpus BLEU scores during development for configuration selected for the train-ing phase of the competition. Each check point is comprised of the same 700 sentences spaced at aregular intervals of 12,400 sentences starting from the beginning of the development sequence.

521

0.07

0.08

0.09

0.10

0.11

0.12

0.13

0.14

0.15

0.16

0.17

0.18

0.19

0.20

0.21

0.22

0 1 2 3 4 5 6 7 8 9 10 11

corp

us-B

LEU

check points




WNMT-ELBNMT-EL




(a) corpus-BLEU

0.08

0.09

0.10

0.11

0.12

0.13

0.14

0.15

0.16

0.17

0.18

0.19

0.20

0 1 2 3 4 5 6 7 8 9 10 11

aver

age

sent

ence

-BLE

U

check points




WNMT-ELBNMT-EL




(b) sentence-BLEU

Figure 3: The evolution of corpus- and sentence-BLEU scores during training for all participant andbaselines. Each check point is comprised of the same 4,000 sentences spaced at a regular intervals of113,634 sentences starting from the beginning of the training sequence.

522

has been embedded in the development and thetraining sequences. While under corpus-BLEU,static systems always outperform online learnerson the held-out embedded set, online learningsystems such as BNMT-EL can catch up undercorpus-BLEU during development, and under asentence-BLEU evaluation during training. Thecurves for corpus- and average sentence-BLEU(Figures 3a and 3b) show a different dynam-ics, with the corpus-BLEU sometimes decreasingwhereas the sentence-BLEU curve continues to in-crease. However, if the focus is online learning,the online task loss is per-sentence BLEU and soshould be the evaluation metric.

7 Conclusion

We presented the learning setup and infrastruc-ture, data and evaluation metrics, and descrip-tions of baselines and submitted systems for anovel shared task on bandit learning for machinetranslation. The task implicitly involved domainadaptation from the news domain to e-commercedata (with the additional difficulty of non-literalpost-editions as references), and online learningfrom simulated per-sentence feedback on transla-tion quality (creating a mismatch between the per-sentence task loss and the corpus-based evaluationmetric standardly used in evaluating batch-trainedmachine translation systems). Despite these chal-lenges, we found promising results for both linearand non-linear online learners that could outper-form their static SMT and NMT baselines, respec-tively. A desideratum for a future installment ofthis shared task is the option to perform offlinelearning from bandit feedback (Lawrence et al.,2017), thus allowing a more lightweight infras-tructure, and opening the task to (mini)batch learn-ing techniques that are more standard in the fieldof machine translation.

Acknowledgments

This research was supported in part by the Germanresearch foundation (DFG), and in part by a re-search cooperation grant with the Amazon Devel-opment Center Germany. We would like to thankAmazon for supplying data and engineering exper-tise, and for covering the running costs.

ReferencesPeter Auer, Nicolo Cesa-Bianchi, and Paul Fischer.

2002. Finite-time analysis of the multiarmed ban-dit problem. Machine Learning 47(2-3):235–256.

Amittai Axelrod, Xiaodong He, and Jianfeng Gao.2011. Domain adaptation via pseudo in-domain dataselection. In EMNLP. Edinburgh, Scotland.

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu,Anirudh Goyal, Ryan Lowe, Joelle Pineau, AaronCourville, and Yoshua Bengio. 2016. An Actor-Critic Algorithm for Sequence Prediction. eprintarXiv:1607.07086 .

Ondrej Bojar, Roman Sudarikov, Tom Kocmi, JindrichHelcl, and Ondrej Cıfka. 2016a. UFAL submissionsto the IWSLT 2016 MT track. In IWSLT . Seattle,WA.

Ondrej Bojar, Rajen Chatterjee, Christian Federmann,Yvette Graham, Barry Haddow, Matthias Huck,Antonio Jimeno Yepes, Philipp Koehn, VarvaraLogacheva, Christof Monz, Matteo Negri, Aure-lie Neveol, Mariana Neves, Martin Popel, MattPost, Raphael Rubino, Carolina Scarton, Lucia Spe-cia, Marco Turchi, Karin Verspoor, and MarcosZampieri. 2016b. Findings of the 2016 conferenceon machine translation. In WMT . Berlin, Germany.

Sebastian Bubeck and Nicolo Cesa-Bianchi. 2012. Re-gret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends inMachine Learning 5(1):1–122.

Olivier Chapelle, Eren Masnavoglu, and Romer Ros-ales. 2014. Simple and scalable response predictionfor display advertising. ACM Trans. on IntelligentSystems and Technology 5(4).

Christos Dimitrakakis, Guangliang Li, and NikolaosTziortziotis. 2014. The reinforcement learning com-petition 2014. AI Magazine 35(3):61–65.

Chris Dyer, Victor Chahuneau, and Noah A. Smith.2013. A simple, fast, and effective reparameteriza-tion of IBM model 2. In HLT-NAACL. Atlanta, GE.

Chris Dyer, Adam Lopez, Juri Ganitkevitch, JonathanWeese, Ferhan Ture, Phil Blunsom, Hendra Seti-awan, Vladimir Eidelman, and Philip Resnik. 2010.cdec: A decoder, alignment, and learning frameworkfor finite-state and context-free translation models.In ACL Demo. Uppsala, Sweden.

Abraham D. Flaxman, Adam Tauman Kalai, andH. Brendan McMahan. 2005. Online convex opti-mization in the bandit setting: gradient descent with-out a gradient. In SODA. Vancouver, Canada.

Yvette Graham, Timothy Baldwin, Alistair Moffat, andJustin Zobel. 2016. Can machine translation sys-tems be evaluated by the crowd alone? Natural Lan-guage Engineering 23(1):3–30.

523

Sebastien Jean, Orhan Firat, Kyunghyun Cho, RolandMemisevic, and Yoshua Bengio. 2015. Montrealneural machine translation systems for WMT’15. InWMT . Lisbon, Portugal.

Diederik Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. eprintarXiv:1412.6980 .

Julia Kreutzer, Artem Sokolov, and Stefan Riezler.2017. Bandit structured prediction for neuralsequence-to-sequence learning. In ACL. Vancouver,Canada.

Carolin Lawrence, Artem Sokolov, and Stefan Riezler.2017. Counterfactual learning from bandit feedbackunder deterministic logging: A case study in statis-tical machine translation. In EMNLP. Copenhagen,Denmark.

Zhifei Li and Jason Eisner. 2009. First-and second-order expectation semirings with applications tominimum-risk training on translation forests. InEMNLP. Singapore.

Jindrich Libovicky, Jindrich Helcl, Marek Tlusty,Pavel Pecina, and Ondrej Bojar. 2016. CUNI systemfor WMT16 automatic post-editing and multimodaltranslation tasks. In WMT . Berlin, Germany.

Chin-Yew Lin and Franz Josef Och. 2004. Auto-matic evaluation of machine translation quality us-ing longest common subsequence and skip-bigramstatistics. In ACL. Barcelona, Spain.

Adam Lopez. 2007. Hierarchical phrase-based transla-tion with suffix arrays. In EMNLP-CoNLL. Prague,Czech Republic.

Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals,and Wojciech Zaremba. 2015. Addressing the rareword problem in neural machine translation. InACL. Beijing, China.

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.2013. On the difficulty of training recurrent neuralnetworks. In ICML. Atlanta, GA.

MarcAurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2016. Sequence level train-ing with recurrent neural networks. In ICLR. SanJuan, Puerto Rico.

Kenneth Rose. 1998. Deterministic annealing for clus-tering, compression, classification, regression andrelated optimization problems. IEEE 86(11).

Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan-dra Birch, Barry Haddow, Julian Hitschler, MarcinJunczys-Dowmunt, Samuel Laubli, Antonio ValerioMiceli Barone, Jozef Mokry, and Maria Nadejde.2017. Nematus: a toolkit for neural machine trans-lation. In EACL. Valencia, Spain.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016a. Edinburgh neural machine translation sys-tems for WMT 16. In WMT . Berlin, Germany.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016b. Neural machine translation of rare wordswith subword units. In ACL. Berlin, Germany.

Artem Sokolov, Julia Kreutzer, Christopher Lo, andStefan Riezler. 2016a. Learning structured predic-tors from bandit feedback for interactive NLP. InACL. Berlin, Germany.

Artem Sokolov, Julia Kreutzer, Christopher Lo, andStefan Riezler. 2016b. Stochastic structured pre-diction under bandit feedback. In NIPS. Barcelona,Spain.

Artem Sokolov, Stefan Riezler, and Tanguy Urvoy.2015. Bandit structured prediction for learning frompartial feedback in statistical machine translation. InMT Summit. Miami, FL.

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: a simple way to prevent neural networksfrom overfitting. JMLR 15(1):1929–1958.

Richard S. Sutton and Andrew G. Barto. 1998. Re-inforcement Learning. An Introduction. The MITPress.

Csaba Szepesvari. 2009. Algorithms for ReinforcementLearning. Morgan & Claypool.

Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine Learning 20:229–256.

Guillaume Wisniewski. 2017. LIMSI submission forWMT’17 shared task on bandit learning. In WMT .Copenhagen, Denmark.

Matthew D. Zeiler. 2012. ADADELTA: an adaptivelearning rate method. eprint arXiv:1212.5701 .

524

Documents

A Shared Task on Bandit Learning for Machine Translation · Infrastructure. We provided three AWS-hosted environments, that correspond to the three phases of the shared task: 1.Mock