533

Click here to load reader

Ninth Workshop on Statistical ... - Machine translation · Ninth Workshop on Statistical Machine Translation ... Yulia Tsvetkov ... Findings of the 2014 Workshop on Statistical Machine

Embed Size (px)

Citation preview

  • ACL 2014

    Ninth Workshop onStatistical Machine Translation

    Proceedings of the Workshop

    June 26-27, 2014Baltimore, Maryland, USA

  • c2014 The Association for Computational Linguistics

    Order copies of this and other ACL proceedings from:

    Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

    ISBN 978-1-941643-17-4

    ii

  • Introduction

    The ACL 2014 Workshop on Statistical Machine Translation (WMT 2014) took place on Thursday andFriday, June 2627, 2014 in Baltimore, United States, immediately following the Conference of theAssociation for Computational Linguistics (ACL).

    This is the ninth time this workshop has been held. The first time it was held at HLT-NAACL 2006in New York City, USA. In the following years the Workshop on Statistical Machine Translation washeld at ACL 2007 in Prague, Czech Republic, ACL 2008, Columbus, Ohio, USA, EACL 2009 inAthens, Greece, ACL 2010 in Uppsala, Sweden, EMNLP 2011 in Edinburgh, Scotland, NAACL 2012 inMontreal, Canada, and ACL 2013 in Sofia, Bulgaria.

    The focus of our workshop was to use parallel corpora for machine translation. Recent experimentationhas shown that the performance of SMT systems varies greatly with the source language. In thisworkshop we encouraged researchers to investigate ways to improve the performance of SMT systemsfor diverse languages, including morphologically more complex languages, languages with partial freeword order, and low-resource languages.

    Prior to the workshop, in addition to soliciting relevant papers for review and possible presentation, weconducted four shared tasks: a general translation task, a medical translation task, a quality estimationtask, and a task to test automatic evaluation metrics. The medical translation task was introduced thisyear to address the important issue of domain adaptation within SMT. The results of the shared tasks wereannounced at the workshop, and these proceedings also include an overview paper for the shared tasksthat summarizes the results, as well as provides information about the data used and any proceduresthat were followed in conducting or scoring the task. In addition, there are short papers from eachparticipating team that describe their underlying system in greater detail.

    Like in previous years, we have received a far larger number of submission than we could accept forpresentation. This year we have received 27 full paper submissions and 49 shared task submissions. Intotal WMT 2014 featured 12 full paper oral presentations and 49 shared task poster presentations.

    The invited talk was given by Alon Lavie (Carnegie Mellon University and Safaba Translation Solutions,Inc.) entitled Machine Translation in Academia and in the Commercial World a ContrastivePerspective.

    We would like to thank the members of the Program Committee for their timely reviews. We alsowould like to thank the participants of the shared task and all the other volunteers who helped with theevaluations.

    Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Matou Machcek,Christof Monz, Pavel Pecina, Matt Post, Herv Saint-Amand, Radu Soricut, and Lucia Specia

    Co-Organizers

    iii

  • Organizers:

    Ondrej Bojar (Charles University Prague)Christian Buck (University of Edinburgh)Christian Federmann (Microsoft Research)Barry Haddow (University of Edinburgh)Philipp Koehn (University of Edinburgh / Johns Hopkins University)Matou Machcek (Charles University Prague)Christof Monz (University of Amsterdam)Pavel Pecina (Charles University Prague)Matt Post (Johns Hopkins University)Herv Saint-Amand (University of Edinburgh)Radu Soricut (Google)Lucia Specia (University of Sheffield)

    Invited Talk:

    Alon Lavie (Research Professor at Carnegie Mellon University / Co-founder, President and CTO -Safaba Translation Solutions, Inc.)

    Program Committee:

    Lars Ahrenberg (Linkping University)Alexander Allauzen (Universit Paris-Sud / LIMSI-CNRS)Tim Anderson (Air Force Research Laboratory)Eleftherios Avramidis (German Research Center for Artificial Intelligence)Wilker Aziz (University of Sheffield)Daniel Beck (University of Sheffield)Jose Miguel Benedi (Universitt Politecnica de Valncia)Nicola Bertoldi (FBK)Ergun Bicici (Centre for Next Generation Localisation, Dublin City University)Alexandra Birch (University of Edinburgh)Arianna Bisazza (University of Amsterdam)Graeme Blackwood (IBM Research)Phil Blunsom (University of Oxford)Fabienne Braune (University of Stuttgart)Chris Brockett (Microsoft Research)Hailong Cao (Harbin Institute of Technology)Michael Carl (Copenhagen Business School)Marine Carpuat (National Research Council)Francisco Casacuberta (Universitat Politcnica de Valncia)Daniel Cer (Google)Boxing Chen (NRC)Colin Cherry (NRC)David Chiang (USC/ISI)Vishal Chowdhary (Microsoft)

    v

  • Steve DeNeefe (SDL Language Weaver)Michael Denkowski (Carnegie Mellon University)Jacob Devlin (Raytheon BBN Technologies)Markus Dreyer (SDL Language Weaver)Kevin Duh (Nara Institute of Science and Technology)Marcello Federico (FBK)Yang Feng (USC/ISI)Andrew Finch (NICT)Mark Fishel (University of Zurich)Jos A. R. Fonollosa (Universitat Politcnica de Catalunya)George Foster (NRC)Michel Galley (Microsoft Research)Juri Ganitkevitch (Johns Hopkins University)Katya Garmash (University of Amsterdam)Josef van Genabith (Dublin City University)Ulrich Germann (University of Edinburgh)Daniel Gildea (University of Rochester)Kevin Gimpel (Toyota Technological Institute at Chicago)Jess Gonzlez-Rubio (Universitat Politcnica de Valncia)Yvette Graham (The University of Melbourne)Spence Green (Stanford University)Francisco Guzmn (Qatar Computing Research Institute)Greg Hanneman (Carnegie Mellon University)Christian Hardmeier (Uppsala universitet)Eva Hasler (University of Edinburgh)Yifan He (New York University)Kenneth Heafield (Stanford)John Henderson (MITRE)Felix Hieber (Heidelberg University)Hieu Hoang (University of Edinburgh)Stphane Huet (Universit c dAvignon)Young-Sook Hwang (SKPlanet)Gonzalo Iglesias (University of Cambridge)Ann Irvine (Johns Hopkins University)Abe Ittycheriah (IBM)Laura Jehl (Heidelberg University)Doug Jones (MIT Lincoln Laboratory)Maxim Khalilov (BMMT)Alexander Koller (University of Potsdam)Roland Kuhn (National Research Council of Canada)Shankar Kumar (Google)Mathias Lambert (Amazon.com)Phillippe Langlais (Universit de Montral)Alon Lavie (Carnegie Mellon University)Gennadi Lembersky (NICE Systems)William Lewis (Microsoft Research)Lemao Liu (The City University of New York)

    vi

  • Qun Liu (Dublin City University)Wolfgang Macherey (Google)Saab Mansour (RWTH Aachen University)Jos B. Mario (Universitat Politcnica de Catalunya)Cettolo Mauro (FBK)Arne Mauser (Google, Inc)Jon May (SDL Language Weaver)Wolfgang Menzel (Hamburg University)Shachar Mirkin (Xerox Research Centre Europe)Yusuke Miyao (National Instutite of Informatics)Dragos Munteanu (SDL Language Technologies)Markos Mylonakis (Lexis Research)Llus Mrquez (Qatar Computing Research Institute)Preslav Nakov (Qatar Computing Research Institute)Graham Neubig (Nara Institute of Science and Technology)Jan Niehues (Karlsruhe Institute of Technology)Kemal Oflazer (Carnegie Mellon University - Qatar)Daniel Ortiz-Martnez (Copenhagen Business School)Stephan Peitz (RWTH Aachen University)Sergio Penkale (Lingo24)Maja Popovic (DFKI)Stefan Riezler (Heidelberg University)Johann Roturier (Symantec)Raphael Rubino (Prompsit Language Engineering)Alexander M. Rush (MIT)Anoop Sarkar (Simon Fraser University)Hassan Sawaf (eBay Inc.)Lane Schwartz (Air Force Research Laboratory)Jean Senellart (SYSTRAN)Rico Sennrich (University of Zurich)Kashif Shah (University of Sheffield)Wade Shen (MIT)Patrick Simianer (Heidelberg University)Linfeng Song (ICT/CAS)Sara Stymne (Uppsala University)Katsuhito Sudoh (NTT Communication Science Laboratories / Kyoto University)Felipe Snchez-Martnez (Universitat dAlacant)Jrg Tiedemann (Uppsala University)Christoph Tillmann (TJ Watson IBM Research)Antonio Toral (Dublin City Unversity)Hajime Tsukada (NTT Communication Science Laboratories)Yulia Tsvetkov (Carnegie Mellon University)Dan Tufis (Research Institute for Artificial Intelligence, Romanian Academy)Marco Turchi (Fondazione Bruno Kessler)Ferhan Ture (University of Maryland)Masao Utiyama (NICT)Ashish Vaswani (University of Southern California Information Sciences Institute)

    vii

  • David Vilar (Pixformance GmbH)Stephan Vogel (Qatar Computing Research Institute)Haifeng Wang (Baidu)Taro Watanabe (NICT)Marion Weller (Universitt Stuttgart)Philip Williams (University of Edinburgh)Guillaume Wisniewski (Univ. Paris Sud and LIMSI-CNRS)Hua Wu (Baidu)Joern Wuebker (RWTH Aachen University)Peng Xu (Google Inc.)Wenduan Xu (Cambridge University)Franois Yvon (LIMSI/CNRS)Richard Zens (Google)Hao Zhang (Google)Liu Zhanyi (Baidu)

    viii

  • Table of Contents

    Efficient Elicitation of Annotations for Human Evaluation of Machine TranslationKeisuke Sakaguchi, Matt Post and Benjamin Van Durme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    Findings of the 2014 Workshop on Statistical Machine TranslationOndrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Lev-

    eling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia and AleTamchyna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    Parallel FDA5 for Fast Deployment of Accurate Statistical Machine Translation SystemsErgun Bicici, Qun Liu and Andy Way. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    Yandex School of Data Analysis Russian-English Machine Translation System for WMT14Alexey Borisov and Irina Galinskaya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    CimS The CIS and IMS joint submission to WMT 2014 translating from English into GermanFabienne Cap, Marion Weller, Anita Ramm and Alexander Fraser . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    English-to-Hindi system description for WMT 2014: Deep Source-Context Features for MosesMarta R. Costa-juss, Parth Gupta, Paolo Rosso and Rafael E. Banchs . . . . . . . . . . . . . . . . . . . . . . . 79

    The KIT-LIMSI Translation System for WMT 2014Quoc Khanh Do, Teresa Herrmann, Jan Niehues, Alexander Allauzen, Franois Yvon and Alex

    Waibel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

    The IIT Bombay Hindi-English Translation System at WMT 2014Piyush Dungarwal, Rajen Chatterjee, Abhijit Mishra, Anoop Kunchukuttan, Ritesh Shah and Push-

    pak Bhattacharyya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

    Edinburghs Phrase-based Machine Translation Systems for WMT-14Nadir Durrani, Barry Haddow, Philipp Koehn and Kenneth Heafield . . . . . . . . . . . . . . . . . . . . . . . . . 97

    EU-BRIDGE MT: Combined Machine TranslationMarkus Freitag, Stephan Peitz, Joern Wuebker, Hermann Ney, Matthias Huck, Rico Sennrich, Nadir

    Durrani, Maria Nadejde, Philip Williams, Philipp Koehn, Teresa Herrmann, Eunah Cho and Alex Waibel105

    Phrasal: A Toolkit for New Directions in Statistical Machine TranslationSpence Green, Daniel Cer and Christopher Manning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

    Anaphora Models and Reordering for Phrase-Based SMTChristian Hardmeier, Sara Stymne, Jrg Tiedemann, Aaron Smith and Joakim Nivre . . . . . . . . . 122

    The Karlsruhe Institute of Technology Translation Systems for the WMT 2014Teresa Herrmann, Mohammed Mediani, Eunah Cho, Thanh-Le Ha, Jan Niehues, Isabel Slawik,

    Yuqi Zhang and Alex Waibel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

    The DCU-ICTCAS MT system at WMT 2014 on German-English Translation TaskLiangyou Li, Xiaofeng Wu, Santiago Cortes Vaillo, Jun Xie, Andy Way and Qun Liu . . . . . . . . . 136

    The CMU Machine Translation Systems at WMT 2014Austin Matthews, Waleed Ammar, Archna Bhatia, Weston Feely, Greg Hanneman, Eva Schlinger,

    Swabha Swayamdipta, Yulia Tsvetkov, Alon Lavie and Chris Dyer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

    ix

  • Stanford Universitys Submissions to the WMT 2014 Translation TaskJulia Neidert, Sebastian Schuster, Spence Green, Kenneth Heafield and Christopher Manning . 150

    The RWTH Aachen German-English Machine Translation System for WMT 2014Stephan Peitz, Joern Wuebker, Markus Freitag and Hermann Ney . . . . . . . . . . . . . . . . . . . . . . . . . . 157

    Large-scale Exact Decoding: The IMS-TTT submission to WMT14Daniel Quernheim and Fabienne Cap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

    Abu-MaTran at WMT 2014 Translation Task: Two-step Data Selection and RBMT-Style Synthetic RulesRaphael Rubino, Antonio Toral, Vctor M. Snchez-Cartagena, Jorge Ferrndez-Tordera, Sergio

    Ortiz Rojas, Gema Ramrez-Snchez, Felipe Snchez-Martnez and Andy Way . . . . . . . . . . . . . . . . . . . 171

    The UA-Prompsit hybrid machine translation system for the 2014 Workshop on Statistical MachineTranslation

    Vctor M. Snchez-Cartagena, Juan Antonio Prez-Ortiz and Felipe Snchez-Martnez . . . . . . . 178

    Machine Translation and Monolingual Postediting: The AFRL WMT-14 SystemLane Schwartz, Timothy Anderson, Jeremy Gwinnup and Katherine Young . . . . . . . . . . . . . . . . . 186

    CUNI in WMT14: Chimera Still Awaits BellerophonAle Tamchyna, Martin Popel, Rudolf Rosa and Ondrej Bojar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

    Manawi: Using Multi-Word Expressions and Named Entities to Improve Machine TranslationLiling Tan and Santanu Pal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

    Edinburghs Syntax-Based Systems at WMT 2014Philip Williams, Rico Sennrich, Maria Nadejde, Matthias Huck, Eva Hasler and Philipp Koehn207

    DCU-Lingo24 Participation in WMT 2014 Hindi-English Translation taskXiaofeng Wu, Rejwanul Haque, Tsuyoshi Okita, Piyush Arora, Andy Way and Qun Liu . . . . . . 215

    Machine Translation of Medical Texts in the Khresmoi ProjectOndrej Duek, Jan Hajic, Jaroslava Hlavcov, Michal Novk, Pavel Pecina, Rudolf Rosa, Ale

    Tamchyna, Zdenka Ureov and Daniel Zeman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

    Postechs System Description for Medical Text Translation TaskJianri Li, Se-Jong Kim, Hwidong Na and Jong-Hyeok Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

    Domain Adaptation for Medical Text Translation using Web ResourcesYi Lu, Longyue Wang, Derek F. Wong, Lidia S. Chao and Yiming Wang . . . . . . . . . . . . . . . . . . . . 233

    DCU Terminology Translation System for Medical Query Subtask at WMT14Tsuyoshi Okita, Ali Vahid, Andy Way and Qun Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

    LIMSI @ WMT14 Medical Translation TaskNicolas Pcheux, Li Gong, Quoc Khanh Do, Benjamin Marie, Yulia Ivanishcheva, Alexander Al-

    lauzen, Thomas Lavergne, Jan Niehues, Aurlien Max and Franois Yvon . . . . . . . . . . . . . . . . . . . . . . . 246

    Combining Domain Adaptation Approaches for Medical Text TranslationLongyue Wang, Yi Lu, Derek F. Wong, Lidia S. Chao, Yiming Wang and Francisco Oliveira . . 254

    Experiments in Medical Translation Shared Task at WMT 2014Jian Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

    x

  • Randomized Significance Tests in Machine TranslationYvette Graham, Nitika Mathur and Timothy Baldwin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

    Estimating Word Alignment Quality for SMT Reordering TasksSara Stymne, Jrg Tiedemann and Joakim Nivre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

    Dependency-based Automatic Enumeration of Semantically Equivalent Word Orders for Evaluating JapaneseTranslations

    Hideki Isozaki, Natsume Kouchi and Tsutomu Hirao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

    Results of the WMT14 Metrics Shared TaskMatous Machacek and Ondrej Bojar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

    Efforts on Machine Learning over Human-mediated Translation Edit RateEleftherios Avramidis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302

    SHEF-Lite 2.0: Sparse Multi-task Gaussian Processes for Translation Quality EstimationDaniel Beck, Kashif Shah and Lucia Specia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307

    Referential Translation Machines for Predicting Translation QualityErgun Bicici and Andy Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

    FBK-UPV-UEdin participation in the WMT14 Quality Estimation shared-taskJos Guilherme Camargo de Souza, Jess Gonzlez-Rubio, Christian Buck, Marco Turchi and

    Matteo Negri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

    Target-Centric Features for Translation Quality EstimationChris Hokamp, Iacer Calixto, Joachim Wagner and Jian Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

    LIG System for Word Level QE task at WMT14Ngoc Quang Luong, Laurent Besacier and Benjamin Lecouteux . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

    Exploring Consensus in Machine Translation for Quality EstimationCarolina Scarton and Lucia Specia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342

    LIMSI Submission for WMT14 QE TaskGuillaume Wisniewski, Nicolas Pcheux, Alexander Allauzen and Franois Yvon . . . . . . . . . . . . 348

    Parmesan: Meteor without Paraphrases with Paraphrased ReferencesPetra Barancikova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355

    A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEUBoxing Chen and Colin Cherry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362

    VERTa participation in the WMT14 Metrics TaskElisabet Comelles and Jordi Atserias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368

    Meteor Universal: Language Specific Translation Evaluation for Any Target LanguageMichael Denkowski and Alon Lavie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376

    Application of Prize based on Sentence Length in Chunk-based Automatic Evaluation of Machine Trans-lation

    Hiroshi Echizenya, Kenji Araki and Eduard Hovy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381

    LAYERED: Metric for Machine Translation EvaluationShubham Gautam and Pushpak Bhattacharyya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

    xi

  • IPA and STOUT: Leveraging Linguistic and Source-based Features for Machine Translation EvaluationMeritxell Gonzlez, Alberto Barrn-Cedeo and Llus Mrquez . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394

    DiscoTK: Using Discourse Structure for Machine Translation EvaluationShafiq Joty, Francisco Guzmn, Llus Mrquez and Preslav Nakov. . . . . . . . . . . . . . . . . . . . . . . . . .402

    Tolerant BLEU: a Submission to the WMT14 Metrics TaskJindrich Libovick and Pavel Pecina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409

    BEER: BEtter Evaluation as RankingMilos Stanojevic and Khalil Simaan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414

    RED, The DCU-CASICT Submission of Metrics TasksXiaofeng Wu, Hui Yu and Qun Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420

    Crowdsourcing High-Quality Parallel Data Extraction from TwitterWang Ling, Luis Marujo, Chris Dyer, Alan W Black and Isabel Trancoso . . . . . . . . . . . . . . . . . . . 426

    Using Comparable Corpora to Adapt MT Models to New DomainsAnn Irvine and Chris Callison-Burch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437

    Dynamic Topic Adaptation for SMT using Distributional ProfilesEva Hasler, Barry Haddow and Philipp Koehn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445

    Unsupervised Adaptation for Statistical Machine TranslationSaab Mansour and Hermann Ney . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457

    An Empirical Comparison of Features and Tuning for Phrase-based Machine TranslationSpence Green, Daniel Cer and Christopher Manning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466

    Bayesian Reordering Model with Feature SelectionAbdullah Alrajeh and Mahesan Niranjan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477

    Augmenting String-to-Tree and Tree-to-String Translation with Non-Syntactic PhrasesMatthias Huck, Hieu Hoang and Philipp Koehn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486

    Linear Mixture Models for Robust Machine TranslationMarine Carpuat, Cyril Goutte and George Foster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499

    xii

  • Workshop Program

    Thursday, June 26, 2014

    9:009:10 Opening Remarks

    Session 1: Shared Translation Tasks

    9:109:30 Efficient Elicitation of Annotations for Human Evaluation of Machine TranslationKeisuke Sakaguchi, Matt Post and Benjamin Van Durme

    9:3010:00 Findings of the 2014 Workshop on Statistical Machine TranslationOndrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn,Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand,Radu Soricut, Lucia Specia and Ale Tamchyna

    10:00-10:30 Panel Discussion

    10:3011:00 Coffee

    Session 2: Poster Session

    11:00-12:30 Shared Task: Translation

    Parallel FDA5 for Fast Deployment of Accurate Statistical Machine TranslationSystemsErgun Bicici, Qun Liu and Andy Way

    Yandex School of Data Analysis Russian-English Machine Translation System forWMT14Alexey Borisov and Irina Galinskaya

    CimS The CIS and IMS joint submission to WMT 2014 translating from Englishinto GermanFabienne Cap, Marion Weller, Anita Ramm and Alexander Fraser

    English-to-Hindi system description for WMT 2014: Deep Source-Context Featuresfor MosesMarta R. Costa-juss, Parth Gupta, Paolo Rosso and Rafael E. Banchs

    The KIT-LIMSI Translation System for WMT 2014Quoc Khanh Do, Teresa Herrmann, Jan Niehues, Alexander Allauzen, FranoisYvon and Alex Waibel

    The IIT Bombay Hindi-English Translation System at WMT 2014Piyush Dungarwal, Rajen Chatterjee, Abhijit Mishra, Anoop Kunchukuttan, RiteshShah and Pushpak Bhattacharyya

    xiii

  • Thursday, June 26, 2014 (continued)

    Edinburghs Phrase-based Machine Translation Systems for WMT-14Nadir Durrani, Barry Haddow, Philipp Koehn and Kenneth Heafield

    EU-BRIDGE MT: Combined Machine TranslationMarkus Freitag, Stephan Peitz, Joern Wuebker, Hermann Ney, Matthias Huck, Rico Sen-nrich, Nadir Durrani, Maria Nadejde, Philip Williams, Philipp Koehn, Teresa Herrmann,Eunah Cho and Alex Waibel

    Phrasal: A Toolkit for New Directions in Statistical Machine TranslationSpence Green, Daniel Cer and Christopher Manning

    Anaphora Models and Reordering for Phrase-Based SMTChristian Hardmeier, Sara Stymne, Jrg Tiedemann, Aaron Smith and Joakim Nivre

    The Karlsruhe Institute of Technology Translation Systems for the WMT 2014Teresa Herrmann, Mohammed Mediani, Eunah Cho, Thanh-Le Ha, Jan Niehues, IsabelSlawik, Yuqi Zhang and Alex Waibel

    The DCU-ICTCAS MT system at WMT 2014 on German-English Translation TaskLiangyou Li, Xiaofeng Wu, Santiago Cortes Vaillo, Jun Xie, Andy Way and Qun Liu

    The CMU Machine Translation Systems at WMT 2014Austin Matthews, Waleed Ammar, Archna Bhatia, Weston Feely, Greg Hanneman, EvaSchlinger, Swabha Swayamdipta, Yulia Tsvetkov, Alon Lavie and Chris Dyer

    Stanford Universitys Submissions to the WMT 2014 Translation TaskJulia Neidert, Sebastian Schuster, Spence Green, Kenneth Heafield and Christopher Man-ning

    The RWTH Aachen German-English Machine Translation System for WMT 2014Stephan Peitz, Joern Wuebker, Markus Freitag and Hermann Ney

    Large-scale Exact Decoding: The IMS-TTT submission to WMT14Daniel Quernheim and Fabienne Cap

    Abu-MaTran at WMT 2014 Translation Task: Two-step Data Selection and RBMT-StyleSynthetic RulesRaphael Rubino, Antonio Toral, Vctor M. Snchez-Cartagena, Jorge Ferrndez-Tordera,Sergio Ortiz Rojas, Gema Ramrez-Snchez, Felipe Snchez-Martnez and Andy Way

    The UA-Prompsit hybrid machine translation system for the 2014 Workshop on StatisticalMachine TranslationVctor M. Snchez-Cartagena, Juan Antonio Prez-Ortiz and Felipe Snchez-Martnez

    xiv

  • Thursday, June 26, 2014 (continued)

    Machine Translation and Monolingual Postediting: The AFRL WMT-14 SystemLane Schwartz, Timothy Anderson, Jeremy Gwinnup and Katherine Young

    CUNI in WMT14: Chimera Still Awaits BellerophonAle Tamchyna, Martin Popel, Rudolf Rosa and Ondrej Bojar

    Manawi: Using Multi-Word Expressions and Named Entities to Improve Machine Trans-lationLiling Tan and Santanu Pal

    Edinburghs Syntax-Based Systems at WMT 2014Philip Williams, Rico Sennrich, Maria Nadejde, Matthias Huck, Eva Hasler and PhilippKoehn

    DCU-Lingo24 Participation in WMT 2014 Hindi-English Translation taskXiaofeng Wu, Rejwanul Haque, Tsuyoshi Okita, Piyush Arora, Andy Way and Qun Liu

    11:00-12:30 Shared Task: Medical Translation

    Machine Translation of Medical Texts in the Khresmoi ProjectOndrej Duek, Jan Hajic, Jaroslava Hlavcov, Michal Novk, Pavel Pecina, Rudolf Rosa,Ale Tamchyna, Zdenka Ureov and Daniel Zeman

    Postechs System Description for Medical Text Translation TaskJianri Li, Se-Jong Kim, Hwidong Na and Jong-Hyeok Lee

    Domain Adaptation for Medical Text Translation using Web ResourcesYi Lu, Longyue Wang, Derek F. Wong, Lidia S. Chao and Yiming Wang

    DCU Terminology Translation System for Medical Query Subtask at WMT14Tsuyoshi Okita, Ali Vahid, Andy Way and Qun Liu

    LIMSI @ WMT14 Medical Translation TaskNicolas Pcheux, Li Gong, Quoc Khanh Do, Benjamin Marie, Yulia Ivanishcheva, Alexan-der Allauzen, Thomas Lavergne, Jan Niehues, Aurlien Max and Franois Yvon

    Combining Domain Adaptation Approaches for Medical Text TranslationLongyue Wang, Yi Lu, Derek F. Wong, Lidia S. Chao, Yiming Wang and FranciscoOliveira

    Experiments in Medical Translation Shared Task at WMT 2014Jian Zhang

    xv

  • Thursday, June 26, 2014 (continued)

    12:3014:00 Lunch

    Session 3: Invited Talk

    14:00-15:30 Machine Translation in Academia and in the Commercial World a Contrastive Perspec-tive. Alon Lavie, Research Professor Carnegie Mellon University, Co-founder, Presidentand CTO Safaba Translation Solutions, Inc.

    15:3016:00 Coffee

    Session 4: Evaluation

    16:0016:20 Randomized Significance Tests in Machine TranslationYvette Graham, Nitika Mathur and Timothy Baldwin

    16:2016:40 Estimating Word Alignment Quality for SMT Reordering TasksSara Stymne, Jrg Tiedemann and Joakim Nivre

    16:4017:00 Dependency-based Automatic Enumeration of Semantically Equivalent Word Orders forEvaluating Japanese TranslationsHideki Isozaki, Natsume Kouchi and Tsutomu Hirao

    Friday, June 27, 2014

    Session 5: Shared Evaluation Metrics and Quality Estimation Tasks

    9:009:30 Quality Estimation Shared Task

    9:309:50 Results of the WMT14 Metrics Shared TaskMatous Machacek and Ondrej Bojar

    9:5010:30 Panel

    10:3011:00 Coffee

    xvi

  • Friday, June 27, 2014 (continued)

    Session 6: Poster Session

    11:0012:30 Shared Task: Quality Estimation

    Efforts on Machine Learning over Human-mediated Translation Edit RateEleftherios Avramidis

    SHEF-Lite 2.0: Sparse Multi-task Gaussian Processes for Translation Quality EstimationDaniel Beck, Kashif Shah and Lucia Specia

    Referential Translation Machines for Predicting Translation QualityErgun Bicici and Andy Way

    FBK-UPV-UEdin participation in the WMT14 Quality Estimation shared-taskJos Guilherme Camargo de Souza, Jess Gonzlez-Rubio, Christian Buck, Marco Turchiand Matteo Negri

    Target-Centric Features for Translation Quality EstimationChris Hokamp, Iacer Calixto, Joachim Wagner and Jian Zhang

    LIG System for Word Level QE task at WMT14Ngoc Quang Luong, Laurent Besacier and Benjamin Lecouteux

    Exploring Consensus in Machine Translation for Quality EstimationCarolina Scarton and Lucia Specia

    LIMSI Submission for WMT14 QE TaskGuillaume Wisniewski, Nicolas Pcheux, Alexander Allauzen and Franois Yvon

    11:0012:30 Shared Task: Evaluation Metrics

    Parmesan: Meteor without Paraphrases with Paraphrased ReferencesPetra Barancikova

    A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEUBoxing Chen and Colin Cherry

    xvii

  • Friday, June 27, 2014 (continued)

    VERTa participation in the WMT14 Metrics TaskElisabet Comelles and Jordi Atserias

    Meteor Universal: Language Specific Translation Evaluation for Any Target LanguageMichael Denkowski and Alon Lavie

    Application of Prize based on Sentence Length in Chunk-based Automatic Evaluation ofMachine TranslationHiroshi Echizenya, Kenji Araki and Eduard Hovy

    LAYERED: Metric for Machine Translation EvaluationShubham Gautam and Pushpak Bhattacharyya

    IPA and STOUT: Leveraging Linguistic and Source-based Features for Machine Transla-tion EvaluationMeritxell Gonzlez, Alberto Barrn-Cedeo and Llus Mrquez

    DiscoTK: Using Discourse Structure for Machine Translation EvaluationShafiq Joty, Francisco Guzmn, Llus Mrquez and Preslav Nakov

    Tolerant BLEU: a Submission to the WMT14 Metrics TaskJindrich Libovick and Pavel Pecina

    BEER: BEtter Evaluation as RankingMilos Stanojevic and Khalil Simaan

    RED, The DCU-CASICT Submission of Metrics TasksXiaofeng Wu, Hui Yu and Qun Liu

    12:3014:00 Lunch

    xviii

  • Friday, June 27, 2014 (continued)

    Session 7: Data and Adaptation

    14:0014:20 Crowdsourcing High-Quality Parallel Data Extraction from TwitterWang Ling, Luis Marujo, Chris Dyer, Alan W Black and Isabel Trancoso

    14:2014:40 Using Comparable Corpora to Adapt MT Models to New DomainsAnn Irvine and Chris Callison-Burch

    14:4015:00 Dynamic Topic Adaptation for SMT using Distributional ProfilesEva Hasler, Barry Haddow and Philipp Koehn

    15:0015:20 Unsupervised Adaptation for Statistical Machine TranslationSaab Mansour and Hermann Ney

    15:2016:00 Coffee

    Session 8: Translation Models

    16:0016:20 An Empirical Comparison of Features and Tuning for Phrase-based Machine TranslationSpence Green, Daniel Cer and Christopher Manning

    16:2016:40 Bayesian Reordering Model with Feature SelectionAbdullah Alrajeh and Mahesan Niranjan

    16:4017:00 Augmenting String-to-Tree and Tree-to-String Translation with Non-Syntactic PhrasesMatthias Huck, Hieu Hoang and Philipp Koehn

    17:0017:20 Linear Mixture Models for Robust Machine TranslationMarine Carpuat, Cyril Goutte and George Foster

    xix

  • Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 111,Baltimore, Maryland USA, June 2627, 2014. c2014 Association for Computational Linguistics

    Efficient Elicitation of Annotations for Human Evaluation of MachineTranslation

    Keisuke Sakaguchi, Matt Post, Benjamin Van Durme Center for Language and Speech Processing

    Human Language Technology Center of ExcellenceJohns Hopkins University, Baltimore, Maryland{keisuke,post,vandurme}@cs.jhu.edu

    Abstract

    A main output of the annual Workshopon Statistical Machine Translation (WMT)is a ranking of the systems that partici-pated in its shared translation tasks, pro-duced by aggregating pairwise sentence-level comparisons collected from humanjudges. Over the past few years, therehave been a number of tweaks to the ag-gregation formula in attempts to addressissues arising from the inherent ambigu-ity and subjectivity of the task, as well asweaknesses in the proposed models andthe manner of model selection.

    We continue this line of work by adapt-ing the TrueSkillTM algorithm an onlineapproach for modeling the relative skillsof players in ongoing competitions, suchas Microsofts Xbox Live to the hu-man evaluation of machine translation out-put. Our experimental results show thatTrueSkill outperforms other recently pro-posed models on accuracy, and also cansignificantly reduce the number of pair-wise annotations that need to be collectedby sampling non-uniformly from the spaceof system competitions.

    1 Introduction

    The Workshop on Statistical Machine Translation(WMT) has long been a central event in the ma-chine translation (MT) community for the evalua-tion of MT output. It hosts an annual set of sharedtranslation tasks focused mostly on the translationof western European languages. One of its mainfunctions is to publish a ranking of the systemsfor each task, which are produced by aggregatinga large number of human judgments of sentence-level pairwise rankings of system outputs. Whilethe performance on many automatic metrics is also

    # score range system1 0.638 1 UEDIN-HEAFIELD2 0.604 2-3 UEDIN

    0.591 2-3 ONLINE-B4 0.571 4-5 LIMSI-SOUL

    0.562 4-5 KIT0.541 5-6 ONLINE-A

    7 0.512 7 MES-SIMPLIFIED8 0.486 8 DCU9 0.439 9-10 RWTH

    0.429 9-11 CMU-T2T0.420 10-11 CU-ZEMAN

    12 0.389 12 JHU13 0.322 13 SHEF-WPROA

    Table 1: System rankings presented as clusters(WMT13 French-English competition). The scorecolumn is the percentage of time each system wasjudged better across its comparisons (2.1).

    reported (e.g., BLEU (Papineni et al., 2002)), thehuman evaluation is considered primary, and is infact used as the gold standard for its metrics task,where evaluation metrics are evaluated.

    In machine translation, the longstanding dis-agreements about evaluation measures do not goaway when moving from automatic metrics to hu-man judges. This is due in no small part to the in-herent ambiguity and subjectivity of the task, butalso arises from the particular way that the WMTorganizers produce the rankings. The system-level rankings are produced by collecting pairwisesentence-level comparisons between system out-puts. These are then aggregated to produce a com-plete ordering of all systems, or, more recently, apartial ordering (Koehn, 2012), with systems clus-tered where they cannot be distinguished in a sta-tistically significant way (Table 1, taken from Bo-jar et al. (2013)).

    A number of problems have been noted withthis approach. The first has to do with the na-ture of ranking itself. Over the past few years, theWMT organizers have introduced a number of mi-nor tweaks to the ranking algorithm (2) in reac-tion to largely intuitive arguments that have been

    1

  • raised about how the evaluation is conducted (Bo-jar et al., 2011; Lopez, 2012). While these tweakshave been sensible (and later corroborated), Hop-kins and May (2013) point out that this is essen-tially a model selection task, and should prop-erly be driven by empirical performance on held-out data according to some metric. Instead of in-tuition, they suggest perplexity, and show that anovel graphical model outperforms existing ap-proaches on that metric, with less amount of data.

    A second problem is the deficiency of the mod-els used to produce the ranking, which work bycomputing simple ratios of wins (and, option-ally, ties) to losses. Such approaches do not con-sider the relative difficulty of system matchups,and thus leave open the possibility that a systemis ranked highly from the luck of comparisonsagainst poorer opponents.

    Third, a large number of judgments need to becollected in order to separate the systems into clus-ters to produce a partial ranking. The sheer size ofthe space of possible comparisons (all pairs of sys-tems times the number of segments in the test set)requires sampling from this space and distributingthe annotations across a number of judges. Evenstill, the number of judgments needed to producestatistically significant rankings like those in Ta-ble 1 grows quadratically in the number of par-ticipating systems (Koehn, 2012), often forcingthe use of paid, lower-quality annotators hired onAmazons Mechanical Turk. Part of the prob-lem is that the sampling strategy collects data uni-formly across system pairings. Intuitively, weshould need many fewer annotations between sys-tems with divergent base performance levels, in-stead focusing the collection effort on system pairswhose performance is more matched, in order totease out the gaps between similarly-performingsystems. Why spend precious human time on re-dundantly affirming predictable outcomes?

    To address these issues, we developed a varia-tion of the TrueSkill model (Herbrich et al., 2006),an adaptative model of competitions originally de-veloped for the Xbox Live online gaming commu-nity. It assumes that each players skill level fol-lows a Gaussian distribution N (, 2), in which represents a players mean performance, and 2

    the systems uncertainty about its current estimateof this mean. These values are updated after eachgame (in our case, the value of a ternary judg-ment) in proportion to how surprising the outcome

    is. TrueSkill has been adapted to a number ofareas, including chess, advertising, and academicconference management.

    The rest of this paper provides an empiricalcomparison of a number of models of human eval-uation (2). We evaluate on perplexity and alsoon accuracy, showing that the two are not alwayscorrelated, and arguing for the primacy of the lat-ter (3). We find that TrueSkill outperforms othermodels (4). Moreover, TrueSkill also allows us todrastically reduce the amount of data that needs tobe collected by sampling non-uniformly from thespace of all competitions (5), which also allowsfor greater separation of the systems into rankedclusters (6).

    2 Models

    Before introducing our adaptation of the TrueSkillmodel for ranking translation systems with humanjudgments (2.3), we describe two comparisons:the Expected Wins model used in recent evalu-ations, and the Bayesian model proposed by Hop-kins and May (2.2).

    As we described briefly in the introduction,WMT produces system rankings by aggregatingsentence-level ternary judgments of the form:

    (i, S1, S2, )

    where i is the source segment (id), S1 and S2are the system pair drawn from a set of systems{S}, and {,=} denotes whether thefirst system was judged to be better than, worsethan, or equivalent to the second. These ternaryjudgments are obtained by presenting judges witha randomly-selected input sentence and the refer-ence, followed by five randomly-selected transla-tions of that sentence. Annotators are asked torank these systems from best (rank 1) to worst(rank 5), ties permitted, and with no meaning as-cribed to the absolute values or differences be-tween ranks. This is done to accelerate data collec-tion, since it yields ten pairwise comparisons perranking. Tens of thousands of judgments of thisform constitute the raw data used to compute thesystem-level rankings. All the work described inthis section is computed over these pairwise com-parisons, which are treated as if they were col-lected independently.

    2.1 Expected WinsThe Expected Wins model computes the per-centage of times that each system wins in its

    2

  • pairwise comparisons. Let A be the completeset of annotations or judgments of the form{i, S1, S2, R}. We assume these judgments havebeen converted into a normal form where S1 is ei-ther the winner or is tied with S2, and thereforeR { q2 q1 > d= otherwise

    The task is to then infer the posterior parameters,given the data: the system means Sj and, by ne-cessity, the latent values {qi} for each of the pair-wise comparison training instances. Hopkins andMay do not publish code or describe details of thisalgorithm beyond mentioning Gibbs sampling, sowe used our own implementation,3 and describe ithere for completeness.

    After initialization, we have training instancesof the form (i, S1, S2, R, q1, q2), where all but theqi are observed. At a high level, the sampler iter-ates over the training data, inferring values of q1and q2 for each annotation together in a single stepof the sampler from the current values of the sys-tems means, {j}.4 At the end of each iteration,

    2Note that better systems have higher relative abilities{Sj}. Better translations subsequently have on-averagehigher values {qi}, which translate into a lower ranking .

    3github.com/keisks/wmt-trueskill4This worked better than a version of the sampler that

    changed one at a time.

    3

  • these means are then recomputed by re-averagingall values of {qi} associated with that system. Af-ter the burn-in period, the s are stored as samples,which are averaged when the sampling concludes.

    During each iteration, q1 and q2 are resampledfrom their corresponding system means:

    q1 N (S1 , 2a)q2 N (S2 , 2a)

    We then update these values to respect the annota-tion as follows. Let t = q1q2 (S1 is the winnerby human judgments), and ensure that the valuesare outside the decision radius, d:

    q1 =

    {q1 t dq1 +

    1

    2(d t) otherwise

    q2 =

    {q2 t dq2

    1

    2(d t) otherwise

    In the case of a tie:

    q1 =

    q1 +1

    2(d t) t d

    q1 t < d

    q1 +1

    2(d t) t d

    q2 =

    q2 1

    2(d t) t d

    q2 t < d

    q2 1

    2(d t) t d

    These values are stored for the current iterationand averaged at its end to produce new estimatesof the system means. The quantity d t can be in-terpreted as a loss function, returning a high valuewhen the observed outcome is unexpected and alow value otherwise (Figure 1).

    2.3 TrueSkillPrior to 2012, the WMT organizers included refer-ence translations among the system comparisons.These were used as a control against which theevaluators could be measured for consistency, onthe assumption that the reference was almost al-ways best. They were also included as data pointsin computing the system ranking. Another ofBojar et al. (2011)s suggestions was to excludethis data, because systems compared more of-ten against the references suffered unfairly. Thiscan be further generalized to the observation that

    not all competitions are equal, and a good modelshould incorporate some notion of match diffi-culty when evaluating systems abilities. Theinference procedure above incorporates this no-tion implicitly in the inference procedure, but themodel itself does not include a notion of matchdifficulty or outcome surprisal.

    A model that does is TrueSkill5 (Herbrich et al.,2006). TrueSkill is an adaptive, online system thatalso assumes that each systems skill level followsa Gaussian distribution, maintaining a mean Sjfor each system Sj representing its current esti-mate of that systems native ability. However, italso maintains a per-system variance, 2Sj , whichrepresents TrueSkills uncertainty about its esti-mate of each mean. After an outcome is observed(a game in which the result is a win, loss, or draw),the size of the updates is proportional to how sur-prising the outcome was, which is computed fromthe current system means and variances. If a trans-lation from a system with a high mean is judgedbetter than a system with a greatly lower mean, theresult is not surprising, and the update size for thecorresponding system means will be small. On theother hand, when an upset occurs in a competition,the means will receive larger updates.

    Before defining the update equations, we needto be more concrete about how this notion of sur-prisal is incorporated. Let t = S1 S2 , the dif-ference in system relative abilities, and let be afixed hyper-parameter corresponding to the earlierdecision radius. We then define two loss functionsof this difference for wins and for ties:

    vwin(t, ) =N (+ t)(+ t)

    vtie(t, ) =N ( t)N ( t)( t) ( t)

    where (x) is the cumulative distribution functionand theN s are Gaussians. Figures 1 and 2 displayplots of these two functions compared to the Hop-kins and May model. Note how vwin (Figure 1) in-creases exponentially as S2 becomes greater thanthe (purportedly) better system, S1 .

    As noted above, TrueSkill maintains not onlyestimates {Sj} of system abilities, but alsosystem-specific confidences about those estimates

    5The goal of this section is to provide an intuitive descrip-tion of TrueSkill as adapted for WMT manual evaluations,with enough detail to carry the main ideas. For more details,please see Herbrich et al. (2006).

    4

  • 1.0 0.5 0.0 0.5 1.0t = S1 S2

    0.0

    0.5

    1.0

    1.5v

    (t,

    )TrueSkill

    HM

    Figure 1: TrueSkills vwin and the correspondingloss function in the Hopkins and May model asa function of the difference t of system means( = 0.5, c = 0.8 for TrueSkill, and d = 0.5 forHopkins and May model).

    1.5 1.0 0.5 0.0 0.5 1.0 1.5t = S1 S2

    1.0

    0.5

    0.0

    0.5

    1.0

    v(t,

    )

    TrueSkill

    HM

    Figure 2: TrueSkills vtie and the correspondingloss function in the Hopkins and May model asa function of the difference t of system means( = 0.5, c = 0.3, and d = 0.5).

    {Sj}. These confidences also factor into the up-dates: while surprising outcomes result in largerupdates to system means, higher confidences (rep-resented by smaller variances) result in smallerupdates. TrueSkill defines the following value:

    c2 = 22 + 2S1 + 2S2

    which accumulates the variances along , anotherfree parameter. We can now define the updateequations for the system means:

    S1 = S1 +2S1c v(t

    c,

    c

    )

    S2 = S2 2S2c v(t

    c,

    c

    )

    The second term in these equations captures theidea about balancing surprisal with confidence,described above.

    In order to update the system-level confidences,TrueSkill defines another set of functions, w, forthe cases of wins and ties. These functions aremultiplicative factors that affect the amount ofchange in 2:

    wwin(t, ) = vwin (vwin + t )

    wtie(t, ) = vtie +( t) N ( t) + (+ t) N (+ t)

    ( t) ( t)

    The underlying idea is that these functions cap-ture the outcome surprisal via v. This update al-ways decreases the size of the variances 2, whichmeans uncertainty of decreases as comparisonsgo on. With these defined, we can conclude bydefining the updates for 2S1 and

    2S2

    :

    2S1 = 2S1

    [1

    2S1c2 w(t

    c,

    c

    )]

    2S2 = 2S2

    [1

    2S2c2 w(t

    c,

    c

    )]

    One final complication not presented here but rel-evant to adapting TrueSkill to the WMT setting:the parameter and another parameter (not dis-cussed) are incorporated into the update equa-tions to give more weight to recent matches. Thislatest-oriented property is useful in the gamingsetting for which TrueSkill was built, where play-ers improve over time, but is not applicable in theWMT competition setting. To cancel this propertyin TrueSkill, we set = 0 and = 0.025 |A| 2in order to lessen the impact of the order in whichannotations are presented to the system.

    2.4 Data selection with TrueSkillA drawback of the standard WMT data collectionmethod is that it samples uniformly from the spaceof pairwise system combinations. This is undesir-able: systems with vastly divergent relative abil-ity need not be compared as often as systems thatare more evenly matched. Unfortunately, one can-not sample non-uniformly without knowing aheadof time which systems are better. TrueSkill pro-vides a solution to this dilemma with its match-selection ability: systems with similar means andlow variances can be confidently considered to beclose matches. This presents a strong possibilityof reducing the amount of data that needs to be

    5

  • collected in the WMT competitions. In fact, theTrueSkill formulation provides a way to computethe probability of a draw between two systems,which can be used to compute for a system Si aconditional distribution over matches with othersystems {Sj 6=i}.

    Formally, in the TrueSkill model, the match-selection (chance to draw) between two players(systems in WMT) is computed as follows:

    pdraw =

    22

    c2 exp((a b)

    2

    2c2)

    However, our setting for canceling the latest-oriented property affects this matching qualityequation, where most systems are almost equallycompetitive ( 1). Therefore, we modify the equa-tion in the following manner which simply de-pends on the difference of .

    pdraw =1

    exp(|a b|)

    TrueSkill selects the matches it would like tocreate, according to this selection criteria. We dothis according to the following process:

    1. Select a system S1 (e.g., the one with thehighest variance)

    2. Compute a normalized distribution overmatches with other systems pairs pdraw

    3. Draw a system S2 from this distribution

    4. Draw a source sentence, and present to thejudge for annotation

    3 Experimental setup

    3.1 Datasets

    We used the evaluation data released by WMT13.6

    The data contains (1) five-way system rankingsmade by either researchers or Turkers and (2)translation data consisting of source sentences, hu-man reference translations, and submitted transla-tions. Data exists for 10 language pairs. More de-tails about the dataset can be found in the WMT2013 overview paper (Bojar et al., 2013).

    Each five-way system ranking was convertedinto ten pairwise judgments (2). We trained themodels using randomly selected sets of 400, 800,1,600, 3,200, and 6,400 pairwise comparisons,

    6statmt.org/wmt13/results.html

    each produced in two ways: selecting from all re-searchers, or split between researchers and Turk-ers. An important note is that the training datadiffers according to the model. For the ExpectedWins and Hopkins and May model, we sim-ply sample uniformly at random. The TrueSkillmodel, however, selects its own training data (withreplacement) according to the description in Sec-tion 2.4.7

    For tuning hyperparameters and reporting testresults, we used development and test sets of 2,000comparisons drawn entirely from the researcherjudgments, and fixed across all experiments.

    3.2 Perplexity

    We first compare the Hopkins and May model andTrueSkill using perplexity on the test data T , com-puted as follows:

    ppl(p|T ) = 2

    (i,S1,S2,)T log2 p(|S1,S2)

    where p is the model under consideration. Theprobability of each observed outcome betweentwo systems S1 and S2 is computed by taking adifference of the Gaussian distributions associatedwith those systems:

    N (, 2 ) = N (S1 , 2S1)N (S2 , 2S2)= N (S1 S2 , 2S1 + 2S2)

    This Gaussian can then be carved into three pieces:the area where S1 loses, the middle area represent-ing ties (defined by a decision radius, r, whosevalue is fit using development data), and a thirdarea representing where S1 wins. By integratingover each of these regions, we have a probabilitydistribution over these outcomes:

    p( | S1, S2) =

    0N (, 2 ) if is >

    r0 N (, 2 ) if is =

    r N (, 2 ) if is B,A = F,A > H,A < J

    B < F,B < H,B < J

    F > H,F < J

    H < J

    Here,A > B should be read is A is ranked higherthan (worse than) B. Note that by this procedure,the absolute value of ranks and the magnitude oftheir differences are discarded.

    For WMT13, nearly a million pairwise anno-tations were collected from both researchers andpaid workers on Amazons Mechanical Turk, ina roughly 1:2 ratio. This year, we collected datafrom researchers only, an ability that was enabledby the use of a new technique for producing thepartial ranking for each task (3.3.3). Table 3 con-tains more detail.

    3.2 Annotator agreement

    Each year we calculate annotator agreementscores for the human evaluation as a measure ofthe reliability of the rankings. We measured pair-wise agreement among annotators using Cohenskappa coefficient () (Cohen, 1960). If P (A) bethe proportion of times that the annotators agree,and P (E) is the proportion of time that they would

    17

  • LANGUAGE PAIR Systems Rankings AverageCzechEnglish 5 21,130 4,226.0EnglishCzech 10 55,900 5,590.0GermanEnglish 13 25,260 1,943.0EnglishGerman 18 54,660 3,036.6FrenchEnglish 8 26,090 3,261.2EnglishFrench 13 33,350 2,565.3RussianEnglish 13 34,460 2,650.7EnglishRussian 9 28,960 3,217.7HindiEnglish 9 20,900 2,322.2EnglishHindi 12 28,120 2,343.3TOTAL WMT 14 110 328,830 2,989.3WMT13 148 942,840 6,370.5WMT12 103 101,969 999.6WMT11 133 63,045 474.0

    Table 3: Amount of data collected in the WMT14 manual evaluation. The final three rows report summary information fromthe previous two workshops.

    agree by chance, then Cohens kappa is:

    =P (A) P (E)1 P (E)

    Note that is basically a normalized version ofP (A), one which takes into account how mean-ingful it is for annotators to agree with each otherby incorporating P (E). The values for rangefrom 0 to 1, with zero indicating no agreement and1 perfect agreement.

    We calculate P (A) by examining all pairs ofsystems which had been judged by two or morejudges, and calculating the proportion of time thatthey agreed that A < B, A = B, or A > B. Inother words, P (A) is the empirical, observed rateat which annotators agree, in the context of pair-wise comparisons.

    As for P (E), it captures the probability that twoannotators would agree randomly. Therefore:

    P (E) = P (AB)2

    Note that each of the three probabilities in P (E)sdefinition are squared to reflect the fact that we areconsidering the chance that two annotators wouldagree by chance. Each of these probabilities iscomputed empirically, by observing how often an-notators actually rank two systems as being tied.

    Table 4 gives values for inter-annotator agree-ment for WMT11WMT14 while Table 5 de-tails intra-annotator agreement scores, includingthe division of researchers (WMT13r) and MTurk(WMT13m) data. The exact interpretation of the

    kappa coefficient is difficult, but according to Lan-dis and Koch (1977), 00.2 is slight, 0.20.4 isfair, 0.40.6 is moderate, 0.60.8 is substantial,and 0.81.0 is almost perfect. The agreement ratesare more or less in line with prior years: worse forsome tasks, better for others, and on average, thebest since WMT11 (where agreement scores werelikely inflated due to inclusion of reference trans-lations in the comparisons).

    3.3 Models of System Rankings

    The collected pairwise rankings are used to pro-duce a ranking of the systems. Machine transla-tion evaluation has always been a subject of con-tention, and no exception to this rule exists for theWMT manual evaluation. While the precise met-ric has varied over the years, it has always shareda common idea of computing the average num-ber of times each system was judged better thanother systems, and ranking from highest to low-est. For example, in WMT11 Callison-Burch et al.(2011), the metric computed the percentage of thetime each system was ranked better than or equalto other systems, and included comparisons to hu-man references. In WMT12 Callison-Burch et al.(2012), comparisons to references were dropped.In WMT13, rankings were produced over 1,000bootstrap-resampled sets of the training data. Arank range was collected for each system acrossthese folds; the average value was used to orderthe systems, and a 95% confidence interval acrossthese ranks was used to organize the systems intoequivalence classes containing systems with over-

    18

  • LANGUAGE PAIR WMT11 WMT12 WMT13 WMT13r WMT13m WMT14CzechEnglish 0.400 0.311 0.244 0.342 0.279 0.305EnglishCzech 0.460 0.359 0.168 0.408 0.075 0.360GermanEnglish 0.324 0.385 0.299 0.443 0.324 0.368EnglishGerman 0.378 0.356 0.267 0.457 0.239 0.427FrenchEnglish 0.402 0.272 0.275 0.405 0.321 0.357EnglishFrench 0.406 0.296 0.231 0.434 0.237 0.302HindiEnglish 0.400EnglishHindi 0.413RussianEnglish 0.278 0.315 0.324 0.324EnglishRussian 0.243 0.416 0.207 0.418MEAN 0.395 0.330 0.260 0.367

    Table 4: scores measuring inter-annotator agreement. See Table 5 for corresponding intra-annotator agreement scores.

    LANGUAGE PAIR WMT11 WMT12 WMT13 WMT13r WMT13m WMT14CzechEnglish 0.597 0.454 0.479 0.483 0.478 0.382EnglishCzech 0.601 0.390 0.290 0.547 0.242 0.448GermanEnglish 0.576 0.392 0.535 0.643 0.515 0.344EnglishGerman 0.528 0.433 0.498 0.649 0.452 0.576FrenchEnglish 0.673 0.360 0.578 0.585 0.565 0.629EnglishFrench 0.524 0.414 0.495 0.630 0.486 0.507HindiEnglish 0.605EnglishHindi 0.535RussianEnglish 0.450 0.363 0.477 0.629EnglishRussian 0.513 0.582 0.500 0.570MEAN 0.583 0.407 0.479 0.522

    Table 5: scores measuring intra-annotator agreement, i.e., self-consistency of judges, across for the past few years of thehuman evaluation.

    lapping ranges.This year, we introduce two new changes. First,

    we pit the WMT13 method against two new ap-proaches: that of Hopkins and May (2013, 3.3.2),and another based on TrueSkill (Sakaguchi et al.,2014, 3.3.3). Second, we compare these twomethods against WMT13s Expected Wins ap-proach, and then select among them by determin-ing which of them has the highest accuracy interms of predicting annotations on a held-out setof pairwise judgments.

    3.3.1 Method 1: Expected Wins (EW)Introduced for WMT13, the EXPECTED WINS hasan intuitive score demonstrated to be accurate inranking systems according to an underlying modelof relative ability (Koehn, 2012a). The idea isto gauge the probability that a system Si will beranked better than another system randomly cho-sen from a pool of opponents {Sj : j 6= i}. Ifwe define the function win(A,B) as the numberof times system A is ranked better than system B,

    then we can define this as follows:

    scoreEW (Si) =1

    |{Sj}|

    j,j 6=i

    win(Si, Sj)win(Si, Sj) + win(Sj , Si)

    Note that this score ignores ties.

    3.3.2 Method 2: Hopkins and May (HM)

    Hopkins and May (2013) introduced a graphicalmodel formulation of the task, which makes thenotion of underlying system ability even more ex-plicit. Each system SJ in the pool {Sj} is repre-sented by an associated relative ability j and avariance 2a (fixed across all systems) which serveas the parameters of a Gaussian distribution. Sam-ples from this distribution represent the qualityof sentence translations, with higher quality sam-ples having higher values. Pairwise annotations(S1, S2, ) are generated according to the follow-ing process:

    19

  • 1. Select two systems S1 and S2 from the poolof systems {Sj}

    2. Draw two translations, adding randomGaussian noise with variance 2obs to simulatethe subjectivity of the task and the differencesamong annotators:

    q1 N (S1 , 2a) +N (0, 2obs)q2 N (S2 , 2a) +N (0, 2obs)

    3. Let d be a nonzero real number that definesa fixed decision radius. Produce a rating according to:

    =

    < q1 q2 > d> q2 q1 > d= otherwise

    Hopkins and May use Gibbs sampling to inferthe set of system means from an annotated dataset.Details of this inference procedure can be found inSakaguchi et al. (2014). The score used to producethe rankings is simply the system mean associatedwith each system:

    scoreHM (Si) = Si

    3.3.3 Method 3: TrueSkill (TS)TrueSkill is an adaptive, online system that em-ploys a similar model of relative ability Herbrichet al. (2006). It was initially developed for XboxLives online player community, where it is usedto model player ability, assign levels, and selectcompetitive matches. Each player Sj is modeledby two parameters: TrueSkills current estimateof each systems relative ability, Sj , and a per-system measure of TrueSkills uncertainty of thoseestimates, 2Sj . When the outcome of a match isobserved, TrueSkill uses the relative status of thetwo systems to update these estimates. If a trans-lation from a system with a high mean is judgedbetter than a system with a greatly lower mean, theresult is not surprising, and the update size for thecorresponding system means will be small. On theother hand, when an upset occurs in a competition,the means will receive larger updates. Sakaguchiet al. (2014) provide an adaptation of this approachto the WMT manual evaluation, and showed thatit performed well on WMT13 data.

    Similar to the Hopkins and May model,TrueSkill scores systems by their inferred means:

    scoreTS(Si) = Si

    This score is then used to sort the systems and pro-duce the ranking.

    3.4 Method Selection

    We have three methods which, provided with thecollected data, produce different rankings of thesystems. Which of them is correct? More imme-diately, which one of them should we publish asthe official ranking for the WMT14 manual eval-uation? As discussed, the method used to com-pute the ranking has been tweaked a bit each yearover the past few years in response to criticisms(e.g., Lopez (2012); Bojar et al. (2011)). While thechanges were reasonable (and later corroborated),Hopkins and May (2013) pointed out that this taskof model selection should be driven by empiricalevaluation on held-out data, and suggested per-plexity as the metric of choice.

    We choose instead a more direct gold-standardevaluation metric: the accuracy of the rankingsproduced by each method in predicting pairwisejudgments. We use each method to produce a par-tial ordering of the systems, grouping them intoequivalence classes. This partial ordering unam-biguously assigns a prediction P between anypair of systems (Si, Sj). By comparing the pre-dicted relationship P to the actual annotation foreach pairwise judgment in the test data (by token),we can compute an accuracy score for each model.

    We predict accuracy in this manner using 100-fold cross-validation. For each task, we split thedata into a fixed set of 100 randomly-selectedfolds. Each fold serves as a test set, with theremaining ninety-nine folds available as trainingdata for each method. Note that the total order-ing over systems provided by the score functionsdefined do not predict ties. In order to do enablethe models to predict ties, we produce equivalenceclasses using the following procedure:

    Assign S1 to a cluster

    For each system Si, assign it to the currentcluster if score(Si1) score(Si) r; oth-erwise, assign it to a new cluster

    The value of r (the decision radius for ties)is tuned using accuracy on the entire trainingdata using grid search over the values r 0, 0.01, 0.02, . . . , .25 (26 values in total). Thisvalue is tuned separately for each method on eachfold. Table 6 contains an example partial ordering.

    20

  • System Score RankB 0.60 1D 0.44 2E 0.39 2A 0.25 2F -0.09 3C -0.22 3

    Table 6: The partial ordering computed with the providedscores when r = 0.15.

    Task EW HM TS OracleCzechEnglish 40.4 41.1 41.1 41.2EnglishCzech 45.3 45.6 45.9 46.8FrenchEnglish 49.0 49.4 49.3 50.3EnglishFrench 44.6 44.4 44.7 46.0GermanEnglish 43.5 43.7 43.7 45.2EnglishGerman 47.3 47.4 47.2 48.2HindiEnglish 62.5 62.2 62.5 62.6EnglishHindi 53.3 53.7 53.5 55.7RussianEnglish 47.6 47.7 47.7 50.6EnglishRussian 46.5 46.1 46.4 48.2MEAN 48.0 48.1 48.2 49.2

    Table 7: Accuracies for each method across 100 folds, foreach translation task. The oracle uses the most frequent out-come between each pair of systems, and therefore might notconstitute a feasible ranking.

    After training, each model has defined a partialordering over systems.6 This is then used to com-pute accuracy on all the pairwise judgments in thetest fold. This process yields 100 accuracies foreach method; the average accuracy across all thefolds can then be used to compute the best method.

    Table 7 contains accuracy results for the threemethods on the WMT14 tasks. On average, thereis a small improvement in accuracy moving fromExpected Wins to the H&M model, and then againto the TrueSkill model; however, there is no pat-tern to the best model for each class. The Oraclecolumn is computed by selecting the most prob-able outcome ( {}) for each systempair, and provides an upper bound on accuracywhen predicting outcomes using only system-levelinformation. Furthermore, this method of oraclecomputation might not represent a feasible rank-ing or clustering,7.

    The TrueSkill approach was best overall, so weused it to produce the official rankings for all lan-

    6It is a total ordering when r = 0, or when all the systemscores are outside the decision radius.

    7For example, if there were a cycle of better than judg-ments among a set of systems.

    guage pairs.

    3.5 Rank Ranges and Clusters

    Above we saw how to produce system scores foreach method, which provides a total ordering ofthe systems. But we would also like to know if theobtained system ranking is statistically significant.Given the large number of systems that participate,and the similarity of the underlying systems result-ing from the common training data condition and(often) toolsets, there will be some systems thatwill be very close in quality. These systems shouldbe grouped together in equivalence classes.

    To establish the reliability of the obtained sys-tem ranking, we use bootstrap resampling. Wesample from the set of pairwise rankings an equalsized set of pairwise rankings (allowing for multi-ple drawings of the same pairwise ranking), com-pute a TrueSkill model score for each systembased on this sample, and then rank the systemsfrom 1..|{Sj}|. By repeating this procedure 1,000times, we can determine a range of ranks, intowhich system falls at least 95% of the time (i.e.,at least 950 times) corresponding to a p-levelof p 0.05. Furthermore, given the rank rangesfor each system, we can cluster systems with over-lapping rank ranges.8

    Table 8 reports all system scores, rank ranges,and clusters for all language pairs and all systems.The official interpretation of these results is thatsystems in the same cluster are considered tied.Given the large number of judgments that we col-lected, it was possible to group on average abouttwo systems in a cluster, even though the systemsin the middle are typically in larger clusters.

    3.6 Cluster analysis

    The official ranking results for English-Germanproduced clusters compute at the 90% confidencelevel due to the presence of a very large cluster(of nine systems). While there is always the pos-sibility that this cluster reflects a true ambiguity, itis more likely due to the fact that we didnt haveenough data: EnglishGerman had the most sys-

    8Formally, given ranges defined by start(Si) and end(Si),we seek the largest set of clusters {Cc} that satisfies:

    S C : S CS Ca, S Cb Ca = CbCa 6= Cb Si Ca, Sj Cb :

    start(Si) > end(Sj) or start(Sj) > end(Si)

    21

  • CzechEnglish# score range system1 0.591 1 ONLINE-B2 0.290 2 UEDIN-PHRASE3 -0.171 3-4 UEDIN-SYNTAX

    -0.243 3-4 ONLINE-A4 -0.468 5 CU-MOSES

    EnglishCzech# score range system1 0.371 1-3 CU-DEPFIX

    0.356 1-3 UEDIN-UNCNSTR0.333 1-4 CU-BOJAR0.287 3-4 CU-FUNKY

    2 0.169 5-6 ONLINE-B0.113 5-6 UEDIN-PHRASE

    3 0.030 7 ONLINE-A4 -0.175 8 CU-TECTO5 -0.534 9 COMMERCIAL16 -0.950 10 COMMERCIAL2

    RussianEnglish# score range system1 0.583 1 AFRL-PE2 0.299 2 ONLINE-B3 0.190 3-5 ONLINE-A

    0.178 3-5 PROMT-HYBRID0.123 4-7 PROMT-RULE0.104 5-8 UEDIN-PHRASE0.069 5-8 Y-SDA0.066 5-8 ONLINE-G

    4 -0.017 9 AFRL5 -0.159 10 UEDIN-SYNTAX6 -0.306 11 KAZNU7 -0.487 12 RBMT18 -0.642 13 RBMT4

    EnglishRussian# score range system1 0.575 1-2 PROMT-RULE

    0.547 1-2 ONLINE-B2 0.426 3 PROMT-HYBRID3 0.305 4-5 UEDIN-UNCNSTR

    0.231 4-5 ONLINE-G4 0.089 6-7 ONLINE-A

    0.031 6-7 UEDIN-PHRASE5 -0.920 8 RBMT46 -1.284 9 RBMT1

    GermanEnglish# score range system1 0.451 1 ONLINE-B2 0.267 2-3 UEDIN-SYNTAX

    0.258 2-3 ONLINE-A3 0.147 4-6 LIMSI-KIT

    0.146 4-6 UEDIN-PHRASE0.138 4-6 EU-BRIDGE

    4 0.026 7-8 KIT-0.049 7-8 RWTH

    5 -0.125 9-11 DCU-ICTCAS-0.157 9-11 CMU-0.192 9-11 RBMT4

    6 -0.306 12 RBMT17 -0.604 13 ONLINE-C

    FrenchEnglish# score range system1 0.608 1 UEDIN-PHRASE2 0.479 2-4 KIT

    0.475 2-4 ONLINE-B0.428 2-4 STANFORD

    3 0.331 5 ONLINE-A4 -0.389 6 RBMT15 -0.648 7 RBMT46 -1.284 8 ONLINE-C

    EnglishFrench# score range system1 0.327 1 ONLINE-B2 0.232 2-4 UEDIN-PHRASE

    0.194 2-5 KIT0.185 2-5 MATRAN0.142 4-6 MATRAN-RULES0.120 4-6 ONLINE-A

    3 0.003 7-9 UU-DOCENT-0.019 7-10 PROMT-HYBRID-0.033 7-10 UA-0.069 8-10 PROMT-RULE

    4 -0.215 11 RBMT15 -0.328 12 RBMT46 -0.540 13 ONLINE-C

    EnglishGerman# score range system1 0.264 1-2 UEDIN-SYNTAX

    0.242 1-2 ONLINE-B2 0.167 3-6 ONLINE-A

    0.156 3-6 PROMT-HYBRID0.155 3-6 PROMT-RULE0.155 3-6 UEDIN-STANFORD

    3 0.094 7 EU-BRIDGE4 0.033 8-10 RBMT4

    0.031 8-10 UEDIN-PHRASE0.012 8-10 RBMT1

    5 -0.032 11-12 KIT-0.069 11-13 STANFORD-UNC-0.100 12-14 CIMS-0.126 13-15 STANFORD-0.158 14-16 UU-0.191 15-16 ONLINE-C

    6 -0.307 17-18 IMS-TTT-0.325 17-18 UU-DOCENT

    HindiEnglish# score range system1 1.326 1 ONLINE-B2 0.559 2-3 ONLINE-A

    0.476 2-4 UEDIN-SYNTAX0.434 3-4 CMU

    3 0.323 5 UEDIN-PHRASE4 -0.198 6-7 AFRL

    -0.280 6-7 IIT-BOMBAY5 -0.549 8 DCU-LINGO246 -2.092 9 IIIT-HYDERABAD

    EnglishHindi# score range system1 1.008 1 ONLINE-B2 0.915 2 ONLINE-A3 0.214 3 UEDIN-UNCNSTR4 0.120 4-5 UEDIN-PHRASE

    0.054 4-5 CU-MOSES5 -0.111 6-7 IIT-BOMBAY

    -0.142 6-7 IPN-UPV-CNTXT6 -0.233 8-9 DCU-LINGO24

    -0.261 8-9 IPN-UPV-NODEV7 -0.449 10-11 MANAWI-H1

    -0.494 10-11 MANAWI8 -0.622 12 MANAWI-RMOOV

    Table 8: Official results for the WMT14 translation task. Systems are ordered by their inferred system means. Lines betweensystems indicate clusters according to bootstrap resampling at p-level p .05, except for EnglishGerman, where p 0.1.This method is also used to determine the range of ranks into which system falls. Systems with grey background indicate useof resources that fall outside the constraints provided for the shared task.

    22

  • tems (18, compared to 13 for the next languages),yet only an average amount of per-system data.Here, we look at this language pair in more detail,in order to justify this decision, and to shed lighton the differences between the ranking methods.

    Table 9 presents the 95% confidence-level clus-terings for EnglishGerman computed with eachof the three methods, along with lines that showthe reorderings of the systems between them. Re-orderings of this type have been used to argueagainst the reliability of the official WMT rank-ing (Lopez, 2012; Hopkins and May, 2013). Thistable shows that these reorderings are captured en-tirely by the clustering approach we used. This rel-ative consensus of these independently-computedand somewhat different models suggests that thepublished ranking is approaching the true ambigu-ity underlying systems within the same cluster.

    Looking across all language pairs, we find thatthe total ordering predicted by EW and TS is ex-actly the same for eight of the ten language pairtasks, and is constrained to reorderings withinthe official cluster for the other two (GermanEnglish just one adjacent swap and EnglishGerman, depicted in Table 9).

    3.7 Conclusions

    The official ranking method employed by WMTover the past few years has changed a few times asa result of error analysis and introspection. Untilthis year, these results were largely based on theintuitions of the community and organizers aboutdeficiencies in the models. In addition to their in-tuitive appeal, many of these changes (such as thedecision to throw out comparisons against refer-ences) have been empirically validated Hopkinsand May (2013). The actual effect of the refine-ments in the ranking metric has been minor pertur-bations in the permutation of systems. The cluster-ing method of Koehn (2012b), in which the officialrankings are presented as a partial (instead of to-tal) ordering, alleviated many of the problems ob-served by Lopez (2012), and also capture all thevariance across the new systems introduced thisyear. In addition, presenting systems as clustersappeals to intuition. As such, we disagree withclaims that there is a problem with irreproducibil-ity of the results of the workshop evaluation task,and especially disagree that there is anything ap-proaching a crisis of confidence (Hopkins andMay, 2013). These claims seem to us to be over-

    stated.Conducting proper model selection by compar-

    ison on held-out data, however, is a welcome sug-gestion, and our inclusion of this process supportsimproved confidence in the ranking results. Thatsaid, it is notable that the different methods com-pute very similar orderings. This avoids hallu-cinating distinctions among systems that are notreally there, and captures the intuition that somesystems are basically equivalent. The chief ben-efit of the TrueSkill model is not in outputting abetter complete ranking of the systems, but lies inits reduced variance, which allow us to cluster thesystems with less data. There is also the unex-plored avenue of using TrueSkill to drive the datacollection, steering the annotations of judges to-wards evenly matched systems during the collec-tion phase, potentially allowing confident resultsto be presented while collecting even less data.

    There is, of course, more work to be done.We have produced this year statistically significantclusters with a third of the data required last year,which is an improvement. Models of relative abil-ity are a natural fit for the manual evaluation, andthe introduction of an online Bayesian approachto data collection present further opportunities toreduce the amount of data needed. These methodsalso provide a framework for extending the modelsin a variety of potentially useful ways, includingmodeling annotator bias, incorporating sentencemetadata (such as length, difficulty, or subtopic),and adding features of the sentence pairs.

    4 Quality Estimation Task

    Machine translation quality estimation is the taskof predicting a quality score for a machine trans-lated text without access to reference translations.The most common approach is to treat the problemas a supervised machine learning task, using stan-dard regression or classification algorithms. Thethird edition of the WMT shared task on qual-ity estimation builds on the previous editions ofthe task (Callison-Burch et al., 2012; Bojar et al.,2013), with tasks including both sentence-leveland word-level estimation, with new training andtest datasets.

    The goals of this years shared task were:

    To investigate the effectiveness of differentquality labels.

    To explore word-level quality prediction at

    23

  • Expected Wins Hopkins & May TrueSkillUEDIN-SYNTAX UEDIN-SYNTAX UEDIN-SYNTAX

    ONLINE-B ONLINE-B ONLINE-BONLINE-A UEDIN-STANFORD ONLINE-A

    UEDIN-STANFORD PROMT-HYBRID PROMT-HYBRIDPROMT-RULE ONLINE-A PROMT-RULE

    PROMT-HYBRID PROMT-RULE UEDIN-STANFORDEU-BRIDGE EU-BRIDGE EU-BRIDGE

    RBMT4 UEDIN-PHRASE RBMT4UEDIN-PHRASE RBMT4 UEDIN-PHRASE

    RBMT1 RBMT1 RBMT1KIT KIT KIT

    STANFORD-UNC STANFORD-UNC STANFORD-UNCCIMS CIMS CIMS

    STANFORD STANFORD STANFORD

    UU UU UU

    ONLINE-C ONLINE-C ONLINE-CIMS-TTT UU-DOCENT IMS-TTT

    UU-DOCENT IMS-TTT UU-DOCENTTable 9: A comparison of the rankings produced by Expected Wins, Hopkins & May, and TrueSkill for EnglishGerman (thetask with the most systems and the largest cluster). The lines extending all the way across mark the official EnglishGermanclustering (computed from TrueSkill with 90% confidence intervals), while bold entries mark the start of new clusters withineach method or column (computed at the 95% confidence level). The TrueSkill clusterings contain all the system reorderingsacross the other two ranking methods.

    different levels of granularity.

    To study the effects of training and testdatasets with mixed domains, language pairsand MT systems.

    To examine the effectiveness of quality pre-diction methods on human translations.

    Four tasks were proposed: Tasks 1.1, 1.2, 1.3are defined at the sentence-level (Sections 4.1),while Task 2, at the word-level (Section 4.2). Eachtask provides one or more datasets with up to fourlanguage pairs each: English-Spanish, English-German, German-English, Spanish-English, andup to four alternative translations generated by:a statistical MT system (SMT), a rule-based MTsystem (RBMT), a hybrid MT system, and a hu-man. These datasets were annotated with differ-ent labels for quality by professional translators aspart of the QTLaunchPad9 project. External re-sources (e.g. parallel corpora) were provided toparticipants. Any additional resources, includingadditional quality estimation training data, could

    9http://www.qt21.eu/launchpad/

    be used by participants (no distinction betweenopen and close tracks is made). Participants werealso provided with a software package to extractquality estimation features and perform modellearning, with a suggested list of baseline featuresand learning method for sentence-level prediction.Participants, described in Section 4.3, could sub-mit up to two systems for each task.

    Data used for building specific MT systems orinternal system information (such as n-best lists)were not made available this year as multiple MTsystems were used to produced the datasets, in-cluding rule-based systems. In addition, part ofthe translations were produced by humans. Infor-mation on the sources of translations was not pro-vided either. Therefore, as a general rule, partici-pants were only allowed to use black-box features.

    4.1 Sentence-level Quality EstimationFor the sentence-level tasks, two variants of theresults could be submitted for each task and lan-guage pair:

    Scoring: An absolute quality score for eachsentence translation according to the type of

    24

  • prediction, to be interpreted as an error met-ric: lower scores mean better translations.

    Ranking: A ranking of sentence translationsfor all source test sentences from best toworst. For this variant, it does not matter howthe ranking is produced (from HTER predic-tions, likert predictions, or even without ma-chine learning).

    Evaluation was performed against the true labeland/or HTER ranking using the same metrics as inprevious years:

    Scoring: Mean Average Error (MAE) (pri-mary metric), Root Mean Squared Error(RMSE).

    Ranking: DeltaAvg (primary metric) (Bojaret al., 2013) and Spearmans rank correlation.

    For all sentence-level these tasks, the same 17features as in WMT12-13 were used to build base-line systems. The SVM regression algorithmwithin QUEST (Specia et al., 2013)10 was appliedfor that with RBF kernel and grid search for pa-rameter optimisation.

    Task 1.1 Predicting post-editing effortData in this task is labelled with discrete andabsolute scores for perceived post-editing effort,where:

    1 = Perfect translation, no post-editingneeded at all.

    2 = Near miss translation: translation con-tains maximum of 2-3 errors, and possiblyadditional errors that can be easily fixed (cap-italisation, punctuation, etc.).

    3 = Very low quality translation, cannot beeasily fixed.

    The datasets were annotated in a triage phaseaimed at selecting translations of type 2 (nearmiss) that could be annotated for errors at theword-level using the MQM metric (see Task 2, be-low) for a more fine-grained and systematic trans-lation quality analysis. Word-level errors in trans-lations of type 3 are too difficult if not impos-sible to annotate and classify, particularly as theyoften contain inter-related errors in contiguous oroverlapping word spans.

    10http://www.quest.dcs.shef.ac.uk/

    For the training of prediction models, we pro-vide a new dataset consisting of source sen-tences and their human translations, as well astwo-three versions of machine translations (by anSMT system, an RBMT system and, for English-Spanish/German only, a hybrid system), all in thenews domain, extracted from tests sets of variousWMT years and MT systems that participated inthe translation shared task:

    # Source sentences # Target sentences954 English 3,816 Spanish350 English 1,400 German350 German 1,050 English350 Spanish 1,050 English

    As test data, for each language pair and MT sys-tem (or human translation) we provide a new setof translations produced by the same MT systems(and humans) as those used for the training data:

    # Source sentences # Target sentences150 English 600 Spanish150 English 600 German150 German 450 English150 Spanish 450 English

    The distribution of true scores in both trainingand test sets for each language pair is given in Fig-ures 3.

    0%#

    10%#

    20%#

    30%#

    40%#

    50%#

    60%#

    {en-de

    -1}#

    {en-de

    -2}#

    {en-de

    -3}#

    {de-en

    -1}#

    {de-en

    -2}#

    {de-en

    -3}#

    {en-es-1}#

    {en-es-2}##

    {en-es-3}##

    {es-en

    -1}#

    {es-en

    -2}#

    {es-en

    -3}#

    #Training##### #Test####

    Figure 3: Distribution of true 1-3 scores by langauge pair.

    Additionally, we provide some out of domaintest data. These translations were annotated inthe same way as above, each dataset by one Lan-guage Service Provider (LSP), i.e, one profes-sional translator, with two LPSs producing data in-dependently for English-Spanish. They were gen-erated using the LSPs own source data (a differentdomain from news), and own MT system (differ-ent from the three used for the official datasets).The results on these datasets were not considered

    25

  • for the official ranking of the participating sys-tems:

    # Source sentences # Target sentences971 English 971 Spanish297 English 297 German388 Spanish 388 English

    Task 1.2 Predicting percentage of editsIn this task we use HTER (Snover et al., 2006) asquality score. This score is to be interpreted asthe minimum edit distance between the machinetranslation and its manually post-edited version,and its range is [0, 1] (0 when no edit needs tobe made, and 1 when all words need to be edited).We used TERp (default settings: tokenised, caseinsensitive, etc., but capped to 1)11 to compute theHTER scores.

    For practical reasons, the data is a subset ofTask 1.1s dataset: only translations producedby the SMT system English-Spanish. As train-ing data, we provide 896 English-Spanish trans-lation suggestions and their post-editions. Astest data, we provide a new set of 208 English-Spanish translations produced by the same SMTsystem. Each of the training and test translationswas post-edited by a professional translator usingthe CASMACAT12 web-based tool, which also col-lects post-editing time on a sentence-basis.

    Task 1.3 Predicting post-editing timeFor this task systems are required to produce, foreach translation, a real valued estimate of the time(in milliseconds) it takes a translator to post-editthe translation. The training and test sets are a sub-set of that uses in Task 1.2 (subject to filtering ofoutliers). The difference is that the labels are nowthe number of milliseconds that were necessary topost-edit each translation.

    As training data, we provide 650 English-Spanish translation suggestions and their post-editions. As test data, we provide a new set of 208English-Spanish translations (same test data as forTask 1.2).

    4.2 Word-level Quality Estimation

    The data for this task is based on a subset of thedatasets used for Task 1.1, for all language pairs,

    11http://www.umiacs.umd.edu/snover/terp/12http://casmacat.eu/

    human and machine translations: those transla-tions labelled 2 (near misses), plus additionaldata provided by industry (either on the news do-main or on other domains, such as technical doc-umentation, produced using their own MT sys-tems, and also pre-labelled as 2). All seg-ments were annotated with word-level labels byprofessional translators using the core categoriesin MQM (Multidimensional Quality Metrics)13 aserror typology (see Figure 4). Each word or se-quence of words was annotated with a single error.For (supposedly rare) cases where a decision be-tween multiple fine-grained error types could notbe made, annotators were requested to choose acoarser error category in the hierarchy.

    Participants are asked to produce a label foreach token that indicates quality at different lev-els of granularity:

    Binary classification: an OK / bad label,where bad indicates the need for editing thetoken.

    Level 1 classification: an OK / accuracy /fluency label, specifying coarser level cate-gories of errors for each token, or OK fortokens with no error.

    Multi-class classification: one of the labelsspecifying the error type for the token (termi-nology, mistranslation, missing word, etc.) inFigure 4, or OK for tokens with no error.

    As training data, we provide tokenised transla-tion output for all language pairs, human and ma-chine translations, with tokens annotated with allissue types listed above, or OK. The annotationwas performed manually by professional transla-tors as part of the QTLaunchPad project. Forthe coarser variants, fine-grained errors are gen-eralised to Accuracy or Fluency, or bad for thebinary variant. The amount of available trainingdata varies by language pair:

    # Source sentences # Target sentences1,957 English 1,957 Spanish715 English 715 German350 German 350 English900 Spanish 900 English

    13http://www.qt21.eu/launchpad/content/training

    26

  • Figure 4: MQM metric as error typology.

    As test data, we provide additional data pointsfor all language pairs, human and machine trans-lations:

    # Source sentences # Target sentences382 English 382 Spanish150 English 150 German100 German 100 English150 Spanish 150 English

    In contrast to Tasks 1.11.3, no baseline featureset is provided to the participants.

    Similar to last year (Bojar et al., 2013), theword-level task is primarily evaluated by macro-averaged F-measure (in %). Because the class dis-tribution is skewed in the test data about 78% ofthe tokens are marked as OK we compute pre-cision, recall, and F1 for each class individually,weighting F1 scores by the frequency of the classin the test data. This avoids giving undue impor-tance to less frequent classes. Consider the follow-ing confusion matrix for Level 1 annotation, i.e.the three classes (O)K, (F)luency, and (A)ccuracy:

    referenceO F A

    predictedO 4172 1482 193F 1819 1333 214A 198 133 69

    For each of the three classes we assume a binarysetting (one-vs-all) and derive true-positive (tp),false-positive (fp), and false-negative (fn) countsfrom the rows and columns of the confusion ma-

    trix as follows:

    tpO = 4172

    fpO = 1482 + 193 = 1675

    fnO = 1819 + 198 = 2017

    tpF = 1333

    fpF = 1819 + 214 = 2033

    fnF = 1482 + 133 = 1615

    tpA = 69

    fpA = 198 + 133 = 331

    fnA = 193 + 214 = 407

    We continue to compute F1 scores for eachclass c {O,F,A}:

    precisionc = tpc/(tpc + fpc)

    recallc = tpc/(tpc + fnc)

    F1,c =2 precisionc recallcprecisionc+recallc

    yielding:

    precisionO = 4172/(4172 + 1675) = 0.7135

    recallO = 4172/(4172 + 2017) = 0.6741

    F1,O =2 0.7135 0.67410.7135 + 0.6741

    =