20
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1138–1157 June 6–11, 2021. ©2021 Association for Computational Linguistics 1138 Macro-Average: Rare Types Are Important Too Thamme Gowda Information Sciences Institute University of Southern California [email protected] Weiqiu You Dept of Computer and Information Science University of Pennsylvania [email protected] Constantine Lignos Michtom School of Computer Science Brandeis University [email protected] Jonathan May Information Sciences Institute University of Southern California [email protected] Abstract While traditional corpus-level evaluation met- rics for machine translation (MT) correlate well with fluency, they struggle to reflect ad- equacy. Model-based MT metrics trained on segment-level human judgments have emerged as an attractive replacement due to strong cor- relation results. These models, however, re- quire potentially expensive re-training for new domains and languages. Furthermore, their decisions are inherently non-transparent and appear to reflect unwelcome biases. We ex- plore the simple type-based classifier metric, MACROF 1 , and study its applicability to MT evaluation. We find that MACROF 1 is com- petitive on direct assessment, and outperforms others in indicating downstream cross-lingual information retrieval task performance. Fur- ther, we show that MACROF 1 can be used to effectively compare supervised and unsuper- vised neural machine translation, and reveal significant qualitative differences in the meth- ods’ outputs. 1 1 Introduction Model-based metrics for evaluating machine trans- lation such as BLEURT (Sellam et al., 2020), ESIM (Mathur et al., 2019), and YiSi (Lo, 2019) have re- cently attracted attention due to their superior cor- relation with human judgments (Ma et al., 2019). However, BLEU (Papineni et al., 2002) remains the most widely used corpus-level MT metric. It corre- lates reasonably well with human judgments, and moreover is easy to understand and cheap to cal- culate, requiring only reference translations in the target language. By contrast, model-based metrics require tuning on thousands of examples of human evaluation for every new target language or domain 1 Tools and analysis are available at https://github. com/thammegowda/007-mt-eval-macro. MT evaluation met- rics are at https://github.com/isi-nlp/sacrebleu/tree/ macroavg-naacl21. (Sellam et al., 2020). Model-based metric scores are also opaque and can hide undesirable biases, as can be seen in Table 1. Reference: You must be a doctor. Hypothesis: must be a doctor. He -0.735 Joe -0.975 Sue -1.043 She -1.100 Reference: It is the greatest country in the world. Hypothesis: is the greatest country in the world. France -0.022 America -0.060 Russia -0.161 Canada -0.309 Table 1: A demonstration of BLEURT’s internal bi- ases; model-free metrics like BLEU would consider each of the errors above to be equally wrong. The source of model-based metrics’ (e.g. BLEURT) correlative superiority over model-free metrics (e.g. BLEU) appears to be the former’s ability to focus evaluation on adequacy, while the latter are overly focused on fluency. BLEU and most other generation metrics consider each output token equally. Since natural language is dominated by a few high-count types, an MT model that con- centrates on getting its if s, ands and buts right will benefit from BLEU in the long run more than one that gets its xylophones, peripatetics, and defen- estrates right. Can we derive a metric with the discriminating power of BLEURT that does not share its bias or expense and is as interpretable as BLEU? As it turns out, the metric may already exist and be in common use. Information extraction and other areas concerned with classification have long used both micro averaging, which treats each to- ken equally, and macro averaging, which instead treats each type equally, when evaluating. The lat- ter in particular is useful when seeking to avoid results dominated by overly frequent types. In this

Macro-Average: Rare Types Are Important Too

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Proceedings of the 2021 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies, pages 1138–1157

June 6–11, 2021. ©2021 Association for Computational Linguistics

1138

Macro-Average: Rare Types Are Important TooThamme Gowda

Information Sciences InstituteUniversity of Southern California

[email protected]

Weiqiu YouDept of Computer and Information Science

University of [email protected]

Constantine LignosMichtom School of Computer Science

Brandeis [email protected]

Jonathan MayInformation Sciences Institute

University of Southern [email protected]

Abstract

While traditional corpus-level evaluation met-rics for machine translation (MT) correlatewell with fluency, they struggle to reflect ad-equacy. Model-based MT metrics trained onsegment-level human judgments have emergedas an attractive replacement due to strong cor-relation results. These models, however, re-quire potentially expensive re-training for newdomains and languages. Furthermore, theirdecisions are inherently non-transparent andappear to reflect unwelcome biases. We ex-plore the simple type-based classifier metric,MACROF1, and study its applicability to MTevaluation. We find that MACROF1 is com-petitive on direct assessment, and outperformsothers in indicating downstream cross-lingualinformation retrieval task performance. Fur-ther, we show that MACROF1 can be used toeffectively compare supervised and unsuper-vised neural machine translation, and revealsignificant qualitative differences in the meth-ods’ outputs.1

1 Introduction

Model-based metrics for evaluating machine trans-lation such as BLEURT (Sellam et al., 2020), ESIM(Mathur et al., 2019), and YiSi (Lo, 2019) have re-cently attracted attention due to their superior cor-relation with human judgments (Ma et al., 2019).However, BLEU (Papineni et al., 2002) remains themost widely used corpus-level MT metric. It corre-lates reasonably well with human judgments, andmoreover is easy to understand and cheap to cal-culate, requiring only reference translations in thetarget language. By contrast, model-based metricsrequire tuning on thousands of examples of humanevaluation for every new target language or domain

1Tools and analysis are available at https://github.com/thammegowda/007-mt-eval-macro. MT evaluation met-rics are at https://github.com/isi-nlp/sacrebleu/tree/macroavg-naacl21.

(Sellam et al., 2020). Model-based metric scoresare also opaque and can hide undesirable biases, ascan be seen in Table 1.

Reference: You must be a doctor.Hypothesis: must be a doctor.

He -0.735Joe -0.975Sue -1.043She -1.100

Reference: It is the greatest country in the world.Hypothesis: is the greatest country in the world.

France -0.022America -0.060Russia -0.161Canada -0.309

Table 1: A demonstration of BLEURT’s internal bi-ases; model-free metrics like BLEU would considereach of the errors above to be equally wrong.

The source of model-based metrics’ (e.g.BLEURT) correlative superiority over model-freemetrics (e.g. BLEU) appears to be the former’sability to focus evaluation on adequacy, while thelatter are overly focused on fluency. BLEU andmost other generation metrics consider each outputtoken equally. Since natural language is dominatedby a few high-count types, an MT model that con-centrates on getting its if s, ands and buts right willbenefit from BLEU in the long run more than onethat gets its xylophones, peripatetics, and defen-estrates right. Can we derive a metric with thediscriminating power of BLEURT that does notshare its bias or expense and is as interpretable asBLEU?

As it turns out, the metric may already exist andbe in common use. Information extraction andother areas concerned with classification have longused both micro averaging, which treats each to-ken equally, and macro averaging, which insteadtreats each type equally, when evaluating. The lat-ter in particular is useful when seeking to avoidresults dominated by overly frequent types. In this

1139

work we take a classification-based approach toevaluating machine translation in order to obtain aneasy-to-calculate metric that focuses on adequacyas much as BLEURT but does not have the ex-pensive overhead, opacity, or bias of model-basedmethods.

Our contributions are as follows: We con-sider MT as a classification task, and thus ad-mit MACROF1 as a legitimate approach to eval-uation (Section 2). We show that MACROF1 iscompetitive with other popular methods at track-ing human judgments in translation (Section 3.2).We offer an additional justification of MACROF1as a performance indicator on adequacy-focuseddownstream tasks such as cross-lingual informa-tion retrieval (Section 3.3). Finally, we demonstratethat MACROF1 is just as good as the expensiveBLEURT at discriminating between structurallydifferent MT approaches in a way BLEU cannot,especially regarding the adequacy of generated text,and provide a novel approach to qualitative analy-sis of the effect of metrics choice on quantitativeevaluation (Section 4).

2 NMT as Classification

Neural machine translation (NMT) models are of-ten viewed as pairs of encoder-decoder networks.Viewing NMT as such is useful in practice forimplementation; however, such a view is inade-quate for theoretical analysis. Gowda and May(2020) provide a high-level view of NMT as twofundamental ML components: an autoregressorand a classifier. Specifically, NMT is viewed as amulti-class classifier that operates on representa-tions from an autoregressor. We may thus considerclassifier-based evaluation metrics.

Consider a test corpus, T = {(x(i),h(i),y(i))∣i =1,2,3...m} where x(i), h(i), and y(i) are source, sys-tem hypothesis, and reference translation, respec-tively. Let x = {x(i)∀i} and similar for h and y. LetVh,Vy,Vh∩y, and V be the vocabulary of h, the vo-cabulary of y, Vh∩Vy, and Vh∪Vy, respectively. Foreach class c ∈V ,

PREDS(c) =m

∑i=1

C(c,h(i))

REFS(c) =m

∑i=1

C(c,y(i))

MATCH(c) =m

∑i=1

min{C(c,h(i)),C(c,y(i))}

where C(c,a) counts the number of tokens of typec in sequence a (Papineni et al., 2002). For eachclass c ∈ Vh∩y, precision (Pc), recall (Rc), and Fβ

measure (Fβ ;c) are computed as follows:2

Pc =MATCH(c)PREDS(c)

; Rc =MATCH(c)REFS(c)

Fβ ;c = (1+β2)

Pc×Rc

β 2×Pc+Rc

The macro-average consolidates individual per-formance by averaging by type, while the micro-average averages by token:

MACROFβ =∑c∈V Fβ ;c

∣V ∣

MICROFβ =∑c∈V f (c)×Fβ ;c

∑c′∈V f (c′)

where f (c) = REFS(c)+k for smoothing factor k.3

We scale MACROFβ and MICROFβ values to per-centile, similar to BLEU, for the sake of easierreadability.

3 Justification for MACROF1

In the following sections, we verify and justify theutility of MACROF1 while also offering a compar-ison with popular alternatives such as MICROF1,BLEU, CHRF1, and BLEURT.4 We use Kendall’srank correlation coefficient, τ , to compute the as-sociation between metrics and human judgments.Correlations with p-values smaller than α = 0.05are considered to be statistically significant.

3.1 Data-to-Text: WebNLG

We use the 2017 WebNLG Challenge dataset (Gar-dent et al., 2017; Shimorina, 2018)5 to analyzethe differences between micro- and macro- aver-aging. WebNLG is a task of generating Englishtext for sets of triples extracted from DBPedia. Hu-man annotations are available for a sample of 223records each from nine NLG systems. The human

2We consider Fβ ;c for c /∈Vh∩y to be 0.3We use k = 1. When k→∞,MICROFβ →MACROFβ .4BLEU and CHRF1 scores reported in this work are

computed with SACREBLEU; see the Appendix for details.BLEURT scores are from the base model (Sellam et al., 2020).We consider two varieties of averaging to obtain a corpus-levelmetric from the segment-level BLEURT: mean and median ofsegment-level scores per corpus.

5https://gitlab.com/webnlg/webnlg-human-evaluation

1140

Name Fluency & Grammar SemanticsBLEU ×.444 ×.500CHRF1

×.278 .778MACROF1

×.222 .722MICROF1

×.333 .611BLEURTmean ×.444 .833BLEURTmedian .611 .667

Table 2: WebNLG data-to-text task: Kendall’s τ betweensystem-level MT metric scores and human judgments. Fluencyand grammar are correlated identically by all metrics. Valuesthat are not significant at α = 0.05 are indicated by ×.

judgments provided have three linguistic aspects—fluency, grammar, and semantics6—which enableus to perform a fine grained analysis of our met-rics. We compute Kendall’s τ between metrics andhuman judgments, which are reported in Table 2.

As seen in Table 2, the metrics exhibit muchvariance in agreements with human judgments. Forinstance, BLEURTmedian is the best indicator offluency and grammar, however BLEURTmean isbest on semantics. BLEURT, being a model-basedmeasure that is directly trained on human judg-ments, scores relatively higher than others. Consid-ering the model-free metrics, CHRF1 does well onsemantics but poorly on fluency and grammar com-pared to BLEU. Not surprisingly, both MICROF1and MACROF1, which rely solely on unigrams, arepoor indicators of fluency and grammar comparedto BLEU, however MACROF1 is clearly a better in-dicator of semantics than BLEU. The discrepancybetween MICROF1 and MACROF1 regarding theiragreement with fluency, grammar, and semanticsis expected: micro-averaging pays more attentionto function words (as they are frequent types) thatcontribute to fluency and grammar whereas macro-averaging pays relatively more attention to the con-tent words that contribute to semantic adequacy.

The take away from this analysis is as follows:MACROF1 is a strong indicator of semantic ade-quacy, however, it is a poor indicator of fluency.We recommend using either MACROF1 or CHRF1when semantic adequacy and not fluency is a de-sired goal.

3.2 Machine Translation: WMT Metrics

In this section, we verify how well the metricsagree with human judgments using Workshop onMachine Translation (WMT) metrics task datasetsfor 2017–2019 (Bojar et al., 2017; Ma et al., 2018,

6Fluency and grammar, which are elicited with nearlyidentical directions (Gardent et al., 2017), are identically cor-related.

2019).7 We first compute scores from each MTmetric, and then calculate the correlation τ withhuman judgments.

As there are many language pairs and transla-tion directions in each year, we report only themean and median of τ , and number of wins permetric for each year in Table 3. We have excludedBLEURT from comparison in this section sincethe BLEURT models are fine-tuned on the samedatasets on which we are evaluating the other meth-ods.8 CHRF1 has the strongest mean and medianagreement with human judgments across the years.In 2018 and 2019, both MACROF1 and MICROF1mean and median agreements outperform BLEU

whereas in 2017 BLEU was better than MACROF1and MICROF1.

As seen in Section 3.1, MACROF1 weighs to-wards semantics whereas MICROF1 and BLEU

weigh towards fluency and grammar. This indi-cates that recent MT systems are mostly fluent, andadequacy is the key discriminating factor amongstthem. BLEU served well in the early era of sta-tistical MT when fluency was a harder objective.Recent advancements in neural MT models suchas Transformers (Vaswani et al., 2017) produce flu-ent outputs, and have brought us to an era wheresemantic adequacy is the focus.

3.3 Cross-Lingual Information Retrieval

In this section, we determine correlation betweenMT metrics and downstream cross-lingual infor-mation retrieval (CLIR) tasks. CLIR is a kind ofinformation retrieval (IR) task in which documentsin one language are retrieved given queries in an-other (Grefenstette, 2012). A practical solutionto CLIR is to translate source documents into thequery language using an MT model, then use amonolingual IR system to match queries with trans-lated documents. Correlation between MT and IRmetrics is accomplished in the following steps:

1. Build a set of MT models and measure theirperformance using MT metrics.2. Using each MT model in the set, translate allsource documents to the target language, buildan IR model, and measure IR performance ontranslated documents.3. For each MT metric, find the correlation be-tween the set of MT scores and their correspond-ing set of IR scores. The MT metric that has a

7http://www.statmt.org/wmt19/metrics-task.html8https://github.com/google-research/bleurt

1141

Year Pairs ⋆BLEU BLEU MACROF1 MICROF1 CHRF1

2019 18Mean .751 .771 .821 .818 .841Median .782 .752 .844 .844 .875Wins 3 3 6 3 5

2018 14Mean .858 .857 .875 .873 .902Median .868 .868 .901 .879 .919Wins 1 2 3 2 6

2017 13Mean .752 .713 .714 .742 .804Median .758 .733 .735 .728 .791Wins 5 4 2 2 6

Table 3: WMT 2017–19 Metrics task: Mean and median Kendall’s τ between MT metrics and human judgments.Correlations that are not significant at α = 0.05 are excluded from the calculation of mean, and median, and wins.See Appendix Tables 9, 10, and 11 for full details. ⋆BLEU is pre-computed scores available in the metrics packages.In 2018 and 2019, both MACROF1 and MICROF1 outperform BLEU, MACROF1 outperforms MICROF1. CHRF1has strongest mean and median agreements across the years. Judging based on the number of wins, MACROF1 hassteady progress over the years, and outperforms others in 2019.

stronger correlation with the IR metric(s) is moreuseful than the ones with weaker correlations.4. Repeat the above steps on many languages toverify the generalizability of findings.An essential resource of this analysis is a dataset

with human annotations for computing MT andIR performances. We conduct experiments on twodatasets: firstly, on data from the 2020 workshopon Cross-Language Search and Summarization ofText and Speech (CLSSTS) (Zavorin et al., 2020),and secondly, on data originally from Europarl,prepared by Lignos et al. (2019) (Europarl).

3.3.1 CLSSTS DatasetsCLSSTS datasets contain queries in English (EN),and documents in many source languages alongwith their human translations, as well as query-document relevance judgments. We use threesource languages: Lithuanian (LT), Pashto (PS),and Bulgarian (BG). The performance of this CLIRtask is evaluated using two IR measures: ActualQuery Weighted Value (AQWV) and Mean Av-erage Precision (MAP). AQWV9 is derived fromActual Term Weighted Value (ATWV) metric (Weg-mann et al., 2013).

We use a single CLIR system (Boschee et al.,2019) with the same IR settings for all MT modelsin the set, and measure Kendall’s τ between MTand IR measures. The results, in Table 4, show thatMACROF1 is the strongest indicator of CLIR down-stream task performance in five out of six settings.AQWV and MAP have a similar trend in agree-ment to the MT metrics. CHRF1 and BLEURT,which are strong contenders when generated textis directly evaluated by humans, do not indicate

9https://www.nist.gov/system/files/documents-/2017/10/26/aqwv_derivation.pdf

CLIR task performance as well as MACROF1, asCLIR tasks require faithful meaning equivalenceacross the language boundary, and human transla-tors can mistake fluent output for proper transla-tions (Callison-Burch et al., 2007).

3.3.2 Europarl DatasetsWe perform a similar analysis to Section 3.3.1 buton another cross-lingual task set up by Lignos et al.(2019) for Czech→ English (CS-EN) and German→ English (DE-EN), using publicly available datafrom the Europarl v7 corpus (Koehn, 2005). Thistask differs from the CLSSTS task (Section 3.3.1)in several ways. Firstly, MT metrics are computedon test sets from the news domain, whereas IRmetrics are from the Europarl domain. The do-mains are thus intentionally mismatched betweenMT and IR tests. Secondly, since there are noqueries specifically created for the Europarl do-main, GOV2 TREC topics 701–850 are used asdomain-relevant English queries. And lastly, sincethere are no query-document relevance human judg-ments for the chosen query and document sets, thedocuments retrieved by BM25 (Jones et al., 2000)on the English set for each query are treated asrelevant documents for computing the performanceof the CS-EN and DE-EN CLIR setup. As a result,IR metrics that rely on boolean query-documentrelevance judgments as ground truth are less infor-mative, and we use Rank-Based Overlap (RBO;p = 0.98) (Webber et al., 2010) as our IR metric.

We perform our analysis on the same experi-ments as Lignos et al. (2019).10 NMT modelsfor CS-EN and DE-EN translation are trained us-ing a convolutional NMT architecture (Gehring

10https://github.com/ConstantineLignos/mt-clir-emnlp-2019

1142

Domain IR Score BLEU MACROF1 MICROF1 CHRF1 BLEURTmean BLEURTmedian

LT-ENIn AQWV .429 ×.363 .508 ×.385 .451 .420

MAP .495 .429 .575 .451 .473 .486

In+Ext AQWV ×.345 .527 .491 .491 .491 .477MAP ×.273 ×.455 ×.418 ×.418 ×.418 ×.404

PS-ENIn AQWV .559 .653 .574 .581 .584 .581

MAP .493 .632 .487 .494 .558 .554

In+Ext AQWV .589 .682 .593 .583 .581 .571MAP .519 .637 .523 .482 .536 .526

BG-ENIn AQWV ×.455 .550 .527 ×.382 ×.418 .418

MAP .491 .661 .564 .491 .527 .527

In+ext AQWV ×.257 .500 ×.330 ×.404 ×.367 ×.367MAP ×.183 ×.426 ×.257 ×.330 ×.294 ×.294

Table 4: CLSSTS CLIR task: Kendall’s τ between IR and MT metrics under study. The rows with Domain=Inare where MT and IR scores are computed on the same set of documents, whereas Domain=In+Ext are where IRscores are computed on a larger set of documents that is a superset of segments on which MT scores are computed.Bold values are the best correlations achieved in a row-wise setting; values with × are not significant at α = 0.05.

BLEU MACROF1 MICROF1 CHRF1 BT BTCS-EN .850 .867 .850 .850 .900 .867DE-EN .900 .900 .900 .912 .917 .900

Table 5: Europarl CLIR task: Kendall’s τ betweenMT metrics and RBO. BT and BT are short forBLEURTmean and BLEURTmedian. All correlationsare significant at α = 0.05.

et al., 2017) implemented in the FAIRSeq (Ott et al.,2019) toolkit. For each of CS-EN and DE-EN, atotal of 16 NMT models that are based on differentquantities of training data and BPE hyperparame-ter values are used. The results in Table 5 showthat BLEURT has the highest correlation in bothcases. Apart from the trained BLEURTmedian met-ric, MACROF1 scores higher than the others onCS-EN, and is competitive on CS-EN. MACROF1is not the metric with highest IR task correlationin this setting, unlike in Section 3.3.1, however itis competitive with BLEU and CHRF1, and thusa safe choice as a downstream task performanceindicator.

4 Spotting Qualitative DifferencesBetween Supervised and UnsupervisedNMT with MACROF1

Unsupervised neural machine translation (UNMT)systems trained on massive monolingual datawithout parallel corpora have made significantprogress recently (Artetxe et al., 2018; Lampleet al., 2018a,b; Conneau and Lample, 2019; Songet al., 2019; Liu et al., 2020). In some cases,UNMT yields a BLEU score that is comparablewith strong11 supervised neural machine transla-

11though not, generally, the strongest

tion (SNMT) systems. In this section we leverageMACROF1 to investigate differences in the transla-tions from UNMT and SNMT systems that havesimilar BLEU.

We compare UNMT and SNMT for English↔German (EN-DE, DE-EN), English↔ French (EN-FR, FR-EN), and English↔ Romanian (EN-RO,RO-EN). All our UNMT models are based on XLM(Conneau and Lample, 2019), pretrained by Yang(2020). We choose SNMT models with similarBLEU on common test sets by either selecting fromsystems submitted to previous WMT News Trans-lation shared tasks (Bojar et al., 2014, 2016) or bybuilding such systems.12 Specific SNMT modelschosen are in the Appendix (Table 12).

Table 6 shows performance for these three lan-guage pairs using a variety of metrics. Despitecomparable scores in BLEU and only minor dif-ferences in MICROF1 and CHRF1, SNMT modelshave consistently higher MACROF1 and BLEURTthan the UNMT models for all six translation direc-tions.

In the following section, we use a pairwise maxi-mum difference discriminator approach to comparecorpus-level metrics BLEU and MACROF1 on a seg-ment level. Qualitatively, we take a closer look atthe behavior of the two metrics when comparinga translation with altered meaning to a translationwith differing word choices using the metric.

12We were unable to find EN-DE and DE-EN systems withcomparable BLEU in WMT submissions so we built standardTransformer-base (Vaswani et al., 2017) models for these usingappropriate quantity of training data to reach the desired BLEUperformance. We report EN-RO results with diacritic removedto match the output of UNMT.

1143

BLEU MACROF1 MICROF1 CHRF1 BLEURTmean BLEURTmedianSN UN ∆ SN UN ∆ SN UN ∆ SN UN ∆ SN UN ∆ SN UN ∆

DE-EN 32.7 33.9 -1.2 38.5 33.6 4.9 58.7 57.9 0.8 59.9 58.0 1.9 .211 -.026 .24 .285 .067 .22EN-DE 24.0 24.0 0.0 24.0 23.5 0.5 47.7 48.1 -0.4 53.3 52.0 1.3 -.134 -.204 .07 -.112 -.197 .09FR-EN 31.1 31.2 -0.1 41.6 33.6 8.0 60.5 58.3 2.2 59.1 57.3 1.8 .182 .066 .17 .243 .154 .09EN-FR 25.6 27.1 -1.5 31.9 27.3 4.6 53.0 52.3 0.7 56.0 57.7 -1.7 .104 .042 .06 .096 .063 .03RO-EN 30.8 29.6 1.2 40.3 33.0 7.3 59.8 56.5 3.3 58.0 54.7 3.3 .004 -.058 .06 .045 -.004 .04EN-RO 31.2 31.0 0.2 34.6 31.0 3.6 55.4 53.4 2.0 59.3 56.7 2.6 .030 -.046 .08 .027 -.038 .07

Table 6: For each language direction, UNMT (UN) models have similar BLEU to SNMT (SN) models, andCHRF1 and MICROF1 have small differences. However, MACROF1 scores differ significantly, consistently infavor of SNMT. Both corpus-level interpretations of BLEURT support the trend reflected by MACROF1, but thevalue differences are difficult to interpret.

δMACROF1 Fav Analysis δBLEU Fav Analysis.071 S S: synonym; U: untranslation, noun .048 S S: word order; U: word order, untranslation,

ending.064 S S: synonym; U: untranslation .046 S S: spelling variation; U: synonym, word order,

punctuation-.055 U U: no issues; S: untranslation .044 S S: extra determiner; U: paraphrase, synonym,

number, untranslation.052 S S: synonym; U: untranslation, noun .042 S S: synonym; U: synonym, punctuation, extra

adverb-.045 U U: no issues; S: untranslation -.039 U U: no issues; S: noun, verb.044 S S: synonym, word order; U: subject, trunca-

tion, word order-.037 U U: no issues; S: punctuation

.044 S S: synonym, tense; U: untranslation -.034 U U: no issues; S: symbol

.043 S S: inflection, word order; U: number -.032 U U: no issues; S: adjective, noun-.041 U U: adjective, verb; S: omitted verb, untransla-

tion-.032 U U: untranslation; S: tense, word order, mean-

ing, active/passive voice.041 S S: time, word order; U: time, nouns -.031 U U: untranslation; S: word order, synonym, ex-

tra_conj

Table 7: Analysis of the ten DE-EN test set segments with the most favoritism in SNMT (S) or UNMT (U),according to MACROF1 (left) and BLEU (right). Fav is the favored system by metrics. The complete text of thesentences is in the Appendix, Tables 15 and 16.

4.1 Pairwise Maximum DifferenceDiscriminator

We consider cases where a metric has a strongopinion of one translation system over another, andanalyze whether the opinion is well justified. Inorder to obtain this analysis, we employ a pairwisesegment-level discriminator from within a corpus-level metric, which we call favoritism.

We extend the definition of T from Section 2to T = {x,hS,hU ,y} where each of hS and hU is aseparate system’s hypothesis set for x.13 Let Mbe a corpus-level measure such that M(h,y) ∈ Rand a higher value implies better translation quality.M(h(−i),y(−i)) is the corpus-level score obtainedby excluding h(i) and y(i) from h and y, respectively.We define the benefit of segment i, δM(i;h):

δM(i;h) =M(h,y)−M(h(−i),y(−i))

If δM(i;h) > 0, then i is beneficial to h with respectto M, as the inclusion of h(i) increases the corpus-

13The subscripts represent SNMT and UNMT in this case,though the definition is general.

level score. We define the favoritism of M toward ias δM(i;hS,hU):

δM(i;hS,hU) = δM(i;hS)−δM(i;hU) (1)

If δM(i;hS,hU) > 0 then M favors the translation ofx(i) by system S over that in system U .

Table 7 reflects the results of a manual examina-tion of the ten sentences in the DE-EN test set withgreatest magnitude favoritism; complete resultsare in the Appendix, Tables 15 and 16. Meaning-altering changes such as ‘untranslation’, (wrong)

‘time’, and (wrong) ‘translation’ are marked in ital-ics, while changes that do not fundamentally alterthe meaning, such as ‘synonym,’ (different) ‘inflec-tion,’ and (different) ‘word order’ are marked inplain text.14

The results indicate that MACROF1 generallyfavors SNMT, and with good reasons, as the fa-vored translation does not generally alter sentencemeaning, while the disfavored translation does. On

14Some changes, such as ‘word order’ may change meaning;these are italicized or not on a case-by-case basis.

1144

6thδMACROF1(i,hS,hU): .044, δBLEU(i,hS,hU): -.00087, δBLEURT (i,hS,hU): .97

Ref Ever since I joined Labour 32 years ago as a school pupil, provoked by the Thatcher government’s neglect that had left mycomprehensive school classroom literally falling down, I’ve sought to champion better public services for those who need them most

- whether as a local councillor or government minister.SNMT 32 years ago, I joined Labour as a student because of the neglect of the Thatcher government, which had led to my classroom literally

collapsed, and as a result I tried to promote better public services for those who need it most, whether as a local council or ministers.UNMT Last 32 years ago, as a student, because of the disdain for the Thatcher-era government, Labour joined Labour.Problems SNMT: synonym, word_order UNMT: subject, truncation , word_order

Table 8: An example of favoritism that illustrates the differences between MACROF1 and BLEU. Translations ofthe DE-EN test sentence with sixth largest magnitude favoritism according to MACROF1, along with the favoritismaccording to BLEU (not in the top ten). UNMT’s translation does not include the second half of the sentence.MACROF1 favors SNMT, but BLEU favors UNMT.

the other hand, for the ten most favored sentencesaccording to BLEU, four do not contain meaning-altering divergences in the disfavored translation.Importantly, none of the sentences with greatestfavoritism according to MACROF1, all of whichhaving meaning altering changes in the disfavoredalternatives, appears in the list for BLEU. This indi-cates relatively bad judgment on the part of BLEU.One case of good judgment from MACROF1 andbad judgment from BLEU regarding truncation isshown in Table 8.

From our qualitative examinations, MACROF1is better than BLEU at discriminating against un-translations and trucations in UNMT. The case issimilar for FR-EN and RO-EN, except that RO-EN has more untranslations for both SNMT andUNMT, possibly due to the smaller training data.Complete tables and annotated sentences are in theAppendix, in Section C.

5 Related Work

5.1 MT Metrics

Many metrics have been proposed for MT eval-uation, which we broadly categorize into model-free or model-based. Model-free metrics computescores based on translations but have no signifi-cant parameters or hyperparameters that must betuned a priori; these include BLEU (Papineni et al.,2002), NIST (Doddington, 2002), TER (Snoveret al., 2006), and CHRF1 (Popovic, 2015). Model-based metrics have a significant number of pa-rameters and, sometimes, external resources thatmust be set prior to use. These include METEOR(Banerjee and Lavie, 2005), BLEURT (Sellamet al., 2020), YiSi (Lo, 2019), ESIM (Mathur et al.,2019), and BEER (Stanojevic and Sima’an, 2014).Model-based metrics require significant effort andresources when adapting to a new language or do-main, while model-free metrics require only a test

set with references.Mathur et al. (2020) have recently evaluated the

utility of popular metrics and recommend the useof either CHRF1 or a model-based metric instead ofBLEU. We compare our MACROF1 and MICROF1metrics with BLEU, CHRF1, and BLEURT (Sel-lam et al., 2020). While Mathur et al. (2020) usePearson’s correlation coefficient (r) to quantify thecorrelation between automatic evaluation metricsand human judgements, we instead use Kendall’srank coefficient (τ), since τ is more robust to out-liers than r (Croux and Dehon, 2010).

5.2 Rare Words are Important

That natural language word types roughly follow aZipfian distribution is a well known phenomenon(Zipf, 1949; Powers, 1998). The frequent types aremainly so-called “stop words,” function words, andother low-information types, while most contentwords are infrequent types. To counter this natu-ral frequency-based imbalance, statistics such asinverted document frequency (IDF) are commonlyused to weigh the input words in applications suchas information retrieval (Jones, 1972). In NLGtasks such as MT, where words are the output ofa classifier, there has been scant effort to addressthe imbalance. Doddington (2002) is the only workwe know of in which the ‘information’ of an n-gram is used as its weight, such that rare n-gramsattain relatively more importance than in BLEU.We abandon this direction for two reasons: Firstly,as noted in that work, large amounts of data arerequired to estimate n-gram statistics. Secondly,unequal weighing is a bias that is best suited todatasets where the weights are derived from, andsuch biases often do not generalize to other datasets.Therefore, unlike Doddington (2002), we assignequal weights to all n-gram classes, and in thiswork we limit our scope to unigrams only.

While BLEU is a precision-oriented measure,

1145

METEOR (Banerjee and Lavie, 2005) and CHRF(Popovic, 2015) include both precision and recall,similar to our methods. However, neither of thesemeasures try to address the natural imbalance ofclass distribution. BEER (Stanojevic and Sima’an,2014) and METEOR (Denkowski and Lavie, 2011)make an explicit distinction between function andcontent words; such a distinction inherently cap-tures frequency differences since function wordsare often frequent and content words are often in-frequent types. However, doing so requires theconstruction of potentially expensive linguistic re-sources. This work does not make any explicit dis-tinction and uses naturally occurring type counts toeffect a similar result.

5.3 F-measure as an Evaluation Metric

F-measure (Rijsbergen, 1979; Chinchor, 1992) isextensively used as an evaluation metric in classifi-cation tasks such as part-of-speech tagging, namedentity recognition, and sentiment analysis (Der-czynski, 2016). Viewing MT as a multi-class clas-sifier is a relatively new paradigm (Gowda andMay, 2020), and evaluating MT solely as a multi-class classifier as proposed in this work is not anestablished practice. However, we find that theF1 measure is sometimes used for various analy-ses when BLEU and others are inadequate: Thecompare-mt tool (Neubig et al., 2019) supportscomparison of MT models based on F1 measure ofindividual types. Gowda and May (2020) use F1 ofindividual types to uncover frequency-based biasin MT models. Sennrich et al. (2016) use corpus-level unigram F1 in addition to BLEU and CHRF,however, corpus-level F1 is computed as MICROF1.To the best of our knowledge, there is no previ-ous work that clearly formulates the differencesbetween micro- and macro- averages, and justifiesthe use of MACROF1 for MT evaluation.

6 Discussion and Conclusion

We have evaluated NLG in general and MT specifi-cally as a multi-class classifier, and illustrated thedifferences between micro- and macro- averagesusing MICROF1 and MACROF1 as examples (Sec-tion 2). MACROF1 captures semantic adequacybetter than MICROF1 (Section 3.1). BLEU, beinga micro-averaged measure, served well in an erawhen generating fluent text was at least as difficultas generating adequate text. Since we are now in anera in which fluency is taken for granted and seman-

tic adequacy is a key discriminating factor, macro-averaged measures such as MACROF1 are betterat judging the generation quality of MT models(Section 3.2). We have found that another popularmetric, CHRF1, also performs well on direct assess-ment, however, being an implicitly micro-averagedmeasure, it does not perform as well as MACROF1on downstream CLIR tasks (Section 3.3.1). Un-like BLEURT, which is also adequacy-oriented,MACROF1 is directly interpretable, does not re-quire retuning on expensive human evaluationswhen changing language or domain, and does notappear to have uncontrollable biases resulting fromdata effects. It is both easy to understand and tocalculate, and is inspectable, enabling fine-grainedanalysis at the level of individual word types. Theseattributes make it a useful metric for understand-ing and addressing the flaws of current models.For instance, we have used MACROF1 to comparesupervised and unsupervised NMT models at thesame operating point measured in BLEU, and deter-mined that supervised models have better adequacythan the current unsupervised models (Section 4).

Macro-average is a useful technique for address-ing the importance of the long tail of language, andMACROF1 is our first step in that direction; we an-ticipate the development of more advanced macro-averaged metrics that take advantage of higher-order and character n-grams in the future.

7 Ethical Consideration

Since many ML models including NMT are them-selves opaque and known to possess data-inducedbiases (Prates et al., 2019), using opaque and bi-ased evaluation metrics in concurrence makes iteven harder to discover and address the flaws inmodeling. Hence, we have raised concerns aboutthe opaque nature of the current model-based evalu-ation metrics, and demonstrated examples display-ing unwelcome biases in evaluation. We advocatethe use of the MACROF1 metric, as it is easily in-terpretable and offers the explanation of score asa composition of individual type performances. Inaddition, MACROF1 treats all types equally, andhas no parameters that are directly or indirectly esti-mated from data sets. Unlike MACROF1, MICROF1and other implicitly or explicitly micro-averagedmetrics assign lower importance to rare conceptsand their associated rare types. The use of micro-averaged metrics in real world evaluation couldlead to marginalization of rare types.

1146

Failure Modes: The proposed MACROF1 metricis not the best measure of fluency of text. Hencewe suggest caution while using MACROF1 to drawfluency related decisions. MACROF1 is inherentlyconcerned with words, and assumes the output lan-guage is easily segmentable into word tokens. Us-ing MACROF1 to evaluate translation into alphabet-ical languages such as Thai, Lao, and Khmer, thatdo not use white space to segment words, requiresan effective tokenizer. Absent this the method maybe ineffective; we have not tested it on languagesbeyond those listed in Section B.

Reproducibility: Our implementation ofMACROF1 and MICROF1 has the same userexperience as BLEU as implemented in SACRE-BLEU; signatures are provided in Section A. Inaddition, our implementation is computationallyefficient, and has the same (minimal) softwareand hardware requirements as BLEU. All datafor MT and NLG human correlation studies ispublicly available and documented. Data forreproducing the IR experiments in Section 3.3.2 isalso publicly available and documented. The datafor reproducing the IR experiments in Section 3.3.1is only available to participants in the CLSSTSshared task.

Climate Impact: Our proposed metrics are on parwith BLEU and such model-free methods, whichconsume significantly less energy than most model-based evaluation metrics.

Acknowledgements

The authors thank Shantanu Agarwal, Joel Barry,and Scott Miller for their help with CLSSTS CLIRexperiments, and Daniel Cohen for discussions onIR evaluation methods. This research is based uponwork supported by the Office of the Director of Na-tional Intelligence (ODNI), Intelligence AdvancedResearch Projects Activity (IARPA), via AFRLContract FA8650-17-C-9116. The views and con-clusions contained herein are those of the authorsand should not be interpreted as necessarily repre-senting the official policies or endorsements, eitherexpressed or implied, of the ODNI, IARPA, or theU.S. Government. The U.S. Government is autho-rized to reproduce and distribute reprints for Gov-ernmental purposes notwithstanding any copyrightannotation thereon.

ReferencesMikel Artetxe, Gorka Labaka, Eneko Agirre, and

Kyunghyun Cho. 2018. Unsupervised neural ma-chine translation. In 6th International Conferenceon Learning Representations, ICLR 2018, Vancou-ver, BC, Canada, April 30 - May 3, 2018, Confer-ence Track Proceedings. OpenReview.net.

Satanjeev Banerjee and Alon Lavie. 2005. METEOR:An automatic metric for MT evaluation with im-proved correlation with human judgments. In Pro-ceedings of the ACL Workshop on Intrinsic and Ex-trinsic Evaluation Measures for Machine Transla-tion and/or Summarization, pages 65–72, Ann Ar-bor, Michigan. Association for Computational Lin-guistics.

Ondrej Bojar, Christian Buck, Christian Federmann,Barry Haddow, Philipp Koehn, Johannes Leveling,Christof Monz, Pavel Pecina, Matt Post, HerveSaint-Amand, Radu Soricut, Lucia Specia, and AlešTamchyna. 2014. Findings of the 2014 workshop onstatistical machine translation. In Proceedings of theNinth Workshop on Statistical Machine Translation,pages 12–58, Baltimore, Maryland, USA. Associa-tion for Computational Linguistics.

Ondrej Bojar, Yvette Graham, and Amir Kamran. 2017.Results of the WMT17 metrics shared task. InProceedings of the Second Conference on MachineTranslation, pages 489–513, Copenhagen, Denmark.Association for Computational Linguistics.

Ondrej Bojar, Rajen Chatterjee, Christian Federmann,Yvette Graham, Barry Haddow, Matthias Huck,Antonio Jimeno Yepes, Philipp Koehn, VarvaraLogacheva, Christof Monz, Matteo Negri, Aure-lie Neveol, Mariana Neves, Martin Popel, MattPost, Raphael Rubino, Carolina Scarton, Lucia Spe-cia, Marco Turchi, Karin Verspoor, and MarcosZampieri. 2016. Findings of the 2016 conferenceon machine translation. In Proceedings of the FirstConference on Machine Translation, pages 131–198,Berlin, Germany. Association for ComputationalLinguistics.

Elizabeth Boschee, Joel Barry, Jayadev Billa, Mar-jorie Freedman, Thamme Gowda, Constantine Lig-nos, Chester Palen-Michel, Michael Pust, Ban-riskhem Kayang Khonglah, Srikanth Madikeri,Jonathan May, and Scott Miller. 2019. SARAL: Alow-resource cross-lingual domain-focused informa-tion retrieval system for effective rapid documenttriage. In Proceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics: Sys-tem Demonstrations, pages 19–24, Florence, Italy.Association for Computational Linguistics.

Chris Callison-Burch, Cameron Fordyce, PhilippKoehn, Christof Monz, and Josh Schroeder. 2007.(Meta-) evaluation of machine translation. In Pro-ceedings of the Second Workshop on Statistical Ma-chine Translation, pages 136–158, Prague, CzechRepublic. Association for Computational Linguis-tics.

1147

Nancy Chinchor. 1992. MUC-4 Evaluation Metrics. InProceedings of the 4th Conference on Message Un-derstanding, MUC4 ’92, page 22–29, USA. Associ-ation for Computational Linguistics.

Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In H. Wal-lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc,E. Fox, and R. Garnett, editors, Advances in Neu-ral Information Processing Systems 32, pages 7059–7069. Curran Associates, Inc.

Christophe Croux and Catherine Dehon. 2010. Influ-ence functions of the spearman and kendall correla-tion measures. Statistical methods & applications,19(4):497–515.

Michael Denkowski and Alon Lavie. 2011. Meteor 1.3:Automatic metric for reliable optimization and eval-uation of machine translation systems. In Proceed-ings of the Sixth Workshop on Statistical MachineTranslation, pages 85–91, Edinburgh, Scotland. As-sociation for Computational Linguistics.

Leon Derczynski. 2016. Complementarity, F-score,and NLP evaluation. In Proceedings of the Tenth In-ternational Conference on Language Resources andEvaluation (LREC’16), pages 261–266, Portorož,Slovenia. European Language Resources Associa-tion (ELRA).

George Doddington. 2002. Automatic evaluationof machine translation quality using n-gram co-occurrence statistics. In Proceedings of the SecondInternational Conference on Human Language Tech-nology Research, HLT ’02, page 138–145, San Fran-cisco, CA, USA. Morgan Kaufmann Publishers Inc.

Claire Gardent, Anastasia Shimorina, Shashi Narayan,and Laura Perez-Beltrachini. 2017. Creating train-ing corpora for NLG micro-planners. In Proceed-ings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 179–188. Association for Computa-tional Linguistics.

Jonas Gehring, Michael Auli, David Grangier, De-nis Yarats, and Yann N Dauphin. 2017. Convolu-tional sequence to sequence learning. In Proceed-ings of the 34th International Conference on Ma-chine Learning-Volume 70, pages 1243–1252.

Thamme Gowda and Jonathan May. 2020. Finding theoptimal vocabulary size for neural machine transla-tion. In Findings of the Association for Computa-tional Linguistics: EMNLP 2020, pages 3955–3964,Online. Association for Computational Linguistics.

Gregory Grefenstette. 2012. Cross-language informa-tion retrieval, volume 2. Springer Science & Busi-ness Media.

Karen Spärck Jones. 1972. A statistical interpretationof term specificity and its application in retrieval.Journal of Documentation, 28:11–21.

Karen Spärck Jones, Steve Walker, and Stephen E.Robertson. 2000. A probabilistic model of informa-tion retrieval: development and comparative exper-iments: Part 2. Information processing & manage-ment, 36(6):809–840.

Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. Proc. 10th MachineTranslation Summit (MT Summit), 2005, pages 79–86.

Guillaume Lample, Alexis Conneau, Ludovic Denoyer,and Marc’Aurelio Ranzato. 2018a. Unsupervisedmachine translation using monolingual corpora only.In 6th International Conference on Learning Rep-resentations, ICLR 2018, Vancouver, BC, Canada,April 30 - May 3, 2018, Conference Track Proceed-ings. OpenReview.net.

Guillaume Lample, Myle Ott, Alexis Conneau, Lu-dovic Denoyer, and Marc’Aurelio Ranzato. 2018b.Phrase-based & neural unsupervised machine trans-lation. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 5039–5049, Brussels, Belgium. Associationfor Computational Linguistics.

Constantine Lignos, Daniel Cohen, Yen-Chieh Lien,Pratik Mehta, W. Bruce Croft, and Scott Miller.2019. The challenges of optimizing machine trans-lation for low resource cross-language informationretrieval. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages3497–3502, Hong Kong, China. Association forComputational Linguistics.

Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, SergeyEdunov, Marjan Ghazvininejad, Mike Lewis, andLuke Zettlemoyer. 2020. Multilingual denoisingpre-training for neural machine translation.

Chi-kiu Lo. 2019. YiSi - a unified semantic MT qualityevaluation and estimation metric for languages withdifferent levels of available resources. In Proceed-ings of the Fourth Conference on Machine Transla-tion (Volume 2: Shared Task Papers, Day 1), pages507–513, Florence, Italy. Association for Computa-tional Linguistics.

Qingsong Ma, Ondrej Bojar, and Yvette Graham. 2018.Results of the WMT18 metrics shared task: Bothcharacters and embeddings achieve good perfor-mance. In Proceedings of the Third Conference onMachine Translation: Shared Task Papers, pages671–688, Belgium, Brussels. Association for Com-putational Linguistics.

Qingsong Ma, Johnny Wei, Ondrej Bojar, and YvetteGraham. 2019. Results of the WMT19 metricsshared task: Segment-level and strong MT sys-tems pose big challenges. In Proceedings of theFourth Conference on Machine Translation (Volume

1148

2: Shared Task Papers, Day 1), pages 62–90, Flo-rence, Italy. Association for Computational Linguis-tics.

Nitika Mathur, Timothy Baldwin, and Trevor Cohn.2019. Putting evaluation in context: Contextualembeddings improve machine translation evaluation.In Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics, pages2799–2808, Florence, Italy. Association for Compu-tational Linguistics.

Nitika Mathur, Timothy Baldwin, and Trevor Cohn.2020. Tangled up in BLEU: Reevaluating the eval-uation of automatic machine translation evaluationmetrics. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 4984–4997, Online. Association for Computa-tional Linguistics.

Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel,Danish Pruthi, and Xinyi Wang. 2019. compare-mt:A tool for holistic comparison of language genera-tion systems. In Proceedings of the 2019 Confer-ence of the North American Chapter of the Asso-ciation for Computational Linguistics (Demonstra-tions), pages 35–41, Minneapolis, Minnesota. Asso-ciation for Computational Linguistics.

Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In Proceedings ofNAACL-HLT 2019: Demonstrations.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics, pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.

Maja Popovic. 2015. chrF: character n-gram f-scorefor automatic MT evaluation. In Proceedings of theTenth Workshop on Statistical Machine Translation,pages 392–395, Lisbon, Portugal. Association forComputational Linguistics.

David M. W. Powers. 1998. Applications and explana-tions of Zipf’s law. In New Methods in LanguageProcessing and Computational Natural LanguageLearning.

Marcelo O.R. Prates, Pedro H. Avelar, and Luís C.Lamb. 2019. Assessing gender bias in machinetranslation: a case study with google translate. Neu-ral Computing and Applications, pages 1–19.

C. J. Van Rijsbergen. 1979. Information Retrieval, 2ndedition. Butterworth-Heinemann, USA.

Thibault Sellam, Dipanjan Das, and Ankur Parikh.2020. BLEURT: Learning robust metrics for textgeneration. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,

pages 7881–7892, Online. Association for Computa-tional Linguistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.

Anastasia Shimorina. 2018. Human vs automatic met-rics: on the importance of correlation design. CoRR,abs/1805.11474.

Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A study oftranslation edit rate with targeted human annotation.In Proceedings of association for machine transla-tion in the Americas, volume 200. Cambridge, MA.

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: Masked sequence to se-quence pre-training for language generation. In Pro-ceedings of Machine Learning Research, volume 97,pages 5926–5936, Long Beach, California, USA.PMLR.

Miloš Stanojevic and Khalil Sima’an. 2014. BEER:BEtter evaluation as ranking. In Proceedings of theNinth Workshop on Statistical Machine Translation,pages 414–419, Baltimore, Maryland, USA. Associ-ation for Computational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

William Webber, Alistair Moffat, and Justin Zobel.2010. A similarity measure for indefinite rankings.ACM Transactions on Information Systems (TOIS),28(4):1–38.

Steven Wegmann, Arlo Faria, Adam Janin, KorbinianRiedhammer, and Nelson Morgan. 2013. The TAOof ATWV: Probing the mysteries of keyword searchperformance. In 2013 IEEE Workshop on AutomaticSpeech Recognition and Understanding, pages 192–197. IEEE.

Hans Yang. 2020. XLM-UNMT-Models. https://github.com/Hansxsourse/XLM-UNMT-Models.

Ilya Zavorin, Aric Bills, Cassian Corey, Michelle Mor-rison, Audrey Tong, and Richard Tong. 2020. Cor-pora for cross-language information retrieval in sixless-resourced languages. In Proceedings of theworkshop on Cross-Language Search and Summa-rization of Text and Speech (CLSSTS2020), pages7–13, Marseille, France. European Language Re-sources Association.

George Kingsley Zipf. 1949. Human behaviour andthe principle of least-effort. Cambridge MA edn.Addison-Wesley.

1149

A Metrics Reproducibility

BLEU scores reported in this work are com-puted with the SACREBLEU library and havesignature BLEU+case.mixed+lang.<xx>-<yy>+numrefs.1

+smooth.exp+tok.<TOK>+version.1.4.13, where <TOK>

is zh for Chinese, and 13a for all other languages.MACROF1 and MICROF1 use the same tokenizer asBLEU. CHRF1 is also obtained using SACREBLEU

and has signature chrF1+lang.<xx>-<yy>+numchars.6

+space.false +version.1.4.13. BLUERT scores arefrom the base model of Sellam et al. (2020),which is fine-tuned on WMT Metrics ratings datafrom 2015-2018. The BLEURT model is re-trieved from https://storage.googleapis.com/

bleurt-oss/bleurt-base-128.zip.MACROF1 and MICROF1 are computed using

our fork of SACREBLEU as:sacrebleu $REF -m macrof microf < $HYP.

B Agreement with WMT HumanJudgments

Tables 9, 10, and 11 provide τ between MT metricsand human judgments on WMT Metrics task 2017–2019. ⋆BLEU is based on pre-computed scores inWMT metrics package, whereas BLEU is basedon our recalculation using SACREBLEU. Valuesmarked with ×are not significant at α = 0.05, andhence corresponding rows are excluded from thecalculation of mean, median, and standard devia-tion.

Since MACROF1 is the only metric that does notachieve statistical significance in the WMT 2019EN-ZH setting, we carefully inspected it. Humanscores for this setting are obtained without look-ing at the references by bilingual speakers (Maet al., 2019), but the ZH references are found tohave a large number of bracketed EN phrases, es-pecially proper nouns that are rare types. Whenthe text inside these brackets is not generated by anMT system, MACROF1 naturally penalizes heavilydue to the poor recall. Since other metrics assignlower importance to poor recall of such rare types,they achieve relatively better correlation to humanscores than MACROF1. However, since the τ val-ues for EN-ZH are relatively lower than the otherlanguage pairs, we conclude that poor correlationof MACROF1 in EN-ZH is due to poor quality ref-erences. Some settings did not achieve statisticalsignificance due to a smaller sample set as therewere fewer MT systems submitted, e.g. 2017 CS-EN.

⋆BLEU BLEU MACROF1 MICROF1 CHRF1

DE-CS .855 .745 .964 .917 .982DE-EN .571 .655 .723 .695 .742DE-FR .782 .881 .927 .844 .915EN-CS .709 .954 .927 .927 .908EN-DE .540 .752 .741 .773 .824EN-FI .879 .818 .879 .848 .923EN-GU .709 .709 .600 .734 .709EN-KK .491 .527 .685 .636 .661EN-LT .879 .848 .970 .939 .881EN-RU .870 .848 .939 .879 .930FI-EN .788 .809 .909 .901 .875FR-DE .822 .733 .733 .764 .815GU-EN .782 .709 .855 .891 .945KK-EN .891 .844 .796 .844 .881LT-EN .818 .855 .844 .855 .833RU-EN .692 .729 .714 .780 .757ZH-EN .695 .695 .752 .676 .715Median .782 .752 .844 .844 .875Mean .751 .771 .821 .818 .841SD .124 .101 .112 .093 .095EN-ZH .606 .606 ×.424 .595 .594Wins 3 3 6 3 5

Table 9: WMT19 Metrics task: Kendall’s τ betweenmetrics and human judgments.

C UNMT and SNMT Models

The UNMT models follow XLM’s standard archi-tecture and are trained with 5 million monolin-gual sentences for each language using a vocab-ulary size of 60,000. We train SNMT models forEN↔DE and select models with the most similar(or a slightly lower) BLEU as their UNMT counter-parts on newstest2019. The DE-EN model selectedis trained with 1 million sentences of parallel dataand a vocabulary size of 64,000, and the EN-DEmodel selected is trained with 250,000 sentences ofparallel data and a vocabulary size of 48,000. ForEN↔FR and EN↔RO, we select SNMT modelsfrom submitted systems to WMT shared tasks thathave similar or slightly lower BLEU scores to corre-sponding UNMT models, based on NewsTest2014for EN↔FR and NewsTest2016 for EN↔RO.

Figure 1, which is a visualization of MACROF1for SNMT and UNMT models, shows that UNMTis generally better than SNMT on frequent types,however, SNMT outperforms UNMT on the restleading to a crossover point in MACROF1 curves.Since MACROF1 assigns relatively higher weightsto infrequent types than in BLEU, SNMT gainshigher MACROF1 than UNMT while both have ap-proximately the same BLEU, as reported in Table 6.

A complete comparison of UNMT vs SNMT indifferent languages is in Table 12. A manual analy-sis of the ten sentences with the largest magnitudefavoritism according to MACROF1 and BLEU in

1150

⋆BLEU BLEU MACROF1 MICROF1 CHRF1

DE-EN .828 .845 .917 .883 .919EN-DE .778 .750 .850 .783 .848EN-ET .868 .868 .934 .906 .949EN-FI .901 .848 .901 .879 .945EN-RU .889 .889 .944 .889 .930EN-ZH .736 .729 .685 .833 .827ET-EN .884 .900 .884 .878 .904FI-EN .944 .944 .889 .915 .957RU-EN .786 .786 .929 .857 .869ZH-EN .824 .872 .738 .780 .820EN-CS 1.000 1.000 .949 1.000 .949Median .868 .868 .901 .879 .919Mean .858 .857 .875 .873 .902SD .077 .080 .087 .062 .052TR-EN ×.200 ×.738 ×.400 ×.316 ×.632EN-TR ×.571 ×.400 .837 ×.571 .849CS-EN ×.800 ×.800 ×.600 ×.800 ×.738Wins 1 2 3 2 6

Table 10: WMT18 Metrics task: Kendall’s τ betweenmetrics and human judgments.

⋆BLEU BLEU MACROF1 MICROF1 CHRF1

DE-EN .564 .564 .734 .661 .744EN-CS .758 .751 .767 .758 .878EN-DE .714 .767 .562 .593 .720EN-FI .667 .697 .769 .718 .782EN-RU .556 .556 .778 .648 .669EN-ZH .911 .911 .600 .854 .899LV-EN .905 .714 .905 .905 .905RU-EN .778 .611 .611 .722 .800TR-EN .911 .778 .674 .733 .907ZH-EN .758 .780 .736 .824 .732Median .758 .733 .735 .728 .791Mean .752 .713 .714 .742 .804SD .132 .110 .103 .097 .088FI-EN .867 .867 ×.733 .867 .867EN-TR .857 .714 ×.571 .643 .849CS-EN ×1.000 ×1.000 ×.667 ×.667 ×.913Wins 5 4 2 2 6

Table 11: WMT17 Metrics task: Kendall’s τ betweenmetrics and human judgments.

Translation SNMT UNMT SNMT NameDE-EN NewsTest2019 32.7 33.9 Our TransformerEN-DE NewsTest2019 24.0 24.0 Our TransformerFR-EN NewsTest2014 31.1 31.2 OnlineA.0EN-FR NewsTest2014 25.6 27.1 PROMT-Rule-based.3083RO-EN NewsTest2016 30.8 29.6 Online-A.0EN-RO NewsTest2016 31.2 31.0 uedin-pbmt.4362

Table 12: SNMT systems are selected such that theirBLEU scores are approximately the same as the avail-able pretrained UNMT models.

the FR-EN and RO-EN test sets is in Table 13 andTable 14. The complete texts of these sentences,their reference translations, and the system transla-tions (including DE-EN mentioned in Sec 4), areshown in Tables 15, 16, 17, 18, 19, and 20.

the do too putinvestigation

Type

50

55

60

65

70

75

80

85

90

Mac

roF 1

Per

cent

ile

SNMT vs UNMT: FREN

SNMTUNMT

de lors ça 1 fillesType

50

55

60

65

70

75

80

85

90

Mac

roF 1

Per

cent

ile

SNMT vs UNMT: ENFR

SNMTUNMT

the those american attack virginia

Type

50

55

60

65

70

75

80

85

90

Mac

roF 1

Per

cent

ileSNMT vs UNMT: DEEN

SNMTUNMT

, seiner obersten aufgrund britische

Type

50

55

60

65

70

75

80

85

90

Mac

roF 1

Per

cent

ile

SNMT vs UNMT: ENDE

SNMTUNMT

the according state go early

Type

50

55

60

65

70

75

80

85

90

Mac

roF 1

Per

cent

ile

SNMT vs UNMT: ROEN

SNMTUNMT

Figure 1: SNMT vs UNMT MACROF1 on the mostfrequent 500 types. UNMT outperforms SNMT on fre-quent types that are weighed heavily by BLEU how-ever, SNMT is generally better than UNMT on raretypes; hence, SNMT has a higher MACROF1.

1151

δMACROF1 Fav Analysis δBLEU Fav Analysis.044 S S: synonym; U: untranslation, synonym -.026 U U: synonym; S: omitted adv, word order

-.038 U U: no issues; S: synonym .025 S S: no issues; U: determiner, word order.035 S S: synonym; U: untranslation, synonym .024 S S: no issues; U: repetition, form

-.034 U U: no issues; S: synonym; word_order .021 S S: verb, synonym; U: untranslation, noun,time, synonym

-.034 U U: synonym; S: word order, verb_ref .021 S S: synonym; U: synonym-.033 U U: no issues; S: synonym -.021 U U: omitted NER; S: synonym, word order.033 S S: word order; U: untranslation, NER, word

order-.021 U U: untranslation; S: verb, word order

.032 S S: synonym; U: number, omitted noun, un-translation, verb

-.021 U U: synonym; S: extra preposition, synonym,word order

.030 S S: adj; U: untranslation .021 S S: no issues; U: NER

.030 S S: noun, synonym; U: noun, synonym -.020 U U: synonym; S: synonym, word order

Table 13: Analysis of the ten FR-EN test set segments with the most favoritism in SNMT (S) or UNMT (U),according to MACROF1 (left) and BLEU (right). Fav is the favored system by metrics. Actual examples are shownin Appendix Tables 17 and 18.

δMACROF1 Fav Analysis δBLEU Fav Analysis.131 S S: word order; U: repetition, word order .114 S S: word order; U: repetition, word order.063 S S: noun, word order; U: repetition, untrans-

lation, noun.089 S S: no issues; U: omitted noun, omitted time,

NER.062 S S: extra, untranslation; U: untranslation,

copy-.072 U U: country, untranslation; S: noun, word

order-.052 U U: untranslation x 3, synonym; S: untrans-

lation, synonym-.045 U U: synonym; S: synonym, word order

-.052 U U: untranslation, NER, synonym; S: NER,synonym

-.041 U U: untranslation; S: word order, subject

-.052 U U: extra; S: untranslation -.040 U U: no issues; S: number, omitted preposition-.050 U U: adv; S: incoherent, adv .039 S S: extra, untranslation; U: untranslation,

copy-.050 U U: active/passive voice, name; S: name .036 S S: no issues; U: extra verb-.049 U U: untranslation; S: untranslation, word or-

der-.035 U U: repetition, untranslation, S: verb, syn-

onym, word order.048 S S: no issues; U: NER .034 S S: synonym; U: untranslation

Table 14: Analysis of the ten RO-EN test set segments with the most favoritism in SNMT (S) or UNMT (U),according to MACROF1 (left) and BLEU (right). Fav is the favored system by metrics. Actual examples are shownin Appendix Tables 19 and 20.

1152

δMACROF1 Source Reference SNMT UNMT.071 Es wird davon ausgegan-

gen, dass sie über eineleistungsstarke Kanone, eineReihe von Flugabwehr- undSchiffsabwehrraketen sowieeinige Stealth-Technologienverfügen, wie z. B. re-duzierte Radar-, Infrarot- undakustische Signaturen.

It is understood they willfeature a powerful cannon,an array of anti-aircraft andanti-ship missiles as wellas some stealth technologies,such as reduced radar, in-frared and acoustic signa-tures.

It is assumed that they havea powerful cannon, a seriesof anti-aircraft and anti-shipmissiles, as well as somesteam technologies, such asreduced radar, infrared andacoustic signatures.

It is understood they have apowerful canon, a number offluke and ship fire systemsand some stealth-controlledtechnologies, such as re-duced radar, infrarot andakustic signposts.

.064 Eine Gruppe maskierter Pro-Separatisten, die von der Bere-itschaftspolizei zurückgehal-ten wurden, bewarfen sie mitEiern und schleuderte Pulver-farbe und erzeugte in denStraßen, die normalerweisevon Touristen überfüllt waren,dunkle Staubwolken.

A group of masked pro-separatists held back by riotpolice pelted them with eggsand hurled powder paint,creating dark clouds of dustin streets that would usuallybe thronged with tourists.

A group of masked pro-separatists held hostage bythe riot police brought themto eggs and ignited pow-der paint and produced darkclouds in the streets thatwere usually crowded bytourists.

A group of masked pro-independence separforces,who were kept away bythe Bereitschaftpolice, be-headed them with evocativepaint and poured pulver-force paint and createddark Staubations in thestreets normally clogged bytourists.

-.055 Il faut bien le faire. Il faut bien le faire. La gentillesse du personnelet la disponibil.

Il faut bien faire.

.052 In einem Abschnitt gibt esein Bild eines Schlafsaals, indem die Studenten auf Kaf-feetassen, Vorhänge, Trainerund Bücher klicken, um überdie Auswirkungen von Kof-fein und Licht informiert zuwerden und darüber, wie sichdie sportliche Leistung durchSchlafmangel und die Bedeu-tung einer Schlafenszeitrou-tine beeinflusst.

In one section there is animage of a dorm room,where students click on cof-fee cups, curtains, trainersand books to be told aboutthe effects of caffeine andlight and how athletic perfor-mance is impacted by sleepdeficiency, and the impor-tance of a bedtime routine.

In a section there is an im-age of a bedroom where stu-dents click on coffee cups,curtains, trainers, and booksto be informed about the ef-fects of caffeine and light,and about how sporting per-formance is affected by lackof sleep and the importanceof sleeping time routine.

In one section, there is apicture of a sleeping saunawhere students click on cof-fee cups, forecourts, coachesand books to be educatedabout the impact of Koffeinand light and about how ath-letic performance is influ-enced by sleep loss and theimportance of a sleep dayroutine.

-.045 Nickelbergbau ist auch inder Provinz wichtig, wirdaber hauptsächlich in Mo-rowali betrieben, an dergegenüberliegenden Küstevon Sulawesi.

Nickel mining is also impor-tant in the province, but ismostly concentrated in Mo-rowali, on the opposite coastof Sulawesi.

Nickelergbau is also impor-tant in the province, but ismainly operated in More-wali, on the opposite coastof Sulawesi.

Nickel mining is also im-portant in the province, butis mostly operated in Mo-rowali, on the opposite coastof Sulawesi.

.044 Vor 32 Jahren schloss ichmich als Schüler, wegender Vernachlässigung derThatcher-Regierung, Labouran. Diese Vernachlässigunghatte dazu geführt, dassmein Klassenzimmer buch-stäblich zusammengebrochenwar. Infolgedessen habe ichversucht, mich für bessereöffentliche Dienstleistungenfür diejenigen einzusetzen,die sie am meisten brauchen.Egal ob als Gemeinderat oderMinister.

Ever since I joined Labour32 years ago as a schoolpupil, provoked by theThatcher government’sneglect that had left mycomprehensive school class-room literally falling down,I’ve sought to championbetter public services forthose who need them most -whether as a local councilloror government minister.

32 years ago, I joinedLabour as a student be-cause of the neglect of theThatcher government, whichhad led to my classroom lit-erally collapsed, and as a re-sult I tried to promote bet-ter public services for thosewho need it most, whether asa local council or ministers.

Last 32 years ago, as a stu-dent, because of the disdainfor the Thatcher-era govern-ment, Labour joined Labour.

.044 UN-Gesandter Staffan deMistura hofft, bald die er-sten Treffen eines neuenAusschusses aus Regierungs-und Oppositionsmitgliederneinzuberufen, um eineNachkriegsverfassung fürSyrien zu entwerfen und denWeg zu Wahlen zu ebnen.

UN envoy Staffan de Mis-tura is hoping to soon con-vene the first meetings ofa new committee comprisedof government and opposi-tion members to draft a post-war constitution for Syriaand pave the way to elec-tions.

UN envoy Staffan de Mis-tura hopes to convene soonthe first meetings of a newcommittee of governmentand opposition members todraw up a post-war constitu-tion for Syria and pave theway for elections.

U.N. Secretary GeneralStaffan de Mistura hopesto soon join the first meet-ings of a new committeeof government and oppo-sition leaders to designa Nachkriegsrewrite forSyria and clear the path toelections.

.043 CBS hatte 3,1 Millio-nen, NBC 2,94 Millionen,MSNBC 2,89 Millionen undCNN 2,52 Millionen, soNielsen.

CBS had 3.1 million, NBChad 2.94 million, MSNBChad 2.89 million and CNNhad 2.52 million, Nielsensaid.

CBS had 3.1 million, NBC2.94 million, MSNBC 2.89million and CNN 2.52 mil-lion, says Nielsen.

CBS had 3.8 million, NBC3.94 million, MSNBC 3.89million and CNN 3.52 mil-lion, Nielsen said.

-.041 Den Rangers gelangen nurzwei Schüsse in der erstenHälfte, aber der ehemaligeIbrox-Torhüter Liam Kellywar kaum von LassanaCoulibalys Kopfsprung unddem Treffer eines bisslosenOvie Ejaria aus der Ruhe zubringen.

Rangers managed just twofirst-half shots on targetbut former Ibrox goalkeeperLiam Kelly was barely trou-bled by Lassana Coulibaly’sheader and a tame OvieEjaria strike.

The Ranners only reach twoshots in the first half, butthe former Ibrox-TorkeeperLiam Kelly was hardly thehead of Lassanna Coulibysand the hit of a bissloze OviEjaria.

The Rangers managed onlytwo shots in the first halfbut former Ibrox goalkeeperLiam Kelly was unlikelyto be helped by LassanaCoulibaly’s headfirst tackleand the goal of a bisectedOvie Ejaria.

.041 Liverpool tritt am MIttwochum 15.00 Uhr im StadioSan Paolo in Neapel, Italien,gegen Napoli an.

Liverpool battles Napoli inthe group stage of the Cham-pions League at 3 p.m. onWednesday at Stadio SanPaolo in Naples, Italy.

Liverpool will take placeat 3 p.m. at the StadioSan Paolo in Naples, Italy,against Napoli.

Liverpool v Napoli at theMItch Stadium at 15.00 pmin Neapel, Italy, on MItch.

Table 15: Top 10 segments by ∣δMACROF1(i,hS,hU)∣ on DE-EN.

1153

δBLEU Source Reference SNMT UNMT.048 In der letzten Woche wurden mittlere

Konzentrationen in Küstennähe undauf offener See in Pinellas Countygemeldet, geringe bis hohe Konzen-trationen auf offener See in Hillsbor-ough County, Hintergrund- bis hoheKonzentrationen in Manatee County,Hintergrund- bis hohe Konzentratio-nen in Küstennähe und auf offenerSee in Sarasota County, Hintergrund-bis mittlere Konzentrationen in Char-lotte County, Hintergrund- bis hoheKonzentrationen in Küstennähe undauf hoher See in Lee County sowiegeringe Konzentrationen in CollierCounty.

Medium concentrations in oroffshore of Pinellas Countyhave been reported in thepast week, low to highconcentrations offshore ofHillsborough County, back-ground to high concentra-tions in Manatee County,background to high concen-trations in or offshore ofSarasota County, backgroundto medium concentrationsin Charlotte County, back-ground to high concentra-tions in or offshore of LeeCounty, and low concentra-tions in Collier County.

Last week, average concen-trations were reported on thecoast and open seas in Pinel-las County, low to high lev-els at open sea in Hillsbor-ough County, background tohigh concentrations in Man-tee County, high concentra-tions in coastal and openseas in Sarsota County, back-ground to medium concen-trations in Charlotte County,background to high shore andhigh sea levels in Lee County,and low concentrations inCollier County.

In the last week, moder-ate to high Konzentrof leadin Küstas County were re-ported in Pinellas County,low to high Konzentrof leadlevels on open water inHillsborough County, Hinter-grundto high levels in Man-atee County, Hintergrundtohigh to high Konzentrin Küs-tas and on open water in Sara-sota County and low Konzen-trationen in Charlotte County,Hintergrundto high to highKonzentrin Küstennähe andon open water in SarasotaCounty.

.046 Moskau hat wiederholt betont, dassdie 11-Milliarden-Dollar-PipelineNord Stream 2, die die bestehendePipeline-Kapazität auf 110 Milliar-den Kubikmeter verdoppeln soll, einrein wirtschaftliches Projekt ist.

Moscow has repeatedlystressed that the $11 billionNord Stream 2 pipeline,which is set to double theexisting pipeline capacity to110 billion cubic meters, is apurely economic project.

Moscow has repeatedlystressed that the $11 billionNord Stream 2 pipeline,which is supposed to doublethe existing pipeline capacityto 110 billion cubic metres,is a purely economic project.

Moscow has repeatedlyinsisted that the 11-billionpipeline, Nord Stream 2,which will double the exist-ing Pipeline-capacity to 110billion cubic feet, is a purelycommercial project.

.044 Der NTS, der für die Betreuung vonmehr als 270 historischen Gebäu-den, 38 wichtigen Gärten und 76.000Hektar Land rund um das Land ve-rantwortlich ist, nimmt die Fleder-mäuse sehr ernst.

The NTS, which is respon-sible for the care of morethan 270 historical build-ings, 38 important gardensand 76,000 hectares of landaround the country, takes batsvery seriously.

The NTS, which is responsi-ble for the care of more than270 historic buildings, 38 im-portant gardens and 76,000hectares of land around thecountry, takes the bats veryseriously.

The NTS, responsible formanaging more than 270 his-toric buildings, 38 key gar-dens and 74,000 acres of landaround the country, said theFledermäuse are very impor-tant.

.042 George W. Bush telefonierte mit Sen-atoren, um diese zu überreden, HerrnKavanaugh zu unterstützen, der imWeißen Haus für Herrn Bush gear-beitet hatte und durch ihn seineFrau Ashley traf, die die persönlicheSekretärin von Herrn Bush war.

George W. Bush has beenpicking up the phone to callSenators, lobbying them tosupport Mr Kavanaugh, whoworked in the White Housefor Mr Bush and through himmet his wife Ashley, whowas Mr Bush’s personal sec-retary.

George W. Bush contactedsenators to persuade them tosupport Mr Kavanaugh, whoworked in the White Housefor Mr Bush and met his wifeAshley, who was Mr Bush’spersonal secretary.

George W. Bush spoke to sen-ators to help him overtureto support Mr. Kavanaugh,who had worked in the WhiteHouse for Mr. Bush and metthrough him his wife, Ashley,who was the personal secre-tary to Mr. Bush.

-.039 Eine Woche nachdem eine of-fizielle chinesische Zeitung einevierseitige Anzeige in einer US-amerikanischen Tageszeitung aufden gegenseitigen Nutzen des US-China-Handels gestellt hatte, warfder US-amerikanische Botschafterin China Peking vor, die amerikanis-che Presse zur Verbreitung vonPropaganda zu verwenden.

A week after an official Chi-nese newspaper ran a four-page ad in a U.S. daily tout-ing the mutual benefits ofU.S.-China trade, the U.S.ambassador to China accusedBeijing of using the Amer-ican press to spread propa-ganda.

A week after an official Chi-nese newspaper published afour-page display in a USdaily on the mutual benefit ofUS China trade, the US am-bassador to China publishedin Beijing to use the Ameri-can press for propaganda.

A week after an official Chi-nese newspaper published afour-page ad on the mutualbenefit of the US-China trade,the U.S. ambassador to Chinaaccused Beijing of using theAmerican press to spread pro-paganda.

-.037 Sie kümmern sich nicht darum, wensie verletzen, wen sie überfahrenmüssen, um Macht und Kontrolle zubekommen, das ist, was sie wollen,Macht und Kontrolle, wir werden esihnen nicht geben.

They don’t care who theyhurt, who they have to runover in order to get powerand control, that’s what theywant is power and control,we’re not going to give it tothem."

They do not care about whothey hurt whom they mustpass over to gain power andcontrol, that is what theywant, power and control, wewill not give them.

They don’t care who theyhurt, who they have to pass toget power and control, that’swhat they want, power andcontrol, we won’t give it tothem.

-.034 Mayorga behauptet, Ronaldo seinach dem angeblichen Vorfall aufdie Knie gefallen und habe ihrgesagt, er sei „zu 99 Prozent“ ein„guter Kerl“, der von den „einProzent“ im Stich gelassen wurde.

Mayorga claims Ronaldo fellto his knees after the allegedincident and told her he was"99 percent" a "good guy" letdown by the "one percent."

Mayorga claims that Ronaldofell to the knees after the al-leged incident, saying that hewas “99% ” a“ good guy ”left in the lurch by the “onepercent ”.

Mayorga claims Ronaldo fellon his knee after the allegedincident and told her he was"to 99 percent" a "good guy"who was left in the dark bythe "one percent."

-.032 Palin, 29, aus Wasilla, Alaska,wurde wegen des Verdachts auf häus-liche Gewalt verhaftet. Gegen ihnliegt bereits ein Bericht über häus-liche Gewalt und Widerstand bei derFestnahme vor, so eine Meldung, dieam Samstag von den Alaska StateTroopers veröffentlicht wurde.

Palin, 29, of Wasilla, Alaska,was arrested on suspicion ofdomestic violence, interfer-ing with a report of domes-tic violence and resisting ar-rest, according to a reportreleased Saturday by AlaskaState Troopers.

Palin, 29, from Wasilla,Alaska, was arrested foralleged domestic violence,and a report on domesticviolence and oppositionto arrest has already beenpublished on Saturday by theAlaska State Trooperator.

Palin, 29, of Wasilla, Alaska,was arrested on charges of do-mestic violence. – Againsthim, a report of domestic vi-olence and resistance in ar-rest was already released Sat-urday, according to a reportreleased Saturday by AlaskaState Troopers.

-.032 "Ich habe [...] nicht verstecktFords Behauptungen, ich habe ihreGeschichte nicht geleakt“, erzählteFeinstein dem Komitee, berichteteThe Hill.

"I did not hide Dr. Ford’sallegations, I did not leakher story," Feinstein toldthe committee, The Hill re-ported.

"I have [...] not hidden Ford’sclaims that I have not livedtheir history," told Finesteinthe committee, reported TheHill.

"I did not hide [Forman’sclaims, I didn’t geleasther story," Feinstein toldthe committee, The Hillreported.

-.031 Briefings werden immer noch stat-tfinden, sagte Sanders, aber solltedie Presse die Chance haben, demPräsidenten der Vereinigten Staatendie Fragen direkt zu stellen, so seidas unendlich besser, als mit ihr zusprechen.

Briefings will still happen,Sanders said, but "if the presshas the chance to ask thepresident of the United Statesquestions directly, that’s in-finitely better than talking tome.

briefing is still going to takeplace, Sanders said, but thepress should have the oppor-tunity to put the questions di-rectly to the President of theUnited States, if that is in-finitely better than to talk toher.

Briefings will still take place,Sanders said, but if the presshas the chance to ask the pres-ident of the United States di-rectly, so that is unendlichbetter than talking to her.

Table 16: Top 10 segments by ∣δBLEU(i,hS,hU)∣ on DE-EN.

1154

δMACROF1 Source Reference SNMT UNMT.044 Il ne fallait qu’en déployer les ac-

cidents, et l’affaire, jacobinismeoblige, était confiée aux préfets etaux sous-préfets, interprètes au-torisés.

All it took was to highlight itsmistakes and, in keeping withJacobinism, the issue wouldbe entrusted to prefects andsub-prefects - the authorisedinterpreters.

It should deploy the accidents,and the case, Jacobinismobliges, was entrusted to theprefects and the sub-prefects,authorized interpreters.

It only took to deploy the ac-cidents, and the matter, jacobi-nite oblige, was handed to thepréfets and the sous-préfets,authorized interprètes.

-.038 Les spécialistes disent que lespersonnes sont systématique-ment contraintes à faire leursaveux, malgré un changementdans la loi qui a été voté plustôt dans l’année interdisant auxautorités de forcer quiconque às’incriminer lui-même.

Experts say confessions arestill routinely coerced, despitea change in the law earlierthis year banning the authori-ties from forcing anyone to in-criminate themselves.

The experts say that peopleare systematically forced tomake their confessions, de-spite a change in the lawwhich was passed earlier thisyear prohibiting the authori-ties to force anyone to incrim-inating himself.

Experts say people are rou-tinely forced to make theirconfessions, despite a changein the law that was passed ear-lier in the year banning offi-cials from forcing anyone toincriminate themselves.

.035 Ils sont intersexués,l’intersexualité faisant partiedu groupe de la soixantaine demaladies diagnostiquées commedésordres du développementsexuel, un terme génériquedésignant les personnes possé-dant des chromosomes ou desgonades (ovaires ou testicules)atypiques ou des organes sexuelsanormalement développés.

They are intersex, part of agroup of about 60 conditionsthat fall under the diagno-sis of disorders of sexual de-velopment, an umbrella termfor those with atypical chro-mosomes, gonads (ovaries ortestes), or unusually deve-loped genitalia.

They are intersex, intersexforming part of the Group of60 diseases diagnosed as dis-orders of sexual development,a generic term for people withchromosomes or atypical go-nads (ovaries or testes) or ab-normally developed sexual or-gans.

They are intersexuzed, withintersexuality making up thegroup of the soixantaine ofdiseases diagnosed as disor-dered sexual development, ageneric term dissignant peo-ple possessing chromosomesor gonades (ovaries or testic-ules) atypiques or anormallydeveloped sexual organs.

-.034 Ces violences sont de plusen plus meurtrières en dépitde mesures de sécurité renfor-cées et d’opérations militairesd’envergure lancées depuis desmois par le gouvernement deNouri Al Maliki, dominé par leschiites.

The violence is becomingmore and more deadly in spiteof reinforced security mea-sures and large-scale militaryoperations undertaken in re-cent months by Nouri Al Ma-liki’s government, which isdominated by Shiites.

Such violence are more lethaldespite measures enhanced se-curity and large-scale militaryoperations launched by theGovernment of Nouri Al Ma-liki, the Shia-dominated formonths.

Those violence is increasinglydeadly in the face of increasedsecurity measures and majormilitary operations launchedfor months by Nouri Al Ma-liki’s government, dominatedby Shiites.

-.034 Du côté du gouvernement, onestime que 29 954 membresdes forces armées du prési-dent Bachar el-Assad ont trouvéla mort, dont 18 678 étaientdes combattants des forces pro-gouvernementales et 187 des mil-itants du Hezbollah libanais.

On the government side, itsaid 29,954 are members ofPresident Bashar Assad’sarmed forces, 18,678 arepro-government fighters and187 are Lebanese Hezbollahmilitants.

On the side of the Govern-ment, it is estimated that 29954 members of the armedforces of president Bachar Al-Assad died, whose 18 678were 187 Lebanese Hezbollahmilitants and fighters of thepro-Government forces.

On the government side, oneestimate says 29,954 mem-bers of President Bachar al-Assad’s armed forces havefound their way, including18,678 were fighters from pro-government forces and 187from Lebanese Hezbollah mil-itants.

-.033 Mercredi, le Centre américain decontrôle et de prévention des mal-adies a publié une série de di-rectives indiquant comment gérerles allergies alimentaires des en-fants à l’école.

On Wednesday, the Centersfor Disease Control and Pre-vention released a set of guide-lines to manage children’sfood allergies at school.

Wednesday, the US Centre ofdisease prevention and controlissued a set of guidelines in-dicating how to manage foodallergies of children at theschool.

Wednesday, the U.S. Centersfor Disease Control and Pre-vention issued a series of di-rectives indicating how to han-dle children’s food allergies atschool.

.033 N’est-il pas surprenant de liredans les colonnes du Monde àquelques semaines d’intervalled’une part la reproduction dela correspondance diplomatiqueaméricaine et d’autre part unecondamnation des écoutes duQuai d’Orsay par la NSA ?

And is it not surprising to readin the pages of Le Monde,on the one hand, a reproduc-tion of diplomatic correspon-dence with the US and, onthe other, condemnation of theNSA’s spying on the Ministryof Foreign Affairs on the Quaid’Orsay, within a matter ofweeks?

Is it not surprising to readin the columns of the worlda few weeks apart on theone hand the reproductionof American diplomatic cor-respondence and on the otherhand a condemnation of theQuai d’Orsay by the NSA lis-tens?

Isn’t it surprising to read in theTimes’ pages just weeks apartof one side’s reproduction ofthe American diplomatic cor-respondance and of another acondamnation of the Quai d’Orsay’s écoutes by the NSA?

.032 Les ministères appellent àprésent les personnes qui au-raient été mordues, griffées,égratignées, ou léchées sur unemuqueuse ou sur une peau léséepar ce chaton ou dont l’animalaurait été en contact avec cechaton entre le 8 et le 28 octobreà contacter le 08.11.00.06.95entre 10 heures et 18 heures àpartir du 1er novembre.

The ministries are currentlyasking anyone who mighthave been bitten, clawed,scratched or licked on a mu-cous membrane or on dam-aged skin by the kitten, or whoown an animal that may havebeen in contact with the kit-ten between 08 to 28 October,to contact them on 08 11 0006 95 between 10am and 6pmfrom 01 November.

Departments now call peo-ple who have been bitten,scratched, scratched or lickedon mucous membranes or skininjured by this kitten or wherethe animal would have beenin contact with this kitten be-tween 8 and 28 October tocontact the 08.11.00.06.95 be-tween 10 a.m. and 6 p.m.from November 1.

The at present call on peoplewho may have been morbid,griffon, egregious or layed ona mug or on a skin léchéby this chateau or whose an-imal may have been in con-tact with that chateau between8 and 28 October to contact08.11.00.095 between 10 and18 November.

.030 A cette IIIe République, momentcentral et créateur, Pierre Noraa montré beaucoup d’intérêt etmême de tendresse: saluant ceuxqui se sont alors employés à ré-parer la fracture révolutionnaire,en enseignant aux écoliers toutce qui dans l’ancienne Francepréparait obscurément la Francemoderne et en leur proposant uneversion unifiée de leur histoire.

Pierre Nora has shown hasshown great interest and eventenderness for this Third Re-public: he salutes those whotried at the time to repair thedivide created by the Rev-olution by teaching studentsabout everything in the for-mer France that obscurelypaved the way for the modernFrance, and by offering thema unified version of their his-tory.

This third Republic, whilecentral and creator, PierreNora has shown great interestand even tenderness: salutingthose who then worked to re-pair the revolutionary divide,by teaching students what inthe former France preparingdarkly modern France and of-fering them a version unifiedin their history.

At this IIIe République,central and creator moment,Pierre Nora showed muchinterest and even tendresse:praising those who thenhelped to repair the revolu-tionary fracture, teachingschoolchildren everything inthe former French Republicthat obscurantly preparedmodern France and offeringthem a unifying version oftheir history.

.030 La théorie dominante sur la façonde traiter les enfants pourvusd’organes sexuels ambigus a étélancée par le Dr John Money, del’université Johns-Hopkins, quiconsidérait que le genre est mal-léable.

The prevailing theory on howto treat children with ambigu-ous genitalia was put forwardby Dr. John Money at JohnsHopkins University, who heldthat gender was malleable.

The prevailing theory abouthow to treat children withambiguous sexual organs waslaunched by Dr. John Moneyof the Johns Hopkins Univer-sity, who considered that thegenre is malleable.

The dominant theory abouthow to treat children armedwith ambigüous sex organswas launched by Dr. JohnMoney, of Johns-HopkinsUniversity, who consideredthe genre maudlin.

Table 17: Top 10 segments by ∣δMACROF1(i,hS,hU)∣ on FR-EN.

1155

δBLEU Source Reference SNMT UNMT-.026 Mais le représentant Bill Shuster

(R-Pa.), président du Comitédes transports de la Chambredes représentants, a déclaréqu’il le considérait aussi commel’alternative la plus viable à longterme.

But Rep. Bill Shuster (R-Pa.),chairman of the House Trans-portation Committee, has saidhe, too, sees it as the most vi-able long-term alternative.

But Congressman Bill Shus-ter (R - PA.), Chairmanof the House of representa-tives Transportation Commit-tee, said that he considered asthe most viable alternative inthe long term.

But Rep. Bill Shuster (R-Pa.),chairman of the House Trans-portation Committee, said healso considered it the most vi-able long-term alternative.

.025 Les neuf premiers épisodes deSheriff Callie’s Wild West serontdisponibles à partir du 24 novem-bre sur le site watchdisneyju-nior.com ou via son applicationpour téléphones et tablettes.

The first nine episodes of Sher-iff Callie’s Wild West willbe available from November24 on the site watchdisneyju-nior.com or via its applicationfor mobile phones and tablets.

The first nine episodes of Sher-iff Callie’s Wild West willbe available from November24 on the site watchdisneyju-nior.com or via its applicationfor phones and tablets.

Sheriff first nine episodes ofSheriff Callie’ s Wild Westwill be available as of Novem-ber 24 on the watchdisneyju-nior.com website or via its ap-plication for phones and com-puters.

.024 Le président Xi Jinping, qui apris ses fonctions en mars dernier,a fait de la lutte contre la cor-ruption une priorité nationale, es-timant que le phénomène consti-tuait une menace à l’existence-même du Parti communiste.

President Xi Jinping, whotook office last March, hasmade the fight against corrup-tion a national priority, believ-ing that the phenomenon is athreat to the very existence ofthe Communist Party.

President Xi Jinping, whotook office in March, hasmade the fight against corrup-tion a national priority, be-lieving that the phenomenonposed a threat to the very exis-tence of the Communist Party.

President Xi Jinping, whotook office in March, hasmade fighting corruption a na-tional priority, saying the phe-nomenon posed a threat to theCommunist Party’s existence-free existence.

.021 Un peu plus tôt, sur la routemenant à Bunagana, poste-frontière avec l’Ouganda,des militaires aidés de civilschargeaient un lance-roquettesmultiple monté sur un camionflambant neuf des FARDC, de-vant assurer la relève d’un autreengin pilonnant les positions duM23 sur les collines.

A little earlier, on the roadto Bunagana, the frontier postwith Uganda, soldiers assistedby civilians loaded up a mul-tiple rocket launcher mountedon a brand new truck belong-ing to the FARDC, intended totake over from another devicepounding the positions of theM23 in the hills.

Earlier, on the road leadingto Bunagana border post withUganda, soldiers helped civil-ians loaded a multiple rocketlauncher mounted on a truckbrand new FARDC, to ensuresuccession of another enginepounding the positions of theM23 in the hills.

A day earlier, on the roadleading to Bunagana, a post-code with Uganda, militarypersonnel aided by civilianswere loading a multiple rocketlance-roquettes fire on a flam-bant neuf FARDC truck, ex-pected to provide the lead foranother device pilfering M23positions on the hills.

.021 Il y a, avec la crémation, "une vi-olence faite au corps aimé", quiva être "réduit à un tas de cen-dres" en très peu de temps, etnon après un processus de décom-position, qui "accompagnerait lesphases du deuil".

With cremation, there is asense of "violence committedagainst the body of a lovedone", which will be "reducedto a pile of ashes" in a veryshort time instead of after aprocess of decomposition that"would accompany the stagesof grief".

There, with the cremation, "aviolence made to the belovedbody", which will be "reducedto a pile of ashes" in a veryshort time, and not after a pro-cess of decomposition, which"would accompany the phasesof mourning".

There is, with cremation, "vio-lence done to the loved one,"which is going to be "reducedto a tas of ashes" in very lit-tle time, and not after a pro-cess of disablement, whichwould "accompany the phasesof grief."

.021 Scott Brown, le capitaine duCeltic Glasgow, a vu son appelrejeté et sera bien suspendu pourles deux prochains matches deLigue des champions de son club,contre l’Ajax et l’AC Milan.

Scott Brown, Glasgow Celticcaptain, has had his appeal re-jected and will miss his club’snext two Champion’s Leaguematches, against Ajax and ACMilan.

Scott Brown, the captain ofthe Glasgow Celtic, saw hisappeal dismissed and will bewell suspended for the nexttwo matches of the championsLeague for his club againstAjax and AC Milan.

Scott Brown, the Celtic cap-tain, has had his appeal re-jected and will be well sus-pended for his club’s next twoChampions League matches,against Ajax and AC Milan.

-.021 Les irréductibles du M23, soitquelques centaines de combat-tants, étaient retranchés à prèsde 2000 mètres d’altitude surles collines agricoles de Chanzu,Runyonyi et Mbuzi, proches deBunagana et Jomba, deux local-ités situées à environ 80 km aunord de Goma, la capitale de laprovince du Nord-Kivu.

The diehards of the M23,who are several hundreds innumber, had entrenched them-selves at an altitude of almost2,000 metres in the farmlandhills of Chanzu, Runyonyiand Mbuzi, close to Buna-gana and Jomba, two townslocated around 80km north ofGoma, the capital of NorthKivu province.

The irreducible m23, or a fewhundred fighters, were cut offto nearly 2000 metres abovesea level on the hills agricul-tural Chanzu, Runyonyi andMbuzi, near Bunagana andJomba, located about 80 kmnorth of Goma, the capital ofthe province of North Kivu.

The irréductibles M23, orsome hundred fighters, wereretranchés at nearly 2000 feetof altitude on the agriculturalhills of Chanzu, Runyonyiand Mbuzi, close to Buna-gana and Jomba, two townslocated about 80 miles northof Goma, the capital of NorthKivu province.

-.021 Il a indiqué que le nouveau tri-bunal des médias « sera toujourspartial car il s’agit d’un prolonge-ment du gouvernement » et queles restrictions relatives au con-tenu et à la publicité nuiraient à laplace du Kenya dans l’économiemondiale.

He said the new media tri-bunal "will always be bi-ased because it’s an exten-sion of the government," andthat restrictions on contentand advertising would damageKenya’s place in the globaleconomy.

He said as the new media tri-bunal ’ will be always par-tial because it is an exten-sion of the Government "andcontent and advertising restric-tions hurt instead of the Kenyainto the world economy.

He said the new media tri-bunal "will always be partialbecause it is a extension ofthe government" and that re-strictions relating to contentand advertising would hurtKenya’s place in the globaleconomy.

.021 Dans "Les Fous de Benghazi",il avait été le premier à révélerl’existence d’un centre de com-mandement secret de la CIA danscette ville, berceau de la révoltelibyenne.

In "Les Fous de Benghazi", hewas the first to reveal the ex-istence of a secret CIA com-mand centre in the city, thecradle of the Libyan revolt.

In "Les Fous de Benghazi", hewas the first to reveal the ex-istence of a secret CIA com-mand center in this town, cra-dle of the Libyan revolt.

In "The Facts of Libya," hehad been the first to reveal theexistence of a secret CIA com-mand center in that city, thebirthplace of the Libyan upris-ing.

-.020 Le Sénat américain a approuvéun projet pilote de 90 M$ l’annéedernière qui aurait porté sur envi-ron 10 000 voitures.

The U.S. Senate approved a$90-million pilot project lastyear that would have involvedabout 10,000 cars.

The US Senate has approveda pilot project of 90 M$ lastyear which would have cov-ered about 10,000 cars.

The U.S. Senate approved a$90 million pilot project lastyear that would have focusedon about 10,000 cars.

Table 18: Top 10 segments by ∣δBLEU(i,hS,hU)∣ on FR-EN.

1156

δMACROF1 Source Reference SNMT UNMT.131 David Grimal are o cariera interna-

tionala de violonist solo, de-a lun-gul careia a sustinut regulat con-certe în ultimii 20 de ani pe princi-palele scene de muzica clasica alelumii s, i cu orchestre prestigioasecum ar fi Orchestre de Paris,Orchestre Philharmonique de Ra-dio France, Russian National Or-chestra, Orchestre National deLyon, Chamber Orchestra of Eu-rope, Berliner Symphoniker, NewJapan Philharmonic, Orchestre del’Opera de Lyon, Mozarteum Or-chestra Salzburg, Jerusalem Sym-phony Orchestra s, i Sinfonia Varso-via, sub conducerea unor diri-jori precum Christoph Eschen-bach, Michel Plasson, MichaelSchnwandt, Peter Csaba, Hein-rich Schiff, Lawrence Foster, Em-manuel Krivine, Mikhail Pletnev,Rafael Fruhbeck de Burgos s, i Pe-ter Eotvos, Andris Nelsons, Chris-tian Arming.

David Grimal has an internationalcareer as a solo violinist. Dur-ing the last 20 years he held reg-ular concerts on the main clas-sical music stages of the worldwith prestigious orchestras such asthe Orchestre de Paris, OrchestrePhilharmonique de Radio France,Russian National Orchestra, Or-chestre National de Lyon, Cham-ber Orchestra of Europe, BerlinerSymphoniker, New Japan Phil-harmonic, Orchestre de l’Operade Lyon, Salzburg MozarteumOrchestra, Jerusalem SymphonyOrchestra and Sinfonia Varsoviaunder the direction of conduc-tors such as Christoph Eschen-bach, Michel Plasson, MichaelSchnwandt, Peter Csaba, Hein-rich Schiff, Lawrence Foster, Em-manuel Krivine, Mikhail Pletnev,Rafael Fruhbeck de Burgos andPeter Eotvos, Andris Nelsons,Christian Arming.

David Grimal has an internationalcareer as a solo violinist, alongwhich claimed concerts regularlyin the past 20 years in key scenesof classical music of the world andwith prestigious orchestras suchas the Orchestre de Paris, Or-chestre Philharmonique de RadioFrance, Russian National Orches-tra, Orchestre National de Lyon,the Chamber Orchestra of Eu-rope, the Berliner Symphoniker,the New Japan Philharmonic, Or-chestre de l’Opera de Lyon TheSalzburg Mozarteum Orchestra,the Jerusalem Symphony Orches-tra, and Sinfonia Varsovia, un-der the direction of conductorssuch as Christoph Eschenbach,Michel Plasson, Michael Schn-wandt, Peter Csaba, HeinrichSchiff, Lawrence Foster, Em-manuel Krivine, Mikhail Pletnev,Rafael Fruhbeck de Burgos, andPeter Eotvos, Andris Nelsons,Christian Arming.

David Grimal has an internationalsolo violinist, Peter Csaba, HeinrichSchiff, Lawrence Foster, EmmanuelKrivine, Mikhail Pletnev, MikhailFruyhbeck Orchestra of Burgos andPeter Eotvos, Andris Philharmoniquede Radio France, Russian NationalOrchestra, the French National Or-chestra of Lyon, the Berliner Sym-phoniker, the New Japan Philharmonic,the Orchestre de l’Opera of Lyon, theMozarteum Orchestra of Lyon, theJerusalem Symphony Orchestra andSinfonia of Paris, under the directionof conductor like Orchestra: the Na-tional Orchestra, the National Orches-tra, the French Chamber Orchestra, theBerliner Symphoniology Orchestra andSinfonia Sinfonia and Sinfonia Sin-fonia Sinfonia, the Orchestra of theParis Symphony Orchestra, the RussianNational Orchestra, the National Or-chestra, the National Opera of Europe,the Berliner Symphoniker, the BerlinerSymphoniker Orchestra, the Orchestraof Lyon, the Mozarteum and SinfoniaSinfonia Sinfonia.

.063 O parte importanta a magazinuluieste dedicata raionului urias,de congelatoare, unde vet,i gasifrunze de curry, pepene s, i ghimbircongelate, rat,e întregi, pes, te,sânge s, i bila de vita, carcase deporc, chiftele de pes, te, cârnat,itradit,ionali, semi-preparate s, imulte altele.

A good portion of the storeis given over to the massivefreezer section, where you’ll findfrozen curry leaves, bitter melonand galangal, whole ducks, fish,beef blood and bile, pork cas-ings, fish balls, regional sausages,commercially-prepared foods andmore.

An important part of the storeis devoted to giant rayon offreezers, where you’ll find curryleaves, melon and Ginger frozenwhole fish, ducks, beef bile andblood, carcasses of pork, fish balls,sausages, traditional dishes andmore.

A significant part of the store is devotedto the huge frozen fried chicken county,where you will find curry leaves, peachand frozen ghrelish, whole fish, fish,blood and bone stock, pork carcass, tra-ditional fish carnouts, semi-cooked car-nacies and many others.

.062 Persoanele interesate vor putea saafle cum sa realizeze creat,ii sculp-turale florale armonioase (ike-bana), cum sa surprinda sufle-tul elementelor înconjuratoare încompozit,ii plastice folosind artajaponeza a pictarii în tus, (Sumie)sau cum sa îs, i exprime propriaindividualitate s, i creativitate prinuniversul plurivalent al artelor fru-moase (grafic, desen, pictura) , adeclarat Sorin Mazilu, profesorulacestor cursuri.

Those interested will find out howto make harmonious sculptural flo-ral creations (ikebana), how tocapture the soul of the surround-ing elements in plastic compo-sitions using the Japanese inkpainting art (Sumie) or how toexpress their own individualityand creativity through the mul-tifaceted universe of Fine Arts(graphic, drawing, painting) saidSorin Mazilu, the teacher of thesecourses.

Persons interested will be able tolearn how to achieve harmonioussculptural floral design (ikebana),how to capture the soul of the sur-rounding elements in plastic com-positions using the Japanese artof painting in India ink (Sumie)or how to express their own in-dividuality and creativity throughthe universe plurivalent of finearts (painting, drawing, graphics),said Sorin Mazilu, teacher of suchcourses.

People interested will be able to learnhow to complete armonious creativecontrasting (grafic, drawing, painting)Creations, how to capture the hearts ofsurrounding elements in artistic com-positions using the Japanese art ofpainting in tus (Sumie) or how to ex-press their own individualistic iden-tity through the plurivalent universe ofbeautiful arts (grafic, drawing, painting,painting), said Mr. Mazilu, the profes-sor of these courses.

-.052 Umorul îmbrat,is, arilor frecventeale lui Joey s, i Chandler, mo-mentele în care se uita la fotbalîn fotoliile comode s, i pasiunea luiRoss pentru Rachel au venit dinfaptul ca se s, tie ca barbat,ii se potasocia cu aceasta situat,ie, chiardaca adesea se feresc sa îs, i ex-ploreze complet sentimentele.

The humour of Joey and Chan-dler’s frequent hugs, momentswatching football on the comfychairs, and Ross’s pining forRachel, came from the knowledgethat yes, men can all relate to this,even if they often hold back fromfully exploring their feelings.

His humor frequently askedîmbrat,is, arilor of Joey and Chan-dler, the times when looking atfootball in comfortable armchairsand Ross’s passion for Rachelcame from the fact that it is knownthat men may be associated withthis case, even if often keep out toexplore fully the sentiments.

Ummaline frequent imbratisations ofJoey and Chandler, the moments whenhe looks at football in his comfy fo-toland Ross’s passion for Rachel camefrom knowing that men can assimilatewith the situation, even though oftenthey are shy about fully exploring theirfeelings.

-.052 În plus, zilele urmatoare vom facerecept,ia lucrarilor de la pasarelacare leaga UPU de clinicile decardiologie, hepatologie s, i gas-troenterologie pentru a asiguratransportul în condit,ii decente abolnavului din UPU în clinicilevecine, a mai explicat managerulunitat,ii medicale.

In addition, in the coming dayswe will accept the works fromthe bridge linking the ER to thecardiology, hepatology and gas-troenterology clinics to ensure theproper transfer of patients fromthe ER to the neighbouring clinicsexplained the manager of the med-ical unit.

In addition, the coming days wewill make the reception of theworks from the footbridge link-ing the UPU of Cardiology clin-ics, Hepatology and gastroenterol-ogy to ensure decent conditions oftransportation of the patient fromthe UPU in neighboring clinics,explained Manager medical unit.

In addition, next week we will makethe receptia work on the pasarela link-ing the UPU to the cardiology, hepatol-ogy and gastroenterology clinicians toensure the transport in decent mannerof the patient from the UPU to neigh-bouring clinicians, the hospital’s chiefexecutive explained.

-.052 Desi presa araba anunta ca despar-tirea lui Sanmartean de Ittihad eiminenta, impresarul antreoruluiLadislau Boloni, Arcadie Zaporo-janu, spune ca acest lucru nu se vamai intampla.

Although Arab media announcedthat the separation between San-martean and Ittihad is imminent,coach Ladislau Boloni’s agent, Ar-cadia Zaporojanu, says this will nolonger happen.

Although the Arabic press an-nounces that the breakup of hisimpending Sanmartean Ittihad e,impresario, Arcadie antreoruluiBaghery Isaac says that this willno longer happen.

While the Arab media say San-martean’s departure from Ittihad is im-minent, impressions of coach LadislauBoloni’s agent Arcadie Zaporojanu saythis will never happen.

-.05 Cei mai putini sunt cei care au fi-nalizat doctoratul - 0,67 - s, i tineriicare au terminat liceul ori scoalagenerala - 6,20%.

The least numerous are those whocompleted the doctorate courses- 0.67 - and young people whohave finished high school or mid-dle school - 6.20%.

Most are few people who havecompleted their Ph.d.-0.67-andyoung people who have completedgrade school-high school times6,20%.

Most are those who completed their de-gree - 0.67 - and young people who fin-ished college or high school - 6.20 per-cent.

-.05 Politica clubului a facut din Vi-itorul o echipa de urmarit pentruechipele din strainatate, astfel caHagi a vandut în aceasta vara de1,5 milioane de euro - Ianis Hagi(Fiorentina - 1 milion de euro) s, iAlexandru Mitrita (Bari - 500 miieuro).

The club’s policy made Viitorula team to be watched by foreignteams, and so Hagi has sold thissummer in the amount of 1.5 mil-lion Euros - Ianis Hagi (Fiorentina- 1 million euro) and AlexandruMitrita (Bari - 500 thousand euro).

The Club’s policy has made theteam watched for foreign teams,such as Hagi has sold in this sum-mer of 1.5 million euros-IbarraH (Fiorentina-EUR 1 million) andAlexander Duane (Bari-500 thou-sand euros).

The club’s policy made the Viitorul ateam to watch for overseas teams, soHagi sold him for £1.5 million - IanisHagi (Fiorentina - £1 million) and JohnMitrita (Bari - £500 million).

-.049 Au inclus amestecul de pungas, ineplatitori de taxe, straini carefraudau telefonic s, i pseudonon-domi care det,in majoritateaziarelor importante.

They included the medley of tax-shy rascals, phone-hacking for-eigners and pseudo non-doms whoown most of our great newspapers.

They included a mixture of tax rev-enue, pungas, i foreign fraudau tele-phone and non-pseudo domi whohold most major newspapers.

They included the mixture of unpaidtax pungas, foreigners who telephoneand pseudo non-doms who own mostof the key newspapers.

.048 Constantin Brâncus, i se nas, te la 19februarie 1876, în Hobit,a, un micsat din comuna Pes, tis,ani, judet,ulGorj, la poalele Carpat,ilor.

Constantin Brancusi was born on19 February 1876 in Hobit,a, asmall village from Pes, tis,ani, Gorjcounty, at the foot of the Carpathi-ans.

Constantin Brâncus, i February 19,was born in 1876, a small Certi-fied village in Pes, tis,ani commune,Gorj County, at the foot of theCarpathians.

Mr. Brancusi was born Feb. 19, 1876,in Hobson, a small village in the townof Peston, Pa., at the foot of the moun-tains.

Table 19: Top 10 segments by ∣δMACROF1(i,hS,hU)∣ on RO-EN.

1157

δBLEU Source Reference SNMT UNMT.114 David Grimal are o cariera interna-

tionala de violonist solo, de-a lun-gul careia a sustinut regulat concerteîn ultimii 20 de ani pe principalelescene de muzica clasica ale lumii s, icu orchestre prestigioase cum ar fiOrchestre de Paris, Orchestre Phil-harmonique de Radio France, Rus-sian National Orchestra, OrchestreNational de Lyon, Chamber Orches-tra of Europe, Berliner Symphoniker,New Japan Philharmonic, Orchestrede l’Opera de Lyon, Mozarteum Or-chestra Salzburg, Jerusalem Sym-phony Orchestra s, i Sinfonia Varso-via, sub conducerea unor dirijori pre-cum Christoph Eschenbach, MichelPlasson, Michael Schnwandt, PeterCsaba, Heinrich Schiff, LawrenceFoster, Emmanuel Krivine, MikhailPletnev, Rafael Fruhbeck de Bur-gos s, i Peter Eotvos, Andris Nelsons,Christian Arming.

David Grimal has an international ca-reer as a solo violinist. During thelast 20 years he held regular con-certs on the main classical musicstages of the world with prestigiousorchestras such as the Orchestrede Paris, Orchestre Philharmoniquede Radio France, Russian NationalOrchestra, Orchestre National deLyon, Chamber Orchestra of Europe,Berliner Symphoniker, New JapanPhilharmonic, Orchestre de l’Operade Lyon, Salzburg Mozarteum Or-chestra, Jerusalem Symphony Or-chestra and Sinfonia Varsovia un-der the direction of conductors suchas Christoph Eschenbach, MichelPlasson, Michael Schnwandt, PeterCsaba, Heinrich Schiff, LawrenceFoster, Emmanuel Krivine, MikhailPletnev, Rafael Fruhbeck de Burgosand Peter Eotvos, Andris Nelsons,Christian Arming.

David Grimal has an international ca-reer as a solo violinist, along whichclaimed concerts regularly in the past20 years in key scenes of classicalmusic of the world and with pres-tigious orchestras such as the Or-chestre de Paris, Orchestre Philhar-monique de Radio France, RussianNational Orchestra, Orchestre Na-tional de Lyon, the Chamber Or-chestra of Europe, the Berliner Sym-phoniker, the New Japan Philhar-monic, Orchestre de l’Opera de LyonThe Salzburg Mozarteum Orches-tra, the Jerusalem Symphony Or-chestra, and Sinfonia Varsovia, un-der the direction of conductors suchas Christoph Eschenbach, MichelPlasson, Michael Schnwandt, PeterCsaba, Heinrich Schiff, LawrenceFoster, Emmanuel Krivine, MikhailPletnev, Rafael Fruhbeck de Burgos,and Peter Eotvos, Andris Nelsons,Christian Arming.

David Grimal has an international soloviolinist, Peter Csaba, Heinrich Schiff,Lawrence Foster, Emmanuel Krivine,Mikhail Pletnev, Mikhail Fruyhbeck Or-chestra of Burgos and Peter Eotvos, An-dris Philharmonique de Radio France,Russian National Orchestra, the FrenchNational Orchestra of Lyon, the BerlinerSymphoniker, the New Japan Philhar-monic, the Orchestre de l’Opera of Lyon,the Mozarteum Orchestra of Lyon, theJerusalem Symphony Orchestra and Sinfo-nia of Paris, under the direction of conduc-tor like Orchestra: the National Orchestra,the National Orchestra, the French Cham-ber Orchestra, the Berliner Symphoniol-ogy Orchestra and Sinfonia Sinfonia andSinfonia Sinfonia Sinfonia, the Orchestraof the Paris Symphony Orchestra, the Rus-sian National Orchestra, the National Or-chestra, the National Opera of Europe, theBerliner Symphoniker, the Berliner Sym-phoniker Orchestra, the Orchestra of Lyon,the Mozarteum and Sinfonia Sinfonia Sin-fonia.

.089 Editia de toamna va avea loc laChisinau (Casa Armatei, 2-3 oc-tombrie), Iasi (Palas, 6-7 octombrie),Brasov (Casa Armatei, 14-15 oc-tombrie), Cluj Global (Sala Poli-valenta, 20-21 octombrie), Cluj IT(Sala Polivalenta, 22-23 octombrie),Sibiu (Centrul de Afaceri, 28-29 oc-tombrie) s, i Targu-Mures (Teatrul Na-tional, 5 noiembrie).

The autumn edition will take placein Chisinau (Army House 2-3 Octo-ber), Iasi (Palas 6-7 October), Brasov(Army House 14-15 October), ClujGlobal (Polyvalent Hall 20-21 Octo-ber) Cluj IT (Polyvalent Hall 22-23October), Sibiu (Business Centre 28-29 October) and Targu-Mures (Na-tional Theatre, November 5).

Autumn edition will take place inChisinau (Army House, 2-3 Oc-tober), Iasi (Palas, 6-7 October),Brasov (Army House, 14-16 Octo-ber), Global (Sala Polivalenta, 20-21October), Cluj (Sala Polivalenta, 22-23 October), Sibiu (Business Cen-tre, 28-29 October) and Targu-Mures(National Theatre, November 5).

The fall season will be in Washington(House, 2-3), Washington (Palas, Oct. 6-7), Brasov (House, Oct. 14-15), San Fran-cisco Global (East Coast, Oct. 20-21),San Francisco IT (East Coast, Oct. 22-23), Sibiu (Center for Business, Oct. 28-29) and Targu-Mures (the National Center,Nov. 5).

-.072 Realizatorii studiului mai transmitca "romanii simt nevoie de cevamai multa aventura în viata lor(24%), urmat de afectiune (21%),bani (21%), siguranta (20%), nou(19%), sex (19%), respect 18%, in-credere 17%, placere 17%, conectare17%, cunoastere 16%, protectie 14%,importanta 14%, invatare 12%, liber-tate 11%, autocunoastere 10% s, i con-trol 7%".

The study’s conductors transmit that"Romanians feel the need for a littlemore adventure in their lives (24%),followed by affection (21%), money(21%), safety (20%), new things(19%), sex ( 19%) respect 18%, con-fidence 17%, pleasure 17%, connec-tion 17%, knowledge 16%, protec-tion 14%, importance 14%, learning12%, freedom 11%, self-awareness10% and control 7% ".

The filmmakers may study trans-mitted as "Romanians feel in needof something more adventure intheir lives (24%), followed by affec-tion (21%), money (21%), security(20%), new (19%), sex (19%), re-spect, trust and 18% 17% 17% plea-sure, 17%, 16%, knowledge protec-tion 14%, 14%, 12%, liberty learning11%, self-awareness and control 10%7%".

Reporters also deliver that "Americansfeel need for some more adventure in theirlives (24%), followed by affection (21%),money (21%), safety (20%), new (19%),sex (19%), respect 18%, confidence 17%,pleasure 17%, connection 17%, knowl-edge 14%, protection 14%, importance14%, learning 12%, freedom 11%, auto-cunoastere 10% and control 7%."

-.045 Fed ar trebui sa puna problemastabilitat,ii financiare pe primul locdoar în cazul unei crize majore, cuma fost cutremurul de pe piat,a din2008, a declarat Adam S. Posen, fostmembru al comisiei de stabilire aratei dobânzii din cadrul Bank ofEngland.

The Fed should put financial stabil-ity concerns first only during a majorcrisis, such as the 2008 market melt-down, said Adam S. Posen, a formermember of the Bank of England’srate-setting committee.

The Fed should put the issue of finan-cial stability in the first place only inthe event of major crises such as theearthquake in the market since 2008,"said Adam s. Posen, a former mem-ber of the Commission for determin-ing the interest rate within the Bankof England.

The Fed should put financial stability firstonly in a major crisis, such as the 2008credit crunch, said Adam S. Posen, a for-mer member of the Bank of England’s sta-bilisation committee.

-.041 Atunci când sust,inatorilor lui Clintonli se adreseaza o întrebare deschisareferitor la motivul pentru care ardori ca ea sa câs, tige cursa, raspunsulmajoritar este ca det,ine experient,aadecvata (16%), urmat de faptul caa venit momentul ca o femeie sa fiepres, edinte (13%) s, i ca este cel maibun candidat pentru aceasta funct,ie(10%).

When Clinton’s supporters are askedin an open-ended question why theywant her to be the nominee, the topanswer is that she has the right expe-rience (16 percent), followed by it’stime for a woman president (13 per-cent), and that she is the best candi-date for the job (10 percent).

When Clinton supporters are askedto answer an open question regardingwhy you want her to win the race, theanswer is that majority holds appro-priate experience (16%), followed bythe fact that the time has come fora woman to be President (13%), andthat is the best candidate for this po-sition (10%).

When Clinton supporters are asking anopen question as to why she would like towin the race, the majoritar answer is thatshe holds the right experience (16 percent),followed by the fact that it has come timefor a woman to be president (13 percent)and that she is the best candidate for thejob (10 percent).

-.040 Clinton este acum sust,inuta de 47%din alegatorii democrat,i (în scaderede la 58%), în timp ce Sanders estepe locul doi, cu 27% (în urcare fat,ade 17%).

Clinton now has the backing of 47percent of Democratic primary vot-ers (down from 58 percent), whileSanders comes in second, with 27percent (up from 17 percent).

Clinton is now supported by 47% ofthe voters Democrats (down 62%),while Sanders is in second place with27% (in relation to climb 17%).

Clinton is now backed by 47 percent ofDemocratic voters (down from 58 per-cent), while Sanders is second with 27 per-cent (up from 17 percent).

.039 Persoanele interesate vor puteasa afle cum sa realizeze creat,iisculpturale florale armonioase(ikebana), cum sa surprinda sufle-tul elementelor înconjuratoare încompozit,ii plastice folosind artajaponeza a pictarii în tus, (Sumie)sau cum sa îs, i exprime propriaindividualitate s, i creativitate prinuniversul plurivalent al artelor fru-moase (grafic, desen, pictura) , adeclarat Sorin Mazilu, profesorulacestor cursuri.

Those interested will find out howto make harmonious sculptural flo-ral creations (ikebana), how to cap-ture the soul of the surrounding ele-ments in plastic compositions usingthe Japanese ink painting art (Sumie)or how to express their own indi-viduality and creativity through themultifaceted universe of Fine Arts(graphic, drawing, painting) saidSorin Mazilu, the teacher of thesecourses.

Persons interested will be able tolearn how to achieve harmonioussculptural floral design (ikebana),how to capture the soul of the sur-rounding elements in plastic com-positions using the Japanese art ofpainting in India ink (Sumie) orhow to express their own individu-ality and creativity through the uni-verse plurivalent of fine arts (paint-ing, drawing, graphics), said SorinMazilu, teacher of such courses.

People interested will be able to learn howto complete armonious creative contrast-ing (grafic, drawing, painting) Creations,how to capture the hearts of surroundingelements in artistic compositions using theJapanese art of painting in tus (Sumie) orhow to express their own individualisticidentity through the plurivalent universeof beautiful arts (grafic, drawing, painting,painting), said Mr. Mazilu, the professorof these courses.

.036 Companiile au vacante cateva miide locuri de munca, oportunitatide internship s, i stagii de prac-tica, iar o parte dintre ele suntdeja anuntate pe site-ul oficialwww.targuldecariere.ro.

Companies have thousands of vacantjobs, internship opportunities and in-ternships, and some of them are al-ready posted on the official websitewww.targuldecariere.ro.

Companies have thousands of vacantjobs, internship opportunities, and in-ternships, and some of them are al-ready announced on the official web-site www.targuldecariere.ro.

Companies have filled several thou-sand jobs, internships and trainingtrips, and a number of them are al-ready announced on the official websitewww.targuldecarier.com.

-.035 Antrenorul celor de la BrisbaneBroncos, Wayne Bennett, a facut ous, oara referire la Storm dupa victo-ria echipei sale în meciul de califi-care cu North Queensland Cowboys,jucat sâmbata seara, când a numitmeciul respectiv o „demonstrat,ie” aligii de rugby s, i a declarat ca celedoua echipe din Queensland nu seprea pricep la wrestling.

Brisbane Broncos coach Wayne Ben-nett made a thinly veiled referenceto the Storm after his side’s qualify-ing final win over North QueenslandCowboys on Saturday night whenhe called that game a "showcase" ofthe rugby league and said the twoQueensland weren’t "too big" intowrestling.

The coach of the Brisbane Broncos,Wayne Bennett, made a slight refer-ence to the Storm after his team’svictory in the qualifying match withNorth Queensland Cowboys, playedSaturday night, when he called thematch a "demonstration" of rugbyleague and said that the two teamsfrom Queensland is not too good atwrestling.

Brisbane Broncos coach Wayne Bennettmade a slight reference to the Storm af-ter their team’s victory in the qualifyingmatch against the North Queensland Cow-boys on Saturday night when he calledthe game a "demonstration" of the rugbyleague league league and said the twoQueensland sides were not really pricep towrestling.

.034 David Grimal recunoaste ca Les Dis-sonaces reprezinta "un lux", pre-cizand ca niciunul dintre muzicieniicare fac parte din orchestra nu de-pinde de succesul acestui proiect pen-tru a trai.

David Grimal admits that Les Dis-sonaces is "a luxury", adding thatnone of the musicians who are partof the orchestra does not depend onthe success of this project to make aliving.

David Grimal admits that Les Dis-sonaces represents "a luxury", stat-ing that none of the musicians whoare part of the orchestra does not de-pend on the success of this project forliving.

David Grimal admits Les Dissonaces rep-resent "a luxury," saying none of the musi-cians who make up the orchestra depindeof the success of this project to live.

Table 20: Top 10 segments by ∣δBLEU(i,hS,hU)∣ on RO-EN.