12
Towards Quantifiable Dialogue Coherence Evaluation Zheng Ye 1 , Liucun Lu 1 , Lishan Huang 2 , Liang Lin 2,3 , Xiaodan Liang 1* 1 Shenzhen Campus of Sun Yat-sen University, 2 Sun Yat-Sen University, 3 Dark Matter AI Inc. {yezh7,lulc,huanglsh6}@mail2.sysu.edu.cn, [email protected], [email protected] Abstract Automatic dialogue coherence evaluation has attracted increasing attention and is crucial for developing promising dialogue systems. How- ever, existing metrics have two major limita- tions: (a) they are mostly trained in a simpli- fied two-level setting (coherent vs. incoherent), while humans give Likert-type multi-level co- herence scores, dubbed as “quantifiable”; (b) their predicted coherence scores cannot align with the actual human rating standards due to the absence of human guidance during train- ing. To address these limitations, we propose Quantifiable Dialogue Coherence Evaluation (QuantiDCE), a novel framework aiming to train a quantifiable dialogue coherence metric that can reflect the actual human rating stan- dards. Specifically, QuantiDCE includes two training stages, Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) fine-tuning. During MLR pre-training, a new MLR loss is proposed for enabling the model to learn the coarse judgement of coherence degrees. Then, during KD fine-tuning, the pretrained model is further finetuned to learn the actual human rating standards with only very few human-annotated data. To advo- cate the generalizability even with limited fine- tuning data, a novel KD regularization is intro- duced to retain the knowledge learned at the pre-training stage. Experimental results show that the model trained by QuantiDCE presents stronger correlations with human judgements than the other state-of-the-art metrics. 1 1 Introduction Dialogue coherence, which requires a response to be fluent, consistent and context-related, is an es- sential property for developing promising dialogue * Corresponding Author. 1 The code and trained checkpoints are available at https: //github.com/James-Yip/QuantiDCE. 1 3 2 5 4 Context Response U1: I like animals, especially dogs. How about you? U2: Haha, I like cats more. I like cats more too and I work in a pet store. U3: Do your work for a pet store? Very coherent? Mostly coherent? Mostly incoherent? Very incoherent? …… 0 1 Coherent? Incoherent? …… Figure 1: Likert-type multi-level human rating vs. two- level automatic evaluation. Human rating always con- siders multiple coherence degrees, while most of the existing automatic metrics only learn to distinguish the coherence dialogues from the incoherent ones and give relatively extreme coherence scores. systems (Cervone et al., 2018). However, it is still challenging to evaluate the coherence of a response generated by a dialogue system. Although human evaluation is always considered as the most accu- rate way to evaluate the coherence, it is expensive and high-latency, which cannot meet the evaluation demand of the frequent development of dialogue systems. Therefore, automatic evaluation metrics are developed to serve as human proxies that can rapidly compute the dialogue coherence and return relatively accurate results. The current widely used metrics measure the lexical word-overlap between generated responses and reference responses, such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004). However, they have been demonstrated to be biased and cor- relate poorly with human judgements since no se- mantic information is considered (Liu et al., 2016; Novikova et al., 2017). To overcome this issue, researchers turned to develop learnable metrics based on neural networks that incorporate the se- mantic information, such as RUBER (Tao et al., arXiv:2106.00507v2 [cs.CL] 22 Jul 2021

Towards Quantifiable Dialogue Coherence Evaluation

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards Quantifiable Dialogue Coherence Evaluation

Towards Quantifiable Dialogue Coherence Evaluation

Zheng Ye1, Liucun Lu1, Lishan Huang2, Liang Lin2,3, Xiaodan Liang1∗

1Shenzhen Campus of Sun Yat-sen University, 2Sun Yat-Sen University, 3Dark Matter AI Inc.{yezh7,lulc,huanglsh6}@mail2.sysu.edu.cn,[email protected], [email protected]

Abstract

Automatic dialogue coherence evaluation hasattracted increasing attention and is crucial fordeveloping promising dialogue systems. How-ever, existing metrics have two major limita-tions: (a) they are mostly trained in a simpli-fied two-level setting (coherent vs. incoherent),while humans give Likert-type multi-level co-herence scores, dubbed as “quantifiable”; (b)their predicted coherence scores cannot alignwith the actual human rating standards due tothe absence of human guidance during train-ing. To address these limitations, we proposeQuantifiable Dialogue Coherence Evaluation(QuantiDCE), a novel framework aiming totrain a quantifiable dialogue coherence metricthat can reflect the actual human rating stan-dards. Specifically, QuantiDCE includes twotraining stages, Multi-Level Ranking (MLR)pre-training and Knowledge Distillation (KD)fine-tuning. During MLR pre-training, a newMLR loss is proposed for enabling the modelto learn the coarse judgement of coherencedegrees. Then, during KD fine-tuning, thepretrained model is further finetuned to learnthe actual human rating standards with onlyvery few human-annotated data. To advo-cate the generalizability even with limited fine-tuning data, a novel KD regularization is intro-duced to retain the knowledge learned at thepre-training stage. Experimental results showthat the model trained by QuantiDCE presentsstronger correlations with human judgementsthan the other state-of-the-art metrics. 1

1 Introduction

Dialogue coherence, which requires a response tobe fluent, consistent and context-related, is an es-sential property for developing promising dialogue

∗Corresponding Author.1The code and trained checkpoints are available at https:

//github.com/James-Yip/QuantiDCE.

1 32 54

Context

Response

U1: I like animals, especially dogs. How about you?U2: Haha, I like cats more.

I like cats more too and I work in a pet store.

U3: Do your work for a pet store?

V1

Very coherent?Mostly coherent?Mostly incoherent?

Very incoherent?

……

0 1

Coherent?

Incoherent?

……

Figure 1: Likert-type multi-level human rating vs. two-level automatic evaluation. Human rating always con-siders multiple coherence degrees, while most of theexisting automatic metrics only learn to distinguish thecoherence dialogues from the incoherent ones and giverelatively extreme coherence scores.

systems (Cervone et al., 2018). However, it is stillchallenging to evaluate the coherence of a responsegenerated by a dialogue system. Although humanevaluation is always considered as the most accu-rate way to evaluate the coherence, it is expensiveand high-latency, which cannot meet the evaluationdemand of the frequent development of dialoguesystems. Therefore, automatic evaluation metricsare developed to serve as human proxies that canrapidly compute the dialogue coherence and returnrelatively accurate results.

The current widely used metrics measure thelexical word-overlap between generated responsesand reference responses, such as BLEU (Papineniet al., 2002) and ROUGE (Lin, 2004). However,they have been demonstrated to be biased and cor-relate poorly with human judgements since no se-mantic information is considered (Liu et al., 2016;Novikova et al., 2017). To overcome this issue,researchers turned to develop learnable metricsbased on neural networks that incorporate the se-mantic information, such as RUBER (Tao et al.,

arX

iv:2

106.

0050

7v2

[cs

.CL

] 2

2 Ju

l 202

1

Page 2: Towards Quantifiable Dialogue Coherence Evaluation

2018), BERT-RUBER (Ghazarian et al., 2019) andGRADE (Huang et al., 2020). However, these met-rics deviate from the actual human rating due to twolimitations. First, they simplify the coherence eval-uation task in a two-level setting, i.e., coherent orincoherent, by maximizing the differences betweenthe positive coherent dialogues and the negative in-coherent ones obtained by some negative samplingstrategies. In contrast, humans usually adopt Likertscaling and give coherence scores from multiplelevels like 1 to 5, as shown in Figure 1. Second, toavoid relying on large-scale human-annotated data,they are mostly trained in a purely unsupervisedmanner and cannot align with the human ratingdue to the absence of introducing the actual humanrating standards during training.

To address the above limitations, we propose anovel dialogue coherence metric training frame-work, named as Quantifiable Dialogue CoherenceEvaluation (QuantiDCE). This framework consistsof two training stages: Multi-Level Ranking (MLR)pre-training and Knowledge Distillation (KD) fine-tuning. At the MLR pre-training stage, a new multi-level ranking (MLR) loss is proposed for learningthe coarse judgement of coherence degrees. Specif-ically, the MLR loss separates the context-responsepairs with different coherence levels and compactsthe pairs within the same level in one-dimensionalscore space. As a result, the pretrained model isable to distinguish different coherence-level dia-logue responses for a given context and predictsmore accurate coherence scores. At the KD fine-tuning stage, the pretrained model is further fine-tuned to learn the actual human rating standardswith only very few human-annotated coherencescores. To mitigate overfitting into the scarce an-notated data during fine-tuning, a novel knowledgedistillation regularization loss is introduced to re-tain the knowledge learned at the pre-training stage,where the pretrained model (teacher) provides thesoft targets for the model during fine-tuning (stu-dent). Experimental results show that the metrictrained by our QuantiDCE obviously outperformsthe other state-of-the-art metrics in terms of thePearson, Spearman and Kendall correlations withhuman judgements by around 5% points on aver-age. To summarize our contributions:

1) We propose QuantiDCE, a novel quantifiabletraining framework for dialogue coherence eval-uation, which aims to align the automatic scoreswith the actual human rating standards via MLR

pre-training and KD fine-tuning. To the best ofour knowledge, it is the first attempt to considerthe quantifiable problem for dialogue coherenceevaluation.

2) Extensive experiments demonstrate the ef-fectiveness of our QuantiDCE, which enables thetrained metric to have obviously stronger correla-tions with human judgements than the other state-of-the-art metrics.

2 Related Work

Automatic Coherence Evaluation. The widelyused automatic metrics, such as BLEU (Papineniet al., 2002), METEOR (Banerjee and Lavie, 2005)and ROUGE (Lin, 2004), use statistical rules tomeasure the degree of lexical word-overlap be-tween generated responses and reference responses.However, these metrics have been demonstrated tocorrelate poorly with human judgments due to theabsence of semantic information (Liu et al., 2016;Novikova et al., 2017). Therefore, the subsequentmetrics are considered to incorporate the seman-tic information. For instance, BERTScore (Zhanget al., 2020) turns to measure the soft semanticword-overlap rather than the hard lexical word-overlap like BLEU. Moreover, learnable metricsencoding the semantic information have been at-tracting interests recently, which are trained in a su-pervised manner with large-scale human-annotateddata, such as ADEM (Lowe et al., 2017), or trainedin an unsupervised manner with automatically con-structed data, such as RUBER (Tao et al., 2018)and BERT-RUBER (Ghazarian et al., 2019). Be-sides, Mesgar et al. (2020) implicitly utilizes dia-logue act labels by introducing an auxiliary task,which improves the performance of coherence eval-uation. Furthermore, the recently proposed coher-ence metric, GRADE (Huang et al., 2020), intro-duces the graph information of dialogue topic tran-sitions and achieves the current state-of-the-art re-sults. Note that these learnable metrics are trainedin a two-level training objective to separate the co-herent dialogues from the incoherent ones, whileour QuantiDCE models the task in a multi-levelsetting which is closer to the actual human rating.

Knowledge Distillation. Knowledge distillation(KD) is a method that transfers the knowledge froma large trained teacher model to a smaller studentmodel by using the soft targets provided by theteacher (Hinton et al., 2015). In recent years, KDhas been applied to many specific tasks (Sun et al.,

Page 3: Towards Quantifiable Dialogue Coherence Evaluation

Stage One: MLR Pre-training

𝓛𝑚𝑙𝑟BERT MLP

BERT MLPHuman rating

Stage Two: KD Fine-tuning

𝓁𝑘𝑑

𝓁𝑚𝑠𝑒

Teacher model

KD

Student model

Level-1 Centroid Score Level-2 Centroid Score Level-3 Centroid Score< <

Level-1 Scores

𝒊𝒕𝒉Dialogue Example

Context

Level-1

Level-2

Level-3

Response

separation

compactness

Level-2 Scores

Level-3 Scores

𝓛kd_𝑚𝑠𝑒

< <

Separation loss

Compactness Loss

Ordering Loss

Concat

……

……

……

Concat

Concat

Figure 2: The overall pipeline of our QuantiDCE, consisting of two training stages which are marked by theblue and the black one-way arrows. Each input dialogue example contains one context with three-level candidateresponses and five responses for each level, shown as red, orange and green rectangles respectively. The solidcircle represents the centroid score for each level of the ith dialogue. At MLR pre-training stage, the context-response pairs are encoded with BERT and transformed into the coherence scores through the MLP predictionnetwork, and then MLR loss is applied to optimize the network. The dotted two-way arrows indicate that bothends should be separated, while the solid two-way arrows indicate that both ends should be compact. And at theKD fine-tuning stage, the student model is first initialized with the teacher model and optimized by KD-MSE loss.

2020; Wei et al., 2019; Kim and Rush, 2016; Sourtyet al., 2020). Unlike these previous works, we useKD to retain knowledge learned at the pre-trainingstage during fine-tuning and do not compress themodel size of the student model.

3 QuantiDCE Framework

In this section, we present QuantiDCE, a two-stageframework for dialogue coherence metric learn-ing, consisting of Multi-Level Ranking (MLR)pre-training and Knowledge Distillation (KD) fine-tuning. As illustrated in Figure 2, given a met-ric model M (Section 3.1), QuantiDCE enablesM to learn multi-level representations for context-response pairs with different levels of coherencedegrees during the pre-training stage (Section 3.2),and further to learn the rating standards of humanswith only a fraction of data during the fine-tuningstage (Section 3.3). After these two training stages,the quantifiable gap between automatic metrics andhumans can be obviously reduced.

3.1 Model Architecture

In our QuantiDCE framework, the metric modelMis composed of: (1) an encoder network for encod-ing the input context-response pairs into features

and (2) a predictor network for transforming the en-coded features into coherence scores. Specifically,we adopt BERT (Devlin et al., 2019) as the encodernetwork and a multi-layer perceptron (MLP) as thepredictor network.

Given a context c = {c1, · · · , cm} and aresponse r = {r1, · · · , rn} where ci and riare tokens of the context and the responserespectively, the c and r are concatenatedas {[CLS], c1, · · · , cm, [SEP], r1, · · · , rn, [SEP]},denoted as [c; r]. Then the coherence score s ofthe response r w.r.t. the context c is predicted by:

s =MLP (BERT ([c; r])), (1)

where MLP is a three-layer fully-connected net-work in which the activation functions of the threelayers are two exponential linear units (Clevertet al., 2016) and a sigmoid function, respectively.

3.2 MLR Pre-TrainingFor learning the coarse judgement of coherencedegrees without the direct supervision of score an-notations, the model M is first pretrained by mini-mizing a new multi-level ranking (MLR) loss on alarge-scale dialogue dataset. Concretely, the MLRloss is composed of a separation loss, a compact-ness loss and an ordering loss.

Page 4: Towards Quantifiable Dialogue Coherence Evaluation

Formally, given a training dataset Dpt ={(ci,Ri)}N1

i=1 where ci is a dialogue context andRi = {(rji,1, · · · , r

ji,K)}Lj=1 is a response set with

L coherence levels2 andK responses for each level,the model M is trained by minimizing the follow-ing MLR loss:

Lmlr =1

N1

N1∑i=1

(`sepi + `comi + `ordi ), (2)

where `sepi , `comi , and `ordi refer to the separationloss, the compactness loss and the ordering loss ofthe ith example, respectively.

The separation loss aims to separate the fea-tures of context-response pairs with different coher-ence levels by separating the coherence scores ofthe different pairs3. Moreover, to efficiently com-pute the loss, we first compute the centroids ofthe context-response pairs belonging to the samecoherence level for the ith dialogue example, i.e.,ei = {eji =

∑Kk=1 s

ji,k|j ∈ [1, L], eji ∈ R} where

sji,k is the coherence score of the context-response

pair (ci, rji,k), and the separation loss between the

centroids is then computed as follows:

`sepi =

L−1∑j=1

L∑l=j+1

max(0, w ∗ λ− d(eji , eli)), (3)

where d(·) is the L1 distance, λ is the lower boundfor the distance between two centroids, and w =l− j is the distance weight used for amplifying thelower bound w.r.t. the coherence-level gap.

The compactness loss aims to compact the pairswithin the same level, which served as a regular-ization role to avoid the occurrence of outliers foreach coherence level. Specifically, the coherencescore sji,k is forced to be closer to the corresponding

centroid eji as follows:

`comi =L∑

j=1

K∑k=1

max(0,d(eji , sji,k)− µ), (4)

where µ is the upper bound for the distance be-tween the centroid of a certain coherence level andthe score within this level.

2The coherence level is in ascending order, i.e., the re-sponse in a higher level is more coherent than the lower one.

3We also tried to directly restrict the features of different-level pairs to be separated, but the performance dropped com-pared with restricting the scores.

The ordering loss is finally introduced to assurethat the rank order of the predicted scores satisfiesthe pre-defined order of coherence degrees, i.e.,sji,k < sj+1

i,k , j ∈ [1, L−1], k ∈ [1,K]. It is criticalsince the separation loss only restricts the scoresof the pairs from different coherence levels to beseparated and this restriction is also satisfied whenthe scores of the highest level are lower than thescores of the lowest level. Similar to the separationloss, the ordering loss is also computed betweeneach two centroids as follows:

`ordi =

L−1∑j=1

L∑l=j+1

max(0, eli − eji ). (5)

3.3 KD Fine-TuningThe model M pretrained by the MLR loss is fur-ther trained at the KD fine-tuning stage to directlylearn the actual human rating standards with only afraction of annotated data.

Formally, given a training dataset Dft =

{(ci, ri, si)}N2i=1 where ci, ri and si are the dia-

logue context, the corresponding response and thehuman-annotated coherence score of ri w.r.t. cirespectively, the previous fine-tuning approach forthe scoring task usually optimizes the model Mwith an MSE loss between the predicted score siand the human score si:

`msei = (si − si)2. (6)

However, by minimizing `msei for each exam-

ple, the model M will be easily over-fitting onthe very few annotated data, and thus the modelgeneralizability will be dramatically reduced. Toovercome this issue, a novel knowledge distillation(KD) regularization is introduced for retaining theknowledge learned at the MLR pre-training stage.Concretely, the pretrained model M is treated asthe teacher model that provides the soft targets forthe student model M which is entirely copied fromM . And we adopt the distillation objectives ofTinyBERT (Jiao et al., 2020), including the distil-lations of the embedding layer, the Transformerlayers and the prediction layer. The KD loss is thenformulated as:

`kdi =

T+1∑t=0

||Oti − Ot

i ||22 +T∑t=1

||Ati − At

i||22, (7)

where || · ||22 indicates the squared L2 norm, T is thenumber of the Transformer layers, Ot

i and Oti are

Page 5: Towards Quantifiable Dialogue Coherence Evaluation

Algorithm 1 Training Procedure of QuantiDCEInput: training datasets Dpt and Dft, metric

model MOutput: student model M1: initialize M with BERTBASE

2: for all (ci,Ri) ∈ Dpt do3: Si =M(ci,Ri)4: compute the centroids ei for Si5: compute `sepi and `ordi for ei6: compute `comi between ei and Si7: compute Lmlr

8: update M to minimize Lmlr

9: end for10: initialize M with M11: for all (ci, ri, si) ∈ Dft do12: Oi, Ai =M(ci, ri)13: si, Oi, Ai = M(ci, ri)14: compute `mse

i between si and si15: compute `kdi between Oi, Ai and Oi, Ai

16: compute Lkd mse

17: update M to minimize Lkd mse

18: end for19: return student model M

the tth layer outputs of M and M respectively, Ati

and Ati are the attention matrices of the tth trans-

former layer. Note that the layer 0 and the layerT+1 refer to the embedding layer and the predictionlayer respectively.

Overall, the loss function for KD fine-tuning,named as KD-MSE loss, is the weighted sum of`msei and `kdi across the whole training dataset Dft:

Lkd mse =1

N2

N2∑i=1

(α ∗ `msei + β ∗ `kdi ), (8)

where α and β are hyperparameters, and we empir-ically found that α = 1 and β = 5 performs well.

The overall training procedure is summarized inAlgorithm 1.

4 Experiments

4.1 Experimental Setup

Baseline Metrics. We compare the metric modeltrained by our QuantiDCE with eight popu-lar automatic dialogue metrics, including threelexical word-overlap metrics: BLEU (Papineniet al., 2002), ROUGE (Lin, 2004) and ME-TEOR (Banerjee and Lavie, 2005), one seman-tic word-overlap metric, BERTScore (Zhang et al.,

2020), and four learnable metrics: ADEM (Loweet al., 2017), BERT-RUBER (Ghazarian et al.,2019), BLEURT (Sellam et al., 2020) andGRADE (Huang et al., 2020).

Evaluation. Our QuantiDCE and the baselinesare evaluated by computing the correlations be-tween the model-predicted scores and the human-rated scores. Specifically, we adopt Pearson, Spear-man and Kendall as the correlation measures and alarge-scale human judgement benchmark (Huanget al., 2020) to provide the human-rated scores.This benchmark contains 1,200 unique (context, re-sponse, human-rated score) triplets for metric eval-uation where the contexts were randomly selectedfrom the test set of three chit-chat datasets includ-ing DailyDialog (Li et al., 2017), ConvAI2 (Dinanet al., 2019) and EmpatheticDialogues (Rashkinet al., 2019), and the responses were produced byboth the retrieval-based dialogue models and thegeneration-based ones to assure response diversity.

Training Datasets. We use two datasets, Daily-Dialog++4 and DailyDialogEVAL5, to support thepre-training and fine-tuning of QuantiDCE, respec-tively. The DailyDialog++ dataset (Sai et al., 2020)contains over 11K conversations, which augmentsthe original DailyDialog dataset with multiple re-sponses of different quality levels including fivegolden reference responses, five adversarial irrele-vant responses and five random selected responsesfor each context. Therefore, in this work, we setthe number of coherence levels L = 3 where thepairs containing the random responses, the adver-sarial responses and the reference responses respec-tively belong to the levels from 1 to 3. As to thefine-tuning data, we use the DailyDialog humanjudgement dataset, denoted as DailyDialogEVAL,which is a subset of the adopted evaluation bench-mark (Huang et al., 2020), with 300 human ratingdata in total, and randomly split the data into train-ing (90%) and validation (10%) sets.

Implementation Details. We use BERTBASE

to initialize the encoder network, which is in linewith the current SOTA metric, GRADE. For theMLR pre-training, we pretrain our model for 5epochs with batch size 3 and learning rate 2e-5where the lower bound for the separation loss λ =0.3 and the upper bound for the compactness loss

4https://github.com/iitmnlp/Dialogue-Evaluation-with-BERT

5https://github.com/li3cmz/GRADE

Page 6: Towards Quantifiable Dialogue Coherence Evaluation

Metric Pearson Spearman Kendall AverageConvAI2

BLEU 0.003 * 0.128 0.088 0.073ROUGE 0.136 0.140 0.097 0.124METEOR 0.145 0.181 0.123 0.15BERTScore 0.225 0.225 0.154 0.201ADEM 0.026 * 0.037 * 0.049 * 0.037BERT-RUBER 0.266 0.266 0.185 0.239BLEURT 0.152 0.149 0.103 0.135GRADE 0.496 0.503 0.356 0.452QuantiDCE 0.554 0.554 0.395 0.501

EmpatheticDialoguesBLEU -0.051 * 0.002 * 0.005 * -0.015ROUGE 0.029 * -0.013 * -0.010 * 0.002METEOR 0.118 0.055 * 0.04 * 0.071BERTScore 0.046 * 0.033 * 0.021 * 0.033ADEM 0.007 * 0.009 * 0.040 * 0.019BERT-RUBER -0.022 * -0.040 * -0.029 * -0.030BLEURT 0.203 0.192 0.13 0.175GRADE 0.350 0.344 0.243 0.312QuantiDCE 0.412 0.393 0.274 0.360

Table 1: Correlations between automatic evaluationmetrics and human judgements on two datasets (Con-vAI2 and EmpatheticDialogues). The star * indicatesresults with p-value > 0.05, which are not statistically signifi-cant.

µ = 0.1. For the KD fine-tuning, we further fine-tune the pretrained model for 20 epochs with batchsize 10 and learning rate 5e-6. For all the training,BERTAdam is used as the optimizer with β1 = 0.9and β2 = 0.999. For the Transformer-layer distilla-tion, we distill all the Transformer layers since themodel architectures of the teacher and the studentare exactly the same.

4.2 Experimental Results

Metric Performance. The correlation results ofQuantiDCE and the other baseline metrics on thelarge-scale human judgement benchmark are pre-sented in Table 1, including the ConvAI2 and theEmpatheticDialogues datasets.6 For a fair com-parison, the learnable baseline metrics, ADEM,BERT-RUBER and GRADE, are trained on thetraining dataset we adopted, i.e., DailyDialog++.7

Generally, QuantiDCE achieves an absolute aver-aged correlation improvement by around 5% pointsover the current SOTA, GRADE. Besides, all theresults of QuantiDCE are statistically significantwith p-value <0.01.

6The DailyDialogEVAL dataset was not used for evalua-tion since we used it for fine-tuning.

7BLEURT was not trained on DailyDialog++ since thisdataset is not suitable for the BLEURT pre-training strategy.Instead, we trained BLEURT with the fine-tuning data we used.The training details of these baseline metrics are provided inAppendix A.

Loss Pearson Spearman Kendall AverageConvAI2

BCE 0.505 0.505 0.361 0.457Ranking 0.507 0.504 0.360 0.457SupCon 0.495 0.523 0.367 0.462FAT 0.516 0.521 0.371 0.469Vanilla MLR 0.522 0.536 0.379 0.479MLR (ours) 0.554 0.554 0.395 0.501

EmpatheticDialoguesBCE 0.354 0.353 0.243 0.317Ranking 0.399 0.389 0.272 0.353SupCon 0.332 0.315 0.22 0.289FAT 0.381 0.358 0.245 0.328Vanilla MLR 0.403 0.387 0.267 0.352MLR (ours) 0.412 0.393 0.274 0.360

Table 2: Correlations between human judgements andthe metric models trained with different losses duringpre-training and the same KD-MSE loss during fine-tuning. Ranking represents the margin ranking loss.

Pre-Training Objective. To verify the superior-ity of our pre-training objective, namely the MLRloss, we investigated the performance of several ex-isting loss functions for pre-training compared withours. Specifically, two categories of loss functionsused for metric training are adopted, including (a)the two-level setting and (b) the multi-level setting.The binary cross entropy (BCE) loss and the marginranking loss are adopted for the two-level setting,while another three loss functions are adopted forthe multi-level setting, including the supervisedcontrastive (SupCon) loss (Khosla et al., 2020), thefast-approximated triplet (FAT) loss (Yuan et al.,2019) and the vanilla MLR loss (Lin et al., 2020) 8.As shown in Table 2, the performance of our MLRloss is the best among all the pre-training objec-tives. And we also found that the multi-level set-ting losses perform better than the two-level ones,especially on the ConvAI2 dataset. Moreover, inorder to more intuitively analyze the performancesof these pre-training objectives, we also visualizethe encoded features and the predicted scores ofthe model M after being pretrained by the aboveloss functions on the DailyDialog++ dataset with-out fine-tuning.9 As shown in Figure 3, (a) theBCE loss cannot separate the level-1 scores fromthe level-2 ones and the corresponding features arealso mixed; (b) the FAT loss, on the other hand,separates the features of different levels well, butdoes not consider the relative gaps where the dis-tances between the level-1 and level-3 features are

8The details of these pre-training loss fucntions are pro-vided in Appendix B.

9The visualization results of the ranking loss, SupCon lossand Vanilla MLR loss are provided in Appendix C.

Page 7: Towards Quantifiable Dialogue Coherence Evaluation

(c) MLR (ours)(a) BCE (b) FAT

Figure 3: Visualizations of features (the scatter plots in the upper row) and scores (the violin plots in the lower row)on the dailydialog++ dataset. The features and scores in each of the three columns are obtained from the metricmodel M only pretrained with the BCE loss, the FAT loss and our MLR loss, respectively.

Loss Pearson Spearman Kendall AverageConvAI2 (best epoch)

MSE 0.272 0.369 0.255 0.299MSE (fix encoder) 0.477 0.477 0.337 0.430KD-MSE (ours) 0.554 0.554 0.395 0.501

EmpatheticDialogues (best epoch)MSE 0.278 0.276 0.187 0.247MSE (fix encoder) 0.384 0.367 0.253 0.335KD-MSE (ours) 0.412 0.393 0.274 0.360

DailyDialogEVAL (last epoch)MSE 0.934 0.945 0.867 0.915MSE (fix encoder) 0.379 0.402 0.281 0.354KD-MSE (ours) 0.804 0.832 0.678 0.771

Table 3: Correlations between human judgements andthe metric model M further trained with different fine-tuning losses after MLR pre-training.

not larger than those between level-1 and level-2;(c) in contrast, our MLR loss separates both thefeatures and the scores well and also considers therelative gaps between different levels.

Fine-Tuning Objective. Furthermore, we alsoverified the effectiveness of our KD-MSE loss dur-ing fine-tuning by comparing with other fine-tuninglosses, including the pure MSE loss without KDregularization as shown in Equation 6 and the sameMSE loss except for freezing the encoder networkand only finetuning the predictor network i.e. theMLP, denoted as MSE (fix encoder). As the resultsshown in Table 3, compared with the other twolosses, the model finetuned by our KD-MSE losshas the highest correlation results on both ConvAI2and EmpatheticDialogues. Moreover, by compar-

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Score

3

2

1

Leve

l

Figure 4: Score visualization on the dailydialog++dataset where the scores are predicted by our Quan-tiDCE after KD fine-tuning.

ing the results of MSE and KD-MSE, we can findthat introducing KD regularization leads to obvi-ous averaged correlation improvements by 20.2%points on ConvAI2 and 11.3% points on Empa-theticDialogues, which verifies the effectivenessof the KD loss. Besides, we also reported the last-epoch correlation results on the training dataset,DailyDialogEVAL. And the results of MSE andMSE (fix encoder) indicate the phenomena of over-fitting and under-fitting into DailyDialogEVAL re-spectively, which explain the reasons of their lowperformance on the two evaluation datasets. Incontrast, our KD-MSE loss enables the model tolearn the actual human rating standards from thescarce annotated data and avoid overfitting it si-

Page 8: Towards Quantifiable Dialogue Coherence Evaluation

Metric Pearson Spearman Kendall AverageQuantiDCE 0.554 0.554 0.395 0.501w/o MLR pre-training 0.373 0.357 0.246 0.325

w/o separation loss 0.388 0.416 0.289 0.364w/o compactness loss 0.526 0.550 0.390 0.489w/o ordering loss -0.494 -0.522 -0.371 -0.462

w/o KD fine-tuning 0.531 0.540 0.381 0.484

Table 4: Ablation studies on the ConvAI2 dataset by re-moving one of the component in QuantiDCE, includingthe MLR loss (w/o MLR pre-training), the KD+MSEloss (w/o KD fine-tuning), and three secondary lossesof the MLR loss.

0 10 20 30 40 50 60 70 80 90 100Data Scale (%)

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Corre

latio

n

MSEMSE (fix encoder)KD-MSE (ours)

Figure 5: The performance trends when changing thenumber of annotated data used for different fine-tuningobjectives. Each point in the line chart indicates the av-eraged correlation of Pearson, Spearman and Kendall.

multaneously. Finally, in Figure 4, we present thevisualization of the scores predicted by our Quan-tiDCE after KD fine-tuning. Compared with thescore distributions before fine-tuning in Figure 3(c),the finetuned score distributions of the level-1 andlevel-3 are wider and partly overlap with the level-2 distribution. It is predictable as the judgementsof coherence are always subjective and humanstend to give vague and middle scores instead ofextremely high or low scores.

4.3 Ablation Studies

Component Analysis. To verify the contribu-tions of the core components in our QuantiDCE, wefurther conducted ablation studies on the ConvAI2dataset. As shown in Table 4, both the MLR pre-training and KD fine-tuning contribute to the betterperformance of QuantiDCE. Besides, we also con-ducted ablations by removing one of the secondaryloss during MLR pre-training, including the sepa-ration loss, the compactness loss and the orderingloss. The results show that the performance ben-efits from all these losses in which the separationloss and the ordering loss are crucial for training ametric with strong and positive human correlations.

U1: I need to book a plane ticket to London.U2: Round-trip or one-way?R: Round trip or one way trip?Coherence Score (Human / QuantiDCE / GRADE) : 2.10 / 2.85 / 4.52U1: Yum. You will find me in the kitchen and if not i am fishing.U2: Wow that’s pretty cool what else you do for fun?R: Probably fish it is great.Coherence Score (Human / QuantiDCE / GRADE) : 2.50 / 3.94 / 4.27

Table 5: Two representative examples to show thestrength and weakness of our QuantiDCE where U1and U2 are two utterances of the context and R is thecorresponding response.

Number of Data for Fine-Tuning. Moreover,we also investigated how the scale of data for fine-tuning effects the model performance by increas-ing the number of fine-tuning data 5% each timefrom zero. The trend of the model performance ispresented in Figure 5. We observed that minimiz-ing our KD-MSE loss made the correlation resultshave a gradually increasing trend after an initial de-crease.10 More specifically, the result achieved thestandard before fine-tuning at around the 70% datascale and continued increasing until 100% with afinal improvement by around 2% points. For com-parison, the performance trends of MSE and MSE(fix encoder) are also provided. And the resultspresent overall decreasing trends of the model per-formance, which indicates that the model trainedby MSE or MSE (fix encoder) cannot benefit fromthe increasing of data scale, due to the severe over-fitting or under-fitting. Therefore, to effectivelyutilize the limited data, it is important to enablethe update of the entire network and add some con-straints to avoid over-fitting, such as our proposedKD regularization.

4.4 Case Study

To illustrate the performance of QuantiDCE, tworepresentative examples are shown in Table 5 . Thefirst example shows the strength of QuantiDCEwhere the coherence score given by ours is closerto the human rating score compared with the ex-tremely high score given by GRADE. However,in the second example, both our QuantiDCE andGRADE deviate from the human score, possiblybecause the number of coherence levels we adoptedin this work (L = 3) is insufficient as humans usu-ally consider more levels of dialogue coherence.

10The initial decrease probably attributes to the randomnessof data sampling where the smaller the sampling ratio is, thehigher the probability that noisy samples dominate the sam-pled data will be. And overfitting into the noisy samples leadsto the performance decrease.

Page 9: Towards Quantifiable Dialogue Coherence Evaluation

5 Conclusion

In this paper, we propose QuantiDCE, a novel train-ing framework aiming to bridge the gap betweenthe training objective and the actual human ratingand train a quantifiable dialogue coherence met-ric. In general, QuantiDCE includes two trainingstages, MLR pre-training for learning the coarsehuman judgements of dialogue coherence degrees,and KD fine-tuning for learning the actual humanrating standards. Experimental results show thatthe metric trained by QuantiDCE presents strongcorrelations with human judgements. For futurework, it is interesting to investigate a more efficientway to obtain multi-level data and extend the multi-level setting into the general evaluation for naturallanguage generation.

Acknowledgments

We thank all anonymous reviewers for theirconstructive comments. This work was supportedin part by National Key R&D Program ofChina under Grant No. 2020AAA0109700,National Natural Science Foundation ofChina (NSFC) under Grant No.U19A2073and No.61976233, Guangdong Province Ba-sic and Applied Basic Research (RegionalJoint Fund-Key) Grant No.2019B1515120039,Shenzhen Fundamental Research Program(Project No. RCYX20200714114642083, No.JCYJ20190807154211365), Zhijiang Lab’s OpenFund (No. 2020AA3AB14) and CSIG YoungFellow Support Fund.

ReferencesSatanjeev Banerjee and Alon Lavie. 2005. METEOR:

An automatic metric for MT evaluation with im-proved correlation with human judgments. In Pro-ceedings of the ACL Workshop on Intrinsic and Ex-trinsic Evaluation Measures for Machine Transla-tion and/or Summarization, pages 65–72, Ann Ar-bor, Michigan. Association for Computational Lin-guistics.

Alessandra Cervone, Evgeny Stepanov, and GiuseppeRiccardi. 2018. Coherence models for dialogue.In Proceedings of the 19th Annual Conference ofthe International Speech Communcation Associa-tion, pages 1011–1015.

Djork-Arne Clevert, Thomas Unterthiner, and SeppHochreiter. 2016. Fast and accurate deep networklearning by exponential linear units (elus). In Inter-national Conference on Learning Representations.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Emily Dinan, Varvara Logacheva, Valentin Malykh,Alexander H. Miller, Kurt Shuster, Jack Urbanek,Douwe Kiela, Arthur Szlam, Iulian Serban, RyanLowe, Shrimai Prabhumoye, Alan W. Black, Alexan-der I. Rudnicky, Jason Williams, Joelle Pineau,Mikhail Burtsev, and Jason Weston. 2019. The sec-ond conversational intelligence challenge (convai2).Computing Research Repository, arXiv:1902.00098.Version 1.

Sarik Ghazarian, Johnny Wei, Aram Galstyan, andNanyun Peng. 2019. Better automatic evaluation ofopen-domain dialogue systems with contextualizedembeddings. In Proceedings of the Workshop onMethods for Optimizing and Evaluating Neural Lan-guage Generation, pages 82–89, Minneapolis, Min-nesota. Association for Computational Linguistics.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.Distilling the knowledge in a neural network.Computing Research Repository, arXiv:1503.02531.Version 1.

Lishan Huang, Zheng Ye, Jinghui Qin, Liang Lin, andXiaodan Liang. 2020. GRADE: Automatic graph-enhanced coherence metric for evaluating open-domain dialogue systems. In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 9230–9240,Online. Association for Computational Linguistics.

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.2020. TinyBERT: Distilling BERT for natural lan-guage understanding. In Findings of the Associationfor Computational Linguistics: EMNLP 2020, pages4163–4174, Online. Association for ComputationalLinguistics.

Prannay Khosla, Piotr Teterwak, Chen Wang, AaronSarna, Yonglong Tian, Phillip Isola, AaronMaschinot, Ce Liu, and Dilip Krishnan. 2020. Su-pervised contrastive learning. In Advances in Neu-ral Information Processing Systems 33: Annual Con-ference on Neural Information Processing Systems2020, NeurIPS 2020, December 6-12, 2020, virtual.

Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 1317–1327, Austin,Texas. Association for Computational Linguistics.

Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, ZiqiangCao, and Shuzi Niu. 2017. DailyDialog: A manu-

Page 10: Towards Quantifiable Dialogue Coherence Evaluation

ally labelled multi-turn dialogue dataset. In Proceed-ings of the Eighth International Joint Conference onNatural Language Processing (Volume 1: Long Pa-pers), pages 986–995, Taipei, Taiwan. Asian Federa-tion of Natural Language Processing.

Chin-Yew Lin. 2004. ROUGE: A package for auto-matic evaluation of summaries. In Text Summariza-tion Branches Out, pages 74–81, Barcelona, Spain.Association for Computational Linguistics.

Zibo Lin, Deng Cai, Yan Wang, Xiaojiang Liu, HaitaoZheng, and Shuming Shi. 2020. The world is notbinary: Learning to rank with grayscale data for di-alogue response selection. In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 9220–9229,Online. Association for Computational Linguistics.

Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose-worthy, Laurent Charlin, and Joelle Pineau. 2016.How NOT to evaluate your dialogue system: Anempirical study of unsupervised evaluation metricsfor dialogue response generation. In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 2122–2132, Austin,Texas. Association for Computational Linguistics.

Ryan Lowe, Michael Noseworthy, Iulian Vlad Ser-ban, Nicolas Angelard-Gontier, Yoshua Bengio, andJoelle Pineau. 2017. Towards an automatic Tur-ing test: Learning to evaluate dialogue responses.In Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), pages 1116–1126, Vancouver,Canada. Association for Computational Linguistics.

Mohsen Mesgar, Sebastian Bucker, and IrynaGurevych. 2020. Dialogue coherence assessmentwithout explicit dialogue act labels. In Proceedingsof the 58th Annual Meeting of the Associationfor Computational Linguistics, pages 1439–1450,Online. Association for Computational Linguistics.

Jekaterina Novikova, Ondrej Dusek, Amanda Cer-cas Curry, and Verena Rieser. 2017. Why we neednew evaluation metrics for NLG. In Proceedingsof the 2017 Conference on Empirical Methods inNatural Language Processing, pages 2241–2252,Copenhagen, Denmark. Association for Computa-tional Linguistics.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics, pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.

Hannah Rashkin, Eric Michael Smith, Margaret Li, andY-Lan Boureau. 2019. Towards empathetic open-domain conversation models: A new benchmark and

dataset. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics, pages 5370–5381, Florence, Italy. Associationfor Computational Linguistics.

Ananya B. Sai, Akash Kumar Mohankumar, Sid-dhartha Arora, and Mitesh M. Khapra. 2020. Im-proving dialog evaluation with a multi-referenceadversarial dataset and large scale pretraining.Computing Research Repository, arXiv:2009.11321.Version 1.

Thibault Sellam, Dipanjan Das, and Ankur Parikh.2020. BLEURT: Learning robust metrics for textgeneration. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 7881–7892, Online. Association for Computa-tional Linguistics.

Raphael Sourty, Jose G. Moreno, Francois-Paul Ser-vant, and Lynda Tamine-Lechani. 2020. Knowledgebase embedding by cooperative knowledge distilla-tion. In Proceedings of the 28th International Con-ference on Computational Linguistics, pages 5579–5590, Barcelona, Spain (Online). International Com-mittee on Computational Linguistics.

Jingyuan Sun, Shaonan Wang, Jiajun Zhang, andChengqing Zong. 2020. Distill and replay for con-tinual language learning. In Proceedings of the 28thInternational Conference on Computational Linguis-tics, pages 3569–3579, Barcelona, Spain (Online).International Committee on Computational Linguis-tics.

Chongyang Tao, Lili Mou, Dongyan Zhao, and RuiYan. 2018. Ruber: An unsupervised method for au-tomatic evaluation of open-domain dialog systems.In AAAI.

Hao-Ran Wei, Shujian Huang, Ran Wang, Xin-yu Dai,and Jiajun Chen. 2019. Online distilling from check-points for neural machine translation. In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers), pages 1932–1941, Min-neapolis, Minnesota. Association for ComputationalLinguistics.

Ye Yuan, Wuyang Chen, Yang Yang, and ZhangyangWang. 2019. In defense of the triplet lossagain: Learning robust person re-identification withfast approximated triplet loss and label distillation.Computing Research Repository, arXiv:1912.07863.Version 2.

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-uating text generation with bert. In InternationalConference on Learning Representations.

Page 11: Towards Quantifiable Dialogue Coherence Evaluation

A Training Details of the learnablebaseline metrics

a) Following Sai et al. (2020), we trained ADEMby first initializing it with the official checkpointand further finetuning on DailyDialog++ with atarget of 5 for level-3 pairs and 1 for level-1 pairs;b) BERT-RUBER and GRADE were both trainedon DailyDialog++ where level-3 pairs as positivesamples and both level-1 and level-2 pairs as neg-ative samples, except that the former use cross-entropy loss while the latter use ranking loss; c)BLEURT was initialized with the official recom-mended checkpoint BLEURT-Base and finetunedon DailyDilaogEVAL by following the office guide-lines11.

B Details of the Pre-Training Losses

BCE Loss. The binary cross entropy (BCE) lossis adopted for the experiments of the two-levelsetting, where both the adversarial irrelevant re-sponses and random selected responses of the dai-lydialog++ dataset (Sai et al., 2020) are treated asnegative samples and labeled as 0, while the goldenreference responses are treated as positive samplesand labeled as 1.

Margin Ranking Loss. Similarly, the marginranking loss simplifies the evaluation task as atwo-level setting and maximizes the differencesbetween the positive coherent dialogues and thenegative incoherent ones. As the name suggests,the focus of the margin ranking loss is ranking,which aims at ranking the scores of positive co-herent dialogues ahead of the negative incoherentones.

SupCon Loss. The supervised contrastive (Sup-Con) loss (Khosla et al., 2020), which pulls the pos-itive anchors closer and pushes the negatives fartheraway in representation space, can be adopted forthe multi-level setting. Here, for our multi-levelsetting, we consider the dialogues of level-1, level-2, and level-3 as positive anchors successively, andthe remaining two levels as corresponding nega-tives.

FAT Loss. The fast-approximated triplet (FAT)loss (Yuan et al., 2019) replaces the traditionalpoint-to-point distances of the triplet loss withpoint-to-cluster distances, through an upper bound

11https://github.com/google-research/bleurt

relaxation of the triplet form, which is first appliedfor the classification task and obviously reducesthe computation cost. To use FAT loss in our eval-uation task, we consider the different coherencelevels as different classes and perform the FAT lossto separate the context-response pairs with differentcoherence levels.

Vanilla MLR Loss. The vanilla MLR loss (Linet al., 2020) is the extension of the margin rankingloss to a multi-level version by repeatedly applyingthe original margin ranking loss between differentlevels, which can be directly applied to our evalua-tion task.

C Visualizations of the Pre-TrainingLosses

We have already compared the visualization resultsof the BCE loss and the FAT loss. For a supple-ment, here we mainly introduce the visualizationsof the margin ranking loss, the SupCon loss andthe vanilla MLR loss in detail.

As we can see in Figure 6, (a) the margin rank-ing loss cannot separate the level-1 scores fromthe level-2 ones and the corresponding features arealso mixed, which is similar to the BCE loss; (b)the SupCon loss, on the other hand, can distinguishthe features and scores of the three levels to someextent, and the scores of different levels are alsoseparated but do not follow the real rank order, i.e.,level-1 < level-2 < level-3; (c) the final vanillaMLR loss can separate the context-response pairswith different coherence level in feature space andthe predicted scores also follow the actual rankorder. However, its score distributions are not com-pact enough for the level-1 and level-3.

Page 12: Towards Quantifiable Dialogue Coherence Evaluation

(c) Vanilla MLR(b) SupCon(a) Ranking

Figure 6: Visualizations of features (the scatter plots in the upper row) and scores (the violin plots in the lower row)on the dailydialog++ dataset. The features and scores in each of the three columns are obtained from the metricmodel M only pretrained with the margin ranking loss, the SupCon loss and the vanilla MLR loss, respectively.