Evaluating Explanation Without Ground Truth in …Evaluating Explanation Without Ground Truth in Interpretable Machine Learning Fan Yang, Mengnan Du, Xia Hu Department of Computer

Evaluating Explanation Without Ground Truth inInterpretable Machine Learning

Fan Yang, Mengnan Du, Xia HuDepartment of Computer Science and Engineering

Texas A&M University{nacoyang, dumengnan, xiahu}@tamu.edu

ABSTRACTInterpretable Machine Learning (IML) has become increas-ingly important in many real-world applications, such asautonomous cars and medical diagnosis, where explanationsare significantly preferred to help people better understandhow machine learning systems work and further enhancetheir trust towards systems. However, due to the diversifiedscenarios and subjective nature of explanations, we rarelyhave the ground truth for benchmark evaluation in IML onthe quality of generated explanations. Having a sense ofexplanation quality not only matters for assessing systemboundaries, but also helps to realize the true benefits tohuman users in practical settings. To benchmark the evalu-ation in IML, in this article, we rigorously define the prob-lem of evaluating explanations, and systematically reviewthe existing efforts from state-of-the-arts. Specifically, wesummarize three general aspects of explanation (i.e., gener-alizability, fidelity and persuasibility) with formal definitions,and respectively review the representative methodologies foreach of them under different tasks. Further, a unified eval-uation framework is designed according to the hierarchicalneeds from developers and end-users, which could be easilyadopted for different scenarios in practice. In the end, openproblems are discussed, and several limitations of currentevaluation techniques are raised for future explorations.

1. INTRODUCTIONServing as one of the most significant momentums for the

booming of artificial intelligence, machine learning is play-ing a vital role in many real-world systems, widely rang-ing from spam filters to humanoid robot. To handle thetasks that are increasingly complicated in practice nowa-days, more and more sophisticated machine learning sys-tems are designed, such as deep learning models [21], for ac-curate decision making. Despite the superior performance,those complex systems are typically hard to be interpretedby human users, which largely limits their applications insome high-stake scenarios like self-driving vehicles and med-ical treatment, where explanations are important and nec-essary for scrutable decisions [36]. To this end, the conceptof interpretable machine learning (IML) has been formally

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 2008 ACM 0001-0782/08/0X00 ...$5.00.

MachineLearningProcess

UserML Model Prediction

This is a husky(p =0 .93)

Input

InterpretableMachineLearningProcess

Training Data

User

©University Of Toronto

IML Model Interface

This is a huskybecause of

Figure 1: Illustration of the IML techniques. We

compare the two different pipelines between ma-

chine learning (ML) and IML. It is worth noting that

IML model is capable of providing specific reasons

for particular machine decisions, while ML model

may simply provide the prediction results with prob-

ability scores. Here, we employ the image classifi-

cation task as an example, where IML model could

tell which part of the image contributes the animal

to a husky while ML model may only tell the overall

confidence towards a husky classification result.

raised [6], aiming to help humans better understand the ma-chine learning decisions. We illustrate the core idea of IMLtechniques in Figure 1.IML is a new branch of machine learning techniques with

mounting attentions in recent years (shown by Figure 2),focusing on the decision explanation beyond the instanceprediction. IML is typically employed to extract useful in-formation, from either system structure or prediction result,as explanations to interpret relevant decisions. AlthoughIML techniques have been comprehensively discussed cov-ering methodology and application [7], the insights on IMLevaluation perspective are still rather limited, which signif-icantly impedes the way of IML to a rigorous science field.To precisely reflect the boundaries of IML systems and mea-sure the benefits of explanations brought to human users,effective evaluations are pivotal and indispensable. Differ-ent from the conventional evaluation purely relied on modelperformance, IML evaluation also needs to pay attention tothe quality of the generated explanations, which makes ithard to be handled and benchmarked.

arX

iv:1

907.

0683

1v2

[cs

.LG

] 1

5 A

ug 2

019

5 8 2143 54

102

277

755

0

100

200

300

400

500

600

700

800

Trend of Research for IML

# Publication for IML basd on Google Scholar

Expon. Trendline

Figure 2: Tendency of the IML research in re-

cent years. In particular, we present the number

of research publications related to IML from 2010

to 2018, and plot the trendline according to the

statistics. The relevant numerics are collected from

Google Scholar, with the key words “interpretable

machine learning". We believe the actual numbers

are even larger than the provided, since some other

terms, such as “explainable", which are closely re-

lated to IML, are ignored during collection. From

the results, we can see that IML related publication

has been increasing exponentially, and much more

attention has been paid for this field.

Evaluating explanation in IML is a challenging problem,since we need to balance well between the objective and sub-jective perspectives when designing experiments. On onehand, different users could have different preferences to-wards what a good explanation should be under differentscenarios [6], thus it is not practical to benchmark the IMLevaluation with a common set of ground truth for objectiveevaluations. For example, when deploying self-driving carswith IML, system engineers may consider sophisticated ex-planations as good ones for safety concerns, while car driversmay prefer those concise explanations because complex onescould be too time-consuming for decision making duringdriving. On the other hand, there might be more criterionbeyond human subjective satisfaction. Human preferred ex-planations may not always represent the full working mech-anism of systems, which could lead to a poor performanceon generalization. It has shown that subjective satisfac-tion of explanations largely depends on the response timeof human users, and has no clear relation with the accu-racy performance [19]. This finding directly supports thefact that human satisfaction cannot be regarded as the solestandard when evaluating explanation. Besides, fully sub-jective evaluations would also result in ethics issues, becauseit is unmoral to manipulate an explanation to better caterhuman users [14]. Seeking human satisfaction excessivelycould cause explanations to persuade users, instead of actu-ally interpreting systems.Considering the aforementioned challenges, we aim to pave

the way for benchmark evaluation in this article, regardingto the explanation generated from IML techniques. First, we

give an overview about the explanations in IML, and catego-rize them by a two-dimensional standard (i.e., interpretationscope and interpretation manner) with representative ex-amples. Then, we summarize three general properties (i.e.,generalizability, fidelity and persuasibility) for explanationwith formal definitions, and rigorously define the problem ofevaluating explanation in the IML context. Next, followingthose properties, we conduct a systematic review about ex-isting work on explanation evaluation, with the focus on dif-ferent techniques in various applications. Moreover, we alsoreview some other special properties for explanation evalu-ation, which are typically considered under particular sce-narios. Further, with the aid of those general properties, wedesign a unified evaluation framework aligned with the hi-erarchical needs from both model developers and end-users.At last, we raise several open problems for current evalua-tion techniques, and discuss some potential limitations forfuture exploration.

2. EXPLANATION AND EVALUATIONIn this section, we first introduce the explanations we par-

ticularly focus on, and give an overview about the categoriesof explanations in IML. Then, three general properties of ex-planation are summarized for evaluation tasks according todifferent perspectives in nature. Finally, we formally definethe problem of evaluating explanations in IML with the aidof those general properties.

2.1 Explanation OverviewIn the context of IML, explanations are particularly re-

ferred to those information that can help human users in-terpret either learning process or prediction result for ma-chine learning models. With different focuses, explanationsin IML could have diversified forms with various charac-teristics, such as the local heatmaps for instances and thedecision rules for models. In this article, we specifically cate-gorize the explanations in IML with a two-dimensional stan-dard, covering both interpretation scope and interpretationmanner. As for the scope dimension, explanations can beclassified into the global and local ones, where global expla-nation indicates the overall working mechanism of modelsby interpreting structures or parameters, and local explana-tion reflects the particular model behaviour for individualinstance by analyzing specific decisions. Regarding to themanner dimension, we can divide explanations into the in-trinsic and posthoc (also written as post-hoc or post hoc)ones. Intrinsic explanation is typically achieved by thoseself-interpretable models that are transparent with particu-lar designs, while posthoc one requires another independentinterpretation model or technique to provide explanationsfor the target model. The two-dimensional taxonomy of ex-planations in IML is illustrated by Figure 3.The first category is intrinsic-global explanation. This

type of explanation can be well represented by some con-ventional machine learning techniques, such as rule-basedsystems and decision trees, which are self-interpretable andcapable of showing the overall working patterns for predic-tion. Take the decision tree for example, the intuitionalstructure, as well as the set of all decision branches, con-stitutes the corresponding intrinsic-global explanation. Thesecond category is intrinsic-local explanation, which is as-sociated with specific input instances. A typical exampleis the attention mechanism applied on sequential models,

Interpretation Scope

Global Local

Inte

rpre

tati

on

Man

ne

r

Intrinsic

Posthoc

Decision Tree

Mimic Learning

Attention Mechanism

Instance Heatmap

root node

leaf node (decision)

split condition

input sequence

output sequence

weight

deep model shallow model

(teacher) (student)

heavy heated

light heated

(a)

(d)(c)

(b)

Figure 3: A two-dimensional categorization for ex-

planations in IML, covering interpretation scope

and interpretation manner. According to the two-

dimensional standard, we can divide explanations

into four different groups: (a) intrinsic-global; (b)

intrinsic-local; (c) posthoc-global; (d) posthoc-local.For each category, we attach a representative exam-

ple for illustration. In particular, we employ deci-

sion tree as the example for intrinsic-global explana-

tions, attention mechanism for intrinsic-local ones,

mimic learning for posthoc-global ones, and instance

heatmap for posthoc-local ones.

where generated attention weights can help interpret par-ticular predictions by indicating the important components.Attention model is widely used in both image captioningand machine translation tasks. Posthoc-global explanationserves as the third category, and the representative exam-ple can be shown with mimic learning techniques for deepmodels. As for mimic learning, the teacher usually is a deepmodel, while the student is typically deployed as a shallowmodel that is easier to be interpreted. The overall process ofmimic learning can be regarded as a distillation process fromthe teacher to the student, where the interpretable studentmodel provides a global view in a posthoc manner for thedeep teacher model. The posthoc-local explanation fills upthe last part of the taxonomy. We introduce this categorywith an example of instance heatmap, which is used to visu-alize the input regions with attribution score (i.e., a quan-tified importance indicator). Instance heatmap works wellfor both image and text, and is capable of showing the localbehaviour of the target model. Since heatmap depends onthe particular input and does not involve the specific modeldesign, it is a typical local explanation within a posthoc way.

2.2 General Properties of ExplanationTo formally define the problem of evaluating explanations

in IML, it is important to make clear the general propertiesof explanation for evaluation. In this article, we summarizethree significant properties from different perspectives, i.e.,generalizability, fidelity and persuasibility, where each of theproperty corresponds to one specific aspect in evaluation.The intuitions of the properties are illustrated in Figure 4.The first general property is generalizability, which is used

General Properties

Generalizability

FidelityPersuasibility

measure the generalization power

of explanations

measure the faithfulness degree

of explanations

measure the usefulness degree

of explanations

Does the knowledge in explanation

generalize well?

Does explanation reflect the target

system well?

Does human satisfy or comprehend

explanation well?

Figure 4: Three general properties for explanations

in IML, including generalizability, fidelity and per-

suasibility. Each property essentially corresponds

to one specific aspect in evaluation. Generalizabil-

ity focuses on the generalization power of explana-

tion. Fidelity focuses on the faithfulness degree of

explanation. Persuasibility focuses on the usefulness

degree of explanation.

to reflect the generalization power of explanation. In real-world applications, human users employ explanation fromIML techniques mainly to obtain insights from the targetsystem, which naturally brings forward the demand on ex-planation generalization performance. If a set of explana-tions is poorly generalized, it can hardly be regarded withgood quality, since the knowledge and guidance it provideswould be rather limited in practice. One thing to clarify isthat the explanation generalization mentioned here is notnecessarily equal to the model predictive power, unless themodel itself is interpretable with self-explanations (e.g., de-cision tree). By measuring the generalizability of explana-tion, users can have a sense of how accurate the generatedexplanations are for specific tasks.

Definition 1: We define the generalizability of expla-nation in IML as an indicator for generalization perfor-mance, regarding to the knowledge and guidance deliv-ered by the corresponding explanation.

The second general property is fidelity, which is used toindicate how faithful explanations are to the target system.Faithful explanation is always preferred by human, becauseit can precisely capture the decision making process of thetarget system and show the correct evidences for particu-lar predictions. Explanations with high quality need to befaithful, since they are essentially served as important toolsfor users to understand the target system. Without suffi-cient fidelity, explanations can only provide limited insightsto the system, which degrades the functionalities of IML tohuman users. To guarantee the relevance of explanations,we need fidelity to conduct explanation evaluation in IML.

Definition 2: We define the fidelity of explanation inIML as the faithfulness degree with regard to the targetsystem, aiming to measure the relevance of explanationsin practical settings.

The third general property is persuasibility, which reflects

IML Evaluation

Model Evaluation Explanation Evaluation

To check the quality of prediction To check the quality of explanation

Predictability Generalizability

Fidelity

Persuasibility

Robustness

Capability

Certainty

...

Figure 5: Illustration of the IML evaluation. Ba-

sically, IML evaluation can be divided into model

evaluation and explanation evaluation. For model

evaluation, we focus on the generalizability of the

system, and evaluate the quality of prediction. For

explanation evaluation, we focus on the predictabil-

ity, fidelity, persuasibility, and evaluate the quality

of explanation. Besides, there are also some spe-

cial properties that are entangled with both model

and explanation. We list robustness, capability and

certainty here for instance. In this paper, we specif-

ically focus on the aspects which are related to ex-

planation evaluation.

the degree of how human comprehend and response to thegenerated explanations. This property handles the subjec-tive aspect of explanation, and is typically measured withhuman involvement. Good explanations are most likely tobe easily comprehended, and facilitate quick responses fromhuman users. Towards different user groups or applicationscenarios, one specific set of explanations could possiblyhave different persuasibility due to the diversified prefer-ences. Thus, discussing persuasibility for explanation shouldonly be considered under a same setting of users and tasks.

Definition 3: We define the persuasibility of explanationin IML as the usefulness degree to human users, servingas the measure of subjective satisfaction or comprehensi-bility for the corresponding explanation.

2.3 Explanation Evaluation ProblemWith the definitions of the three general properties for

explanation, we further introduce and formally define theproblem of evaluating explanations in IML. Technically, IMLevaluation can be divided into two parts: model evaluationand explanation evaluation, shown by Figure 5. As for themodel evaluation part, the goal is to measure the predictivepower of IML systems, which is identical to that of commonmachine learning systems and can be directly achieved withsome conventional metrics (e.g., accuracy, precision, recalland F1-score). Explanation evaluation, however, is differentfrom model evaluation in both objective and methodologyaspects. Since explanation typically contains more than oneperspective and has no common ground truth over differentscenarios, traditional model evaluation techniques thus can-not be perfectly applied. In this article, we specifically focus

on the second part of IML evaluation, i.e., the explanationevaluation, and rigorously define the problem as follows.

Definition 4: The explanation evaluation problemwithin IML context is to assess the quality of the gener-ated explanations from systems, where high-quality ex-planation corresponds to large values of generalizability,fidelity and persuasibility with relevant measurement. Ingeneral, good explanation ought to be well generalized,highly faithful and easily understood.

3. EVALUATION REVIEWIn this section, we will conduct a systematic review for

explanation evaluation problem in IML, following the prop-erties of explanation we summarize. For each property, wemainly review the primary methodologies of evaluation forpractical tasks, and shed light on the philosophy about whythey are reasonable. After the review of evaluations ongeneralizability, fidelity and persuasibility, we also focus onsome other special aspects, which are typically entangledtogether with model evaluation, including the robustness,capability and certainty.

3.1 Evaluation on GeneralizabilityExisting work, related to generalizability evaluation, mainly

focus on the IML systems with intrinsic-global explanations.Since intrinsic-global explanations are typically presentedand organized as the form of prediction models, it is straight-forward and convenient to evaluate generalizability by ap-plying those explanations on test data to see the correspond-ing generalization performance. Under this scenario, thegeneralizability evaluation task is somewhat equivalent tothe model evaluation, where traditional model performanceindicators (e.g., accuracy and F1-score) can be employedas the metrics to quantify the explanation generalizability.Conventional examples for this scenario can include gener-alized linear model (with informative coefficients) [27], deci-sion tree (with structured branches) [32], K-nearest neigh-bors (with significant instances) [2] and rule-based systems(with discriminative patterns) [3]. In general, the general-izability evaluation for intrinsic-global explanations can beeasily converted to model evaluation tasks, in which gen-eralizability is positively correlated to the model predictionperformance. Take the recent work on decision set [20] forexample. The authors use the AUC scores, a common met-ric for classification tasks in model evaluation, to indicatethe generalizability of explanations in set of decision rules.Similarly in recent work [23], AUC scores are employed toreflect the explanation generalizability indirectly.Besides, there is another branch of work focusing on the

generalizability of posthoc-global explanations. The rele-vant evaluation method is similar to that of generalizabilityfor intrinsic-global ones, where model evaluation techniquescould be employed to indicate the explanation generalizabil-ity. The major difference lies in the fact that the explana-tions we apply on test data are not directly associated withthe target system, but are closely related to the interpretableproxies extracted or learned from the target. Those proxiestypically serve as the role for interpreting the target sys-tem which is either a black box or a sophisticated model.Representative examples for this scenario can be found inknowledge distillation [11] and mimic learning [5], where thecommon focus is to derive an interpretable proxy out of the

black-box neural model for providing explanations. For ex-ample, in literature [5], the authors employ Gradient Boost-ing Trees (GBT) as the interpretable proxy to explain theworking mechanism of neural networks. The constructedGBT is capable of providing feature importance for expla-nation, and is assessed by model evaluation techniques withAUC score to show the generalizability of corresponding ex-planations. The generalizability of posthoc-global explana-tion typically has positive correlation with the model per-formance of the derived interpretable proxy.

3.2 Evaluation on FidelityThough pretty important for explanation evaluation, fi-

delity may not be explicitly considered for intrinsic expla-nations. In fact, the intrinsicality from explanations is suffi-cient to guarantee the exact working mechanism of the tar-get IML system, and the corresponding explanations canthus be treated as faithful ones with full fidelity. The inter-pretable decision set [20] should be a good example here.The learned decision set is self-interpretable and explicitlypresents the decision rules for the potential classificationtasks. Under this example, we can see that those explana-tion rules faithfully reflect the model prediction behaviour,and there does not exist any inconsistency between the IMLsystem prediction and the relevant explanations. This kindof complete accordance between model and explanation isjust what the full fidelity indicates.However, different from intrinsic ones, posthoc-global ex-

planations in form of interpretable proxies cannot be re-garded with full fidelity, since the derived proxies usuallywork in a different way compared with the target system.Although most proxies are derived to approximate the be-haviour of target system, it is still constructed as a differentmodel for the potential task. Existing work, related to fi-delity evaluation for interpretable proxies, mainly use thedifference in prediction performance to indicate the fidelitydegree. For instance, in work [5], the authors conducted ex-periments with several sets of teacher-student models, wherethe teacher is the target model and the student is the proxymodel. During the evaluation, the prediction differences be-tween corresponding teachers and students are used to re-flect the fidelity of the derived proxies, and preferred faithfulproxies are shown to have minor losses in performance.Moreover, due to the posthoc manner and locality from

nature, none of posthoc-local explanations is fully faithful tothe target IML system. Among existing work, common waysto measure fidelity for posthoc-local explanations are abla-tion analysis [28] and meaningful perturbations [10], wherethe core idea is to check the prediction variation after theadversarial changes made according to the generated expla-nations. The philosophy of this kind of methodologies issimple, i.e., modifications to the input instances, which arein accordance with the generated explanations, can bringabout significant differences to model predictions if the ex-planations are faithful to the target system. Typical exam-ple can be found in image classification task with deep neuralnetworks [33], where the fidelity of generated posthoc-localexplanations are evaluated by measuring the prediction dif-ference between the original image and the perturbative im-age. The overall logic here is to mask the attributing regionsin images indicated by the explanations, and then check theextent of prediction variation. The larger the difference, themore faithful the generated explanations are. In addition

to the image classification task, ablation- and perturbation-based fidelity evaluation methods have also been effectivelyused in text classification [8], recommender system [38] andadversarial detection [24]. Furthermore, as for the posthoc-local explanations in form of training samples [18] and modelcomponents [26], ablation and perturbation operations areproperly applied as well in evaluating the explanation fi-delity.

3.3 Evaluation on PersuasibilityTo effectively evaluate the persuasibility of generated ex-

planations, human annotation is widely used especially inthose uncontentious tasks, such as object detection. Theannotation-based evaluation is usually regarded to be ob-jective, since relevant annotations do not change among dif-ferent groups of user. In computer vision related tasks, themost common annotations for persuasibility evaluation arebounding box [34] and semantic segmentation [25]. Ap-propriate example can be found in recent work [33], whichutilizes bounding boxes to evaluate the persuasibility of ex-planations and employ the metric Intersection over Union(IoU) or Jaccard index to quantify the persuasibility per-formance. As for the annotations with semantic segmen-tation, recent work [40] employs the pixel-level differenceas the metric to measure the corresponding persuasibilityof explanations. Moreover, in natural language processing,similar human annotation, named rationale [22], has beenextensively used for evaluation, which is a subset of featureshighlighted by annotators and regarded to be important forprediction. Through those different forms of annotations,the persuasibility of explanation can be objectively eval-uated with human-grounded truth, which typically keepsconsistent across different groups of user and one particu-lar task. Due to the one-to-one correspondence between theannotation and the instance, annotation-based evaluation isusually applied to those local explanations instead of theglobal ones.Conducting persuasibility evaluation with human anno-

tation does not work well in complicated tasks, since therelated annotations may not keep consistent across differentuser groups. Under those circumstances, employing usersfor human studies is the common way to evaluate the per-suasibility of explanation. To appropriately design relevanthuman studies, both machine learning experts and human-computer interaction (HCI) researchers actively explore thisarea [1, 14], and propose several metrics for human eval-uation on general explanation from IML techniques, suchas mental mode [20], human-machine performance [9], usersatisfaction [19] and user trust [15]. Take the most recentwork [19] for instance. The authors focus on the user satis-faction in evaluating the persuasibility, and specifically em-ploy the human response time and decision accuracy as theauxiliary metrics. The whole study is conducted on two dif-ferent domains with three types of explanation variation,aiming to conclude the relation between the explanationquality and human cognition. With the aid of human stud-ies, persuasibility of explanation can be evaluated under amore complicated and practical setting, regarding to specificuser groups and application domains. By directly measuringexplanations from human users, we can realize the useful-ness in real-world applications when determining the expla-nation quality. Since human studies can be designed flexiblyaccording to diversified needs and goals, this methodology

is generally applicable to all kinds of explanations for per-suasibility evaluation within IML context.

3.4 Evaluation on Other PropertiesBesides the generalizability, fidelity and persuasibility, ex-

isting work also consider some other properties when eval-uating the explanation in IML. We introduce those prop-erties separately due to the following two reasons. First,those properties are not representative and general for ex-planation evaluation among IML systems, and are simplyconsidered under specific architectures or applications. Sec-ond, those properties are related to both prediction modeland generated explanation, which typically need novel andspecial design to evaluate. In this part, we particularly focuson the following three properties.Robustness. Similar to machine learning models, the gen-

erated explanations from IML systems can also be fragileto adversarial perturbations, especially for those posthocones from neural architectures [12]. Explanation robust-ness is primarily designed to measure how similar the ex-planations are for similar instances. Recent work [33, 39]all conduct robustness evaluation for explanation with themetrics on sensitivity, beyond the evaluation on those threegeneral properties we summarize. Robust explanations arealways preferred in building a trustable IML system for hu-man users. To obtain the explanations with high robustness,a stable prediction model and a reliable explanation gener-ation algorithm are usually the two most important keys.Capability. Another property for explanation evaluation is

named capability, which is used to indicate the extent thatcorresponding explanations can be generated. Commonly,this property is evaluated on those explanations generatedfrom search based methodologies [37], instead of those ob-tained from gradient based [33] or perturbation based [31]methodologies. Typical example for capability evaluationcan be found in work [38] with the application to recom-mender system, where the authors employ the explainabil-ity precision and explainability recall as the metrics to in-dicate the capability strength. Similar to the property ro-bustness, capability is also related to the target predictionmodel, which essentially determines the upper bound of theability to generate explanations.Certainty. To further evaluate explanations on whether

they reflect the uncertainty of the target IML system, exist-ing work also focus on the certainty aspect of explanation.Certainty is also a property related to both model and ex-planation, since explanation can only provide uncertaintyinterpretation as long as the corresponding IML system it-self has the certainty measure. Recent work [29] gives an ap-propriate example for certainty evaluation. In this work, theauthors consider the IML systems under the active learningsettings, and propose a novel measure, named uncertaintybias, to evaluate the certainty of generated explanations.Specifically, the explanation certainty is measured accordingto the discrepancy in prediction confidence of the IML sys-tem between one category and the others. In similar ways,work [35] focus on the certainty aspect of explanations aswell, and provide insights on how confident users could befor particular outputs with the computed explanations inform of flip set (i.e., a list of actionable changes that userscan make to flip the prediction of the target system). Inessence, certainty evaluation and persuasibility evaluationcan be mutually supported from each other.

4. DISCUSSION AND EXPLORATIONIn this section, we first propose a unified framework to

conduct general assessment on explanations in IML, accord-ing to the different level of needs for evaluation. Then, sev-eral open problems in explanation evaluation are raised anddiscussed regarding to benchmarking issues. Further, wehighlight some significant limitations of current evaluationtechniques for future exploration.

4.1 Unified Framework for EvaluationDespite the large number of work we reviewed for expla-

nation evaluation, different work typically have their ownparticular focus, depending on the specific tasks, architec-tures, or applications. This situation leads to the fact thatit is hard to benchmark the evaluation process for expla-nations in IML as what we developed in model evaluation.To pave the way to benchmark evaluation on explanation,we try to construct a unified framework here by consideringthose properties of explanations. To make the frameworkgeneral, we simply take the generalizability, fidelity and per-suasibility into account, and do not consider those specialones under particular scenarios.

4.1.1 Different level of needs for evaluationAlthough we conduct the review separately, regarding to

generalizability, fidelity and persuasibility, those three gen-eral properties are internally related to each other, whereeach of them represents a specific level of needs for evalua-tion. From the lower level to higher level, we can sort theproperties as: generalizability, fidelity, persuasibility. Gen-eralizability typically serves as the basic need in evaluation,since it formulates the foundation for other properties. Inreal-world applications, good generalizability is the precon-dition for human users to make accurate decisions with thegenerated explanations, which guarantees that the explana-tions we employ are generalizable and reflect the true knowl-edge for particular tasks. After that, a further demand forhuman users is to check whether the derived explanationsat hands are reliable or not. This demand pushes out the fi-delity property to the front. By assessing the fidelity, betterdecisions can be made on whether to trust the IML sys-tem or not based on the explanation relevance. As for thehigher demand on real effectiveness in practice, persuasi-bility is further considered to indicate the tangible impacts,directly bridging the gap between human users and machineexplanations. For one specific task, the explanation evalua-tion mainly depends on the corresponding applications anduser groups, which determine the level of needs in evalua-tion design. Generally, model developers would care moreon those basic properties of lower levels, including generaliz-ability and fidelity, while general end-users would pay moreattention on the persuasibility in a higher level.

4.1.2 Hierarchical structure of the frameworkThe overall unified evaluation framework is designed hier-

archically, according to the different level of needs, as illus-trated in Figure 6. In the bottom tier, the evaluation goalfocuses on the generalizability, where generated explanationsare tested for their generalization power. In the medium tier,the goal is to evaluate the fidelity, with regard to the targetIML system. The top tier aims to evaluate the persuasibil-ity, targeting on specific applications and user groups. Tohave a unified evaluation in one particular task, each tier

Persuasibility

Fidelity

Generalizability

Lower Need

Higher Need

Focus on Human(Target on general end-users)

Focus on Machine(Target on model developers)

Figure 6: A unified hierarchical framework for ex-

planation evaluation in IML. The whole framework

consists of three different tiers, corresponding to

generalizability, fidelity and persuasibility, from the

lower level to the higher level. Basically, the bottom

and medium tier focus on the evaluation from ma-

chine perspective, while the top tier concentrate on

the evaluation from human perspective. To this end,

the bottom and medium tier are usually designed

for model developers, and the top tier is designed

for general end-users.

should have a consistent pipeline with a fixed set of data,user and metrics correspondingly. The overall evaluation re-sults can be further derived through an ensemble way, suchas weighted sum, where each tier could be assigned withan importance weight depending on the applications anduser groups. This proposed hierarchical framework is gener-ally applicable to most of explanation evaluation problems,which could be appended with new components if necessary.With proper metrics, as well as a sensible manner for ensem-ble, the framework can effectively help human users measurethe overall quality of explanation from IML techniques un-der certain circumstances.

4.2 Open Problems for BenchmarkTo fully achieve the benchmark for explanation evaluation

in real-world applications, there are still some open problemsleft to explore, which are listed and discussed as follows.

4.2.1 Generalizability for local explanationsExisting work on generalizability evaluation mainly focus

on those global explanations, while limited efforts has beenpaid on the local ones. The challenges in evaluating gen-eralizability of local explanations are in two folds: (1) localexplanations cannot be easily organized into valid predictionmodels, which makes the model evaluation techniques hardto be directly applied; (2) local explanations simply containthe partial knowledge learned by the target IML system,thus special designs are required to ensure the evaluation hasa specific local focus. Though no direct solutions, some in-sights from existing efforts may be inspiring. As for the firstchallenge, an approximated local classifier [31] could be po-tentially built to carry the local explanations, and then thegeneralizability could be further assessed with model evalu-ation techniques by specifying test instances. Moreover, forthe second challenge, we could possibly employ local expla-nations, together with human simulated/augmented data,to train a separate classifier [16] for generalizability evalua-tion, where the task is essentially reduced from the originalone and only involves the local knowledge we test with.

4.2.2 Fidelity for posthoc explanationsAmong existing work, it is well received that good ex-

planation should have high fidelity to the target IML sys-tem. However, with the posthoc manner, it might not bethe case that faithful explanations are always the good onesthat human user prefer. During explanation evaluation, wetypically assume that IML systems are well trained and arecapable of making reasonable decisions, but this assump-tion is hard to be perfectly achieved in practice. As a re-sult, the generated post-hoc explanations may not be withhigh quality due to the inferior model performance, althoughthey might be highly faithful to the target system. Thus,designing a novel methodology, which could consider bothmodel and explanation, for posthoc fidelity evaluation is ofgreat importance. In general, how to utilize the model per-formance to guide the measurement of posthoc explanationfidelity is the key problem to tackle this challenge, where theultimate goal is to help human users better select out thoseexplanations with good quality from fidelity perspective.

4.2.3 Persuasibility for global explanationsAs for the persuasibility, it is also challenging to conduct

effective evaluations on global explanations, no matter us-ing annotation based methods or employing human studies.The main reason lies in the fact that global explanations inreal applications are very sophisticated, which makes it hardto make annotations or select appropriate users for studies.Essentially, the global nature requires either selected an-notators or users to equip with comprehensive understand-ings towards the target task, otherwise the evaluation re-sults would be less convincing or even misleading. Besides,the global explanations in practice typically contain tonsof information, which could be extremely time-consumingto evaluate persuasibility. One possible solution is to usesome simplified or proxy tasks to simulate the original one,as mentioned in [6], but this kind of substitution needsto maintain the original essence, which certainly requiresnon-trivial efforts on task abstraction. Another potentialsolution is to simplify the explanations shown to users, suchas only showing the top-k features, which, however, sacri-fices the comprehensiveness of generated explanations andimpedes the full view over the target system.

4.3 Limitations of Current EvaluationAlthough various methodologies of explanation evaluation

exist in IML research, there are still some significant limita-tions of current evaluation techniques. We briefly introducesome of the most important ones as below.

4.3.1 Causality insight for evaluationThe first limitation lies in the lack of causal perspective [17]

in explanation evaluation. Current evaluation techniques,no matter what properties they focus on, mostly fail tohave causal analysis when evaluating the explanation qual-ity. This kind of drawback could possibly lead to the factthat our selected explanations may not fully represent thetrue reasons behind the prediction, since the influence fromconfounders are not effectively blocked during interpreta-tion. Take the two most common methodologies in IML,gradient based and perturbation based methods, for exam-ples. Both of them can be viewed as special cases of In-dividual Causal Effect (ICE) analysis, where complicatedinter-feature interactions could conceal the real importance

of some input features [4]. Thus, to derive better explana-tions with relevant causal guarantees, we need correspondingevaluation techniques to assess the causal perspective of thegenerated explanations. In this way, human users would befurther enabled to have a clearer understanding towards thecause-effect association when interpreting the target system.

4.3.2 Completeness insight for evaluationThe second limitation is the neglect of completeness in ex-

planation evaluation [13]. Existing efforts on IML evaluationcannot well reflect the degree of completeness for generatedexplanations, which makes it difficult for human users tofurther ensure the real value in practice. Explanation com-pleteness could be important in real applications, because itis able to indicate the possibility of whether there would beadditional explanations for certain prediction results. Ques-tions, such as “Do we get the full explanations from the targetIML system?" and “Is it possible to generate better expla-nations than the current ones?", are not supported by thecurrent evaluation techniques. A completeness-aware evalu-ation for explanation would definitely be helpful in exploringthe boundaries of the target IML system. Besides, havingcompleteness insight for assessment would also be a signifi-cant supplement for persuasibility evaluation, since the needfor explainability typically stems from the incompleteness inproblem formalization [6].

4.3.3 Novelty insight for evaluationThe third limitation results from the explanation novelty

perspective [30]. Under the current infrastructure of expla-nation evaluation in IML, it is commonly assumed that high-quality explanations are those ones which can help humanusers make better decisions or obtain better understand-ings. Nevertheless, the view of this assumption for goodexplanation is rather limited, since it somewhat overlooksthe potential values of the explanations that may not be wellcomprehended by users. Explanations which are not directly“useful” to human users may still have significant influences,due to their important roles in extending the human knowl-edge boundary. Medical diagnosis should be a good exampleto illustrate this point. When diagnosing patients, doctorswould typically refer the generated explanations with theiracquired domain knowledge, if they have access to the IMLsystems. Since there is no way that domain knowledge cancover all aspects and contain full pathological mechanism,especially for those new diseases, we cannot casually discardthe explanations that are mismatched with our knowledge.Those “novel” explanations could possibly point out somevaluable research areas in a reverse way. To this end, cur-rent evaluation techniques need to be further enhanced toproperly cover the novelty issue in assessing the quality ofgenerated explanations, so that novel explanations could bewell distinguished from those noisy ones.

5. CONCLUSIONSWith the booming development of IML techniques, how to

effectively evaluate those generated explanations, typicallywithout ground truth on quality, is becoming increasinglycritical in recent years. In this article, we briefly introducethe explanation in IML, as well as its three general proper-ties, and formally define the explanation evaluation problemwithin the context of IML. Then, following the properties,we systematically review the existing efforts in evaluating

explanation, covering various methodologies and applicationscenarios. Moreover, a potential unified evaluation frame-work is built according to the hierarchical needs from bothmodel developers and general end-users. In the end, sev-eral open problems in benchmark and limitations of currenttechniques are discussed for future exploration. Though nu-merous obstacles are still left to be solved, explanation eval-uation will keep playing the key role in enabling effectiveinterpretation of IML systems.

6. REFERENCES[1] A. Abdul, J. Vermeulen, D. Wang, B. Y. Lim, and

M. Kankanhalli. Trends and trajectories forexplainable, accountable and intelligible systems: Anhci research agenda. In Proceedings of the 2018 CHIConference on Human Factors in Computing Systems,page 582. ACM, 2018.

[2] N. S. Altman. An introduction to kernel andnearest-neighbor nonparametric regression. TheAmerican Statistician, 46(3):175–185, 1992.

[3] B. G. Buchanan and R. O. Duda. Principles ofrule-based expert systems. In Advances in computers,volume 22, pages 163–216. Elsevier, 1983.

[4] A. Chattopadhyay, P. Manupriya, A. Sarkar, andV. N. Balasubramanian. Neural network attributions:A causal perspective. In Proceedings of the 36thInternational Conference on Machine Learning,volume 97, pages 981–990, 2019.

[5] Z. Che, S. Purushotham, R. Khemani, and Y. Liu.Distilling knowledge from deep networks withapplications to healthcare domain. arXiv preprintarXiv:1512.03542, 2015.

[6] F. Doshi-Velez and B. Kim. Towards a rigorousscience of interpretable machine learning. arXivpreprint arXiv:1702.08608, 2017.

[7] M. Du, N. Liu, and X. Hu. Techniques forinterpretable machine learning. Communications ofthe ACM, 2019.

[8] M. Du, N. Liu, F. Yang, S. Ji, and X. Hu. Onattribution of recurrent neural network predictions viaadditive decomposition. In Proceedings of The WebConference 2019 (TheWebConf). ACM, 2019.

[9] S. Feng and J. Boyd-Graber. What can ai do for me:Evaluating machine learning interpretations incooperative play. arXiv preprint arXiv:1810.09648,2018.

[10] R. C. Fong and A. Vedaldi. Interpretable explanationsof black boxes by meaningful perturbation. InProceedings of the IEEE International Conference onComputer Vision, pages 3429–3437, 2017.

[11] N. Frosst and G. Hinton. Distilling a neural networkinto a soft decision tree. arXiv preprintarXiv:1711.09784, 2017.

[12] A. Ghorbani, A. Abid, and J. Zou. Interpretation ofneural networks is fragile. arXiv preprintarXiv:1710.10547, 2017.

[13] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa,M. Specter, and L. Kagal. Explaining explanations:An overview of interpretability of machine learning. In2018 IEEE 5th International Conference on DataScience and Advanced Analytics (DSAA), pages80–89. IEEE, 2018.

[14] B. Herman. The promise and peril of humanevaluation for model interpretability. arXiv preprintarXiv:1711.07414, 2017.

[15] D. Holliday, S. Wilson, and S. Stumpf. User trust inintelligent systems: A journey over time. InProceedings of the 21st International Conference onIntelligent User Interfaces, pages 164–168. ACM,2016.

[16] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler,F. Viegas, and R. Sayres. Interpretability beyondfeature attribution: Quantitative testing with conceptactivation vectors (tcav). arXiv preprintarXiv:1711.11279, 2017.

[17] C. Kim and O. Bastani. Learning interpretable modelswith causal guarantees. arXiv preprintarXiv:1901.08576, 2019.

[18] P. W. Koh and P. Liang. Understanding black-boxpredictions via influence functions. In Proceedings ofthe 34th International Conference on MachineLearning-Volume 70, pages 1885–1894. JMLR, 2017.

[19] I. Lage, E. Chen, J. He, M. Narayanan, B. Kim,S. Gershman, and F. Doshi-Velez. An evaluation ofthe human-interpretability of explanation. arXivpreprint arXiv:1902.00006, 2019.

[20] H. Lakkaraju, S. H. Bach, and J. Leskovec.Interpretable decision sets: A joint framework fordescription and prediction. In Proceedings of the 22ndACM SIGKDD international conference on knowledgediscovery and data mining, pages 1675–1684. ACM,2016.

[21] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning.Nature, 521(7553):436, 2015.

[22] T. Lei, R. Barzilay, and T. Jaakkola. Rationalizingneural predictions. In Proceedings of the Conferenceon Empirical Methods in Natural Language Processing.NIH Public Access, 2016.

[23] B. Letham, C. Rudin, T. H. McCormick, andD. Madigan. Interpretable classifiers using rules andbayesian analysis: Building a better stroke predictionmodel. The Annals of Applied Statistics,9(3):1350–1371, 2015.

[24] N. Liu, H. Yang, and X. Hu. Adversarial detectionwith model interpretation. In Proceedings of the 24thACM SIGKDD International Conference onKnowledge Discovery & Data Mining, pages1803–1811. ACM, 2018.

[25] J. Long, E. Shelhamer, and T. Darrell. Fullyconvolutional networks for semantic segmentation. InProceedings of the IEEE conference on computervision and pattern recognition, pages 3431–3440, 2015.

[26] T. Narendra, A. Sankaran, D. Vijaykeerthy, andS. Mani. Explaining deep learning models using causalinference. arXiv preprint arXiv:1811.04376, 2018.

[27] J. A. Nelder and R. W. Wedderburn. Generalizedlinear models. Journal of the Royal Statistical Society:Series A (General), 135(3):370–384, 1972.

[28] A. Nguyen, J. Yosinski, and J. Clune. Deep neuralnetworks are easily fooled: High confidence predictionsfor unrecognizable images. In Proceedings of the IEEEconference on computer vision and pattern recognition,pages 427–436, 2015.

[29] R. L. Phillips, K. H. Chang, and S. A. Friedler.Interpretable active learning. In FAT, 2017.

[30] K. Preuer, G. Klambauer, F. Rippmann,S. Hochreiter, and T. Unterthiner. Interpretable deeplearning in drug discovery. arXiv preprintarXiv:1903.02788, 2019.

[31] M. T. Ribeiro, S. Singh, and C. Guestrin. Why shouldi trust you?: Explaining the predictions of anyclassifier. In Proceedings of the 22nd ACM SIGKDDinternational conference on knowledge discovery anddata mining, pages 1135–1144. ACM, 2016.

[32] S. R. Safavian and D. Landgrebe. A survey of decisiontree classifier methodology. IEEE transactions onsystems, man, and cybernetics, 21(3):660–674, 1991.

[33] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,D. Parikh, and D. Batra. Grad-cam: Visualexplanations from deep networks via gradient-basedlocalization. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 618–626, 2017.

[34] C. Szegedy, A. Toshev, and D. Erhan. Deep neuralnetworks for object detection. In Advances in neuralinformation processing systems, pages 2553–2561,2013.

[35] B. Ustun, A. Spangher, and Y. Liu. Actionablerecourse in linear classification. In Proceedings of theConference on Fairness, Accountability, andTransparency, pages 10–19. ACM, 2019.

[36] S. Wachter, B. Mittelstadt, and L. Floridi.Transparent, explainable, and accountable ai forrobotics. Science Robotics, 2(6), 2017.

[37] E. Wallace, S. Feng, and J. Boyd-Graber. Interpretingneural networks with nearest neighbors. arXivpreprint arXiv:1809.02847, 2018.

[38] F. Yang, N. Liu, S. Wang, and X. Hu. Towardsinterpretation of recommender systems with sortedexplanation paths. In 2018 IEEE InternationalConference on Data Mining (ICDM), pages 667–676.IEEE, 2018.

[39] C.-K. Yeh, C.-Y. Hsieh, A. S. Suggala, D. Inouye, andP. Ravikumar. How sensitive are sensitivity-basedexplanations? arXiv preprint arXiv:1901.09392, 2019.

[40] B. Zhou, D. Bau, A. Oliva, and A. Torralba.Interpreting deep visual representations via networkdissection. IEEE transactions on pattern analysis andmachine intelligence, 2018.

Documents

Evaluating Explanation Without Ground Truth in …Evaluating Explanation Without Ground Truth in Interpretable Machine Learning Fan Yang, Mengnan Du, Xia Hu Department of Computer