Manipulating and Measuring Model Interpretabilityfopo5620/interpretability_timl_nips.pdf · Machine learning is increasingly used to make decisions that affect people’s lives in

Manipulating and Measuring Model Interpretability

Forough Poursabzi-SangdehUniversity of Colorado Boulder

[email protected]

Daniel G. GoldsteinMicrosoft research

[email protected]

Jake M. HofmanMicrosoft Research

[email protected]

Jennifer Wortman VaughanMicrosoft Research

[email protected]

Hanna WallachMicrosoft Research

[email protected]

1 Introduction

Machine learning is increasingly used to make decisions that affect people’s lives in critical domainslike criminal justice, credit, lending, and medicine. Within the research community, machine learningmodels are often evaluated based on their predictive performance (measured, for example, in terms ofaccuracy, precision, or recall) on held-out data sets. However, good performance on a held-out dataset may not be sufficient to convince users that a model is trustworthy or reliable in the wild.

To address this problem, a new line of research has emerged that focuses on developing interpretablemachine learning methods. There are two common approaches. The first is to employ models inwhich the impact of each feature on the model’s prediction is easy to understand. Examples includegenerative additive models [1, 7, 8] and point systems [5, 11]. The second is to provide post-hocexplanations for (potentially complex) models. One thread of research in this direction looks athow to explain individual predictions by learning simple local approximations of the model aroundparticular data points [10], while another focuses on visualizing model output [12].

Despite these contributions, there is still disagreement around what, exactly, interpretability means andhow to quantify and measure the interpretability of a machine learning model [2]. Different notionsof interpretability, such as simulatability, trustworthiness, and simplicity, are often conflated [6]. Thisproblem is exacerbated by the fact that there are different types of users of machine learning systemsand these users may have different needs in different scenarios. For example, the approach that worksbest for a regulator who wants to understand why a particular person was denied a loan may bedifferent from the approach that works best for a data scientist trying to debug a model.

We believe that this confusion arises, in part, because interpretability is not something that can bedirectly manipulated or measured. Rather, we view interpretability as a latent property that can beinfluenced by different manipulable factors (such as the number of features, the complexity of themodel, or even the user interface) and that impacts different measurable outcomes (such as an enduser’s ability to trust or debug the model). Different factors may influence outcomes in differentways. As such, we argue that to understand interpretability, it is necessary to directly manipulate andmeasure the influence that different factors have on real people’s abilities to complete tasks.

In this work, we run a large-scale randomized experiment, varying factors that are thought to makemodels more or less interpretable [3, 6] and measuring how these changes impact people’s decisionmaking. We view this experiment as a first step toward a larger agenda aimed at quantifying andmeasuring the impact of factors that influence interpretability. For this preliminary experiment, wefocus on two factors that are often assumed to influence interpretability, but rarely studied formally:the number of features and whether the model is clear or black box. We focus on lay people asopposed to domain experts, and ask which factors help them gain trust in a model, use a model tomake more accurate predictions, and understand where a model will make mistakes.

Before running the experiment, we posited and pre-registered three primary hypotheses:1

1The pre-registered document is available at https://aspredicted.org/xy5s6.pdf.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

https://aspredicted.org/xy5s6.pdf

Figure 1: Visualization of the four primary experimental conditions.

1. A clear model with a small number of features will be easiest for participants to simulate (i.e.,guess what the model will predict).

2. Participants will follow the predictions of a clear model with a small number of features morethan a black-box model with a large number of features.

3. When an unusual example leads a model to make a highly inaccurate prediction, the extent towhich participants follow the model will vary under different conditions.

We intentionally refrain from hypothesizing about the direction of the effect of different factors onparticipants’ abilities to correct a model’s predictions on unusual examples. On the one hand, if aparticipant understands a model better, she may be better equipped to detect examples on which themodel will make mistakes. On the other hand, a participant may place greater trust in a model sheunderstands well, leading her to follow the model’s predictions more closely.

2 Experimental Design

In our experiment, participants are asked to predict the prices of apartments in New York City with thehelp of a linear regression model. Each apartment is represented in terms of eight features (number ofbedrooms, number of bathrooms, square footage, total rooms, days on the market, maintenance fee,distance from the subway, and distance from a school). All participants see the same set of apartmentsand, crucially, the same model prediction for each apartment (This is possible because we roundthe models’ predictions, masking the minor differences between them). What varies between theexperimental conditions is only the presentation of the model. As a result, any observed differencesin participants’ behavior between the conditions can be attributed to the model presentation alone.

We consider four primary experimental conditions in a 2× 2 design:

• Some participants see a model that uses only two features (number of bathrooms and squarefootage—the two most predictive features for our data set), while some see a model that uses alleight features. Note that all eight feature values are visible to participants in all conditions.• Some participants see the model as a black box, while others are presented with a clear linear

regression model with visible coefficients.

We additionally consider a baseline condition in which there is no model available. Screenshots fromeach of the four primary experimental conditions are shown in Figure 1.

2

After participants view detailed instructions (which include a simple English description of the modelin the clear conditions), the experiment proceeds in two phases. In the training phase, participantsare shown a set of ten apartments in a random order. In the four primary experimental conditions,participants are shown the model’s prediction of each apartment’s price, asked to make their ownprediction of the price, and then shown the apartment’s actual price. In the baseline condition,participants are simply asked to guess the price of each apartment and are then shown the actual price.

In the testing phase, participants see another twelve apartments. The order of the first ten is random-ized, while the remaining two always appear last, for reasons described below. In the four primaryexperimental conditions, participants are first asked to guess what the model will predict for each apart-ment (i.e., simulate the model) and to indicate how confident they are in this guess on a five-point scale.They are then told the model’s prediction and asked to indicate how confident they are that the modelis correct. Finally, they are asked to make their own prediction of the apartment’s price and to indicatehow confident they are in this prediction. In the baseline condition, participants are simply asked tomake their own prediction of each apartment’s price and to indicate their confidence in this prediction.

The apartments shown to participants were selected from a data set of actual Upper West Sideapartments taken from StreetEasy,2 a popular and reliable New York City real estate website, between2013 and 2015. To generate the models that were presented to participants, we first trained a two-feature linear regression model on our data set using ordinary least squares with Python’s scikit-learnlibrary [9], rounding coefficients to “nice” numbers within a safe range.3 To keep the models assimilar as possible, we fixed the coefficients for number of bathrooms and square footage and theintercept of the eight-feature model to match those of the two-feature model, and then trained a linearregression model with the remaining six features, following the same rounding procedure to obtain“nice” numbers. The resulting coefficients are shown in Figure 1. The root-mean-square error of thefinal two- and eight-feature models are close—$265,000 and $259,000 respectively. When presentingmodel predictions to participants, we rounded all predictions to the nearest $100,000.

In order to allow for comparisons across experimental conditions, ten apartments used in the trainingphase and the first ten apartments used in the testing phase were selected from those apartments in ourdata set for which the rounded predictions of the two- and eight-feature models agreed and chosen tocover a wide range of deviations between the models’ predictions and the apartments’ prices.

The last two apartments used in the testing phase were chosen to test our third hypothesis—i.e.,that participant behavior would vary across conditions for unusual examples on which the modelsmake mistakes. Ideally, to test this hypothesis we would use an apartment with strange or misleadingfeatures that lead the two- and eight-feature models to make the same bad prediction. Unfortunately,there was no such apartment in our data, so we chose two examples to test different aspects of ourhypothesis. Both of these examples exploit the models’ large coefficient ($350,000) for number ofbathrooms. The first (apartment 11) is a one-bedroom, two-bathroom apartment from our data set forwhich both models make high, but different, predictions. Comparisons between the two- and eight-feature conditions are therefore impossible, but we can examine differences in accuracy between theclear and black-box conditions. The second (apartment 12) is a synthetically generated one-bedroom,three-bathroom apartment for which both models make the same (very high) prediction, allowing com-parisons between all conditions, but ruling out accuracy comparisons since there is no ground truth.

3 Results

We ran the experiment on Amazon Mechanical Turk using psiTurk [4], an open-source platform fordesigning online experiments. We recruited 1,250 participants, all located in the United States, withMechanical Turk approval ratings greater than 97%. Participants were randomly assigned to the fiveexperimental conditions. There were 247 participants in the black-box, two-feature condition (BB-2);248 in the clear, two-feature condition (CLEAR-2); 256 in the black-box, eight-feature condition(BB-8); 247 in the clear, eight-feature condition (CLEAR-8); and 252 in the baseline condition(NO-MODEL). Each participant received a flat payment of $2.50. The experiment was IRB-approved.

Simulation error. One frequently used measure of interpretability is simulatability [2, 6]. We definesimulation error to be the absolute deviation between the model’s prediction, m, and the participant’s

2https://streeteasy.com/3In particular, the safe range for each coefficient is the value plus or minus stderr/4.

3

https://streeteasy.com/

(a) (b) (c)

Figure 2: Mean simulation error (a), trust (b), and prediction error (c) over all participants and overthe first ten apartments in the testing phase. The error bars indicate standard errors.

guess for this prediction, um—that is, |m− um|. Figure 2a shows the mean simulation error over allparticipants and over the first ten apartments in the testing phase. As hypothesized, participants in theCLEAR-2 condition have lower simulation error, on average, than participants in the other conditions(p < .001 for individual and interaction effects of the number of features and transparency under a2× 2 ANOVA). This suggests that, on average, participants in this condition have some understandingof how the model works. We note that participants in the BB-8 condition, who cannot see the model’sinternals, may have lower simulation error, on average, than participants in the CLEAR-8 condition(p = .045), which would not be conventionally significant adjusting for multiple comparisons.

Trust. To measure trust, we calculate the absolute deviation between the model’s prediction, m,and the participant’s prediction of the apartment’s price, ua—that is, |m− ua|; a smaller valueindicates higher trust. Figure 2b shows the mean trust. Contrary to our second hypothesis, we find nosignificant difference in participants’ trust between the conditions. (We do find statistically but notpractically significant differences in participants’ self-reported confidence in the models’ predictions.)

Prediction error. We define prediction error to be the absolute deviation between the apartment’sactual price, a, and the participant’s prediction of the price, ua—that is, |a− ua|. Figure 2c showsthe mean prediction error. As with trust, we do not find a significant difference between the fourprimary experimental conditions. However, participants in the baseline condition have significantlyhigher error than participants in the primary conditions (p < .001 with a one-way ANOVA).

Detection of mistakes. We use the last two apartments in the testing phase (apartment 11 and apart-ment 12) to test our third hypothesis. The models make erroneously high predictions on these exam-ples. For both apartments, we find that participants in the four primary experimental conditions overes-timate the apartments’ prices, compared to participants in the baseline condition. We suspect that thisis due to an anchoring effect around the models’ predictions. For apartment 11, we find no significantdifference in participants’ trust between the four primary conditions. For apartment 12, we find a sig-nificant difference between the clear and black-box conditions (p < .001). Furthermore, participantsin the clear conditions appear to follow the model more, on average, than participants in the black-boxconditions, resulting in even worse final predictions. We will explore this result in future work.

4 Discussion and Future Work

This experiment is a first step toward a larger goal of quantifying and measuring the impact of factorsthought to make machine learning models more or less interpretable. As hypothesized, we find thatparticipants who are presented with a clear, two-feature model are better able to simulate the model’spredictions. However, we do not find significant differences in participants’ trust or prediction error.We are currently running a variety of additional experiments. For example, following feedback fromparticipants that it was difficult to reason about the astronomical prices of New York City apartments,we are scaling down the prices so that they are more in line with US averages. We suspect that thiswill help participants to detect examples on which the models will make mistakes. As another way tomeasure trust, we also plan to ask participants to guess the apartment’s price before and after seeingthe model’s prediction. Beyond this, we plan to experiment with other models, other factors that areassumed to influence interpretability, other scenarios and tasks, and other types of users.

4

References[1] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad.

Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission.In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining, pages 1721–1730, 2015.

[2] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning.arXiv preprint arXiv:1702.08608, 2017.

[3] Alyssa Glass, Deborah L McGuinness, and Michael Wolverton. Toward establishing trustin adaptive agents. In Proceedings of the 13th International Conference on Intelligent UserInterfaces, pages 227–236, 2008.

[4] Todd M Gureckis, Jay Martin, John McDonnell, Alexander S Rich, Doug Markant, AnnaCoenen, David Halpern, Jessica B Hamrick, and Patricia Chan. psiTurk: An open-sourceframework for conducting replicable behavioral experiments online. Behavior research methods,48(3):829–842, 2016.

[5] Jongbin Jung, Connor Concannon, Ravi Shro, Sharad Goel, and Daniel G. Goldstein. Simplerules for complex decisions. arXiv preprint arXiv:1702.04690, 2017.

[6] Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490,2016.

[7] Yin Lou, Rich Caruana, and Johannes Gehrke. Intelligible models for classification andregression. In Proceedings of the 18th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, 2012.

[8] Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models withpairwise interactions. In Proceedings of the 19th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, 2013.

[9] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion,Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vander-plas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and ÉdouardDuchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research,12:2825–2830, 2011.

[10] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you?: Explainingthe predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016.

[11] Berk Ustun and Cynthia Rudin. Supersparse linear integer models for optimized medical scoringsystems. Machine Learning Journal, 102(3):349–391, 2016.

[12] Martin Wattenberg, Fernanda Viégas, and Moritz Hardt. Attacking discrimination withsmarter machine learning. Accessed at https://research.google.com/bigpicture/attacking-discrimination-in-ml/, 2016.

5

https://research.google.com/bigpicture/attacking-discrimination-in-ml/

https://research.google.com/bigpicture/attacking-discrimination-in-ml/

Documents

Manipulating and Measuring Model Interpretabilityfopo5620/interpretability_timl_nips.pdf · Machine learning is increasingly used to make decisions that affect people’s lives in