13
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=ierp20 Download by: [RTI International] Date: 01 November 2017, At: 04:17 Expert Review of Pharmacoeconomics & Outcomes Research ISSN: 1473-7167 (Print) 1744-8379 (Online) Journal homepage: http://www.tandfonline.com/loi/ierp20 Improving the quality of discrete-choice experiments in health: how can we assess validity and reliability? Ellen M. Janssen, Deborah A. Marshall, A. Brett Hauber & John F. P. Bridges To cite this article: Ellen M. Janssen, Deborah A. Marshall, A. Brett Hauber & John F. P. Bridges (2017) Improving the quality of discrete-choice experiments in health: how can we assess validity and reliability?, Expert Review of Pharmacoeconomics & Outcomes Research, 17:6, 531-542, DOI: 10.1080/14737167.2017.1389648 To link to this article: http://dx.doi.org/10.1080/14737167.2017.1389648 Published online: 23 Oct 2017. Submit your article to this journal Article views: 16 View related articles View Crossmark data

Improving the quality of discrete-choice experiments in ... · 11/1/2017  · 4. Measurement validity Measurement validity captures whether the results from a DCE meet face/content

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Improving the quality of discrete-choice experiments in ... · 11/1/2017  · 4. Measurement validity Measurement validity captures whether the results from a DCE meet face/content

Full Terms & Conditions of access and use can be found athttp://www.tandfonline.com/action/journalInformation?journalCode=ierp20

Download by: [RTI International] Date: 01 November 2017, At: 04:17

Expert Review of Pharmacoeconomics & OutcomesResearch

ISSN: 1473-7167 (Print) 1744-8379 (Online) Journal homepage: http://www.tandfonline.com/loi/ierp20

Improving the quality of discrete-choiceexperiments in health: how can we assess validityand reliability?

Ellen M. Janssen, Deborah A. Marshall, A. Brett Hauber & John F. P. Bridges

To cite this article: Ellen M. Janssen, Deborah A. Marshall, A. Brett Hauber & John F. P. Bridges(2017) Improving the quality of discrete-choice experiments in health: how can we assess validityand reliability?, Expert Review of Pharmacoeconomics & Outcomes Research, 17:6, 531-542, DOI:10.1080/14737167.2017.1389648

To link to this article: http://dx.doi.org/10.1080/14737167.2017.1389648

Published online: 23 Oct 2017.

Submit your article to this journal

Article views: 16

View related articles

View Crossmark data

Page 2: Improving the quality of discrete-choice experiments in ... · 11/1/2017  · 4. Measurement validity Measurement validity captures whether the results from a DCE meet face/content

REVIEW

Improving the quality of discrete-choice experiments in health: how can we assessvalidity and reliability?Ellen M. Janssena, Deborah A. Marshallb, A. Brett Hauberc and John F. P. Bridgesa

aDepartment of Health Policy and Management, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA; bDepartment ofCommunity Health Sciences, University of Calgary, Health Research Innovation Centre (HRIC), Calgary, AB, Canada; cResearch Triangle, RTI HealthSolutions, NC, USA

ABSTRACTIntroduction: The recent endorsement of discrete-choice experiments (DCEs) and other stated-pre-ference methods by regulatory and health technology assessment (HTA) agencies has placed a greaterfocus on demonstrating the validity and reliability of preference results.Areas covered: We present a practical overview of tests of validity and reliability that have been applied inthe health DCE literature and explore other study qualities of DCEs. From the published literature, weidentify a variety of methods to assess the validity and reliability of DCEs. We conceptualize these methodsto create a conceptual model with four domains: measurement validity, measurement reliability, choicevalidity, and choice reliability. Each domain consists of three categories that can be assessed using one tofour procedures (for a total of 24 tests). We present how these tests have been applied in the literature anddirect readers to applications of these tests in the health DCE literature. Based on a stakeholder engagementexercise, we consider the importance of study characteristics beyond traditional concepts of validity andreliability.Expert commentary: We discuss study design considerations to assess the validity and reliability of aDCE, consider limitations to the current application of tests, and discuss future work to consider thequality of DCEs in healthcare.

ARTICLE HISTORYReceived 30 August 2017Accepted 5 October 2017

KEYWORDSBenefit-risk assessment;discrete-choice experiment;patient preferences;reliability; stated-preferencemethods; validity

1. Introduction

In recent years, there has been a general trend toward patientinvolvement in treatment [1,2], research [3,4], and regulatorydecision-making [5,6]. As regulatory and health technologyassessment (HTA) decision-making [7,8] is becoming morepatient centered, ways to measure preferences of healthcarestakeholders are being explored [9]. Patient preferences playan important role in many quantitative approaches to benefit-risk assessment [10], and stated-preference methods such asdiscrete-choice experiments (DCEs) are increasingly used tostudy acceptance of benefit-risk tradeoffs [11]. A growingnumber of HTA agencies are now formally incorporatingpatient preferences into their decision-making for regulatoryapproval, listing and reimbursement, and pricing [7,12,13], andthe US FDA has already provided guidance on quantitativepreference assessment for regulatory approval of medicaldevices [14].

Moving forward, preference studies will need to meet evi-dence standards consistent with standards for clinical evi-dence [9]. Guidance from the FDA on incorporating patientpreference information in benefit-risk assessments calls forchecks on the quality of stated-preference studies [14], includ-ing the logical soundness and validity of patient-preferenceresults, but do not specify how to measure these concepts.Validating stated-preference assessments is difficult because

of the hypothetical nature of the choice scenarios [15].Whether participants would display the same preferences ifthey were presented with these choices and would subse-quently experience the consequences of their choices, isunknown.

The validity of stated-preference methods when comparedwith revealed preferences has been explored in transport, mar-keting, and environmental economics [16,17]. Opportunities toobserve preferences through real-life choices in healthcare arelimited [11] because healthcare options might not be availableon the market yet [14] or choices are made on behalf of thepatient by a physician [18] or a third-party payer [19,20]. A recentsystematic review in the application of DCEs in environmentaleconomics found that environmental DCEs generally providelimited evidence on the internal reliability and validity of DCEs[21]. The Medical Device Innovation Consortium (MDIC), a pub-lic–private partnership to advance medical device regulatoryscience, has called for studies that evaluate specific aspects ofthe validity of DCEs that are applied to healthcare [9]. Whileliterature reviews on the application of DCEs in healthcare havediscussed tests of internal validity, the tests discussed in thesereviews have not been comprehensive [22–24]. Furthermore,these discussions often do not provide details on how to conducttests for validity and reliability. Last, it is not clear which studyqualities, including but not limited to validity and reliability,should be prioritized when conducting a DCE.

CONTACT Ellen M. Janssen [email protected] Department of Health Policy and Management, Johns Hopkins Bloomberg School of Public Health, 624 NBroadway, Baltimore, MD, USA

EXPERT REVIEW OF PHARMACOECONOMICS & OUTCOMES RESEARCH, 2017VOL. 17, NO. 6, 531–542https://doi.org/10.1080/14737167.2017.1389648

© 2017 Informa UK Limited, trading as Taylor & Francis Group

Dow

nloa

ded

by [

RT

I In

tern

atio

nal]

at 0

4:17

01

Nov

embe

r 20

17

Page 3: Improving the quality of discrete-choice experiments in ... · 11/1/2017  · 4. Measurement validity Measurement validity captures whether the results from a DCE meet face/content

The MDIC report defines validity of a stated-preferencestudy as ‘the extent to which quantitative measures of relativeimportance or tradeoffs reflect the true preferences ofpatients’ [9]. In our study, we expand on the MDIC definitionto include both validity and reliability and apply this in thecontext of the healthcare DCE literature. We present a practi-cal list of tests of validity and reliability that have been appliedin the health DCE literature and describe how these tests havebeen conducted in published studies. We also consider whatother desirable and actionable study characteristics, outside oftests for validity and reliability, influence the quality of apreference study. Last, we discuss concepts that should beconsidered when designing a DCE to allow researchers toassess validity and reliability and we provide an overview offuture work that should be carried out to ensure the studyquality of DCEs applied in health. This review can helpresearchers consider questions of study quality when conduct-ing a DCE and will help them identify and apply tests ofvalidity and reliability.

2. Targeted literature search

We conducted a targeted literature review to identify articles thatdiscussed the validity and reliability of DCEs as applied in thehealthcare preference literature. The search was meant not toidentify every published DCE that assessed validity or reliabilitybut rather to identify studies that discussed concepts of validityand reliability and how these could be tested. We searchedPubMed Central for DCEs that used the words validity and/orreliability and were published between January 2000 andJanuary 2017. We also conducted a hand search to identifystudies that reviewed DCEs or that assessed the reliability orvalidity of a DCE but were omitted in the targeted search. Theliterature search described was admittedly limited by includingonly one database and narrow search terms. It is likely that someapplicable articles weremissed in this review, but we believe thatwe identified the majority of test of validity and reliability thathave been utilized in the health DCE literature.

We included papers that gave an overview of DCEs as astated-preference method, discussed methods issues relatedto conducting DCEs, and mentioned processes to assess valid-ity or reliability of results. We excluded studies that did notdiscuss DCEs or that did not discuss tests that could be usedto evaluate the validity or reliability of a DCE. We alsoexcluded studies that were not related to healthcare or thatonly employed qualitative methods. From the included litera-ture, we extracted tests of validity and reliability.

The PubMed search identified 139 studies, and the handsearch identified an additional four articles. Title and abstract,full-text review, and the extraction of tests of validity andreliability were conducted by one reviewer. After title andabstract review, 109 studies were excluded. Tests wereextracted from 13 studies that were reviews or made a meth-odological contribution in describing and applying a test. Fourof these studies were literature reviews [22–25], three pro-vided overviews on how to conduct stated-preference studies[26–28], and seven provided a methodological explanation ofa particular test with an empirical illustration [28–34].

3. Types of tests of validity and reliability for DCEs

To conceptualize the tests of validity and reliability, principles ofconcept mapping were adopted to sort and group tests [35]. Atotal of 24 tests of validity and reliability were identified from theliterature. Table 1 presents a summary of each test and providesa reference to an application of each test. An iterative andcollaborative process was then used to group the differenttests based on the concept of validity and reliability they mea-sured that had been discussed in the literature. Through thisprocess, 12 conceptual groups were identified, with each groupcontaining between one and four tests of validity or reliability.The conceptual groups were sorted into concepts that mea-sured validity and those that measured reliability. Based onthis sorting, a further distinction was made between conceptsthat applied to most survey-based studies and those thatapplied only to preference-based studies.

The developed conceptual framework included four con-ceptual domains: measurement validity, measurement reliabil-ity, choice validity, and choice reliability (Figure 1). Measuresof validity capture how accurately an instrument measures theoutcome of interest, whereas measures of reliability capturehow consistently it measures the outcome of interest. In thisconceptual framework, measurement validity and reliabilitycapture traditional measures of validity and reliability [49]that apply to most scientific studies and are not limited tostated-preference methods.

Choice validity and choice reliability still address issues ofaccuracy and consistency, but these concepts examine whetherthe results of the study adhere to assumptions specific to choicebehavior [50]. These concepts do not apply to studies that donot elicit measures such as preferences or choices. The axioms ofutility theory lay the groundwork for many of the assumptionson consumer choice behavior used in tests of validity andreliability of DCEs. Since a thorough discussion of utility theoryis beyond the scope of this study, we direct readers interested inlearning more about utility theory and DCEs to a paper pub-lished by Hougaard et al. (2012) [51].

4. Measurement validity

Measurement validity captures whether the results from a DCEmeet face/content validity, convergent validity, and externalvalidity. These concepts identify how accurately the instru-ment measures preferences and how generalizable these pre-ferences are to other circumstances.

4.1. Face/content validity (four tests)

Face and content validity are largely assessed through quali-tative methods. Face validity, also referred to as theoreticalvalidity [22,23,25,27], assesses the extent to which the resultsare consistent with a priori expectations. We prefer the termface validity because theoretical validity can also refer to theaxioms of welfare theory, which may or may not be an appro-priate standard in evaluating medical decisions [28,33].Content validity refers to the extent to which the DCE takesaccount of all things deemed important in the construct’sdomain [25]. In a DCE, this refers to the attributes and levels

532 E. M. JANSSEN ET AL.

Dow

nloa

ded

by [

RT

I In

tern

atio

nal]

at 0

4:17

01

Nov

embe

r 20

17

Page 4: Improving the quality of discrete-choice experiments in ... · 11/1/2017  · 4. Measurement validity Measurement validity captures whether the results from a DCE meet face/content

Table1.

Summaryof

testsof

validity

andreliabilityof

adiscrete-cho

iceexperim

entidentifiedin

theliterature.

Catego

ryDefinition

aTest

Stud

ydesign

adaptio

nsb

Applicationexam

ple

Addadditio

nal

choice

tasks

Create

diffe

rent

instrument

versions

Cond

ucta

second

experim

ent

Measuremen

tvalid

ity(8

tests)

Face/con

tent

validity

Whether

thechoice

task

accoun

tsfor

impo

rtantpreference

attributes

and

whether

results

areconsistent

with

aprioripreference

expectations

Directionof

preferences:Co

mpare

stud

yresults

with

previously

form

ulated

hypo

thesis

DeBekker

etal.(2010)[36]

Intuition

:Com

pare

stud

yresults

with

intuition

onpreferences

Ryan

etal.(2001)[28]

Develop

mentprocess:Evaluate

instrumentdevelopm

ent

process

Müh

lbacheret

al.(2009)[37]

Clinical

expertise:Askclinicalexperts/patientsabou

trelevance

ofinclud

edand/or

omitted

attributes

Kenn

yet

al.(2003)[38]

Convergent

validity

Whether

theresults

areconsistent

with

othermeasuresthat

measure

thesame

construct

Across

surveys:Co

nductmultip

lepreference

experim

entsusing

diffe

rent

preference-elicitatio

nmetho

dsx

Bijleng

aet

al.(2009)[39]

With

insurveys:Ad

dasecond

preference

experim

ent(different

metho

d)to

anexistin

gexperim

ent

xHollin

etal.(2016)[40]

Research

context:Exam

inestud

yresults

incontexto

fpreviou

sly

cond

uctedpreference

stud

ieson

thesametopic

Janssenet

al.(2016)[41]

External

validity

Whether

theresults

canaccuratelypredict

preferencesandchoicesou

tsideof

the

stud

ycontext

Revealed

preferences:Co

mpare

stud

yresults

with

revealed

preference

results

(adjustforscalingeffects)

MarkandSw

ait(2004)

[42]

Telser

andZw

eifel(2007)[34]

Measuremen

trelia

bility(6

tests)

Test–retest

Whether

theinstrumentmeasures

preference

consistentlywhen

administeredtwice

Across-surveyreliability:Repeat

thesameDCE

with

thesame

respon

dersandcompare

respon

ses

xBijleng

aet

al.(2009)[39]

With

in-surveystability:Include

thesamechoice

task,a

repeat

task,intheexperim

enttwiceandcompare

respon

seswith

inindividu

als

xOzdem

iret

al.(2010)[33]

Versionconsistency

Whether

diffe

rent

versions

ofthe

instrumentresultin

consistent

preference

estim

ates

Fixed-choice

tasks:Includ

eon

eor

morefixed-cho

icetasksthat

arethesameacross

survey

versions

andcompare

respon

ses

across

survey

versions

xJanssenet

al.(2016)[43]

Smallsurveychanges:Makesm

allchang

esto

survey

versions

andcompare

results

betweenversions

xVelwijk

etal.(2016)[44]

DeBekker

etal.(2010)[36]

Holdo

utpredictio

nWhether

theinstrumentcanpredict

choicesou

tsidethechoice

mod

el(with

intheinstrument)accurately

With

insurvey:Include

oneor

moreho

ldou

ttasksin

the

instrumentthat

areno

tused

inthechoice

mod

elestim

ation.

Use

thechoice

mod

elto

predictchoicesfortheseho

ldou

ttasksandcompare

with

observed

choices

xRockerset

al.(2012)[45]

Across

survey:Include

survey

versionthat

isno

tinclud

edinthe

estim

ationof

thechoice

mod

el.U

sethechoice

mod

elto

predictchoicesfortheseho

ldou

ttasksandcompare

with

observed

choices

xKruk

etal.(2010)[46]

Choice

valid

ity(5

tests)

Mon

oton

icity

Whether

participants

dochoose

‘worse’

choice

profilesover

‘better’profiles

With

in-set

dominated

pairs:Testwhether

participants

choose

thechoice

profile

from

achoice

task

that

hasbetter

attribute

levelsthan

theotherchoice

task

byinclud

ingado

minated

choice

task

inthedesign

xOzdem

iret

al.(2010)[33]

Ryan

etal.(2001)[28]

(Con

tinued)

EXPERT REVIEW OF PHARMACOECONOMICS & OUTCOMES RESEARCH 533

Dow

nloa

ded

by [

RT

I In

tern

atio

nal]

at 0

4:17

01

Nov

embe

r 20

17

Page 5: Improving the quality of discrete-choice experiments in ... · 11/1/2017  · 4. Measurement validity Measurement validity captures whether the results from a DCE meet face/content

Table1.

(Con

tinued).

Catego

ryDefinition

aTest

Stud

ydesign

adaptio

nsb

Applicationexam

ple

Addadditio

nal

choice

tasks

Create

diffe

rent

instrument

versions

Cond

ucta

second

experim

ent

Across-set

dominated

pairs:Testwhether

participants

choose

thepreferredchoice

profile

ifon

eprofile

dominates

another

choice

profile

across

choice

tasksby

includ

ingmultip

lechoice

tasksthat

arerelated.

Includ

eachoice

task

between

treatm

entAandtreatm

entB,

includ

eachoice

task

between

treatm

entAandC.

Ensure

that

treatm

entChasbetter

attributes

than

treatm

entB

xcHoet

al.(2015)[47]

Ozdem

iret

al.(2010)[33]

Compensatory

choices

Whether

participantstradebetweenall

attributes

ofchoice

profiles

Attributedo

minance:D

eterminewhether

certainparticipants

alwayschosethechoice

profile

that

hasthebetter

levelo

fa

particular

attribute

Hoet

al.(2015)[47]

Attributeno

nattendance:Use

amulti-step

latent

classanalysis

approach

todeterm

inewhether

participants

igno

recertain

attributes

intheirdecision

-making

Lagardeet

al.(2013)[30]

Task

nonattendance

Whether

participantspayattentionto

the

choice

tasks

Profile

positio

ning

:Examinewhether

certainparticipantsalways

choose

thechoice

profile

that

occursin

acertainpo

sitio

nin

thechoice

task

Bridgeset

al.(2012)[48]

Hoet

al.(2015)[47]

Choice

relia

bility(5

tests)

Transitivity

Whether

participantsmakechoicesthat

meetthetransitiveprop

erty

(ifa

participantprefersAover

BandBover

C,they

shou

ldprefer

Aover

C)

Transitivity

test:Include

threechoice

tasksthat

askparticipants

tochoose

betweenchoice

profile

Aandchoice

profile

B,betweenchoice

profile

BandC,

andbetweenchoice

profile

AandC

xdOzdem

iret

al.(2010)[33]

Sen’sconsistency

Whether

participantsmakechoicesthat

follow

Sen’scontractionandexpansion

principle

Contractionprinciple:Ifachoice

setisnarrow

ed,thenno

unchosen

alternatives

shou

ldbe

chosen

nowandno

chosen

alternatives

shou

ldbe

unchosen

now.Testby

presentin

grespon

dentsan

initialchoice

task

andafollow-onchoice

task.The

follow-oncontractionchoice

task

shou

ldpresent

fewer

choice

tasksthan

theinitialchoice

task

xMiguele

tal.(2005)[32]

Expansionprinciple:Ifachoice

setisexpand

ed,thenno

unchosen

alternatives

from

theoriginalsetshou

ldbe

chosen

now.Testby

presentin

grespon

dentswith

aninitialchoice

task

andafollow-onchoice

task.The

follow-onexpansion

choice

task

shou

ldpresentmorechoice

tasksthan

theinitial

choice

task

xMiguele

tal.(2005)[32]

Levelrecod

ing

Whether

participantsmakechoicesusing

theabsolute

values

ofnu

meric

attributes

Scop

etest:U

semultip

lesurvey

versions

with

someoverlapp

ing

numericattributelevelsbu

twith

levelsthat

span

diffe

rent

rang

esto

determ

inewhether

theoverlapp

ingnu

meric

attributes

arevalued

thesamebetweensurvey

versions

xHoet

al.(2015)[47]

Simplescop

etest:Include

anu

mericattributeforwhich

the

diffe

rences

betweentheattributelevelsareno

tequal

John

sonet

al.(2011)[29]

DCE:discrete-choice

experim

ent.

a Definition

oftestsadaptedto

specifically

applyto

DCEs.

bSometestsrequ

ireadaptio

nsto

bemadeto

thestud

ydesign

.Teststhat

requ

ireadaptio

nsaremarkedin

thetable.Generally,a

test

requ

irestheadditio

nof

choice

tasks,thecreatio

nof

diffe

rent

survey

versions,orasecond

experim

ent.

c Necessary

choice

tasksmight

occurnaturally

intheexperim

entald

esign,bu

ttheresearchersneed

tocheckthechoice

task

prop

erties.

dTo

assess

transitivity,atleasttwo,

notat

leaston

e,additio

nalcho

icetasksneed

tobe

includ

ed.

534 E. M. JANSSEN ET AL.

Dow

nloa

ded

by [

RT

I In

tern

atio

nal]

at 0

4:17

01

Nov

embe

r 20

17

Page 6: Improving the quality of discrete-choice experiments in ... · 11/1/2017  · 4. Measurement validity Measurement validity captures whether the results from a DCE meet face/content

included. While not every attribute that is important to everypossible respondent needs to be included, it is crucial thatattributes important to a majority of participants are captured[26]. Face and content validity are most often evaluated usingone of four tests.

Face validity of results can be examined by setting a priorihypotheses for the preference relation between the attributelevels. For example, it can be expected that a high clinical benefitwill be valued more positively than a low benefit, and thishypothesis can be checked by examining the direction of thepreference estimates [28,36]. Assessing face validity is a relativelysimple test of validity; about 60% of DCEs published between2009 and 2012 reported on it [22]. Ryan and Gerard [24] sug-gested that considering how results match ‘intuition’ can assessface validity. For example, it might be expected that patientswith experience of an illness value treatment differently than dothe general public [24]. We suggest researchers exercise cautionwhen using a priori expectations and intuition; researchers’assumptions about the preferences of patients might not beaccurate, and researchers need to be careful not to imposetheir own expectations on patients’ preferences.

Content validity can be assessed by examining the instru-ment development process [27,37], for example, whetherexperts were consulted and whether pretest or pilot testswere conducted [34]. In addition, clinical experts and studyparticipants can shed light on whether important attributeswere omitted and whether the instrument was deemed rele-vant [24,38].

4.2. Convergent validity (three tests)

Convergent validity measures the extent to which the resultsfrom the DCE are consistent with other measures believed tomeasure the same construct [25]. Preferences elicited for thesame treatment from the same population are expected to besimilar regardless of the method used to elicit these prefer-ences. Convergent validity is generally assessed using one ofthree tests.

Comparisons can be drawn between DCE, visual analog scale,and time tradeoff [39] or direct elicitation of willingness to pay(WTP) [52] using different preference-elicitation instruments.Participants should either complete both types of instrumentsto allow for within-subject comparison or should be randomizedto one instrument to minimize sample biases in across-subjectcomparisons.

Comparisons can also be drawn between different stated-preference studies in the same preference-elicitation

instrument. For example, Hollin et al. [53] conducted a best–worst scaling (BWS) experiment to measure treatment prefer-ences. They included a choice-based discrete-choice questionafter each BWS task that asked whether participants wouldaccept the treatment [53].

If resources are not available to conduct two experiments,researchers can compare preference estimates they obtainagainst those reported in the literature [41,43]. Care shouldbe taken that preference estimates are appropriately com-pared, as different studies might use different estimationmethods and different attributes. In these cases, it can helpto standardize preferences by, for example, calculating WTPestimates [34].

4.3. External validity (one test)

External validity, sometimes referred to as criterion validity,refers to the extent to which the preference results obtainedfrom the instrument can be generalized to situations andpeople outside the study. In stated-preference studies, it isoften described as the capability of a model to accuratelypredict preferences and choices outside the immediatemodel context [31]. For DCEs, this is most easily examinedby studying the choices people make in the real world andtheir revealed preferences and comparing those with the pre-ference estimated in the DCE [34]. This can be done by jointlyestimating revealed and stated-preference estimates [42]. Inthe healthcare sector, relatively few studies have examinedthe external validity of DCEs [22], partially because opportu-nities to observe preferences through real-life choices inhealthcare might be limited [11].

5. Measurement reliability

Measurement reliability captures whether the DCE producessimilar results under consistent conditions. Preference reliabil-ity includes concepts of test–retest, version consistency, andholdout tasks. It measures whether results can be reproducedor repeated over a given time. Measurement reliability willnever be perfect in any situation because of simple variationin responses.

5.1. Test–retest (two tests)

Test–retest compares the consistency of participants’responses for the same instrument or choice tasks and canbe assessed through test–retest reliability and test–retest

Validity Reliability

Measurement

• Face/content validity (4 tests)

• Convergent validity (3 tests)

• External validity (1 test)

• Test–retest (2 tests)

• Version consistency (2 tests)

• Holdout tasks (2 tests)

Choice

• Monotonicity (2 tests)

• Compensatory choices (2 tests)

• Task non-attendance (1 test)

• Transitivity (1 test)

• Sen's consistency (2 tests)

• Level recoding (2 tests)

Figure 1. Conceptual model of validity and reliability for a discrete-choice experiment.

EXPERT REVIEW OF PHARMACOECONOMICS & OUTCOMES RESEARCH 535

Dow

nloa

ded

by [

RT

I In

tern

atio

nal]

at 0

4:17

01

Nov

embe

r 20

17

Page 7: Improving the quality of discrete-choice experiments in ... · 11/1/2017  · 4. Measurement validity Measurement validity captures whether the results from a DCE meet face/content

stability. Test–retest reliability refers to the extent to which theinstrument results in consistent preference estimates overtime. This is tested by asking a sample of respondents tocomplete the same DCE at least two different times. The retestresults are then compared with those from the first time theDCE was administered [39]. Test–retest reliability can be exam-ined by calculating the correlation coefficient (r) when usingcontinuous variables and the Kappa coefficient (κ) when usingcategorical variables [25]. Test–retest reliability can be proble-matic if a long period is allowed to pass between the exer-cises, as people’s circumstances and preferences can change.In that case, the study would not be assessing test–retestreliability but rather the temporal stability of preferences [54].

Test–retest stability refers to the extent to which prefer-ences are consistent for the duration of the survey [33]. Toassess test–retest stability, the same choice task needs to bepresented to participants twice. The proportion of participantsthat answer this repeat-choice task the same way twice canthen be established [33]. A Kappa coefficient can also be usedto examine the difference in expected and observedconsistency.

Test–retest stability has the advantage that the instrumentonly needs to be administered once and that preferences areless likely to change over the course of a single instrumentthan over the period between consecutive survey administra-tions. However, test–retest stability only measures consistencyfor a few (at a minimum, one) choice tasks and might there-fore give a less accurate view on the reliability of preferences.For example, if the repeat choice task is first presented at thebeginning and then at the end of the instrument, learningeffects or fatigue might affect the test–retest stability [33]. Inaddition, the options presented in the repeat task might affecttest–retest stability. For example, a particular task mightinclude options to which many people are indifferent. Thisincreases the probability that they choose a choice profilerandomly. In this case, test–retest stability can be expectedto be low, even if test–retest reliability for the entire instru-ment could be high.

5.2. Version consistency (two tests)

Version consistency refers to the extent that different versions ofthe same DCE result in consistent preference estimates. Versionconsistency is generally tested in two ways. In experimentaldesigns in which different respondents see different combina-tions of choice tasks, such as blocked or individualized designs,one or more fixed-choice tasks that are the same for eachrespondent can be included [27]. Responses to this fixed-choicetask can then be compared across survey blocks [43] or acrossrandom splits in the sample using a simple t-test [55]. The morefixed-choice tasks are included across the survey versions, themore accurately version consistency can be established.However, adding extra tasks to the experimental designdecreases statistical efficiency and increases response burden.

Another way to assess version consistency is by makingsmall changes to survey versions, for example by using deci-sion scenarios in which one version includes a product label asone of the attributes and the other version does not [36] or inthe framing of the attributes [44]. This approach has the

disadvantage that changes to a survey might mean a differentpreference construct is measured and responses might not becomparable.

5.3. Holdout prediction (two tests)

Holdout prediction refers to the extent to which the choicemodel predicts choices outside the model, for holdout tasks,but within the DCE [27]. One or more choice tasks, holdout tasks,or fixed tasks that are not part of the experimental design of theDCE can be included in the instrument. Then, preference esti-mates obtained from the choice tasks that are part of theexperimental design can be used to predict the probabilitythat each choice profile in the additional choice task(s) is chosen[56]. These predictions can be compared with the observedproportion of participants that chose each profile [45].

Another method to assess holdout prediction is to usemultiple surveys. Preference estimates for one of the surveyversions can be estimated and used to predict the choices forthe other holdout survey version [46]. Mühlbacher andJohnson [28] noted that it might be easier to predict choicesfor one task than for another choice task, which could lead tomisleading results for this test.

However, these types of holdout prediction tests captureboth response reliability and the predictive properties of thepreference model used to analyze the results of the remainingquestions in the survey and therefore may be of limited use inestablishing a clear measure of response reliability.

6. Choice validity

Choice validity examines whether participants engage withthe choice task as expected based on assumptions aboutconsumer behavior. It includes concepts such as monotonicity,compensatory preferences, and task nonattendance. Theseconcepts specifically apply to preference-based methods andmeasure whether choices comply with assumptions abouthow choices represent preferences, trading, and responsive-ness to the choice instrument.

6.1. Monotonicity (two tests)

Monotonicity, also referred to as non-satiation, means thatpeople will not prefer ‘worse’ levels of an attribute to ‘better’levels. Two types of monotonicity can be tested: within-setmonotonicity and across-set monotonicity [27,33].

Within-set monotonicity tests for non-satiation within achoice task and can be assessed by adding a dominant choicetask to the instrument design. This means that, for one of thechoice profiles in a choice task, all the attribute levels are justas good as the attribute levels of the other profile and at leastone attribute level is better [28,32,33]. If uncertainty existsabout how participants value attributes; these attributesshould be held constant across the choice profiles in thechoice task. To satisfy the within-set monotonicity require-ment, a participant should not choose the dominated choiceprofile. In a two-profile case, there is a 50% probability thatsomeone who does not pay attention to the choice tasks willpass the test for within-set monotonicity [33].

536 E. M. JANSSEN ET AL.

Dow

nloa

ded

by [

RT

I In

tern

atio

nal]

at 0

4:17

01

Nov

embe

r 20

17

Page 8: Improving the quality of discrete-choice experiments in ... · 11/1/2017  · 4. Measurement validity Measurement validity captures whether the results from a DCE meet face/content

Across-set monotonicity tests for non-satiation acrosschoice tasks; it checks whether people choose the preferredchoice profile if one profile dominates across choice tasks. Forexample, participants might choose treatment B over A in onechoice task (treatment A is dominated compared with B). If allthe attribute levels of treatment C are better than those of B(treatment B is dominated compared with C), they should notchoose treatment A over C in another choice task. Choiceprofiles that can be used to assess cross-set monotonicitycan occur spontaneously in an experimental design, especiallyif a constant opt-out or status quo option is included [33].However, researchers might need to add one or more choicetasks to assess cross-set monotonicity.

6.2. Compensatory preferences (two tests)

Compensatory preferences refer to the extent that participantstrade between all attributes of the choice profiles. DCEsassume that participants consider all attributes in their deci-sion-making and that there is always some improvement inone attribute that compensates for the reduction in anotherand that the ranges of levels required to accomplish tradingare included in the DCE. Non-compensatory preferences canbe indicated by either attribute dominance or attributenonattendance.

Attribute dominance is present if a participant makeschoices based on one attribute only or if they choose thechoice profile with the best level of a particular attribute fora majority of choice tasks [28,47]. Lack of clarity exists aroundthe cut-off to define attribute dominance; different cut-offpoints, such as choosing 70% of choice tasks or all choicetasks with the best level for a particular attribute, have beenused to define attribute dominance.

When attribute nonattendance is present, respondentsmake choices while ignoring one or more attributes.Attribute nonattendance can be examined through multi-step latent class analysis approaches [30]. Attribute nonatten-dance might be a more comprehensive approach to examin-ing non-compensatory preferences in DCE [30], but it is alsomore complicated to assess. Lagarde [31] argued that attributedominance can also be conceptualized as attribute nonatten-dance, since, in attribute dominance, all attributes except forthe dominant one are ignored.

6.3. Task nonattendance (one tests)

Task nonattendance refers to the extent that participants payattention to the choice tasks. It is assumed that participantsactively engage in the choice task, but if the task is toocognitively burdensome or not realistic, they might not care-fully consider their decisions [27]. Participants who alwayschoose the choice profile in a particular position in the choicetask, for example, the choice profile on the right, are likely notpaying attention to the choice task [47,48]. The concept oftask nonattendance is more easily measured this way in unla-beled DCEs. In a labeled DCE, a particular product label mighthave always appeared on one side of the choice task. In thiscase, if the participant always choses the profile in a particularposition, it is not clear whether they did not pay attention to

the task or whether they based their choice on the label only(a sign of non-compensatory preferences).

7. Choice reliability

Choice reliability examines whether participants make consis-tent choices that are in accordance with assumptions aboutconsumer choices. This includes the consistency of choiceacross choice tasks or survey versions and includes conceptssuch as transitivity, Sen’s expansion and contraction principles,and level recoding [23,33].

7.1. Transitivity (one test)

Transitivity refers to the preference relation between three ormore choice profiles. In particular, if treatment A is preferred to Bin one choice task, and treatment B is preferred to C in a secondchoice task, then the transitive property states that treatment Ashould be preferred to C in a third choice task. To assess transi-tivity, at least three related choice tasks need to be included inthe DCE instrument [33]. The first choice task, the choicebetween treatments A and B, can be part of the regular experi-mental design. The other two choice tasks (treatment B vs. C andtreatment A vs. C) will most likely need to be added to theinstrument outside of the existing experimental design.

7.2. Sen’s consistency (two tests)

Sen’s consistency principles provide a more stringent test ofrationality than the traditional monotonicity tests [28]. Sen’sconsistency principles [57] consist of both the contractionprinciple and the expansion principle. Sen’s contraction prin-ciple states that if a choice set A is narrowed (to B) and someof the choices from A are still in B, then no unchosen alter-natives should be chosen now and no chosen alternativesshould be unchosen now. Sen’s expansion principle statesthat if a choice set A is expanded (to C) and some of thechosen from A are still in C, then no unchosen alternativesfrom A should be chosen now. To assess the contraction orexpansion property, respondents should be presented with aninitial and a follow-on choice task that contain different num-bers of choice profiles.

For the contraction principle, the initial choice task shouldcontain at least three choice profiles (e.g. treatment A, treat-ment B, no treatment). The follow-on contraction choice taskshould present fewer choice profiles than the initial choicetask (e.g. treatment A, treatment B). The contraction propertyis satisfied if a person who chose treatment A in the initialchoice task still chooses treatment A in the follow-on contrac-tion choice task [32].

For the expansion principle, the first choice task shouldcontain at least two choice profiles (e.g. treatment A, treat-ment B). The follow-on expansion choice task should presentmore choice profiles than the initial choice task (e.g. treatmentA, treatment B, no treatment). The expansion property issatisfied if a person who chose treatment A in the initial choicetask does not choose treatment B in the follow-on expansionchoice task [32].

EXPERT REVIEW OF PHARMACOECONOMICS & OUTCOMES RESEARCH 537

Dow

nloa

ded

by [

RT

I In

tern

atio

nal]

at 0

4:17

01

Nov

embe

r 20

17

Page 9: Improving the quality of discrete-choice experiments in ... · 11/1/2017  · 4. Measurement validity Measurement validity captures whether the results from a DCE meet face/content

7.3. Level recoding (two tests)

Level recoding refers to the extent that participants processthe actual numbers presented in continuous or numeric attri-butes. Recoding may be a strategy for simplifying evaluationsof a relatively unfamiliar but important attribute. Instead ofinterpreting the actual number and the numeric differencesbetween attribute levels, participants might recode the levelsto ‘low,’ ‘medium,’ and ‘high’ categories.

Recoding can be assessed using a scope test, whichinvolves creating two survey versions. For the numeric attri-bute, these two survey versions need to include some ofthe same levels (overlapping levels) and at least one differ-ent level so the level range is different for the two surveys.If preference estimates are the same for overlappingnumeric levels between the two survey versions, we caninfer that respondents did not recode the levels but ratherreacted to the absolute levels. If the preference estimatesfor the overlapping levels are different, this suggests parti-cipants recoded the numeric values as ‘high,’ ‘medium,’ or‘low’ [47]. The scope test should be used with cautionbecause if the range of the levels for one of the surveyversions is too extreme, it could invoke non-compensatorypreferences (discussed above).

An approximation of the scope test, a simple scope test, canbe done using just one survey version with a large difference andsmall difference between the levels of the numeric attributeincluded. If the difference in preference estimates between thelarge- and the small-level difference are similar, this might indi-cate that participants recoded the numeric levels as ‘low,’ ‘med-ium,’ and ‘high’ [29]. However, the simple scope test does notdetermine with certainty that recoding occurred, because thepreference function might not be linear.

8. Study qualities beyond validity and reliability

Validity and reliability are just two issues that affect the studyquality of a stated-preference study and a strict focus on onlytests of validity and reliability will not guarantee a high-qualitypreference study. Many recommendations and guidancedocuments on good research practices exist on the develop-ment, design, implementation, and analysis of stated-prefer-ence studies [6,9,14,22,24,26,27,40,41,56,58–67]. Theserecommendations discuss varying aspects of stated-prefer-ence studies, ranging from patient relevance to logical sound-ness, that researchers should consider when conducting andevaluating stated-preference studies. In many of these recom-mendations, validity and reliability play a role, but are not theonly study qualities that are discussed.

Stakeholders’ views on the importance of different aspects ofstudy quality, including the importance of ensuring validity andreliability, are largely unknown. It is not clear which study quali-ties stakeholders believe should be prioritized to advance the useof stated-preference studies. While it might be desirable toexplore certain study qualities in terms of the scientific advance-ment of stated-preference studies, these qualities might notcurrently be actionable.

We engaged a diverse group of stakeholders to discussand prioritize important characteristics of preference studies

and where validity and reliability fell within this prioritization.Participants (n = 29) from industry, regulatory agencies, fund-ing agencies, patient advocacy organizations, and academicinstitutions with experience conducting or evaluating stated-preference studies were invited to participate in an all-dayworkshop on stated-preference methods. As part of theworkshop, we asked participants to form small groups andlist the important characteristics of a stated-preference study.Then, each participant allocated 12 votes across the identi-fied characteristics based on what they believed were themost desirable aspects for high scientific quality. Participantscould allocate one vote to 12 characteristics, allocate 12votes to one characteristic, or allocate their votes in anyother combination. Next, participants repeated the votingprocess, but this time they allocated 12 votes to the char-acteristics that they thought were most actionable. After theworkshop was completed, the study team categorized thecharacteristics identified by participants into aspects of astated-preference study. These aspects were ranked for desir-ability and actionability according to the number of votesreceived from the workshop participants.

Through this engagement process, eight aspects of stated-preference studies that determined study quality were identi-fied. These aspects are presented and described in Table 2.Aspects identified included respondents’ understanding,appropriate research question, diverse samples, transparencyof methods, patient/population centered, stakeholder rele-vance, internal validity, and external validity.

Understanding/interpretation by respondents was priori-tized as the most desirable and actionable aspect. Internalvalidity was ranked as the seventh most desirable aspect andthe sixth most actionable aspect of a stated-preference study.External validity was prioritized as the fourth most desirablebut the eighth most actionable aspect. These results suggestthat stated-preference stakeholders might be particularlyinterested in the conceptualization of quality as a processmeasure [31] rather than solely as an outcomes measure.Further work on the implementation of good research prac-tices for instrument development to ensure process validitywill be essential [63].

9. Expert commentary

To be able to conduct many of the discussed validity tests,adaptations to the study design of a DCE are necessary(Table 1). These adaptions need to be taken into accountbefore the study commences and might also complicatedata analysis. Design adaptations might increase studycosts, decrease statistical efficiency, or increase responseburden. Face validity, content validity, attribute dominance,and attribute nonattendance do not require special studydesign considerations so can be relatively easily incorpo-rated into almost any DCE if the entire study process, includ-ing instrument development, is transparently reported.While attribute nonattendance does not require specialstudy design considerations, it does require the use of arelatively complex analytical process. In addition, externalvalidity does not require changes to the study design, butrevealed preferences might not exist and/or might require

538 E. M. JANSSEN ET AL.

Dow

nloa

ded

by [

RT

I In

tern

atio

nal]

at 0

4:17

01

Nov

embe

r 20

17

Page 10: Improving the quality of discrete-choice experiments in ... · 11/1/2017  · 4. Measurement validity Measurement validity captures whether the results from a DCE meet face/content

complex statistical analyses. Therefore, these tests might bemore difficult to conduct.

To be able to assess monotonicity, transitivity, Sen’s con-sistency principles, and test–retest stability, at least oneadditional choice task needs to be added to the studydesign. To be able to assess level recoding, different instru-ment versions need to be administered to different partici-pants. To be able to assess test–retest reliability, the DCEneeds to be administered twice to the same group ofparticipants. Convergent validity can be tested by conduct-ing a second choice experiment either as part of the DCEinstrument or outside of it. Version consistency and holdoutpredictions can be assessed either by adding a fixed-choiceor holdout task or by administering multiple survey versionsto different participants.

In cases where validity tests can be performed usingmultiple study design adaptations, the type of adaptationthat requires more resources generally presents a morestringent test. For example, to assess level recoding, ascope test requires that different survey versions are admi-nistered to different participants. The simple scope test onlyrequires the adaption of the numeric attribute levels, butthis test does not conclusively show whether recodingoccurred or whether preferences are not linear. Generally,more stringent tests require larger adaptations to the studydesign [28]. Within-set monotonicity is regarded as a weaktest of choice validity because it can be relatively easilypassed by chance [33], but assessing within-set monotoni-city only requires one additional choice task. Transitivity andSen’s consistency theory are more stringent tests of choicereliability but require larger changes to study design.Assessing transitivity requires the addition of at least twochoice tasks. Assessing Sen’s consistency principle requiresthe addition of at least one choice task that does not followthe format of the other choice tasks, which might addcomplexity to the instrument.

Researchers need to be careful in the interpretation of testsof validity and reliability and the conclusions reached basedon these tests. Apparent violations of face validity may reflectincorrect assumptions about patient preferences. Furthermore,failure of the dominance test might result from lack of

understanding on the part of the researcher as to the inter-pretation of attribute levels by participants. In addition, theestimation of choice data assumes an element of randomnessin the choices. This randomness needs to be considered whenconducting and interpreting validity or reliability tests – whatmay appear to be an inconsistent response may just be noiseand not a violation of the rules.

Observed violations in the validity and reliability of tests caneither be interpreted as an issue with the DCE or as an issue withthe respondents. It is important to keep in mind that observedviolations do not necessarily mean participants did not under-stand the tasks, did not use logic in their choices, or wereinattentive to the choice tasks. Instead, unexpected resultsmight signal problems in the design and implementation ofthe DCE [68,69]. It might signal that the researcher did not askthe question appropriately, did not understand the responsesappropriately, or imposed their own assumptions on what pre-ferences should be. Furthermore, different theoretical assump-tions [50,70–73] might lead to different expectations about theway participants will complete DCE choices.

When interpreting preference results, researchers shouldstart with the assumption that measured responses reflectparticipants’ preferences. Violations of tests of validity andreliability or other unexpected results should not be inter-preted as a failure of participants. Rather they should betaken as an opportunity for the researcher to reflect on theirpreference instrument, assumptions, and a priori expectations.

10. Five-year review

Unanswered questions surrounding the validity and reliability ofDCEs will need to be addressed. While a variety of tests exist toassess the validity of a DCE, important questions remain. It isunclear what to do with participants who fail tests for validity orreliability [28]. Excluding participants who fail the test for choicevalidity or reliability might increase internal validity but willdecrease the external validity of results. Deleting responsescan increase sample selection bias and decrease the statisticalefficiency and power of the estimated choice models [69]. Inaddition, participants who fail these tests might have made theirchoices according to their actual preferences. Follow-up

Table 2. Prioritization of study qualities by desirability and actionability.

Theme Description Desirability rank Actionability rank

Respondents’ understanding ● Development process ensures comprehension by participants● Effective communication that minimizes choice task ambiguity

1 1

Appropriate research question ● Research question can be answered by stated-preference study● Appropriate attributes and levels given the research question

3 5

Diverse samples ● Sample is representative of the target population● Enough variability of respondents for subgroup analyses

5 4

Transparency of methods ● Transparent reporting of methods in all study phases● Enough detail that the study can be reproduced

6 3

Patient/population centered ● Focused on eliciting target population(s) perspective● Considerations of population centeredness in all study phases

8 2

Stakeholder relevance ● Relevant research question and attributes for stakeholders● High degree of actionability and impact of results

2 7

Internal validity ● Robust study design that includes tests of internal validity● Strong analytic rigor with consideration of data heterogeneity

7 6

External validity ● Consideration of external validity and quality of evidence● Reliability if study is replicated or compared with existing evidence

4 8

EXPERT REVIEW OF PHARMACOECONOMICS & OUTCOMES RESEARCH 539

Dow

nloa

ded

by [

RT

I In

tern

atio

nal]

at 0

4:17

01

Nov

embe

r 20

17

Page 11: Improving the quality of discrete-choice experiments in ... · 11/1/2017  · 4. Measurement validity Measurement validity captures whether the results from a DCE meet face/content

questions or debriefing interviews could shed light on the wayparticipants answer choice tasks [32].

It remains unclear exactly when a test of validity or reliability ismet. This is especially difficult for tests that require some quali-tative or subjective assessment such as face or content validity.However, ambiguous requirements for more quantitative testsalso exist. For example, the acceptable frequency of choosing thedominated treatment profile, or the failure rates of rationalitytests or test–retest stability in a study, are not well established. Itis likely that ambiguity on when a test is met will remain even ifmore research is conducted. Such ambiguity can be incorporatedinto formal tests with, for example, the establishment of ‘grey’zones in which conclusions on validity or reliability remain incon-clusive [74]. To be able to answer questions of ambiguity, moreresearch on theoretical properties of DCEs and stated-preferencemethods is needed. A review of theoretical contributions to thequestions of validity and reliability would be needed to identifygaps in the literature.

It is also what set of tests needs to be met for a DCE study tobe considered valid. It is unlikely that one study can incorporateeach test of validity or reliability, especially because many testsrequire an adaptation to the study design. A standard set oftests to incorporate is unlikely to be applicable to every DCE.Different aspects of study quality, including issues of processvalidity [31] or the aspects identified by stakeholders presentedin this study, might be more important to consider for certainpurposes. Therefore, further work should examine which testsand study aspects might be most important to include given aparticular research question and population.

11. Conclusion

The complexity of DCEs means most studies will neither includenor meet all tests of validity and reliability identified in thisreview. More studies are needed to investigate the determinantsof validity for a DCE; this work might include studies on studyquality and good research practices that go beyond the conceptsof validity and reliability as outcome measures. Given unan-swered questions regarding tests for validity and reliability,researchers need to remain cautious when applying these tests.Personal expectations and assumptions regarding the outcomesof these tests should be carefully examined so as not to lead toincorrect conclusions. While limitations in establishing the relia-bility and validity of a DCE remain, this review can help research-ers gain insights into the available types of tests for validity andreliability and how to incorporate these tests into their studies.

Key issues

● The recent endorsement of stated-preference methods,including discrete-choice experiments (DCEs), by regulatoryand health technology assessment (HTA) agencies hasplaced a greater focus on demonstrating the validity andreliability of preference results.

● From the published literature, we developed a frameworkto conceptualize tests of validity and reliability for DCEs.This conceptual model spans 24 tests and includes fourconceptual domains: measurement validity, measurementreliability, choice validity, and choice reliability.

● Validity and reliability are just two important aspects of thequality of a stated-preference study. Other important aspects,as identified by stakeholders, include respondents’ under-standing, appropriate research question, diverse samples,transparency of methods, patient/population centered, stake-holder relevance, internal validity, and external validity.

● In the next 5 years, unanswered questions regarding thevalidity and reliability of DCEs need to be addressed. Thesequestions include treatment of participants who fail testsfor validity and reliability, selecting cut-off points to deter-mine when a test of validity and reliability is met, andproviding guidance on when a stated-preference studycan be considered valid.

● Caution needs to be taken when assessing reliability and valid-ity for stated-preference studies as many questions regardingthe application of these tests remain. Furthermore, personalexpectations and assumptions regarding the outcomes ofthese tests should be carefully examined so as not to biasconclusions that are based on these tests.

Acknowledgments

The authors sincerely thank the Johns Hopkins Institute for Clinical andTranslational Research (ICTR) Community Research Advisory Council (C-RAC) and members of the Diabetes Action Board (DAB) for their valuablecontributions and engagement in the research study.

Funding

This work was supported by a Patient-Centered Outcomes Research Institute(PCORI) Methods Award (ME-1303-5946) and by the Center for Excellence inRegulatory Science and Innovation (CERSI) (1U01FD004977-01).

Declaration of interest

DA Marshall is funded through a Canada Research Chair, Health Systemsand Services Research and the Arthur J.E. Child Chair RheumatologyOutcomes Research. AB Hauber is an employee of RTI Health Solutions.The funders had no role in the design and conduct of the study, inter-pretation of the data, or preparation of the manuscript. The authors haveno other relevant affiliations or financial involvement with any organiza-tion or entity with a financial interest in or financial conflict with thesubject matter or materials discussed in the manuscript apart from thosedisclosed. Peer reviewers on this manuscript have no relevant financial orother relationships to disclose.

References

Papers of special note have been highlighted as either of interest (•) or ofconsiderable interest (••) to readers.

1. Institute of Medicine (US) Committee on Quality of Health Care inAmerica. Crossing the quality chasm: a new health system for the21st century. Washington (DC): The National Academy of Sciences;2001.

2. Epstein RM, Street RL. The values and value of patient-centeredcare. Ann Fam Med. 2011 Mar;9(2):100–103.

3. Frank L, Basch E, Selby JV. The PCORI perspective on patient-centered outcomes research. JAMA. 2014 Oct 15;312(15):1513–1514. DOI:10.1001/jama.2014.11100.

4. Domecq JP, Prutsky G, Elraiyah T, et al. Patient engagement inresearch: a systematic review. BMC Health Serv Res. 2014 Feb;26(14):89.

540 E. M. JANSSEN ET AL.

Dow

nloa

ded

by [

RT

I In

tern

atio

nal]

at 0

4:17

01

Nov

embe

r 20

17

Page 12: Improving the quality of discrete-choice experiments in ... · 11/1/2017  · 4. Measurement validity Measurement validity captures whether the results from a DCE meet face/content

5. Bridges JF, Jones C. Patient-based health technology assessment: avision of the future. Int J Technol Assess Health Care. 2007Winter;23(1):30–35. DOI:10.1017/s0266462307051549.

6. Ho M, Saha A, McCleary KK, et al. A framework for incorporatingpatient preferences regarding benefits and risks into regulatoryassessment of medical technologies. Value Health. 2016 Sep–Oct;19(6):746–750. DOI:10.1016/j.jval.2016.02.019.

7. Muhlbacher AC, Bridges JF, Bethge S, et al. Preferences for antiviraltherapy of chronic hepatitis C: a discrete choice experiment. Eur JHealth Econ. 2016 Feb 04. DOI:10.1007/s10198-016-0763-8.

8. European Medicines Agency (EMA). Benefit-risk methodology pro-ject. Work package 4 report: benefit-risk tools and processes; 2012.[cited 2017 Oct 12]. Available from: http://www.ema.europa.eu/docs/en_GB/document_library/Report/2012/03/WC500123819.pdf.

9. Medical Device Innovation Consortium. A Framework forIncorporating Information on Patient Preferences RegardingBenefit and Risk into Regulatory Assessments of New MedicalTechnology; 2015 [cited 2017]. Available from: http://mdic.org/wp-content/uploads/2015/05/MDIC_PCBR_Framework_Web1.pdf

•• The Framework Report discusses considerations for the use ofpatient preference information in the regulatory assessment,reimbursement, marketing, and shared decision-making. Itspecifically calls for the need to study what determines thevalidity and reliability of stated-preference studies.

10. Guo JJ, Pandey S, Doyle J, et al. A review of quantitative risk-benefitmethodologies for assessing drug safety and efficacy-report of theISPOR risk-benefit management working group. Value Health.2010;13(5):657–666.

11. Hauber BA, Fairchild AO, Johnson FR. Quantifying benefit-risk pre-ferences for medical interventions: an overview of a growingempirical literature. Appl Health Econ Health Policy. 2013;11(4):319–329.

12. Mühlbacher AC, Juhnke C, Beyer AR, et al. Patient-focused benefit-risk analysis to inform regulatory decisions: the European Unionperspective. Value Health. 2016;19(6):734–740.

13. Australian Pharmaceutical Benefits Advisory Committee (PBAC).Appendix 6 - Including non health outcomes in a supplementaryanalysis; 2016 [cited 2017]. Available from: https://pbac.pbs.gov.au/appendixes/appendix-6-including-nonhealth-outcomes-in-a-supplementary-analysis.html

14. United States Food and Drug Administration (FDA). PatientPreference Information – Voluntary Submission, Review inPremarket Approval Applications, Humanitarian Device ExemptionApplications, and De Novo Requests, and Inclusion in DecisionSummaries and Device Labeling; 2016 [cited 2017 Aug 23]. Availablefrom: http://www.fda.gov/downloads/medicaldevices/deviceregulationandguidance/guidancedocuments/ucm446680.pdf

•• Guidance put forth by the US FDA on patient preference infor-mation (PPI) that may be used by FDA staff in decision-makingon medical devices. The guidance presents 11 study qualitiesto consider when conducting patient preference studies.

15. Ozdemir S. Improving the validity of stated-preference data inhealth research: the potential of the time-to-thinkapproach. Patient. 2015 Jun;8(3):247–255. DOI:10.1007/s40271-014-0084-x.

16. Adamowicz W, Swait J, Boxall P, et al. Perceptions versus objectivemeasures of environmental quality in combined revealed and sta-ted preference models of environmental valuation. J Environ EconManage. 1997;32(1):65–84.

17. Adamowicz W, Louviere J, Williams M. Combining revealed andstated preference methods for valuing environmental amenities. JEnviron Econ Manage. 1994;26(3):271–292.

18. Gurmankin AD, Baron J, Hershey JC, et al. The role of physicians’recommendations in medical treatment decisions. Med DecisMaking. 2002 May-Jun;22(3):262–271.

19. MacLeod TE, Harris AH, Mahal A. Stated and revealed preferencesfor funding new high-cost cancer drugs: a critical review of theevidence from patients, the public and payers. Patient. 2016 Jun;9(3):201–222. DOI:10.1007/s40271-015-0139-7.

20. Sacristan JA, Lizan L, Comellas M, et al. Perceptions of oncologists,healthcare policy makers, patients and the general population onthe value of pharmaceutical treatments in oncology. Adv Ther.2016 Nov;33(11):2059–2068.

21. Rakotonarivo OS, Schaafsma M, Hockley N. A systematic review ofthe reliability and validity of discrete choice experiments in valuingnon-market environmental goods. J Environ Manage. 2016;183:98–109. DOI:10.1016/j.jenvman.2016.08.032.

• A systematic review on the validity and reliability of environ-mental DCEs.

22. Clark MD, Determann D, Petrou S, et al. Discrete choice experi-ments in health economics: a review of the literature.Pharmacoeconomics. 2014 Sep;32(9):883–902. DOI:10.1007/s40273-014-0170-x.

• Most recent systematic review on health-related DCEs pub-lished. Includes a count of studies that include certain testsof validity.

23. De Bekker-Grob EW, Ryan M, Gerard K. Discrete choice experimentsin health economics: a review of the literature. Health Econ. 2012Feb;21(2):145–172. DOI:10.1002/hec.1697.

24. Ryan M, Gerard K. Using discrete choice experiments to valuehealth care programmes: current practice and future researchreflections. Appl Health Econ Health Policy. 2003;2(1):55–64.

25. Ryan M, Scott DA, Reeves C, et al. Eliciting public preferences forhealthcare: a systematic review of techniques. Health TechnolAssess. 2001;5(5):1–186.

26. Lancsar E, Louviere J. Conducting discrete choice experiments toinform healthcare decision making: a user’s guide.Pharmacoeconomics. 2008;26(8):661–677.

27. Muhlbacher A, Johnson FR. Choice experiments to quantify prefer-ences for health and healthcare: state of the practice. Appl HealthEcon Health Policy. 2016 Jun;14(3):253–266. DOI:10.1007/s40258-016-0232-7.

28. Ryan M, Bate A, Eastmond CJ, et al. Use of discrete choice experi-ments to elicit preferences. Qual Health Care. 2001 Sep;10(Suppl 1):i55–i60.

29. Johnson FR, Mohamed AF, Ozdemir S, et al. How does cost matterin health-care discrete-choice experiments? Health Econ. 2011Mar;20(3):323–330.

30. Lagarde M. Investigating attribute non-attendance and its conse-quences in choice experiments with latent class models. HealthEcon. 2013 May;22(5):554–567. DOI:10.1002/hec.2824.

31. Lancsar E, Swait J. Reconceptualising the external validity of dis-crete choice experiments. Pharmacoeconomics. 2014 Oct;32(10):951–965. DOI:10.1007/s40273-014-0181-7.

32. Miguel FS, Ryan M, Amaya-Amaya M. ‘Irrational’ stated preferences:a quantitative and qualitative investigation. Health Econ. 2005Mar;14(3):307–322. DOI:10.1002/hec.912.

33. Ozdemir S, Mohamed AF, Johnson FR, et al. Who pays attention instated-choice surveys? Health Econ. 2010 Jan;19(1):111–118.DOI:10.1002/hec.1452.

34. Telser H, Zweifel P. Validity of discrete-choice experiments evi-dence for health risk reduction. Appl Econ. 2007;39(1):69–78.

35. Trochim W, Kane M. Concept mapping: an introduction to struc-tured conceptualization in health care. Int J Qual Health Care. 2005Jun;17(3):187–191. DOI:10.1093/intqhc/mzi038.

36. De Bekker-Grob EW, Hol L, Donkers B, et al. Labeled versus unla-beled discrete choice experiments in health economics: an applica-tion to colorectal cancer screening. Value Health. 2010 Mar-Apr;13(2):315–323. DOI:10.1111/j.1524-4733.2009.00670.x.

37. Muhlbacher AC, Rudolph I, Lincke HJ, et al. Preferences for treat-ment of attention-deficit/hyperactivity disorder (ADHD): a discretechoice experiment. BMC Health Serv Res. 2009;9:149. DOI:10.1186/1472-6963-9-149.

38. Kenny P, Hall J, Viney R, et al. Do participants understand a statedpreference health survey? A qualitative approach to assessingvalidity. Int J Technol Assess Health Care. 2003 Fall;19(4):664–681.

39. Bijlenga D, Birnie E, Bonsel GJ. Feasibility, reliability, and validity ofthree health-state valuation methods using multiple-outcome

EXPERT REVIEW OF PHARMACOECONOMICS & OUTCOMES RESEARCH 541

Dow

nloa

ded

by [

RT

I In

tern

atio

nal]

at 0

4:17

01

Nov

embe

r 20

17

Page 13: Improving the quality of discrete-choice experiments in ... · 11/1/2017  · 4. Measurement validity Measurement validity captures whether the results from a DCE meet face/content

vignettes on moderate-risk pregnancy at term. Value Health. 2009Jul-Aug;12(5):821–827. DOI:10.1111/j.1524-4733.2009.00503.x.

40. Hollin IL, Caroline Y, Hanson C, et al. Developing a patient-centeredbenefit-risk survey: a community-engaged process. Value Health.2016 Sep - Oct;19(6):751–757. DOI:10.1016/j.jval.2016.02.014.

41. Janssen EM, Segal JB, Bridges JF. A framework for instrumentdevelopment of a choice experiment: an application to type 2diabetes. Patient. 2016 Oct;9(5):465–479. DOI:10.1007/s40271-016-0170-3.

42. Mark TL, Swait J. Using stated preference and revealed preferencemodeling to evaluate prescribing decisions. Health Econ. 2004Jun;13(6):563–573. DOI:10.1002/hec.845.

43. Janssen EM, Hauber AB, Bridges JFP. Conducting a discrete choiceexperiment study following recommendations for good researchpractices: an application for eliciting patient preferences for dia-betes treatments. Value in Health. 2017. DOI:10.1016/j.jval.2017.07.001

44. Veldwijk J, Essers BA, Lambooij MS, et al. Survival or mortality: doesrisk attribute framing influence decision-making behavior in a dis-crete choice experiment? Value Health. 2016 Mar-Apr;19(2):202–209. DOI:10.1016/j.jval.2015.11.004.

45. Rockers PC, Jaskiewicz W, Wurts L, et al. Preferences for working inrural clinics among trainee health professionals in Uganda: a dis-crete choice experiment. BMC Health Serv Res. 2012 Jul;23(12):212.

46. Kruk ME, Paczkowski MM, Tegegn A, et al. Women’s preferencesfor obstetric care in rural Ethiopia: a population-based discretechoice experiment in a region with low rates of facility delivery.J Epidemiol Commun Health. 2010 Nov;64(11):984–988.DOI:10.1136/jech.2009.087973.

47. Ho MP, Gonzalez JM, Lerner HP, et al. Incorporating patient-pre-ference evidence into regulatory decision making. Surg Endosc.2015;29(10):2984–2993. DOI:10.1007/s00464-014-4044-2.

48. Bridges JF, Mohamed AF, Finnern HW, et al. Patients’ preferencesfor treatment outcomes for advanced non-small cell lung cancer: aconjoint analysis. Lung Cancer. 2012 Jul;77(1):224–231.DOI:10.1016/j.lungcan.2012.01.016.

49. Kimberlin CL, Winterstein AG. Validity and reliability of measure-ment instruments used in research. Am J Health Syst Pharm. 2008Dec 01;65(23):2276–2284. DOI:10.2146/ajhp070364.

50. McFadden D. Conditional logit analysis of qualitative choice beha-vior. In: Zarembka P, editor. Frontiers in econometrics. New York(NY): Academic Press; 1974. p. 105–142.

51. Hougaard JL, Tjur T, Osterdal LP. On the meaningfulness of testingpreference axioms in stated preference discrete choice experi-ments. Eur J Health Econ. 2012 Aug;13(4):409–417. DOI:10.1007/s10198-011-0312-4.

52. Bijlenga D, Bonsel GJ, Birnie E. Eliciting willingness to pay inobstetrics: comparing a direct and an indirect valuation methodfor complex health outcomes. Health Econ. 2011 Nov;20(11):1392–1406. DOI:10.1002/hec.1678.

53. Hollin IL, Peay HL, Bridges JF. Caregiver preferences for emergingduchenne muscular dystrophy treatments: a comparison of best-worst scaling and conjoint analysis. Patient. 2015;8(1):19–27.

54. Doiron D, Yoo HI. Temporal stability of stated preferences: the caseof junior nursing jobs. Health Econ. 2016Apr;14:802–809.DOI:10.1002/hec.3350.

55. Orme B. Including holdout choice tasks in conjoint studies.Washington: Sawtooth Software Inc.; 2015.

56. Hauber AB, Gonzalez JM, Groothuis-Oudshoorn CG, et al.Statistical methods for the analysis of discrete choice experi-ments: a report of the ISPOR conjoint analysis good researchpractices task force. Value Health. 2016 Jun;19(4):300–315.DOI:10.1016/j.jval.2016.04.004.

57. Sen A. Internal consistency of choice. Econometrica. 1993;61:495–521.

58. Marshall D, Bridges JF, Hauber B, et al. Conjoint analysis applica-tions in health - how are studies being designed and reported?: anupdate on current practice in the published literature between2005 and 2008. Patient. 2010 Dec 1;3(4):249–256. DOI:10.2165/11539650-000000000-00000.

59. Joy SM, Little E, Maruthur NM, et al. Patient preferences for thetreatment of type 2 diabetes: a scoping review.Pharmacoeconomics. 2013;31(10):877–892.

60. Bridges JF. Stated preference methods in health care evaluation: anemerging methodological paradigm in health economics. ApplHealth Econ Health Policy. 2003;2(4):213–224.

61. Ryan M, Farrar S. Using conjoint analysis to elicit preferences forhealth care. Bmj. 2000 Jun 3;320(7248):1530–1533.

62. dosReis S, Castillo WC, Ross M, et al. Attribute development usingcontinuous stakeholder engagement to prioritize treatment deci-sions: a framework for patient-centered research. Value Health.2016 Sep–Oct;19(6):758–766. DOI:10.1016/j.jval.2016.02.013.

63. Coast J, Al-Janabi H, Sutton EJ, et al. Using qualitative methods forattribute development for discrete choice experiments: issues andrecommendations. Health Econ. 2012;21(6):730–741.

64. De Bekker-Grob EW, Donkers B, Jonker MF, et al. Sample sizerequirements for discrete-choice experiments in healthcare: a prac-tical guide. Patient. 2015 Oct;8(5):373–384.

65. Lancsar E, Louviere J, Flynn T. Several methods to investigaterelative attribute impact in stated preference experiments. Soc SciMed. 2007 Apr;64(8):1738–1753. DOI:10.1016/j.socscimed.2006.12.007.

66. Bridges JF, Hauber AB, Marshall D, et al. Conjoint analysis applica-tions in health–a checklist: a report of the ISPOR good researchpractices for conjoint analysis task force. Value Health. 2011 Jun;14(4):403–413. DOI:10.1016/j.jval.2010.11.013.

67. Johnson FR, Lancsar E, Marshall D, et al. Constructing experimentaldesigns for discrete-choice experiments: report of the ISPORConjoint Analysis Experimental Design Good Research PracticesTask Force. Value Health. 2013;16(1):3–13.

68. Viney R, Lancsar E, Louviere J. Discrete choice experiments tomeasure consumer preferences for health and healthcare. ExpertRev Pharmacoecon Outcomes Res. 2002 Aug;2(4):319–326.DOI:10.1586/14737167.2.4.319.

69. Lancsar E, Louviere J. Deleting ‘irrational’ responses from discretechoice experiments: a case of investigating or imposing preferences?Health Econ. 2006 Aug;15(8):797–811. DOI:10.1002/hec.1104.

70. Thurstone LL. A law of comparative judgment. Psychol Rev.1927;34(4): 273–286.

71. Kahneman D, Tversky A. Prospect theory: an analysis of decisionunder risk. Econometrica. 1979;47(2):263–291.

72. Loomes G, Sugden R. A rationale for preference reversal. Am EconRev. 1983;73(3):428–432.

73. Loomes G, Sugden R. Testing for regret and disappointment inchoice under uncertainty. Econ J. 1987;97:118–129.

74. Durbin J, Watson GS. Testing for serial correlation in least squaresregression: I. Biometrika. 1950;37(3–4):409–428.

542 E. M. JANSSEN ET AL.

Dow

nloa

ded

by [

RT

I In

tern

atio

nal]

at 0

4:17

01

Nov

embe

r 20

17