6
Curtailment: A method to reduce the length of self-report questionnaires while maintaining diagnostic accuracy Marjolein Fokkema a,n , Niels Smits a , Matthew D. Finkelman b , Henk Kelderman a , Pim Cuijpers a a Vrije Universiteit Amsterdam, the Netherlands b Tufts University School of Dental Medicine, USA article info Article history: Received 12 April 2013 Received in revised form 24 October 2013 Accepted 4 November 2013 Available online 12 November 2013 Keywords: Curtailment Stochastic curtailment Respondent burden Efciency Common mental health disorders Screening abstract Minimizing the respondent burden and maximizing the classication accuracy of tests is essential for efcacious screening for common mental health disorders. In previous studies, curtailment of tests has been shown to reduce average test length considerably, without loss of accuracy. In the current study, we simulate Deterministic (DC) and Stochastic (SC) Curtailment for three self-report questionnaires for common mental health disorders, to study the potential gains in efciency that can be obtained in screening for these disorders. The curtailment algorithms were applied in an existing dataset of item scores of 502 help-seeking participants. Results indicate that DC reduces test length by up to 37%, and SC reduces test length by up to 46%, with only very slight decreases in diagnostic accuracy. Compared to an item response theory based adaptive test with similar test length, SC provided better diagnostic accuracy. Consequently, curtailment may be useful in improving the efciency of mental health self-report questionnaires. & 2013 Elsevier Ireland Ltd. All rights reserved. 1. Introduction As noted by for example Gilbody et al. (2006) and the UK National Screening Committee (2003), tests used in screening for common mental health disorders should be simple, precise and acceptable to patients. Similarly, in discussing the costs and benets of screening, Gray and Austoker (1998) noted All screen- ing programmes do harm; some also do good(p. 983). Therefore, minimizing the respondent burden, and maximizing the accuracy of tests is essential for efcacious screening. To reduce the respondent burden of screeners for common mental health disorders, many efforts have been aimed at creating xed length short forms of existing self-report questionnaires (e.g., Donker et al., 2011; Cuijpers et al., 2010; Rost et al., 1993). However, for xed-length short forms, the reduction in test length generally comes at the expense of diagnostic accuracy (Smith et al., 2000; Mitchell and Coyne, 2007). Over the past few decades, adaptive testing algorithms have been developed, aimed at reducing test length without reducing accuracy (e.g., Weiss, 1982; Van der Linden and Glas, 2010). In every stage of adaptive testing, earlier item responses are used to select the item which is most informative for the current respon- dent, and items that do not provide additional information are not administered. In general, this results in considerable test length reductions, while the diagnostic accuracy or measurement preci- sion of the original full length instrument is preserved (e.g., Fliege et al., 2005; Fries et al., 2009; Gibbons et al., 2008; Smits et al., 2011; Walter et al., 2007). However, most adaptive testing algo- rithms are based on Item Response Theory (IRT), and assume the data to satisfy the conditions of a latent trait model. Often, a single latent trait underlying the data is assumed, which may be unrealistic for (mental) health self report questionnaires (e.g., Fayers, 2007; Gardner et al., 2002; Petersen et al., 2006). In addition, the purpose of classication is prediction of an external criterion, so methods that do not depend on a latent trait model may be preferable for classication (Smits and Finkelman, 2013). Recently, Finkelman et al. (2011) and Finkelman et al. (2012) introduced curtailment as a method for reducing the respondent burden of mental health self-report questionnaires, which does not assume latent traits underlying the item scores. Earlier, the statistical properties of curtailment have been studied by Eisenberg and Simons (1978) and Eisenberg and Ghosh (1980), and curtailment has been used for early stopping in clinical trials (e.g., Lan et al., 1982). The application of curtailment in psycholo- gical testing results in variable length tests, in which testing is halted when administration of the remaining items is unable or unlikely to change the nal classication decision. In other words, Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/psychres Psychiatry Research 0165-1781/$ - see front matter & 2013 Elsevier Ireland Ltd. All rights reserved. http://dx.doi.org/10.1016/j.psychres.2013.11.003 n Correspondence to: Vrije Universiteit Amsterdam, Van der Boechorststraat 1, 1081 BT Amsterdam, the Netherlands. Tel.: þ31 6 248 92 741. E-mail address: [email protected] (M. Fokkema). Psychiatry Research 215 (2014) 477482

Curtailment: A method to reduce the length of self-report questionnaires while maintaining diagnostic accuracy

  • Upload
    pim

  • View
    214

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Curtailment: A method to reduce the length of self-report questionnaires while maintaining diagnostic accuracy

Curtailment: A method to reduce the length of self-reportquestionnaires while maintaining diagnostic accuracy

Marjolein Fokkema a,n, Niels Smits a, Matthew D. Finkelman b, Henk Kelderman a,Pim Cuijpers a

a Vrije Universiteit Amsterdam, the Netherlandsb Tufts University School of Dental Medicine, USA

a r t i c l e i n f o

Article history:Received 12 April 2013Received in revised form24 October 2013Accepted 4 November 2013Available online 12 November 2013

Keywords:CurtailmentStochastic curtailmentRespondent burdenEfficiencyCommon mental health disordersScreening

a b s t r a c t

Minimizing the respondent burden and maximizing the classification accuracy of tests is essential forefficacious screening for common mental health disorders. In previous studies, curtailment of tests hasbeen shown to reduce average test length considerably, without loss of accuracy. In the current study, wesimulate Deterministic (DC) and Stochastic (SC) Curtailment for three self-report questionnaires forcommon mental health disorders, to study the potential gains in efficiency that can be obtained inscreening for these disorders. The curtailment algorithms were applied in an existing dataset of itemscores of 502 help-seeking participants. Results indicate that DC reduces test length by up to 37%, and SCreduces test length by up to 46%, with only very slight decreases in diagnostic accuracy. Compared to anitem response theory based adaptive test with similar test length, SC provided better diagnostic accuracy.Consequently, curtailment may be useful in improving the efficiency of mental health self-reportquestionnaires.

& 2013 Elsevier Ireland Ltd. All rights reserved.

1. Introduction

As noted by for example Gilbody et al. (2006) and the UKNational Screening Committee (2003), tests used in screening forcommon mental health disorders should be simple, precise andacceptable to patients. Similarly, in discussing the costs andbenefits of screening, Gray and Austoker (1998) noted “All screen-ing programmes do harm; some also do good” (p. 983). Therefore,minimizing the respondent burden, and maximizing the accuracyof tests is essential for efficacious screening.

To reduce the respondent burden of screeners for commonmental health disorders, many efforts have been aimed at creatingfixed length short forms of existing self-report questionnaires (e.g.,Donker et al., 2011; Cuijpers et al., 2010; Rost et al., 1993).However, for fixed-length short forms, the reduction in test lengthgenerally comes at the expense of diagnostic accuracy (Smithet al., 2000; Mitchell and Coyne, 2007).

Over the past few decades, adaptive testing algorithms havebeen developed, aimed at reducing test length without reducingaccuracy (e.g., Weiss, 1982; Van der Linden and Glas, 2010). Inevery stage of adaptive testing, earlier item responses are used to

select the item which is most informative for the current respon-dent, and items that do not provide additional information are notadministered. In general, this results in considerable test lengthreductions, while the diagnostic accuracy or measurement preci-sion of the original full length instrument is preserved (e.g., Fliegeet al., 2005; Fries et al., 2009; Gibbons et al., 2008; Smits et al.,2011; Walter et al., 2007). However, most adaptive testing algo-rithms are based on Item Response Theory (IRT), and assume thedata to satisfy the conditions of a latent trait model. Often, a singlelatent trait underlying the data is assumed, which may beunrealistic for (mental) health self report questionnaires (e.g.,Fayers, 2007; Gardner et al., 2002; Petersen et al., 2006).In addition, the purpose of classification is prediction of an externalcriterion, so methods that do not depend on a latent trait modelmay be preferable for classification (Smits and Finkelman, 2013).

Recently, Finkelman et al. (2011) and Finkelman et al. (2012)introduced curtailment as a method for reducing the respondentburden of mental health self-report questionnaires, which doesnot assume latent traits underlying the item scores. Earlier, thestatistical properties of curtailment have been studied byEisenberg and Simons (1978) and Eisenberg and Ghosh (1980),and curtailment has been used for early stopping in clinical trials(e.g., Lan et al., 1982). The application of curtailment in psycholo-gical testing results in variable length tests, in which testing ishalted when administration of the remaining items is unable orunlikely to change the final classification decision. In other words,

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/psychres

Psychiatry Research

0165-1781/$ - see front matter & 2013 Elsevier Ireland Ltd. All rights reserved.http://dx.doi.org/10.1016/j.psychres.2013.11.003

n Correspondence to: Vrije Universiteit Amsterdam, Van der Boechorststraat 1,1081 BT Amsterdam, the Netherlands. Tel.: þ31 6 248 92 741.

E-mail address: [email protected] (M. Fokkema).

Psychiatry Research 215 (2014) 477–482

Page 2: Curtailment: A method to reduce the length of self-report questionnaires while maintaining diagnostic accuracy

item administration is continued only as long as the resultingdiagnostic outcome is amenable to change.

Consider the example of using the seven-item anxiety symp-tom subscale of the Hospital Anxiety and Depression Scale (HADS-A; Zigmond and Snaith, 1983) to screen for anxiety disorders, byusing a cutoff score of eight (Bjelland et al., 2002). Self-evidently,item administration can be ceased, whenever a respondentobtains a cumulative score of eight or higher, before all itemshave been administered. Likewise, item administration can beceased, when it is no longer possible for a respondent to obtain afinal score of eight or higher, with the remaining items. Inaddition, for respondents endorsing the most severe responseoption (an item score of three) on the first two items, it seemslikely that their final test score will exceed the cutoff, and there-fore further item administration may not be necessary. On theother hand, for respondents endorsing the least severe responseoption (an item score of zero) on the first two items, it seems likelythat their final test score will not exceed the cutoff, and furtheritem administration may not be necessary, either. However, forrespondents endorsing response options representing more mod-erate levels of anxiety on the first two items, administration ofsubsequent items may be necessary to determine whether theirfinal test score will or will not exceed the cutoff value.

Curtailment provides a formalization of this idea, and Finkelmanet al. (2011) have developed an algorithm for application of curtail-ment to health questionnaires. Their method depends on observedscores only, and makes no assumptions about underlying latent traits.In addition, curtailment can be applied deterministically, or stochas-tically. For application of Deterministic Curtailment (DC), a cutoff valuefor classifying respondents as “at risk” is needed.1 During testing, itemadministration for a respondent is halted and an “at risk” classificationis made, when the remaining items can no longer result in a final testscore above or equal to the cutoff value. Item administration is haltedand a “not at risk” classification is made, when the remaining itemscan no longer result in a final test score below the cutoff value.

For application of Stochastic Curtailment (SC), a cutoff value forclassifying respondents as “at risk” is required as well. In addition,for the stochastic part of the algorithm, an existing dataset of itemscores is needed, and the user has to specify a value for γ: thethreshold for the probability that the classification decision basedon the stochastically curtailed version will match that of the full-length instrument. First, the algorithm is trained, by splitting thecomplete dataset of item scores in two parts: the “at risk” and the“not at risk” datasets. The “at risk” dataset contains item scores ofall respondents with a test score meeting or exceeding the cutoffvalue; the “not at risk” dataset contains item scores of allrespondents with a test score below the cutoff value (Finkelmanet al., 2012). Next, the algorithm is applied for shortening tests fornew respondents: for every new respondent, after administrationof every item, a cumulative score is calculated. In the “at risk” and“not at risk” datasets, all scores on the remaining items areappended to the cumulative score. When the proportion ofresulting test scores meeting or exceeding the cutoff value is Zγin both the “at risk” and “not at risk” datasets, item administrationis halted, and an “at risk” classification is made for the newrespondent. Similarly, when the proportion of resulting test scoresmeeting or exceeding the cutoff value is r(1�γ) in both the “atrisk” and “not at risk” datasets, item administration is halted, and a“not at risk” classification is made for the new respondent(Finkelman et al., 2012).

An illustration of DC and SC of administration of the HADS-A scaleto two new respondents is provided in Appendix A. Although thecurtailment algorithms can seem elaborate, for practical application ofcurtailment, look-up tables with stopping criteria for every item can becreated, which are easy to use and implement.

It should be noted that application of SC with γ¼1.00 will generallyyield the same results as application of DC. However, SC with γ¼1.00may halt testing earlier than DC, when some response patterns aretheoretically possible, but not observed empirically. For example,when the highest response option for the last item is never observedin the training dataset, SC with γ¼1.00 may halt testing for somerespondents before the last item, whereas testing for these respon-dents would be continued with DC.

The goal of the current article is to illustrate the potential gainsin efficiency that may be obtained by application of curtailment inmental health care applications. In what follows, we will illustratethis with a post-hoc simulation of curtailment in an existingdataset of item responses on self-report questionnaires. To providea benchmark for assessing the performance of curtailment, we willsimulate an IRT-based adaptive test, as well. In Section 2, thedataset, algorithms and simulation design will be described. InSection 3, the findings in terms of test length reduction andaccuracy will be presented. In Section 4, implications of thecurrent study and directions for further research will be presented.

2. Method

2.1. Participants

The dataset used in the current study was collected for development of a fixedlength, web-based screener for common mental disorders (Donker et al., 2009,2011). The total sample consisted of 502 participants, with a mean age of 43 (S.D.¼13, range 18–80). A majority of the subjects (57%) was female. Detailedinformation about the sample is provided in Donker et al. (2009). Questionnaireswere completed by all 502 participants, and because of computerized questionnaireadministration, no data were missing. Diagnoses on DSM-IV disorders wereobtained from a subsample of 157 participants (Donker et al., 2009, 2011). Of theseparticipants, 29.29% were diagnosed with major depressive disorder, 19.11% werediagnosed with generalized anxiety disorder, and 59.87% were diagnosed with ananxiety disorder.

2.2. Measures

Center for epidemiological studies – depression scale. The CES-D (Radloff, 1977) isa 20-item questionnaire about depressive symptomatology. Items are scored 0 to 3,indicating increasing frequency of symptom occurrence over past week. Acceptablesensitivity and specificity have been reported for a cutoff value of 16 (Beekmanet al., 1997; Wada et al., 2007; Whooley et al., 1997).

Hospital anxiety and depression scale. The Anxiety subscale of the HADS(Zigmond and Snaith, 1983) is a 7-item questionnaire about symptoms of anxiety.Items are scored on a four-point scale, ranging from 0 (not at all) to 3 (very oftenindeed). Bjelland et al. (2002) reported Cronbach's α values ranging from 0.68 to0.93 (mean 0.83). With a cutoff score of eight, sensitivity and specificity for theHADS-A were approximately 0.80.

Generalized anxiety disorder scale. The GAD scale (Spitzer et al., 2006) is a 7-itemself-report scale. Items are scored 0–3, indicating increasing severity of symptomsover the last 2 weeks. Kroenke et al. (2007) reported Cronbach's α of 0.92,sensitivities for several anxiety disorders ranging from 0.66 to 0.89 and specificitiesranging from 0.80 to 0.82, with a cutoff value of ten.

Composite international diagnostic interview. To assess the presence of DSM-IVdisorders, the CIDI version 2.1 (World Health Organization, 1997) was administeredby telephone. CIDI diagnoses on depressive disorders were used as a ‘gold standard'for assessment of the accuracy of the curtailed CES-D. Similarly, CIDI diagnoses onanxiety disorders were used for assessment of the accuracy of the curtailed HADS-A, and CIDI diagnoses on generalized anxiety disorder were used for assessment ofthe accuracy of the curtailed GAD scale.

2.3. Simulation design

DC and SC were simulated by application of the algorithms on the item scoredata of all 502 participants. The DC and SC algorithms were implemented in acustom function written in R (R Development Core Team, 2010), following the

1 Note that the curtailment algorithms described here make two implicitassumptions: First, those respondents scoring above, or equal to, the cutoff valueare classified as “at risk”, and respondents scoring below the cutoff value areclassified as “not at risk”. Second, that all response options have values Z0.

M. Fokkema et al. / Psychiatry Research 215 (2014) 477–482478

Page 3: Curtailment: A method to reduce the length of self-report questionnaires while maintaining diagnostic accuracy

descriptions of Finkelman et al. (2012). The code used for application of curtailmentcan be obtained from the first author. SC was simulated with a γ value of 0.95(SC95), and a γ value of 0.75 (SC75).

The performance of DC and SC was contrasted with the performance of an IRT-based Computerized Classification Test (CCT), in which the item order wasdetermined by maximum information at the cutoff value. The CCT algorithm asdescribed in Smits and Finkelman (2013) was used. For training of the CCT, theGraded Response Model (GRM; Samejima, 1979) was used, because all question-naires consisted of polytomous items. The parameters of the GRM were estimatedusing marginal ML estimation, as implemented in the LTM package (Rizopoulous,2006). To determine the cutoff value for the latent trait variable θ, corresponding tothe cutoff value for the test score, a test characteristic curve was computed usingthe lordif package (Choi, 2011).

For application of the CCT, items were selected by using maximum informationat the cutoff value of θ (Equation 4, Thompson, 2009). After administration of everyitem, an estimate of the respondent's θ value, and the corresponding StandardError (SE), was obtained. These were used to calculate a Confidence Interval (CI) ofthe respondent's estimated θ value (Equation 7, Thompson, 2009). When the CI didno longer include the cutoff value for θ, testing was halted and a classification wasmade for the respondent. The value for the normal deviate corresponding to a 1�εconfidence interval, was set to yield similar average test lengths for CCT and SC95.This approach allowed for straightforward comparison of both the methods, interms of accuracy (Thompson, 2011).

2.4. Outcomes

Performance of SC and CCT was assessed by means of leave-one-out crossvalidation, because using the same data for training and evaluation of an algorithmmay result in overly optimistic estimates of performance (Stone, 1974; Hastie et al.,2008). For every respondent, algorithms were trained using all data, minus theresponse pattern of the current respondent. Then, the algorithm was applied to theresponse pattern of the current respondent. For every respondent, the number ofitems administered and the final classification decision was recorded.

We evaluated the efficiency of curtailment by calculating means and standarddeviations for test length distributions for DC, SC75, and CCT. In addition, accuracyof the methods was evaluated by calculating the condordance rate (proportion ofrespondents with the same classification as the full-length instrument), sensitivity(proportion of true positives/[proportion of true positivesþproportion of falsenegatives]), and specificity (proportion of true negatives/[proportion of truenegativesþproportion of false positives]). Sensitivity and specificity were calcu-lated with respect to DMS-IV diagnoses, as assessed by the CIDI interview in asubgroup of 157 respondents.

3. Results

3.1. Deterministic curtailment

First, DC was applied to all three questionnaires. This resultedin substantial reductions in average test lengths, without reduc-tions in accuracy (Table 2). The average test length of the CES-D

was reduced from 20 to 12.59 items: a test length reduction of37.05%, on average. For the shorter scales, test length reductionswere somewhat smaller: average test length for the HADS-A scalewas reduced by 24.00%, and for the GAD scale by 24.14% (Table 1).The test length distributions resulting from the application of DCfor “at risk respondents” is shown in Fig. 1, and for “not at risk”respondents in Fig. 2. Overall, test length reductions were largeramong the “at risk” respondents.

3.2. Stochastic curtailment

Application of SC95 resulted in further reductions in averagetest length, compared to administration of the full-length ques-tionnaires, without reducing accuracy (Tables 1 and 2). Applicationof SC to the CES-D yielded an average test length reduction of37.65%. Application of SC to the HADS-A resulted in an average testlength reduction of 25.86%. The average test length of the GADscale was reduced by 24.14% (Table 1).

Application of SC75 again resulted in substantial reductions inaverage test length, compared to administration of the full-lengthquestionnaires (Table 1). Application of SC to the CES-D yielded anaverage test length reduction from 20 to 10.74 items (46.30%).Application of SC to the HADS-A resulted in an average test lengthreduction of 34.00% (Table 1). The average test length of the GADscale was reduced by 35.86%, to 4.49 items. Notably, no reductionsin diagnostic accuracies were found for SC75, with only oneexception: concordance with the full length instrument showeda small decrease for the GAD scale, from 1.000 to 0.998 (Table 2).For both SC95 and SC75, test length reductions were larger among“at risk” respondents (Fig. 1), than among “not at risk” respondents(Fig. 2).

3.3. Comparison of SC and CCT

The CCT, set to yield comparable test length to SC95 (Table 1),provided lower accordance with the classification of the full lengthinstrument than SC95 and SC75 (Table 2). Similarly, the overalldiagnostic accuracy with respect to CIDI diagnoses was higher forSC than CCT for the CES-D and HADS scales. For the GAD scale;however, CCT provided slightly higher sensitivity than SC (Table 2).Test length distributions resulting from application of CCT showeda pattern similar to that of DC and SC: test length reductions werehigher for “at risk”, than for “not at risk” respondents.

Fig. 1. Test length distributions for “at risk” (full length test score equal to, or above, the cutoff value) respondents.

M. Fokkema et al. / Psychiatry Research 215 (2014) 477–482 479

Page 4: Curtailment: A method to reduce the length of self-report questionnaires while maintaining diagnostic accuracy

4. Discussion

The results of our study indicate that DC provides substantialreductions in test length, while diagnostic accuracy remained identicalto that of the full-length test. Furthermore, application of SC resultedin further reductions of test length, with identical diagnostic accuracyfor SC95, and only a very slight change in accuracy for SC75, comparedto the classification of the full-length test. Test length reductions of24–37% were obtained by applying DC, whereas SC resulted in averagetest length reductions of 24 to 46%, compared to administration of thefull length instrument. Compared to an IRT-based CCT with averagetest length comparable to SC95, the curtailment algorithms providedbetter diagnostic accuracy. Therefore, curtailment provides a powerfulmethod for improving the efficiency of self-report questionnaires usedin screening for common mental health disorders.

Compared to adaptive methods for test length reduction, anadvantage of curtailment is that it involves only minimal assump-tions about the distribution of the data, in contrast to for exampleIRT-based methods (e.g., Van der Linden and Glas, 2010). Anotheradvantage of curtailment is that the original item order can beretained. As a result, unforeseen item order effects (e.g.,McFarland, 1981), which may hamper the performance of adaptivetesting routines, can be ruled out. At the same time, practicalimplementation is expected to be less elaborate for curtailmentthan for adaptive testing routines, because look-up tables can becreated, with stopping criteria for every item (see also Finkelmanet al., 2012). However, like adaptive testing routines, in curtail-ment may require tests to be administered by computer, whichcan be an impediment in some applications. Although in case oforal test administration, curtailment algorithms may be used for

Fig. 2. Test length distributions for “not at risk” (i.e., full length test score below cutoff value) respondents.

Table 1Summary statistics for test length distributions of original and curtailed versions of several mental health self-report scales.

scale No. of items DC SC95 SC75 CCT

M S.D. M S.D. M S.D. M S.D.

CES-D 20 12.59 4.42 12.47 4.31 10.74 4.87 12.50 7.47HADS-A 7 5.32 1.33 5.19 1.27 4.62 1.57 5.20 2.15GAD 7 5.31 1.13 5.31 1.13 4.49 1.33 5.30 2.05

Table 2Accuracies for curtailed versions of several mental health self-report scales.

Scale DC SC95 SC75 CCT

Conc Sens Spec Conc Sens Spec Conc Sens Spec Conc Sens Spec

CES-D 1.000 1.000 0.333 1.000 1.000 0.333 1.000 1.000 0.333 0.958 1.000 0.297HADS-A 1.000 0.819 0.556 1.000 0.819 0.556 1.000 0.819 0.556 0.964 0.798 0.571GAD 1.000 0.867 0.504 1.000 0.867 0.504 0.998 0.867 0.504 0.962 0.900 0.504

Note. conc¼concordance with classification of full length instruments (N¼502); sens¼sensitivity, calculated with respect to CIDI diagnoses (N¼157); spec¼specificity,calculated with respect to CIDI diagnoses (N¼157).

M. Fokkema et al. / Psychiatry Research 215 (2014) 477–482480

Page 5: Curtailment: A method to reduce the length of self-report questionnaires while maintaining diagnostic accuracy

creating look-up tables, which provide interviewers with rules forearly stopping.

Another advantage of SC is its direct applicability to criterion-keyed testing. When the goal of classification is an externalcriterion, this external criterion may be incorporated in thetraining of SC directly. For example, for curtailment of a ques-tionnaire used for diagnosis of depressive disorder, the “at risk”training dataset would then be composed of the item scores ofrespondents with a diagnosis of depressive disorder. Similarly, the“not at risk” training dataset would then be composed of itemscores of respondents without a diagnosis of depressive disorder.In the current study, we have not taken this approach, becausemental disorder diagnoses were available for only a subsample ofthe respondents.

In contrast to fixed length short forms, curtailment providesthe advantage of allowing the user to control the certainty ofclassification decisions. With fixed-length short forms, the cer-tainty of the resulting classification decision may differ acrossrespondents. With DC, testing can be continued as long as thecertainty of the resulting classification decision can be improved,for the current respondent. With SC, the certainty of the resultingclassification decision for every respondent can be controlled bysetting the value of γ.

Compared to the study of Finkelman et al. (2012), test lengthreductions obtained by DC and SC of the CES-D in the currentstudy were about 10% higher. This may be due to differences intest score distributions, as the prevalence of depressive disorderwas notably higher in the current study. However, in terms ofaccuracy, the findings in both studies were the same: thereduction of test length provided by DC and SC did not reducediagnostic accuracy.

The current study is was the first to use a value for γ of 0.75. Forexample, Finkelman et al. (2012) used values of 0.95 and 0.90.Although the results for γ of 0.75 are encouraging, further researchon diagnostic accuracy when γ-values are set lower than 0.95 iswarranted. Based on current research, γ values of 0.95 would beadvised for practical application of SC. Or, when large trainingsamples are available, a cross-validation approach may be taken toallow for determination of an optimal value for γ.

Lastly, it is important to note that application of SC within apopulation requires calibration of the algorithm in the samepopulation. For example, application of SC in a clinical populationrequires an existing dataset of item scores from a representativeclinical sample. Using an existing dataset of item scores from anon-clinical sample would reduce the accuracy of the resultingclassification decisions. Also, clinicians should be aware that theaccuracy of curtailed tests only relates to the classification problemat hand. If the assessment goal is measurement, instead ofclassification, adaptive testing algorithms aimed at measurementshould be used.

Appendix A. Supplementary material

Supplementary data associated with this article can be found inthe online version at http://dx.doi.org/10.1016/j.psychres.2013.11.003.

References

Beekman, A., Deeg, D., Van Limbeek, J., Braam, A., De Vries, M., Van Tilburg, W.,1997. Criterion validity of the Center for Epidemiologic Studies Depression scale(CES-D): results from a community-based sample of older subjects in TheNetherlands. Psychological Medicine 27, 231–235.

Bjelland, I., Dahl, A., Haug, T., Neckelmann, D., 2002. The validity of the HospitalAnxiety and Depression Scale—an updated literature review. Journal of Psy-chosomatic Research 52, 69–78.

Choi, S.W., Gibbons, L.E., Crane, P.K., 2011. lordif: an R package for detectingdifferential item functioning using iterative hybrid ordinal logistic regres-sion/item response theory and Monte Carlo simulations. Journal of StatisticalSoftware 38, 1–30.

Cuijpers, P., Smits, N., Donker, T., Ten Have, M., Graaf, R., 2010. Screening for moodand anxiety disorders with the five-item, the three-item, and the two-itemMental Health Inventory. Psychiatry Research 168, 250–255.

Donker, T., Van Straten, A., Marks, I., Cuijpers, P., 2009. A brief Web-based screeningquestionnaire for common mental disorders: development and validation.Journal of Medical Internet Research, 11.

Donker, T., van Straten, A., Marks, I., Cuijpers, P., 2011. Quick and easy self-rating ofGeneralized Anxiety Disorder: Validity of the Dutch web-based GAD-7, GAD-2and GAD-SI. Psychiatry Research 188, 58–64.

Eisenberg, B., Ghosh, B., 1980. Curtailed and uniformly most powerful sequentialtests. Annals of Statistics 8, 1123–1131.

Eisenberg, B., Simons, G., 1978. On weak admissibility of tests. Annals of Statistics 6,319–332.

Fayers, P., 2007. Applying item response theory and computer adaptive testing: thechallenges for health outcomes assessment. Quality of Life Research 16,187–194.

Finkelman, M., He, Y., Kim, W., Lai, A., 2011. Stochastic curtailment of healthquestionnaires: a method to reduce respondent burden. Statistics in Medicine30, 1989–2004.

Finkelman, M., Smits, N., Kim, W., Riley, B., 2012. Curtailment and stochasticcurtailment to shorten the CES-D. Applied Psychological Measurement 36,632–658.

Fliege, H., Becker, J., Walter, O., Bjorner, J., Klapp, B., Rose, M., 2005. Development ofa computer-adaptive test for depression (D-CAT). Quality of Life Research 14,2277–2291.

Fries, J., Cella, D., Rose, M., Krishnan, E., Bruce, B., 2009. Progress in assessingphysical function in arthritis: PROMIS short forms and computerized adaptivetesting. Journal of Rheumatology 36, 2061–2066.

Gardner, W., Kelleher, K., Pajer, K., 2002. Multidimensional adaptive testing formental health problems in primary care. Medical Care 40, 812–823.

Gibbons, R., Weiss, D., Kupfer, D., Frank, E., Fagiolini, A., Grochocinski, V., Immekus,J., 2008. Using computerized adaptive testing to reduce the burden of mentalhealth assessment. Psychiatric Services 59, 361–368.

Gilbody, S., Sheldon, T., Wessely, S., 2006. Health policy: Should we screen fordepression? British Medical Journal 332, 1027–1030.

Gray, J., Austoker, J., 1998. Quality assurance in screening programmes. BritishMedical Bulletin 54, 983–992.

Hastie, T., Tibshirani, R., Friedman, J., 2008. The Elements of Statistical Learning, 2nded. Springer, New York.

Kroenke, K., Spitzer, R., Williams, J., Monahan, P., Löwe, B., 2007. Anxiety disordersin primary care: prevalence, impairment, comorbidity, and detection. Annals ofInternal Medicine 146, 317–325.

McFarland, S.G., 1981. Effects of question order on survey responses. Public OpinionQuarterly 45, 208–215.

Lan, K., Simon, R., Halperin, M., 1982. Stochastically curtailed tests in long–termclinical trials. Sequential Analysis 1, 207–219.

Mitchell, A., Coyne, J., 2007. Do ultra-short screening instruments accurately detectdepression in primary care? A pooled analysis and meta-analysis of 22 studies.British Journal of Medical Practice 57, 144.

Petersen, M., Groenvold, M., Aaronson, N., Fayers, P., Sprangers, M., Bjorner, J., 2006.Multidimensional computerized adaptive testing of the EORTC QLQ-C30: basicdevelopments and evaluations. Quality of Life Research 15, 315–329.

R Development Core Team, 2010. R: A Language and Environment for StatisticalComputing [Computer software manual]. Vienna, Austria. Retrieved from:⟨http://www.R-project.org⟩.

Radloff, L., 1977. The CES-D scale. A self-report depression scale for research in thegeneral population. Applied Psychological Measurement 1, 385–401.

Rizopoulos, D., 2006. ltm: an R package for latent variable modellingand itemresponse theory analyses. Journal of Statistical Software 17, 1–25.

Rost, K., Burnam, M., Smith, G., 1993. Development of screeners for depressivedisorders and substance disorder history. Medical Care 31, 189–200.

Samejima, F., 1979. A New Family of Models for the Multiple-Choice Item. ResearchReport No. 79-4. Department of Psychology, University of Tennessee, Knoxville.

Smith, G., McCarthy, D., Anderson, K., 2000. On the sins of short-form development.Psychological Assessment 12, 102–111.

Smits, N., Cuijpers, P., van Straten, A., 2011. Applying computerized adaptive testingto the CES-D scale: a simulation study. Psychiatry Research 188, 147–155.

Smits, N., Finkelman, M., 2013. A comparison of computerized classification testingand computerized adaptive testing in clinical psychology. Journal of Computer-ized Adaptive Testing 1, 19–37.

Spitzer, R., Kroenke, K., Williams, J., Lowe, B., 2006. A brief measure for assessinggeneralized anxiety disorder: the GAD-7. Archives of Internal Medicine 166,1092.

Stone, M., 1974. Cross-validatory choice and assessment of statistical predictions(with discussion). Journal of the Royal Statistical Society, B. 36, 111–147.

Thompson, N.A., 2009. Item selection in computerized classification testing.Educational and Psychological Measurement 69, 778–793.

Thompson, N.A., 2011. Termination criteria for computerized classification testing.Practical Assessment, Research and Evaluation 16.

UK National Screening Committee, 2003. The UK National Screening Committee'sCriteria for Appraising the Viability, Effectiveness and Appropriateness of aScreening Programme. HMSO, London.

M. Fokkema et al. / Psychiatry Research 215 (2014) 477–482 481

Page 6: Curtailment: A method to reduce the length of self-report questionnaires while maintaining diagnostic accuracy

Van der Linden, W., Glas, C., 2010. Elements of Adaptive Testing. Springer, New York,NY.

Wada, K., Tanaka, K., Theriault, G., Satoh, T., Mimura, M., Miyaoka, H., Aizawa, Y.,2007. Validity of the Center for Epidemiologic Studies Depression Scale as ascreening instrument of major depressive disorder among Japanese workers.American Journal of Industrial Medicine 50, 8–12.

Walter, O., Becker, J., Bjorner, J., Fliege, H., Klapp, B., Rose, M., 2007. Developmentand evaluation of a computer adaptive test for anxiety (Anxiety-CAT). Quality ofLife Research 16, 143–155.

Weiss, D., 1982. Improving measurement quality and efficiency with adaptivetesting. Applied Psychological Measurement 6, 473–492.

Whooley, M., Avins, A., Miranda, J., Browner, W., 1997. Case-finding instruments fordepression: two questions are as good as many. Journal of General InternalMedicine 12, 439–445.

World Health Organization, 1997. Composite International Diagnostic Interview(CIDI): Version 2.1. World Health Organization, Geneva.

Zigmond, A., Snaith, R., 1983. The Hospital Anxiety and Depression Scale. ActaPsychiatrica Scandinavica 67, 361–370.

M. Fokkema et al. / Psychiatry Research 215 (2014) 477–482482