An illustration of how a self-report diagnostic screening scale could improve the internal validity of antidepressant efficacy trials

www.elsevier.com/locate/jad

Journal of Affective Disorders 80 (2004) 79–85

cr24

Brief report

An illustration of how a self-report diagnostic screening scale could

improve the internal validity of antidepressant efficacy trials

Mark Zimmerman*, Iwona Chelminski, Michael Posternak

Department of Psychiatry and Human Behavior, Brown University School of Medicine, Rhode Island Hospital, Providence, RI, USA

Received 9 September 2002; accepted 22 January 2003

Abstract

Background: During the past 20 years semi-structured diagnostic interviews have been the standard for diagnostic

evaluations in research relying on reliable and valid psychiatric assessment and diagnosis; however, only a minority of

antidepressant efficacy trials (AETs) employ these interviews. This might be important insofar as several studies have found that

clinicians conducting unstructured clinical interviews underrecognize diagnostic comorbidity. Because of the financial

incentives to recruit patients into AETs quickly, there is little incentive to vigorously determine the presence of comorbid

conditions that should result in exclusion from the trial. In the present report we demonstrate how a self-report diagnostic

screening scale could be used to identify systematic differences in diagnostic practice across settings, and how such a scale

could be used to compare samples of patients who pass screening evaluations and are accepted into an AET. Methods:

Depressed patients completed the Psychiatric Diagnostic Screening Questionnaire (PDSQ), and were evaluated with either an

unstructured clinical interview or with the Structured Clinical Interview for DSM-IV (SCID). Results: The two samples were

clinically comparable based on their scores on the self-administered PDSQ. Consistent with the greater thoroughness of the

SCID, compared to unstructured diagnostic evaluations, more patients administered the SCID were diagnosed with comorbid

conditions. After excluding patients with disorders that might be the basis for exclusion from an AET, the two samples then

differed in their scores on the PDSQ. That is, more patients in the sample evaluated by an unstructured interview had ‘occult’

pathology than patients evaluated with the SCID. Conclusion: These findings demonstrate how systematic differences in

diagnostic practice might be detected across sites when conducting AETs. Limitations: The study was conducted with patients

in a single outpatient clinical practice rather than participants of a multi-site trial.

D 2003 Elsevier B.V. All rights reserved.

Keywords: Major depressive disorder (MDD); Antidepressant efficacy trials (AETs); Self-report scale; Semi-structured diagnostic interview

1. Introduction

Antidepressant efficacy trials (AETs) rely on accu-

rate diagnostic determinations to select patients with

0165-0327/$ - see front matter D 2003 Elsevier B.V. All rights reserved.

doi:10.1016/S0165-0327(03)00050-8

* Corresponding author. Present address: Bayside Medical

Center, 235 Plain Street, Providence, RI 02905, USA.

E-mail address: [email protected] (M. Zimmerman).

the diagnosis of interest (usually major depressive

disorder, MDD) and exclude patients with comorbid

conditions. In discussing the reasons for failed AETs,

Robinson and Rickels (2000) questioned pharmaceu-

tical companies’ current practice of conducting multi-

site studies involving many treatment centers because

of difficulties maintaining control over the quality of

the diagnostic and outcome evaluations. Variable

M. Zimmerman et al. / Journal of Affective Disorders 80 (2004) 79–8580

competence and quality of diagnostic raters in out-

come studies introduces error variance in the data

collected, and this error variance may be one factor

which contributes to the failure to detect differences in

outcome between active medication and placebo.

Methods to identify and reduce this error variance

should improve the internal validity of AETs.

During the past 20 years semi-structured diagnostic

interviews have been the standard for diagnostic

evaluations in research relying on reliable and valid

psychiatric assessment and diagnosis. However, this

standard has not extended to AETs. We recently

reviewed 39 AETs to determine how many patients

in a routine clinical practice would have been exclud-

ed had the exclusion criteria from the trials been

applied (Zimmerman et al., submitted for publication).

Only eight (20.5%) of the 39 studies reported using

standardized diagnostic interviews to determine

whether patients had MDD, the comorbid diagnoses

requiring exclusion, or a bipolar or psychotic subtype

of depression requiring exclusion. The other 31 stud-

ies apparently relied on unstructured clinical evalua-

tions to diagnose patients.

During the past 3 years four independent reports

have questioned the accuracy of psychiatric diagnoses

made by clinicians using unstructured clinical inter-

views (Basco et al., 2000; Miller et al., 2001; Shear et

al., 2000; Zimmerman and Mattia, 1999). All four

research groups reported that clinicians conducting

unstructured diagnostic interviews underrecognize di-

agnostic comorbidity. Whether or not these findings

extend to clinical evaluators in AETs in unknown.

However, because of the aforementioned financial

incentives to recruit patients into AETs quickly, there

is little incentive to vigorously determine the presence

of comorbid conditions that should result in exclusion

from the trial.

It can be difficult to demonstrate differences in

diagnostic practice between clinicians, or between

clinical/research sites. One method would be to video

or audiotape diagnostic interviews and have them

reviewed by an independent ‘expert’ clinician. A

disadvantage of such an approach is that it is time

consuming and expensive. Another, less costly, meth-

od to determine whether diagnosticians systematically

differ in their diagnostic practice is with the use of

self-administered questionnaires as a ‘paper-standard’.

Zimmerman and colleagues suggested that diagnostic

criteria might be differentially applied across diagnos-

tic centers, and illustrated how the differential appli-

cation of criteria might be responsible for the

difficulty in independently replicating research find-

ings (Zimmerman et al., 1990). In a separate report

they suggested that self-report questionnaires offered

an inexpensive, non-laborious, empirical, and easily

standardized method that can detect systematic differ-

ences between research groups in diagnostic practices

(Zimmerman et al., 1993). The results of a self-report

scale can be used as a benchmark to which interview-

er-derived diagnoses can be compared, and this would

provide a method of detecting systematic differences

in diagnostic practice.

As part of the Rhode Island Methods to Improve

Diagnostic Assessment and Services (MIDAS) proj-

ect, our research group has developed a broad-based

self-report scale that screens for several DSM-IVAxis

I disorders—the Psychiatric Diagnostic Screening

Questionnaire (PDSQ; Zimmerman and Mattia,

2001a,b). In the present report we illustrate how a

self-report scale such as the PDSQ can be used to

identify systematic differences in diagnostic practice

and how this might influence who gets included in

AETs.

During the 7 years of the MIDAS project, some

patients have been evaluated with the Structured

Clinical Interview for DSM-IV (SCID), whereas other

patients have been evaluated by clinicians using

unstructured interviews (non-SCID sample). We con-

ducted a series of three analyses to test the hypothesis

that systematic differences between the different di-

agnostic methods could be detected by a self-report

scale. First, we determined if the depressed patients in

the SCID and non-SCID samples are clinically similar

by comparing the two groups on their scores on the

PDSQ. We did this because when searching for differ-

ences in diagnostic practice it is important to establish

the clinical equivalence of the comparison groups, or

to control for true sample differences. Second, we

compared the SCID and non-SCID samples in diag-

nostic frequencies. Based on our prior work we

predicted that more depressed patients in the SCID

than the non-SCID sample would be diagnosed with

comorbid disorders (Zimmerman and Mattia, 1999). If

this were true, and the two groups were similar on the

self-report PDSQ, this would reflect a systematic

diagnostic bias that is due to diagnostic method rather

M. Zimmerman et al. / Journal of Affective Disorders 80 (2004) 79–85 81

than true clinical differences between the samples.

And third, we compared patients in the SCID and

non-SCID samples on the PDSQ after excluding

patients based on the results of the diagnostic evalu-

ation. In other words, we excluded patients with

disorders that are often the basis for exclusion in

AETs. We predicted that patients in the non-SCID

group will now score higher on the PDSQ than

patients in the SCID group because there will be more

occult disorder in the patients who are evaluated less

thoroughly.

2. Method

2.1. Patients

More than 2000 patients have been evaluated in the

Rhode Island Hospital Department of Psychiatry out-

patient practice. This private practice group predom-

inantly treats individuals with medical insurance

(including Medicare but not Medicaid) on a fee-for-

service basis, and is distinct from the hospital’s

outpatient residency training clinic that predominantly

serves lower income, uninsured, and medical assis-

tance patients.

We examined psychiatric diagnoses in two non-

overlapping cohorts of patients who completed the

final version of the PDSQ—patients interviewed by

clinicians with an unstructured clinical interview

(non-SCID sample, n= 1352) and patients interviewed

with the SCID (n = 993). Not all patients were inter-

viewed with the SCID because of the lack of avail-

ability of diagnostic raters and patients’ preference for

the briefer clinical evaluation. Thus, assignment to

receiving a SCID or unstructured clinical diagnostic

evaluation was not random.

2.2. Assessment

Before the initial evaluation all patients completed

the PDSQ as part of their initial paperwork. The

PDSQ is a broad-based screening questionnaire

assessing the symptoms of mood, eating, anxiety,

substance use, and somatoform disorders (Zimmer-

man and Mattia, 2001a,b). Because the validity of the

PDSQ was under investigation, the clinicians were

kept blind to the patients’ responses on the question-

naire. The institutional review board approved the

evaluation protocol, and participants provided written

informed consent.

In the non-SCID sample, diagnostic evaluations

were conducted by attending psychiatrists. Diagnoses

were based on DSM-IV criteria. Clinicians complet-

ed a standardized intake form modeled on the Intake

Evaluation Form of Mezzich and colleagues (Mez-

zich et al., 1981). The intake form included space for

a narrative description of the chief complaint, history

of present illness, and past psychiatric history. In

addition, there was a checklist to record the presence

or absence of substance use problems, a history of

sexual or physical abuse, psychotic symptoms, panic

attacks, phobias, obsessions, compulsive behavior,

and all of the symptoms of major depression. On

the last page of the five-page form clinicians

recorded patients’ DSM-IV multiaxial diagnoses.

Research assistants recorded the results of the clini-

cian’s diagnostic evaluation written on the last page

of the intake form, and collected demographic infor-

mation from the narrative. To avoid underestimating

comorbidity detection by clinicians, we included as

cases patients whom the clinicians diagnosed with a

‘rule-out’ disorder.

When patients called to schedule their initial

appointment they were offered the opportunity to

receive a more comprehensive evaluation than the

usual clinical evaluation. The patients were told that

they would be interviewed by two people—first by

a diagnostic rater who would conduct a compre-

hensive evaluation, and then by a psychiatrist. After

the SCID, the rater presented the case to a psychi-

atrist who reviewed the findings of the evaluation

with the patient. The extensive training program of

the diagnostic raters has been described in prior

reports (Zimmerman and Mattia, 1999). During the

course of the study, joint-interview diagnostic reli-

ability information has been collected on 47

patients. For anxiety and substance use disorders

the kappa coefficients were: panic disorder (k = 1.0),

social phobia (k= 0.84), obsessive-compulsive dis-

order (k = 1.0), generalized anxiety disorder

(k = 0.93), posttraumatic stress disorder (k = 0.91),

alcohol abuse/dependence (k = 0.64), and drug

abuse/dependence (k = 0.73).

The PDSQ has undergone several rounds of

study involving more than 3000 primary care and


psychiatric outpatients. After each large validation

study, the scale was revised based on a psychomet-

ric analysis of the subscales and items. The final

version of the PDSQ consists of 126 questions

assessing the symptoms of 13 DSM-IV disorders

in five areas: eating disorders (bulimia/binge eating

disorder), mood disorders (MDD), anxiety disorders

(panic disorder, agoraphobia, PTSD, OCD, GAD

and social phobia), substance use disorders (alcohol

abuse/dependence, drug abuse/dependence), and

somatoform disorders (somatization disorder, hypo-

chondriasis). In addition, there is a six-item psy-

chosis screen.

The reliability and validity of the PDSQ have been

described in detail elsewhere (Zimmerman and Mat-

tia, 2001a,b). Briefly, in the validity study of the final

version of the PDSQ, the 13 PDSQ subscales dem-

onstrated good to excellent levels of internal consis-

tency (Zimmerman and Mattia, 2001a). Cronbach’s

alpha was greater than 0.80 for 12 of the 13 subscales,

and the mean of the alpha coefficients was 0.86. Test–

retest reliability coefficients were greater than 0.80 for

nine subscales (mean 0.83). The convergent and

discriminant validity of the PDSQ subscales was

examined in 361 patients who completed a package

of questionnaires at home less than a week after

completing the PDSQ. The booklet included measures

of symptoms related to each of the PDSQ symptoms

domains. Every PDSQ subscale was more highly

correlated with the concordant validity scale assessing

the same symptom domain versus other symptoms

domains. Across all subscales, the mean correlation

between the PDSQ subscales and their respective

validity scale was 0.66, while the mean correlation

between PDSQ subscales and measures of other

symptom domains was 0.25. Finally, the diagnostic

performance of the PDSQ subscales was examined in

630 patients interviewed with the SCID. Sensitivity

and specificity varied according to the cut-off used

(Zimmerman and Mattia, 2001b). In the present report

we used the PDSQ cut-off scores associated with a

specificity of 90%.

2.3. Statistical analysis

In the present report we focused on anxiety and

substance use disorders because they are the most

frequent comorbidities used as exclusion criteria in

AETs (Zimmerman et al., submitted for publication).

We did not examine specific phobia or nicotine

dependence because these disorders are rarely used

as the basis for exclusion in an AET. We examined the

impact of different diagnostic methods on the appli-

cation of the anxiety and substance use disorder

exclusion criteria in the patients with non-psychotic,

unipolar MDD in the SCID and non-SCID samples.

First, we determined the similarity of the two samples

by comparing their demographic characteristics and

scores on the PDSQ. Next, we compared the two

samples on the percentage of patients that might be

excluded from an AET because the disorder is pres-

ent. Last, for each of the anxiety and substance

disorders we compared patients in the SCID and

non-SCID groups on the PDSQ after we excluded

patients with the diagnosis. Categorical variables were

compared by the chi-square statistic and continuous

variables were compared with t-tests.

3. Results

More than 900 patients received a principal diag-

nosis of current non-bipolar MDD, 579 patients in the

non-SCID sample and 339 patients in the SCID

sample. The data in Table 1 indicate that there were

modest, albeit statistically significant, differences be-

tween the non-SCID and SCID samples in age,

marital status, education, and race.

Patients in the non-SCID and SCID samples were

compared on the PDSQ subscale scores, controlling

for demographic differences. There were no signifi-

cant differences between the groups on any of the

PDSQ subscale scores. Thus, despite the significant

differences in demographic characteristics, the SCID

and non-SCID patient samples were clinically similar

as assessed by a reliable and valid self-report measure

of DSM-IV symptoms.

Each anxiety disorder except PTSD was significant-

ly more frequently diagnosed in the SCID than the non-

SCID sample (panic disorder: 16.2 vs. 8.8%,

v2 = 11.51, P < 0.01; obsessive-compulsive disorder:

7.7 vs. 3.3%, v2 = 8.83, P < 0.01; social phobia: 32.4

vs. 2.4%, v2 = 165.0, P < 0.01; posttraumatic stress

disorder: 10.6 vs. 9.0%, v2 = 0.66, NS; generalized

anxiety disorder: 20.1 vs. 7.3%, v2 = 33.24, P < 0.01).

There was no difference between SCID and non-SCID

Table 1

Demographic characteristics of depressed patients in the non-SCID and SCID samples

non-SCID (n= 579) SCID (n= 339) Two-group test

n % n % v2 P

Gender 0.22 NS

Females 399 69.0 229 67.6

Males 179 31.0 110 32.4

Race

11.13 < 0.01

White 465 80.3 301 88.8

Non-white 114 19.7 38 11.2

Education

15.76 < 0.05

Less than high school 84 14.5 30 8.8

High school graduate or GED 341 59.6 223 65.8

College graduate 154 26.9 86 25.4

Marital status

13.54 < 0.05

Married 245 43.4 140 41.3

Living with someone as if married 33 5.8 20 5.9

Widowed 24 4.2 5 1.5

Separated 46 8.1 21 6.2

Divorced 101 17.9 54 15.9

Single 116 20.5 99 29.2

Mean S.D. Mean S.D. t P

Age (years) 40.72 14.3 38.67 12.0 2.32 < 0.05


groups in rates of current drug abuse/dependence (6.2

vs. 3.5%, v2 = 3.76, NS). Alcohol abuse/dependencewas significantly more frequently diagnosed in the

SCID patients (9.4 vs. 5.5%, v2 = 5.05, P < 0.05).

Table 2

Prevalence of PDSQ cases in the non-SCID and SCID patients after exclu

PDSQ casesa non-SCID sample (n= 579)

nb No. of PDSQ cases %

Panic disorder 492 86 17.5

Social phobia 517 95 18.4

Obsessive compulsive disorder 522 70 13.4

Posttraumatic stress disorder 504 69 13.7

Generalized anxiety disorder 486 103 21.2

Any alcohol use disorder 492 43 8.7

Any drug use disorder 504 22 4.4

a In the non-SCID sample, the number of patients with missing data on

phobia subscale, 38 on the obsessive compulsive disorder subscale, 23 o

anxiety disorder subscale, 55 on the alcohol use disorder subscale, and 55 o

patients with missing data on the PDSQ was ten on the panic disorder

compulsive disorder subscale, three on the posttraumatic stress disorder su

alcohol use disorder subscale, and six on the drug use disorder subscale.b n indicates sample size after patients diagnosed with the index disord

were diagnosed with panic disorder. Data were missing on the PDSQ pan

We next compared the SCID and non-SCID sam-

ples on individual PDSQ subscales after excluding

patients with the index disorder. For example, the

samples were compared on the PDSQ panic disorder

ding patients with the index disorder

SCID sample (n= 339) Two-group test

nb No. of PDSQ cases % v2 P

274 24 8.8 10.88 0.01

223 28 12.6 3.81 0.05

300 21 7.0 7.95 0.01

300 38 12.7 0.17 NS

267 55 20.6 0.04 NS

301 20 6.6 1.12 NS

312 23 7.4 3.34 NS

the PDSQ was 36 on the panic disorder subscale, 48 on the social

n the posttraumatic stress disorder subscale, 51 on the generalized

n the drug use disorder subscale. In the SCID sample, the number of

subscale, six on the social phobia subscale, 13 on the obsessive

bscale, four on the generalized anxiety disorder subscale, six on the

er were excluded. For example, 51 patients in the non-SCID group

ic disorder subscale in 36 patients. Thus, n= 492 (579� 51� 36).


subscale after patients diagnosed with panic disorder

were excluded. This is analogous to comparing dif-

ferent samples included in an AET at different sites (or

in different studies) using different diagnostic meth-

odologies. After this exclusion, we compared the

percentage of patients in each group who were pos-

itive on the PDSQ for each of the anxiety or substance

use disorders used as the basis for exclusion (Table 2).

As expected from the above results, after excluding

the diagnosed cases, significantly more patients

screened positive on the PDSQ for panic disorder,

social phobia, and obsessive-compulsive disorder in

the non-SCID group than the SCID group (because

more of these cases were undetected by the unstruc-

tured interview in the non-SCID group).

4. Discussion

In this report we illustrated how a self-report

questionnaire could be used to detect systematic

differences in diagnostic practices in research studies

such as AETs. Depressed patients drawn from the

same clinical practice were evaluated with an unstruc-

tured clinical interview or with the SCID. The non-

SCID and SCID samples were clinically comparable

(according the PDSQ), though significantly more

patients were diagnosed with anxiety and alcohol

use disorders in the SCID sample. Without the PDSQ

data we would not have been able to determine that

the differences in diagnostic frequencies between the

samples were due to different diagnostic practices

rather than true sample differences.

Differences in diagnostic practices can influence

recruitment into an AET. Anecdotal conversations

with researchers and clinicians who have worked

on clinical trials at different sites suggest that

different levels of rigor are used in adhering to

the specified inclusion and exclusion criteria. To

illustrate how this might occur, we applied the

anxiety and substance use disorder exclusion crite-

ria frequently used in AETs to two samples eval-

uated with different degrees of diagnostic rigor. As

expected, a greater percentage of patients evaluated

more thoroughly with the SCID would have been

excluded from a clinical trial.

Of course, use of a semi-structured diagnostic inter-

view for recruitment into a study does not ensure that

recruitment and the application of exclusion criteria in

AETs will be done similarly across sites. Variability in

the interpretation of subjects’ responses to questions of

a semi-structured interview remains, and different

investigators may have different thresholds for diag-

nosing comorbid conditions. Consequently, it would be

helpful to be able to compare samples that are accepted

into an AETon ameasure that is free of interviewer bias

in the application of diagnostic criteria. A self-report

questionnaire such as the PDSQ is one such approach.

We demonstrated that the patients who might have

passed through the diagnostic evaluation process as

part of an AET based on different methods of diagnosis

were not comparable. That is, when diagnoses were

based on an unstructured clinical evaluation signifi-

cantly more patients who might have been accepted

into the AETscored positive on the PDSQ than patients

who might have been accepted into the AET based on

the SCID interview. If these were the findings from an

actual multi-center AET we would interpret them as

indicating that there was a systematic difference be-

tween sites in applying the exclusion criteria. Theoret-

ically, when this happens it is not possible to know

which site is more appropriately applying the AET

exclusion criteria. A self-report paper-standard does

not indicate which site is more or less accurate in

evaluating patients. Rather, a self-report paper-standard

simply identifies a systematic difference in how

patients were evaluated. However, in light of the

financial incentives to overlook exclusion criteria and

recruit patients into an AET as quickly as possible, one

could infer which sites are less rigorously applying the

exclusion criteria.

We do not knowwhat is actually done when subjects

are evaluated for participation in an AET. It is probable

that use of a semi-structured diagnostic interview

results in closer adherence to the stated exclusion

criteria; thus, this methodology should probably be

routinely used in recruiting patients for an AET. It is

surprising that semi-structured interviews, which are

the diagnostic standard in most areas of psychiatric

research, are so infrequently used in AETs. Even when

semi-structured interviews are used there is still room

for interpretation. Thus, wewould also recommend that

self-report questionnaires be routinely used in AETs,

and that results on these measures be routinely reported

in the same way the samples’ demographic character-

istics are described.


Acknowledgements

This research was supported, in part, by grants

MH48732 and MH56404 from the National Institute

of Mental Health.

References

Basco, M.R., Bostic, J.Q., Davies, D., Rush, A.J., Witte, B., Hen-

drickse, W., Barnett, V., 2000. Methods to improve diagnostic

accuracy in a community mental health setting. Am. J. Psychia-

try 157, 1599–1605.

Mezzich, J.E., Dow, J.T., Rich, C.L., Costello, A.J., Himmelhoch,

J.M., 1981. Developing an efficient clinical information system

for comprehensive psychiatric institute. II: Initial evaluation

form. Behav. Res. Methods Instrum. Comput. 13, 464–478.

Miller, P.R., Dasher, R., Collins, R., Griffiths, P., Brown, F., 2001.

Inpatient diagnostic assessments: 1. Accuracy of structured ver-

sus unstructured interviews. Psychiatry Res. 105, 255–264.

Robinson, D., Rickels, K., 2000. Concerns about clinical drug tri-

als. J. Clin. Psychopharmacol. 20, 593–596, editorial.

Shear, M.K., Greeno, C., Kang, J., Ludewig, D., Frank, E.,

Swartz, H.A., Hanekamp, M., 2000. Diagnosis of non-psy-

chotic patients in community clinics. Am. J. Psychiatry 157,

581–587.

Zimmerman, M., Coryell, W., Black, D.W., 1990. Variability in the

application of contemporary diagnostic criteria: endogenous de-

pression as an example. Am. J. Psychiatry 147, 1173–1179.

Zimmerman, M., Coryell, W., Black, D.W., 1993. A method to

detect intercenter differences in the application of contemporary

diagnosis criteria. J. Nerv. Ment. Dis. 181, 130–134.

Zimmerman, M., Mattia, J.I., 1999. Psychiatric diagnosis in clinical

practice: is comorbidity being missed? Compr. Psychiatry 40,

182–191.

Zimmerman, M., Mattia, J.I., 2001a. The Psychiatric Diagnostic

Screening Questionnaire: development, reliability and validity.

Compr. Psychiatry 42, 175–189.

Zimmerman, M., Mattia, J.I., 2001b. A self-report scale to help

make psychiatric diagnoses: the Psychiatric Diagnostic

Screening Questionnaire (PDSQ). Arch. Gen. Psychiatry 58,

787–794.

Zimmerman, M., Chelminski, I., Posternak, M.A. Exclusion crite-

ria used in antidepressant efficacy trials: Consistency across

studies and representativeness of samples included. Submitted

for publication.

Documents

An illustration of how a self-report diagnostic screening scale could improve the internal validity of antidepressant efficacy trials