Standardized clinical outcome rating scale for depression for use in clinical practice

Brief Report

STANDARDIZED CLINICAL OUTCOME RATING SCALEFOR DEPRESSION FOR USE IN CLINICAL PRACTICE

Mark Zimmerman, M.D.n, Michael A. Posternak, M.D., Iwona Chelminski, Ph.D., and Michael Friedman, M.D.

The integration of research into clinical practice to conduct effectiveness studiesfaces multiple obstacles. One obstacle is the burden of completing researchmeasures of outcome. A simple, reliable, and valid measure that could be ratedat every visit, incorporated into a clinician’s progress note, and reflect the DSM-IV definition of a major depressive episode (including partial and full remissionfrom the episode) would enhance the ability to conduct effectiveness research.The goal of the present study was to examine the reliability and validity of such ameasure. Three hundred and three psychiatric outpatients who were beingtreated for a DSM-IV major depressive episode were rated on the StandardizedClinical Outcome Rating for Depression (SCOR-D), 17-item Hamilton RatingScale for Depression, Montgomery-Asberg Depression Rating Scale, and theGlobal Assessment of Functioning. We examined the correlation between theSCOR-D and the other measures, and conducted an analyses of variance tocompare mean values on these measures for each rating point on the SCOR-D.The inter-rater reliability of the SCOR-D dimensional ratings and categoricaldetermination of remission were high. The SCOR-D was highly correlated withthe other scales, and there were significant differences on the other measures ofdepression severity between each adjacent rating level of the SCOR-D. TheSCOR-D is a brief standardized outcome measure linked to the DSM-IVapproach toward defining remission that can be incorporated into routineclinical practice without adding undue burden to the treating clinician withsome evidence of reliability and validity. This measure could make it morefeasible to conduct effectiveness studies in clinical practice. Depression andAnxiety 22:36–40, 2005. & 2005 Wiley-Liss, Inc.

Key words: depression; assessment; remission; severity; outcome; ef fectiveness

INTRODUCTIONThe optimal delivery of mental health treatmentdepends on evaluating the impact of treatment onpatients treated in clinical practice. During the past fewyears there have been increasing calls for effectivenessresearch because of concerns about the generalizabilityof the results of tightly controlled efficacy trials to real-world clinical practice [Lebowitz and Rudorfer, 1998;Zimmerman et al., 2002]. Efficacy studies, conductedunder highly controlled and standardized conditions,maximize internal validity, whereas effectiveness re-search is concerned with the external validity oftreatment studies. In efficacy studies outcome isusually assessed with detailed symptom measures thatrequire training to administer reliably and validly. In

antidepressant efficacy studies the two symptommeasures used most frequently to evaluate outcomeare the Hamilton Rating Scale for Depression (HRSD)

DEPRESSION AND ANXIETY 22:36–40 (2005)

Department of Psychiatry and Human Behavior, Brown

University School of Medicine, Rhode Island Hospital, Provi-

dence, Rhode Island

nCorrespondence to: Dr. Mark Zimmerman, Bayside Medical

Center, 235 Plain Street, Providence, RI 02905.

E-mail: [email protected].

Received for publication 21 August 2004; Revised 21 October

2004; Accepted 7 December 2004

DOI 10.1002/da.20046

Published online 29 March 2005 in Wiley InterScience (www.

interscience.wiley.com).

&& 2005 WILEY-LISS, INC.

[Hamilton, 1960] and the Montgomery-Asberg De-pression Rating Scale (MADRS) [Montgomery andAsberg, 1979].

The integration of research into clinical practice toconduct effectiveness studies faces multiple obstacles,one of which is the burden of completing researchmeasures of outcome. The HRSD and MADRS do notreadily lend themselves to clinical practice, in part,because clinicians generally do not inquire aboutsymptom presence in sufficient detail to distinguishbetween severity levels as is required on these scales.Max Hamilton, in his initial description of his ratingscale [Hamilton, 1960], suggested that it should takeapproximately 25 min to administer. The use of thesescales would necessitate a more thorough assessmentthan is necessary, or at times possible, to conduct inclinical practice.

During the past 10 years we have established theRhode Island Methods to Improve Diagnostic Assess-ment and Services (MIDAS) project in which we haveintegrated the assessment methods of researchers intoroutine clinical practice [Zimmerman, 2003]. One goalof the MIDAS project has been to develop tools toassist clinicians in making psychiatric diagnoses andoutcome ratings [Zimmerman and Mattia, 2001;Zimmerman et al., 2004]. In the present report fromthe MIDAS project we describe the reliability andvalidity of the Standardized Clinical Outcome Ratingfor Depression (SCOR-D). Our goal was to develop aquantitative measure of depression that could be ratedat every visit, incorporated into a clinician’s progressnote, and reflect the DSM-IV definition of a majordepressive episode (including partial and full remissionfrom the episode). We described previously how theSCOR-D could be integrated into routine clinicalpractice, and illustrated how this enabled us to establishrecovery rates in a sample of depressed psychiatricoutpatients [Posternak et al., 2002]. In this study, 102depressed outpatients with major depressive disorderwere followed up to 2 years, and rated on the SCOR-Dat each medication management visit. The overall rateof recovery, and mean time to recovery, was nearlyidentical to that reported in the National Institute ofMental Health Collaborative Depression Study, there-by indirectly supporting the validity of the SCOR-D.In the present report, we more directly examine thereliability and validity of the scale.

PATIENTS AND METHODS

PATIENTS

Participants were 303 psychiatric outpatients whowere being treated for a DSM-IV major depressiveepisode in the Rhode Island Hospital Department ofPsychiatry outpatient practice. This private practicegroup predominantly treats individuals with medicalinsurance (including Medicare but not Medicaid) on afee-for-service basis, and it is distinct from the

hospital’s outpatient residency training clinic thatpredominantly serves lower income, uninsured, andmedical assistance patients. The sample included 114(37.6%) men and 189 (62.4%) women who ranged inage from 18–79 years (M¼ 42.9, sd ¼ 12.7). Almosthalf of the subjects were married (47.9%, n¼ 145); theremainder were single (23.4%, n¼ 71), divorced(19.8%, n¼ 60), separated (5.6%, n¼ 17), widowed(2.0%, n¼ 6), or living with a partner (1.3%, n¼ 4).The racial composition of the sample was 86.8%(n¼ 263) Caucasian, 2.6% (n¼ 8) African-American,4.3% (n¼ 13) Hispanic, 0.7% (n¼ 2) Asian, and 5.6%(n¼ 17) other. The Rhode Island Hospital institutionalreview committee approved the research protocol, andall patients provided informed, written consent.

ASSESSMENT

The patients were rated by their treating psychiatriston the 17-item HRSD, MADRS, SCOR-D, andGlobal Assessment of Functioning (GAF). Ratingswere made by three psychiatrists based on an un-structured clinical interview that covered all of theitems on the measures. Two of three psychiatrists areexperienced clinical researchers, whereas the third is afull-time clinician who was trained by the other two toconduct the ratings. Diagnoses were based on theStructured Clinical Interview for DSM-IV [First et al.,1995]. Inter-rater reliability on the HRSD, MADRS,and SCOR-D was obtained in 30 patients, with one ofthe authors interviewing the patient while the otherobserved and made independent ratings.

The HRSD was published more than 40 years agofor the purpose of ‘‘quantifying the results of aninterview’’ [Hamilton, 1960]. Although not designedfor use in treatment studies, Hamilton anticipated thatthe scale would have value in evaluating the impact oftreatment. The original rating form included 21 items,though Hamilton indicated that only the first 17 itemsshould contribute to the total scale score because oneof the last four items represented depressive type ratherthan depression severity (diurnal mood variation), andthree other items did not occur with sufficientfrequency (derealization, paranoia, and obsessionalsymptoms). During the past 40 years the HRSD hasbeen the most widely used outcome measure inantidepressant efficacy trials [Prien et al., 1991],though during the past decade the MADRS has beenused with increasing frequency [Khan et al., 2002].

The MADRS was developed almost 20 years afterthe HRSD because of the difficulty demonstratingdifferences between active drugs with putativelydifferent mechanisms of action [Montgomery andAsberg, 1979]. Montgomery and Asberg speculatedthat one of the reasons for failing to find drug–drugdifferences was that existing rating scales at the timewere not sufficiently sensitive to detect treatmenteffects. In contrast to the HRSD, in whichitem selection was based on clinical experience, the

Brief Report: SCOR-D in Clinical Practice 37

development of the MADRS followed an empiricalapproach. The MADRS has been reliably rated bydif ferent research groups [Davidson et al., 1986;Korner et al., 1990; Maier et al., 1988b], and it hasbeen consistently found to correlate highly with theHRSD [Dratcu et al., 1987; Hawley et al., 1998;Korner et al., 1990; Maier et al., 1988a; Mulder et al.,2003]. Its superiority to the HRSD as a more sensitiveindicator of change in antidepressant efficacy trials hasbeen challenged recently [Khan et al., 2002].

There is some overlap in the content of the MADRSand 17-item HRSD. For example, both scales includeitems assessing depressed mood, insomnia, fatigue, lossof appetite, anxiety, guilt and suicidality. We conducteda single interview with sufficient questions to allow usto make the ratings on each of the scales. Thisapproach was less redundant, and thus preferred, tothe alternative of sequentially administering eachmeasure. Similarly, because these measures includeitems corresponding to the DSM-IV diagnosticcriteria, the SCOR-D was rated on the basis of thisinterview.

As part of the MIDAS project we have modified andexpanded upon the psychiatric status outcome ratingsused in the Longitudinal Interval Follow-up Evalua-tion (LIFE) [Keller et al., 1987], the instrument used inthe Collaborative Depression Study. Standardizedclinical outcome rating guidelines were written forsome disorders not covered by the LIFE, and some ofthe definitions in the LIFE were modified to make iteasier for clinicians to rate clinical status for the weekpreceding the visit. The SCOR-D is a 6-point ratingscale based on the number of DSM-IV criteria for amajor depressive episode and the level of psychosocialimpairment present during the past week (Table 1).The ratings of the individual symptoms are based onthe DSM-IV definitions of symptom presence. Thus,eight of nine criteria are rated at threshold if they arepresent every day, or nearly every day, of the past week.Symptoms that are present for five or fewer days areconsidered subthreshold. The exception is thoughts of

death or suicidal ideation, which need only berecurrent. In tallying the number of criteria that arepresent, the SCOR-D rating guidelines follow DSM-IV’s diagnostic algorithm for criterion A of a majordepressive episode. That is, to be rated five or six onthe SCOR-D the subject must have depressed mood oranhedonia, and symptoms in at least five of ninesymptom groupings. Consistent with DSM-IV’s algo-rithm, symptoms that are components of the sameDSM-IV criterion (e.g., impaired concentration andindecisiveness) only count as one criterion. Theguidelines for each rating point on the SCOR-D arelisted in Table 1. We did not develop a separateinterview guide to determine symptom presencebecause the goal was to maintain, as much as possible,compatibility with standard clinical practice. Poten-tially this could reduce reliability and increase errorvariance. The time needed to evaluate patients to ratethe SCOR-D depends, in part, on patients’ clinicalstatus. In patients who are doing well, the informationneeded to rate the SCOR-D can usually be ascertainedin approximately 3–4 minutes. Between 5–10 min isalmost always sufficient to rate patients who aresymptomatic. Rating the SCOR-D is easily incorpo-rated into a standard 15-min medication visit.

ANALYSES

Separate analyses of variance (ANOVA) were con-ducted on the HRSD, MADRS and GAF scores. Ineach of these ANOVA the SCOR-D was the indepen-dent variable. Tukey HSD follow-up tests comparedeach adjacent level of SCOR-D to determine whetherthe ratings represented a valid gradient of severity.Pearson correlations were computed between theSCOR-D and each of the other measures. Intraclasscorrelation coefficients, which account for level ofagreement due to chance, were computed to assess thereliability of the SCOR-D dimensional score. TheCollaborative Depression Study has variously definedremission broadly (a rating of 1 or 2) and narrowly

TABLE 1. Ratings on the Standardized Clinical Outcome Rating Scale for Depression

SCOR-DRating Definition

6 Meets DSM-IV criteria for MDD, and either prominent psychotic symptoms or extreme impairment in functioning.5 Meets DSM-IV criteria for MDD, but no prominent psychotic symptoms and no extreme impairment in functioning.4 Does not meet DSM-IV criteria, but has major symptoms or impairment from this disorder (e.g., depressive episode with only four

symptoms but still unable to work).3 Considerably less psychopathology than full criteria with no more than moderate impairment in functioning, but still has obvious

evidence of MDD (e.g., a depressive episode with only two or three symptoms of a moderate degree, one or two symptoms of a severedegree).

2 Either patient claims not to be completely back to ‘‘usual self’’, or rater notes the presence of one or more symptoms of MDD of nomore than a mild degree (e.g., only mild insomnia from the original episode).

1 Full remission with no symptoms of MDD. Significant symptomatology from another disorder may be present (and is coded for thatdisorder).

SCOR-D, Standardized Clinical Outcome Rating Scale for Depression.

Zimmerman et al.38

(a rating of 1) [Judd et al., 1998; Solomon et al., 1997].We computed k coefficients to assess the reliability ofboth definitions of remission.

RESULTSThe intraclass reliability coefficient for the SCOR-

D dimensional rating was .95. For comparison, thereliabilities of the HRSD and MADRS were .97 and.96, respectively. Dichotomizing the ratings as remittedor not, the k coefficient of agreement was .80 for thebroad definition of remission and .81 for the narrowdefinition.

Across the entire sample of 303 depressed patientsthe mean score on the HRSD was 11.4 (sd¼ 8.4),MADRS 17.2 (sd¼ 12.7), and GAF 59.1 (sd¼ 11.8).Pearson correlations between the SCOR-D and each ofthe other measures were as follows: r¼ .86 for HRSD,r¼ .91 for MADRS, and r¼�.83 for GAF.

Separate analyses of variance were conducted on theHRSD, MADRS, and GAF, each of which wassignificant (Table 2). For all three variables, TukeyHSD follow-up tests found that the dif ference betweeneach adjacent level on the SCOR-D was significant,except for the difference between Levels 4 and 5 on theGAF score (Table 2).

DISCUSSIONThe present study presented preliminary evidence of

the reliability and convergent validity of a briefoutcome measure for major depression that canpotentially be incorporated into routine clinical prac-tice. Presumably, clinicians treating depressed patientsroutinely assess the presence of the defining features ofthe disorder as well as level of functioning. Thereforewhat was needed was a method of quantifying theseobservations in a meaningful way. The SCOR-D islinked to the DSM-IV approach toward definingremission, and in a previous report [Posternak et al.,2002] we used these ratings to determine the rate of

remission in our patients and the results were compar-able to the Collaborative Depression Study [Solomonet al., 1997]. The present study demonstrated that theSCOR-D is associated highly with the clinician ratingscales of depression used most frequently in antide-pressant efficacy trials.

During the past few years the results of antidepres-sant efficacy trials have increasingly reported rates ofremission as well as rates of responders. Response isdefined typically as a 50% or more improvement inscores on the HRSD or MADRS. This approachtoward reporting outcome has been criticized becausepatients could be considered responders despite havingclinically significant residual levels of depressivesymptoms. As noted by others it is possible fortreatment responders who score sufficiently high onthese scales to still meet inclusion criteria for participa-tion. Thus, responders might still meet the diagnosticcriteria for MDD.

In contrast to how response is defined, remission isdefined typically as a score below a threshold value on ameasure such as the HRSD or MADRS. Even with thisapproach patients could have several symptoms ofdepression and still be considered in remission. Neitherthe HRSD or MADRS provide full coverage of theDSM-IV diagnostic criteria for major depression. TheMADRS lacks items for reverse vegetative symptomsand indecisiveness. The HRSD also lacks items forincreased appetite, increased sleep, and indecisiveness,as well as impaired concentration and worthlessness. Itis possible for patients to be simultaneously classified asin remission by scoring at or below the HRSDremission threshold of 7, and also fully meet DSM-IV criteria for major depression [Nierenberg et al.,1999]. There is a significant conceptual problem with adefinition of remission that includes any individualswho still have the disorder. A conceptual advantage ofthe SCOR-D is that it is linked to the DSM-IVdefinition of remission, and patients who are consid-ered in remission according to this definition arewithout the defining features of the disorder. A

TABLE 2. Other rating system mean values for each of the SCOR-D levels

SCOR-D Rating n HRSDa MADRSb GAFc

6 11 28.3 (4.7) 43.2 (5.1) 37.9 (5.0)5 103 18.1 (6.1) 28.4 (6.8) 50.5 (6.3)4 35 14.2 (3.4) 21.3 (4.5) 53.2 (6.8)3 52 9.7 (3.0) 13.8 (3.7) 60.7 (7.0)2 50 4.0 (2.4) 5.3 (2.8) 68.9 (6.5)1 52 1.6 (1.7) 1.5 (2.2) 73.6 (6.2)

Values are expressed as mean (sd), unless otherwise indicated.aHRSD: F[5,297]¼ 179.4, Po.001. All follow-up Tukey’s tests between each pair of adjacent SCOR-D levels were significant at Po.05.bMADRS: F[5,297]¼ 347.6, Po.001. All follow-up Tukey’s tests between each pair of adjacent SCOR-D levels were significant at Po.01.cGAF: F[5,297]¼ 140.6, Po.001. All follow-up Tukey’s tests between each pair of adjacent SCOR-D levels were significant at Po.01, exceptfor the contrast between levels 4 and 5 which was not significant (P4.05).HRSD, Hamilton Rating Scale for Depression; MADRS, Montgomery–Asberg Depression Rating Scale; GAF, Global Assessment of Functioning.

Brief Report: SCOR-D in Clinical Practice 39

practical advantage of the SCOR-D is that it can beincluded more easily in clinical practice.

Despite the widespread availability of standardizedrating scales of established reliability and validity theyhave not been widely embraced by practicing clinicians.One approach toward making these scales moreclinician-friendly has been to shorten their length[Bech et al., 1975; Rush et al., 2003]. We believe thatthis approach does not address the fundamentalproblem of assessment burden due to more detailedevaluations of patient status than are ordinarily done.In the development of the SCOR-D we approached theproblem of clinician acceptance from a differentperspective. We developed a simple rating scale tocapture in a quantifiable way what clinicians are alreadydoing: assessing the presence or absence of the featuresof the disorder they are treating. It takes less time tomake dichotomous decisions regarding symptom pre-sence than to quantify the severity of a symptom on a4- or 5-point Likert scale.

A limitation of the present study is that it wasconducted in a single practice setting by clinicians whoare also clinical researchers. Replication in othersettings is warranted. In addition, symptom informa-tion to rate the HRSD, MADRS, and SCOR-D wascollected in the same session; thus, shared methodvariance partially accounts for the high correlationbetween the measures. If separate interviewers wereused to rate each of the scales the correlations would belower. Similarly, the reliability ratings were based on ajoint-interview design. Test–retest reliability coeffi-cients are typically lower than those based on a joint-interview design.

REFERENCESBech P, Gram L, Dein E, Jacobsen O, Vitger J, Bolwig T. 1975.

Quantitative rating of depressive states. Acta Psychol Scand51:161–170.

Davidson J, Turnbull CD, Strickland R, Miller R, Graves K. 1986.The Montgomery-Asberg Depression Scale: Reliability andvalidity. Acta Psych Scan 73:544–548.

Dratcu L, da Costa Ribciro L, Calil HM. 1987. Depressionassessment in Brazil. The first application of the Montgomery-Asberg Depression Rating Scale. Br J Psychiatry 150:797–800.

First MB, Spitzer RL, Gibbon M, Williams JBW. 1995. Structuredclinical interview for DSM-IV Axis I disorders. Patient edition(SCID-I/P, version 2.0). New York: Biometrics Research Depart-ment, New York State Psychiatric Institute.

Hamilton M. 1960. A rating scale for depression. J NeurolNeurosurg Psychiatry 23:56–62.

Hawley CJ, Gale TM, Smith V, Sen P. 1998. Depression rating scalescan be related to each other by simple equations. Int J PsychiatryClin Practice 2:215–219.

Judd LL, Akiskal HS, Maser JD, Zeller PJ, Endicott J, Coryell W.1998. A prospective 12-year study of subsyndromal and syndromaldepressive symptoms in unipolar major depressive disorders. ArchGen Psychiatry 55:694–700.

Keller MB, Lavori PW, Friedman B, Nielsen E, Endicott J,McDonald-Scott NC, Andreasen NC. 1987. The longitudinal

interval follow-up evaluation: A comprehensive method forassessing outcome in prospective longitudinal studies. Arch GenPsychiatry 44:540–548.

Khan A, Khan SR, Shankles EB, Polissar NL. 2002. Relativesensitivity of the Montgomery-Asberg Depression Rating Scale,the Hamilton Depression Rating Scale and the Clinical GlobalImpressions rating scale in antidepressant clinical trials. Int ClinPsychpharmacol 17:281–285.

Korner A, Nielsen BM, Eschen F, Moller-Madsen S, Stender A,Christensen EM, Aggernaes H, Kastrup M, Larsen JK. 1990.Quantifying depressive symptomatology: Inter-rater reliability andinter-item correlations. J Affect Disord 20:143–149.

Lebowitz B, Rudorfer M. 1998. Treatment research at themillennium: From efficacy to effectiveness. J Clin Psycho-pharmacol 18:1.

Maier W, Heuser I, Philipp M, Frommberger U, Demuth W. 1988.Improving depression severity assessmentFII. Content, concur-rent and external validity of three observer depression scales.J Psychiatr Res 22:13–19.

Maier W, Philipp M, Heuser I, Schlegel S, Buller R, Wetzel H. 1988.Improving depression severity assessmentFI. Reliability, internalvalidity and sensitivity to change of three observer depressionscales. J Psychiatr Res 22:3–12.

Montgomery SA, Asberg M. 1979. A new depression scale designedto be sensitive to change. Br J Psychiatry 134:382–389.

Mulder R, Joyce P, Frampton C. 2003. Relationships amongmeasures of treatment outcome in depressed patients. J AffectDisord 76:127–135.

Nierenberg AA, Keefe BR, Leslie VC, Alpert JE, Pava JA,Worthington JJ, Rosenbaum JF, Fava M. 1999. Residual symptomsin depressed patients who respond acutely to fluoxetine. J ClinPsychiatry 60:221–225.

Posternak MA, Zimmerman M, Solomon DA. 2002. Integratingoutcomes research into clinical practice: A pilot study. PsychiatricServices 53:335–336.

Prien RF, Carpenter LL, Kupfer DJ. 1991. The definition andoperational criteria for treatment outcome of major depressivedisorder. Arch Gen Psychiatry 48:796–800.

Rush A, Trivedi M, Ibrahim H, Carmody T, Arnow B, Klein D,Markowitz J, Ninan P, Kornstein S, Manber R, Thase M. 2003.The 16-Item Quick Inventory of Depressive Symptomatology(QIDS), clinician rating (QIDS-C), and self-report (QIDS-SR): Apsychometric evaluation in patients with chronic major depression.Biol Psychiatry 54:573–583.

Solomon DA, Keller MB, Leon AC, Mueller TI, Shea TM, WarshawM, Maser JD, Coryell W, Endicott J. 1997. Recovery from majordepression: A 10-year prospective follow-up across multipleepisodes. Arch Gen Psychiatry 54:1001–1006.

Zimmerman M. 2003. Integrating the assessment methodsof researchers in routine clinical practice: The Rhode IslandMethods to Improve Diagnostic Assessment and Services(MIDAS) project in standardized evaluation in clinical practice.Vol. 22. Washington, D.C.: American Psychiatric Publishing, Inc.p 29–74.

Zimmerman M, Mattia JI. 2001. The Psychiatric DiagnosticScreening Questionnaire: Development, reliability and validity.Compr Psychiatry 42:175–189.

Zimmerman M, Mattia JI, Posternak MA. 2002. Are subjects inpharmacological treatment trials of depression representativeof patients in routine clinical practice? Am J Psychiatry 159:469–473.

Zimmerman M, Sheeran T, Young D. 2004. The diagnostic inventoryfor depression: a self-report scale to diagnose DSM-IV for majordepressive disorder. J Clin Psychol 60:87–110.

Zimmerman et al.40

Documents

Standardized clinical outcome rating scale for depression for use in clinical practice