45
Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS-SR 16 in measuring severity of depression in a UK sample of primary care patients with a diagnosis of depression Author: Applied Health Sciences (Mental Health), University of Aberdeen

Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS-SR16 in measuring severity of depression in a UK sample of primary care patients with a diagnosis of depression

Author: Applied Health Sciences (Mental Health), University of Aberdeen

Page 2: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

2

© NHS Quality Improvement Scotland 2011 First published February 2011 This report was published by the sponsor, NHS Quality Improvement Scotland (NHS QIS), on behalf of the author, The University of Aberdeen. You can copy or reproduce the information in this document for use within NHSScotland and for educational purposes. You must not make a profit using information in this document. Commercial organisations must get our written permission before reproducing this document. www.nhshealthquality.org

Page 3: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

3

Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS-SR16

in measuring severity of depression in a UK sample of primary care patients with a diagnosis of depression

Grant holders Professor Ian C ReidDr Isobel M Cameron

*

Professor John R Crawford*

Dr Kenneth Lawton**

***

*Applied Health Sciences (Mental Health), University of Aberdeen, Royal Cornhill Hospital, Aberdeen, AB25 2ZH. **School of Psychology, University of Aberdeen, Aberdeen, AB24 2UB ***

Centre of Academic Primary Care, University of Aberdeen, Foresterhill Health Centre, Westburn Road, Foresterhill, Aberdeen AB25 2AY

Research Team Isobel M Cameron, Research Fellow Amanda Cardy, Scottish Primary Care Research Network North Coordinator John R Crawford, Professor of Psychology Schalk Du Toit, Specialist Registrar in Psychiatry Kenneth Lawton, Clinical Senior Lecturer Steven Hay, Staff Grade Psychiatrist Kenneth Mitchell, Consultant Psychiatrist Ian C Reid, Professor of Mental Health Sumit Sharma, Specialist Registrar in Psychiatry Shilpa Shivaprasad, Specialist Registrar in Psychiatry Sally Winning, Staff Grade Psychiatrist Ethical Approval This research was conducted with the approval of the North of Scotland Research Ethics Committee (reference number: 07/S0802/40). Additionally it met the approval of the Research and Development Department of NHS Grampian and of the Scottish Primary Care Research Network. Acknowledgements We would like to thank the patients and staff of the nine practices in Grampian who kindly participated in this study and Kirsty Sykes for preparation of research materials.

Page 4: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

4

Contents 1 Summary ............................................................................................................. 6

Background ........................................................................................................ 6Why we did it …………………………………………………………………………..6 Aims ................................................................................................................... 6Methods ............................................................................................................. 6Results ............................................................................................................... 6Conclusion……………………………………………………………………………...7 What it means for the service ............................................................................. 7

2 Background ......................................................................................................... 83 Aim .................................................................................................................... 104 Methods/Design ................................................................................................ 11

Study Design .................................................................................................... 11Setting .............................................................................................................. 11Patients ............................................................................................................ 11Depression Severity Measures ........................................................................ 11Demographic data ............................................................................................ 13Procedure ......................................................................................................... 13Statistical analyses ........................................................................................... 15Sample Calculation .......................................................................................... 15

5 Results .............................................................................................................. 16Participating practices ...................................................................................... 16Patient participation .......................................................................................... 16Reliability Analysis ........................................................................................... 19Factor analysis ................................................................................................. 21Convergent and discriminant validity ................................................................ 23Convergence of severity banding ..................................................................... 24Empirically derived cut offs ............................................................................... 25Responsiveness to change .............................................................................. 28

6 Discussion ......................................................................................................... 29Summary of main findings ................................................................................ 29Strengths and Weaknesses ............................................................................. 29Comparisons with literature .............................................................................. 30

7 Conclusion ........................................................................................................ 328 Recommendations for NHS QIS ....................................................................... 33 Appendix I Assessing the suitability of the QIDS-SR16

as a proxy

for the HRSD-17 ....................................................................................................... 34Background ...................................................................................................... 34Methods/Design ............................................................................................... 34Discussion ........................................................................................................ 38Conclusion ....................................................................................................... 38

Appendix II Assessing the inter-rater reliability of the clinician administered HRSD-17 (GRID version) .................................................................... 39

Background ...................................................................................................... 39Methods/Design ............................................................................................... 39

Page 5: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

5

Procedure ......................................................................................................... 40Statistical analyses ........................................................................................... 40Results ............................................................................................................. 40Discussion ........................................................................................................ 42Conclusion ....................................................................................................... 42

9 References ........................................................................................................ 43

Page 6: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

6

1 Summary Background The Integrated Care Pathway (ICP) for Depression and the Quality and Outcomes Framework (QOF) emphasise the importance of measuring severity of depression in primary care to target the condition with an appropriate intervention; however there is an absence of psychometric comparison of endorsed measures (Patient Health Questionnaire (PHQ-9), Hospital Anxiety and Depression Scale (HADS) and Beck Depression Inventory (BDI-II)). Why We Did It For people presenting with depression in primary care, it is considered important that the severity of their depressive symptoms be measured in order to facilitate the offer of appropriate, evidence based interventions. This is reflected in the ICP for depression and in the QOF of the nGMS contract. However, there is an absence of objective psychometric comparison between endorsed measures to enable GPs to choose the optimal assessment tool for severity measurement. Aims To assess psychometric properties of PHQ-9, HADS, BDI-II and Quick Inventory of Depressive Symptomatology (QIDS-SR16

Methods Patients (≥16 years) were recruited from nine general practices across Grampian (selected to yield participants with a mix of socio-economic and urban/rural status). Consenting participants completed the four severity measures, demographic questions and were assessed by a psychiatrist with the 17-item HRSD (GRID-HAMD). Psychiatrists (n=6) were blind to the questionnaire responses. Order of administration of interview and booklet was randomly assigned, stratified by practice. Inter-rater reliability was assessed. A concurrent psychometric analysis was made of the four scales to assess: internal consistency, homogeneity, convergent and discriminant validity and responsiveness to change over time. Established severity cut-off scores for each scale were assessed for convergence with HRSD-17 cut offs. Receiver Operating Characteristic (ROC) curves were plotted to assess optimal severity cut off points. Analyses were conducted using SPSS (Version 17) and Clinimetrics Toolkit. Ethical approval was granted by the North of Scotland Research Ethics Committee.

) relative to the clinician administered Hamilton Rating Scale for Depression (HRSD) in a sample of primary care patients with a depression diagnosis.

Results 286 (25%) of 1134 invited patients participated: mean age=49.8 (s.d.=13.8), 69% female, mean HRSD-17=12.9 (s.d.=7.6). The HAD-D, PHQ-9, BDI-II and QIDS-SR16 all exhibited acceptable internal consistency (Cronbach’s alpha 0.86-0.92) and robust factor structures (variance measured by 1st factor 46.3%-60.7%) indicating

Page 7: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

7

one underlying construct in each measure. Additionally, all the measure correlated highly with the HRSD-17 (r=0.68-0.78). The HAD-D, PHQ-9 and QIDS-SR16 exhibited discriminant validity but the BDI-II did not. All of the scales differed significantly in how they categorised the severity of depression relative to the HRSD-17 (Wilcoxon Signed Rank Test p<0.05). The HAD-D tended to categorise participants in milder categories than the HRSD-17 whereas the PHQ-9, BDI-II and QIDS-SR16 tended to categorise participants in a more severe category than the HRSD-17. Best sensitivity and specificity was reached for detecting moderate severity of depressive symptoms where HAD-D≥9, PHQ-9≥12, BDI-II≥23 and QIDS-SR16≥13. Based on diagnostic odds ratios, QIDS-SR16

Conclusion Commonly used scales in UK primary care do not align adequately with the HRSD-17. Current severity cut-offs for such scales are therefore invalid for the selection of treatments determined by depression severity, given that HSRD assessments generally inform the evidence base. The present study provides empirically derived cut off points which can be used with these measures to facilitate the assessment of severity of depressive symptoms. However, health care professionals should not rely on this interpretation of a score alone when assessing an individual who may have depression but should also consider other factors including degree of impairment, length of episode, history of depression, family history, other co-morbid disorders and specific circumstances pertaining to individuals.

exhibited greatest discriminatory performance (diagnostic odds ratio=21.42 (95% CI 11.03, 41.60). All four questionnaires exhibited a similar measurement of magnitude of change over time.

What it means for the service Our findings show that commonly used scales in UK primary care for measuring the severity of depressive symptoms are invalid for selecting treatments which have been determined by depression severity. This study provides alternative cut offs that have been empirically derived from a comparison with a Hamilton Depression Rating Scale interview administered by a psychiatrist. These new cut offs improve the validity of the measures however, these scales do not assess the severity of depressive symptoms with sufficient accuracy to be used with confidence in clinical practice. For this reason, the service should emphasise that health care professions should not rely on the scores of such scales alone when assessing an individual who may have depression but should consider other factors including the degree of impairment, length of episode, history of depression, family history, other comorbid disorders and specific circumstances pertaining to individuals.

Page 8: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

8

2 Background The Scottish Integrated Care Pathway for Depression Toolkit Standard 33 (1) advocates the use of an objective measure for people with depression, stating that “A validated measure of depression is used at initial assessment and repeated at regular intervals to monitor progress and outcome”. Similarly, the Quality and Outcomes Framework (QOF) of the new General Medical Services contract (nGMS) (2) provides incentives to general practices for making an assessment of severity at the outset of a new diagnosis of depression with a further assessment occurring within five to 12 weeks. In the context of the 2009 Updated National Institute for Health and Clinical Excellence (NICE) Guidelines (3) on the management of depression in primary and secondary care it would appear sensible to make such an assessment as different treatment options are advocated according to severity. Additionally, applying an objective measure of severity, rather than relying on GPs perceptions, is thought to be more reliable (4). Using such measures facilitates the monitoring of symptoms over time allowing patients and clinicians to assess the impact of various interventions. At the outset of a new episode of depression, general practitioners (GPs) are required to use one of the following QOF endorsed tools: the Patient Health Questionnaire (PHQ-9) (5); the Hospital Anxiety and Depression Scale (HADS) (6) Depression subscale (HAD-D); and the Beck Depression Inventory, Second Edition (BDI-II) (7). Practices are advised to choose one of these three measures. These measures are endorsed on the grounds of their validity. As their function in the context of the QOF is to measure severity, particular attention should be paid to how their severity cut-off bands were derived. The established severity cut-off scores for the PHQ-9 have been based on assessment of US based primary care patients (5). Whilst questions arise as to how well findings from a US sample can be generalised to the UK setting, the severity cut-offs were based on a substantial sample (n=580) recruited across several primary care sites. In the case of BDI-II, the severity cut-off scores are based on a US sample of 127 patients, all from a University of Pennsylvania site (8). The original banding of ‘non-case’, ‘doubtful case’ and ‘case’ of the HADS were based on a sample of 100. The originators offered no empirical evidence to support endorsing a cut off of 15 and over as indicating a ‘severe’ case (9). Although the HADS has been extensively researched, its validity has tended to be assessed in terms of its case finding properties, rather than on how it differentiates between severities of depression (10-12). Presently, there is an absence of objective psychometric comparison between the endorsed measures that would enable GPs to choose the optimal assessment tool for severity measurement. Furthermore, the severity-related interventions advocated in the NICE Guidelines tend to be based on trials where severity has been assessed using the Hamilton Rating Scale for Depression (HRSD) (13). There have been two studies in Germany which have compared validity which has included the QOF depression severity scales (14,15). One compared HADS, PHQ-9 and the World Health Organisation Well Being Index (WHO-5) against the DSM-III-R overview for the Structured Clinical Interview (SCID) (16), the other compared the PHQ-9, General Health Questionnaire (GHQ-12) and WHO-5 against the SCID. However these comparisons were of their validity as case finding measures. These studies did not examine the validity of the severity banding. A recent German study compared

Page 9: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

9

the PHQ-9 with the HRSD-17 in terms of severity measurement (17). The PHQ-9 accounted for just 27% of the variance in HRSD-17 scores. However, this analysis was conducted on a small sample (n=47) and included HRSD-17 and PHQ-9 data that were administered up to two weeks apart (mean = 6.7 days); this is a wide time interval from which to assess concurrence of instruments. A study which described the first UK validation of the PHQ-9 (18) assessed the PHQ-9 and the Clinical Outcomes in Routine Evaluation – Outcome Measure (CORE-OM) as diagnostic tools. Here the focus was on case finding properties in a relatively small sample (n=97). Additionally, in recognising this was a small study, the authors concluded that replication of their findings would be helpful. In the absence of UK validation of the severity banding of the QOF depression measures, GPs may be left to make their choice of measure based on convenience factors, such as licence-cost considerations or ease of scoring. These considerations are important but should always be preceded by considerations of psychometric robustness. Psychometric investigations of the PHQ-9 and HAD-D have found that the scales differ significantly in how they categorise the severity of depression in UK (19), Swedish (20) and Australian (21) samples. These finding suggests at least one of these measures is categorising the severity of depression erroneously. Similarly, severity of depressive symptoms varied in accordance with choice of screening tool in a study of the QOF measures in an English primary care setting where practices that used PHQ-9 recorded greater prevalence of moderate and severe depressive symptoms compared with practices that used HAD-D (22). In considering the ICP standard 33 and the QOF depression severity measures, the Quick Inventory of Depressive Symptomatology (Self Report) (QIDS-SR16) (23) is also a scale worthy of investigation. The QIDS-SR16 is a measure of depression severity, devised in the US which has been assessed and developed to act as a proxy measure for the HRSD. This link with HRSD makes its potential for use in a UK primary care context particularly of interest given the use of HRSD in studies assessing treatment effectiveness (3).

Page 10: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

10

3 Aim The principal aims of this study were as follows:

1. To assess the psychometric properties of the PHQ-9, HAD-D, and BDI-II in terms of their use as measures of severity of depression in a sample of primary care patients with a diagnosis of depression.

2. To establish empirically derived severity cut-off scores for these measures as they relate to the HRSD-17.

In meeting the above aims the following objectives were required to be met:

a) To assess the suitability of the QIDS-SR16 as a proxy for the HRSD-17 following the conversion tables of Rush et al., (www.ids-qids.org).

b) To assess the inter-rater reliability of the clinician administered HRSD-17 (GRID version).

The investigations relating to objective a) are reported in Appendix I and those relating to objective b) are reported in Appendix II. These secondary objectives had implications for the main study in that their results lead us to conclude that the QIDS-SR16 (using the conversion tables of Rush et al., 2003) was not suitable to act as a proxy for the HRSD-17 and that the HRSD-17 (GRID version) demonstrated adequate inter-rater reliability. Although the QIDS-SR16 would not be used as a proxy for HRSD-17, it remained in the study and we assessed its psychometric properties along with the other self-complete measures.

Page 11: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

11

4 Methods/Design Study Design This study is a psychometric assessment of four depression severity rating scales: PHQ-9, HAD-D, BDI-II and QIDS-SR16

. The aim was to assess the relative reliability, validity and responsiveness to change of these measures and to empirically derive severity bands which can be applied to a UK primary care population.

Setting Patients were recruited from nine general practices across Grampian, Scotland. Practices were selected with the intention of yielding participants with a mix of socio-economic and urban/rural status. Socio-economic status was based on the deprivation status of practices as derived from Scottish Index of Multiple Deprivation (SIMD) scores. The urban/rural status of practice populations was assessed by the Scottish Executive 2007/8 urban/rural classification whereby practices were coded according to their modal urban/rural category based on the number of patients in each category per practice. Practice populations by deprivation status and by urban/rural status were accessed at: www.isdscotland.org/isd/3793.html. The HRSD-17 was administered to patient participants within their general practice or within a local community based hospital. The questionnaire booklet, containing the four self-complete measures and demographic questions, was completed by patient participants within their own space, ie this could be done at home or within another location of the participant’s choice. Patients Inclusion criteria Patients were eligible for inclusion if they were aged 16 years and over and their GP had diagnosed them as depressed. This reflects current QOF arrangements where GPs use their clinical judgement to identify depression. The current study included existing depressed patients in order to assess the scales across the spectrum of stages presenting in primary care. Both the Depression ICP toolkit Standard 33 and the QOF Depression 3 indicator advocates repetition of measurement therefore the inclusion of existing depressed patients was relevant as it allowed for the scales to be tested within greater presentation diversity. Exclusion criteria The vast majority of depressed patients were eligible to consider participating; however patients were excluded if they did not have the necessary spoken or written language skills. Depression Severity Measures The depression severity measures to be assessed comprised the HADS, PHQ-9, BDI-II and the QIDS-SR16

against the HRSD (17 item).

Page 12: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

12

Hospital Anxiety and Depression Scale (HADS) The HADS scale consists of 14 items each rated from 0 to 3 according to severity of difficulty experienced. Eight items require reversed scoring, after which a depression (HAD-D) and an anxiety (HAD-A) subscale total can be summed. Each subscale score can range from 0 to 21. The scores can then be interpreted as indicating either: mild, moderate or severe difficulty. Licence agreement was attained through NFER Nelson. Patient Health Questionnaire – 9 (PHQ-9) The PHQ-9 consists of 9 questions designed to correspond to the nine diagnostic criteria for major depressive disorder covered in the Diagnostic and Statistical Manual for Mental Disorders (DSM-IV) (24). Items are rated from 0 to 3 according to increased frequency of experiencing difficulties in each area covered. Scores are summed and can range from 0 to 27. The score can then be interpreted as indicating either: no depression, minimal, mild, moderate, moderately severe or severe depression. The scale is freely available from http://www.depression-primarycare.org/clinicians/toolkits/materials/forms/phq9/. Beck Depression Inventory-II The BDI-II consists of 21 items each rated from 0 to 3 according to severity of difficulty experienced. Scores are summed and can range from 0 to 63. The score can then be interpreted as being in the minimal, mild, moderate or severe range. Licence agreement was attained through Harcourt Assessment, Inc. Quick Inventory of Depressive Symptomatology (Self-report) QIDS-SRQIDS-SR

16 16 is a brief self-complete measure designed to measure the severity of

depression. It assesses the nine diagnostic criteria for major depressive disorders covered in DSM-IV and consists of 16 items each ranging from 0 to 3. It is scored by summing the highest response in each of a set of questions relating to sleep, weight and psychomotor symptoms and then adding the remaining items. Scores can range from 0 to 27. Conversion tables have been generated which allows HRSD scores to be predicted from this measure however this has not been previously assessed in a UK primary care context. QIDS-SR16

is freely available from http://www.ids-qids.org/.

Hamilton Rating Scale for Depression (17 item) HRSD-17 The 17-item HRSD structured interview is intended for use as the ‘gold standard’ of depression severity measurement. The HRSD-17 has been chosen for this purpose as it has been widely used in intervention studies which have taken into account depression severity (3). Overwhelmingly, what is known about treatment efficacy has been based on trials which have assessed depression severity using this method. Additionally, the appropriateness of using the HRSD-17 in a primary care context has long been established (25). Initially the standard HRSD-17 schedule was used however following discussions at a UK primary care psychiatry meeting; the GRID-HAMD was brought to our attention and was used thereafter. It is freely available from http://www.iscdd.org/. Both administration and scoring are standardised in this method which helps maximise inter-rater reliability without altering the original intent of the measure. However, to ensure inter-rater reliability between the psychiatrist interviewers, an additional analysis was conducted and is reported in Appendix II.

Page 13: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

13

Demographic data Within the questionnaire booklet, demographic questions were included to allow an assessment of the sample’s representativeness in terms of age, sex, ethnicity and deprivation status based on Scottish Index of Multiple Deprivation (SIMD) as derived from participants’ home postal codes. Procedure All recruits were invited to complete the four questionnaires on two occasions, three months apart. Additionally, they were invited to participate in a clinical interview conducted by a consultant psychiatrist, a specialist registrar or a staff grade psychiatrist. Prospective participants were given the alternative of taking part in a telephone interview if attending would be difficult. In such cases the GRID-HAMD was still used however as two of the items required visual observations (regarding retardation and agitation), raters were given instructions for the telephone version of HRSD (26) in order to facilitate the assessment of these components. The interviewers were blind to the questionnaire responses. Once recruitment had commenced (n=18) (following feedback from a conference presentation) randomisation of order of administration of interview and booklet, stratified by practice, was introduced. This was done to reduce any confounding of order of completion. For those randomised to receive the questionnaire first, participants were encouraged to complete the booklet on the same day (before the interview), or the day before the interview. For those randomised to receive the questionnaire following the interview, participants were encouraged to complete the booklet on the same day (after the interview), or the following day. The process is outlined in Figure 1.

Page 14: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

14

Figure 1: Recruitment and participation Invitation to participate (n=1134)

GP gives/sends, to patients with a diagnosis of depression, information sheet, reply slip and freepost envelope.

Patient considers information in his/her own time,

phoning research team to discuss any queries as necessary (n=306).

Patient declines (n=20) No further action

Patient accepts (286) Patient contacts University research team (by reply slip or telephone) indicating interest and is randomised to receive questionnaire either before or after the interview. They are allocated an appointment with a study doctor to take place at practice, community hospital or by telephone).

Receives questionnaire before (n randomised=137; n not randomised=8) Three days prior to appointment researcher posts to patient: appointment confirmation letter, questionnaire, consent form and reply paid envelope.

Receives questionnaire after (n randomised =131; n not randomised=10) Researchers post to patient: appointment confirmation letter.

Appointment for HRSD-17 (n=141) Study doctor confirms patient has completed consent form and questionnaire and conducts HRSD-17 assessment.

Appointment for HRSD-17 (n=140) Study doctor completes consent form with participant and conducts HRSD-17 assessment. The participant is given the questionnaire and reply paid envelope to take away and complete.

Follow up (282 sent out) Patients sent three month follow up questionnaire. Up to two reminder letters were sent to non-responders.

Page 15: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

15

Statistical analyses A concurrent analysis was made of the self-complete scales. Internal consistency was assessed using Cronbach’s alpha and item-total correlations. The factor structure was examined to assess the homogeneity of each scale by observing the amount of variance extracted by the first factor, using principal components factor analysis. The robustness of the emerging factors in each scale was analysed by running coefficients of congruence (27) between the samples from first to second time point. Convergent validity was examined by computing correlations of each of the self complete scales with the HRSD-17. Discriminant validity was investigated by intercorrelating the scales with the HAD Anxiety subscale (HAD-A) to assess whether they exhibited greater convergence with the HRSD-17 than with the HAD-A. Both convergence and discrimination was assessed using William’s Test (28). Convergence of the scales’ severity bandings was also investigated. The established severity cut-off scores for each scale were assessed for convergence using Wilcoxon Signed Rank Test for related samples. Severity cut off bands for moderate depressive symptoms were assessed relative to HRSD-17≥14 using Receiver Operating Characteristic (ROC) curves (29). Sensitivity and specificity of the scales at detecting symptoms of moderate severity were calculated with accompanying confidence intervals (30). Positive and negative predictive values (PPVs and NPVs) were also calculated as were Diagnostic Odds Ratios (DORs) with accompanying confidence intervals (Glas et al., 2003). DORs allow assessment of each scale’s discriminatory performance using a single indicator relative to HRSD-17 defined moderate severity. Responsiveness to clinical change, across the two time points was measured by running paired t-tests on the scales; the effect size of each scale was then calculated (31). Analyses were conducted using SPSS (Version 17) and Clinimetrics Toolkit. Sample Calculation The original intention was to recruit a sample of 500. However, this was based on the assumption that the QIDS-SR16 would be used as a proxy for the HRSD-17 once its fitness for purpose was established. As this was not established, it was not considered feasible to recruit 500 participants to have the clinical interview. In light of the continued need for clinical interviews, the sample size was revised to be n> 250. Larger samples are required to detect small differences between groups. Where a sample is too small a risk emerges of a type-II error occurring; where one fails to find a significant difference between samples when a difference genuinely exists. In the current study, preliminary analysis revealed marked differences in depression severity categorisation therefore even with a sample size of half the original intent, the power remained adequate.

Page 16: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

16

5 Results Participating practices Practice characteristics are presented in Table 1. The nine practices represent a range of urban and rural settings. In terms of predominant deprivation status of their catchments there was representation of four of the quintiles from Scotland. No practices in Grampian fall into the fifth quintile for Scotland, representing greatest deprivation. Table 1 Characteristics of participating practices Practice SIMD

quintile 2007

Modal urban/rural category

1 A

2

B C D E F G H I

1st

4

th

1

st

4th

4

th

3

rd

2

nd

3rd

4th

Other urban area – settlements of 10 000 – 125 000 people

Large urban area – settlements of over 125,000 people Accessible small town – settlements of 3 000 – 10 000 people and within 30 minutes drive of settlements of 10 000 or more Large urban area – settlements of over 125,000 people Large urban area – settlements of over 125,000 people Other urban area – settlements of 10 000 – 125 000 people Remote small town – settlements of 3 000 – 10 000 people within 30-60 minutes drive to settlements of 10 000 or more Remote small town – settlements of 3 000 – 10 000 people within 30-60 minutes drive to settlements of 10 000 or more Large urban area – settlements of over 125,000 people

1Scottish Index of Multiple Deprivation quintile for Scotland where 1st represents least deprived and 5th most deprived (mapped to 2006 reference files); 2

Urban/rural practice status according to Scottish Government 2007/2008 urban rural classification.

Patient participation Two hundred and eighty-six (25%) of 1134 invited patients participated. Table 2 presents demographic characteristics of participants.

Page 17: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

17

Table 2 Demographic characteristics of patient participants Characteristic Summary statistic N Age (standard deviation) Female (%) Ethnicity (%) Asian Black Other White First language English n (%) Main activity (%) Employed/self employed Housework Seeking work Retired Student Other Educated beyond minimum school age (%) Educated to degree level (%)

49.8 years (13.8) 197 (69) 1 1 1 267 (99) 261 (98) 136 (51) 45 (17) 10 (4) 48 (18) 4 (2) 23 (8) 138 (52) 95 (36)

270 286 270 266 266 266 266

Of the 286 participants, 269 (94%) completed both the questionnaire booklet and participated in the interview (Table 3). This occurred as some interview participants did not subsequently complete and return the questionnaire. Some participants indicated they did not have time to take part in an interview but were willing to complete the questionnaire. Such participation was considered of value to the study as this data could be included in the analyses which did not require the HRSD-17. For example, analyses assessing internal consistency and factor structure of the questionnaires. The first 19 (7%) participants were assessed with the standard version of the HRSD-17. Following this, 263 (93%) participants were assessed with the GRID version. Two hundred and forty (85%) interviews were conducted face-to-face and 42 (15%) by telephone.

Page 18: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

18

Table 3 Participation in interview and questionnaire Participated in

HRSD-17 (%) Did not participate in HRSD-17 (%)

Returned questionnaire (%) Did not return questionnaire (%)

269 (94) 13 (4)

4 (1) -

Three month follow up questionnaires were completed by 214 (76%) participants. The distribution of the HRSD-17 total scores did not deviate from Normal distribution as confirmed by the one-sample Kolmogorov-Smirnov test of goodness of fit: D=0.064, p=0.199. The mean score = 12.9, s.d. = 7.57. This corresponds with the outer extreme of the mild category as defined by the Handbook of Psychiatric Measures (32). Table 4 presents the HRSD-17 scores according to the established categories. Table 4 HRSD-17 severity categories (Handbook of Psychiatric Measures) HRSD-17 severity categories N (%) None (0-7) Mild (8-13) Moderate (14-18) Severe (≥19)

76 (27.0) 71 (25.3) 66 (23.5) 68 (24.2)

Total scores of the self-complete questionnaires were also found not to deviate significantly from Normal distribution according to the Kolmogorov-Smirnov test at the first time point (HAD-D D=0.06, p=0.31; PHQ-9 D=0.09, p=0.05; BDI-II D=0.07, p=0.18; QIDS-SR16 D=0.07, p=0.18). At the second time point HAD-D and BDI-II did not differ significantly from Normal distribution (HAD-D D=0.09, p=0.08; BDI-II D=0.09, p=0.07) however PHQ-9 and QIDS-SR16 did differ (PHQ-9 D=0.11, p=0.01; QIDS-SR16

D=0.11, p=0.02). Summary statistics of the self complete questionnaires are presented in Table 5.

Page 19: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

19

Table 5 Scores of self-complete depression measures (initial time point) Self-complete questionnaire (Time point 1)

Mean (s.d.) Median (Quartiles)

N*

HAD-D PHQ-9 BDI-II QIDS-SR

8.39 (4.73) 11.8 (7.25) 23.7 (12.8) 12.1 (5.73)

8 (5, 12) 11 (6, 18) 22 (14, 33) 12 (8, 16.5)

270 252 241 265

Self-complete questionnaire (Time point 2)

Mean (s.d.) Median (Quartiles)

N*

HAD-D PHQ-9 BDI-II QIDS-SR

7.25 (4.73) 10.6 (7.24) 19.9 (14.0) 10.8 (6.08)

7 (3.5, 10) 10 (4.75, 16) 17 (8. 28) 10 (6, 16)

209 202 198 207

*Where data were complete and total scores could be summed. Reliability Analysis Cronbach’s Alphas for each of the self-complete measures at both time points are presented in Table 6. All scales exhibited acceptable internal consistency at both time points. Item-total correlations were also satisfactory for all the questionnaires at both time points with the exception of one item on the QIDS-SR16

at time point 1 which just fell below 0.4 (0.39). This was the item derived from the highest of the four weight questions.

Page 20: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

20

Table 6 Cronbach’s Alpha and item-total correlations of self-complete questionnaires at time point 1 and time point 2 Time point 1 Time point 2 Questionnaire Cronbach’s α

(95% C.I.) N

Item-total correlations (minimum, maximum)

Cronbach’s α (95% C.I.)

N

Item-total correlations (minimum, maximum)

HAD-D PHQ-9 BDI-II QIDS-SR

0.87 (0.84, 0.89) 0.92 (0.90, 0.93) 0.94 (0.93, 0.95) 0.86 (0.83, 0.88)

269 251 239 264

0.54 – 0.74 0.60 – 0.80 0.47 – 0.78 0.39 – 0.77

0.89 (0.86, 0.91) 0.92 (0.90, 0.94) 0.96 (0.95, 0.96) 0.89 (0.86, 0.91)

209 202 198 207

0.54 – 0.79 0.64 – 0.77 0.51 – 0.81 0.47 – 0.74

Page 21: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

21

Factor analysis Principal components analysis revealed a uni-dimensional factor structure in the HAD-D, PHQ-9 and QIDS-SR16

at both time points. For the BDI-II three factors emerged at time point one and two factors emerged at time point two, however the scree plots suggested the presence of one underlying factor at both time points. Figure 2 presents the first time point scree plot to illustrate this.

Figure 2 Scree plot of BDI-II at first time point showing three factors with Eigen values greater than one. One underlying factor is apparent.

The item variance accounted for by the one factor emerging in the HAD-D, PHQ-9 and QIDS-SR16 and the first factor of the BDI-II are shown in Table 7. At both time points, most items within each scale had a substantial loading indicating all of them to be factorally valid. For each scale, the coefficients of congruence between time point one and time point two tended towards one. This indicates each scale has a highly robust uni-dimensional factor structure regardless of the time point in which it was measured.

Page 22: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

22

Table 7 Item variance accounted for by first factor, range of item loadings and coefficients of congruence Questionnaire Percentage variance

measured by 1st

factor

Time point 1

Range on item loadings of 1st

factor

Time point 1

Percentage variance measured by 1st

Time point 2

factor

Range on item loadings of 1st factor Time point 2

Coefficients of congruence (time point 1 to time point 2)

HAD-D PHQ-9 BDI-II QIDS-SR

56.7 60.7 46.3 48.8

0.652 – 0.834 0.682 – 0.857 0.513 – 0.819 0.481 – 0.844

61.1 61.4 54.8 53.2

0.648 – 0.868 0.714 – 0.822 0.552 – 0.842 0.553 – 0.818

>0.999 0.999 0.998 0.998

Page 23: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

23

Convergent and discriminant validity For assessment of convergent and discriminant validity, data were only included where the HRSD-17 and the self-complete measures were completed within three days of one another. It was considered that with a maximum time difference of three days, there would be sufficient overlap in reference points. The sample included in this analysis was therefore n=252. In fact 233 (92%) of these observations were within one day of each other making substantial overlap of points of reference. Figure 3 Scatter plots of the HAD-D, PHQ-9, BDI-II and QIDS-SR16

against HRSD-17

The questionnaires demonstrated good convergent validity in that they all correlated highly with HRSD-17 (HAD-D and HRSD-17 r=0.68; PHQ-9 and HRSD-17 r=0.78; BDI-II and HRSD-17 r=0.74; and QIDS-SR16 and HRSD-17 r=0.78). Figure 3 demonstrates the convergence in scatter plots where it can be seen that the HRSD-17 forms a linear relationship with all the measures. For HAD-D and HRSD-17, r2=0.47, that is to say the HAD-D sum score accounted for 47% of the variance in the HRSD-17 scores. The PHQ-9, BDI-II and QIDS-SR16

accounted for 59%, 54% and 61% of the variance in HRSD-17 scores respectively.

The measures also correlated highly with the HAD-A. This is to be expected between such closely related constructs as anxiety and depression however (apart from BDI-

Page 24: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

24

II) correlations were significantly higher between the self-complete measures and HRSD-17, than with the HAD-A (Table 8). Therefore, all but the BDI-II demonstrated discriminant validity. Table 8 Correlation of the depression severity self-complete questionnaires with the HRSD-17 and with the HAD-A. Questionnaire Correlation with

HRSD-17 Correlation with HAD-A

William’s test p-value

HAD-D PHQ-9 BDI-II QIDS-SR

0.697 0.784 0.748 0.791

0.593 0.656 0.685 0.684

0.012 0.001 0.114 0.003

Convergence of severity banding The distribution of the measures across categories is shown in the bar chart in Figure 4. For the purpose of this observation, data were only included where the HRSD-17 and the self-complete measures were completed within three days of one another. Additionally, only data were included where data were complete and could be summed for all measures so that the exact same sample of participants were being assessed (n= 204). It can be seen that HAD-D categorises a larger proportion of participants in the “none” category relative to the other measures and the PHQ-9 tends to categorise more participants in the moderate category. Although this bar chart provides an opportunity to assess the relative distribution of categorisation of depression severity it is limited in that it does not inform with regard to agreement in categorisation. Figure 4 Distribution of scores across categories according to each measure (n=204) Figure 5 however, shows the convergence in severity banding between each scales and the HRSD-17. HAD-D tended to categorise participants in a milder category than HRSD-17 whereas PHQ-9, QIDS-SR16

and BDI-II tended to categorise participants in a more severe category. Wilcoxon Signed Rank test for related samples showed these differences to be significant for each measure relative to the HRSD-17 (p<0.05).

0

10

20

30

40

50

60

70

80

90

None Mild Moderate Severe

HRSD-17HAD-DPHQ-9BDI-IIQIDS-SR16

Page 25: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

25

Figure 5 Convergence of severity banding Empirically derived cut offs As NICE Guidelines on the management of depression advocate the use of antidepressants where depression is of at least moderate severity, the analysis of optimal cut off in points in the self complete measures was assessed against a HRSD-17≥14. Assessment with Receiver Operator Characteristic (ROC) curves allows assessment of the best trade off between sensitivity and specificity. Figure 6 ROC curves of self-complete depression severity measures relative to HRSD-17 cut off of 14

Table 9 presents the area under the ROC curve (AUROC) of each self complete depression measure. The greater the area, the better the test is at discriminating.

0

20

40

60

80

100

120

140

HAD-D(n=247)

PHQ-9(n=233)

BDI-II(n=222)

QIDS-SR(n=242)

<HRSD-17=HRSD-17>HRSD-17

Page 26: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

26

For example, an area of 1 would indicate the self-complete measure was perfect at discriminating between those above or below the threshold on the HRSD-17≥14. An area of 0.5 would indicate the test did no better than chance at discriminating between those above and below the threshold. All four measures were shown to perform significantly better than chance at discriminating between those above and below the threshold (p<0.001). The largest area under the curve occurred with QIDS-SR16

.

Table 9 Area under the Receiver Operator Characteristics (AUROC) curve of self-complete depression severity measures relative to a HRSD-17≥14 (moderate) cut off Questionnaire N* AUROC curve 95% Confidence

Intervals HAD-D PHQ-9 BDI-II QIDS-SR

247

16

233 222 242

0.834 0.879 0.850 0.894

0.785 – 0.882 0.837 – 0.921 0.802 – 0.899 0.856 – 0.933

*Where data were complete for HRSD-17 and the self-complete questionnaire and where completion occurred within a 3 day interval. When deciding upon a cut off point on a scale, there is always a trade off between specificity and sensitivity, ie optimising sensitivity to ensure no cases are missed will inevitably results in an increased rate of individuals being rated above the threshold who do not meet the criteria. Table 10 shows, for each self-complete depression severity measure, the discriminatory properties at the moderate cut off defined by the scales’ developers relative to HRSD-17≥14. Following the ROC curve analysis, the optimal discriminatory properties are also shown. DORs provide a single indicator from which to assess a scale’s discriminatory performance. It is independent of prevalence and depression guidelines (3) state that a diagnostic odds ratio ≥20 demonstrates sufficient accuracy for use in clinical settings. As such, the QIDS-SR16 appears superior to the other measures however, the confidence intervals are wide.

Page 27: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

27

Table 10 Self-complete depression severity measures: discriminatory performance of detecting moderate depression severity relative to a HRSD-17≥14 (moderate) cut off % Sensitivity (95%

Confidence intervals)

% Specificity (95% Confidence intervals)

Diagnostic odds ratio (95% Confidence intervals)

Positive predictive value (%)

Negative predictive value (%)

Cut offs for moderate severity as defined by scales’ developers relative to HRSD-17≥14

HAD-D≥11 PHQ-9≥10 BDI-II≥20 QIDS-SR≥11

52 (43 – 61) 87 (80 – 93) 84 (78 – 91) 86 (79 – 92)

89 (83 – 94) 69 (60 – 77) 68 (60 – 77) 68 (60 – 76)

8.51 (4.44 – 16.30) 14.29 (7.27 – 28.09) 11.58 (6.03 – 22.20) 12.57 (6.60 – 23.95)

80 70 72 70

68 86 82 85

Optimal cut offs for moderate severity derived from ROC curve analysis relative to HRSD-17≥14

HAD-D≥9 PHQ-9≥12 BDI-II≥23 QIDS-SR≥13

73 (65 - 81) 76 (68 – 84) 73 (65 – 82) 79 (71 – 86)

76 (68 – 83) 78 (71 – 85) 74 (66 – 82) 85 (79 – 91)

8.47 (4.78 – 15.02) 11.46 (6.20 – 21.17) 7.99 (4.39 – 14.54) 21.42 (11.03 – 41.60)

72 74 73 82

76 80 74 82

Page 28: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

28

Responsiveness to change The three-month follow up questionnaire was completed by 214 (76%) participants. Time between completion one and two was not normally distributed (D=0.22; p<0.001). The median time between completion one and two was 98 days (IQR 93, 110). The HRSD-17 interviews were not conducted at the second time point therefore observations of the self-complete measures in terms of responsiveness to change are a relative comparison. Additionally, time point one did not represent the start of a depressive episode nor did it signify the beginning of an intervention therefore this assessment is simply one of the relative responsiveness of the four measures over two points in time. Change in mean scores from time point one to time point two are shown in Table 11. For all of the measures, there is a statistically significant difference in the mean score from time point one to time point two with all showing a reduction in scores. The effect size is small for all the measures indicating their measurement of magnitude of change over time is similar. Table 11 Responsiveness to change over time of self complete depression severity measures Questionnaire Mean (s.d.)

Time point 1 Mean (s.d.) Time point 2

95% Confidence Intervals of difference of means

Effect size

HAD-D (n=202) PHQ-9 (n=183) BDI-II (n=180) QIDS-SR(n=198)

16

8.09 (4.81) 11.28 (7.27) 22.76 (12.63) 11.67 (5.64)

7.22 (4.78) 10.39 (7.27) 19.53 (13.90) 10.72 (6.09)

(0.37 – 1.37) (0.01 – 1.77) (1.85 – 4.61) (0.27 – 1.62)

0.18 0.12 0.26 0.17

Page 29: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

29

6 Discussion Summary of main findings The HAD-D, PHQ-9, BDI-II and QIDS-SR16 all exhibited acceptable internal consistency and robust factor structures indicating one underlying construct in each measure. Additionally, all the measures correlated highly with the HRSD-17. The HAD-D, PHQ-9 and QIDS-SR16 exhibited discriminant validity but the BDI-II did not. All of the scales differed significantly in how they categorised the severity of depression relative to the HRSD-17. The HAD-D tended to categorise participants in milder categories than the HRSD-17 whereas the PHQ-9, BDI-II and QIDS-SR16

tended to categorise participants in a more severe category than the HRSD-17. All four questionnaires exhibited a similar measurement of magnitude of change over time.

A HRSD-17≥14 is indicative of depression symptoms of moderate severity. This threshold is of interest to clinicians as the NICE Guidelines on treatment of depression (2009) advocates the use of antidepressant therapy for patients with symptoms of at least moderate severity. On this basis, best sensitivity and specificity can be reached for detecting moderate severity of depressive symptoms where HAD-D≥9, PHQ-9≥12, BDI-II≥23 or QIDS-SR16≥13. The diagnostic odds ratios indicated QIDS-SR16

to operate at a greater degree of accuracy than the other measures however the confidence intervals were wide with the lower limit being out with the acceptable range for clinical practice.

Strengths and Weaknesses To our knowledge, this has been the first study in the UK to assess those four commonly used depression severity measures in terms of their ability to measure severity of depression. The HRSD-17 is not a diagnostic interview and some may have argued that the self-complete measures ought to be assessed against a diagnostic interview such as the SCID. However the purpose of these scales in UK treatment of depression is not for them to be used as case finding tools but to be used for assessing severity of depression in patients already diagnosed by a clinician. The assessment of severity is to identify appropriate evidence-based treatment options. As such, the HRSD-17, administered by a clinician trained in mood assessment, represents the best standard to assess the self-complete measures against as it is with the HRSD-17 that the evidence base with regard to severity is founded. Furthermore, we were able to establish inter-rater reliability between the clinicians that carried out the assessments. Our strategy involved recruiting practices from a range of settings encompassing urban, rural and suburban locations and with a range of affluence/deprivation. By recruiting completely in the Grampian area, our sample does not include the most deprived of practices. Similarly Grampian does not represent the ethnic diversity of some parts of Scotland. It is uncertain how this may impact on the generalisability of our findings. Only a quarter of patients invited to participate did so. This is perhaps not surprising. Depressive symptoms relating to motivation, concentration and co-occurring anxiety symptoms may have contributed to dissuading potential recruits from agreeing to complete several questionnaires and meet with an unfamiliar

Page 30: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

30

clinician. However, the priority for this sample, sought for psychometric assessment, was that it represented an even distribution of patients with differing severity of symptoms of depression. The sample met this criterion. Additionally, the sample was similar, in terms of sex, with patients consulting GPs regarding depression throughout Scotland in 2007/2008 (http://www.isdscotland.org/isd/3711.html). Diagnostic odds ratios were used to assess the relative performance of each scale in identifying patients with moderately severe depressive symptoms. It is worth remembering that two scales with identical diagnostic odds ratios could have very different sensitivity and specificity therefore choice of scale should consider which of these criteria are most important for the given purpose. Although we assessed how the self-complete questionnaires measured change over time, there was no HRSD-17 assessment at the second time point. We could not therefore comment on the accuracy of the questionnaires at measuring change over time but could only comment on the relative magnitude of change the questionnaires observed. This study compared the depression self-complete measures against HRSD-17. In addition to assessing the accuracy of these measures in categorising the severity of depressive symptoms, it would have been of great interest to assess GPs’ accuracy in gauging the severity of their patients’ depressive symptoms. Whether GPs’ assessment of severity was any better or worse than the measures could then have been assessed. Adding this element to the study would have been difficult to achieve as it would have required synchronised timing of three components: psychiatric interview, GP consultation and completion of the questionnaire booklet. As it was, it was sometimes challenging to align timings when patients and interviewing psychiatrists could meet without adding the additional element of a GP appointment at a close time point. Accuracy of GPs’ assessment should however be a focus of future research. Lowe et al. 2004 found GPs’ detection of major depression to be poor relative to measures including PHQ-9 and HAD-D. However this was in a sample of patients attending practices rather than an assessment of GPs’ accuracy in gauging severity in patients already diagnosed with depression. Comparisons with literature Our earlier work raised concerns regarding the validity of the HAD-D and PHQ-9 in terms of their assessment of severity of depression (19). We observed that at least one of the two measures were erroneously categorising the severity of depression. Investigations in Sweden and Australia reached similar conclusions (20,21). The present psychometric assessment, where the HRSD-17 interview assessment was conducted, enables us to conclude that indeed both the HAD-D and the PHQ-9, used with their conventional scoring systems, categorise the severity of depression inaccurately. The HAD-D tends to place participants in a milder category of depression than the HRSD-17 and the PHQ-9 tends to place individuals in a more severe category. This latter tendency is also true of the BDI-II and QIDS-SR16. Kendrick et al., 2009, in an observational study of the QOF depression severity measures in UK primary care, observed that practices which used the PHQ-9 tended to observe greater severity of symptoms than practices that used the HAD-D. Drawing on evidence of psychometric studies which had assessed the PHQ-9 (18)

Page 31: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

31

and the HAD-D (15) in terms of their case finding properties relative to the SCID, Kendrick et al proposed a cut off of ≥9 on the HAD-D and ≥12 on the PHQ-9. Based on the same studies, a cut off of ≥10 on the HAD-D and ≥12 on the PHQ-9 became endorsed in the QOF guidance despite the SCID diagnostic interview not being a measure of severity of depression symptomology (2). The cut off points which we present have been empirically derived based on the HRSD-17 on which the evidence base of treatment of depression in regard to severity is founded. The cut offs we propose are the same as those suggested by Kendrick et al, 2009, however they have been generated from an assessment of their performance relative to a cut off of HRSD-17≥14, in accordance with the cut offs outlined in the American Psychiatric Association’s Handbook of Psychiatric Measures (32). Our assessment of the PHQ-9 found the measure to have better psychometric properties in terms of severity when compared to another European study (17). We found the PHQ-9 to account for 59% in variance of the HRSD-17 scores whereas they found it only accounted for 27%. Our results are more likely to be a more accurate reflection on the scale’s performance as Wittkampf et al’s sample was small (n=47) and the time delay between assessments with PHQ-9 and HRSD-17 was up to two weeks apart. A meta-analysis of depression identification instruments in primary care, chronic physical health, and older populations found the PHQ-9 to have a diagnostic odds ratio of 21.38 (95% C.I. 11.87, 38.52) based on 11 studies (3). In this meta-analysis, assessment was being made of the PHQ-9 as a diagnostic tool whereas the current investigation assessed its discriminatory performance in identifying moderate severity of depression symptoms. The rationale for introducing the use of depression severity measures into the QOF was partly informed by a study in which it was observed that GPs were inaccurate in their categorisation of severity of depression (4). However, the gold standard by which the GPs were assessed in this study was the HAD-D. Our current investigations have demonstrated the HAD-D to be inaccurate at categorising severity of depressive symptoms; therefore it is questionable whether the clinical judgement of GPs is any better or worse than such a measure at gauging severity. Although our present study proposes new cut offs for possible use in assessing depression severity, even with reasonable sensitivity and specificity (over 70% for each) it is still not precise. Clinicians can improve on accuracy by using a different cut off depending on whether their priority is not to miss cases or not to misdiagnose cases. It is our view that clinicians should not over rely on these measures for establishing depression severity but that they should have the opportunity to develop or continue using skills of clinical assessment that will allow them to make these judgements themselves. The use of these measures still has a place however, in monitoring symptoms. Ideally scores should be assessed relative to percentile norms collected from the general population. This is in keeping with considering depression as a dimensional, rather than a categorical entity (34). UK percentile norms for the HAD-D are available (12). If Scottish health services are committed to using the PHQ-9 in the assessment of depression, including the measure as part of the Scottish Household Survey would provide a suitable sample from which to generate percentile norms that would act as a useful reference point to allow the rarity of symptoms to be assessed.

Page 32: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

32

7 Conclusion The present psychometric assessment reports on the properties of the HAD-D, PHQ-9, BDI-II and QIDS-SR16 measures of depression severity. Their psychometric properties are generally strong in terms of internal consistency, factor structure and convergent validity however, using their conventional scoring systems, none of the scales were found to accurately categorise the severity of depression. The present study provides empirically derived cut off points which can be used with these measures to facilitate the assessment of severity of depressive symptoms. Further research is required to ascertain whether such measures are any better than GPs’ own judgement of severity of depressive symptoms in patients whose depression has been identified. It may be more appropriate to consider depression on a continuum and to interpret scores on severity measures relative to a normative sample. In this respect, collecting data to generate reference percentile norms for routinely used measures should be a priority.

Page 33: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

33

8 Recommendations for NHS QIS The Scottish Integrated Care Pathway for Depression Toolkit Standard 33 (1), advocates the use of an objective measure for people with depression. It is stated that “A validated measure of depression is used at initial assessment and repeated at regular intervals to monitor progress and outcome.” (p 74). This remains a valid standard for services in Scotland to aspire towards. The current study found that best sensitivity and specificity was reached for detecting moderate severity of depressive symptoms where HAD-D≥9, PHQ-9≥12, BDI-II≥23 or QIDS-SR16

≥13. NHS QIS should incorporate this finding into their integrated care pathway guidance. However, it should still be emphasised that health care professionals should not rely on this interpretation of a score alone when assessing an individual who may have depression but should also consider other factors including degree of impairment, length of episode, history of depression, family history, other comorbid disorders and specific circumstances pertaining to individuals (3).

With regard to monitoring the impact of interventions, interpreting scores against percentiles derived from a Scottish general population sample would be a step forward. Such a method recognises depression as existing on a continuum rather than within discrete diagnostic categories. By considering scores alongside percentile norms, the rarity of an individual’s score can be observed. NHS QIS could facilitate this process by supporting research aimed at collating data and generating percentile norms from a representative Scottish sample.

Page 34: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

34

Appendix I Assessing the suitability of the QIDS-SR16

as a proxy for the HRSD-17

Background A principal aim of this research was to assess the validity of the PHQ-9, HADS-D and BDI-II measures against the HRSD-17. The HRSD-17 is intended for administration by individuals who have undergone adequate training in the assessment of mood. Assessing all participants by this method, adds to the complexity of the study. It requires synchronised timing of participants and clinicians alongside availability of an interview room in the primary care settings. An alternative approach would be to use the QIDS-SR16. This is a self-complete measure of depression severity. It includes all DSM-IV criterion symptoms for major depressive disorder. It was devised in the US and has been assessed and developed to act as a proxy measure for the HRSD-17 (23). The developers of this scale have produced conversion tables which allow scores from the QIDS-SR16 to be converted to scores on the HRSD-17. If the QIDS-SR16 could replace the HRSD-17, the collection of data could be simplified as the QIDS-SR16 could be incorporated into the booklet with the other measures. This would dispense with the need for patients to attend an interview with a study clinician and would therefore allow a greater number of practices to participate as the study would not be limited by coverage of research clinicians to participating practice. Furthermore, the developers of the QIDS-SR16

have highlighted the potential of the measure as a simple self-report screening instrument for use in primary care (23).

Whilst using the QIDS-SR16 as a proxy for the HRSD-17 is an attractive option (in terms of simplifying data collection and expanding practice participation) caution is required when considering the potential for its use in UK primary care. The QIDS-SR16

has not been validated in the UK therefore its fitness for purpose should be assessed before considering its use as a proxy for HRSD-17.

The aim of the present investigation was to assess the performance of the QIDS-SR16

Methods/Design

as a proxy for the HRSD-17.

Study Design Psychometric assessment of convergence between the severity bands of the HRSD-17 and the QIDS-SR16

Setting Patients were recruited from four general practices in Grampian, Scotland. The HRSD-17 was administered to patient participants within their general practice. The QIDS-SR16

was completed by patient participants within their own space.

Patients Inclusion criteria Patients were eligible if they were aged 16 years and over and their GP had diagnosed them as depressed. The sample included patients at the outset and also

Page 35: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

35

during the course of an earlier identified depressive episode, including those receiving treatment. Exclusion criteria The vast majority of depressed patients were eligible to consider participating; however patients were excluded if they did not have the necessary spoken or written language skills. Measures QIDS-SRQIDS-SR

16 16 is a brief self-complete measure designed to measure the severity of

depression. It assesses the nine diagnostic criteria for major depressive disorders covered in DSM-IV and consists of 16 items each ranging from 0 to 3. It is scored by summing the highest response in each of a set of questions relating to sleep, weight and psychomotor symptoms and then adding the remaining items. Scores can range from 0 to 27. Conversion tables have been generated which allows HRSD-17 scores to be predicted from this measure however this has not been previously assessed in a UK primary care context. QIDS-SR16

is freely available from http://www.ids-qids.org/.

HRSD-17 The 17-item HRSD structured interview is intended for use as the ‘gold standard’ of depression severity measurement. The HRSD-17 has been chosen for this purpose as it has been widely used in intervention studies which have taken into account depression severity (3). Overwhelmingly, what is known about treatment efficacy has been based on trials which have assessed depression severity using this method. Additionally, the appropriateness of using the HRSD-17 in a primary care context has long been established (25). It consists of seventeen items, nine of which are rated on a five point scale (0-4) and eight on a 3 point scale (0-2). Hamilton developed the scale to be administered by an experienced psychiatrist (13). Demographic data Demographic questions were included with the questionnaire in order to describe the sample. Procedure All recruits were invited to complete the questionnaire and participate in a face-to-face clinical interview conducted by one of three specialist registrars in psychiatry. The interviewers were blind to the questionnaire responses. Data were analysed once 50 participants had completed the QIDS-SR16

and been assessed in a face-to-face interview. The questionnaire and interview were required to be completed no more than one day apart.

Statistical analyses A concurrent analysis was made of the HRSD-17 and the QIDS-SR16

. The established severity cut-off scores for each scale were assessed for convergence using Cohen’s Kappa.

Page 36: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

36

Analyses were conducted using SPSS (Version 17). Results Practice characteristics are shown in Table 12. They comprised a mix of urban settings and socio-economic catchments. Table 12 Characteristics of participating practices Practice SIMD

quintile 2007

Modal urban/rural category

1 A

2

B C D

1st 4th 1st 4th

Other urban area – settlements of 10 000 – 125 000 people Large urban area – settlements of over 125,000 people Accessible small town – settlements of 3 000 – 10 000 people and within 30 minutes drive of settlements of 10 000 or more Large urban area – settlements of over 125,000 people

1Scottish Index of Multiple Deprivation quintile for Scotland where 1st represents least deprived and 5th most deprived (mapped to 2006 reference files; 2

Urban/rural practice status according to Scottish Government 2007/2008 urban rural classification.

Twenty-four (48%) of the HRSD-17 assessments were conducted by one interviewer, 19 (38%) by a second and 7 (14%) by a third. Thirty-three (66%) of the interviews were conducted on the same day that the QIDS-SR16

was completed, six (12%) were conducted the day before the questionnaire was completed and 11 (22%) were conducted the day after the questionnaire was completed.

The demographic characteristics of participants are presented in Table 13. Whilst the sample represents a mix in terms of most characteristics, it is not ethnically diverse however neither is the population of the North East of Scotland.

Page 37: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

37

Table 13 Demographic characteristics of participants Characteristic N=50 Age (s.d.)* Female (%) Ethnic origin white First Language English Main activity (%) In employment/self employed Housework Seeking work Retired Other Educated beyond minimum school age Educated to degree level

48.1 (12.8) 29 (58) 50 (100) 50 (100) 28 (57) 10 (21) 1 (2) 6 (12) 4 (8) 33 (66) 23 (46)

* Data were normally distributed: one sample Kolmogorov-Smirnov test: D=0.049; exact p=0.999 (two-tailed) The mean QIDS-SR16 score was 11.14 (s.d.=5.39) and the mean HRSD-17 score was 13.3 (s.d.=7.13). The one-sample Kolmogorov-Smirnov test of goodness of fit provided no evidence against the null hypothesis that the sample had been drawn from a normal population in terms of QIDS-SR16

score (D=0.084, exact p=0.846) and HRSD-17 score (D=0.112, exact p=0.526).

Table 14 presents the frequencies of categories of responses of the QIDS-SR16

and the HRSD-17 according to the categories developed by Rush et al., for the conversion tables.

Table 14 Categorical convergence of QIDS-SR16

and HRSD-17

HDRS-17 categories N (%)

None 0-7

Mild 8-13

Moderate 14-19

Severe ≥20

QIDS-SR categories N (%) None (0-5) Mild (6-10) Moderate (11-15) Severe (≥16)

9 4 2 0

0 6 4 0

0 6 4 5

0 0 3 7

Frequencies are in bold where category agreement would be expected. For 26 (52%) participants, the HRSD-17 and QIDS-SR16 categories were in agreement however for 15 (30%) participants, depressive symptoms were categorised in a more severe grouping by the QIDS-SR16 than the HRSD-17. For 9 (18%) participants, depressive symptoms were categorised in a less severe grouping by the QIDS-SR16 than the HRSD-17. As category agreement occurred in just half of the assessments, agreement between the two measures was not adequate. This was confirmed by the value of kappa (0.365). Whilst this value was significant (p<0.001) it was still too low

Page 38: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

38

to indicate an appropriate level of agreement between methods purporting to measure the same construct. Discussion Main Finding This investigation of agreement between the categories of severity of depression of the QIDS-SR16 and the HRSD-17 found that the two measures were not adequately matched. For only half of the assessments was category agreement demonstrated. Where there was disagreement, the tendency was for the QIDS-SR16

to rate depression in a more severe category than the HRSD-17.

Interpretation In this small study where we assessed the feasibility of using the QIDS-SR as a proxy for the HRSD-17 in 50 primary care patients with a diagnosis of depression, our findings did not concur with those of Rush et al., 2003. There are three factors which may explain this. Firstly, the sample from which the conversion tables were derived were adult out-patients who had fulfilled the criteria for one of the following: a chronic major depressive illness of at least two years duration; a current major depressive disorder in addition to a pre-existing dysthymic disorder; or a recurrent major depressive disorder with a history of incomplete remission between episodes. Furthermore they had all scored at least 20 on HRSD-24 at the outset of the study (approximating to HRSD-17 score of 16). As such, the Rush sample had more severe symptoms than our primary care sample where mean HRSD-17 was 11.14. Symptom severity may impact upon categorical convergence so that agreement between measures is greater with more severe symptoms. Rush et al., assert that the generalisability of their findings should be restricted to out-patients with chronic major depressive disorder with symptoms of a moderate to severe level. Secondly, the Rush sample was recruited in the US. It may not be possible to generalise to a UK context. Thirdly our sample of 50 was small and it may be that greater convergence would be demonstrated in a larger sample. It was decided that QIDS-SR16

data would continue to be collected to enable further assessment of agreement in a larger sample. HRSD-17 would continue to be administered as the standard from which to assess the self-complete measures.

Conclusion The QIDS-SR16

was found not to be fit for purpose as a proxy for HRSD-17 with a UK primary care sample of patients diagnosed with depression by their GP. As a result, the HRSD-17 continued to be administered throughout the study to assess the psychometric properties of the self-complete depression severity measures.

Page 39: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

39

Appendix II Assessing the inter-rater reliability of the clinician administered HRSD-17 (GRID version) Background Inter-rater reliability assesses the extent to which different raters generate the same results. In light of the analysis in section 3, where we discarded the idea of using the QIDS-SR16

as a proxy for the HRSD-17, it became clear that several psychiatrists would be required if all recruits to the study assessing the QOF severity measures were to have an interview assessment based on HRSD-17. It was necessary to ensure that all the research psychiatrists were rating consistently. Previous investigations of the inter-rater reliability of the HRSD have yielded mixed results. The intraclass correlation (ICC) for the scale has ranged from 0.46 (34) to 0.99 (35). The GRID-HAMD was developed in order to standardise the administration of the HRSD without altering the original intent of the measure and thus improve its inter-rater reliability. We sought to maximise the inter-rater reliability of the HRSD-17 by using the GRID-HAMD, nonetheless it was necessary to test its performance amongst the six psychiatrists who conducted the assessments over the course of the study.

Methods/Design Study Design This study was an assessment of the inter-rater reliability of the HRSD-17 GRID-HAMD. Setting Patients were recruited from eight general practices across Grampian, Scotland. The HRSD-17 was administered to patient participants within their general practice. Patients Inclusion criteria Patients were eligible for inclusion if they were aged 16 years and over and had been diagnosed as depressed by their GP. The sample included patients at the outset and also during the course of an earlier identified depressive episode, including those receiving treatment. Exclusion criteria The vast majority of depressed patients were eligible to consider participating; however patients were excluded if they did not have the necessary spoken or written language skills. HRSD-17 Interviewers Six psychiatrists participated: one consultant, three specialist registrars and two staff grades. In the study, between them they administered HRSD-17 to 282 primary care participants who had been diagnosed with depression by their GP. Table 15 shows the number of patients assessed by each psychiatrist. Table 15 HRSD-17 assessments carried out by psychiatrists

Page 40: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

40

Psychiatrist N (%) 1 2 3 4 5 6

58 (20.6) 97 (34.4) 74 (26.2) 8 (2.8) 35 (12.4) 10 (3.5)

Procedure All recruits were invited to participate in a clinical interview. A subsample of patients was asked if they would agree to the assessment being audio recorded for the purpose of assessing inter-rater reliability. We did not invite all participants to have the assessment audio-recorded as we were concerned it might reduce the participation rate. Additionally, only a subsample of recordings was required as the intention was to assess inter-rater reliability using five audio recordings from each psychiatrist. Participants were sampled across the recruitment period between 23 May 2008 and 7 May 2009. Each psychiatrist made only one recording per day so that their five recordings represented a range of occasions. (These recorded assessments will be referred to as the primary ratings). Following this, each psychiatrist was given five recordings, one from each of the other raters to listen to and make their own ratings. (These assessments will be referred to as the secondary ratings). Two of the items in the HRSD-17 (concerning retardation and agitation) required visual observation. The psychiatrists were instructed not to attempt to rate these items from the audio recordings. Statistical analyses Distributions of scores of individual items from the primary raters were examined and percentage agreement between the secondary raters calculated. The unweighted Kappa statistic was calculated for each item. The 15 items that were rated by both the primary and secondary rater were summed. Summing the 15 items was carried out rather than imputing values for the items on agitation and retardation and scoring the full 17 items. Values were not imputed as missingness was intentional and complete for these two variables. To assess if there were any consistent differences between primary and secondary raters a paired t-test was calculated. The intraclass correlation (ICC) was calculated with accompanying confidence intervals in order to express the between pair variance as a proportion of the total variance. Analyses were conducted using SPSS (Version 17). Results Twenty-eight of 33 (85%) participants agreed to audio-recording of HRSD-17. Five psychiatrists made five recordings each and one psychiatrist made three recordings. All six psychiatrists made secondary recordings of five colleagues’ primary recordings. As one psychiatrist made three recordings, two of these were issued for secondary ratings to more than one psychiatrist.

Page 41: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

41

The mean age of patient participant was 49.7 years (s.d. =20). Twenty-five (83%) were female. The mean HRSD-17 score based on the primary assessment was 9.5 (s.d. = 6.9, range 0 - 22). Table 16 shows the distribution of scores in the primary rating and the inter-rater agreement and kappa coefficients for all items (except agitation and retardation). Agreement can be considered as moderate if kappa>0.4; substantial if kappa>0.6 and outstanding if >0.8. Table 16 Inter-rater agreement for individual items on HRSD-17 Item Primary rater

response (%) n=30

% Agreement

Kappa

0 1 2 3 4 Depressed mood

Guilt

Suicidal tendencies

Insomnia (early)

Insomnia (middle)

Insomnia (late)

Work/activities

Anxiety (psycho)

Anxiety (somatic)

Somatic (GI)

Somatic (general)

Sexual/genital

Hypochondriasis

Weight

Insight

8

19

25

17

14

21

16

14

12

22

13

16

18

25

100

11

5

5

4

7

6

7

8

10

7

11

8

7

3

0

7

3

0

9

9

3

3

3

7

1

6

6

4

2

0

3

3

0

-

-

-

4

5

1

-

-

-

1

-

-

1

0

0

-

-

-

0

0

0

-

-

-

0

-

-

66

87

97

90

90

93

73

66

83

80

80

87

80

87

100

0.53

0.76

0.78

0.84

0.84

0.84

0.57

0.53

0.75

0.59

0.69

0.78

0.65

0.59

-

The difference (primary value – secondary value) of the 15-item summed scale was approximately Normally distributed: Kolmogorov-Smirnov test of goodness of fit D=0.133, exact p=0.73. The mean difference between the pairs of scores was 0.033 (s.d.=2.11). The test statistic for the paired t-test=0.09 (degrees of freedom=29), p=0.93 indicating that there was no evidence of any systematic difference between the primary and secondary ratings. The intraclass correlation coefficient for the summed scale was 0.95 (95% CI 0.90, 0.98) demonstrating excellent agreement.

Page 42: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

42

Discussion The GRID-HAMD 17 version of HRSD-17 demonstrated very good inter-rater reliability allowing confidence in its use in the larger study assessing the psychometric properties of the self-completed depression severity measures. Our findings are comparable with those of Tabuse et al., 2007 in their inter-rater analysis of the Japanese version of the GRID-HAMD 17 (36). Intraclass coefficients ranged from 0.95 to 0.99 amongst inexperienced and experienced raters consisting of psychiatrists, psychologists and medical students. Conclusion The inter-rater reliability of the GRID-HAMD 17 has demonstrated good agreement between the psychiatrists conducting assessments in this study. This adds to the external validity of the larger study where the psychometric properties of the self-completed depression severity measures are being assessed.

Page 43: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

43

9 References (1) NHS Quality Improvement Scotland. Standards for integrated care pathways for mental health December 2007. 2007. Available from http://www.nhshealthquality.org/nhsqis/files/mentalhealth_standardsforICP_DEC07.pdf (2) NHS Employers and the General Practitioners' Committee. Quality and Outcome Frameworks: guidance for GMS contract 2009/10. 2009:1-162. (3) NHS National Institute for Clinical Excellence. Depression: the treatment and management in adults (update). 2009;CG90:1-585. (4) Kendrick T, King F, Albertella L, Smith P. GP treatment decisions for patients with depression. Br J Gen Pract 2005;55:280-286. (5) Kroenke K, Spitzer RL, Williams JB. The PHQ-9: validity of a brief depression severity measure. J.Gen.Intern.Med. 2001;16:606-613. (6) Zigmond AS, Snaith P. The Hospital Anxiety and Depression Scale (HAD). Acta Psychiatrica Scandinavica 1983;67:361-370. (7) Beck AT, Steer RA, Ball R, Ranieri WF. Comparison of Beck Depression Inventories-IA and -II in psychiatric outpatients. Journal of personality assessment 1996;67(3):588-597. (8) Beck AT, Steer RA, Brown GK. Manual for Beck Depression Inventory-II. San Antonio: TX: Psychological Corporation; 1996. (9) Snaith P, Zigmond AS. Hospital Anxiety and Depression Scale. : Windsor: NFER Nelson; 1984. (10) Herrmann C. International experiences with the Hospital Anxiety and Depression Scale - A review of validation data and clinical results. J Psychosom Res 1997;42:17-41. (11) Bjelland I, Dahl AA, Haug TT, Neckelmann D. The validity of the Hospital Anxiety and Depression Scale: An updated literature review. J Psychosom Res 2002;52:69-77. (12) Crawford JR, Henry JD, Crombie C, Taylor EP. Normative data for the HADS from a large non-clinical sample. Br J Clin Psychol 2001;40:429-434. (13) Hamilton M. A rating scale for depression. J Neurol Neurosurg Psychiatry 1960;23:56-62. (14) Henkel V, Mergl R, Kohnen R, Maier W, Moller H-, Hegerl U. Identifying depression in primary care: A comparison of different methods in a prospective cohort study. Br.Med.J. 2003 25 Jan;326(7382):200-201. (15) Lowe B, Spitzer RL, Grafe K, Kroenke K, Quenter A, Zipfel S, et al. Comparative validity of three screening questionnaires for DSM-IV depressive disorders and physicians' diagnoses. J Affect Disord 2004;78:131-140. (16) Spitzer RL, Williams JB, Gibbon M, First MB. The structured clinical interview for DSM-III-R (SCID) I: History, rationale, and description. Arch Gen Psychiatry 1992;49(8):624-629. (17) Wittkampf K, van Ravesteijn H, Baas K, van de Hoogen H, Schene A, Bindels P, et al. The accuracy of Patient Health Questionnaire-9 in detecting depression and measuring depression severity in high-risk groups in primary care. Gen.Hosp.Psychiatry 2009 Sep-Oct;31(5):451-459.

Page 44: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

44

(18) Gilbody S, Richards D, Barkham M. Diagnosing depression in primary care using self-completed instruments: UK validation of PHQ-9 and CORE-OM. Br J Gen Pract 2007;57:650-652. (19) Cameron IM, Crawford JR, Lawton K, Reid IC. Psychometric comparison of the PHQ-9 and HADS for measuring depression severity in primary care. Br J Gen Pract 2008;58:32-36. (20) Hansson M, Chotai J, Nordstom A, Bodlund O. A comparison of two self-rating scales to detect depression: HADS and PHQ-9. Br.J.Gen.Pract. 2009;59(566):650-654. (21) Reddy P, Dunbar J, Ford D, Philpot B. Identification of depression in diabetes: the utility of the PHQ-9 and HADS-D. Br.J.Gen.Pract. 2010:DOI: 10.3399/bjgp10X502128. (22) Kendrick T, Dowrick C, McBride A, Howe A, Clarke P, Maisey S, et al. Management of depression in UK general practice in relation to scores on depression severity questionnaires: analysis of medical record data. BMJ 2009;338:b750. (23) Rush AJ, Trivedi MH, Ibrahim HM, Carmody TJ, Arnow B, Klein DN, et al. The 16-Item Quick Inventory of Depressive Symptomatology (QIDS), clinician rating (QIDS-C), and self-report (QIDS-SR): a psychometric evaluation in patients with chronic major depression. Biol.Psychiatry 2003 Sep 1;54(5):573-583. (24) American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders. fourth ed. Washington DC: American Psychiatric Association; 1994. (25) Paykel ES. Use of the Hamilton Depression Scale in General Practice. The Hamilton Scales 1990;9:40-47. (26) Potts MK, Daniels M, Burnam MA, Wells KB. A structured interview version of the Hamilton Depression Rating Scale: evidence of reliability and versatility of administration. J Psychiat Res 1990;24(4):335-350. (27) Wrigley C, Neuhaus JO. The matching of two sets of factors. Am Psychol 1955;10:418-419. (28) Howell DC. Statistical methods for psychology. 5th ed. Belmont, CA: Duxbury Press; 2002. (29) Murphy JM, Berwick DM, Weinstein MC, Borus JF, Budman SH, Klerman GL. Performance of screening and diagnostic tests. Application of receiver operating characteristic analysis. Arch Gen Psychiatry 1987;44:550-555. (30) Harper R, Reeves B. Reporting of precision of estimates for diagnostic accuracy: a review. BMJ 1999;318:1322-1323. (31) Deyo RA, Diehr P, Patrick DL. Reproducibility and responsiveness of health status measures: Statistics and strategies for evaluation. Controlled Clin Trials 1991;12:142S-158S. (32) Rush AJ, First MB, Blacker D. Handbook of Psychiatric Measures. Second ed. USA: American Psychiatric Association; 2008. (33) Thompson C, Ostler K, Peveler RC, Baker N, Kinmonth AL. Dimensional perspective on the recognition of depressive symptoms in primary care: The Hampshire Depression Project 3. Br J Psychiatry 2001;179:317-323. (34) Cicchetti DV, Prusoff BA. Reliability of depression and associated clinical symptoms. Arch.Gen.Psychiatry 1983 Sep;40(9):987-990.

Page 45: Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS

45

(35) Gottlieb GL, Gur RE, Gur RC. Reliability of psychiatric scales in patients with dementia of the Alzheimer type. Am.J.Psychiatry 1988 Jul;145(7):857-860. (36) Tabuse H, Kalali A, Azuma H, Ozaki N, Iwata N, Naitoh H, et al. The new GRID Hamilton Rating Scale for Depression demonstrates excellent inter-rater reliability for inexperienced and experienced raters before and after training. Psychiatry Res. 2007 Sep 30;153(1):61-67.