Clare Davenport Senior Clinical Lecturer Public Health, Epidemiology and Biostatistics University of Birmingham Chris Hyde Professor of Public Health and

TEST ACCURACY EVIDENCE & DIAGNOSTIC DECISION MAKING

Clare DavenportSenior Clinical Lecturer

Public Health, Epidemiology and BiostatisticsUniversity of Birmingham

Chris HydeProfessor of Public Health and Clinical EpidemiologyUniversity of Exeter

“I consider much less thinking has gone into the theory underlying diagnosis, or possibly one should say less energy has gone into constructing the correct model of diagnostic procedures, than into therapy or prevention where the concept of ‘altering the natural history of the disease’ has been generally accepted and a theory has been evolved for testing hypotheses concerning this.”

Archie Cochrane. Effectiveness and efficiency. Random reflections on health services. The Nuffield Provincial Hospitals Trusts, 1972.

Medical Test

Information

Decision

ActionPatient Outcome

Test harms/ placebo effects

Diagnostic

accuracy

Diagnostic yield

Management

Diagnostic and Treatment PathwaysAccuracy provides information

about the hypothetical value of a test in decision making

Phases of Test EvaluationTechnical PerformanceDoes CT produce good

quality images, reliably and reproducibly?

DIAGNOSTIC PERFORMANCE

Do CT images accurately differentiate diseased from non-diseased patients?

Diagnostic ImpactDoes CT change how diagnoses are made by doctors?

Therapeutic ImpactDoes CT change how treatment decisions are made?

Patient Health ImpactDoes CT ultimately reduce mortality or morbidity?

Fryback and Thornbury Med Dec Mak 1991;11:88-94

Why Emphasis on Test Accuracy?Lax regulatory system for tests: no

requirement for evidence about patient impact

Health Technology assessment traditionally concerned with treatments

-what if we are treating the wrong patients?-what if test use itself causes harm?Medical education: concept of EBM relatively

new and test evaluation even more so

Difficulties conducting test-treat trials: sample size; blinding; measurement and reporting of diagnostic and treatment decisions

Outcomes Framework: 14 Mechanisms

Test resultsproduced

Diagnostic decision made

Treatment decision made

Patient Outcome

Treatment implemented

Patient given test

6. Accuracy

8. Dx confidence

9. Dx Yield

10. Rx Yield

3. Feasibility

5. Interpretability

1. Test

Process

11. Rx

Confidence

2. Timing

Test

13.Timing

Treatment

4. Timing

Results

7. Timing Diagnosis

12. Adherence

14. Patient / Clinician

Perspective

Ferrante di Ruffano et al. BMJ 2012;344:e686

Diagnostic confidenceDo I know where to find trustworthy

information on the accuracy of this test?

Do I understand the test accuracy information available?

What are the implications of the accuracy of this test for my patients

Test Accuracy Measures

Disease (Reference test result)

Present (D+)

Absent (D-)

IndexTest

+ True Positives

False Positives TP+FP

- False Negatives

True Negatives FN+TN

TP+FN FP+TN TP+FP+FN+TN

SensitivityTP / (TP+FN)

SpecificityTN / (TN+FP)

Positive Predictive Value

TP / (TP+FP)

Negative Predictive Value

TN / (TN+FN)

Disease as Reference

Class

Test Result as Reference

ClassDOR = True Positives X True Negatives False Negatives X False Positives

Accessibility of test accuracy information for decision making

AUC

Summary ROC curves

SPECIF

ICITY

LR-

ROC Space

Summary specificity

ERROR RATE

NPV

Summary sensitivity

LR+

DOR

ROC plotRDOR

2x2 tableRelat

ive se

nsiti

vity

Relative specificity

In most testing situations either false negative or false positive test errors are more important

CA125 is a marker used to identify individuals who have an increased risk of ovarian cancer.

NICE endorse the use of the test in primary care to select who should undergo pelvic ultrasound testing with and those who can safely not be investigated further.

Sensitivity OR Specificity?

Ca125 testing for ovarian cancer

Ca125 testing for ovarian cancer

False Positives receive further investigation with US

False Negatives do not receive further investigation

Sensitivity : FNSpecificity : FP

Do not have ovarian cancer but

test positive

Have ovarian cancer but test

negative

Research rationale (1)Contextual variables cause variation in test accuracy estimates:-Characteristics of the population to be tested -Variation in the conduct of the test itself

The proposed role of a test determines the point in a care pathway that it should be evaluated and any comparator tests that should be considered.

Context determines the relative value placed on test errors (false positives and false negatives)

Systematic reviews of test accuracy offer the potential to mitigate the lack of contextual fit observed in primary studies of test accuracy… are they doing their job?

Research rationale (2)It is stated to be common knowledge that

decision makers have difficulty understanding and applying test accuracy

information

BUT

What is the EXTENT of the problem ?

WHY do decision makers have particular difficulty with test accuracy measures – what is the nature of the difficulty?

To what extent are contextual considerations

represented in Test Accuracy Reviews ?

Research methodsAims: -Assess the extent to which testing context is reflected at each stage of the review process:-when formulating review questions -synthesis, including investigation of heterogeneity-discussing results and making recommendations

Searches:-3 review databases chosen on the basis of an epidemiological mapping exercise (Bayliss & Davenport 2008 IJTAHC; 24(4):403-411) ARIF, (University of Birmingham) DARE (CRD York) Cochrane Database Systematic Reviews-Cochrane Diagnostic Test Accuracy Working Group & the UK National Research Register: unpublished reviews

Results: study flow

HITS:N=1215

Cochrane Database

of Systematic

Reviews

DAREARIF

Excluded at abstract stage:N=641

Excluded at full paper stage:N=34

DuplicatesN=303

Eligible test accuracy reviewsN=237

Random 100 reviews data extracted

Results: review characteristicsDate of publication ranged from 1990 to 2006; 23% of reviews before 2000 and 73% on or after 2000. A total of 16 disease topic areas were represented Between 1 and 50 index tests (median 3) were evaluated by a single review The majority of reviews (43/100) were conducted in the USA, 23 in the UK, 12 in the Netherlands and 8 in the rest of Europe, 6 in Australia, 4 in Canada, 2 in Peru and one each in Columbia and China. 94/100 reviews included a clinician as an authorUsing a modified checklist of 9 items taken from the QUORUM and AMSTAR checklists, study quality ranged from 0-9 (median 4.6; inter-quartile range 3 to 6)

Results: question formulationOnly 24/100 reviews clearly specified all of test application, test role and prior tests as part of question formulation; 26% of the 73 reviews published on or after 2000 and 22% of the 23 reviews published before 2000

0102030405060708090

100

Detail of Question Formulation in Included Reviews

% R

ev

iew

s

Only 9/100 reviews reported all of setting, (pesentation (symptomatic or asymptomatic) and prior tests.

Preva

lenc

e

Settin

g

Chron

icity

Sympt

omat

ic/as

ympt

om...

Age

Sever

ity

Co-m

orbi

dity

Prior t

ests

0102030405060708090

100

Reporting of study characteristics in test accuracy reviews

Limited by poor reporting in 1y studies

No

Unclear

Yes Nu

mb

er

of

rev

iew

s

Systematic review of evidence concerned with understanding and application of test accuracy metrics

Contextualisation of review findings

Applic

abilit

y (s

pect

rum

)

Applic

abilit

y (in

dex

test

)

Applic

abilit

y (p

reva

lenc

e)

Applic

abilit

y (h

ealth

care

set

ting)

Conse

quen

ces

of te

st re

sults

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%Contextualisation of review findings

Unclear / not reported

Stated

Pe

rce

nta

ge

of

Re

vie

ws

Characteristics of the literatureTheoretical perspective papers:- 1978 to 2010, (88% after 1990)-34 papers written by 30 unique authors-25/30 clinicians of which 16/25 affiliated to an academic institution

Empirical research:-1978 to 2010,(60% after 1995)-majority of health professional samples were self- selected, convenience samples from medical education courses

- only 3/26 health professional samples representative of primary care

Research Methods...Systematic review of evidence concerned with the understanding and application of test accuracy metrics.

Survey of use and understanding of test accuracy metrics by general practitioners

Results: study flowTotal Hits

N=16765

Included

papersN=67

Empirical research

N=33

Theoretical papers

N=34

Clinical

sampleN=26

OtherN=7

Excluded

N=14136

Duplicates

N=2508

Results: Theoretical papersUnderstanding and application of test accuracy information:-Accessibility of test accuracy metrics -Knowledge about accuracy of tests used in practice likely to be limited-Lack of appreciation of variation in pre-test probability across healthcare settings-Use of graphical aids and frequencies (rather than % or proportions) to facilitate probability revision

Factors affecting testing behaviour:-Testing viewed as a risk aversive behaviour-Testing context considered important modifier of attitudes to risk

Theoretical: understanding and applying test accuracy information

Sensitivity and specificity

“A clinician will not start from diseased or not diseased, but from a positive or negative test. Therefore sensitivity and specificity are intuitively not so evident” (Dujardin 1994)

Global test accuracy measures do not

distinguish between test accuracy in 2 dimensions“The problem that occurs in a

meta-analysis of diagnostic studies is the multi-directional performance of the diagnostic instrument regarding its ability to detect (specificity) or exclude (sensitivity) the characteristic of interest is not distinguished. Multi-dimensional outcomes cannot be summarised well by a single estimate.” (Stengel 2003)

Likelihood Ratios“Never in 20 years of teaching

clinical logic, have we found a clinician who used the word “positive likelihood ratio”.

(Van Den Ende 2005)

Test Accuracy Measures

Disease (Reference test result)

Present (D+)

Absent (D-)

IndexTest

+ True Positives

False Positives TP+FP

- False Negatives

True Negatives FN+TN

TP+FN FP+TN TP+FP+FN+TN

SensitivityTP / (TP+FN)

SpecificityTN / (TN+FP)

Positive Predictive Value

TP / (TP+FP)

Negative Predictive Value

TN / (TN+FN)

Disease as Reference

Class

Test Result as Reference

ClassDOR = True Positives X True Negatives False Negatives X False Positives

Theoretical: understanding and applying test accuracy information

Pre- test probability and test accuracy estimation “research has shown that

clinicians’ estimates of probability vary widely and are often inaccurate..by itself, clinical experience appears insufficient to guide accurate probability estimation” (Richardson 2003)

“We rarely know what the sensitivities, specificities or likelihood ratios are for tests. At best clinicians carry a general impression about their usefulness” (Gill 2005)

Contextual variation in test accuracy

Unfortunately, it is often not realised that there can be no generally valid estimates of a test’s sensitivity, specificity or likelihood ratio that apply to all patients of a particular population, nor should such values be sought” (Moons 2003)

Theoretical: factors affecting testing behaviour

Clinicians are uncomfortable with

uncertainty

Testing ‘risk’ is context dependent

“...some physicians order all the tests that may be even remotely applicable in a given clinical situation. Such a practice may comfort the patient and enhance the physician’s belief that all diagnostic avenues have been pursued, but more tests do not necessarily produce more certainty..... we continue to test excessively, partly because of our discomfort with uncertainty.” (Kassirer 1989)

“.. feelings of uncertainty regarding medical problems can differ depending on the situation, not only because one physician may be faced with more complicated diagnostic puzzles than the other, but also, and primarily because the consequences of a vague and uncertain diagnosis may vary in each situation.” (Zaat (1992)

Empirical research: understanding and applying test accuracy information

High levels of error observed for estimates of the accuracy of particular tests. Estimates were based on clinical experience of test use, rather than published evidence.

Between-person variation in estimation of disease prevalence (pre-test probability) for any one disease was considerable (25-100%)

Confusion with interpretation of test accuracy metrics (for example sensitivity and specificity confused with positive and negative predictive values).

Empirical research: understanding and applying test accuracy informationResearch based on premise that probability revision is necessary for diagnostic decision making... (32/33 empirical studies) investigated the ability of respondents to undertake probability revision

Average proportion of respondents able to undertake probability revision 46%, range 0% - 33% for practising clinicians, 33%-73% academic clinicians

Presentation of test accuracy as frequencies rather than proportions or percentages appears to facilitate probability revision

…. However only 3% of respondents reported using probability revision in clinical practice

Probability revision.....

A serum test screens pregnant women for babies with Down’s syndrome. The test is a very good one but not perfect. Roughly 1% of babies have Down’s

syndrome. If the baby has Down’s syndrome, there is a 90% chance that the result will be positive. If the

baby is unaffected, there is still a 1% chance that the result will be positive. A pregnant woman has been tested and the result is positive. What is the chance

that her baby actually has Down’s syndrome?

(0.9) x (0.01) (probability of a TP)(0.9 x 0.01)+(0.01 x 0.99) (probability of a TP or a

FP)

Bramwell 2006; BMJ 333(7562):284-286

Facilitation of probabilistic reasoning: frequencies

Probabilistic representation

Frequency representation

The prevalence of disease is 10% (0.1).

100 patien

ts

10 with disease

90 no disease

8 test +ve

2 test-ve

10 test +ve

80 test-ve

Probability of disease if test +ve

818

Probability of disease if test +ve

(0.8) x (0.1)(0.8 x 0.1)+(0.11 x 0.9)

The probability of testing positive if you have disease is 80% (0.8)

The probability of testing negative if you do not have disease is 89% (0.89)

The probability of testing positive even if you do not have disease is 11% (0.11)

Survey of use and understanding of test accuracy metrics by General Practitioners

“It would just confirm what we already know, doctors, on the whole,

struggle with these concepts”

Survey ObjectivesTo identify which sources of test accuracy

information are used by primary care clinicians and barriers to their use

To evaluate the utility of existing test accuracy metrics as measured by self-reported familiarity, perceived ability to define metrics and self-reported use of metrics in clinical practice

To investigate whether there is consistency in the application of different test accuracy metrics and graphics across a common scenario

Survey Methods: distribution Incentivised, electronic survey hosted

by a professional network of ~200,000 GMC registered doctors with access to approximately 27 000 of 41 000 general practitioners across the UK (doctors.net.org )

Sample size of 200 pre-specified

Survey Results: respondent characteristics224/215 participants accessing the survey

(95%) completed the survey in full Number of years since qualification in the

specialty ranged from 0-41 (median 14 years)

11% had work responsibilities that might result in greater knowledge about test accuracy (GP trainer; GP with an academic position; GP involved in policy)

13% of respondents had undertaken training that included test accuracy interpretation in the last 3 years.

Survey results: test accuracy information sources used by respondents

“Please estimate how often you use the following test accuracy information sources as part of your clinical work”

0

20

40

60

80

100

120

140

160

180

200

220

Nu

mb

er o

f res

po

nd

ents

Source of Test Accuracy Information

Sources of Test Accuracy Information Used by Survey Respondents

Cannot estimate how often use source

Never use source

Rarely use source

Often use source

Always use source

Survey results: familiarity with test accuracy metrics/graphics

Sensitivi

ty and S

pecifici

ty

PPV and N

PV

LR+ and L

R-

2x2 d

iagnostic t

able

ROC curv

eDOR

AUC

Pictogra

ph0

20

40

60

80

100

120

140

160

180

200

Heard of / Seen Test Accuracy Metric / Graphic

Test Accuracy Metric / Graphic

Nu

mb

er

of

res

po

nd

en

ts

Survey results: perceived ability to define metrics/graphics

Use of test accuracy metrics in practice

0

20

40

60

80

100

120

140

160

180

200

Use of Test Accuracy Metric / Graphic in Practice

Test Accuracy Metric / Graphic

Nu

mb

er

of

res

po

nd

en

ts

Survey Methods: Application of nine different test accuracy metrics to a common testing scenario

A new biological marker for ovarian cancer has been identified and is available as a blood test for use in primary care. A 57 year old asymptomatic woman presents to you concerned about her risk of ovarian cancer and you perform the blood test at her request.”

TEST ACCURACY INFORMATION PRESENTED IN ONE OF NINE DIFFERENT FORMATS

“If the test came back positive would you refer the woman for further investigation?

If the test came back negative would you be confident not to investigate further at this point in time?”


Sensitivity and SpecificitySensitivity and Specificity (frequencies)Predictive valuesPredictive Values (frequencies)Likelihood ratiosPre to post test probabilityDiagnostic Odds Ratio Annotated 2x2 Diagnostic contingency tableAnnotated pictogram


Sensitivity and Specificity: “The marker has a sensitivity of 76% and a specificity of 98%”

Sensitivity and Specificity (frequencies): “Of every 100 women with ovarian cancer, 76 would test positive (be detected by the test) but 24 would test negative (be missed). Of every 100 women without ovarian cancer, 98 would test negative (receive a correct diagnosis) but 2 would test positive (be falsely labelled as having cancer).”

Annotated 2x2 tableWomen with confirmed

ovarian cancer (based on surgery and long

term clinical follow up)

Women confirmed free of ovarian cancer

(based on surgery and long term clinical follow up)

New blood test for detecting ovarian

cancer:POSITIVE RESULT

31women with ovarian cancer

correctly test +ve with the new blood test

(TRUE +VES)

26 women without ovarian cancer

incorrectly test +ve with the new blood test

(FALSE +VES)

New blood test for detecting ovarian

cancer:NEGATIVE

RESULT

10women with ovarian cancer

incorrectly test –ve with the new blood test

(FALSE -VES)

1293women without ovarian cancer correctly test –ve with the new

blood test(TRUE -VES)

41 women, in total, with confirmed

ovarian cancer

1319 women, in total, confirmed

free of ovarian cancer

1360 women

tested in total

Pictograph

Survey RESULTS: Scenarios:“If the test came back POSITIVE would you refer the woman for further investigation?”

Yes

No

?

Yes

No ?

YesNo

?

Yes

No

?

Yes

No

?Yes

No

?

Yes

No

?

Yes

No

?

Yes

No

?

Sens & Spec (%)

Sens & Spec (normalised frequencies)

Predictive values

(%)

Predictive values

(normalised frequencies)

Likelihood Ratios

Pre-post test

probability

Annotated 2x2

Diagnostic Table

DOR explained

Annotated Pictograph

“If the test came back POSITIVE would you refer the woman for further investigation?”

Yes No Don’t know

Survey RESULTS: Scenarios:“If the test came back NEGATIVE would you be confident not to investigate further at this point in time?”

Yes

No

? Yes

No

?

Yes

No

?

YesNo

? Yes

No

?

Yes

No

?

Yes

No

?

Yes

No

?

Yes

No

?

“If the test came back NEGATIVE would you be confident not to investigate further at this point in time?”

Sens & Spec (%)

Sens & Spec


Predictive values (%)

Predictive values


Likelihood Ratios

Pre-post test

probability

Annotated 2x2

Diagnostic Table

DOR explained

Annotated Pictograph

Yes No Don’t know

YES – would not investigate further / would not referNO – would investigate further / would refer

Open responses to testing scenariosObligation to test further:-“Would probably investigate (on the basis of a negative test result) but aware all further tests may be negative”

-“I would refer -ve result here even - would be difficult to defend if subsequently turned out to have ovarian carcinoma.”

-“Patient choice as well - but if she wanted further referral I would do this”

Open responses to testing scenarios: Prominence of test errors: SENSITIVITY AND SPECIFICITY: RESPONDENT EMPHASIS ON FALSE NEGATIVES (more important test error in this testing context)-“will miss about 1/4 +ve cases ”-“24% ?false negative.....too high”-“there are a lot of falsely negative results ”

PREDICTIVE VALUES: RESPONDENT EMPHASIS ON FALSE POSITIVES (low prevalence testing context)-“A lot of healthy women would be investigated due to a positive result”-“Would have concerns over doing test in view of high false positive rate.”

Open responses to testing scenarios: Prominence of test errors

2X2 TABLE: FALSE NEGATIVES AND FALSE POSITIVES

-“Too many false positives-they nearly equal the true positives. It is much better at helping you predict who does not have ovarian cancer but still too many false negatives”

- “1303 women had negative tests. You cannot send all of these for further investigation; you would swamp the system. The 10 false -ves will just have to come in if they develop symptoms.”

Open responses to testing scenarios: metrics causing confusionPredictive Values :“Not familiar with terminology here, presume PPV and NPV correspond with sensitivity and specificity but I would need to check”

Diagnostic Odds Ratio:“DOR 190 - is this good or bad ?” “Would need guidelines to follow here because I have no experience of the DOR”

Likelihood Ratios:”I do not understand the LR terminology” “Not sure what LR - or + means” “I have no experience of using likelihood ratios so would have to research before deciding on next course of action”

Open responses to testing scenarios: effect of changing metric on referral decisions-“Is this the same data being presented with different indices? Scary how presenting the same data differently induces different behaviour!”

-“It’s slightly scary how the way this is presented can change the way you feel about the results.”

- “Interesting - when asked this question earlier I would have referred -ve result patient, but realising now I can confidently say she has only a 0.6% chance of having the disease I would explain this to her”

-“Too many false negatives for me to feel comfortable when presented in this way.”

Conclusions (1)The degree to which testing context is reflected in question formulation, conduct and reporting of systematic reviews appears limited

Inadequacies in contextualisation of review methods appear to reflect a deficiency in methodological approach rather than poor reporting of methods.

94 % of test accuracy reviews included a clinician as a co-author.

No relationship between review quality and contextualisation of review reporting was observed suggesting AMSTAR and QUORUM did not capture this dimension of review quality.

Conclusions (2)Ability to understand and apply test accuracy metrics and contextual factors are both likely to be contributing factors to the observed effect of test accuracy metric on diagnostic decision making.

Sensitivity and specificity are understood by a significant proportion of respondents, but it is unclear to what extent this is due to familiarity as opposed to their intuitive nature.

Predictive values and likelihood ratios do not appear to be well understood.

Conclusions (3)

Test errors (false negatives and false positives) appear prominent as part of the translational pathway from summary estimates of test accuracy to management decisions.

‘Raw data’ in the from of the 2x2 diagnostic table appears to reduce framing effects introduced by summary test accuracy metrics

ImplicationsFor practice: -Medical education-Dissemination of test evaluation research and guideline development

For research:-Further investigation in more diverse testing contexts-What do decision makers want / need

“It’s clinical medicine... not based on any form of probability.

That’s gambling with lives.”

Documents

Clare Davenport Senior Clinical Lecturer Public Health, Epidemiology and Biostatistics University of Birmingham Chris Hyde Professor of Public Health and