Upload
esther-clough
View
219
Download
0
Embed Size (px)
Citation preview
TEST ACCURACY EVIDENCE & DIAGNOSTIC DECISION MAKING
Clare DavenportSenior Clinical Lecturer
Public Health, Epidemiology and BiostatisticsUniversity of Birmingham
Chris HydeProfessor of Public Health and Clinical EpidemiologyUniversity of Exeter
“I consider much less thinking has gone into the theory underlying diagnosis, or possibly one should say less energy has gone into constructing the correct model of diagnostic procedures, than into therapy or prevention where the concept of ‘altering the natural history of the disease’ has been generally accepted and a theory has been evolved for testing hypotheses concerning this.”
Archie Cochrane. Effectiveness and efficiency. Random reflections on health services. The Nuffield Provincial Hospitals Trusts, 1972.
Medical Test
Information
Decision
ActionPatient Outcome
Test harms/ placebo effects
Diagnostic
accuracy
Diagnostic yield
Management
Diagnostic and Treatment PathwaysAccuracy provides information
about the hypothetical value of a test in decision making
Phases of Test EvaluationTechnical PerformanceDoes CT produce good
quality images, reliably and reproducibly?
DIAGNOSTIC PERFORMANCE
Do CT images accurately differentiate diseased from non-diseased patients?
Diagnostic ImpactDoes CT change how diagnoses are made by doctors?
Therapeutic ImpactDoes CT change how treatment decisions are made?
Patient Health ImpactDoes CT ultimately reduce mortality or morbidity?
Fryback and Thornbury Med Dec Mak 1991;11:88-94
Why Emphasis on Test Accuracy?Lax regulatory system for tests: no
requirement for evidence about patient impact
Health Technology assessment traditionally concerned with treatments
-what if we are treating the wrong patients?-what if test use itself causes harm?Medical education: concept of EBM relatively
new and test evaluation even more so
Difficulties conducting test-treat trials: sample size; blinding; measurement and reporting of diagnostic and treatment decisions
Outcomes Framework: 14 Mechanisms
Test resultsproduced
Diagnostic decision made
Treatment decision made
Patient Outcome
Treatment implemented
Patient given test
6. Accuracy
8. Dx confidence
9. Dx Yield
10. Rx Yield
3. Feasibility
5. Interpretability
1. Test
Process
11. Rx
Confidence
2. Timing
Test
13.Timing
Treatment
4. Timing
Results
7. Timing Diagnosis
12. Adherence
14. Patient / Clinician
Perspective
Ferrante di Ruffano et al. BMJ 2012;344:e686
Diagnostic confidenceDo I know where to find trustworthy
information on the accuracy of this test?
Do I understand the test accuracy information available?
What are the implications of the accuracy of this test for my patients
Test Accuracy Measures
Disease (Reference test result)
Present (D+)
Absent (D-)
IndexTest
+ True Positives
False Positives TP+FP
- False Negatives
True Negatives FN+TN
TP+FN FP+TN TP+FP+FN+TN
SensitivityTP / (TP+FN)
SpecificityTN / (TN+FP)
Positive Predictive Value
TP / (TP+FP)
Negative Predictive Value
TN / (TN+FN)
Disease as Reference
Class
Test Result as Reference
ClassDOR = True Positives X True Negatives False Negatives X False Positives
Accessibility of test accuracy information for decision making
AUC
Summary ROC curves
SPECIF
ICITY
LR-
ROC Space
Summary specificity
ERROR RATE
NPV
Summary sensitivity
LR+
DOR
ROC plotRDOR
2x2 tableRelat
ive se
nsiti
vity
Relative specificity
In most testing situations either false negative or false positive test errors are more important
CA125 is a marker used to identify individuals who have an increased risk of ovarian cancer.
NICE endorse the use of the test in primary care to select who should undergo pelvic ultrasound testing with and those who can safely not be investigated further.
Sensitivity OR Specificity?
Ca125 testing for ovarian cancer
Ca125 testing for ovarian cancer
False Positives receive further investigation with US
False Negatives do not receive further investigation
Sensitivity : FNSpecificity : FP
Do not have ovarian cancer but
test positive
Have ovarian cancer but test
negative
Research rationale (1)Contextual variables cause variation in test accuracy estimates:-Characteristics of the population to be tested -Variation in the conduct of the test itself
The proposed role of a test determines the point in a care pathway that it should be evaluated and any comparator tests that should be considered.
Context determines the relative value placed on test errors (false positives and false negatives)
Systematic reviews of test accuracy offer the potential to mitigate the lack of contextual fit observed in primary studies of test accuracy… are they doing their job?
Research rationale (2)It is stated to be common knowledge that
decision makers have difficulty understanding and applying test accuracy
information
BUT
What is the EXTENT of the problem ?
WHY do decision makers have particular difficulty with test accuracy measures – what is the nature of the difficulty?
To what extent are contextual considerations
represented in Test Accuracy Reviews ?
Research methodsAims: -Assess the extent to which testing context is reflected at each stage of the review process:-when formulating review questions -synthesis, including investigation of heterogeneity-discussing results and making recommendations
Searches:-3 review databases chosen on the basis of an epidemiological mapping exercise (Bayliss & Davenport 2008 IJTAHC; 24(4):403-411) ARIF, (University of Birmingham) DARE (CRD York) Cochrane Database Systematic Reviews-Cochrane Diagnostic Test Accuracy Working Group & the UK National Research Register: unpublished reviews
Results: study flow
HITS:N=1215
Cochrane Database
of Systematic
Reviews
DAREARIF
Excluded at abstract stage:N=641
Excluded at full paper stage:N=34
DuplicatesN=303
Eligible test accuracy reviewsN=237
Random 100 reviews data extracted
Results: review characteristicsDate of publication ranged from 1990 to 2006; 23% of reviews before 2000 and 73% on or after 2000. A total of 16 disease topic areas were represented Between 1 and 50 index tests (median 3) were evaluated by a single review The majority of reviews (43/100) were conducted in the USA, 23 in the UK, 12 in the Netherlands and 8 in the rest of Europe, 6 in Australia, 4 in Canada, 2 in Peru and one each in Columbia and China. 94/100 reviews included a clinician as an authorUsing a modified checklist of 9 items taken from the QUORUM and AMSTAR checklists, study quality ranged from 0-9 (median 4.6; inter-quartile range 3 to 6)
Results: question formulationOnly 24/100 reviews clearly specified all of test application, test role and prior tests as part of question formulation; 26% of the 73 reviews published on or after 2000 and 22% of the 23 reviews published before 2000
0102030405060708090
100
Detail of Question Formulation in Included Reviews
% R
ev
iew
s
Only 9/100 reviews reported all of setting, (pesentation (symptomatic or asymptomatic) and prior tests.
Preva
lenc
e
Settin
g
Chron
icity
Sympt
omat
ic/as
ympt
om...
Age
Sever
ity
Co-m
orbi
dity
Prior t
ests
0102030405060708090
100
Reporting of study characteristics in test accuracy reviews
Limited by poor reporting in 1y studies
No
Unclear
Yes Nu
mb
er
of
rev
iew
s
Systematic review of evidence concerned with understanding and application of test accuracy metrics
Contextualisation of review findings
Applic
abilit
y (s
pect
rum
)
Applic
abilit
y (in
dex
test
)
Applic
abilit
y (p
reva
lenc
e)
Applic
abilit
y (h
ealth
care
set
ting)
Conse
quen
ces
of te
st re
sults
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%Contextualisation of review findings
Unclear / not reported
Stated
Pe
rce
nta
ge
of
Re
vie
ws
Characteristics of the literatureTheoretical perspective papers:- 1978 to 2010, (88% after 1990)-34 papers written by 30 unique authors-25/30 clinicians of which 16/25 affiliated to an academic institution
Empirical research:-1978 to 2010,(60% after 1995)-majority of health professional samples were self- selected, convenience samples from medical education courses
- only 3/26 health professional samples representative of primary care
Research Methods...Systematic review of evidence concerned with the understanding and application of test accuracy metrics.
Survey of use and understanding of test accuracy metrics by general practitioners
Results: study flowTotal Hits
N=16765
Included
papersN=67
Empirical research
N=33
Theoretical papers
N=34
Clinical
sampleN=26
OtherN=7
Excluded
N=14136
Duplicates
N=2508
Results: Theoretical papersUnderstanding and application of test accuracy information:-Accessibility of test accuracy metrics -Knowledge about accuracy of tests used in practice likely to be limited-Lack of appreciation of variation in pre-test probability across healthcare settings-Use of graphical aids and frequencies (rather than % or proportions) to facilitate probability revision
Factors affecting testing behaviour:-Testing viewed as a risk aversive behaviour-Testing context considered important modifier of attitudes to risk
Theoretical: understanding and applying test accuracy information
Sensitivity and specificity
“A clinician will not start from diseased or not diseased, but from a positive or negative test. Therefore sensitivity and specificity are intuitively not so evident” (Dujardin 1994)
Global test accuracy measures do not
distinguish between test accuracy in 2 dimensions“The problem that occurs in a
meta-analysis of diagnostic studies is the multi-directional performance of the diagnostic instrument regarding its ability to detect (specificity) or exclude (sensitivity) the characteristic of interest is not distinguished. Multi-dimensional outcomes cannot be summarised well by a single estimate.” (Stengel 2003)
Likelihood Ratios“Never in 20 years of teaching
clinical logic, have we found a clinician who used the word “positive likelihood ratio”.
(Van Den Ende 2005)
Test Accuracy Measures
Disease (Reference test result)
Present (D+)
Absent (D-)
IndexTest
+ True Positives
False Positives TP+FP
- False Negatives
True Negatives FN+TN
TP+FN FP+TN TP+FP+FN+TN
SensitivityTP / (TP+FN)
SpecificityTN / (TN+FP)
Positive Predictive Value
TP / (TP+FP)
Negative Predictive Value
TN / (TN+FN)
Disease as Reference
Class
Test Result as Reference
ClassDOR = True Positives X True Negatives False Negatives X False Positives
Theoretical: understanding and applying test accuracy information
Pre- test probability and test accuracy estimation “research has shown that
clinicians’ estimates of probability vary widely and are often inaccurate..by itself, clinical experience appears insufficient to guide accurate probability estimation” (Richardson 2003)
“We rarely know what the sensitivities, specificities or likelihood ratios are for tests. At best clinicians carry a general impression about their usefulness” (Gill 2005)
Contextual variation in test accuracy
Unfortunately, it is often not realised that there can be no generally valid estimates of a test’s sensitivity, specificity or likelihood ratio that apply to all patients of a particular population, nor should such values be sought” (Moons 2003)
Theoretical: factors affecting testing behaviour
Clinicians are uncomfortable with
uncertainty
Testing ‘risk’ is context dependent
“...some physicians order all the tests that may be even remotely applicable in a given clinical situation. Such a practice may comfort the patient and enhance the physician’s belief that all diagnostic avenues have been pursued, but more tests do not necessarily produce more certainty..... we continue to test excessively, partly because of our discomfort with uncertainty.” (Kassirer 1989)
“.. feelings of uncertainty regarding medical problems can differ depending on the situation, not only because one physician may be faced with more complicated diagnostic puzzles than the other, but also, and primarily because the consequences of a vague and uncertain diagnosis may vary in each situation.” (Zaat (1992)
Empirical research: understanding and applying test accuracy information
High levels of error observed for estimates of the accuracy of particular tests. Estimates were based on clinical experience of test use, rather than published evidence.
Between-person variation in estimation of disease prevalence (pre-test probability) for any one disease was considerable (25-100%)
Confusion with interpretation of test accuracy metrics (for example sensitivity and specificity confused with positive and negative predictive values).
Empirical research: understanding and applying test accuracy informationResearch based on premise that probability revision is necessary for diagnostic decision making... (32/33 empirical studies) investigated the ability of respondents to undertake probability revision
Average proportion of respondents able to undertake probability revision 46%, range 0% - 33% for practising clinicians, 33%-73% academic clinicians
Presentation of test accuracy as frequencies rather than proportions or percentages appears to facilitate probability revision
…. However only 3% of respondents reported using probability revision in clinical practice
Probability revision.....
A serum test screens pregnant women for babies with Down’s syndrome. The test is a very good one but not perfect. Roughly 1% of babies have Down’s
syndrome. If the baby has Down’s syndrome, there is a 90% chance that the result will be positive. If the
baby is unaffected, there is still a 1% chance that the result will be positive. A pregnant woman has been tested and the result is positive. What is the chance
that her baby actually has Down’s syndrome?
(0.9) x (0.01) (probability of a TP)(0.9 x 0.01)+(0.01 x 0.99) (probability of a TP or a
FP)
Bramwell 2006; BMJ 333(7562):284-286
Facilitation of probabilistic reasoning: frequencies
Probabilistic representation
Frequency representation
The prevalence of disease is 10% (0.1).
100 patien
ts
10 with disease
90 no disease
8 test +ve
2 test-ve
10 test +ve
80 test-ve
Probability of disease if test +ve
818
Probability of disease if test +ve
(0.8) x (0.1)(0.8 x 0.1)+(0.11 x 0.9)
The probability of testing positive if you have disease is 80% (0.8)
The probability of testing negative if you do not have disease is 89% (0.89)
The probability of testing positive even if you do not have disease is 11% (0.11)
Survey of use and understanding of test accuracy metrics by General Practitioners
“It would just confirm what we already know, doctors, on the whole,
struggle with these concepts”
Survey ObjectivesTo identify which sources of test accuracy
information are used by primary care clinicians and barriers to their use
To evaluate the utility of existing test accuracy metrics as measured by self-reported familiarity, perceived ability to define metrics and self-reported use of metrics in clinical practice
To investigate whether there is consistency in the application of different test accuracy metrics and graphics across a common scenario
Survey Methods: distribution Incentivised, electronic survey hosted
by a professional network of ~200,000 GMC registered doctors with access to approximately 27 000 of 41 000 general practitioners across the UK (doctors.net.org )
Sample size of 200 pre-specified
Survey Results: respondent characteristics224/215 participants accessing the survey
(95%) completed the survey in full Number of years since qualification in the
specialty ranged from 0-41 (median 14 years)
11% had work responsibilities that might result in greater knowledge about test accuracy (GP trainer; GP with an academic position; GP involved in policy)
13% of respondents had undertaken training that included test accuracy interpretation in the last 3 years.
Survey results: test accuracy information sources used by respondents
“Please estimate how often you use the following test accuracy information sources as part of your clinical work”
0
20
40
60
80
100
120
140
160
180
200
220
Nu
mb
er o
f res
po
nd
ents
Source of Test Accuracy Information
Sources of Test Accuracy Information Used by Survey Respondents
Cannot estimate how often use source
Never use source
Rarely use source
Often use source
Always use source
Survey results: familiarity with test accuracy metrics/graphics
Sensitivi
ty and S
pecifici
ty
PPV and N
PV
LR+ and L
R-
2x2 d
iagnostic t
able
ROC curv
eDOR
AUC
Pictogra
ph0
20
40
60
80
100
120
140
160
180
200
Heard of / Seen Test Accuracy Metric / Graphic
Test Accuracy Metric / Graphic
Nu
mb
er
of
res
po
nd
en
ts
Survey results: perceived ability to define metrics/graphics
Use of test accuracy metrics in practice
0
20
40
60
80
100
120
140
160
180
200
Use of Test Accuracy Metric / Graphic in Practice
Test Accuracy Metric / Graphic
Nu
mb
er
of
res
po
nd
en
ts
Survey Methods: Application of nine different test accuracy metrics to a common testing scenario
A new biological marker for ovarian cancer has been identified and is available as a blood test for use in primary care. A 57 year old asymptomatic woman presents to you concerned about her risk of ovarian cancer and you perform the blood test at her request.”
TEST ACCURACY INFORMATION PRESENTED IN ONE OF NINE DIFFERENT FORMATS
“If the test came back positive would you refer the woman for further investigation?
If the test came back negative would you be confident not to investigate further at this point in time?”
Survey Methods: Application of nine different test accuracy metrics to a common testing scenario
Sensitivity and SpecificitySensitivity and Specificity (frequencies)Predictive valuesPredictive Values (frequencies)Likelihood ratiosPre to post test probabilityDiagnostic Odds Ratio Annotated 2x2 Diagnostic contingency tableAnnotated pictogram
Survey Methods: Application of nine different test accuracy metrics to a common testing scenario
Sensitivity and Specificity: “The marker has a sensitivity of 76% and a specificity of 98%”
Sensitivity and Specificity (frequencies): “Of every 100 women with ovarian cancer, 76 would test positive (be detected by the test) but 24 would test negative (be missed). Of every 100 women without ovarian cancer, 98 would test negative (receive a correct diagnosis) but 2 would test positive (be falsely labelled as having cancer).”
Annotated 2x2 tableWomen with confirmed
ovarian cancer (based on surgery and long
term clinical follow up)
Women confirmed free of ovarian cancer
(based on surgery and long term clinical follow up)
New blood test for detecting ovarian
cancer:POSITIVE RESULT
31women with ovarian cancer
correctly test +ve with the new blood test
(TRUE +VES)
26 women without ovarian cancer
incorrectly test +ve with the new blood test
(FALSE +VES)
New blood test for detecting ovarian
cancer:NEGATIVE
RESULT
10women with ovarian cancer
incorrectly test –ve with the new blood test
(FALSE -VES)
1293women without ovarian cancer correctly test –ve with the new
blood test(TRUE -VES)
41 women, in total, with confirmed
ovarian cancer
1319 women, in total, confirmed
free of ovarian cancer
1360 women
tested in total
Pictograph
Survey RESULTS: Scenarios:“If the test came back POSITIVE would you refer the woman for further investigation?”
Yes
No
?
Yes
No ?
YesNo
?
Yes
No
?
Yes
No
?Yes
No
?
Yes
No
?
Yes
No
?
Yes
No
?
Sens & Spec (%)
Sens & Spec (normalised frequencies)
Predictive values
(%)
Predictive values
(normalised frequencies)
Likelihood Ratios
Pre-post test
probability
Annotated 2x2
Diagnostic Table
DOR explained
Annotated Pictograph
“If the test came back POSITIVE would you refer the woman for further investigation?”
Yes No Don’t know
Survey RESULTS: Scenarios:“If the test came back NEGATIVE would you be confident not to investigate further at this point in time?”
Yes
No
? Yes
No
?
Yes
No
?
YesNo
? Yes
No
?
Yes
No
?
Yes
No
?
Yes
No
?
Yes
No
?
“If the test came back NEGATIVE would you be confident not to investigate further at this point in time?”
Sens & Spec (%)
Sens & Spec
(normalised frequencies)
Predictive values (%)
Predictive values
(normalised frequencies)
Likelihood Ratios
Pre-post test
probability
Annotated 2x2
Diagnostic Table
DOR explained
Annotated Pictograph
Yes No Don’t know
YES – would not investigate further / would not referNO – would investigate further / would refer
Open responses to testing scenariosObligation to test further:-“Would probably investigate (on the basis of a negative test result) but aware all further tests may be negative”
-“I would refer -ve result here even - would be difficult to defend if subsequently turned out to have ovarian carcinoma.”
-“Patient choice as well - but if she wanted further referral I would do this”
Open responses to testing scenarios: Prominence of test errors: SENSITIVITY AND SPECIFICITY: RESPONDENT EMPHASIS ON FALSE NEGATIVES (more important test error in this testing context)-“will miss about 1/4 +ve cases ”-“24% ?false negative.....too high”-“there are a lot of falsely negative results ”
PREDICTIVE VALUES: RESPONDENT EMPHASIS ON FALSE POSITIVES (low prevalence testing context)-“A lot of healthy women would be investigated due to a positive result”-“Would have concerns over doing test in view of high false positive rate.”
Open responses to testing scenarios: Prominence of test errors
2X2 TABLE: FALSE NEGATIVES AND FALSE POSITIVES
-“Too many false positives-they nearly equal the true positives. It is much better at helping you predict who does not have ovarian cancer but still too many false negatives”
- “1303 women had negative tests. You cannot send all of these for further investigation; you would swamp the system. The 10 false -ves will just have to come in if they develop symptoms.”
Open responses to testing scenarios: metrics causing confusionPredictive Values :“Not familiar with terminology here, presume PPV and NPV correspond with sensitivity and specificity but I would need to check”
Diagnostic Odds Ratio:“DOR 190 - is this good or bad ?” “Would need guidelines to follow here because I have no experience of the DOR”
Likelihood Ratios:”I do not understand the LR terminology” “Not sure what LR - or + means” “I have no experience of using likelihood ratios so would have to research before deciding on next course of action”
Open responses to testing scenarios: effect of changing metric on referral decisions-“Is this the same data being presented with different indices? Scary how presenting the same data differently induces different behaviour!”
-“It’s slightly scary how the way this is presented can change the way you feel about the results.”
- “Interesting - when asked this question earlier I would have referred -ve result patient, but realising now I can confidently say she has only a 0.6% chance of having the disease I would explain this to her”
-“Too many false negatives for me to feel comfortable when presented in this way.”
Conclusions (1)The degree to which testing context is reflected in question formulation, conduct and reporting of systematic reviews appears limited
Inadequacies in contextualisation of review methods appear to reflect a deficiency in methodological approach rather than poor reporting of methods.
94 % of test accuracy reviews included a clinician as a co-author.
No relationship between review quality and contextualisation of review reporting was observed suggesting AMSTAR and QUORUM did not capture this dimension of review quality.
Conclusions (2)Ability to understand and apply test accuracy metrics and contextual factors are both likely to be contributing factors to the observed effect of test accuracy metric on diagnostic decision making.
Sensitivity and specificity are understood by a significant proportion of respondents, but it is unclear to what extent this is due to familiarity as opposed to their intuitive nature.
Predictive values and likelihood ratios do not appear to be well understood.
Conclusions (3)
Test errors (false negatives and false positives) appear prominent as part of the translational pathway from summary estimates of test accuracy to management decisions.
‘Raw data’ in the from of the 2x2 diagnostic table appears to reduce framing effects introduced by summary test accuracy metrics
ImplicationsFor practice: -Medical education-Dissemination of test evaluation research and guideline development
For research:-Further investigation in more diverse testing contexts-What do decision makers want / need
“It’s clinical medicine... not based on any form of probability.
That’s gambling with lives.”