Imperfect Gold Standards for Biomarker Evaluation Rebecca A. Betensky Conference on Statistical Issues in Clinical Trials University of Pennsylvania April

Imperfect Gold Standards for Biomarker Evaluation

Rebecca A. Betensky

Conference on Statistical Issues in Clinical Trials

University of Pennsylvania

April 18, 2012

Outline

• Motivation: need for kidney injury biomarkers for diagnosis of acute kidney injury (AKI)

• Impact of imperfect gold standard on apparent sensitivity and specificity of perfect biomarker

• Examine conditional independence assumption: implicit restrictions

• Bounds on true sensitivity and specificity

Serum creatinine for AKI

• Clinicians have used SCr to diagnose AKI for decades.

• Acknowledged as inadequate gold standard:– Poor specificity in some settings that are not

associated with kidney injury– Poor sensitivity in setting of adequate renal reserve– Relatively slow kinetics after injury

• Considerable interest in identifying better biomarkers of tubular injury: potentially more accurate and earlier diagnosis.

How to evaluate new biomarkers?

• Studies have used changes in SCr as the gold standard against which to test novel tubular injury biomarkers.

• Aside from problems of specificity and sensitivity, – SCr does not directly reflect tubular function

or injury– Based on a cutoff, which will impact its true

spec and sens, and thus that of novel marker.

Conceptual framework

• Actual disease that is the target of the diagnostic test (AKI) is not synonymous with clinical conditions identified by imperfect gold standard (SCr).

• AKI is difficult to establish without invasive and risky histopathological assessment.

• Using imperfect gold standard (i.e., imperfect reference test) may distort apparent diagnostic performance of novel biomarker.

Idealized example of perfect novel biomarker

disease prevalence=20%

imperfect gold standard sensitivity=80%, specificity=80%

Relative to imperfect gold standard, a perfect novel biomarker will have apparent sensitivity of 50% and apparent specificity of 64/68=94%.

At lower prevalence, dominant effect of imperfect gold standard is on perfect biomarker’s apparent sensitivity:

apparent sens= apparent spec=

G

G

sensspec

pp 11

1

1

G

G

specsens

pp

11

1

1

This is similar to imperfect gold standard=“need for dialysis”.

At prevalence of 20%, apparent sensitivity of perfect biomarker is 100% and apparent specificity is 84%. The bounds of the apparent AUC are 0.84-1.00.

Even rare false positives (imperfect gold standard spec=99%) lead to apparent sensitivity of 86% and bounds of apparent AUC of 0.72-0.98.

Cut-offs for SCr

• Recent clinical studies of novel AKI biomarkers have used a variety of SCr criteria to define AKI.

• These examples illustrate that different choices of cut-off’s can lead to hugely different apparent properties of a novel biomarker.

What if new biomarker is not perfect?

• Need assumptions on relationship between new biomarker and imperfect gold standard and disease to evaluate new biomarker.

• Conditional independence is convenient; allows for latent class models.

• However, it introduces implicit restrictions.

What can we learn for imperfect novel biomarker?

• Previous illustration assumes perfect novel biomarker.

• Common assumption is conditional independence: P(B=b|G=g,D=d)=P(B=b|D=d)

• Apparent sensitivity of B relative to G:

• Apparent specificity of B relative to G:

• Use these to solve for “true sensitivity” and specificity of B relative to D

• Bounds on apparent AUC:– Apparent AUC< apparent sens × apparent spec– Apparent AUC>apparent sens+(1-apparent sens) × apparent spec

)1()1(

)1()1()1(

GG

BGBG

SppSep

SpSppSeSep

GG

BGBG

SppSep

SpSppSeSep

)1()1(

)1()1()1(

Problems with conditional independence

• May not be plausible from mechanistic or physiological perspective; the two tests measure related phenomena.

• May be association between disease severity and test results; two tests may be conditionally independent given disease severity, but not conditionally independent given presence or absence of disease.

• Assumption of conditional independence constrains the disease prevalence; may not be plausible.

Conditional Independence: disease severity

• Independence given disease severity:

P(G=1, B=1|D=1,X)=P(G=1|D=1,X)×P(B=1|D=1,X)

does not imply independence given disease:

P(G=1,B=1|D=1)=P(G=1|D=1)×P(B=1|D=1)

Conditional Independence: disease prevalence

Conditional independence may not be possible at a given disease prevalence.

Bounds on prevalence under conditional independence

G=1 G=0

B=1 a b

B=0 c d

G=1 G=0

B=1 a b

B=0 c d

G=1 G=0

B=1 (1-)a (1-)b

B=0 (1-)c (1-)d

D=1 D=0

Under conditional independence, split into two tables, with some constraints:

p=P(D=1)= a+ b+c+ d

Example

Ignoring sampling variability, for p(0.285,0.715), conditional independence is not possible.

G=1 G=0

B=1 30% 5%

B=0 15% 50%

Other dependence assumptions

• With more tests, some methods model relationships between some tests. This is arbitrary, and cannot be tested without a rich enough study.

• Discrepant resolution method; disfavored due to bias.

• Composite reference method; success depends on reliability of reference tests.

Bounds on true sensitivity and specificity of a new biomarker

• Explore information available from the comparison of B and G, when no assumptions are made regarding their dependence.

• Assume operating characteristics of G are known.

• Derive bounds for operating characteristics of B.

Idea

• Simply by bounding cells in cross tabulation of G and (B,D) to be between 0 and 1 we derive bounds for– P(D=1, B=1|G=1)– P(D=0, B=0|G=0)

• True sensitivity and specificity of G maximized at maxima of these and minimized at minima of these.

ExampleG=1 G=0

B=1 25 5

B=0 10 60

• Apparent sens=25/35=71%

• Apparent spec=60/65=92%

• Suppose sens of G is 90% and spec of G is 95%

• True sens of B is (61%,81%)

• True spec of B is (87%,98%)

• These bounds are reasonably narrow.

ExampleG=1 G=0

B=1 10 20

B=0 10 60

• Apparent sens=50%

• Apparent spec=75%

• Suppose sens of G is 90% and spec of G is 95%

• The true sens of B is (33%,67%)

• True spec of B is (71%,78%)

• Bound for sens is quite wide, ranging from poor test to possibly adequate; bound for spec is narrow.

Conclusions

• Low sensitivity of a promising kidney injury biomarker when expected prevalence of disease is low (e.g., contrast nephropathy – NGAL sensitivity=78%), raises question of imperfect specificity of “gold standard”.

• Likewise, low specificity when expected prevalence is high (e.g., ICU with hypotension and sepsis – NGAL spec=76% when applied to critically ill patients) raises question of imperfect sensitivity of gold standard.

Conclusions• Need “hard” clinical endpoints for use as gold standard, but even

these have potential problems (e.g., long latency, confounding by other risk factors).

• Could use exposure status, such as to nephrotoxic drug, to avoid SCr.

• Amount of information in comparing new biomarker to imperfect gold standard may not be very high, even if imperfect gold standard is a good test itself.

• Conditional independence is problematic – physiologically and technically.

• Nonparametric bounds may or may not be useful; but certainly reflect true information content.

• Ultimate validation of a biomarker’s utility is demonstration in a randomized clinical trial that it alters clinical management and improves clinical outcomes.

Acknowledgments

• Sarah Emerson, PhD• Sushrut Waikar, MD• Joseph Bonventre, MD

Waikar SS, Betensky RA, Emerson SC, Bonventre JV (2012). Imperfect gold standards for kidney injury biomarker evaluation. J Am Soc Nephrol 23: 13-21.

Emerson SC, Waikar SS, Bonventre JV, Betensky RA (2012). Biomarker validation with an imperfect reference: issues and bounds. Unpublished manuscript.

With low prevalence, maintaining high specificity is more important than high sensitivity.

Documents

Imperfect Gold Standards for Biomarker Evaluation Rebecca A. Betensky Conference on Statistical Issues in Clinical Trials University of Pennsylvania April