1 Propensity Scores Methodology for Receiver Operating Characteristic (ROC) Analysis. Marina Kondratovich, Ph.D. U.S. Food and Drug Administration, Center

1

Propensity Scores Methodology for Receiver Operating Characteristic

(ROC) Analysis.

Marina Kondratovich, Ph.D.

U.S. Food and Drug Administration,Center for Devices and Radiological Health

No official support or endorsement by the Food and Drug Administration of this presentation is intended or should be

inferred.

September, 2003

2

Outline Introduction

Place for propensity scores Distributions of covariates (details) Distributions of a New Test results (details)

Bias of naïve AUC estimation

Matching for one covariate Weighted ROC analysis

Stratification for one covariate

Relationship between AUC by matching and by stratification

Propensity score – pre-test risk of disease

Conjunction of a New Test with other diagnostic tests

3

ROC Analysis New Test is quantitative.

New Test Variable: X for Diseased population Y for Non-Diseased population

ROC curve = relationship between sensitivity and specificity of a New Test over all possible cut-off values.The AUC (area under curve) is the most common measure of the test performance.

AUC = sensitivity averaged over all values of specificity; specificity averaged over all values of sensitivity;

AUC = P{X>Y} probability that a randomly selected Diseased subject has a test value bigger than that for a randomly chosen Non-Diseased subject

4

• In order to correctly estimate the diagnostic accuracy of a New Test, we should compare the values of the New Test for Diseased subjects and the values of the New Test for the same Non-Diseased subjects. Each subject has two potential values of the New Test: a value X that would be observed if the subject was Diseased and a value Y that would be observed if the subject was Non-Diseased. But X and Y cannot be observed jointly for same subject.

Subject = {New Test, Covariates (e.g., C1=Age, C2=BMI)}

• If we were able to assign randomly the subjects to Diseased and Non-Diseased clinical states then Diseased and Non-Diseased groups were comparable in the sense of covariates and diagnostic accuracy of Test was evaluated correctly. But such a random assignment is impossible.

5

Biased estimators of AUC occur if

I. Distributions of covariates are different for the Disease and Non-Diseased study groups;

and

II. Distributions of New Test results are different for different sets of covariates.

Problem: Consider M randomly selected Diseased subjects and N randomly selected Non-Diseased subjects. Naïve estimation of AUC is biased (usually overstated).

Consider these two situations in more details for one covariate, Age.

6

I. Different Age distributions in Diseased and Non-Diseased study groups.

Target Population Age distribution (t1, t2, t3). t1=0.5; t2=0.3; t3=0.2

Pre-test risk of Disease (Age) =

π1

π2

π3

0.1 Age1

0.3 Age2

0.5 Age3

πpopulation = π1·t1+ π2·t2+ π3·t3

0.24

7

Age distributions

I. Study Groups: M randomly selected Diseased subjects, N randomly selected Non-Diseased subjects.

M = m1 + m2 + m3 N = n1 + n2 + n3

E [mi/M] = pi E [ni/N] = qi

Diseased Non-Diseased

p1=0.21; p2=0.38; p3=0.41 q1=0.59; q2=0.28; q3=0.13

8

I. Study Groups: M randomly selected Diseased subjects, N randomly selected Non-Diseased subjects.

1

1 (1 ) (1 )1

ii i

populationi ii

population

m

Nm nM

Monotonic function of πi, depends on πpopulation and πstudy.

Pre-test risk of Disease in the study (Agei) = i

i i

m

m n related to the pre-test risk of Disease in the population

Pre-test Risk of Disease

Study (N=M) Population

Age1 0.26 0.1

Age2 0.58 0.3

Age3 0.80 0.5

9

II. The distribution of the Test variable depends on Age.The New Test variables of Diseased subjects: X1 , X2 , X3 with c.d.f. F1(x), F2(x), F3(x) Non-Diseased subjects: Y1 , Y2 , Y3 with c.d.f. G1(y), G2(y), G3(y)

Example. Disease=Fracture, Non-Diseased=No Fracture, New Test=Ultrasound test for body site.This is a hypothetical relationship between the average ultrasound test and the age. Usually, the ultrasound values becomes lower with increasing of age.

PSA test values (for prostate cancer) are increasing with increasing age;BNP test values (for congestive heart failure) are increasing with increasing age.

10

This is a typical picture of the data (ultrasound test for the bone status).

I. The age distributions for Diseased and Non-Diseased subjects are different. II. The values of the New Test depend on age.

Prostate cancer is more prevalent in older men;Congestive heart failure is more prevalent in older people.

11

PROBLEM: Naïve estimation of AUC is biased (usually overstated). Indeed,

Wilcoxon - Mann -Whitney statistic

3 3

, ,1 1 1 1

1( , )

k sm n

k i s jk s i j

AUC X YM N

Ψ(A,B) =1 if A>B;

½ if A=B; 0 if A<B

3 3

,1 1

[ ] Tk s k s

k s

E AUC p q AUC p AUCq

area under ROC curve when the Diseased subjects are Agek -years old and the Non-Diseased subjects are Ages -years old.

, { } ( ) ( )k s k s s kAUC P X Y G x dF x where

12

Example.

X1 , Y1 ~ N(1,1/4)

X2, ,Y2 ~ N(2,1/4)

X3 , Y3 ~ N(3,1/4)

0.50 0.16 0.02

0.84 0.50 0.16

0.98 0.84 0.50

New Test does not have diagnostic ability: New Test cannot discriminate Diseased and Non-Diseased subjects in every age group.

AUC matrix is Non-diseased Age1 Age2 Age3

Age1

Age2

Age3

Diseased

Age distribution of the Diseased subjects is pT=(0.21; 0.38; 0.41);age distribution of Non-Diseased subjects is qT=(0.59; 0.28; 0.13),

Two groups, Diseased and Non-Diseased, appear different with respect to the values of the New Test.

13

Example (continued).

0.50 0.16 0.02

0.84 0.50 0.16

0.98 0.84 0.50

If the age distribution of the Diseased subjects is pT=(0.21; 0.38; 0.41);age distribution of Non-Diseased subjects is qT=(0.59; 0.28; 0.13),then the mean value of the Wilcoxon-Mann-Whitney statistic, pTAUCq, is 0.68.

The matrix element AUC3,1=0.98, which corresponds to the biggest age group of Diseased subjects (p3=0.41) and the biggest age group of Non-Diseased subjects (q1=0.59), makes the largest contribution to the bilinear form pTAUCq, computed for vectors p and q.

AUC matrix:

Non-diseased Age1 Age2 Age3

Age1

Age2

Age3

Diseased

14

Adjustments for one covariate

Three common methods of adjusting for

one confounding covariate:

– Matching

– Stratification

– Covariate adjustment through logistic regression

15

MatchingMatching of Diseased and Non-diseased subjects means that the agedistributions of these subjects are the same. Let the diseased and non-diseased subjects be matched with common age distribution φT = (φ1 , φ2 , φ3 )

Theorem. A New Test cannot discriminate Diseased and Non-Diseased populations for each age group. Then the expected value of the Mann-Whitney statistic is 0.5 for any age distribution in the age-matched samples of Diseased and Non-Diseased subjects.

Wilcoson-Mann-Whitney statistic correctly evaluates the test performance (area under ROC curve) only for age-matched samples.

3 3

,1 1

[ ] Tk s k s

k s

E AUC AUC AUC

16

Matching (continued)

By matching, we create a “quasi-randomized” experiment. That is, if we find two subjects, one in the Diseased and one inNon-Diseased group, with the same pre-test risk of Disease (same age), then we could imagine that there was one subject to whom the value of the New Test was observed when this subject was Diseased and when this subject was Non-Diseased. The age-matched study groups are similar with respect to the Age (AUC for the covariate Age is exactly 0.5). Then we are sure that the difference in the New Test distributions for Diseased and Non-Diseased groups are not due to the difference in age.

Problem: The data of unmatched subjects are not used in AUC. Then the weighted ROC analysis should be used.

17

Weighted ROC AnalysisData set: Diseased and Non-Diseased Subjects are not Age-matched.We want to have these two samples be age-matched with the common age distribution φ, where φk = dk/D (dk = min(mk, nk)).

3,12,11,1 XXX 5,14,13,12,11,1 YYYYY

4,23,22,21,2 XXXX 3,22,21,2 YYY

3,1 3,2X X

Age distribution Diseased Non-diseased for matching

Age1 d1=3 m1=3 n1=5

Age2 d2=3 m2=4 n2=3

Age3 d3=1 m2=2 n2=1

3,1Y

18

Weighted ROC Analysis (continued)

For each age Agek, we can take• Some set of size dk of mk Diseased subjects.

k

k

m

d

Then we consider all possible sets of matching, estimate AUC for each set, and then take the average of AUC over all these sets.

There are different variants.

•Some set of size dk of nk Non-Diseased subjects.

There are k

k

n

d

different variants.

For Age1, 10 variants; for Age2, 4 variants; for Age3, 2 variants.Total number of different matched sets: 80 (=10 x 4 x 2).

Using the particular age-matched set of D Diseased and D Non-Diseased subjects, we can estimate age-matched AUCusing the Wilcoxon statistic.

19


This is equivalent to the calculation of AUC with all N Diseased subjects with weights dk/mk and with all M Non-Diseased subjectswith weights dk/nk:

, ,2

1 1 1 1

1( , )

k sm nK Kk s

weighted k i s jk s i j k s

d dAUC X Y

n nD

The weighted ROC analysis is equivalent to consideration of all possible variants of age-matching with common age distribution φ.

Also, the weighted estimate of AUC can be obtained using the bootstrap technique.

20


3,12,11,1 XXX 5,14,13,12,11,1 YYYYY

4,23,22,21,2 XXXX 3,22,21,2 YYY

2,31,3 YY

Age distribution Diseased Non-diseased for matching

Age1 d1=3 m1=3 n1=5

Weights 1 1 1 3/5 3/5 3/5 3/5 3/5

Age2 d2=3 m2=4 n2=3

Weights 3/4 3/4 3/4 3/4 1 1 1

Age3 d3=1 m2=2 n2=1

Weights 1/2 1/2 1

3,1Y

21

Weighted ROC Analysis (continued)The weighted AUC is unbiased estimate of φ-age-matched AUC.

The variance of the weighted estimate is:

2 2

, , , ,10 012 2

1 1 1

2 2, , , , ,

11 10 012 2 21 1

var( )

1( )

1( )

weighted

K K Kk s t k t sk s t k t s

k s tk s t s t k sk t

K Kk s k s s k s k k s

k sk s k s

AUC

d d d d d d

D n n m mm n

d d

D m n

If dk ≤ min(mk, nk) (all weights are not more than 1) then this variance is smaller than the variance for one matching set.

[ ] [ ] Tweighted matchedE AUC E AUC AUC

22

Stratification

The strata are defined and Diseased and Non-Diseased subjectswho are in the same stratum are compared.

3,12,11,1 XXX 5,14,13,12,11,1 YYYYY

4,23,22,21,2 XXXX 3,22,21,2 YYY

3,1 3,2X X

Diseased Non-diseased

Age1 m1=3 n1=5

Age2 m2=4 n2=3

Age3 m2=2 n2=1

3,1Y

AUC1,1

AUC2,2

AUC3,3

23

Stratification (continued)

Overall diagnostic accuracy of the New test can be the weighted average of AUC1,1, AUC2,2, and AUC3,3.

We can consider the linear combination:3

,1

k k kk

AUC

where φ is the same as in matching, φk = dk/D (dk = min(mk, nk)).

If AUC1,1=AUC2,2=AUC3,3=AUC, then the weights φk are similar to the weights inversely proportional to variances of stratum AUC. Is there a relationship between

3 3

,1 1

Ti j i j

i j

AUC AUC

3

,1

k k kk

AUC

AUC by matching

and AUC by stratification ?

24

Example. New Test = Ultrasound test for bone status.The results of the ultrasound test are the normal variables with the means which are different for different ages and with the same standard deviation of 130 m/sec.

Means for Diseased (m/sec)

Means for Non-Diseased (m/sec)

Age1 4,005 4,027

Age2 3,904 3,953

Age3 3,885 3,942

0.55 0.39 0.37

0.75 0.65 0.58

0.83 0.70 0.68

Matrix AUC

φT = (0.2; 0.5; 0.3)

AUC by matching: φTAUCφ = 0.624

AUC by stratification:3

,1

k k kk

AUC

0.639

25

Relationship between AUC by matching and AUC by stratification

0 0.030 0.015

0.030 0 0.025

0.015 0.025 0

Matrix Δ from previous Example.

Theorem. Let φT=(φ1, φ2, φ3) be the age distribution in the age-matched Diseased and Non-Diseased groups. Then ,

3

,1

k k kk

AUC TT AUC

where the matrix Δ is a symmetric matrix with elements

, , , , , ,( ) / 2.k s s k k k s s k s s kAUC AUC AUC AUC

For broad class of distributions,3

,1

Tk k k

k

AUC AUC

AUC bymatching

AUC bystratification

≤

26

Covariates (C1, C2, …, CL)

3km 5kn

Matching based on many covariates is difficult.

Stratification: As the number of covariates increases, the number of strata grows exponentially.

27

Replace the collection of confounding covariates with one scalar function of these covariates: the propensity score.

Propensity score (PS): conditional probability be in Diseased group rather than Non-Diseased group, given a collection of observed covariates.

PS (C1, C2, …, CL) = Pr (Disease| C1, C2, …, CL).

Propensity Score = Pre-test risk of Disease given a

collection of covariates, C1, C2, …, CL.

Propensity Scores

28

Construction of propensity score (pre-test risk)Logistic regression or others (neural networks,..) Outcome: Disease – 1, Non-Disease – 0. Predictors: all measured covariates, some interaction terms or squared terms, and so on. New Test is not included. AUC for combined covariates – a measure of covariates unbalance.

The distributions of X and Y variables, the values of a New Test for Diseased and Non-Diseased groups, depend on the covariatesbut this dependence is approximated well through the pre-test risk:F (x, C1, C2, …, CL) = F (x, PS(C1, C2, …, CL));G (y, C1, C2, …, CL) = G (y, PS(C1, C2, …, CL)).

29

Propensity Scores (continued) Calculate estimated propensity scores (pre-test risk) for all subjects using the propensity score model. Sort all subjects by propensity scores. Divide subjects into strata that have similar PS. Estimate AUC by matching (use weighted AUC) or AUC by stratification.

BMI

Age

mk Diseasednk Non-Diseased

Five strata based on logistic regression model of age and BMI (linear terms).

30

Propensity Scores (continued). Example: conjunction of a New Test with

other diagnostic tests

A New test is used in conjunction with other clinical tests to detect the clinical state “Disease”. The use of propensity scores technique is convenient tool for the matching based on all available prior information (covariates) about the subjects.

Example: “Disease”= any stenosis during coronary angiography; New Test; C1 = Age; C2 = Gender; C3 = Total cholesterol; C4 = HDL (“good” cholesterol) C5 = LDL (“bad” cholesterol)

In order to correctly evaluate the diagnostic ability of a New Test, matched AUC analysis should be performed. Matching based on propensity score is recommended.

31

Use of matched ROC analysis when New Test results do not depend on the covariates.

If the distribution of the New Test results for each strata is the same(F1=F2=F3=F, G1=G2=G3=G) but we do not have any information about that and use the matched ROC analysis.How is the matched estimate of AUC related to the usual empirical estimate?

Theorem. The matched estimate of the AUC is unbiased estimate of AUC but the variance of the matched estimate is inflated.

Proof based on the Hölder’s inequality (see [1]).

32

Summary If the results of a New Test depend on covariates and distributions of covariates in Diseased and Non-Diseased groups are different then only matched ROC analysis correctly evaluates the diagnostic accuracy of the New Test.

Matching based on propensity scores (pre-test risk of Disease) reduces bias. Propensity score is seriously degraded when important covariates influencing pre-test risk have not been collected. Weighted ROC analysis allows more effectively utilizing all the data.

33

References

1. Kondratovich, Marina V. (2000). Methodology of removing the effect of confounding variables in receiver operating characteristic (ROC) analysis.

Proceedings of the 2000 Joint Statistical Meeting, Biopharmaceutical Section, Indianapolis, IN.

2. Kondratovich, Marina V. (2002). Matched receiver operating characteristic (ROC) analysis and propensity scores.

Proceedings of the 2002 Joint Statistical Meeting, Biopharmaceutical Section, New York, NY.

3. Zweig, M.H. and Campbell, G. (1993). Receiver operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clinical Chemistry, 39, p. 561-577.

The propensity scores technique is well developed in the context of observational studies and studies for the therapeutic devices. In the context of diagnostic studies, however,there has been little papers.

34

• Rubin, DB, Estimating casual effects from large data sets using propensity scores. Ann Intern Med 1997; 127:757-763

• Grunkemeier, GL and et al, Propensity score analysis of stroke after off-pump coronary artery bypass grafting, Ann Thorac Surg 2002; 74:301-305

• Wolfgang, C. and et al, Comparing mortality of elder patients on hemodialysis versus peritoneal dialysis: A propensity score approach, J. Am Soc Nephrol 2002; 13:2353-2362

• Rosenbaum, PR, Rubin DB, Reducing bias in observational studies using subclassification on the propensity score. JASA 1984; 79:516-524

• Blackstone, EH, Comparing apples and oranges, J. Thoracic and Cardiovascular Surgery, January 2002; 1:8-15

• D’agostino, RB, Jr., Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group, Statistics in medicine, 1998,17:2265-2281

References for the propensity scores technique

Documents

1 Propensity Scores Methodology for Receiver Operating Characteristic (ROC) Analysis. Marina Kondratovich, Ph.D. U.S. Food and Drug Administration, Center