18
Scand J Clin Lab Invest 1997;57(Suppl227);46-63 Mathematical tools for demonstrating the clinical usefulness of biochemical markers JAMES C. BOYD, M.D., #2 14 Department of Pathology, University of Virginia, Health Sciences Center, Charlottesville, VA 22908, USA Boyd JC. Mathematical tools for demonstrating the clinical usefulness of biochemical markers. Scand J Clin Lab Invest 1997;57 Suppl227:46-63. Various approaches have been proposed for evaluating the diagnostic value of biochemical markers. Careful design of experimental protocol is key in carrying out any evaluation of clinical diagnostic value. A prospective cohort study is the best clinical trial design and should include an appropriate reference (gold) standard applied in every patient, the results of which are assessed blindly. The spectrum of patients evaluated should reflect the population in which the test will be used, be appropriately broad to avoid bias, and include both symptomatic and asymptomatic patients. The handling of indeterminate results and the eligibility criteria for inclusion in the study should be carefully defined. Although sensitivity, specificity, and predictive value have long been used as indices of test accuracy, newer methods such as receiver operating characteristic curve (ROC) analysis, logistic regression analysis and likelihood ratios are more robust indicators that overcome many limitations of the traditional indices. The area under the ROC curve (AUC) is the best global indicator of test accuracy, but comparisons of AUC for different tests must take correlation between the tests into account if they have been performed in the same patients. Logistic regression analysis allows the diagnostic information from several tests to be evaluated multivariately, provides a probability estimate for a given outcome, and requires few assumptions regarding the underlying distributions of test data. Logistic regression also provides a straightforward method for calculating likelihood ratios. Likelihood ratios are useful for interpreting test results in the individual patient because they provide a convenient means to directly determine predictive value without having to calculate sensitivity and specificity for a given decision limit. Application of these methods is demonstrated using specific examples. Keywords: Evaluation studies, likelihood ratio, logistic regression analysis, predictive value, reference standard, ROC curve, sensitivity, specificity, spectrum, statistics. James C. Boyd, M.D., #214 Department of Pathology, University of Virginia, Health Sciences Center, Charlottesville, VA 22908, USA Scand J Clin Lab Invest Downloaded from informahealthcare.com by The University of Manchester on 11/02/14 For personal use only.

Mathematical tools for demonstrating the clinical usefulness of biochemical markers

  • Upload
    james-c

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

Scand J Clin Lab Invest 1997;57(Suppl227);46-63

Mathematical tools for demonstrating the clinical usefulness of biochemical markers

JAMES C. BOYD, M.D., #2 14 Department of Pathology, University of Virginia, Health Sciences Center, Charlottesville, VA 22908, USA

Boyd JC. Mathematical tools for demonstrating the clinical usefulness of biochemical markers. Scand J Clin Lab Invest 1997;57 Suppl227:46-63.

Various approaches have been proposed for evaluating the diagnostic value of biochemical markers. Careful design of experimental protocol is key in carrying out any evaluation of clinical diagnostic value. A prospective cohort study is the best clinical trial design and should include an appropriate reference (gold) standard applied in every patient, the results of which are assessed blindly. The spectrum of patients evaluated should reflect the population in which the test will be used, be appropriately broad to avoid bias, and include both symptomatic and asymptomatic patients. The handling of indeterminate results and the eligibility criteria for inclusion in the study should be carefully defined. Although sensitivity, specificity, and predictive value have long been used as indices of test accuracy, newer methods such as receiver operating characteristic curve (ROC) analysis, logistic regression analysis and likelihood ratios are more robust indicators that overcome many limitations of the traditional indices. The area under the ROC curve (AUC) is the best global indicator of test accuracy, but comparisons of AUC for different tests must take correlation between the tests into account if they have been performed in the same patients. Logistic regression analysis allows the diagnostic information from several tests to be evaluated multivariately, provides a probability estimate for a given outcome, and requires few assumptions regarding the underlying distributions of test data. Logistic regression also provides a straightforward method for calculating likelihood ratios. Likelihood ratios are useful for interpreting test results in the individual patient because they provide a convenient means to directly determine predictive value without having to calculate sensitivity and specificity for a given decision limit. Application of these methods is demonstrated using specific examples. Keywords: Evaluation studies, likelihood ratio, logistic regression analysis,

predictive value, reference standard, ROC curve, sensitivity, specificity, spectrum, statistics.

James C. Boyd, M.D., #214 Department of Pathology, University of Virginia, Health Sciences Center, Charlottesville, VA 22908, USA

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 2: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

Mathematical tools demonstrating the clinical usefilness.. . 4 7

The diagnosis of disease and the assessment of disease prognosis in medical practice commonly involves the performance of a diagnostic test. Such diagnostic and prognostic assessments are always associated with a certain amount of uncertainty that can be divided into at least two components: 1) inaccuracy in the diagnostic test itself; 2) inaccuracy in the clinical assessment of the patient's condition or the interpretation of the diagnostic test. Uncertainty due to the first cause has been reduced by the evolution of new, more robust analytical methods and by increased attention to laboratory quality control. Uncertainty associated with the second cause requires the development of more reliable and objective instruments for prediction and better methods for assessing the diagnostic value of those instruments. Although there is no question that careful studies are necessary to evaluate the clinical diagnostic utility of new potential biochemical markers, there is considerable uncertainty on how to best carry out such studies, and how to optimize the information obtained from such studies [l]. This paper will focus on the statistical approaches that may be employed in studies of diagnostic or prognostic utility. Usually, the goal of an evaluation study is to

establish whether there is a relationship between the marker test result and some clinically defined endpoint (e.g., presence or absence of a disease, good or bad prognosis). The strength of any relationship discovered must be evaluated to establish the utility of the marker in predicting disease or outcome in the individual patient. Since tests are not always used for diagnostic or prognostic purposes alone (e.g., they may also be used in screening, monitoring, in risk assessment, in epidemiologic studies, or for research), their evaluation must be tailored to their intended application. We will not consider here the specialized test evaluation approaches for these latter cases, but, instead, will focus upon the most common study designs and statistical

approaches for test evaluation and their pitfalls.

COMMON STUDY DESIGNS Zweig and Robertson [2,3] reviewed cases in

the literature in which the initial evaluation of a diagnostic test led to overoptimistic views of its clinical utility and proposed a three phased approach to test evaluations: a phase I trial evaluates the analytical performace of the diagnostic test, (characterizing analytical imprecision and accuracy, sensitivity, and specificity of the method) as well as the within-person and between-person variation of the test results obtained in actual clinical use; aphase 11 trial is a study in which results from selected patients having the disease in question are compared with results from controls without the disease. Even when carried out as a case-control study, a phase I1 trial can give results that are overoptimistic because the patients selected are in easily recognized advanced stages of the disease (biased disease spectrum) and the controls often consist of only healthy subjects. However, if a test does not show promise in a phase I1 trial, it is unlikely to have much diagnostic value in clinical application; a phase IIZ trial investigates the clinical usefulness of a diagnostic test in a target population of intended use for the test. A phase I11 trial is carried out as a blind, controlled, prospective study to answer a specific clinical question. How a particular study of a diagnostic test is

carried out can have a profound impact on what results are obtained. Mnrller-Petersen [4] distinguished between 'Yndirect" (retrospective case-control equivalent to phase I1 trial described above) and "direct" (prospective cohort follow-up equivalent to phase I11 trial described above) study designs and described how the design of the study can influence the derived characteristics of a diagnostic test. Important factors that must be considered in designing a study have been outlined by

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 3: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

48 JBoyd

several authors [5,6] and are briefly reviewed below.

APPROPRIATE REFERENCE STANDARD Jaeschke et al. [5] emphasize the importance

of performing an independent blind comparison with a reference standard. The reference (or gold) standard must be as close to the "truth" as possible and performed in every patient along with the test under evaluation. Performance of the reference standard on every patient avoids the problem of verification (workup) bias [6]. Verification bias occurs when patients with positive (or negative) diagnostic test results are preferentially referred to receive verification by the reference standard procedure. When performing a case- control study where the patients with positive test results have preferentially been evaluated using the reference standard, verification bias will automatically be embedded in the results. With prospective cohort studies, verification bias can be minimized by carrying out clinical followup when the reference standard is too expensive or risky to apply to all patients. It is additionally important that the reference standard be assessed independently (blindly) of the results of the test under evaluation. This latter guideline avoids the problem of review bias [6]. It will be assumed for this discussion that the

clinical outcome category (disease vs. non- disease, good vs. bad prognosis) can be accurately assessed by an independent reference standard. Inaccuracy in the reference standard leads to an imperfect evaluation of the diagnostic test [7] and Begg has discussed this problem in relation to other sources of bias in diagnostic test assessment [ 81. Methods have been derived for making unbiased estimates of test sensitivity and specificity even when the accuracy of the gold standard is unknown [9,10] and other methods for assessing test accuracy in the absence of any external reference test have been proposed [ 1 1 , 121, but

will not be covered here.

APPROPRIATE SPECTRUM OF PATIENTS Unless the population of patients studied

closely resembles the population in whom the test will ultimately be applied clinically, the true diagnostic utility of the test cannot be evaluated adequately. Ransohoff and Feinstein [ 131 pointed out the importance of selecting an appropriately broad spectrum of patients. When the spectrum of patients is confined to groups that are very narrowly defined, the estimates diagnostic accuracy for a diagnostic test may be biased. For instance, in a case- control study, if only patients with advanced forms of the disease in question are compared with healthy controls, the diagnostic accuracy of the test will be overestimated when compared with the diagnostic accuracy determined from a study of patients whose clinical presentations range from mild to severe cases of the disease and control subjects who have clinical presentations resembling those of patients with the disease. Reid et al. [6] urge the reporting of age and sex distributions of the studied subjects, the clinical symptoms and/or the disease stage of the subjects, and the eligibility criteria for inclusion of subjects in the study. Common measures of test accuracy include

sensitivity (the fraction of disease patients with positive test results), specificity (the fraction of nondiseased patients with negative test results), and likelihood ratios (odds ratios of a particular test result being observed in diseased vs nondiseased patients). Although these indices are often regarded as fixed properties of a test, Reid et al. [6] have noted that indices of test accuracy may vary in different medical subgroups that have different disease severity, disease duration, and comorbidity. Unless the subgroups are analyzed separately, these variations may be missed.

MEASURES OF REPRODUCIBILITY AND

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 4: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

Mathematical tools demonstrating the clinical usefilness ... 49

Negative

Total

REPORTING OF RESULTS Reid et al. [6] urge that the results of clinical

evaluation studies include confidence intervals to bracket the precision of accuracy indices, that all indeterminate (nondiagnostic or equivocal) results be reported and indication made as to whether the indeterminate results were included or excluded in the calculations of test accuracy, and that measures of the reproducibility of the test (analytical imprecision or observer variability) be reported. In a survey of 112 published studies, they found that these parameters were reported in only 1 1, 21 and 25 percent of the studies respectively.

P,, p,, p2

PI p* 1

APPROPRIATE SAMPLE SIZE To ensure adequate statistical power in testing

the diagnostic accuracy of a test, an appro- priately large sample of patients must be studied. Although there are no methods known for determining exactly how large a sample size should be, we will consider a useful approximation for estimating sample size. When the diagnostic accuracy of a new test

(Test B) is being compared with that of a

TABLE 1 : Observed proportions when Comparing two tests on the same patient samples.

Test A

I Test B I Positive I Negative I TOtaA

I Positive I P, , I P,, I P,, I

The numbers of observations in each cell is obtained by multiplying Pij by n, the number of individuals studied.

currently known test (Test A), a McNemar test

of correlated proportions is carried out. Results from this type of study are gathered into a table like the one shown in Table 1, tallying into the appropriate sections of the table the results of painvise test comparisons carried out in the same samples. An appropriate sample size for demonstrating significantly different utility between the tests from such a table can be derived using a formula given by Dixon and Massey [ 141

1 @,2-P,l)2 q 2 = i x

where n is the required sample size, cp is a constant corresponding to a given statistical power from Table 13b of Dixon and Massey [24], and p I 2 and p are derived (using formulae given below) from estimates of prevalence of disease, test sensitivity and specificity. Solving this formula for n we have:

Estimates ofp,, andp,, in terms of test sensitivity and specificity and prevalence of disease may be obtained using formulae derived fiom Greenberg and Jekel [ 151

p ,, =px ( 1 -Sna)xSnb +( 1 -p) x Sp,x ( 1 -Sp,)

where p is the prevalence of disease in the population being studied, Sn, and Spa are the sensitivity and specificity of test A, and &,and Sp, are the sensitivity and specificity of test B. Representative sample sizes for detecting a ten percent difference in sensitivity and specificity between test A and test B for different and

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 5: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

50 JBoyd

0.99

TABLE 2: Estimated sample size to achieve 95 percent statistical power when using the McNemar test for comparing two tests in the same group of individuals for several disease prevalences, and combinations of sensitivity and specificity for the Test A. In all cases. the power is against the alternative that Test B has sensitivity and specificity at least 10 percent lower than Test A.

568 480 397 318 243

Specificity (Test A)

0.99

I 0.91 0.93 0.95 0.97 0.99

389 329 271 215 162

Sensitivity (Test A)

0.91

0.93

0.95

0.97

0.99

p = 0.2

1041 889 748 617 494

1031 878 736 604 480

1020 866 723 590 465

1007 852 708 574 449

992 836 691 557 43 1

p = 0.1

0.91 I 586 498 416 337 263

0.93

0.95

0.97

582 494 411 333 258

578 490 407 328 253

573 485 402 323 248

0.91

0.93

0.95

0.97

0.99

0.91

0.93

0.95

0.97

p = 0.05

463 392 325 26 1 200

46 1 391 323 259 198

459 389 322 257 196

457 387 320 255 194

455 385 317 253 192

p = 0.01

390 330 272 216 163

390 330 272 216 163

390 329 271 216 163

389 329 271 215 162

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 6: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

Mathematical tools demonstrating the clinical usefulness.. . 51

Range of Test Values A B C

FIGURE 1. Hypothetical Gaussian distributions for diseased and non-diseased patients. Test values below decision limit A virtually rule-out the presence of disease, whereas test values above B might be used to rule-in a diagnosis of disease. Other decision limits (such as C) could be used for other medical purposes (see text)

specificities are displayed in Table 2.

MATHEMATICAL AND STATISTICAL TOOLS Potential biochemical markers can be placed

into two categories, quantitative or qualitative, depending upon what measurement scale is utilized for a given marker. Qualitative markers frequently have only two results (present or absent), although some qualitative markers may be expressed using an ordinal scale with three or more possible results (e.g., 1+, 2+, 3+, etc.). Physicians often use laboratory tests to confirm or exclude diagnoses that they are presently considering. Limiting values for continuous tests that aid

the physician in ruling in or ruling out a particular diagnosis are called, variously, decision limits, cutoff points, critical values, discrimination values, and referent values (here, they will be called decision limits). Of the decision limits displayed in Figure 1, physicians may be interested in several values for a given test: 1) a test value that would exclude the possibility of a disease being present (test value to the left of A in figure 1); 2) a test value that would confirm the presence of a disease (test value to the right of B); 3) a test value in the area to the right of C, that value beyond which treatment has been shown to do more good than harm; 4) the optimal interpretation of a test result that falls between A and B. Information needed to derive these

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 7: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

52 JBoyd

decision limits can only be determined in a carefully designed study of diseased and non- diseased patients carried out according to the protocols and guidelines outlined above. The derivation of optimized decision limits that help answer the physician’s questions as outlined above is a major thrust of the area of decision analysis [16,17], a topic that is beyond the scope of this presentation.

SENSITIVITY AND SPECIFICITY Several statistical tools have been developed

to evaluate the diagnostic utility of a qualitative binary test or a continuous-valued test with a defined decision limit. The statistical indices of sensitivity and specificity of a diagnostic test were first introduced by Yerushalmy [18]. The sensitivity of the test refers to its capability to make a correct (positive) diagnosis in patients with confirmed disease, whereas, the specificity of the test indicates the capability for making a correct (negative) diagnosis in confirmed negative cases. These concepts are often demonstrated with reference to a 2 x 2 table (Table 3). In this table, the row labels refer to the

TABLE 3: A 2 x 2 table for comparison of test results with the true disease classification of patients

Test I I i I

Total

TP+FN

FP+TN

TP -- True positive test results FP -- False positive test results TN -- True negative test results FN -- False negative test results Sensitivity (Sp) = TP/(TP+FN) Specificity (Sn)= TN/(TN+FP) Predictive value of a positive (PV+)= TP/(TP+FP) Predictive value of a negative (PV-)= TN/(TN+FN)

patients’true diagnosis and the column labels, to the result of the test. The cells indicate whether the patients’ diagnosis has been correctly or falsely predicted by the test. The sensitivity of the test is provided by the number of true positive results divided by the total number of cases with confirmed presence of the disease, TP/(TP + FN). The specificity of the test is the number of true negative cases divided by the total number of cases with confirmed absence of the disease, TN/(TN +FP).

PREDICTIVE VALUES Although the sensitivity and specificity of a

given diagnostic procedure can be characterized by carrying out a diagnostic test evaluation as outlined above, the physician is more interested in the probability that hidher patient has a disease given that the test result is positive (i.e., the post-test probability of disease or the predictive value of a positive test result). The statistical index of predictive value of a diagnostic test was introduced by Vecchio [ 191 and popularized in laboratory medicine by Galen and Gambino [20] The predictive value of a positive test is defined as the percentage of the time that a positive result indicates that a patient truly has the disease. It can be calculated by dividing the number of true positive results by the total number of cases with a positive result, TP/(TP+FP). The predictive value of a negative test indicates the converse and is calculated by dividing the number of true negative results by the total number of cases with a negative result, TN/(TN+FN). It is well-known that although the sensitivity and specificity of a test may be high, the performance of the test as measured by the predictive value of a positive may be poor (low positive predictive value) when the disease prevalence is low in the population being tested. The truth of this phenomenon can be verified by use of Bayes theorem [21]. Bayes theorem allows the predictive value of

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 8: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

Mathematical tools demonstrating the clinical usefilness ... 53

100- 90.

r 60- 50- i :: 20- 10.

a positive (PV+) to be expressed in terms of test sensitivity (SENS) and specificity (SPEC) and disease prevalence (PREV) as follows:

I : : f

0 - . , , , , , , , , , , , , , , , , , , , , , , , , , , , , , _ , , , , , , , , , , I

(SENS) *(PREP') *lo0 P V + = (SENS) *(PREF') +( 100 -SPEC) *( lOO-PREV)

This equation is plotted for various disease prevalences in Figure 2 for a test with a sensitivity of 99 percent and a specificity of 99 percent. Note that the predictive value for the test falls precipitously when disease prevalence drops below 5 percent. It follows that the indiscriminate use of a test in a population with a low prevalence of the disease in question can lead to poor clinical results as pointed out by Boyd and Marr for the acid-fast smear in detecting tuberculosis [22]. This dependence of predictive value on disease prevalence has made it a difficult measure to use clinically, because the prevalence of disease in the population being examined (or,

equivalently, the probability of disease in the individual patient before the test is performed) must be estimated by the physician, a practice that has been criticized [23].

RECEIVER OPERATING CHARACTERISTIC (ROC) CURVES When evaluating qualitative tests with only

two outcomes (positive or negative), sensitivity, specificity and predictive value are suitable measures for assessing diagnostic value. The likelihood ratio (discussed below) also provides a good assessment. However, in the case of continuously-valued tests or qualitative variables expressed on an ordinal scale, sensitivity and specificity must be derived in the context of a specific decision limit. Different decision limit choices will yield different sensitivity and specificity estimates. To better understand this concept, consider Table 4 where a series of results are listed for serum parathyroid hormone (S-PTH) in patients with and without primary hyperparathyroidism (HPT) [24]. In viewing

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 9: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

54 JBoyd

these results, it can be seen that only patients with HPT have (S-PTH) exceeding 13.9 pL- equivalentdl, and that the higher the PTH concentration, the more likely that the patient has HPT. When using a specific decision limit, information regarding the degree of test deviation away from the decision limit is thrown away by conversion of the test result into "positive" or "negative" based on whether it is above or below the decision limit. This loss of information restricts any overall evaluation of the diagnostic capabilities of a test. It is clear from Figure 1 that when the decision limit is set high (e.g., at B), few scores will exceed it. This will result in the test having a high specificity. However, the high decision limit will also result in a larger proportion of scores from the disease

<2

Total

TABLE 4: S-PTH concentrations in healthy controls and patients with proven primary hyper- parathyroidism (lo HPT). Boyd and Ladenson ~ 4 1 .

12

70 130

34-37.9

30-33.9

26-29.9

22-25.9

I 18-21.9 1 14-17.9

10-1 3.9

6-9.9

2-5.9 25

population with values below the decision limit (giving a test with low sensitivity). As the decision limit is moved toward A, the specificity of the test will fall and the sensitivity will rise. This relationship between specificity and sensitivity as a function of the decision limit can be represented nicely in graphical form as shown by Gerhardt [25] In evaluating the diagnostic accuracy of a test, one of the best graphical tools is a receiver operating characteristic (ROC) curve. The typical ROC curve plots for all possible decision limits the true positive rate on the ordinate and the false-positive rate (1 - specificity) on the abscissa. As such, the ROC curve provides a pure index of accuracy reflecting the diagnostic information available from a test for all decision limits, from low to high [26]. The ROC plot usually does not show actual decision thresholds although they are known and used in generating the plot. Likewise, a ROC plot does not show how many subjects were used in its generation. A good diagnostic test is characterized by

high true positive and low false positive rates at all possible decision limits. The ROC curve for a good test will tend to plot upward and to the left, as illustrated in Figure 3 in the ROC curve for PTH in differentiating patients with proven primary hyperparathyroidism from healthy controls. A poor diagnostic test such as chloride (Cl) will have a flat ROC curve, approaching the diagonal (plotted as a dotted line) where the TP and FP rates are equal at every decision threshold. Such a test will be positive with equal probabilities in individuals who have the disease and in those who do not and, thus, shows total indifference to the diagnosis. Phosphate (PO,) falls between PTH and C1 in diagnostic value. The area under the curve (AUC) of an ROC

curve serves as a good overall measure of clinical accuracy because it reflects the ability of the test to differentiate patients with disease from those without disease over the entire

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 10: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

Mathematical tools demonstrating the clinical usefulness.. . 55

1 .o

0.8

0.2

0.0

0.0 0.2 0.4 0.6 0.8 1 .o

1 - Specificity

FIGURE 3. Plot of the receiver operating characteristic curves for serum parathyroid hormone (PTH). phosphate (PO,), and chloride (CI) in the differentiation of patient with primary hyperparathyroidism (confirmed by tissue examination of excised parathyroid glands) from healthy controls. The dashed line represents the line of identity. Data from Ladenson and Boyd [24].

spectrum of decision limits. Although it is easy to compare the relative positions of two or more ROC curves visually to get an idea of relative utilities of the tests plotted, testing for statistical significance of differences in the AUC's should be carried out. The AUC can be computed directly from the ROC plot using numerical integration approaches, but, this area has been shown to be directly related to the Mann-Whitney U statistic of the two-sample rank-sum test of Wilcoxon [27,28]. This well- known algorithm is implemented in many statistical packages. The areas (fSD) under the three ROC curves in Figure 3 calculated by

this approach are: PTH -- 0.983 f 0.0043; PO, -- 0.843 0.0469; and C1-- 0.537 f 0.0444. By constructing 95 percent confidence intervals around the area for each test, we can see that PTH is a significantly better global measure for differentiating patients with primary hyperparathyroidism from healthy controls than PO, or C1. Although the AUC has the strength of being a

global measure of test performance, this may also be a weakness, because there has been a loss of information by reducing the entire plot to a single number. It may be more suitable in certain circumstances to evaluate only a

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 11: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

56 JBoyd

portion of an ROC curve if a range of test decision limits, specificities, or sensitivities can be established as the only useful range for a given clinical application [29]. As a rule, AUC values of different diagnostic tests should not be compared without a visual inspection of the ROC plots themselves. Two ROC curves with similar areas can have very different shapes, and may cross over each other, indicating that one test will have better specificity for a given sensitivity while the other will have better sensitivity for a given specificity. The statistical comparison of ROC curves is

dependent upon whether each curve is derived from the same or different patient populations. In the case when the ROC curves are derived from different patient samples, a test procedure employing a pooled estimate of the variance for the two ROC curves is easily carried out and usually adequate [30]. Another procedure that is more exact employs bootstrap estimates of the standard errors for the curves and can be used for simultaneous comparisons of the AUC’s for several curves using an F-test [3 11. When ROC curves have been derived from the same patient sample, there is often a high correlation between the test results for the different tests being evaluated. This correlation cannot be seen by visually inspecting the ROC curves. However, when this correlation is taken explicitly into account, the power of statistical tests to detect differences between ROC curves is improved. In most applications, a simple procedure of measuring the correlation between the two tests in the diseased and non-diseased groups separately, then using the averaged correlation coefficients obtained in each group, and the averaged areas under each curve to provide an estimate of an adjusted correlation coefficient will suffice [32], however, more exact methods are available [33]. Campbell [28] describes the construction of confidence regions around the ROC curve based on

Kolmogorov-Smirnov confidence bands for distribution functions or on estimates obtained by bootstrapping. Computer software is available from several sources for carrying out ROC analysis (RULEMAKER, Digital Medicine, Inc., Hanover, NH; ROCLAB, DeLeo JM, National Institutes of Health, Division of Computer Research and Technology, Bethesda, MD; ROC ANALYZER, Centor RM, Medical College of Virginia, Department of Internal Medicine, Richmond, VA; LABROC, CLABROC, C. Metz, Chicago, IL).

LOGISTIC REGRESSION ANALYSIS When several tests or variables have been

measured simultaneously in a test evaluation study, their relative diagnostic values can be measured by the method of receiver operating characteristic analysis as outlined above. However, in ROC analysis, the variables are evaluated in isolation from one another (e.g., univariately), and the comparison is carried out using statistical analysis of AUC’s. Some tests carry diagnostic information that is revealed best only when the result of another test is available. The diagnostic usefulness of PTH measurement, for instance, is greatly improved by having a concurrent calcium measurement available. When PTH is elevated and calcium is also elevated, the likelihood of HPT is much higher than it would be if the serum calcium concentration was low [34]. In such cases, multivariate logistic regression analysis can be a useful tool for measuring the relative value of the additional information supplied by multivariable assessment of a disease or prognosis. Logistic regression analysis is applied when

the dependent variable (e.g., presence or absence of the disease, good or bad outcome) is binary, restricted to only one of two values. Such a restriction is not handled well by classical linear regression techniques in which the dependent variable is assumed to have

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 12: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

Mathematical tools demonstrating the clinical usefulness ... 57

unlimited range and errors that follow a Gaussian distribution with constant variance over the entire range of the dependent variable(s) [35]. The logistic equation may be used to map an unrestricted value into the restricted range of 0 to 1 as follows:

where L is a variable with unlimited range. Using this equation, p is restricted to the interval [0,1], and can be used to represent the probability of disease in a patient. L can be represented as a linear function

L=a, + a,x, + a,x, + *** + a ~ " ... (2)

where the function L is based on n variables (diagnostic tests), xi, multiplied by a weight (coefficient) a,. Using the relationship between odds and probability, p ,

p = odds/(l + odds), odds=p/(l - p ) ... (3)

an alternative expression of Equation 2 can be derived,

In (odds) = a,, + a, x, + a, x2 + +a,, x, ... (4)

where In represents the natural logarithm. Thus L represents a log-likelihood function for disease. Values for the coefficients, a , can be derived

using various statistical packages such as BMDP [36] and SAS [37], by an iterative maximum likelihood estimation fitting procedure. With the application of stepwise procedures for entry of variables into the model, the relative diagnostic utilities of the variables can be judged. In this process, the one variable that is next most informative (relative to the information supplied by the variables already in the model) for the discrimination of outcomes is added at each of successive steps until no more variables can be

found that contribute additional discriminatory information at some preselected level of statistical significance. Derivation of the regression coefficients and stepwise variable selection are described in much greater depth in Albert and Harris [38]. For evaluation of the overall discriminatory value of a logistic regression function, ROC analysis can be performed on the values given by evaluating the logistic regression function for each patient's profile of test results. The logistic approach offers advantages over

traditional linear discriminant analysis. It requires many fewer assumptions regarding the underlying distributions of the test variables and works well with categorical variables expressed on either binary or ordinal scales, andor continuous variables having Gaussian, gamma or beta distributions [39]. An important advantage is that the logistic model supplies direct estimates of the probability for a given outcome (Equation 2) and is a primary tool for the estimation of likelihood ratios [39].

LIKELIHOOD RATIOS The use of decision limits in laboratory test

interpretation has a number of disadvantages. Use of a decision limit converts numeric test results into (in the simplest case) "positive" and "negative" and throws away information regarding the degree of deviation of a test result above or below the decision limit. Thus, a test result slightly above the limit will be regarded the same as one far above the limit. In addition, decision limits do not individualize the probability that a person with a certain test value has the target disorder and assumptions made in deriving the decision limit regarding the utility of each outcome may not always hold for the individual patient. Likelihood ratios overcome many of the

disadvantages of decision limits by allowing direct determination of the predictive values of observed test values. Since the determination of predictive values can be carried out directly,

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 13: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

58 JBoyd

it is unnecessary to carry out intermediate calculations of test sensitivity and specificity. Thus, there is no loss of information caused by converting quantitative answers into "positive" and "negative". The likelihood rat io of a test value (x) for analyte X is written L(X=x), or simply L(x), and is defined [39] as:

Prob (X=x) in pat with Disease Prob (X=x) in pat without Disease

L(x) =

Considering the special case of a binary test where x may take on only two values-plus or minus, 0 or 1, then

L(,)= Prob. positive res. in pat with Disease - SENS -- Prob. positive res. in pat without Disease I -SPEC

I -SPEC SENS

L(o)= Prob. negative res. in pat with Disease -- Prob. negative res. in pat without Disease

The probability that a positive result indicates the presence of disease (PV+) is

where p = the prevalence of the disease. This result can be generalized to continuous- valued tests, x

PV(x)= pL(x) PL(X)+(l -P)

Thus, the predictive value of a test result with respect to a given disease can be determined from the likelihood ratio and prevalence of that disease. Albert (391 has shown that the likelihood ratio

for both continuous valued and categorical variables (tests) can be modeled using logistic regression analysis as follows

L(x, = U O + U l l l + U I X l + "' + a,.x,

Where: X is a vector of results for k tests x,,

and ai are coefficients derived by logistic regression analysis

With Albert's model, likelihood ratios can be generalized to the multivariate case, where the information from several tests is combined. The derivation of such a model uses the data from a clinical evaluation study as outlined above. Using as a "training set" the test values from each person in a group of patients with and without the disease, logistic regression analysis can be applied to derive maximum likelihood estimates for the a's. Albert and Harris [40] give formulae for calculating the confidence intervals for likelihood ratios.

x 2 ~ ... , xk

EXAMPLE OF LIKELIHOOD RATIO

Up to this point, we have only discussed evaluation of the diagnostic utility of a test as it relates to the overall population being tested. The example about to be presented shows the potential of likelihood ratios for aiding the interpretation of laboratory results in the individual patient.

Boyd and Herold [41] developed a likelihood ratio function to aid the interpretation of fetal lung maturity results as measured using the FLM test on the TDx analyzer (Abbott Laboratories, Chicago IL). This study was undertaken to provide a standardized approach for the interpretation of FLM results. The likelihood ratio for fetal maturity of a test

value (x) for FLM was written L(FLM=x), or simply L(x), and defined as:

APPLICATION

Prob (FLM=x) in fetal maturity Prob (FLM=x) in fetal immaturity

L(x) =

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 14: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

Mathematical tools demonstrating the clinical usefitlness.. . 59

Likelihood Ratio from FLM and Gestational Age

GA = Gestational Age, LR = Likelihood Ratio

FIGURE 4. Three dimensional "grid" plot of the likelihood ratio function derived for predicting fetal lung maturity on the basis of the FLM test (Abbott Diagnostics) and the estimated gestational age (GA). LR represents the

likelihood ratio calculated by the following equation: LR = e [,.ssos cA + 0.0428 FLM - 19.9877 + la

FLM measurements, gestationai ages, and fetal outcomes were available for 421 of the 695 amniotic fluid samples from five major medical centers evaluated in the Abbott Laboratories multicenter FLM database [42].

delivery were available for 291 amniotic fluids. Of these cases 28 infants developed documented respiratory distress. Using logistic regression Boyd and Herold developed an equation to

calculate the likelihood ratio for fetal maturity.

[,5805 GA + 00428 PLM ~ 199877 + In- LR = e [;6q

Where GA = Gestational age in weeks, FLM

= FLM =say value, and in - was a factor

to correct for the unequal group sizes in the training data. Gestational age was the most important

predictive variable for fetal lung maturity entering the logistic regression model with a

Full measurement profiles within 3 days of [ 22683)

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 15: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

60 JBoyd

Likelihood Ratio (LR) from FLM and Gestational Age

L R < 1

LR

Do

Do

50

20

10

5

2

1

FLM FIGURE 5 . A two dimensional contour plot of the function shown in Figure 4. Logarithmic increases in FLM are shown by the diagonal isocontour lines as one moves from the lower left part of the figure up the steep slope of the likelihood ratio function toward the upper right. A data point for the patient example given in the text is plotted at an FLM of 60 mg/g and a gestational age o f 38 weeks. A likelihood ratio (LR) of 1 1 can be estimated for this point from the isocontour lines.

Chi-square for improvement of 15.25 (p<O.OOl). FLM entered the model next with a Chi-square for improvement of 8.21 (p < 0.0043). Using a likelihood ratio of 1 as a cutoff, the multivariate likelihood function achieved a sensitivity of 93% and a specificity of 92%. A three dimensional plot of this equation in Figure 4 shows the steep slope of the surface representing the likelihood ratio for fetal maturity. Boyd and Herold also applied logistic

regression analysis to develop a likelihood ratio function for fetal maturity using only the F1,M reslllt FT M rpallltc within '2 &avc nf

delivery and fetal outcomes were available for 307 infants in the multicenter FLM database, of whom 3 1 developed documented respiratory distress. The resulting univariate likelihood ratio function was:

[,.0726 FLM - 2.2175 + In- LR = e (:;6 )]

Again, In - was a factor to correct for

the unequal group sizes in the training data. FLM entered the model with a Chi-square for

[ 2%)

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 16: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

Mathematical tools demonstrating the clinical usefitlness.. . 61

improvement of 34.4 (p < 0.00005). Using a likelihood ratio of 1 as a cutoff, the univariate likelihood function achieved a sensitivity of 86% and a specificity of 90%. As an example of applying these equations,

assume that the FLM result from a woman who is 38 weeks into her pregnacy is 60 mg/g. The corresponding multivariate likelihood ratio calculation is:

[(O 5805)(38) * (0 0428)(60) - 19 9877

LR = e 1 1.02

As an alternative to doing this calculation, the values for FLM and Gestational Age can be plotted on a contour plot of the likelihood ratio equation shown in Figure 5. Using this approach it can be seen that the likelihood ratio exceeds 10. With a calculated likelihood ratio of 1 1 and an estimate by the physician of the pre-test odds of fetal maturity (say, even odds or l/l), the post test odds of fetal maturity can be calculated as

Post-test odds = Pre-test odds x LR

Using this formula the fetus has post-test odds of 1/1 x 11/1 = 11 to l(probability=92%) of being mature and the physician can make a decision on whether to deliver the infant accordingly. If there is clinical uncertainty regarding the

gestational age (as is frequently the case) the likelihood ratio can easily be estimated for thegestational ages thought to bound the true gestational age. Alternatively, the likelihood ratio can be calculated by the second equation using only the information in the FLM result:

[(00726)(60) ~ 2 1275 + In- LR = e (l;6)1 = 1.04

Comparison of this likelihood ratio with the one derived earlier using both the gestational age and the FLM result demonstrates the added certainty of fetal maturity conferred by the relatively advanced gestational age of 38

weeks.

SUMMARY Clinical and laboratory tests have long been a

routine part of the medical examination of the patient. As long as biomedical knowledge continues to evolve, new potential diagnostic markers will emerge and need careful evaluation. Attention must be given to the design and execution of clinical diagnostic studies of biochemical markers, avoiding the pitfalls of inadequate reference standards and inconsistent application and/or unblinded interpretation of those standards. Inadequacies in the spectrum of patients studied can easily bias study results, and subgroup analys is may be necessary to separate effects of the subgroups on indices of test accuracy. Various mathematical approaches are available for the analysis of data from such evaluation studies. The traditional indices of sensitivity, specificity, and predictive value can be derived using specific decision limits. By integrating test information over all possible decision limits, ROC analysis provides the best overall indication of univariate test accuracy. Logistic regression analysis enables integration of information from several tests in a single function that produces probability estimates of outcome and allows evaluation of the incremental diagnostic contribution of variables as they are added stepwise to the model. As a special application of logistic regression, likelihood ratios provide point estimates of the predictive value for continuously valued or discrete diagnostic tests that are useful in individual patient management and are not dependent upon the specification of decision thresholds. Test evaluation tools that are independent of decision limits appear to be the most useful and will see increasing application in fumre studies.

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 17: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

62 JBoyd

REFERENCES 1.

2.

3.

4.

5 .

6.

7.

8.

9.

Magid E. Some concepts and principles of clinical test evaluation. Introduction. Scand J Clin Lab Invest 1992; Suppl208:9-12. Zweig MH, Robertson EA. Why we need better test evaluations. Clin Chem 1982;28: 1272-1276. Robertson EA, Zweig MH, Van Steirteghem AC. Evaluationg the clinical efficacy of laboratory tests. Am J Clin Pathol 1983;79:78-86. Msller-Petersen J. Evaluation of diagnostic tests. Design and phases. Scand J Clin Lab Invest

Jaeschke R, Guyatt G, Sackett DL. Users’ guides to the medical literature, 111. How to use an article about a diagnostic test, A. Are the results of the study valid? J Am Med Assoc 1994;271:389-391. Reid MC, Lachs MS, Feinstein AR. Use of methodological standards in diagnostic test research. Getting better but still not good. J Am Med Assoc 1995;274:645-65 1 Valenstein PN. Evaluating diagnostic tests with imperfect standards. Am J Clin Pathol

Begg CG. Biases in the assessment of diagnostic tests. Stat Med 1987;6:411-423. Hui SL, Walter SD. Estimating the error rates of diagnostic tests. Biometrics 1980;36:167-171.

1992;S~ppl208: 35-50.

1990:93 ~252-258.

10. Waiter SD, Irwig LM. Estimation of test error rates, disease prevalence and relative risk from misclassified data: a review. J Clin Epidemiol

11. Schulzer M, Anderson DR, Drance SM. Sensitivity and specificity of a diagnostic test determined by repeated observations in the absence of an external standard. J Clin Epidemiol

12. Staquet M, Rozencweig M, Lee YJ, Muggia FM. Methodology for the assessment of new dichotomous diagnostic tests. J Chronic Dis

13. Ransohoff DF, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N Engl J Med 1978;299:926-930.

14. Dixon WJ, Massey FJ Jr. Introduction to Statistical Analvsis, fourth edition, New York, McGraw-Hill,

15. Greenberg RA, Jekel JF. Some problems in the determination of the false positive and false negative rates of tuberculin tests. Am Rev Respir Dis 1969;100:645-650.

16. Weinstein MC. Clinical Decision Analvsis. Phildelphia, Saunders, 1980.

17. Patrick EA. Decision analysis in medicine:

1988;41:923-937.

1991;44:1167-1179.

198 1 ;34:599-610.

1983, pp 319-324.

methods and applicationxBoca Raton, FL, CRC Press, 1979.

18. Yerushalmy J. Statistical problems in assessing methods of medical diagnosis. with special reference to X-ray techniques. Public Health Rep

19. Vecchio TJ. Predictive value of a single diagnostic test in unselected populations. N Engl J Med

20. Galen RS, Gambino SR. Beyond normality : the predictive value and efficiency of medical diagnoses. New York, Wiley, 1975.

21. Fleiss JL. Statistical Methods for Rates and Proportions (2nd ed). New York, Wiley, 1981, pp

22. Boyd JC, Marr JJ: Decreasing reliability of acid- fast smear techniques for detection of tuberculosis. Ann Int Med 1975;82:489-92.

23. Feinstein AR. Clinical biostatistics XXXIX. The haze of Bayes, the aerial palaces of decision analysis and the computerized Ouija board. Clin Pharmacol Ther 1977;21:482-496.

24. Boyd JC, Ladenson JH: Value of laboratory tests in the differential diagnosis of hypercalcemia. Am J Med 1984;77:863-72.

25. Gerhardt W. The Bayes approach: Systematic graphic evaluation of diagnostic tests. In Keller H, Trendelenburg Ch (eds). Data Presentation Intemretation. Berlin, Germany. Walter de Gruyter, 1989, pp 161-204.

26. Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine. Clin Chem

27. Bamber D. The area above the ordinal dominance graph and the area below the receiver operating characteristic curve. J Math Psycho1 1975;12:387- 415.

28. Campbell G. General methodology I. Advances in statistical methodology for the evaluation of diagnostic and laboratory tests. Stat Med

29. McClish DK. Analyzing a portion of the ROC curve. Med Decis Making 1989;9:190-195.

30. McNeil BJ, Hanley JA. Statistical approaches to the analysis of receiver operating characteristic (ROC) curves. Med Deck Making 1984;4:137- 150.

3 1. McClish DK. Comparing areas under more than two independent ROC curves. Med Deck Making

32. Hanley JA, McNeil BJ. A method of comparing

1947;62:1432-1449.

1966;274:1171-1173.

1-13.

1993;39:561-577

1994;13:499-508.

1987;7: 149-55.

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.

Page 18: Mathematical tools for demonstrating the clinical usefulness of biochemical markers

Mathematical tools demonstrating the clinical usefilness ... 63

the areas under receiver operating characteristic curves derived from the the same cases. Radiology

33. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 1988;44:837- 845.

34. Boyd JC, Lewis JW, Slatopolsky E, Ladenson JH: Parathyrin measured concurrently with free or total calcium in the differential diagnosis of hypercalcemia. Clin Chem 198 1 ;27:574-9.

35. Rosner B. Fundamentals of Biostatistics, 2nd ed., Boston, PWS Publishers, 1986, p. 402.

36. Dixon WJ (ed). BMDP Statistical Software. Berkeley, CA, UCLA Press, 1990, p. 10 13.

37. SAS Institute Inc., SASISTAT User’s Guide, Release 6.03 Edition. Cary, NC, Sas Institute Inc.,

1983; 148:839-843.

1988, p. 189. 38. Albert A, Hanis EK. Multivariate Interpretation of

Clinical Laboratory Data. New York, Marcel Dekker, Inc., 1987, pp 116-122.

39. Albert A. On the use and computation of likelihood ratios in clinical chemistry. Clin Chem 1982;28:1113-1119.

40. Albert A, Harris EK. Op. Cit. pp 93-94. 41. Boyd JC, Herold DA. A likelihood ratio function

to aid in the interpretation of TDx FLM results. Clin Chem 1991;37(6):988.

42. Russell JC, Cooper CM, Ketchum CH, Torday JS, Richardson DK, Holt JA, Kaplan LA, Swanson JR, h i e WM. Multicenter evaluation of Tdx test for assessing fetal lung maturity. Clin Chem 1989;35: 1005-1010.

Scan

d J

Clin

Lab

Inv

est D

ownl

oade

d fr

om in

form

ahea

lthca

re.c

om b

y T

he U

nive

rsity

of

Man

ches

ter

on 1

1/02

/14

For

pers

onal

use

onl

y.