Elizabeth Garrett-Mayer Division of Biostatistics The Sidney Kimmel Comprehensive Cancer Center

Methods for Evaluating the Performance of Diagnostic Tests in the Absence of a

“Gold Standard:” A Latent Class Model Approach

Elizabeth Garrett-MayerDivision of Biostatistics

The Sidney Kimmel Comprehensive Cancer CenterJohns Hopkins University

March 5, [email protected]

http://astor.som.jhmi.edu/~esg

Latent Variables

We call variables ‘latent’ if they cannot be directly measured or observed.

Examples of ‘latent’ variables:Major depressionQuality of lifePainIQSocio-economic

status

Examples of ‘measured’ variables:Blood pressureAge Eye colorCD4 cell countNumber of symptoms on a

symptom checklistVital status

Latent Variables• Some debate over ‘latent’

– Bollen: “All variables are latent”

• Approaches to latent variable situations:– Find suitable measured variable

• Total number of symptoms

• Pain score on a scale of 1-10

• Sum of items on scale

Latent Variables• Approaches to latent variable situations

(cont.):– Model the latent variable

• Factor analysis: continuous latent variables

• Latent class analysis: categorical latent variables

• Latent trait analysis / Rasch modeling / Item response theory: continuous latent variables

• Structural equation modeling

Latent Variables• When modeling the latent variable

– We need some observed variables to inform us about the latent variable

– These can be continuous or categorical– Usually, we like at least 3 or 4– Examples:

• Depression: sadness, sleeping problems, guilty feelings, etc.

• Quality of Life: on a scale of 1-10, how much…

• IQ: multiple questions on ‘exam’

Latent variables in relation to gold-standards

• If we can measure a disease, disorder, or condition exactly, we call this a ‘gold-standard’ measure.

• The gold-standard provides a benchmark for evaluating diagnoses based on other approaches

• Examples: – Mammogram is an imperfect measure of breast cancer.

Tissue biopsy is a gold standard measure of breast cancer

– A ELISA test is an imperfect measure of HIV infection. The HIV PCR test is a gold-standard measure of HIV infection

Evaluating Diagnostic Criteria• However, relatively few areas of medicine have true “gold

standard” tests, where test is perfectly accurate.– “Pathognomic indicators”– When indicator is present, disease is present– When indicator is absent, disease is absent

• Other situations:– Combination of signs and symptoms provide very

accurate diagnosis.– Disease process is not well understood: controversy

exists about how to define diagnosis.– Disease process is well understood but measuring

disease via signs and symptoms is difficult.

Diagnostic Criteria in Psychiatry

• Currently, the DSM (Diagnostic and Statistical Manual of Mental Disorders) is the standard for defining mental disorders.

• Diagnostic algorithms are provided with which a determination of disorder absence or presence can be made

• Examples: major depression, schizophrenia, autism, alcoholism, generalized anxiety disorder.

• All of diagnostic algorithms are constructed to measure latent variables.

Major Depressive Episode, as diagnosed by the DSM-IV (APA, 1994) A. A person who suffers from major depressive disorder must either have depressed mood or a loss of interest or pleasure in daily activities for at least a 2 week period.

B. The disorder is characterized by the presence of five or more of the following nine symptoms: 1. depressed mood most of the day, nearly every day, as indicated by either subjective report or observation made by others. 2. markedly diminished interest or pleasure in all, or almost all, activities most of the day, nearly every day.3. significant weight loss when not dieting or weight gain, or decrease or increase in appetite nearly every day. 4. insomnia or hypersomnia nearly every day.5. psychomotor agitation or retardation nearly every day. 6. fatigue or loss of energy nearly every day.7. feelings of worthlessness or excessive inappropriate guilt nearly every day.8. diminished ability to think or concentrate, or indecisiveness, nearly every day.9. recurrent thoughts of death, recurrent suicidal ideation without a specific plan, or a suicide attempt or a specific plan for committing suicide.

Symptoms are not better accounted for by bereavement, the symptoms persist for longer than 2 months or are characterized by marked functional impairment, morbid preoccupation with worthlessness, suicidal ideation, psychotic symptoms, or psychomotor retardation.

How do we validate the DSM criteria?

• How can we be sure that these definitions are valid measures?

• How can we determine the sensitivity and specificity of these measures?

• Is there a gold standard?

• Is psychiatrist’s diagnosis a gold standard?

• What types of individuals are the diagnostic criteria diagnosing as depressed?

• How often are individuals misdiagnosed?

• What are the implications of a positive or negative diagnosis?

Example: Major Depression • Epidemiologic Catchment Area Study (ECA): Collected mental

health data on individuals in 5 cities, beginning in 1981.• Our sample: epidemiologic sample of 1322 individuals in the East

Baltimore area collected in 1993 (wave 3).• Depression questions are from Diagnostic Interview Schedule,

which has been shown to be valid and reliable (Robins et al., 1981)

• “Present” if the symptom occurred within two weeks of the interview

• Symptom groups: some questions ask about the same type of symptom:– Have you had trouble sleeping?

– Have you had trouble waking?

– Do you sleep too much?• Related symptoms are categorized into the same symptom group.

Distribution of Symptoms

Group Symptom Prevalence

1 Depressed mood 0.06

2 Disinterest in sex 0.08

Less fun

Loss of enjoyment

3 Reduced energy/fatigued

0.05

4 Reduced concentration

0.04

Slow thoughts

Indecisive

5* Feel inferior 0.03

Lacking self-confidence

6 Guilty/sinful 0.02

Group Symptom Prevalence

7 Ideas of self-harm 0.05

Want to die

Suicidal thoughts

Suicide attempts

8 Trouble falling asleep

0.09

Trouble waking

Sleep too much

9 Loss of appetite 0.08

Weight loss

Increased appetite

Weight gain

10 Slow movement 0.04

Fast movement

fidgety

ECA Wave 3, N = 1322

Evaluating the DSM Criteria• Without an available gold standard, we resort to other

methods

• Suppose that the proposed symptom (groups) define depression.

• Without relying on the DSM definition of depression but imposing model assumptions, what types of symptom patterns are observed in the data?

• Do individuals tend to “cluster” into categories based on symptom response patterns?

• We can evaluate this using a ‘Latent Class Model.’

• Categorical analog of factor analysis.

The Latent Class Model

• Assumes that each individual in the population is a member of one of M latent classes.

• Each of the classes is defined by a vector of symptom prevalences, pm = (p1m, p2m, …pKm) where there are K symptoms, m = 1,…,M.

• The vector yi = (yi1, yi2, …., yik) is individual i’s binary vector of symptom responses, i = 1,…,N.

• The proportion of individuals in class m is denoted by m.• The true, yet unobserved, latent class of individual i is denoted

by ηi, where ηi {1,2,..,M}. • The symptoms “define” the latent variable of interest.• M is fixed.• Conditional Independence: Given class membership, symptoms

are independent.

class 1(η = 1)

class 2(η = 2)

class 3(η = 3)

p11, p21, …,pK1 p13, p23, …,pK3p12, p22, …,pK2

yi1, yi2, …,yiKyi’1, yi’2, …,yi’K

yi’’1, yi’’2, …,yi’’K

Graphical Depiction of the Latent Class Model

Statistical Details

)1(

1 1

,...,22,11

)1(

)()(

ikik ykm

M

m

K

k

ykmm

iKyiKYiyiYiyiYiyiY

pp

PP

N

i

ykm

M

m

K

k

ykmm

ikik ppYpL1

)1(

1 1

)1()|,(

Probability distribution of Yi:

Likelihood function:

Table2: Parameter estimates in the2 and 3class latent classmodels

2ClassModel 3ClassModel

class 1 class 2 class 1 class 2 class 3

movement 0.01 0.43 0.01 0.17 0.73

appetite/weight 0.05 0.55 0.04 0.31 0.78

sleep 0.05 0.65 0.03 0.49 0.77

morbid thoughts 0.03 0.44 0.02 0.24 0.54

guilt/ sin <0.01 0.25 <0.01 0.08 0.37

self-esteem 0.01 0.32 <0.01 0.10 0.56

concentration 0.01 0.48 0.01 0.15 0.78

fatigue 0.02 0.48 0.01 0.25 0.64

loss of interest 0.04 0.65 0.02 0.32 0.91

depressedmood 0.02 0.60 0.02 0.30 0.78

Class size 0.93 0.07 0.88 0.09 0.03

Interpretation

• Two class model:– A non-depressed class which reports on average no

symptoms (93% of sample)– A depressed class which reports on average 4 to 5 of

the 10 symptoms • Three class model:

– A non-depressed class which reports on average no symptoms (88% of sample)

– A mildly depressed class which reports on average 2 to 3 of the 10 symptoms (9% of sample)

– A severely depressed class which reports on average 6 to 7 of the 10 symptoms (3% of sample)

• The three class model is deemed more appropriate from a statistical standpoint (model fit, adherence to model assumptions) (Garrett and Zeger, 2000)

Results of Estimation

• p matrix vector Posterior probability of class membership:

– Tells us probability that individual i is in one of the classes, given his response pattern.

P r yP y r P r

P y

P y r P r

P y m P m

i ii i i

i

i i i

i i im

M

( | )( | ) ( )

( )

( | ) ( )

( | ) ( )

1

Examples: Assume M = 2

• Individual reports absence of all symptoms:

y

P Y yP Y y P

P Y y P P Y y P

p p

p p

i ii i i

i i i i i i

ky

ky

k

K

ky

ky

k k

k

*

**

* *

( )

(

( , , , , , , , , , , )

( | )( | ) ( )

( | ) ( ) ( | ) ( )

( )

( )

* *

*

0 0 0 0 0 0 0 0 0 0 0

11 1

1 1 2 2

1

1

1 11

11

1 11

k k k

k

K

ky

ky

k

K

p p* * *) ( )( )

.

1

1 2 21

121

0 9 9 9

Examples: Assume M = 2

• Individual reports only fatigue and sleep problems:

• Individual reports all symptoms except self-esteem and guilt:

y

P Y y

P Y y

i i

i i

*

*

*

( , , , , , , , , , , )

( | ) .

( | ) .

0 0 1 0 0 0 0 0 1 0 0

1 0 8 7

2 0 1 3

y

P Y y

P Y y

i i

i i

*

*

*

( , , , , , , , , , )

( | ) .

( | ) .

1 1 1 1 0 0 1 1 1 1

1 0 0 0 1

2 0 9 9 9

Estimation Options

• Maximum Likelihood Approach– Widely available– Accepted approach

• Bayesian Approach– Markov Chain Monte Carlo estimation– Easily implemented in “WinBugs” (Imperial College of Science,

Technology and Medicine: http://www.mrc-bsu.cam.ac.uk/bugs/)

– Benefits: • Model checking methods • ‘Identifiability’ can be assessed (Garrett and Zeger, 2000)

MCMC approach allows estimation of ANY function of parameters and standard errors.

Bayesian Estimation ApproachThe Gibbs Sampler is an iterative process used to estimate posterior

distributions of parameters.– we sample parameters from conditional distributions

e.g. P(1|Y,p, , 2, 3)

– At each iteration, we get ‘sampled’ values of p, , and .

– We use the samples from the iterations to estimate posterior distributions by averaging over other parameter values.

0.10 0.12 0.14 0.16 0.18

)|( 1 yP

1

Evaluating Depression Diagnosis

• Assumption: Treat the latent class model as our “gold standard” definition of depression.

• We can use the symptom responses to evaluate the DSM-IV diagnosis of depression

• Compare the DSM diagnosis to the latent class diagnosis using standard definitions:

• Assume two classes of depression

– Class 1 is non-depressed class

– Class 2 is depressed class

Sensitiv ity P test tru ly depressed

P D SM IV diagnosis class

Specific ity P test tru ly depressed

P D SM IV diagnosis class

( | )

( | )

( | )

( | )

-

no t

no -

2

1

More specifically…

SE P

SP P

SE j P y y j

SP j P y y j

ir R

r

ir R

rc

( ) ( )

( ) ( )

( ) ( | )

( ) ( | )'

'

2 2

1 1

d iagno sed as d ep ressed |

d iagno sed as no t d ep ressed |i

i

i

i

where {yr: r R} is the set of symptom patterns that areclassified as a diagnosis by the DSM-IV.

Predictive Values

• Positive and Negative Predictive Values are simply transformations of SE and SP:

PPV j P j

SE j

P

N PV j P j

SP j

P

i

j

i

j

( ) ( | )

( )

(

( ) ( | )

( )

(

d iagno sed as d ep ressed

d iagno sed as d ep ressed )

d iagno sed as no t d ep ressed

d iagno sed as no t d ep ressed )

Class assignment?

• Complication: latent class model provides us with “posterior probabilities” of class membership. We don’t know the true latent classes, η, for individuals in the dataset.

• Example: M =3

– Posterior probabilities of class membership for a particular symptom pattern are 0.48, 0.48, 0.04.

– To which class should this individual be assigned?

– How do we account for the uncertainty in the assignment?

One Approach to Class Assignment

• “Pseudo-classes” (Maximum Likelihood)– assign individuals to “pseudo-classes” based on posterior probability of

class membership (Bandeen-Roche et al., 1997)– e.g. individual with posterior probabilities of 0.20, 0.05, 0.75

• better chance of being in class 3• not necessarily in class 3

• Using class assignment, we can calculate sensitivity and specificity

• We can repeat assignment procedure T times, where T is large. • On average, the sensitivity and specificity estimates will be

correct.• Drawback: we don’t get precision associated with estimates.

Standard deviation of repeated estimates does not account for imprecision in estimates of p and

Confidence intervals based on the T repeats will be too narrow.

MCMC Approach to Class Assignment

• η is a vector of parameters• At each iteration in the Gibbs sampler, each parameter is

drawn from its conditional distributionAt each iteration in Gibbs sampler, individuals are

automatically assigned to classes no need to “manually” assign.

• For each of the W iterations of the chain, we can calculate sensitivity and specificity.

• Sensitivity and specificity are simply additional parameters. Due to the nature of the MCMC approach, the standard

deviation of the posterior interval of sensitivity represents its standard error.

Precision estimates for sensitivity and specificity are valid.

Operating Characteristics of Depression Diagnoses

• Several definitions of depression:

– DSM-III

– DSM-IV

– ICD-10a (mild)

– ICD10b (moderate)

– ICD10c (severe)

• We calculate sensitivity and specificity for each of five diagnoses (above) for models with M = 2 and M = 3.

• We do the same for PPV and NPV.

• Vertical lines represent 95% posterior intervals.

Interpreting results from three class model

• Diagnoses only have two possibilities: depressed or not depressed

• Two class model also has two possibilities. Three class model has a non-depressed class and two

depression classes (mild and severe).

• Should we think of BOTH or just SEVERE as the “treatment class.”

• Why does it matter?

– Clinical decision making

– “Pre-clinical” depression?

• Which is better?

Misclassification probabilities for identifying “severe depression”using the DSM-IV criteria

Two-class model Three-class model

P(false positive) < 0.001 0.004

P(false negative) 0.035 0.002

P(misclassification) 0.035 0.006

Misclassification probabilities for identifying “any depression”using the DSM-IV criteria

Two-class model Three-class model

P(false positive) < 0.001 < 0.001

P(false negative) 0.035 0.078

P(misclassification) 0.035 0.078

Revisiting questions….

• Recall that three class model was chosen versus the two class model as more appropriate.

• We answer questions posed earlier by examining agreement of DSM-IV and the three class model.

What types of individuals are the diagnostic criteria diagnosing as depressed?

• DSM-IV tends to diagnose individuals who are in ‘class 3’ of the three class model (i.e. our severe depression class)

• The mildly depressed class tends to be ignored.• Not necessarily a bad thing:

– DSM criteria are developed for deciding treatment. – If mild depression does not require any treatment, then

diagnosis of DSM-IV is adequate.• But what if:

– Class 2 individuals (ie mildly depressed) would benefit from treatment.

– Class 2 is a “pre-clinical” class: intervention could prevent transition to severe depression

How often are individuals misdiagnosed?

• Assuming that diagnosis of severely depressed individuals is intent of DSM-IV, there is LOW probability of misclassification:

P(misclassification) = 0.006

• If intent is to diagnose ANY depression (i.e., mild or severe), then there is much higher probability of misclassification:

P(misclassification) = 0.078

(Note that of these 8%, almost all are false negatives)

What are the implications of a positive or negative diagnosis?

• The DSM-IV has high PPV for severe depression: PPV(3) 0.90

• High NPV for no depression:

NPV(1) 0.90

• Essentially no information is provided as to an individual’s likelihood of mild depression given either a negative or a positive diagnosis:

PPV(2) 0.10

NPV(2) 0.10

Issues and Concerns

• Operating characteristics assume that two types of diagnosis being compared are determined independently.– Methods of assessment are different– But, large overlap of symptoms– Possibly/probably not truly independent

Issues and Concerns

• Conditional independence of tests given simply presence or absence of disease is a common problem.– Tests may be independent given “continuum” level of

disease, but not when disease status is simply categorized.

– However, the latent class model does not definitively assign individuals to classes. Instead, posterior probability is estimated

– Because individuals are assigned posterior probabilities, we can more easily think of a “continuum” of disease.

– This is true even in the case of classes which are not ordinal in nature, because the posterior probabilities for each class will be continuous.

Conclusions

• DSM-IV appears to be a valid approach for diagnoses of “severe” depression.

• There appears to be another class of milder depression that is not identified by any of the depression definitions.

• By using an MCMC approach to latent class model estimation, we can estimate operating characteristics of tests and their standard errors in a straightforward way.

• This approach can be used quite generally for other medical diagnoses– Psychiatric diagnoses– Arthritis

• More information?– [email protected]– http://astor.som.jhmi.edu/~esg/talks.html– Garrett ES, Eaton WW, Zeger S. (2002) Methods for evaluating

the performance of diagnostic tests in the absence of a gold standard: a latent class model approach. Statistics in Medicine, 2002 May 15;21(9):1289-307

Documents

Elizabeth Garrett-Mayer Division of Biostatistics The Sidney Kimmel Comprehensive Cancer Center