4. Single Decision Treatment Regimes: Additional Methods 4 ...davidian/dtr20/dtrcourse4.pdf199 ST 790, Dynamic Treatment Regimes Classiﬁcation analogy Final result: Maximizing Vb

4. Single Decision Treatment Regimes: Additional Methods4.1 Optimal Regimes from a Classification Perspective4.2 Outcome Weighted Learning4.3 Interpretable Treatment Regimes via Decision Lists4.4 Additional Approaches4.5 Key References

189 ST 790, Dynamic Treatment Regimes

Classification analogy

Premise:• Estimation of an optimal regime can be viewed as a weighted

classification problem

• This allows the vast literature on classification and machinelearning to be exploited to define a restricted class of regimesand to estimate an optimal regime within it

• Rules characterizing regimes can be likened to classifiers so caninvolve high-dimensional parameterizations that are complex,flexible functions of individual characteristics

• Algorithms and software developed for classification problemscan be exploited for implementation

• Demonstrated by Zhang et al. (2012) and Zhao et al. (2012)


Generic classification problem

• Z = outcome or class label; here, Z = {0, 1} (binary)

• X = vector of covariates or features taking values in X, thefeature space

• d is a classifier: d : X → {0, 1}

• D is a family of classifiers; e.g., with X = (X1,X2)T

I Hyperplanes of the form

d(X) = I(η11 + η12X1 + η13X2 > 0)

I Rectangular regions of the form

d(X) = I(X1 < η11) + I(X1 ≥ η11,X2 < η12)



Implementation:• Training set: (Xi ,Zi), i = 1, . . . ,n• Find classifier d ∈ D that minimizes

I Classification error

n∑i=1

{Zi − d(Xi )}2 =n∑

i=1

I{Zi 6= d(Xi )}

I Weighted classification error

n∑i=1

wi{Zi − d(Xi )}2 =n∑

i=1

wi I{Zi 6= d(Xi )}

for wi , i = 1, . . . ,n, fixed, known weights



• This problem has been studied extensively by statisticians andcomputer scientists

• Is a form of supervised learning , an approach within the broadarea of machine learning

• Many methods and software are available

• Recursive partitioning (CART): Rectangular regions

• Support vector machines (SVM): Hyperplanes (linear SVM),nonlinear SVM


Value search estimation, revisited

Zhang et al. (2012): A1 = {0,1}, restricted class Dη• Elements dη = {d1(h1; η1)}, optimal restricted regime

doptη = {d1(h1; ηopt

1 )}, ηopt1 = arg max

η1

V(dη)

• AIPW estimator (3.43) for V(dη) for fixed η = η1

VAIPW (dη) =

n−1n∑

i=1

[ Cdη ,iYi

πdη ,1(H1i ; η1, γ1)−Cdη ,i − πdη ,1(H1i ; η1, γ1)

πdη ,1(H1i ; η1, γ1)Qdη ,1(H1i ; η1, β1)

]

Cdη = I{A1 = d1(H1; η1)} = A1I{d1(H1; η1) = 1}+ (1− A1)I{d1(H1; η1) = 0}

πdη,1(H1; η1, γ1) = π1(H1; γ1)I{d1(H1; η1) = 1}+{1−π1(H1; γ1)}I{d1(H1; η1) = 0}

Qdη,1(H1; η1, β1) = Q1(H1,1;β1)I{d1(H1; η1) = 1}+Q1(H1,0;β1)I{d1(H1; η1) = 0}


Value search estimation, revisited

Estimator for doptη : dopt

η,AIPW = {d1(h1; ηopt1,AIPW )}

• ηopt1,AIPW maximizes VAIPW (dη) in η1

Algebra:CdηY

πdη,1(H1; η1, γ1)=

[A1I{d1(H1; η1) = 1}+ (1− A1)I{d1(H1; η1) = 0}]Yπ1(H1; γ1)I{d1(H1; η1) = 1}+ {1− π1(H1; γ1)}I{d1(H1; η1) = 0}

=A1Y

π1(H1; γ1)I{d1(H1; η1) = 1}+

(1− A1)Y{1− π1(H1; γ1)}

I{d1(H1; η1) = 0}

Cdη − πdη,1(H1; η1, γ1)

πdη,1(H1; η1, γ1)Qdη,1(H1; η1, β1)

={A1 − π1(H1; γ1)}

π1(H1; γ1)Q1(H1,1;β1)I{d1(H1; η1) = 1}

− {A1 − π1(H1; γ1)}1− π1(H1; γ1)

Q1(H1,0;β1)I{d1(H1; η1) = 0}



Define:

ψ1(H1,A1,Y ) =A1Yπ1(H1)

− {A1 − π1(H1)}π1(H1)

Q1(H1,1), (4.1)

ψ0(H1,A1,Y ) =(1− A1)Y1− π1(H1)

+{A1 − π1(H1)}

1− π1(H1)Q1(H1,0) (4.2)

• Under SUTVA, NUC, positivity

E{ψ1(H1,A1,Y )|H1} = Q1(H1,1), E{ψ0(H1,A1,Y )|H1} = Q1(H1,0)

• Thus

E{ψ1(H1,A1,Y )−ψ0(H1,A1,Y )|H1} = C1(H1) = Q1(H1,1)−Q1(H1,0),

the contrast function (3.33)



Thus, by all of this algebra: Can write

VAIPW (dη) = n−1n∑

i=1

[ψ1(H1i ,A1i ,Yi)I{d1(H1i ; η1) = 1}

+ ψ0(H1i ,A1i ,Yi)I{d1(H1i ; η1) = 0}]

• ψ1(H1i ,A1i ,Yi) and ψ0(H1i ,A1i ,Yi) are (4.1) and (4.2) evaluatedat (H1i ,A1i ,Yi) with the fitted models Q1(H1,1; β1), Q1(H1,0; β1),and π1(H1; γ1) substituted

• Rewrite using I{d1(H1; η1) = 1} = d1(H1; η1),I{d1(H1; η1) = 0} = 1− d1(H1; η1)



By further algebra: VAIPW (dη) can be expressed as

VAIPW (dη)

= n−1n∑

i=1

[ψ1(H1i ,A1i ,Yi )d1(H1i ; η1) + ψ0(H1i ,A1i ,Yi ){1− d1(H1i ; η1)}

]= n−1

n∑i=1

[d1(H1i ; η1)

{ψ1(H1i ,A1i ,Yi )− ψ0(H1i ,A1i ,Yi )

}+ ψ0(H1i ,A1i ,Yi )

]= n−1

n∑i=1

{d1(H1i ; η1)C1(H1i ,A1i ,Yi ) + ψ0(H1i ,A1i ,Yi )

}

• Predictor of the contrast function

C1(H1i ,A1i ,Yi) = ψ1(H1i ,A1i ,Yi)− ψ0(H1i ,A1i ,Yi)



Result: Maximizing VAIPW (dη) in η1 is equivalent to maximizing

n−1n∑

i=1

d1(H1i ; η1)C1(H1i ,A1i ,Yi)

More algebra: Using a = I(a > 0)|a| − I(a ≤ 0)|a| for any a andwriting dη1,1i = d1(H1i ; η1), C1i = C1(H1i ,A1i ,Yi)

dη1,1i C1i = dη1,1i I(C1i > 0)|C1i | − dη1,1i I(C1i ≤ 0)|C1i |

= I(C1i > 0)|C1i | − |C1i |{(1− dη1,1i )I(C1i > 0) + dη1,1i I(C1i ≤ 0)}

= I(C1i > 0)|C1i | − |C1i |{I(C1i > 0)− dη1,1i}2

Thus:

d1(H1i ;η1)C1(H1i ,A1i ,Yi) = I{C1(H1i ,A1i ,Yi) ≥ 0}|C1(H1i ,A1i ,Yi)|

− |C1(H1i ,A1i ,Yi)|[I{C1(H1i ,A1i ,Yi) ≥ 0} − d1(H1i ; η1)

]2



Final result: Maximizing VAIPW (dη) in η1 is equivalent to minimizingin η1

n−1n∑

i=1

|C1(H1i ,A1i ,Yi)|[I{C1(H1i ,A1i ,Yi) > 0} − d1(H1i ; η1)

]2

= n−1n∑

i=1

|C1(H1i ,A1i ,Yi)|I[I{C1(H1i ,A1i ,Yi) > 0} 6= d1(H1i ; η1)

](4.3)

• A weighted classification error with

I “Label” I{C1(H1i ,A1i ,Yi ) ≥ 0} (Zi )

I “Weight” |C1(H1i ,A1i ,Yi )| (wi )

I “Classifier” d1(h1; η1) (d)



n−1n∑

i=1


](4.3)

Intuitive interpretation:• From (3.34), dopt ∈ D satisfies

dopt1 (h1) = I{C1(h1) > 0}

• The second term in (4.3) compares a predictor of the optionselected by the global dopt to that selected by a rule in Dη

• The “weight” |C1(H1i ,A1i ,Yi)| in (4.3) places greater importanceon contributions from individuals for whom the absolutedifference in expected outcomes for options 0 and 1 is large



Similarly: Analogous argument applies to

VIPW (dη) = n−1n∑

i=1

Cdη ,iYi

πdη ,1(H1i ; η1, γ1)

• Can be shown: The same formulation applies with

ψ1(H1,A1,Y ) =A1Yπ1(H1)

, ψ0(H1,A1,Y ) =(1− A1)Y1− π1(H1)


=A1iYi

π1(H1i ; γ1)− (1− A1i)Yi

1− π1(H1i ; γ1)



Summary: Value (policy) search estimation of an optimal restrictedregime dopt

η by maximizing VIPW (dη) or VAIPW (dη) is equivalent tominimizing a weighted classification error

• Choice of classification approach dictates the restricted class Dη• Can be implemented using off-the-shelf software and algorithms

for classification problems

• E.g., for CART, SVM

However: This analogy does not circumvent the need to optimize anonsmooth function of η1


Demonstration

Decision function: Write d1(h1; η1) = I{f1(h1; η1) > 0}• E.g.

f1(h1; η1) = η11 + η12x11 + η13x12

• By algebra, can write

n−1n∑

i=1


]= n−1

n∑i=1

|C1(H1i ,A1i ,Yi)| `0-1

([2I{C1(H1i ,A1i ,Yi) > 0} − 1

]f1(H1i ; η1)

)in terms of the 0-1 loss function

`0-1(x) = I(x ≤ 0)


Demonstration

n−1n∑

i=1

|C1(H1i ,A1i ,Yi)| `0-1

([2I{C1(H1i ,A1i ,Yi) > 0} − 1

]f1(H1i ; η1)

)Source of difficulty: The 0-1 loss function is nonconvex

`0-1(x) = I(x ≤ 0)

• Optimization involving nonconvex loss functions is challenging;standard techniques cannot be used

• This problem has been well studied in the classification literature

• E.g., with SVM, replace `0-1(x) by a convex “surrogate” such asthe hinge loss function

`hinge(x) = (1− x)+, x+ = max(0, x)


Hinge loss vs. 0-1 loss

−2 −1 0 1 2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

x




Original formulation

Zhao et al. (2012): Approach based on the IPW estimator

VIPW (dη) = n−1n∑

i=1

Cdη ,iYi

πdη ,1(H1i ; η1, γ1)

• Assume that Y is bounded and Y ≥ 0• Can be developed as a special case with

ψ1(H1,A1,Y ) =A1Yπ1(H1)

, ψ0(H1,A1,Y ) =(1− A1)Y1− π1(H1)


=A1iYi

π1(H1i ; γ1)− (1− A1i)Yi

1− π1(H1i ; γ1)

=Yi{A1i − π1(H1; γ1)}

π1(H1; γ1){1− π1(H1; γ1)}


As a special case

• With Y ≥ 0 and the positivity assumption

I{C1(H1i ,A1i ,Yi) > 0} = I(A1i = 1) = A1i

• By considering A1i = 1 and A1i = 0

|C1(H1i ,A1i ,Yi)| =Yi

A1iπ1(H1i ; γ1) + (1− A1i){1− π1(H1i ; γ1)}

• Substitute in (4.3): Maximizing VIPW (dη) in η1 is equivalent tominimizing

n−1n∑

i=1

Yi

A1iπ1(H1i ; γ1) + (1− A1i){1− π1(H1i ; γ1)}︸︷︷︸“Weight”

I{A1i 6= d1(H1i ; η1)}

(4.4)

• “Label” A1i , “Classifier” d1(h1; η1)


Original formulation

Randomized study: With known

π1(H1) = P(A1 = 1|H1) = P(A1 = 1) = π1

• Recode options: A1 = {−1,1}• d1(h1; η1) = sign{f1(h1; η1)} for decision function f1(h1; η1)

Weighted classification error: (4.4) can be rewritten as

n−1n∑

i=1

Yi

A1iπ1 + (1− A1i)/2I{A1i 6= d1(H1i ; η1)}

= n−1n∑

i=1

Yi

A1iπ1 + (1− A1i)/2I[A1i 6= sign{f1(H1i ; η1)}]

• Involves the 0-1 loss function

I[A1i 6= sign{f1(H1i ; η1)}] = I{A1i f1(H1i ; η1) ≤ 0} = `0-1{A1i f1(H1i ; η1)}


Outcome weighted learning (OWL)

Minimize:

n−1n∑

i=1

Yi

A1iπ1 + (1− A1i)/2`0-1{A1i f1(H1i ; η1)}

Original OWL: Zhao et al. (2012)

• Restricted class Dη induced by linear or nonlinear SVM

• Replace 0-1 loss by the convex surrogate hinge loss function

`hinge(x) = (1− x)+, x+ = max(0, x)

• Minimize in η1 the penalized objective function

n−1n∑

i=1

Yi

A1iπ1 + (1− A1i)/2{1− A1i f1(H1i ; η1)}+ + λn‖f1‖2 (4.5)

• Flexible, highly parameterized representation of f1(h1; η1),penalty for overfitting


Outcome weighted learning

Remarks:• Can take a similar approach (flexible f1(h1; η1), penalization) with

the full AIPW formulation• Important: Minimizing the original objective (4.4) and minimizing

(4.5) with hinge loss substituted are different optimizationproblems and will lead to different ηopt

1 and thus differentestimated optimal regimes

• Similarly for the AIPW formulation• Simulation evidence: Suggests this might not be such a big deal

in practice; resulting estimated optimal regimes perform well

Refinements and extensions of OWL: Zhou et al. (2017), Liu et al(2018)




Flexibility vs. interpretability

Classification approach: Flexible representation• Complex, highly parameterized estimated decision rules• Pro: Can synthesize high-dimensional patient information and

achieve performance close to a true optimal regime dopt ∈ D• Con: Difficult to interpret, “black box,” hard to glean new scientific

insights

Opposing view: Emphasize parsimony and interpretability• Deliberately focus on Dη with rules that can be understood by

clinicians and patients• Pro: Accessibility, more readily accepted, can generate new

scientific insights and hypotheses• Con: Optimal such regimes may not approach performance of

dopt ∈ D


Decision rules as decision lists

Zhang et al. (2015): Focus on Dη with decision rules characterizingregimes in the form of a decision list• Decision list: A sequence of if-then clauses• “If” is a condition involving patient information that, if true, leads

to selection of an option a1 ∈ A1

• Natural for A1 with m1 ≥ 2 options

Example: Acute leukemia, A1 = {C1,C2} = {0,1}• Rule d1(h1) = I(age < 50 and WBC < 10) as a list

If age < 50 and WBC < 10 then C2;

else C1



Fancier example: Acute leukemia, {C1,C2,C3}If age < 50 and WBC < 10 then C2;

else if age ≥ 50 then C1;

else C3

(4.6)

C1

C2

C3

WBC 10

50 age



Fancier still:

If age < 50 and ECOG < 2 then C2;

else if WBC ≥ 20 then C1;

else if PLAT > 300 then C1;

else C3.

(4.7)



In general: A1 with m1 ≥ 2 options; rule in form of decision list oflength L1

If c11 then a11;

else if c12 then a12;

...

else if c1L1 then a1L1 ;

else a10,

• Summarized as {(c11,a11), . . . , (c1L1 ,a1L1),a10}• For example, in (4.7), L1 = 3, c11 = { age < 50 and ECOG < 2},

and a11 = C2

• Treatment options can be repeated in different clauses• L1 = 0 corresponds to a static regime


Basic formulation

Can mathematize: Define

T1(c1`) = {h1 : c1` is true }, ` = 1, . . . ,L1

R11 = T1(c11)

R1` ={∩j<`T1(c1j)

c}⋂ T1(c1`), ` = 2, . . . ,L1,

R10 =

L1⋂j=1

T1(c1j)c

• Each R1`, ` = 0, . . . ,L1, represents the conditions that must besatisfied for an individual to receive option a1`

• For the diligent student: Determine the sets R11, R12, R13, andR10 for the example in (4.7)

• Clearly: A given h1 can belong to at most one set R1`,` = 0,1, . . . ,L1


Basic formulation

Treatment regime: Decision rules of form

d1(h1) =

L1∑`=0

a1` I(h1 ∈ R1`). (4.8)

• Characterized by {(c11,a11), . . . , (c1L1 ,a1L1),a10}

Zhang et al. (2015): For parsimony and interpretability, restrict to c1`involving at most 2 components of h1

• For h1 with p1 components, j1 < j2 ∈ {1, . . . ,p1}, restrict toT1(c1`) of any of the forms{h1 : h1j1 ≤ τ11} {h1 : h1j1 ≤ τ11 or h1j2 ≤ τ12}{h1 : h1j1 ≤ τ11 and h1j2 ≤ τ12} {h1 : h1j1 ≤ τ11 or h1j2 > τ12}{h1 : h1j1 ≤ τ11 and h1j2 > τ12} {h1 : h1j1 > τ11 or h1j2 ≤ τ12}{h1 : h1j1 > τ11 and h1j2 ≤ τ12} {h1 : h1j1 > τ11 or h1j2 > τ12}{h1 : h1j1 > τ11 and h1j2 > τ12} {h1 : h1j1 > τ11}

(4.9)


Regimes

Restricted class Dη: Define η1 to be a collection

{(c11,a11), . . . , (c1L1 ,a1L1),a10}

with conditions c1` as in one of the T1(c1`) in (4.9)• A rule as in (4.8) can be written as d1(h1; η1), and Dη comprises

all regimes with rules of this form• Feature: Do not need to collect all patient variables up front; can

ascertain as needed, useful if some are expensive orburdensome to collect


Nonuniqueness

A decision rule can be represented with more than one list:• For a decision list with η1 = {(c11,a11), . . . , (c1L1 ,a1L1),a10} and

decision rule d1(h1; η1), there may exist another decision list withη′1 = {(c′11,a

′11), . . . , c′1L′1

,a′1L′1),a′10} and decision rule d1(h1; η′1)

such that d1(h1; η1) = d1(h1; η′1) for all h1 but L1 6= L′1 or L1 = L′1but c1j 6= c′1j or a1j 6= a′1j for some j = 1, . . . ,L1


Nonuniqueness

Example: (4.6) and alternative

If age < 50 and WBC < 10 then C2; If age ≥ 50 then C1;

else if age ≥ 50 then C1; else if WBC < 10 then C2;

else C3 else C3

C1

C2

C3

WBC 10

50 age


Optimal regime

Value of a regime: For any regime dη ∈ Dη, V(dη) is the sameregardless of which version of dη is considered

• Optimal regime doptη

d1(h1; ηopt1 ), ηopt

1 = arg maxη1

V(dη)

• Suggests: If there are equivalent versions of doptη , estimate the

version that is least costly/burdensome to implement• Value search: Maximize VAIPW (dη) on Slide 185 subject to

targeting the version of an optimal regime minimizing a measureof “cost”


Optimal regime

Value search: As on Slide 185, with A1 = {1, . . . ,m1}

πdη ,1(H1; η1, γ1) =

m1∑a1=1

I{d1(H1; η1) = a1}ω1(H1,a1; γ1)

Qdη ,1(H1; η1, β1) =

m1∑a1=1

I{d1(H1; η1) = a1}Q1(H1,a1;β1)

VAIPW (dη)

= n−1n∑

i=1

[ Cdη ,iYi

πdη ,1(H1i ; η1, γ1)−Cdη ,i − πdη ,1(H1i ; η1, γ1)

πdη ,1(H1i ; η1, γ1)Qdη ,1(H1i ; η1, β1)

]

= n−1n∑

i=1

m1∑a1=1

([I(A1i = a1)

ω1(H1i ,a1; γ1)

{Yi −Q1(H1i ,a1; β1)

}+ Q1(H1i ,a1; β1)

]I{d1(H1i ; η1) = a1}

)as in (2) of Zhang et al. (2015)


Optimal regime

Cost: If N1` = cost of measuring components of h1 necessary tocheck c11, . . . , c1` (fewer comparisons to thresholds is better), thecost of implementing regime dη with rule as in (4.8)

d1(h1; η1) =

L1∑`=0

a1` I(h1 ∈ R1`).

in the population is

N1(dη) =

L1∑`=1

N1`P(H1 ∈ R1`) +N1L1P(H1 ∈ R10)

• Zhang et al. (2015): Describe a computational algorithm tomaximize VAIPW (dη) while minimizing an estimator of the costN1(dη)




Extensive literature

Numerous approaches: We highlight two additional approaches toestimation of an optimal regime• Regression-based estimation: To mitigate concern over

parametric model misspecification, represent

Q1(h1,a1) = E(Y |H1 = h1,A1 = a1)

nonparametrically,, e.g., using generalized additive models,support vector regression, random forests, etc, to obtain anonparametric estimator Q1(h1,a1)

• Use Q1(h1,a1) as the fitted model and thus obtain

doptQ,1(h1) = arg max

a1∈A1

Q1(h1,a1)

• E.g., Qian and Murphy (2011)


Extensive literature

• Alternative form of value search: Because for dη in a restrictedclass Dη

V(dη) = E [Q1(H1,1)I{d1(H1; η1) = 1}+ Q1(H1,0)I{d1(H1; η1) = 0}]

maximize in η1

V(dη)

= n−1n∑

i=1

[Q1(H1,1)I{d1(H1; η1) = 1}+ Q1(H1,0)I{d1(H1; η1) = 0}

]• As above, Q1(h1,a1) is a nonparametric estimator for Q1(h1,a1)

• But here Q1(h1,a1) is used only to ensure faithful representationof the value and not to define the optimal regime estimatordirectly (Taylor et al., 2015)




References

Liu, Y.,Wang, Y., Kosorok, M. R., Zhao, Y., and Zeng, D. (2018).Augmented outcome-weighted learning for estimating optimaldynamic treatment regimens. Statistics in Medicine 37, 3776–3788.

Qian, M. and Murphy, S. (2011). Performance guarantees forindividualized treatment rules. Annals of Statistics 39, 1180–1210.

Taylor, J. M. G., Cheng, W., and Foster, J. C. (2015). Reader reactionto “A robust method for estimating optimal treatment regimes” byZhang et al. (2012). Biometrics 71, 267–273.

Zhang, B., Tsiatis, A. A., Davidian, M., Zhang, M., and Laber, E. B.(2012). Estimating optimal treatment regimes from a classificationperspective. Stat 1, 103–114.


References

Zhang, Y., Laber, E. B., Tsiatis, A. A., and Davidian, M. (2015). Usingdecision lists to construct interpretable and parsimonious treatmentregimes. Biometrics 71, 895–904.

Zhao, Y., Zeng, D., Rush, A. J., and Kosorok, M. R. (2012). Estimatingindividualized treatment rules using outcome weighted learning.Journal of the American Statistical Association 107, 1106–1118.

Zhou, X., Mayer-Hamblett, N., Khan, U., and Kosorok, M. R. (2017).Residual weighted learning for estimating individualized treatmentrules. Journal of the American Statistical Association 112, 169–187.


Documents

4. Single Decision Treatment Regimes: Additional Methods 4 ...davidian/dtr20/dtrcourse4.pdf199 ST 790, Dynamic Treatment Regimes Classiﬁcation analogy Final result: Maximizing Vb