Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Continual Learning in
Sequential Decision Making
S.A. Murphy
In Honor of A. Wald, JSM
2
"Abraham Wald" by Konrad Jacobs, Erlangen, Copyright is
MFO - Mathematisches Forschungsinstitut Oberwolfach
Father of Sequential Decision Making
Find the decision making policy,that at each time n, inputs present state Sn and outputs one of three actions:
This policy should minimize
where g is a probability, are positive weights,
N is stopping time.
g³w0Pp0;¼[¼(SN ) = p1&stop] + cEp0;¼[N ])
+(1¡ g)(w1Pp1;¼[¼(SN ) = p0&stop] + cEp1;¼[N ]´
fp0&stop; p1&stop; continueg
¼
w0; w1; c
3
The Dream!
“Continually Learning Mobile Health Intervention”
• Help maintain healthy behaviors
• Help you achieve your health goals
– Help you better trade off long term benefit with short term momentary pleasure
• The ideal mHealth intervention
– will be there when you need it and will not intrude when you don’t need it.
– will adjust to unanticipated life challenges
4
Ideas
• Real-Time Treatment Policies
• “Personalized Continually Learning Treatment Policy”
– Warm-Start Policy
– Online Training Algorithm
• Clinical Trial
5
Treatment Policies are individualized treatments delivered when-ever and where-ever the individual needs help. They are delivered by a wearable device (e.g., smartphone).
Observations at tth decision time (high dimensional)
Action at tth decision time (treatment)
Yt+1 : Proximal Response (aka: Reward, Cost)
mHealth
HeartSteps Activity Coach
o Wearable band measures activity and sleep quality; phone sensors measure busyness of calendar, location, weather; utility ratings
o In which contexts should smartphone ping and deliver activity ideas?
6
7
Examples
1) Decision Times (Times at which a treatment can be provided.)
1) Regular intervals in time (e.g. every 10 minutes)
2) At user demand
HeartSteps: Approximately every 2-2.5 hours
8
Examples
2) Observations 1) Passively collected (via sensors)2) Actively collected (via self-report)
HeartSteps observations include activity recognition, location, calendar, step count, usefulness ratings, adherence, self-efficacy…….
Ot
9
Examples
3) Actions 1) Treatments that can be provided at decision
time2) Whether to provide a treatment
HeartSteps: Activity Recommendations
At
10
HeartStepsActivity
Recommendation
No Message or
11
Examples
4) Proximal Response (reward)
HeartSteps: Activity (step count) over next 60 minutes.
Yt+1
12
Treatment Policy
A treatment policy is a mapping from the information on an individual at each decision time t, say St, to a probability distribution on the action space; this will be a distribution for At.
Example of a parameterized policy for a binary action,
OR
13
Desired Treatment Policy
• Individuals recovering from a heart attack
• Develop a coach app on phone. Policy underlying coach app will need to trade off
• short term improvement/maintenance of physical activity level
with
• burden and engagement
to maintain activity over the long term.
14
Current State of the Art
• Expert-derived treatment policy: Scientists formulate the policy using behavioral/social theories, clinical experience, observational data analyses.
• Concerns that different individuals may do better with different policies (treatment effect heterogeneity)• Between individuals• Temporal, within individuals
15
Desired Online Algorithm
• How to best detect and use treatment effect heterogeneity to improve policy?
• Online algorithm should detect that an individual is responding differently from the “average person” to treatment action
• Online algorithm should detect that an individual is responding differently at later times to a treatment action than at early times.
16
Summary of Goal
1) Take advantage of existing approach of constructing a policy based on theory, clinical experience—e.g the expert-derived policy
2) Online algorithm should alter treatment policy to improve policy if this will improve average reward.
17
Ideas
• Real-Time Treatment Policies & Example
• “Personalized Continually Learning Treatment Policy”
– Warm-Start Policy
– Online Training Algorithm
• Clinical Trial
18
Personalized Continually Learning Treatment Policy
1) Two parts:
1) Warm-Start Policy
2) Online Training Algorithm
19
Warm-Start Policy
• Warm-start policy is stochastic; Warm-start policy is based on expert-derived policy or treatment policy developed via micro-randomized trial.
• Variation in actions can help retard habituation and maintain engagement.
• Parameterized policy can be interpreted/vetted by domain experts
20
Example Warm-Start
1) Clinician’s policy: If calendar is not busy and if risk of burden is low, then provide an activity suggestion.
2) Warm-Start policy, :
¼µ(1jSt) =eµ0+µ1Ct+µ2Bt+µ
T3 g(St)
1 + eµ0+µ1Ct+µ2Bt+µT3 g(St)
¼µ(ajs) = P [A = ajS = s]
Ct; Bt 2 f0; 1g; (Ct; Bt) ½ St
Example Warm-Start
Warm-Start policy:
How to initialize θ0, …., θ3? Set θ3 =0; set remaining θ0, θ1, θ2 so that
21
¼µ(1jSt) =eµ0+µ1Ct+µ2Bt+µ
T3 g(St)
1 + eµ0+µ1Ct+µ2Bt+µT3 g(St)
1 0
1 .2 .4
0 .4 ~.84
¼µ(1jSt)
Bt
Ct
22
Personalized Continually Learning Treatment Policy
1) Two parts:
1) Warm-Start Policy
2) Online Training Algorithm
Policy is “control law;” online training algorithm is “adaptive control rule” (TL Lai, 2001)
23
Online Training Algorithm
Training algorithm is used to improve the Warm-Start policy for each person separately; algorithm is a stochastic gradient ascent algorithm.
1) After each decision time and on each individual separately, the training algorithm is used to update the θ parameters of the current policy, πθ.
2) At the next decision time, treatment, At, is assigned using the individual’s updated policy.
24
Background
1) On each individual:
2) Probabilistic Model: Markov Decision Process
3) Optimality criterion for policy: Average Reward
25
Markov Decision Process
Markovian Assumptions
Stationarity Assumptions
P [St+1 = s0jS1; A1; : : : ; St; At] =P [St+1 = s0jSt; At]and
P [Yt+1 = rjS1; A1; : : : ; St; At] =P [Yt+1 = rjSt; At]
P [St+1 = s0jSt = s; At = a] = p(s0js; a)and
E[Yt+1jSt = s; At = a] = r(s; a)
26
Optimality Criterion for Policy
Average Reward, ηθ, for policy πθ :
Eθ denotes expectation under the stationary distribution, dθ, associated with πθ.
´µ = limT!1
1
TEµ
"T¡1X
t=0
Yt+1
¯¯¯S0 = s
#
=X
s
dµ(s)X
a
¼µ(ajs)r(s; a)
27
Working Premise
1) Each individual follows a possibly different Markovian process. Thus the policy achieving maxθ ηθ may differ by individual
2) The Markovian processes are sufficiently similar so that starting from the warm-start policy, a policy achieving maxθ ηθ for the individual can be learned quickly.
28
Nuisance Parameter: Differential Value
Vθ is the Differential Value function
Vθ(s) - Vθ(s) reflects the difference in sum of responses accrued when starting in state s as opposed to state s .
(ηθ is the average reward)
Vµ(s) = limT!1
Eµ
"TX
t=0
³Yt+1 ¡ ´µ
´¯¯¯S0 = s
#
:
29
Background: Bellman Equation
Oracle Temporal Difference:
Bellman Equation:
Eµ
h±t
¯¯¯St
i= 0
±t = Yt+1 ¡ ´µ + Vµ(St+1)¡ Vµ(St)
30
Background: Bellman Equation
Bellman’s equation implies that
will be, for all t, for any vector, f(.), of appropriately integrable functions, equal to 0 if
Eµ
·³Yt+1 ¡ ´ + V (St+1)¡ V (St)
´µ1
f(St)
¶¸
´ = ´µ; V = Vµ
31
Background for Online Training Algorithm
Policy Gradient Theorem:
where
5µ´µ = Eµ [±t 5µ log ¼µ(AtjSt)]
±t = Yt+1 ¡ ´µ + Vµ(St+1)¡ Vµ(St)
Stochastic Gradient Algorithm
At the tth decision time, draw action from current policy then update estimators of average reward and differential value, . Use these estimators, to form the TD error,
Move in direction of empirical gradient,
to obtain µt+1 32
d5µ´(¼µ) = ±t 5µ log ¼µ(AtjSt)jµ=µt
¼µt
µt
´µt Vµt^t; Vt
±t = Yt+1 ¡ ^t + Vt(St+1)¡ Vt(St)
33
Estimating Equations for nnn
are based on the fact that is a solution to
for any appropriately integrable function z(.).
The idea: parameterize then solve empirical versions of these equations.
´µt ; Vµt
´ = ´µt ; V = Vµt
Vµt
0 = Eµt
·³Yt+1 ¡ ´ + V (St+1)¡ V (St)
´µ1
f(St)
¶¸
34
Estimating Equations for nnn
Parameterize as
Intuition is to solve
Use first (or second) order incremental solutions.
0 =
tX
j=0
·¡Yj+1 ¡ ´ + vT f(Sj+1)¡ vT f(Sj)
¢µ
1f(St)
¶¸
Vµt
´µt ; Vµt
35
A Very Little Review!
An exact incremental solution to
is
where
0 =
tX
j=0
[Yj+1 ¡ ¹]
¹j+1 = ¹j + ¾j
hYj+1 ¡ ¹j
i
¾j =³
1j+1
´
36
37
Ideas
• Real-Time Treatment Policies & Example
• “Personalized Continually Learning Treatment Policy”
– Warm-Start Policy
– Online Training Algorithm
• Clinical Trial
38
Proposal
A Two-group Randomized Trial
of“Personalized Continually Learning
Treatment Policy” versus
“Standard Care”
39
Note
Personalized policy must be operationalized, e.g., “locked down” to enhance replicability. Prior to Trial:
1) Warm start policy is pre-specified.
2) Training algorithm must be completely pre-specified
40
Standard RCT
Standard data analyses can be used to compare Personalized Policy group versus Standard Care group.
1) Survival outcomes, longitudinal outcomes
2) Subpopulation (moderator) analyses. E.g. estimate treatment effect for more severely ill people
41
Interesting Challenges
• At the end of the trial different people will have different policies, e.g. different θ’s.
– Testing if the variation between people in their policies at the end of the study is greater than that due to noise.
• Provide error bounds on how much the personalized policy will be an improvement over the warm start—at least on average.
42
Interesting Challenges
• Methods for using trial data to estimate the average reward for a variety of policies.
• Methods for using trial data to construct a new single “optimal” policy. This new policy could be used in forming a warm-start for a future trial.
43
Interesting Challenges
• Optimality criterion for online training algorithm: Minimize regret:
• Or should the regret be an averaged regret?
44
Interesting Challenges
• How to formulate “Bayesian” heuristics to design the stochastic gradient algorithm and facilitate the balance between using information learned so far with the need to continue exploration?
• Combining information across individuals?
45
Challenges
• Any method should provide confidence intervals/permit scientists to test hypotheses.
• How to reduce the amount of self-report data (How might you do this?)
• How to accommodate/utilize the vast amount of missing data, some of which will be informative……
46
Collaborators