Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

Continual Learning in

Sequential Decision Making

S.A. Murphy

In Honor of A. Wald, JSM

2

"Abraham Wald" by Konrad Jacobs, Erlangen, Copyright is

MFO - Mathematisches Forschungsinstitut Oberwolfach

Father of Sequential Decision Making

Find the decision making policy,that at each time n, inputs present state Sn and outputs one of three actions:

This policy should minimize

where g is a probability, are positive weights,

N is stopping time.

g³w0Pp0;¼[¼(SN ) = p1&stop] + cEp0;¼[N ])

+(1¡ g)(w1Pp1;¼[¼(SN ) = p0&stop] + cEp1;¼[N ]´

fp0&stop; p1&stop; continueg

¼

w0; w1; c

3

The Dream!

“Continually Learning Mobile Health Intervention”

• Help maintain healthy behaviors

• Help you achieve your health goals

– Help you better trade off long term benefit with short term momentary pleasure

• The ideal mHealth intervention

– will be there when you need it and will not intrude when you don’t need it.

– will adjust to unanticipated life challenges

4

Ideas

• Real-Time Treatment Policies

• “Personalized Continually Learning Treatment Policy”

– Warm-Start Policy

– Online Training Algorithm

• Clinical Trial

5

Treatment Policies are individualized treatments delivered when-ever and where-ever the individual needs help. They are delivered by a wearable device (e.g., smartphone).

Observations at tth decision time (high dimensional)

Action at tth decision time (treatment)

Yt+1 : Proximal Response (aka: Reward, Cost)

mHealth

HeartSteps Activity Coach

o Wearable band measures activity and sleep quality; phone sensors measure busyness of calendar, location, weather; utility ratings

o In which contexts should smartphone ping and deliver activity ideas?

6

7

Examples

1) Decision Times (Times at which a treatment can be provided.)

1) Regular intervals in time (e.g. every 10 minutes)

2) At user demand

HeartSteps: Approximately every 2-2.5 hours

8

Examples

2) Observations 1) Passively collected (via sensors)2) Actively collected (via self-report)

HeartSteps observations include activity recognition, location, calendar, step count, usefulness ratings, adherence, self-efficacy…….

Ot

9

Examples

3) Actions 1) Treatments that can be provided at decision

time2) Whether to provide a treatment

HeartSteps: Activity Recommendations

At

10

HeartStepsActivity

Recommendation

No Message or

11

Examples

4) Proximal Response (reward)

HeartSteps: Activity (step count) over next 60 minutes.

Yt+1

12

Treatment Policy

A treatment policy is a mapping from the information on an individual at each decision time t, say St, to a probability distribution on the action space; this will be a distribution for At.

Example of a parameterized policy for a binary action,

OR

13

Desired Treatment Policy

• Individuals recovering from a heart attack

• Develop a coach app on phone. Policy underlying coach app will need to trade off

• short term improvement/maintenance of physical activity level

with

• burden and engagement

to maintain activity over the long term.

14

Current State of the Art

• Expert-derived treatment policy: Scientists formulate the policy using behavioral/social theories, clinical experience, observational data analyses.

• Concerns that different individuals may do better with different policies (treatment effect heterogeneity)• Between individuals• Temporal, within individuals

15

Desired Online Algorithm

• How to best detect and use treatment effect heterogeneity to improve policy?

• Online algorithm should detect that an individual is responding differently from the “average person” to treatment action

• Online algorithm should detect that an individual is responding differently at later times to a treatment action than at early times.

16

Summary of Goal

1) Take advantage of existing approach of constructing a policy based on theory, clinical experience—e.g the expert-derived policy

2) Online algorithm should alter treatment policy to improve policy if this will improve average reward.

17

Ideas

• Real-Time Treatment Policies & Example




• Clinical Trial

18

Personalized Continually Learning Treatment Policy

1) Two parts:

1) Warm-Start Policy

2) Online Training Algorithm

19

Warm-Start Policy

• Warm-start policy is stochastic; Warm-start policy is based on expert-derived policy or treatment policy developed via micro-randomized trial.

• Variation in actions can help retard habituation and maintain engagement.

• Parameterized policy can be interpreted/vetted by domain experts

20

Example Warm-Start

1) Clinician’s policy: If calendar is not busy and if risk of burden is low, then provide an activity suggestion.

2) Warm-Start policy, :

¼µ(1jSt) =eµ0+µ1Ct+µ2Bt+µ

T3 g(St)

1 + eµ0+µ1Ct+µ2Bt+µT3 g(St)

¼µ(ajs) = P [A = ajS = s]

Ct; Bt 2 f0; 1g; (Ct; Bt) ½ St

Example Warm-Start

Warm-Start policy:

How to initialize θ0, …., θ3? Set θ3 =0; set remaining θ0, θ1, θ2 so that

21

¼µ(1jSt) =eµ0+µ1Ct+µ2Bt+µ

T3 g(St)

1 + eµ0+µ1Ct+µ2Bt+µT3 g(St)

1 0

1 .2 .4

0 .4 ~.84

¼µ(1jSt)

Bt

Ct

22

Personalized Continually Learning Treatment Policy

1) Two parts:

1) Warm-Start Policy

2) Online Training Algorithm

Policy is “control law;” online training algorithm is “adaptive control rule” (TL Lai, 2001)

23

Online Training Algorithm

Training algorithm is used to improve the Warm-Start policy for each person separately; algorithm is a stochastic gradient ascent algorithm.

1) After each decision time and on each individual separately, the training algorithm is used to update the θ parameters of the current policy, πθ.

2) At the next decision time, treatment, At, is assigned using the individual’s updated policy.

24

Background

1) On each individual:

2) Probabilistic Model: Markov Decision Process

3) Optimality criterion for policy: Average Reward

25

Markov Decision Process

Markovian Assumptions

Stationarity Assumptions

P [St+1 = s0jS1; A1; : : : ; St; At] =P [St+1 = s0jSt; At]and

P [Yt+1 = rjS1; A1; : : : ; St; At] =P [Yt+1 = rjSt; At]

P [St+1 = s0jSt = s; At = a] = p(s0js; a)and

E[Yt+1jSt = s; At = a] = r(s; a)

26

Optimality Criterion for Policy

Average Reward, ηθ, for policy πθ :

Eθ denotes expectation under the stationary distribution, dθ, associated with πθ.

´µ = limT!1

1

TEµ

"T¡1X

t=0

Yt+1

¯¯¯S0 = s

#

=X

s

dµ(s)X

a

¼µ(ajs)r(s; a)

27

Working Premise

1) Each individual follows a possibly different Markovian process. Thus the policy achieving maxθ ηθ may differ by individual

2) The Markovian processes are sufficiently similar so that starting from the warm-start policy, a policy achieving maxθ ηθ for the individual can be learned quickly.

28

Nuisance Parameter: Differential Value

Vθ is the Differential Value function

Vθ(s) - Vθ(s) reflects the difference in sum of responses accrued when starting in state s as opposed to state s .

(ηθ is the average reward)

Vµ(s) = limT!1

Eµ

"TX

t=0

³Yt+1 ¡ ´µ

´¯¯¯S0 = s

#

:

29

Background: Bellman Equation

Oracle Temporal Difference:

Bellman Equation:

Eµ

h±t

¯¯¯St

i= 0

±t = Yt+1 ¡ ´µ + Vµ(St+1)¡ Vµ(St)

30

Background: Bellman Equation

Bellman’s equation implies that

will be, for all t, for any vector, f(.), of appropriately integrable functions, equal to 0 if

Eµ

·³Yt+1 ¡ ´ + V (St+1)¡ V (St)

´µ1

f(St)

¶¸

´ = ´µ; V = Vµ

31

Background for Online Training Algorithm

Policy Gradient Theorem:

where

5µ´µ = Eµ [±t 5µ log ¼µ(AtjSt)]

±t = Yt+1 ¡ ´µ + Vµ(St+1)¡ Vµ(St)

Stochastic Gradient Algorithm

At the tth decision time, draw action from current policy then update estimators of average reward and differential value, . Use these estimators, to form the TD error,

Move in direction of empirical gradient,

to obtain µt+1 32

d5µ´(¼µ) = ±t 5µ log ¼µ(AtjSt)jµ=µt

¼µt

µt

´µt Vµt^t; Vt

±t = Yt+1 ¡ ^t + Vt(St+1)¡ Vt(St)

33

Estimating Equations for nnn

are based on the fact that is a solution to

for any appropriately integrable function z(.).

The idea: parameterize then solve empirical versions of these equations.

´µt ; Vµt

´ = ´µt ; V = Vµt

Vµt

0 = Eµt

·³Yt+1 ¡ ´ + V (St+1)¡ V (St)

´µ1

f(St)

¶¸

34

Estimating Equations for nnn

Parameterize as

Intuition is to solve

Use first (or second) order incremental solutions.

0 =

tX

j=0

·¡Yj+1 ¡ ´ + vT f(Sj+1)¡ vT f(Sj)

¢µ

1f(St)

¶¸

Vµt

´µt ; Vµt

35

A Very Little Review!

An exact incremental solution to

is

where

0 =

tX

j=0

[Yj+1 ¡ ¹]

¹j+1 = ¹j + ¾j

hYj+1 ¡ ¹j

i

¾j =³

1j+1

´

36

37

Ideas

• Real-Time Treatment Policies & Example




• Clinical Trial

38

Proposal

A Two-group Randomized Trial

of“Personalized Continually Learning

Treatment Policy” versus

“Standard Care”

39

Note

Personalized policy must be operationalized, e.g., “locked down” to enhance replicability. Prior to Trial:

1) Warm start policy is pre-specified.

2) Training algorithm must be completely pre-specified

40

Standard RCT

Standard data analyses can be used to compare Personalized Policy group versus Standard Care group.

1) Survival outcomes, longitudinal outcomes

2) Subpopulation (moderator) analyses. E.g. estimate treatment effect for more severely ill people

41

Interesting Challenges

• At the end of the trial different people will have different policies, e.g. different θ’s.

– Testing if the variation between people in their policies at the end of the study is greater than that due to noise.

• Provide error bounds on how much the personalized policy will be an improvement over the warm start—at least on average.

42


• Methods for using trial data to estimate the average reward for a variety of policies.

• Methods for using trial data to construct a new single “optimal” policy. This new policy could be used in forming a warm-start for a future trial.

43


• Optimality criterion for online training algorithm: Minimize regret:

• Or should the regret be an averaged regret?

44


• How to formulate “Bayesian” heuristics to design the stochastic gradient algorithm and facilitate the balance between using information learned so far with the need to continue exploration?

• Combining information across individuals?

45

Challenges

• Any method should provide confidence intervals/permit scientists to test hypotheses.

• How to reduce the amount of self-report data (How might you do this?)

• How to accommodate/utilize the vast amount of missing data, some of which will be informative……

46

Collaborators

Documents

Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,