46
Continual Learning in Sequential Decision Making S.A. Murphy In Honor of A. Wald, JSM

Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

Continual Learning in

Sequential Decision Making

S.A. Murphy

In Honor of A. Wald, JSM

Page 2: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

2

"Abraham Wald" by Konrad Jacobs, Erlangen, Copyright is

MFO - Mathematisches Forschungsinstitut Oberwolfach

Father of Sequential Decision Making

Find the decision making policy,that at each time n, inputs present state Sn and outputs one of three actions:

This policy should minimize

where g is a probability, are positive weights,

N is stopping time.

g³w0Pp0;¼[¼(SN ) = p1&stop] + cEp0;¼[N ])

+(1¡ g)(w1Pp1;¼[¼(SN ) = p0&stop] + cEp1;¼[N ]´

fp0&stop; p1&stop; continueg

¼

w0; w1; c

Page 3: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

3

The Dream!

“Continually Learning Mobile Health Intervention”

• Help maintain healthy behaviors

• Help you achieve your health goals

– Help you better trade off long term benefit with short term momentary pleasure

• The ideal mHealth intervention

– will be there when you need it and will not intrude when you don’t need it.

– will adjust to unanticipated life challenges

Page 4: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

4

Ideas

• Real-Time Treatment Policies

• “Personalized Continually Learning Treatment Policy”

– Warm-Start Policy

– Online Training Algorithm

• Clinical Trial

Page 5: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

5

Treatment Policies are individualized treatments delivered when-ever and where-ever the individual needs help. They are delivered by a wearable device (e.g., smartphone).

Observations at tth decision time (high dimensional)

Action at tth decision time (treatment)

Yt+1 : Proximal Response (aka: Reward, Cost)

Page 6: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

mHealth

HeartSteps Activity Coach

o Wearable band measures activity and sleep quality; phone sensors measure busyness of calendar, location, weather; utility ratings

o In which contexts should smartphone ping and deliver activity ideas?

6

Page 7: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

7

Examples

1) Decision Times (Times at which a treatment can be provided.)

1) Regular intervals in time (e.g. every 10 minutes)

2) At user demand

HeartSteps: Approximately every 2-2.5 hours

Page 8: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

8

Examples

2) Observations 1) Passively collected (via sensors)2) Actively collected (via self-report)

HeartSteps observations include activity recognition, location, calendar, step count, usefulness ratings, adherence, self-efficacy…….

Ot

Page 9: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

9

Examples

3) Actions 1) Treatments that can be provided at decision

time2) Whether to provide a treatment

HeartSteps: Activity Recommendations

At

Page 10: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

10

HeartStepsActivity

Recommendation

No Message or

Page 11: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

11

Examples

4) Proximal Response (reward)

HeartSteps: Activity (step count) over next 60 minutes.

Yt+1

Page 12: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

12

Treatment Policy

A treatment policy is a mapping from the information on an individual at each decision time t, say St, to a probability distribution on the action space; this will be a distribution for At.

Example of a parameterized policy for a binary action,

OR

Page 13: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

13

Desired Treatment Policy

• Individuals recovering from a heart attack

• Develop a coach app on phone. Policy underlying coach app will need to trade off

• short term improvement/maintenance of physical activity level

with

• burden and engagement

to maintain activity over the long term.

Page 14: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

14

Current State of the Art

• Expert-derived treatment policy: Scientists formulate the policy using behavioral/social theories, clinical experience, observational data analyses.

• Concerns that different individuals may do better with different policies (treatment effect heterogeneity)• Between individuals• Temporal, within individuals

Page 15: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

15

Desired Online Algorithm

• How to best detect and use treatment effect heterogeneity to improve policy?

• Online algorithm should detect that an individual is responding differently from the “average person” to treatment action

• Online algorithm should detect that an individual is responding differently at later times to a treatment action than at early times.

Page 16: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

16

Summary of Goal

1) Take advantage of existing approach of constructing a policy based on theory, clinical experience—e.g the expert-derived policy

2) Online algorithm should alter treatment policy to improve policy if this will improve average reward.

Page 17: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

17

Ideas

• Real-Time Treatment Policies & Example

• “Personalized Continually Learning Treatment Policy”

– Warm-Start Policy

– Online Training Algorithm

• Clinical Trial

Page 18: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

18

Personalized Continually Learning Treatment Policy

1) Two parts:

1) Warm-Start Policy

2) Online Training Algorithm

Page 19: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

19

Warm-Start Policy

• Warm-start policy is stochastic; Warm-start policy is based on expert-derived policy or treatment policy developed via micro-randomized trial.

• Variation in actions can help retard habituation and maintain engagement.

• Parameterized policy can be interpreted/vetted by domain experts

Page 20: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

20

Example Warm-Start

1) Clinician’s policy: If calendar is not busy and if risk of burden is low, then provide an activity suggestion.

2) Warm-Start policy, :

¼µ(1jSt) =eµ0+µ1Ct+µ2Bt+µ

T3 g(St)

1 + eµ0+µ1Ct+µ2Bt+µT3 g(St)

¼µ(ajs) = P [A = ajS = s]

Ct; Bt 2 f0; 1g; (Ct; Bt) ½ St

Page 21: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

Example Warm-Start

Warm-Start policy:

How to initialize θ0, …., θ3? Set θ3 =0; set remaining θ0, θ1, θ2 so that

21

¼µ(1jSt) =eµ0+µ1Ct+µ2Bt+µ

T3 g(St)

1 + eµ0+µ1Ct+µ2Bt+µT3 g(St)

1 0

1 .2 .4

0 .4 ~.84

¼µ(1jSt)

Bt

Ct

Page 22: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

22

Personalized Continually Learning Treatment Policy

1) Two parts:

1) Warm-Start Policy

2) Online Training Algorithm

Policy is “control law;” online training algorithm is “adaptive control rule” (TL Lai, 2001)

Page 23: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

23

Online Training Algorithm

Training algorithm is used to improve the Warm-Start policy for each person separately; algorithm is a stochastic gradient ascent algorithm.

1) After each decision time and on each individual separately, the training algorithm is used to update the θ parameters of the current policy, πθ.

2) At the next decision time, treatment, At, is assigned using the individual’s updated policy.

Page 24: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

24

Background

1) On each individual:

2) Probabilistic Model: Markov Decision Process

3) Optimality criterion for policy: Average Reward

Page 25: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

25

Markov Decision Process

Markovian Assumptions

Stationarity Assumptions

P [St+1 = s0jS1; A1; : : : ; St; At] =P [St+1 = s0jSt; At]and

P [Yt+1 = rjS1; A1; : : : ; St; At] =P [Yt+1 = rjSt; At]

P [St+1 = s0jSt = s; At = a] = p(s0js; a)and

E[Yt+1jSt = s; At = a] = r(s; a)

Page 26: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

26

Optimality Criterion for Policy

Average Reward, ηθ, for policy πθ :

Eθ denotes expectation under the stationary distribution, dθ, associated with πθ.

´µ = limT!1

1

TEµ

"T¡1X

t=0

Yt+1

¯¯¯S0 = s

#

=X

s

dµ(s)X

a

¼µ(ajs)r(s; a)

Page 27: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

27

Working Premise

1) Each individual follows a possibly different Markovian process. Thus the policy achieving maxθ ηθ may differ by individual

2) The Markovian processes are sufficiently similar so that starting from the warm-start policy, a policy achieving maxθ ηθ for the individual can be learned quickly.

Page 28: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

28

Nuisance Parameter: Differential Value

Vθ is the Differential Value function

Vθ(s) - Vθ(s) reflects the difference in sum of responses accrued when starting in state s as opposed to state s .

(ηθ is the average reward)

Vµ(s) = limT!1

"TX

t=0

³Yt+1 ¡ ´µ

´¯¯¯S0 = s

#

:

Page 29: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

29

Background: Bellman Equation

Oracle Temporal Difference:

Bellman Equation:

h±t

¯¯¯St

i= 0

±t = Yt+1 ¡ ´µ + Vµ(St+1)¡ Vµ(St)

Page 30: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

30

Background: Bellman Equation

Bellman’s equation implies that

will be, for all t, for any vector, f(.), of appropriately integrable functions, equal to 0 if

·³Yt+1 ¡ ´ + V (St+1)¡ V (St)

´µ1

f(St)

¶¸

´ = ´µ; V = Vµ

Page 31: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

31

Background for Online Training Algorithm

Policy Gradient Theorem:

where

5µ´µ = Eµ [±t 5µ log ¼µ(AtjSt)]

±t = Yt+1 ¡ ´µ + Vµ(St+1)¡ Vµ(St)

Page 32: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

Stochastic Gradient Algorithm

At the tth decision time, draw action from current policy then update estimators of average reward and differential value, . Use these estimators, to form the TD error,

Move in direction of empirical gradient,

to obtain µt+1 32

d5µ´(¼µ) = ±t 5µ log ¼µ(AtjSt)jµ=µt

¼µt

µt

´µt Vµt^t; Vt

±t = Yt+1 ¡ ^t + Vt(St+1)¡ Vt(St)

Page 33: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

33

Estimating Equations for nnn

are based on the fact that is a solution to

for any appropriately integrable function z(.).

The idea: parameterize then solve empirical versions of these equations.

´µt ; Vµt

´ = ´µt ; V = Vµt

Vµt

0 = Eµt

·³Yt+1 ¡ ´ + V (St+1)¡ V (St)

´µ1

f(St)

¶¸

Page 34: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

34

Estimating Equations for nnn

Parameterize as

Intuition is to solve

Use first (or second) order incremental solutions.

0 =

tX

j=0

·¡Yj+1 ¡ ´ + vT f(Sj+1)¡ vT f(Sj)

¢µ

1f(St)

¶¸

Vµt

´µt ; Vµt

Page 35: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

35

A Very Little Review!

An exact incremental solution to

is

where

0 =

tX

j=0

[Yj+1 ¡ ¹]

¹j+1 = ¹j + ¾j

hYj+1 ¡ ¹j

i

¾j =³

1j+1

´

Page 36: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

36

Page 37: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

37

Ideas

• Real-Time Treatment Policies & Example

• “Personalized Continually Learning Treatment Policy”

– Warm-Start Policy

– Online Training Algorithm

• Clinical Trial

Page 38: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

38

Proposal

A Two-group Randomized Trial

of“Personalized Continually Learning

Treatment Policy” versus

“Standard Care”

Page 39: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

39

Note

Personalized policy must be operationalized, e.g., “locked down” to enhance replicability. Prior to Trial:

1) Warm start policy is pre-specified.

2) Training algorithm must be completely pre-specified

Page 40: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

40

Standard RCT

Standard data analyses can be used to compare Personalized Policy group versus Standard Care group.

1) Survival outcomes, longitudinal outcomes

2) Subpopulation (moderator) analyses. E.g. estimate treatment effect for more severely ill people

Page 41: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

41

Interesting Challenges

• At the end of the trial different people will have different policies, e.g. different θ’s.

– Testing if the variation between people in their policies at the end of the study is greater than that due to noise.

• Provide error bounds on how much the personalized policy will be an improvement over the warm start—at least on average.

Page 42: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

42

Interesting Challenges

• Methods for using trial data to estimate the average reward for a variety of policies.

• Methods for using trial data to construct a new single “optimal” policy. This new policy could be used in forming a warm-start for a future trial.

Page 43: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

43

Interesting Challenges

• Optimality criterion for online training algorithm: Minimize regret:

• Or should the regret be an averaged regret?

Page 44: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

44

Interesting Challenges

• How to formulate “Bayesian” heuristics to design the stochastic gradient algorithm and facilitate the balance between using information learned so far with the need to continue exploration?

• Combining information across individuals?

Page 45: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

45

Challenges

• Any method should provide confidence intervals/permit scientists to test hypotheses.

• How to reduce the amount of self-report data (How might you do this?)

• How to accommodate/utilize the vast amount of missing data, some of which will be informative……

Page 46: Continual Learning in Sequential Decision Makingpeople.seas.harvard.edu/~samurphy/seminars/Wald.III.8.13.2015.pdf · Sequential Decision Making Find the decision making policy,

46

Collaborators