Possible Futures

Possible Futures:Complexity in Sequential

Decision-Making

Dan LizotteAssistant Professor, David R. Cheriton School of Computer Science

University of Waterloo

Waterloo Institute for Complexity and Innovation22 Jan 2013

Agents, Rationality, and Sequential Decision-Making• Sequential decision-making from the AI tradition

• Supporting sequential clinical decision-making

• !e assumption of rationality:

• mitigates complexity ☺

• reduces expressiveness ☹

• Alternatives to “simple rationality”

• Please interrupt at any time

Graphic: Leslie Pack Kaelbling, MIT

“An agent is just something that perceives and acts.”-Russell & Norvig

Agent

Environment

actionatst

rewardrt

rt+1

st+1

state

The “Reward Hypothesis”

“!at all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward).”

— Rich Su#on

A.K.A. “Rationality.” (“reward” ≡ “utility”)

Agent

Environment

actionatst

rewardrt

rt+1

st+1

state TreatmentStatus Outcome

Autonomous Agent Paradigm

• Good

• Goal is to maximize long term reward

• Makes context-dependent, individualized decisions

• Handles uncertain environments naturally

• Bad

• Need to prepare for every eventuality

• Relies on the reward hypothesis

• Relies on the assumption of rational* behaviour

Agent

Environment

actionatst

rewardrt

rt+1

st+1

state

*Rational in the non-judgemental, mathematical sense.

Decision Support Agent

Agent

Environment

actionatst

rewardrt

rt+1

st+1

state

Recommend rather than perform actions,still leverage autonomous agent idea

Place a human in-between

Should I Move or Stay?

Assume we have:Observation of state of the world

Rewards associated with each stateModel of how the world changes

Rational Decision Making

Move

P = 0.1, R = 0

P = 0.1, R = 100

P = 0.7, R = 0

P = 0.1, R = -50

Q( ,Move) = 5

Stay

P = 0.1, R = -25

P = 0.1, R = 10

P = 0.7, R = 0

P = 0.1, R = 0

Q( ,Stay) = -1.5

• O$en, “we” assume we want to maximize expected reward

• For each potential action

• Compute expected reward over all possible future states

• Choose the action with the highest expected reward

• “Q-function” Q(s,a) = E[R|s,a]

How do we decide what to do?

Move

P = 0.1, R = 0

P = 0.1, R = 100

P = 0.7, R = 0

P = 0.1, R = -50

Q( ,Move) = 5

Stay

P = 0.1, R = -25

P = 0.1, R = 10

P = 0.7, R = 0

P = 0.1, R = 0

Q( ,Stay) = -1.5

Q( ,Stay) = -1.5Q( ,Move) = 5

V( ) = 5 π( ) = Move

• For each potential action

• Compute expected reward over all possible future states

• Feasible if reward is known for each state.

• What if reward is in!uenced by subsequent decision-making?

Sequential Decision-Making

P = 0.1, R = 10Mosey P = 1.0, R = 10

Q( ,Mosey) = 10

Scramble!

P = 0.1, R = 0

P = 0.9, R = 100

Q( ,Scramble) = 90

Q( ,Scramble) = 90

Q( ,Mosey) = 10

V( ) = 90π( ) = Scramble

Stay

P = 0.1, R = -25

P = 0.1,

P = 0.7, R = 0

P = 0.1, R = 0

Q( ,Stay) =

R = 10V = 90

-1.56.5Q( ,Stay) =

Q( ,Move) = 5

V( ) = π( ) = Move

-1.56.5

6.55MoveStay

• Optimal decisions are in!uenced by future plans

• Not a problem, if we know what we will do when we get to any future state.

• Replace R (short-term) with V (long-term) by the law of total expectation

• Expectation is now over sequences of states

SequentialDecision-Making

Policy Evaluation

Policy Evaluation

Policy Evaluation

Policy Evaluation

Policy Evaluation

Time to evaluate: O(S2⋅T)Number of policies: O(AS⋅T)Exhaustive search: O(AS⋅T ⋅S2⋅T)

Policy Learning

Policy Learning

Policy Learning

Policy Learning

Policy Learning

Policy Learning

Policy Learning

Policy Learning

Time to evaluate: O(A⋅S2⋅T)Exhaustive search: O(AS⋅T ⋅S2⋅T)

Policy Learning

• Enormous complexity savings: From O(AS⋅T ⋅S2⋅T) to O(A⋅S2⋅T)

• What assumptions are required?

• Markov property

• Rationality (linearity of expectation)

• What if we want to avoid the rationality assumption?

Beyond Simple Rationality• maximization of the expected value of the

cumulative sum of a received scalar signal

• What is the “right” scalar signal?

• Symptom relief?• Side-effects?• Cost?• Quality of life?

• Different for every decision-maker, every patient.

Multiple Reward Signals

• How can we reason with multiple reward signals?

• Make use of the human in the loop!

• Preference Revealing (DL,Bowling,Murphy)

• Multi-outcome Screening (DL,Ferguson,Laber)

• Each adds computational complexity

Agent

Environment

actionatst

rewardrt

rt+1

st+1

state

Preference Elicitation

Suppose D different rewards are important for decision-making,

r[1], r[2], ..., r[D]

Assume each person has a function f that takes these and gives utility, that person’s happiness given any configuration of the r[i] expressed as a scalar value. We could use this as our new reward!

Preference Elicitation attempts to figure out an individual’s f.

Preference Elicitation1. Determine preferences of the decision-maker

2. Construct reward function from “basis rewards” (different outcomes)

3. Compute the recommended treatment, by policy learning

• Cost: O(A⋅S2⋅T) per elicitation Agent

Environment

actionatst

rewardrt

rt+1

st+1

state


One way: Assume f has a nice form:

f(r[1], r[2], ..., r[D]) = δ[1]r[1] + δ[2]r[2] + ... + δ[D]r[D]

Then Preference Elicitation figures out the δ, or weights, an individual attaches to different rewards. How?

The values δ[i] and δ[j] defines an exchange rate between r[i] and r[j].

Preference Elicitationf(r[1], r[2], ..., r[D]) = δ[1]r[1] + δ[2]r[2] + ... + δ[D]r[D]

“If I lost δ[j] units of r[i], but I gained δ[i] units of r[j], I would be equally happy.”

Preference elicitation asks questions like:“If I took away 5 units of r[i], how many units of r[j] would you want?”

Once f is known, standard single-outcome methods can be applied.


• Are the questions based in reality?

• Even if they are, can the decision-maker answer them?

• How will the decision-maker respond to “I know what you want.” ?

Agent

Environment

actionatst

rewardrt

rt+1

st+1

state

Preference Revealing

1. Determine the preferences of the decision-maker

2. Compute the recommended treatment for all possible preferences (δ)

3. Show, for each action, what preferences are consistent with that action being recommended

(DL, S Murphy, M Bowling)

Reward: PANSS! = (1, 0, 0)

Reward: BMI! = (0, 1, 0)

Reward: HQLS! = (0, 0, 1)

! Ziprasidone — Olanzapine

Phase 1

Positive And Negative Syndrome Scale vs.

Body Mass Index vs.

Heinrichs Quality of Life Scale

Data:Clinical

Antipsychotic Trials of

Intervention Effectiveness

(NIMH)

Preference RevealingBenefits

No reliance on preference elicitation

Facilitates deliberation rather than imposing a single recommended treatment

Information still individualized through patient state

Treatments that are not suggested for any preference are implicitly screened

CostsO(AT⋅ST) worst case

cf. O(A⋅S2⋅T) for single reward

cf. 2ℵ0 for “enumeration”

Feasible for short horizons

Value function/policy now a function of state and preference

Value functions not convex in preference, thus related methods for POMDPs do not apply

Multi-outcome Screening

1. Elicit “clinically meaningful difference” for each outcome

2. Screen out treatments that are “de&nitely bad”

3. Recommend the set of remaining treatments


Suppose two* different rewards are important for decision making: r[1], r[2]

Screen out a treatment if it is much worse than another treatment for one reward and not any better for the other reward.

Do not screen if 1) treatments are not much different or 2) one treatment is much worse for one reward but much better for the other

Output: Set containing one or both treatments, possibly with a reason if both are included.

(DL, B Ferguson, E Laber)

Non-deterministic multiple-reward fitted-Q for decision support 11

Algorithm 1 Non-deterministic fitted-QLearn QT = (QT [1], ..., QT [D]), (Section 3) place in QTfor t = T �1,T �2, ...,1 do

for all sit in the data do

Map Qt+1 to P9t+1(s

it) (Section 3.2), Generate Cf (P9

t+1(sit)) (Section 3.1)

for all p 2 Cf (P9t+1(s

it)) do

for all Qt+1 2Qt+1 do

Learn (Q[1]t(·, ·,p), ..., Q[D]t(·, ·,p)) using Qt+1 for the future value estimatesPlace resulting (Q[1]t(·, ·,p), ..., Q[D]t(·, ·,p)) in Qt

0 2 4 60

2

4

6

a1a2

a3

a4

QT [1](sT ,a)

(b)

QT[2](

s T,a)

Eliminate using Prac-tical Domination

0 2 4 60

2

4

6

a1a2

a3

a4

QT [1](sT ,a)

(b)

QT[2](

s T,a)

Eliminate usingStrong Practical Domination

Fig. 4 Comparison of rules for eliminating actions. In this simple example, we suppose the Q-vectors(QT [1](sT ,a),QT [2](sT ,a)) are (4.9,4.9),(3,5.2),(1.8,5.6),(4.6,4.6) for a1,a2,a3,a4, respectively, and sup-pose D1 = D2 = 0.5. Figure 4(a): Using the Practical Domination rule, action a4 is not eliminated by a3because it is not much worse according to either basis reward, as judged by D1 and D2. Action a2 is elimi-nated because although it is slightly better than a1 according to basis reward 2, it is much worse accordingto basis reward 1. Similarly, a3 is eliminated by a2. Note the small solid rectangle to the left of a2: points inthis region (including a3) are dominated by a2, but not by a1. This illustrates the non-transitivity of the Prac-tical Domination relation, and in turn shows that it is not a partial order. Figure 4(b): Using Strong PracticalDomination, which is a partial order, no actions are eliminated, and there are no regions of non-transitivity.

removing from consideration future policy sequences that are dominated no matter whatcurrent action is chosen. Again, this pruning has no impact on solution quality because weare only eliminating future policy sequences that will never be executed. Despite the expo-nential dependence on T , we will show that our method can be successfully applied to realdata in Section 5, and we defer the development of approximations to future work.

4 Practical domination

So far we have presented our algorithm assuming we will use Pareto dominance to define�. However, there are two ways in which Pareto dominance does not reflect the reasoningof a physician when she determines whether one action is superior to another. First, an ac-

a5



BenefitsNo model of preference required

Suggests a set rather than imposing a single recommended treatment


Treatments with bad evidence are explicitly screened

Screening criterion is intuitive

CostsMust consider all policies the user

might follow in future

Positive And Negative Syndrome Scale and

Body Mass Index

Possible Futures

6 Daniel J. Lizotte, Eric B. Laber

Fig. 2 Plot of a vector-valued Q-function (QT�1[1](sT�1,aT�1;pT ),QT�1[2](sT�1,aT�1;pT )) for a fixed statesT�1. Here, aT�1 2A= {+,4,#,⇥,⌃}, and pT 2P , a collection of 20 possible future policies. Thus eachmarker represents a Q-vector that is achievable by first taking a particular action aT�1 and then following aparticular policy pT for the last time point. The markers for Q-pairs with the same action but different futurepolicies have the same shape and colour, but are at different locations. Thus, each + marker near the topof the plot represents a Q-value that is achievable by taking the + action at the current time point and thenfollowing a particular future policy.

simplify presentation, we will take PT (sT ) to be the set of actions that are Pareto-optimalaccording to QT (sT , ·). However, an important advantage of our algorithm is that it does notrely on any specific definition of PT ; in Section 4 we will introduce a novel definition of PT(and of Pt , t < T ) that is more appropriate for clinical decision support applications.

Having defined our NDP PT for the final timestep, we use it together with QT to deriveQt and Pt for earlier time points. For time points t < T , define

Qt[d](st ,at ;pt+1,pt+2, ...) = ft(st ,at)|wt[d]pt+1,pt+2,...,

wt[d]pt+1,pt+2,... =

argminw

Âi

⇣ft(si

t ,ait)|w� (ri

t + Qt+1[d](sit+1,pt+1(si

t+1);pt+2,pt+3, ...))⌘2

.

Note that in this formulation we explicitly include the dependence of Qt[d] on the choiceof future policies p j, t < j T . In fact, to get a truly complete picture of Qt we mustlearn a weight vector for every basis reward and for every possible sequence of futurepolicies. Figure 2 illustrates what a Q function might look like for a fixed state sT�1 andD = 2 basis rewards. Each choice of aT�1 and pT generates a pair of estimated Q-values,(QT�1[1](sT�1,aT�1,pT ), QT�1[2](sT�1,aT�1,pT )), which we can plot as a point in the plane.We say that such a point represents a vector-valued expected return that is achievable by tak-ing some immediate action and following it with a fixed sequence of policies. The Q-valuevectors plotted in Figure 2 show the achievable values for five settings of aT�1 and 20 dif-ferent future policies pT .


BenefitsNo notion of preference required

Suggests a set rather than imposing a single recommended treatment


Treatments with bad evidence are explicitly screened

Screening criterion is intuitive

CostsMust consider all policies the user

might follow in future

Enumeration: O(AS⋅T ⋅S2⋅T)

Restriction to policies that 1) follow recommendations and 2) are “not too complex” makes computation feasible

CATIE: 10124 possible futures reduced to 1213 possible futures

Multi-Outcome ScreeningNon-deterministic multiple-reward fitted-Q for decision support 17

Fig. 7 CATIE NDP for Phase 1 made using P9; “warning” actions that would have been eliminated byPractical Domination but not by Strong Practical Domination have been removed.

Figure 7 shows the NDP learned for Phase 1 using our algorithm with Strong PracticalDomination (D1 = D2 = 2.5) and P9, and actions that receive a “warning” according toPractical Domination have been removed. Here, choice is further increased by requiringan action to be practically better than another action in order to dominate it. Furthermore,although we have removed actions that were warned to have a bad tradeoff—those that wereslightly better for one reward but practically worse for another—we still achieved increasedchoice over using the Pareto frontier alone. In this NDP, the mean number of choices perstate is 4.18, and 67% of states have had one or more actions eliminated.

Finally, Figure 8 shows the NDP learned for Phase 1 using our algorithm with StrongPractical Domination (D1 = D2 = 2.5) and P8. Again, an action must be practically betterthan another action in order to dominate it. Recall that for P8, we only recommend actionsthat are not dominated by another action for any future policy. Hence, these actions are ex-tremely “safe” in the sense that they achieve an expected value on the �sp-frontier as long asthe user selects from our recommended actions in the future. In this NDP, the mean number

Summary• Autonomous Agent model is for sequential

decision making but we want decision support.

• !e Reward Hypothesis is useful computationally, but too restrictive for us

• Preference Revealing: Compute for all possible scalar rewards

• Multi-outcome Screening: Compute for all “reasonable” future policies

• In both cases, increase in complexity is tolerable

Thank you!

• Dan Lizo#e dlizo#[email protected]:

My colleagues Susan Murphy, Michael Bowling, Eric Laber

Natural Sciences and Engineering Research Council of Canada (NSERC)Data were obtained from the limited access datasets distributed from the NIH-supported “Clinical Antipsychotic Trials of Intervention Effectiveness in Schizophrenia” (CATIE-Sz). !e study was supported by NIMH Contract N01MH90001 to the University of North Carolina at Chapel Hill. !e ClinicalTrials.gov identi&er is NCT00014001. !is talk re'ects the view of the author and may not re'ect the opinions or views of the CATIE-Sz Study Investigators or the NIH.

mailto:[email protected]

mailto:[email protected]

Documents

Possible Futures