View
216
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Dr. Dan Lizotte talks with WICI community, January 2013.
Citation preview
Possible Futures:Complexity in Sequential
Decision-Making
Dan LizotteAssistant Professor, David R. Cheriton School of Computer Science
University of Waterloo
Waterloo Institute for Complexity and Innovation22 Jan 2013
Agents, Rationality, and Sequential Decision-Making• Sequential decision-making from the AI tradition
• Supporting sequential clinical decision-making
• !e assumption of rationality:
• mitigates complexity ☺
• reduces expressiveness ☹
• Alternatives to “simple rationality”
• Please interrupt at any time
Graphic: Leslie Pack Kaelbling, MIT
“An agent is just something that perceives and acts.”-Russell & Norvig
Agent
Environment
actionatst
rewardrt
rt+1
st+1
state
The “Reward Hypothesis”
“!at all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward).”
— Rich Su#on
A.K.A. “Rationality.” (“reward” ≡ “utility”)
Agent
Environment
actionatst
rewardrt
rt+1
st+1
state TreatmentStatus Outcome
Autonomous Agent Paradigm
• Good
• Goal is to maximize long term reward
• Makes context-dependent, individualized decisions
• Handles uncertain environments naturally
• Bad
• Need to prepare for every eventuality
• Relies on the reward hypothesis
• Relies on the assumption of rational* behaviour
Agent
Environment
actionatst
rewardrt
rt+1
st+1
state
*Rational in the non-judgemental, mathematical sense.
Decision Support Agent
Agent
Environment
actionatst
rewardrt
rt+1
st+1
state
Recommend rather than perform actions,still leverage autonomous agent idea
Place a human in-between
Should I Move or Stay?
Assume we have:Observation of state of the world
Rewards associated with each stateModel of how the world changes
Rational Decision Making
Move
P = 0.1, R = 0
P = 0.1, R = 100
P = 0.7, R = 0
P = 0.1, R = -50
Q( ,Move) = 5
Stay
P = 0.1, R = -25
P = 0.1, R = 10
P = 0.7, R = 0
P = 0.1, R = 0
Q( ,Stay) = -1.5
• O$en, “we” assume we want to maximize expected reward
• For each potential action
• Compute expected reward over all possible future states
• Choose the action with the highest expected reward
• “Q-function” Q(s,a) = E[R|s,a]
How do we decide what to do?
Move
P = 0.1, R = 0
P = 0.1, R = 100
P = 0.7, R = 0
P = 0.1, R = -50
Q( ,Move) = 5
Stay
P = 0.1, R = -25
P = 0.1, R = 10
P = 0.7, R = 0
P = 0.1, R = 0
Q( ,Stay) = -1.5
Q( ,Stay) = -1.5Q( ,Move) = 5
V( ) = 5 π( ) = Move
• For each potential action
• Compute expected reward over all possible future states
• Feasible if reward is known for each state.
• What if reward is in!uenced by subsequent decision-making?
Sequential Decision-Making
P = 0.1, R = 10Mosey P = 1.0, R = 10
Q( ,Mosey) = 10
Scramble!
P = 0.1, R = 0
P = 0.9, R = 100
Q( ,Scramble) = 90
Q( ,Scramble) = 90
Q( ,Mosey) = 10
V( ) = 90π( ) = Scramble
Stay
P = 0.1, R = -25
P = 0.1,
P = 0.7, R = 0
P = 0.1, R = 0
Q( ,Stay) =
R = 10V = 90
-1.56.5Q( ,Stay) =
Q( ,Move) = 5
V( ) = π( ) = Move
-1.56.5
6.55MoveStay
• Optimal decisions are in!uenced by future plans
• Not a problem, if we know what we will do when we get to any future state.
• Replace R (short-term) with V (long-term) by the law of total expectation
• Expectation is now over sequences of states
SequentialDecision-Making
Policy Evaluation
Policy Evaluation
Policy Evaluation
Policy Evaluation
Policy Evaluation
Time to evaluate: O(S2⋅T)Number of policies: O(AS⋅T)Exhaustive search: O(AS⋅T ⋅S2⋅T)
Policy Learning
Policy Learning
Policy Learning
Policy Learning
Policy Learning
Policy Learning
Policy Learning
Policy Learning
Time to evaluate: O(A⋅S2⋅T)Exhaustive search: O(AS⋅T ⋅S2⋅T)
Policy Learning
• Enormous complexity savings: From O(AS⋅T ⋅S2⋅T) to O(A⋅S2⋅T)
• What assumptions are required?
• Markov property
• Rationality (linearity of expectation)
• What if we want to avoid the rationality assumption?
Beyond Simple Rationality• maximization of the expected value of the
cumulative sum of a received scalar signal
• What is the “right” scalar signal?
• Symptom relief?• Side-effects?• Cost?• Quality of life?
• Different for every decision-maker, every patient.
Multiple Reward Signals
• How can we reason with multiple reward signals?
• Make use of the human in the loop!
• Preference Revealing (DL,Bowling,Murphy)
• Multi-outcome Screening (DL,Ferguson,Laber)
• Each adds computational complexity
Agent
Environment
actionatst
rewardrt
rt+1
st+1
state
Preference Elicitation
Suppose D different rewards are important for decision-making,
r[1], r[2], ..., r[D]
Assume each person has a function f that takes these and gives utility, that person’s happiness given any configuration of the r[i] expressed as a scalar value. We could use this as our new reward!
Preference Elicitation attempts to figure out an individual’s f.
Preference Elicitation1. Determine preferences of the decision-maker
2. Construct reward function from “basis rewards” (different outcomes)
3. Compute the recommended treatment, by policy learning
• Cost: O(A⋅S2⋅T) per elicitation Agent
Environment
actionatst
rewardrt
rt+1
st+1
state
Preference Elicitation
One way: Assume f has a nice form:
f(r[1], r[2], ..., r[D]) = δ[1]r[1] + δ[2]r[2] + ... + δ[D]r[D]
Then Preference Elicitation figures out the δ, or weights, an individual attaches to different rewards. How?
The values δ[i] and δ[j] defines an exchange rate between r[i] and r[j].
Preference Elicitationf(r[1], r[2], ..., r[D]) = δ[1]r[1] + δ[2]r[2] + ... + δ[D]r[D]
“If I lost δ[j] units of r[i], but I gained δ[i] units of r[j], I would be equally happy.”
Preference elicitation asks questions like:“If I took away 5 units of r[i], how many units of r[j] would you want?”
Once f is known, standard single-outcome methods can be applied.
Preference Elicitation
• Are the questions based in reality?
• Even if they are, can the decision-maker answer them?
• How will the decision-maker respond to “I know what you want.” ?
Agent
Environment
actionatst
rewardrt
rt+1
st+1
state
Preference Revealing
1. Determine the preferences of the decision-maker
2. Compute the recommended treatment for all possible preferences (δ)
3. Show, for each action, what preferences are consistent with that action being recommended
(DL, S Murphy, M Bowling)
Reward: PANSS! = (1, 0, 0)
Reward: BMI! = (0, 1, 0)
Reward: HQLS! = (0, 0, 1)
! Ziprasidone — Olanzapine
Phase 1
Positive And Negative Syndrome Scale vs.
Body Mass Index vs.
Heinrichs Quality of Life Scale
Data:Clinical
Antipsychotic Trials of
Intervention Effectiveness
(NIMH)
Preference RevealingBenefits
No reliance on preference elicitation
Facilitates deliberation rather than imposing a single recommended treatment
Information still individualized through patient state
Treatments that are not suggested for any preference are implicitly screened
CostsO(AT⋅ST) worst case
cf. O(A⋅S2⋅T) for single reward
cf. 2ℵ0 for “enumeration”
Feasible for short horizons
Value function/policy now a function of state and preference
Value functions not convex in preference, thus related methods for POMDPs do not apply
Multi-outcome Screening
1. Elicit “clinically meaningful difference” for each outcome
2. Screen out treatments that are “de&nitely bad”
3. Recommend the set of remaining treatments
Multi-outcome Screening
Suppose two* different rewards are important for decision making: r[1], r[2]
Screen out a treatment if it is much worse than another treatment for one reward and not any better for the other reward.
Do not screen if 1) treatments are not much different or 2) one treatment is much worse for one reward but much better for the other
Output: Set containing one or both treatments, possibly with a reason if both are included.
(DL, B Ferguson, E Laber)
Non-deterministic multiple-reward fitted-Q for decision support 11
Algorithm 1 Non-deterministic fitted-QLearn QT = (QT [1], ..., QT [D]), (Section 3) place in QTfor t = T �1,T �2, ...,1 do
for all sit in the data do
Map Qt+1 to P9t+1(s
it) (Section 3.2), Generate Cf (P9
t+1(sit)) (Section 3.1)
for all p 2 Cf (P9t+1(s
it)) do
for all Qt+1 2Qt+1 do
Learn (Q[1]t(·, ·,p), ..., Q[D]t(·, ·,p)) using Qt+1 for the future value estimatesPlace resulting (Q[1]t(·, ·,p), ..., Q[D]t(·, ·,p)) in Qt
0 2 4 60
2
4
6
a1a2
a3
a4
QT [1](sT ,a)
(b)
QT[2](
s T,a)
Eliminate using Prac-tical Domination
0 2 4 60
2
4
6
a1a2
a3
a4
QT [1](sT ,a)
(b)
QT[2](
s T,a)
Eliminate usingStrong Practical Domination
Fig. 4 Comparison of rules for eliminating actions. In this simple example, we suppose the Q-vectors(QT [1](sT ,a),QT [2](sT ,a)) are (4.9,4.9),(3,5.2),(1.8,5.6),(4.6,4.6) for a1,a2,a3,a4, respectively, and sup-pose D1 = D2 = 0.5. Figure 4(a): Using the Practical Domination rule, action a4 is not eliminated by a3because it is not much worse according to either basis reward, as judged by D1 and D2. Action a2 is elimi-nated because although it is slightly better than a1 according to basis reward 2, it is much worse accordingto basis reward 1. Similarly, a3 is eliminated by a2. Note the small solid rectangle to the left of a2: points inthis region (including a3) are dominated by a2, but not by a1. This illustrates the non-transitivity of the Prac-tical Domination relation, and in turn shows that it is not a partial order. Figure 4(b): Using Strong PracticalDomination, which is a partial order, no actions are eliminated, and there are no regions of non-transitivity.
removing from consideration future policy sequences that are dominated no matter whatcurrent action is chosen. Again, this pruning has no impact on solution quality because weare only eliminating future policy sequences that will never be executed. Despite the expo-nential dependence on T , we will show that our method can be successfully applied to realdata in Section 5, and we defer the development of approximations to future work.
4 Practical domination
So far we have presented our algorithm assuming we will use Pareto dominance to define�. However, there are two ways in which Pareto dominance does not reflect the reasoningof a physician when she determines whether one action is superior to another. First, an ac-
a5
Multi-outcome Screening
Multi-outcome Screening
BenefitsNo model of preference required
Suggests a set rather than imposing a single recommended treatment
Information still individualized through patient state
Treatments with bad evidence are explicitly screened
Screening criterion is intuitive
CostsMust consider all policies the user
might follow in future
Positive And Negative Syndrome Scale and
Body Mass Index
Possible Futures
6 Daniel J. Lizotte, Eric B. Laber
Fig. 2 Plot of a vector-valued Q-function (QT�1[1](sT�1,aT�1;pT ),QT�1[2](sT�1,aT�1;pT )) for a fixed statesT�1. Here, aT�1 2A= {+,4,#,⇥,⌃}, and pT 2P , a collection of 20 possible future policies. Thus eachmarker represents a Q-vector that is achievable by first taking a particular action aT�1 and then following aparticular policy pT for the last time point. The markers for Q-pairs with the same action but different futurepolicies have the same shape and colour, but are at different locations. Thus, each + marker near the topof the plot represents a Q-value that is achievable by taking the + action at the current time point and thenfollowing a particular future policy.
simplify presentation, we will take PT (sT ) to be the set of actions that are Pareto-optimalaccording to QT (sT , ·). However, an important advantage of our algorithm is that it does notrely on any specific definition of PT ; in Section 4 we will introduce a novel definition of PT(and of Pt , t < T ) that is more appropriate for clinical decision support applications.
Having defined our NDP PT for the final timestep, we use it together with QT to deriveQt and Pt for earlier time points. For time points t < T , define
Qt[d](st ,at ;pt+1,pt+2, ...) = ft(st ,at)|wt[d]pt+1,pt+2,...,
wt[d]pt+1,pt+2,... =
argminw
Âi
⇣ft(si
t ,ait)|w� (ri
t + Qt+1[d](sit+1,pt+1(si
t+1);pt+2,pt+3, ...))⌘2
.
Note that in this formulation we explicitly include the dependence of Qt[d] on the choiceof future policies p j, t < j T . In fact, to get a truly complete picture of Qt we mustlearn a weight vector for every basis reward and for every possible sequence of futurepolicies. Figure 2 illustrates what a Q function might look like for a fixed state sT�1 andD = 2 basis rewards. Each choice of aT�1 and pT generates a pair of estimated Q-values,(QT�1[1](sT�1,aT�1,pT ), QT�1[2](sT�1,aT�1,pT )), which we can plot as a point in the plane.We say that such a point represents a vector-valued expected return that is achievable by tak-ing some immediate action and following it with a fixed sequence of policies. The Q-valuevectors plotted in Figure 2 show the achievable values for five settings of aT�1 and 20 dif-ferent future policies pT .
Multi-outcome Screening
BenefitsNo notion of preference required
Suggests a set rather than imposing a single recommended treatment
Information still individualized through patient state
Treatments with bad evidence are explicitly screened
Screening criterion is intuitive
CostsMust consider all policies the user
might follow in future
Enumeration: O(AS⋅T ⋅S2⋅T)
Restriction to policies that 1) follow recommendations and 2) are “not too complex” makes computation feasible
CATIE: 10124 possible futures reduced to 1213 possible futures
Multi-Outcome ScreeningNon-deterministic multiple-reward fitted-Q for decision support 17
Fig. 7 CATIE NDP for Phase 1 made using P9; “warning” actions that would have been eliminated byPractical Domination but not by Strong Practical Domination have been removed.
Figure 7 shows the NDP learned for Phase 1 using our algorithm with Strong PracticalDomination (D1 = D2 = 2.5) and P9, and actions that receive a “warning” according toPractical Domination have been removed. Here, choice is further increased by requiringan action to be practically better than another action in order to dominate it. Furthermore,although we have removed actions that were warned to have a bad tradeoff—those that wereslightly better for one reward but practically worse for another—we still achieved increasedchoice over using the Pareto frontier alone. In this NDP, the mean number of choices perstate is 4.18, and 67% of states have had one or more actions eliminated.
Finally, Figure 8 shows the NDP learned for Phase 1 using our algorithm with StrongPractical Domination (D1 = D2 = 2.5) and P8. Again, an action must be practically betterthan another action in order to dominate it. Recall that for P8, we only recommend actionsthat are not dominated by another action for any future policy. Hence, these actions are ex-tremely “safe” in the sense that they achieve an expected value on the �sp-frontier as long asthe user selects from our recommended actions in the future. In this NDP, the mean number
Summary• Autonomous Agent model is for sequential
decision making but we want decision support.
• !e Reward Hypothesis is useful computationally, but too restrictive for us
• Preference Revealing: Compute for all possible scalar rewards
• Multi-outcome Screening: Compute for all “reasonable” future policies
• In both cases, increase in complexity is tolerable
Thank you!
• Dan Lizo#e dlizo#[email protected]:
My colleagues Susan Murphy, Michael Bowling, Eric Laber
Natural Sciences and Engineering Research Council of Canada (NSERC)Data were obtained from the limited access datasets distributed from the NIH-supported “Clinical Antipsychotic Trials of Intervention Effectiveness in Schizophrenia” (CATIE-Sz). !e study was supported by NIMH Contract N01MH90001 to the University of North Carolina at Chapel Hill. !e ClinicalTrials.gov identi&er is NCT00014001. !is talk re'ects the view of the author and may not re'ect the opinions or views of the CATIE-Sz Study Investigators or the NIH.