40
Decision Making Under Decision Making Under Uncertainty Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and Daphne Koller 1

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Embed Size (px)

Citation preview

Page 1: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Decision Making Under Decision Making Under UncertaintyUncertainty

CMSC 471 – Spring 2041Class #25– Tuesday, April 29R&N, 17.1-17.3

material from Lise Getoor, Jean-Claude Latombe, and Daphne Koller 1

Page 2: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

SEQUENTIAL DECISION MAKING UNDER UNCERTAINTY

Page 3: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Sequential Decision Making

Finite HorizonInfinite Horizon

3

Page 4: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Simple Robot Navigation Simple Robot Navigation ProblemProblem

• In each state, the possible actions are U, D, R, and L

4

Page 5: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Probabilistic Transition Probabilistic Transition ModelModel

• In each state, the possible actions are U, D, R, and L• The effect of U is as follows (transition model):

• With probability 0.8, the robot moves up one square (if the robot is already in the top row, then it does not move)

5

Page 6: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Probabilistic Transition Probabilistic Transition ModelModel

• In each state, the possible actions are U, D, R, and L• The effect of U is as follows (transition model):

• With probability 0.8, the robot moves up one square (if the robot is already in the top row, then it does not move)• With probability 0.1, the robot moves right one square (if the robot is already in the rightmost row, then it does not move)

6

Page 7: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Probabilistic Transition Probabilistic Transition ModelModel

• In each state, the possible actions are U, D, R, and L• The effect of U is as follows (transition model):

• With probability 0.8, the robot moves up one square (if the robot is already in the top row, then it does not move)• With probability 0.1, the robot moves right one square (if the robot is already in the rightmost row, then it does not move)• With probability 0.1, the robot moves left one square (if the robot is already in the leftmost row, then it does not move)

7

Page 8: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Probabilistic Transition Probabilistic Transition ModelModel

• In each state, the possible actions are U, D, R, and L• The effect of U is as follows (transition model):

• With probability 0.8, the robot moves up one square (if the robot is already in the top row, then it does not move)• With probability 0.1, the robot moves right one square (if the robot is already in the rightmost row, then it does not move)• With probability 0.1, the robot moves left one square (if the robot is already in the leftmost row, then it does not move)

•D, R, and L have similar probabilistic effects8

Page 9: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Markov PropertyMarkov Property

The transition properties depend only on the current state, not on the previous history (how that state was reached)

9

Page 10: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Sequence of ActionsSequence of Actions

• Planned sequence of actions: (U, R)

2

3

1

4321

[3,2]

10

Page 11: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Sequence of ActionsSequence of Actions

• Planned sequence of actions: (U, R)• U is executed

2

3

1

4321

[3,2]

[4,2][3,3][3,2]

11

Page 12: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

HistoriesHistories

• Planned sequence of actions: (U, R)• U has been executed• R is executed

• There are 9 possible sequences of states – called histories – and 6 possible final states for the robot!

4321

2

3

1

[3,2]

[4,2][3,3][3,2]

[3,3][3,2] [4,1] [4,2] [4,3][3,1]

12

Page 13: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Probability of Reaching the Probability of Reaching the GoalGoal

•P([4,3] | (U,R).[3,2]) = P([4,3] | R.[3,3]) x P([3,3] | U.[3,2]) + P([4,3] | R.[4,2]) x P([4,2] | U.[3,2])

2

3

1

4321

Note importance of Markov property in this derivation

•P([3,3] | U.[3,2]) = 0.8•P([4,2] | U.[3,2]) = 0.1

•P([4,3] | R.[3,3]) = 0.8•P([4,3] | R.[4,2]) = 0.1

•P([4,3] | (U,R).[3,2]) = 0.6513

Page 14: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Utility FunctionUtility Function

• [4,3] provides power supply• [4,2] is a sand area from which the robot cannot escape

-1

+1

2

3

1

4321

14

Page 15: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Utility FunctionUtility Function

• [4,3] provides power supply• [4,2] is a sand area from which the robot cannot escape• The robot needs to recharge its batteries

-1

+1

2

3

1

4321

15

Page 16: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Utility FunctionUtility Function

• [4,3] provides power supply• [4,2] is a sand area from which the robot cannot escape• The robot needs to recharge its batteries• [4,3] and [4,2] are terminal states

-1

+1

2

3

1

4321

16

Page 17: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Utility of a HistoryUtility of a History

• [4,3] provides power supply• [4,2] is a sand area from which the robot cannot escape• The robot needs to recharge its batteries• [4,3] or [4,2] are terminal states• The utility of a history is defined by the utility of the last state (+1 or –1) minus n/25, where n is the number of moves

-1

+1

2

3

1

4321

17

Page 18: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Utility of an Action Utility of an Action SequenceSequence

-1

+1

• Consider the action sequence (U,R) from [3,2]

2

3

1

4321

18

Page 19: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Utility of an Action Utility of an Action SequenceSequence

-1

+1

• Consider the action sequence (U,R) from [3,2]• A run produces one of 7 possible histories, each with some probability

2

3

1

4321

[3,2]

[4,2][3,3][3,2]

[3,3][3,2] [4,1] [4,2] [4,3][3,1]

19

Page 20: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Utility of an Action Utility of an Action SequenceSequence

-1

+1

• Consider the action sequence (U,R) from [3,2]• A run produces one among 7 possible histories, each with some probability• The utility of the sequence is the expected utility of the histories:

U = hUh P(h)

2

3

1

4321

[3,2]

[4,2][3,3][3,2]

[3,3][3,2] [4,1] [4,2] [4,3][3,1]

20

Page 21: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Optimal Action SequenceOptimal Action Sequence

-1

+1

• Consider the action sequence (U,R) from [3,2]• A run produces one among 7 possible histories, each with some probability• The utility of the sequence is the expected utility of the histories• The optimal sequence is the one with maximal utility

2

3

1

4321

[3,2]

[4,2][3,3][3,2]

[3,3][3,2] [4,1] [4,2] [4,3][3,1]

21

Page 22: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Optimal Action SequenceOptimal Action Sequence

-1

+1

• Consider the action sequence (U,R) from [3,2]• A run produces one among 7 possible histories, each with some probability• The utility of the sequence is the expected utility of the histories• The optimal sequence is the one with maximal utility• But is the optimal action sequence what we want to compute?

2

3

1

4321

[3,2]

[4,2][3,3][3,2]

[3,3][3,2] [4,1] [4,2] [4,3][3,1]

only if the sequence is executed blindly!

22

Page 23: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Accessible orobservable state

Repeat: s sensed state If s is terminal then exit a choose action (given s) Perform a

Reactive Agent AlgorithmReactive Agent Algorithm

23

Page 24: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Policy Policy (Reactive/Closed-Loop (Reactive/Closed-Loop Strategy)Strategy)

• A policy is a complete mapping from states to actions

-1

+1

2

3

1

4321

24

Page 25: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Repeat: s sensed state If s is terminal then exit a (s) Perform a

Reactive Agent AlgorithmReactive Agent Algorithm

25

Page 26: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Optimal PolicyOptimal Policy

-1

+1

• A policy is a complete mapping from states to actions• The optimal policy * is the one that always yields a history (ending at a terminal state) with maximal expected utility

2

3

1

4321

Makes sense because of Markov property

Note that [3,2] is a “dangerous” state that the optimal policy

tries to avoid

26

Page 27: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Optimal PolicyOptimal Policy

-1

+1

• A policy is a complete mapping from states to actions• The optimal policy * is the one that always yields a history with maximal expected utility

2

3

1

4321

This problem is called aMarkov Decision Problem (MDP)

How to compute *?27

Page 28: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Additive UtilityAdditive Utility

History H = (s0,s1,…,sn)

The utility of H is additive iff: U(s0,s1,…,sn) = R(0) + U(s1,…,sn) = R(i)

Reward

28

Page 29: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Additive UtilityAdditive Utility

History H = (s0,s1,…,sn)

The utility of H is additive iff: U(s0,s1,…,sn) = R(0) + U(s1,…,sn) =

R(i)

Robot navigation example:R(n) = +1 if sn = [4,3]

R(n) = -1 if sn = [4,2]

R(i) = -1/25 if i = 0, …, n-1 29

Page 30: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Principle of Max Expected Principle of Max Expected UtilityUtility

History H = (s0,s1,…,sn)

Utility of H: U(s0,s1,…,sn) = R(i)

First-step analysis

U(i) = R(i) + maxa kP(k | a.i) U(k)

*(i) = arg maxa kP(k | a.i) U(k)

-1

+1

30

Page 31: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Defining State Utility

Problem: When making a decision, we only know the

reward so far, and the possible actions We’ve defined utility retroactively (i.e., the

utility of a history is obvious once we finish it)

What is the utility of a particular state in the middle of decision making?

Need to compute expected utility of possible future histories

31

Page 32: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Value IterationValue Iteration Initialize the utility of each non-terminal state si

to U0(i) = 0

For t = 0, 1, 2, …, do:

Ut+1(i) R(i) + maxa kP(k | a.i) Ut(k)

-1

+1

2

3

1

4321

32

Page 33: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Value IterationValue Iteration Initialize the utility of each non-terminal state si

to U0(i) = 0

For t = 0, 1, 2, …, do:

Ut+1(i) R(i) + maxa kP(k | a.i) Ut(k)

Ut([3,1])

t0 302010

0.6110.5

0-1

+1

2

3

1

4321

0.705 0.655 0.3880.611

0.762

0.812 0.868 ???

0.660

Note the importanceof terminal states andconnectivity of thestate-transition graph

33EXERCISE: What is U*([3,3]) (assuming that the other U* are as shown)?

Page 34: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Value IterationValue Iteration Initialize the utility of each non-terminal state si

to U0(i) = 0

For t = 0, 1, 2, …, do:

Ut+1(i) R(i) + maxa kP(k | a.i) Ut(k)

Ut([3,1])

t0 302010

0.6110.5

0-1

+1

2

3

1

4321

0.705 0.655 0.3880.611

0.762

0.812 0.868 0.918

0.660

Note the importanceof terminal states andconnectivity of thestate-transition graph

34U*3,3 = R3,3 + [P3,2 U*3,2 + P3,3 U*3,3 + P4,3 U*4,3]

Page 35: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Policy IterationPolicy Iteration

Pick a policy at random

35

Page 36: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Policy IterationPolicy Iteration

Pick a policy at random Repeat: Compute the utility of each state for

Ut+1(i) R(i) + kP(k | (i).i) Ut(k)

36

Page 37: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Policy IterationPolicy Iteration

Pick a policy at random Repeat:

Compute the utility of each state for Ut+1(i) R(i) + kP(k | (i).i) Ut(k)

Compute the policy ’ given these utilities

’(i) = arg maxa kP(k | a.i) U(k)

37

Page 38: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Policy IterationPolicy Iteration

Pick a policy at random Repeat: Compute the utility of each state for

Ut+1(i) R(i) + kP(k | (i).i) Ut(k) Compute the policy ’ given these

utilities

’(i) = arg maxa kP(k | a.i) U(k)

If ’ = then return

Or solve the set of linear equations:

U(i) = R(i) + kP(k | (i).i) U(k)(often a sparse system)

38

Page 39: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Infinite HorizonInfinite Horizon

-1

+1

2

3

1

4321

In many problems, e.g., the robot navigation example, histories are potentially unbounded and the same state can be reached many timesOne trick:

Use discounting to make an infinitehorizon problem mathematicallytractable

What if the robot lives forever?

39

Page 40: Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, 17.1-17.3 material from Lise Getoor, Jean-Claude Latombe, and

Value IterationValue Iteration Value iteration:

Initialize state values (expected utilities) randomly Repeatedly update state values using best action, according to current

approximation of state values Terminate when state values stabilize Resulting policy will be the best policy because it’s based on accurate state

value estimation

Policy iteration: Initialize policy randomly Repeatedly update state values using best action, according to current

approximation of state values Then update policy based on new state values Terminate when policy stabilizes Resulting policy is the best policy, but state values may not be accurate

(may not have converged yet) Policy iteration is often faster (because we don’t have to get the state

values right)

Both methods have a major weakness: They require us to know the transition function exactly in advance!

40