23
KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Page 1: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

KI2 - 10

Kunstmatige Intelligentie / RuG

Markov Decision Processes

AIMA, Chapter 17

Page 2: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

2

Markov Decision Problem

How to use knowledge about the world to make decision even when the outcomes of an action are uncertain and the payoffs will not be obtained until several (or many) actions have passed.

Page 3: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

3

The Solution

Sequential decision problems in uncertain environments can be solved by calculating a policy that associates an optimal decision with every state that the agent might reach

=> Markov Decision Process (MDP)

Page 4: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

4

Example

+1

-1

1

2

3

1 2 3 4

start

0.1

0.8

0.1

The world Actions have uncertain consequences

Page 5: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

5

Page 6: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

6

Page 7: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

7

Page 8: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

8

Page 9: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

9

Page 10: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

10

Page 11: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

11

Utility of a State Sequence

Additive rewards

Discounted rewards

...)()()(...]),,([ 22

10210 sRsRsRsssU h

...)()()(...]),,([ 210210 sRsRsRsssU h

Page 12: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

12

Page 13: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

13

The utility of each state is the expected sum of discounted rewards if the agent executes the policy

The true utility of a state corresponds to the optimal policy *

Utility of a State

sssREsUt

tt

00

,)()(

Page 14: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

14

Page 15: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

15

Algorithms forCalculating the Optimal Policy

Value iteration

Policy iteration

Page 16: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

16

Calculate the utility of each state

Then use the state utilities to select an optimal action in each state

Value Iteration

/

)(),,()( //* maxargsa

sUsasTs

Page 17: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

17

Value Iteration Algorithm

function value-iteration(MDP) returns a utility function local variables: U, U’ initially identical to R repeat U U’ for each state s do

end until close-enough(U, U’) return U

/

)(),,()()( //maxsa

sUsasTsRsU

Bellman update

Page 18: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

18

The utilities of the states by value iteration algorithm

The Utilities of the States ObtainedAfter Value Iteration

+1

-1

1

2

3

1 2 3 4

0.705 0.655 0.611 0.388

0.762 0.660

0.9120.8680.812

Page 19: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

19

Policy Iteration

Pick a policy, then calculate the utility of each state given that policy (value determination step)

Update the policy at each state using the utilities of the successor states

Repeat until the policy stabilizes

Page 20: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

20

Policy Iteration Algorithm

function policy-iteration(MDP) returns a policy local variables: U, a utility function, , a policy repeat U value-determination(,U,MDP,R) unchanged? true for each state s do

unchanged? false end until unchanged? return

/

)(),,()( //maxargsa

sUsasTs

thensUsssTsUsasTifssa

//

)()),(,()(),,( ////max

Page 21: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

21

Value Determination

Simplification of the value iteration algorithm because the policy is fixed

Linear equations because the max() operator has been removed

Solve exactly for the utilities using standard linear algebra

Page 22: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

22

+1

-1

1

2

3

1 2 3 4

u(1,1) = 0.8 u(1,2) + 0.1 u(1,2) + 0.1 u(1,1)

u(1,2) = 0.8 u(1,3) + 0.2 u(1,2)

Optimal Policy(policy iteration with 11 linear equations)

Page 23: KI2 - 10 Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17

23

Partially observable MDP (POMDP)

In an inaccessible environment, the percept does not provide enough information to determine the state or the transition probability

POMDP– State transition function: P(st+1 | st, at)– Observation function: P(ot | st, at)– Reward function: E(rt | st, at)

Approach– To calculate a probability distribution over the possible

states given all previous percepts, and to base decision on this distribution

Difficulty– Actions will cause the agent to obtain new percept, which

will cause the agent’s beliefs to change in complex ways