View
222
Download
0
Tags:
Embed Size (px)
Citation preview
KI2 - 10
Kunstmatige Intelligentie / RuG
Markov Decision Processes
AIMA, Chapter 17
2
Markov Decision Problem
How to use knowledge about the world to make decision even when the outcomes of an action are uncertain and the payoffs will not be obtained until several (or many) actions have passed.
3
The Solution
Sequential decision problems in uncertain environments can be solved by calculating a policy that associates an optimal decision with every state that the agent might reach
=> Markov Decision Process (MDP)
4
Example
+1
-1
1
2
3
1 2 3 4
start
0.1
0.8
0.1
The world Actions have uncertain consequences
5
6
7
8
9
10
11
Utility of a State Sequence
Additive rewards
Discounted rewards
...)()()(...]),,([ 22
10210 sRsRsRsssU h
...)()()(...]),,([ 210210 sRsRsRsssU h
12
13
The utility of each state is the expected sum of discounted rewards if the agent executes the policy
The true utility of a state corresponds to the optimal policy *
Utility of a State
sssREsUt
tt
00
,)()(
14
15
Algorithms forCalculating the Optimal Policy
Value iteration
Policy iteration
16
Calculate the utility of each state
Then use the state utilities to select an optimal action in each state
Value Iteration
/
)(),,()( //* maxargsa
sUsasTs
17
Value Iteration Algorithm
function value-iteration(MDP) returns a utility function local variables: U, U’ initially identical to R repeat U U’ for each state s do
end until close-enough(U, U’) return U
/
)(),,()()( //maxsa
sUsasTsRsU
Bellman update
18
The utilities of the states by value iteration algorithm
The Utilities of the States ObtainedAfter Value Iteration
+1
-1
1
2
3
1 2 3 4
0.705 0.655 0.611 0.388
0.762 0.660
0.9120.8680.812
19
Policy Iteration
Pick a policy, then calculate the utility of each state given that policy (value determination step)
Update the policy at each state using the utilities of the successor states
Repeat until the policy stabilizes
20
Policy Iteration Algorithm
function policy-iteration(MDP) returns a policy local variables: U, a utility function, , a policy repeat U value-determination(,U,MDP,R) unchanged? true for each state s do
unchanged? false end until unchanged? return
/
)(),,()( //maxargsa
sUsasTs
thensUsssTsUsasTifssa
//
)()),(,()(),,( ////max
21
Value Determination
Simplification of the value iteration algorithm because the policy is fixed
Linear equations because the max() operator has been removed
Solve exactly for the utilities using standard linear algebra
22
+1
-1
1
2
3
1 2 3 4
u(1,1) = 0.8 u(1,2) + 0.1 u(1,2) + 0.1 u(1,1)
u(1,2) = 0.8 u(1,3) + 0.2 u(1,2)
…
Optimal Policy(policy iteration with 11 linear equations)
23
Partially observable MDP (POMDP)
In an inaccessible environment, the percept does not provide enough information to determine the state or the transition probability
POMDP– State transition function: P(st+1 | st, at)– Observation function: P(ot | st, at)– Reward function: E(rt | st, at)
Approach– To calculate a probability distribution over the possible
states given all previous percepts, and to base decision on this distribution
Difficulty– Actions will cause the agent to obtain new percept, which
will cause the agent’s beliefs to change in complex ways