Upload
alexander-lloyd
View
218
Download
0
Embed Size (px)
Citation preview
A complexity analysis
Solving Markov Decision Processes using Policy Iteration
Romain Hollanders, UCLouvain
Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers
Seminar at Loria – Inria, Nancy, February 2015
Policy Iteration to solve Markov Decision Processes
Two powerful tools for the analysis
Acyclic Unique Sink Orientations Order-Regular matrices
starting state
How much will we pay ?
How much will we pay ?
starting state
Total-cost criterion . . . . . . . . . . . . . . .
horizon
cost vector
How much will we pay ?
starting state
Total-cost criterion
Average-cost criterion
. . . . . . . . . . . . . . .
. . . . . . . . . . . . .
horizon
cost vector
How much will we pay ?
starting state
Total-cost criterion
Average-cost criterion
Discounted-cost criterion
. . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . .
horizon
discount factor
cost vector
Markov chains
Markov chains
Markov Decision Processes
one a
ction
per s
tate
in general
action
action cost
transition probability
Goal: find the optimal policy
Evaluate a policy using an objective function
Total-cost
Average-cost
Discounted-cost
Proposition: there always exists
what we aim for !
How do we solve a Markov Decision Process ?
Policy Iteration
POLICY ITERATION
Choose an initial policy0.
end
while
1. Evaluate
2. Improve
is the best action in each stateaccording to
POLICY ITERATION
Choose an initial policy0.
end
while
1. Evaluate
2. Improve
is the best action in each stateaccording to
POLICY ITERATION
Choose an initial policy0.
end
while
1. Evaluate
2. Improve
is the best action in each stateaccording to
Stop ! We found the optimal policy
POLICY ITERATION
Markov Decision Processes
Turn Based Stochastic Games
one p
layer
two players
Markov Decision Processes
minimizer versus maximizer
STRATEGY ITERATION
minimizer versus maximizer
STRATEGY ITERATION
find the best response
using POLICY ITERATION
against
minimizer versus maximizer
STRATEGY ITERATION
find the best response
using POLICY ITERATION
against
find the best response
using POLICY ITERATION
against
minimizer versus maximizer
STRATEGY ITERATION
find the best response
using POLICY ITERATION
against
Repeat until nothing changes
find the best response
using POLICY ITERATION
against
What is the complexity of Policy Iteration ?
Total-cost criterion
Average-cost criterion
Discounted-cost criterion
Exponential. . . . . . . . . . . . . . . . . . . . . .
[Friedmann ‘09, Fearnley ‘10]
Total-cost criterion
Average-cost criterion
Discounted-cost criterion
Exponential
Exponential
Exponential
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
[H. et al. ‘12]
[Friedmann ‘09, Fearnley ‘10]
[Friedmann ‘09, Fearnley ‘10]
Exponential in general !
But…
Fearnley’s example is pathological
Deterministic MDPs
MDPs with only positive costs
Polynomial for a close variant
???
. . . . . .
. . . . . . . . . . . . . . . . .
[Ye ‘10, Hansen et al. ‘11, Scherrer ‘13]
[Post & Ye ‘12, Scherrer ‘13]
Discounted-cost criterionwith a fixed discount rate
Polynomial
. . . . . . . . . . . . . . . . . . . .
Let us find upper bounds for the general case !
sink
Every subcube has a unique sink
The orientation is acyclic
Let us find the sink with POLICY ITERATION
Acyclic Unique Sink Orientation
Initial policy
Let us find the sink with POLICY ITERATION
: the set of dimensions of the improvement edges
Let us find the sink with POLICY ITERATION
Let us find the sink with POLICY ITERATION
Let us find the sink with POLICY ITERATION
Convergence in 5 vertex evaluations
is the PI-sequence
Let us find the sink with POLICY ITERATION
Two properties to derive an upper bound
There exists a path connectingthe policies of the PI-sequence
Two properties to derive an upper bound
1.
2.
A new upper bound
total number of policies
Therefore we cannot have too many large ’s in a PI-sequence
We prove
Therefore
Can we do even better?
The matrix is “Order-Regular”
The matrix is “Order-Regular”
The matrix is “Order-Regular”
The matrix is “Order-Regular”
The matrix is “Order-Regular”
The matrix is “Order-Regular”
The matrix is “Order-Regular”
The matrix is “Order-Regular”
The matrix is “Order-Regular”
The matrix is “Order-Regular”
The matrix is “Order-Regular”
The matrix is “Order-Regular”
The matrix is “Order-Regular”
The matrix is “Order-Regular”
The matrix is “Order-Regular”
The matrix is “Order-Regular”
The matrix is “Order-Regular”
How large are the largest Order-Regular matrices that we can build?
The answer of exhaustive search
??
Conjecture (Hansen & Zwick, 2012)
the Fibonacci number
the golden ratio
The answer of exhaustive search
Theorem (H. et al., 2014)
for
(Proof: a “smart” exhaustive search)
How large are the largest Order-Regular matrices that we can build?
A constructive approach
A constructive approach
A constructive approach
A constructive approach
A constructive approach
Iterate and build matrices of size
Can we do better ?
Yes!
We can build matrices of size
So, what do we know about Order-Regular matrices ?
Order-Regular matrix Acyclic Unique Sink Orientation
Let’s recap’ !
PART 1 Policy Iteration for Markov Decision Processes
Efficient in practice but not in the worst case
PART 2 The Acyclic Unique Sink Orientations point of view
Leads to a new upper bound
PART 3 Order-Regular matrices towards new bounds
The Fibonacci conjecture fails
A complexity analysis
Solving Markov Decision Processes using Policy Iteration
Romain Hollanders, UCLouvain
Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers
Seminar at Loria – Inria, Nancy, February 2015