A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

A complexity analysis

Solving Markov Decision Processes using Policy Iteration

Romain Hollanders, UCLouvain

Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers

Seminar at Loria – Inria, Nancy, February 2015

Policy Iteration to solve Markov Decision Processes

Two powerful tools for the analysis

Acyclic Unique Sink Orientations Order-Regular matrices

starting state

How much will we pay ?


starting state

Total-cost criterion . . . . . . . . . . . . . . .

horizon

cost vector


starting state

Total-cost criterion

Average-cost criterion

. . . . . . . . . . . . . . .

. . . . . . . . . . . . .

horizon

cost vector


starting state



Discounted-cost criterion

. . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . .

horizon

discount factor

cost vector

Markov chains

Markov chains

Markov Decision Processes

one a

ction

per s

tate

in general

action

action cost

transition probability

Goal: find the optimal policy

Evaluate a policy using an objective function

Total-cost

Average-cost

Discounted-cost

Proposition: there always exists

what we aim for !

How do we solve a Markov Decision Process ?

Policy Iteration

POLICY ITERATION

Choose an initial policy0.

end

while

1. Evaluate

2. Improve

is the best action in each stateaccording to

POLICY ITERATION


end

while

1. Evaluate

2. Improve


POLICY ITERATION


end

while

1. Evaluate

2. Improve


Stop ! We found the optimal policy

POLICY ITERATION


Turn Based Stochastic Games

one p

layer

two players


minimizer versus maximizer

STRATEGY ITERATION


STRATEGY ITERATION

find the best response

using POLICY ITERATION

against


STRATEGY ITERATION



against



against


STRATEGY ITERATION



against

Repeat until nothing changes



against

What is the complexity of Policy Iteration ?




Exponential. . . . . . . . . . . . . . . . . . . . . .

[Friedmann ‘09, Fearnley ‘10]




Exponential

Exponential

Exponential

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

[H. et al. ‘12]



Exponential in general !

But…

Fearnley’s example is pathological

Deterministic MDPs

MDPs with only positive costs

Polynomial for a close variant

???

. . . . . .

. . . . . . . . . . . . . . . . .

[Ye ‘10, Hansen et al. ‘11, Scherrer ‘13]

[Post & Ye ‘12, Scherrer ‘13]

Discounted-cost criterionwith a fixed discount rate

Polynomial

. . . . . . . . . . . . . . . . . . . .

Let us find upper bounds for the general case !

sink

Every subcube has a unique sink

The orientation is acyclic

Let us find the sink with POLICY ITERATION

Acyclic Unique Sink Orientation

Initial policy


: the set of dimensions of the improvement edges




Convergence in 5 vertex evaluations

is the PI-sequence


Two properties to derive an upper bound

There exists a path connectingthe policies of the PI-sequence

Two properties to derive an upper bound

1.

2.

A new upper bound

total number of policies

Therefore we cannot have too many large ’s in a PI-sequence

We prove

Therefore

Can we do even better?

The matrix is “Order-Regular”

















How large are the largest Order-Regular matrices that we can build?

The answer of exhaustive search

??

Conjecture (Hansen & Zwick, 2012)

the Fibonacci number

the golden ratio

The answer of exhaustive search

Theorem (H. et al., 2014)

for

(Proof: a “smart” exhaustive search)

How large are the largest Order-Regular matrices that we can build?

A constructive approach





Iterate and build matrices of size

Can we do better ?

Yes!

We can build matrices of size

So, what do we know about Order-Regular matrices ?

Order-Regular matrix Acyclic Unique Sink Orientation

Let’s recap’ !

PART 1 Policy Iteration for Markov Decision Processes

Efficient in practice but not in the worst case

PART 2 The Acyclic Unique Sink Orientations point of view

Leads to a new upper bound

PART 3 Order-Regular matrices towards new bounds

The Fibonacci conjecture fails

A complexity analysis

Solving Markov Decision Processes using Policy Iteration

Romain Hollanders, UCLouvain

Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers

Seminar at Loria – Inria, Nancy, February 2015

Documents

A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles