The Stochastic Shortest Path Problem: A Polyhedral Perspective

The Stochastic Shortest Path Problem:A Polyhedral Perspective

Matthieu Guillot 1 Gautier Stauffer 1

1G-SCOP, Univ. Grenoble Alpes, 38000 Grenoble, France

London School of Economics, january 2017

Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 1 / 18

Outline of the talk

Infinite horizon total cost MDP

The Stochastic Shortest Path Problem

Contributions

Main proof technique: Generalized flow decomposition theorem

Open Questions


Outline of the talk



Contributions


Open Questions


Outline of the talk



Contributions


Open Questions


Outline of the talk



Contributions


Open Questions


Outline of the talk



Contributions


Open Questions


Infinite horizon Markov Decision Process

Entries :

S a finite set of statesA = ∪s∈SA(s) a finite set of actionsc : A 7→ R, a cost function on the actionsP(·|a), conditional probabilities over the state space for each action aAn initial state s0.



Entries :

S a finite set of states

A = ∪s∈SA(s) a finite set of actionsc : A 7→ R, a cost function on the actionsP(·|a), conditional probabilities over the state space for each action aAn initial state s0.

1

2 4

3

0



Entries :

S a finite set of statesA = ∪s∈SA(s) a finite set of actions

c : A 7→ R, a cost function on the actionsP(·|a), conditional probabilities over the state space for each action aAn initial state s0.

1

2 4

3

0

a

b c

d

e

f

g

0



Entries :

S a finite set of statesA = ∪s∈SA(s) a finite set of actions

c : A 7→ R, a cost function on the actionsP(·|a), conditional probabilities over the state space for each action aAn initial state s0.

1

2 4

3

0

a

b c

d

e

f

g

0



Entries :

S a finite set of statesA = ∪s∈SA(s) a finite set of actionsc : A 7→ R, a cost function on the actions

P(·|a), conditional probabilities over the state space for each action aAn initial state s0.

1

2 4

3

0

a

b c

d

e

f

g

0

3

10

−5

7

−1

2

4

0



Entries :

S a finite set of statesA = ∪s∈SA(s) a finite set of actionsc : A 7→ R, a cost function on the actionsP(·|a), conditional probabilities over the state space for each action a

An initial state s0.

1

2 4

3

0

a

b c

d

e

f

g

00.7 10.3

1 0.5

0.9

0.2

1

0.5

0.8

0.1

1



Entries :

S a finite set of statesA = ∪s∈SA(s) a finite set of actionsc : A 7→ R, a cost function on the actionsP(·|a), conditional probabilities over the state space for each action a

An initial state s0.

1

2 4

3

0

a

b c

d

e

f

g

0

3

10

−5

7

−1

2

4

00.7 10.3

1 0.5

0.9

0.2

1

0.5

0.8

0.1

1



Entries :

S a finite set of statesA = ∪s∈SA(s) a finite set of actionsc : A 7→ R, a cost function on the actionsP(·|a), conditional probabilities over the state space for each action aAn initial state s0.

11

2 4

3

0

a

b c

d

e

f

g

0

3

10

−5

7

−1

2

4

00.7 10.3

1 0.5

0.9

0.2

1

0.5

0.8

0.1

1



Dynamics :

In each time period t ≥ 0, the system is in state st and we need to decideupon an action a available in A(st).The system evolves to state st+1 according to P(·|a).

1

2 4

3

0

a

b c

d

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)



Dynamics :

In each time period t ≥ 0, the system is in state st and we need to decideupon an action a available in A(st).

The system evolves to state st+1 according to P(·|a).

1

2 4

3

0

a

b c

d

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)



Dynamics :



11

2 4

3

0

a

b c

d

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)



Dynamics :



11

2 4

3

0

a

b c

d

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

7

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)



Dynamics :



1

2 4

3

0

a

b c

dd

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)



Dynamics :


1

2 4

3

0

a

b c

dd

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)



Dynamics :


1

2 4

3

0

a

b c

dd

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

0.5

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)



Dynamics :


1

3

2 4

3

0

a

b c

d

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)



Dynamics :


1

3

2 4

3

0

a

b c

d

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

2

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)


Infinite horizon (total cost) Markov Decision Process

Goal :

Find a policy π : S 7→ A(It defines a Markov Chain with transition matrix Pπ).

Minimizing∑+∞

k=0 1ts0

(Pπ)kcπ

1

2 4

3

0

a

b c

d

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)

NB : we might consider non stationary and non deterministic policies BUTfor most MDPs ‘pure’ policies are optimal



Goal :

Find a policy π : S 7→ A

(It defines a Markov Chain with transition matrix Pπ).

Minimizing∑+∞

k=0 1ts0

(Pπ)kcπ

1

2 4

3

0

a

b c

d

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

7 2

−5

10

0

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)




Goal :


Minimizing∑+∞

k=0 1ts0

(Pπ)kcπ

1

2 4

3

0

b

d

e

g

0

10

1−5

1

7

0.5

2

0.2

0.5

0.8

0

1

7 2

−5

10

0

a

s

action

state

a sp(s|a)

as c(a)




Goal :


Minimizing∑+∞

k=0 1ts0

(Pπ)kcπ

1

2 4

3

0

b

d

e

g

0

10

1−5

1

7

0.5

2

0.2

0.5

0.8

0

1

7 2

−5

10

0

a

s

action

state

a sp(s|a)

as c(a)




Goal :


Minimizing∑+∞

k=0 1ts0

(Pπ)kcπ

1

2 4

3

0

b

d

e

g

0

10

1−5

1

7

0.5

2

0.2

0.5

0.8

0

1

7 2

−5

10

0

a

s

action

state

a sp(s|a)

as c(a)



Discounted Markov Decision Process

Issue :∑+∞

k=0 1ts0

(Pπ)kcπ is not always defined

1 2

b

c

1 1

−11

Discounted models : V ∗(s0) := min∑+∞

k=0 αk1t

s0(Pπ)kcπ for some 0 ≤ α < 1

Standards Methods from the 50’s:

Value Iteration : Bellman (1957) Dynamic ProgrammingPolicy Iteration : Howard (1960) Block-Pivot Simplex algorithmLinear Programming : Manne (1960)



Issue :∑+∞

k=0 1ts0


1 2

b

c

1 1

−11


k=0 αk1t






Issue :∑+∞

k=0 1ts0


1 2

b

c

1 1

−11


k=0 αk1t






Issue :∑+∞

k=0 1ts0


1 2

b

c

1 1

−11


k=0 αk1t



V ∗(s) = mina∈A(s)

{c(a) + α∑s′

P(s ′|a) · V ∗(s ′)}




Issue :∑+∞

k=0 1ts0


1 2

b

c

1 1

−11


k=0 αk1t




{c(a) + α∑s′

P(s ′|a) · V ∗(s ′)}

Value Iteration : Bellman (1957) Dynamic Programming

Policy Iteration : Howard (1960) Block-Pivot Simplex algorithmLinear Programming : Manne (1960)



Issue :∑+∞

k=0 1ts0


1 2

b

c

1 1

−11


k=0 αk1t




{c(a) + α∑s′

P(s ′|a) · V ∗(s ′)}

Value Iteration : Bellman (1957) Dynamic ProgrammingPolicy Iteration : Howard (1960) Block-Pivot Simplex algorithm

Linear Programming : Manne (1960)



Issue :∑+∞

k=0 1ts0


1 2

b

c

1 1

−11


k=0 αk1t




{c(a) + α∑s′

P(s ′|a) · V ∗(s ′)}




Extension to undiscounted MDPs i.e. α = 1 (discounted cas is special case)

Bertsekas and Tsitsiklis 1991 : Value Iteration, Policy Iteration, LP all work

Hypothesis :

there is an identified target state T (from there no way to escape)there is a proper policy that lead to T with proba 1‘looping’ in the system (outside T ) is costly : +∞ cost

1

2 4

3

T

a

b c

d

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)





Hypothesis :


1

2 4

3

T

a

b c

d

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)





Hypothesis :


1

2 4

3

T

a

b c

d

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)





Hypothesis :

there is an identified target state T (from there no way to escape)

there is a proper policy that lead to T with proba 1‘looping’ in the system (outside T ) is costly : +∞ cost

1

2 4

3

TT

a

b c

d

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)





Hypothesis :

there is an identified target state T (from there no way to escape)there is a proper policy that lead to T with proba 1

‘looping’ in the system (outside T ) is costly : +∞ cost

1

2 4

3

T

a

b c

d

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

7 2

−5

10

0

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)





Hypothesis :


1

2 4

3

T

a

b c

d

e

f

g

0

3

0.7

10

10.3−5

1

7

0.5

−1

0.9

2

0.2

3

2

−5

41

0.5

0.8

0.1 0

1

a

s

action

state

a sp(s|a)

as c(a)



Almost an extension of the standard deterministic shortest path :

there is an identified target state T (from there no way to escape)there is a proper policy that lead to T with proba 1‘looping’ in the system (outside T ) is costly : +∞ cost→ this forbids zero cost cycles

1

2 4

3

T

3

10

−1

2

−5

7

2

NB: Bertsekas and Yu (2016) proved that perturbated version of PI and VIconverge in the presence of zero cost cycles.





1

2 4

3

T

a

b c

d

e

f

g

3

1

10

1−5

1

7

1

−12

21

1

1





there is an identified target state T (from there no way to escape)

there is a proper policy that lead to T with proba 1‘looping’ in the system (outside T ) is costly : +∞ cost→ this forbids zero cost cycles

1

2 4

3

TT

a

b c

d

e

f

g

3

1

10

1−5

1

7

1

−12

21

1

1





there is an identified target state T (from there no way to escape)there is a proper policy that lead to T with proba 1

‘looping’ in the system (outside T ) is costly : +∞ cost→ this forbids zero cost cycles

1

2 4

3

T

a

b c

d

e

f

g

3

1

10

1−5

1

7

1

−12

21

1

1

3

10

−5

−1






1

2 4

3

T

3

10

−1

2

−5

7

2






1

2 4

3

T

3

10

−1

2

−5

7

2



This is not only a technical problem !

Many applications with zero cost cycles !

Maximizing the probability of reaching a target

Ex: Robot motion planing in turbulent water


Our Contribution

A Generalization of the framework by Bertsekas and Tsitsiklis thatencapsulates the deterministic version (i.e. zero cost cycles)

A proof that we can actually restrict to ‘pure’ policies

Proof of convergence of Value Iteration by a simple analysis :a natural extension of Bellman-Ford

Proof that Policy Iteration converges

A generalization of Dijkstra’s algorithm through primal-dual

→ Simplifies, Improves and Extends all previous results and analysis forinfinite horizon total cost MDPs !


Our Contribution








Our Contribution








Our Contribution








Our Contribution








Our Contribution








Our technique : polyhedral analysis

Observation that the (dual of the) linear programming formulation for SSP isa natural relaxation of a more general problem

→ The corresponding polyhedra generalizes the network flow polyhedra

min cx∑a∈δ+(v)

x(a)−∑

a∈δ−(v)

x(a) =

1, if v = s−1, if v = t0, otherwise

,∀v ∈ V

x ≥ 0





1

2 4

3

T

1

1

1/3

2/3

2/3

1/3

1/3

1

1/3





min cx∑a∈A(s)

x(a)−∑a∈A

p(s|a)x(a) =

1, if s = s0

−1, if s = T0, otherwise

∀s ∈ S

x ≥ 0





1

1

1

2 4

3

T

a

b c

d

e

f

g

0.5

0.5

0.375

10.51.5

1

2.5

0.5

0.25

0.5

1

0.5

1.251

0.5

0.5

0.5


Linear Programming relaxation : proof sketch

A policy π induces a probability distribution over all possible (s0,T )-walks

yπk (s): probability of being in state s in period k following policy π

xπk (a): probability of taking action a in period k following policy π

We have for all π and for all k ≥ 0 :∑a∈A(s) x

πk (a) = yπ

k (s) and yπk+1(s) =

∑j∈A p(s|a)xπk (a)

It implies∑

k

∑a∈A(s) x

πk+1(a) =

∑k

∑a∈A p(s|a)xπk (a)

Together with yπ0 = 1s0 =

∑a∈A(s) x

π0 (a) this yields∑

a∈A(s)

xπ(a)−∑a∈A

p(s|a)xπ(a) = 1s0

as long as xπ(a) :=∑

k xπk (a) is well-defined for all a

(this is our new def. of proper)



Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)

→ The proof relies on a generalization of the ‘flow’ decomposition theorem

1

2 4

3

T

3

3

1

2

2

1

1

3

1





1

2 4

3

T

3

3

1

1





1

2 4

3

T

2

2

1

2

1

1

3

1





1

2 4

3

T

2

2

1

1





1

2 4

3

T

1

1

1

2

1

1

1





1

2 4

3

T

1

1

1

1

1





1

2 4

3

T1

1

1

1





1

2 4

3

T





1

2 4

3

T

3

3





1

2 4

3

T

a

b c

d

e

f

g

2

2

1

0.50.53

1

2.510.75

1

0.5

0.5

0.5

5 2

0.5

0.50.5

0.5





1

2 4

3

T

a

b c

d

e

f

g

2

20.50.5

1

10.75

1

0.5

0.5

0.5

1

0.5

0.50.5

0.5





1

2 4

3

T

a

b c

d

e1

1

1

0.50.53

1

2.51

4 2

0.5

0.50.5

0.5





1

2 4

3

T

a

b c

d

e1

10.50.52

1

21

4 2

0.5

0.50.5

0.5





1

2 4

3

T

a

b c

1

0.50.51

1

0.51





1

2 4

3

T

a

b c

1

0.50.51

1

0.51





1

2 4

3

T

a

b c

d

e

f

g

2

20.50.5

1

1

1

0.5

0.5

0.5

0.50.5

0.5

1

1

0.5

2

2

42

0.75

0.5

1


Idea of contributions : framework

Decomposition theorem implies that extreme points are ‘pure’ strategies andextreme rays of the relaxation are ‘transition cycles’

A transition cycle is a solution x ≥ 0 to∑

a∈A(s) x(a)−∑

a∈A p(s|a)x(a) = 0

The optimum of the relaxation and of the original problem coincide when notransition cycles of negative costs : this is our new framework

Assumptions

There exists a path between all node i and 0 in the support graph

There is no negative cost transition cycle





a∈A(s) x(a)−∑

a∈A p(s|a)x(a) = 0


Assumptions







a∈A(s) x(a)−∑

a∈A p(s|a)x(a) = 0


Assumptions







a∈A(s) x(a)−∑

a∈A p(s|a)x(a) = 0


Assumptions




Idea of contributions : algorithms

Value iteration is very similar to Bellman-Ford: we essentially prove thatmin lim = lim min

minΠ∈P

limK→∞

K∑k=0

cT xΠk = lim

K→∞min

Π∈PK

K∑k=0

cT xΠk

( P ∼ all proper policies, PK ∼ all proper policies that terminate in K steps)

Policy iteration is a block-pivot simplex : we prove strict improvement togarantee finiteness.

We can apply a primal-dual algorithm, the subproblem is a reachabilityquestion : Dijkstra-like algorithm (we fall into the same class, not the casebefore because of zero cost cycles !!)




minΠ∈P

limK→∞

K∑k=0

cT xΠk = lim

K→∞min

Π∈PK

K∑k=0

cT xΠk







minΠ∈P

limK→∞

K∑k=0

cT xΠk = lim

K→∞min

Π∈PK

K∑k=0

cT xΠk





Main Open questions

The stochastic shortest path problem is polynomial through LP

Is it strongly polynomial ?

Ye (2011) : true for discounted MDPs if α is fixed

Is our generalization of Disjkstra’s algorithm strongly polynomial ?

Is the reachability subproblem strongly polynomial ?


Main Open questions







Main Open questions







Main Open questions







Main Open questions








Documents

The Stochastic Shortest Path Problem: A Polyhedral Perspective