Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
The Stochastic Shortest Path Problem:A Polyhedral Perspective
Matthieu Guillot 1 Gautier Stauffer 1
1G-SCOP, Univ. Grenoble Alpes, 38000 Grenoble, France
London School of Economics, january 2017
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 1 / 18
Outline of the talk
Infinite horizon total cost MDP
The Stochastic Shortest Path Problem
Contributions
Main proof technique: Generalized flow decomposition theorem
Open Questions
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 2 / 18
Outline of the talk
Infinite horizon total cost MDP
The Stochastic Shortest Path Problem
Contributions
Main proof technique: Generalized flow decomposition theorem
Open Questions
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 2 / 18
Outline of the talk
Infinite horizon total cost MDP
The Stochastic Shortest Path Problem
Contributions
Main proof technique: Generalized flow decomposition theorem
Open Questions
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 2 / 18
Outline of the talk
Infinite horizon total cost MDP
The Stochastic Shortest Path Problem
Contributions
Main proof technique: Generalized flow decomposition theorem
Open Questions
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 2 / 18
Outline of the talk
Infinite horizon total cost MDP
The Stochastic Shortest Path Problem
Contributions
Main proof technique: Generalized flow decomposition theorem
Open Questions
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 2 / 18
Infinite horizon Markov Decision Process
Entries :
S a finite set of statesA = ∪s∈SA(s) a finite set of actionsc : A 7→ R, a cost function on the actionsP(·|a), conditional probabilities over the state space for each action aAn initial state s0.
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 3 / 18
Infinite horizon Markov Decision Process
Entries :
S a finite set of states
A = ∪s∈SA(s) a finite set of actionsc : A 7→ R, a cost function on the actionsP(·|a), conditional probabilities over the state space for each action aAn initial state s0.
1
2 4
3
0
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 3 / 18
Infinite horizon Markov Decision Process
Entries :
S a finite set of statesA = ∪s∈SA(s) a finite set of actions
c : A 7→ R, a cost function on the actionsP(·|a), conditional probabilities over the state space for each action aAn initial state s0.
1
2 4
3
0
a
b c
d
e
f
g
0
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 3 / 18
Infinite horizon Markov Decision Process
Entries :
S a finite set of statesA = ∪s∈SA(s) a finite set of actions
c : A 7→ R, a cost function on the actionsP(·|a), conditional probabilities over the state space for each action aAn initial state s0.
1
2 4
3
0
a
b c
d
e
f
g
0
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 3 / 18
Infinite horizon Markov Decision Process
Entries :
S a finite set of statesA = ∪s∈SA(s) a finite set of actionsc : A 7→ R, a cost function on the actions
P(·|a), conditional probabilities over the state space for each action aAn initial state s0.
1
2 4
3
0
a
b c
d
e
f
g
0
3
10
−5
7
−1
2
4
0
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 3 / 18
Infinite horizon Markov Decision Process
Entries :
S a finite set of statesA = ∪s∈SA(s) a finite set of actionsc : A 7→ R, a cost function on the actionsP(·|a), conditional probabilities over the state space for each action a
An initial state s0.
1
2 4
3
0
a
b c
d
e
f
g
00.7 10.3
1 0.5
0.9
0.2
1
0.5
0.8
0.1
1
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 3 / 18
Infinite horizon Markov Decision Process
Entries :
S a finite set of statesA = ∪s∈SA(s) a finite set of actionsc : A 7→ R, a cost function on the actionsP(·|a), conditional probabilities over the state space for each action a
An initial state s0.
1
2 4
3
0
a
b c
d
e
f
g
0
3
10
−5
7
−1
2
4
00.7 10.3
1 0.5
0.9
0.2
1
0.5
0.8
0.1
1
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 3 / 18
Infinite horizon Markov Decision Process
Entries :
S a finite set of statesA = ∪s∈SA(s) a finite set of actionsc : A 7→ R, a cost function on the actionsP(·|a), conditional probabilities over the state space for each action aAn initial state s0.
11
2 4
3
0
a
b c
d
e
f
g
0
3
10
−5
7
−1
2
4
00.7 10.3
1 0.5
0.9
0.2
1
0.5
0.8
0.1
1
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 3 / 18
Infinite horizon Markov Decision Process
Dynamics :
In each time period t ≥ 0, the system is in state st and we need to decideupon an action a available in A(st).The system evolves to state st+1 according to P(·|a).
1
2 4
3
0
a
b c
d
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 4 / 18
Infinite horizon Markov Decision Process
Dynamics :
In each time period t ≥ 0, the system is in state st and we need to decideupon an action a available in A(st).
The system evolves to state st+1 according to P(·|a).
1
2 4
3
0
a
b c
d
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 4 / 18
Infinite horizon Markov Decision Process
Dynamics :
In each time period t ≥ 0, the system is in state st and we need to decideupon an action a available in A(st).
The system evolves to state st+1 according to P(·|a).
11
2 4
3
0
a
b c
d
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 4 / 18
Infinite horizon Markov Decision Process
Dynamics :
In each time period t ≥ 0, the system is in state st and we need to decideupon an action a available in A(st).
The system evolves to state st+1 according to P(·|a).
11
2 4
3
0
a
b c
d
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
7
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 4 / 18
Infinite horizon Markov Decision Process
Dynamics :
In each time period t ≥ 0, the system is in state st and we need to decideupon an action a available in A(st).
The system evolves to state st+1 according to P(·|a).
1
2 4
3
0
a
b c
dd
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 4 / 18
Infinite horizon Markov Decision Process
Dynamics :
In each time period t ≥ 0, the system is in state st and we need to decideupon an action a available in A(st).The system evolves to state st+1 according to P(·|a).
1
2 4
3
0
a
b c
dd
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 4 / 18
Infinite horizon Markov Decision Process
Dynamics :
In each time period t ≥ 0, the system is in state st and we need to decideupon an action a available in A(st).The system evolves to state st+1 according to P(·|a).
1
2 4
3
0
a
b c
dd
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
0.5
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 4 / 18
Infinite horizon Markov Decision Process
Dynamics :
In each time period t ≥ 0, the system is in state st and we need to decideupon an action a available in A(st).The system evolves to state st+1 according to P(·|a).
1
3
2 4
3
0
a
b c
d
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 4 / 18
Infinite horizon Markov Decision Process
Dynamics :
In each time period t ≥ 0, the system is in state st and we need to decideupon an action a available in A(st).The system evolves to state st+1 according to P(·|a).
1
3
2 4
3
0
a
b c
d
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
2
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 4 / 18
Infinite horizon (total cost) Markov Decision Process
Goal :
Find a policy π : S 7→ A(It defines a Markov Chain with transition matrix Pπ).
Minimizing∑+∞
k=0 1ts0
(Pπ)kcπ
1
2 4
3
0
a
b c
d
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
NB : we might consider non stationary and non deterministic policies BUTfor most MDPs ‘pure’ policies are optimal
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 5 / 18
Infinite horizon (total cost) Markov Decision Process
Goal :
Find a policy π : S 7→ A
(It defines a Markov Chain with transition matrix Pπ).
Minimizing∑+∞
k=0 1ts0
(Pπ)kcπ
1
2 4
3
0
a
b c
d
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
7 2
−5
10
0
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
NB : we might consider non stationary and non deterministic policies BUTfor most MDPs ‘pure’ policies are optimal
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 5 / 18
Infinite horizon (total cost) Markov Decision Process
Goal :
Find a policy π : S 7→ A(It defines a Markov Chain with transition matrix Pπ).
Minimizing∑+∞
k=0 1ts0
(Pπ)kcπ
1
2 4
3
0
b
d
e
g
0
10
1−5
1
7
0.5
2
0.2
0.5
0.8
0
1
7 2
−5
10
0
a
s
action
state
a sp(s|a)
as c(a)
NB : we might consider non stationary and non deterministic policies BUTfor most MDPs ‘pure’ policies are optimal
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 5 / 18
Infinite horizon (total cost) Markov Decision Process
Goal :
Find a policy π : S 7→ A(It defines a Markov Chain with transition matrix Pπ).
Minimizing∑+∞
k=0 1ts0
(Pπ)kcπ
1
2 4
3
0
b
d
e
g
0
10
1−5
1
7
0.5
2
0.2
0.5
0.8
0
1
7 2
−5
10
0
a
s
action
state
a sp(s|a)
as c(a)
NB : we might consider non stationary and non deterministic policies BUTfor most MDPs ‘pure’ policies are optimal
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 5 / 18
Infinite horizon (total cost) Markov Decision Process
Goal :
Find a policy π : S 7→ A(It defines a Markov Chain with transition matrix Pπ).
Minimizing∑+∞
k=0 1ts0
(Pπ)kcπ
1
2 4
3
0
b
d
e
g
0
10
1−5
1
7
0.5
2
0.2
0.5
0.8
0
1
7 2
−5
10
0
a
s
action
state
a sp(s|a)
as c(a)
NB : we might consider non stationary and non deterministic policies BUTfor most MDPs ‘pure’ policies are optimal
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 5 / 18
Discounted Markov Decision Process
Issue :∑+∞
k=0 1ts0
(Pπ)kcπ is not always defined
1 2
b
c
1 1
−11
Discounted models : V ∗(s0) := min∑+∞
k=0 αk1t
s0(Pπ)kcπ for some 0 ≤ α < 1
Standards Methods from the 50’s:
Value Iteration : Bellman (1957) Dynamic ProgrammingPolicy Iteration : Howard (1960) Block-Pivot Simplex algorithmLinear Programming : Manne (1960)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 6 / 18
Discounted Markov Decision Process
Issue :∑+∞
k=0 1ts0
(Pπ)kcπ is not always defined
1 2
b
c
1 1
−11
Discounted models : V ∗(s0) := min∑+∞
k=0 αk1t
s0(Pπ)kcπ for some 0 ≤ α < 1
Standards Methods from the 50’s:
Value Iteration : Bellman (1957) Dynamic ProgrammingPolicy Iteration : Howard (1960) Block-Pivot Simplex algorithmLinear Programming : Manne (1960)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 6 / 18
Discounted Markov Decision Process
Issue :∑+∞
k=0 1ts0
(Pπ)kcπ is not always defined
1 2
b
c
1 1
−11
Discounted models : V ∗(s0) := min∑+∞
k=0 αk1t
s0(Pπ)kcπ for some 0 ≤ α < 1
Standards Methods from the 50’s:
Value Iteration : Bellman (1957) Dynamic ProgrammingPolicy Iteration : Howard (1960) Block-Pivot Simplex algorithmLinear Programming : Manne (1960)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 6 / 18
Discounted Markov Decision Process
Issue :∑+∞
k=0 1ts0
(Pπ)kcπ is not always defined
1 2
b
c
1 1
−11
Discounted models : V ∗(s0) := min∑+∞
k=0 αk1t
s0(Pπ)kcπ for some 0 ≤ α < 1
Standards Methods from the 50’s:
V ∗(s) = mina∈A(s)
{c(a) + α∑s′
P(s ′|a) · V ∗(s ′)}
Value Iteration : Bellman (1957) Dynamic ProgrammingPolicy Iteration : Howard (1960) Block-Pivot Simplex algorithmLinear Programming : Manne (1960)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 6 / 18
Discounted Markov Decision Process
Issue :∑+∞
k=0 1ts0
(Pπ)kcπ is not always defined
1 2
b
c
1 1
−11
Discounted models : V ∗(s0) := min∑+∞
k=0 αk1t
s0(Pπ)kcπ for some 0 ≤ α < 1
Standards Methods from the 50’s:
V ∗(s) = mina∈A(s)
{c(a) + α∑s′
P(s ′|a) · V ∗(s ′)}
Value Iteration : Bellman (1957) Dynamic Programming
Policy Iteration : Howard (1960) Block-Pivot Simplex algorithmLinear Programming : Manne (1960)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 6 / 18
Discounted Markov Decision Process
Issue :∑+∞
k=0 1ts0
(Pπ)kcπ is not always defined
1 2
b
c
1 1
−11
Discounted models : V ∗(s0) := min∑+∞
k=0 αk1t
s0(Pπ)kcπ for some 0 ≤ α < 1
Standards Methods from the 50’s:
V ∗(s) = mina∈A(s)
{c(a) + α∑s′
P(s ′|a) · V ∗(s ′)}
Value Iteration : Bellman (1957) Dynamic ProgrammingPolicy Iteration : Howard (1960) Block-Pivot Simplex algorithm
Linear Programming : Manne (1960)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 6 / 18
Discounted Markov Decision Process
Issue :∑+∞
k=0 1ts0
(Pπ)kcπ is not always defined
1 2
b
c
1 1
−11
Discounted models : V ∗(s0) := min∑+∞
k=0 αk1t
s0(Pπ)kcπ for some 0 ≤ α < 1
Standards Methods from the 50’s:
V ∗(s) = mina∈A(s)
{c(a) + α∑s′
P(s ′|a) · V ∗(s ′)}
Value Iteration : Bellman (1957) Dynamic ProgrammingPolicy Iteration : Howard (1960) Block-Pivot Simplex algorithmLinear Programming : Manne (1960)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 6 / 18
The Stochastic Shortest Path Problem
Extension to undiscounted MDPs i.e. α = 1 (discounted cas is special case)
Bertsekas and Tsitsiklis 1991 : Value Iteration, Policy Iteration, LP all work
Hypothesis :
there is an identified target state T (from there no way to escape)there is a proper policy that lead to T with proba 1‘looping’ in the system (outside T ) is costly : +∞ cost
1
2 4
3
T
a
b c
d
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 7 / 18
The Stochastic Shortest Path Problem
Extension to undiscounted MDPs i.e. α = 1 (discounted cas is special case)
Bertsekas and Tsitsiklis 1991 : Value Iteration, Policy Iteration, LP all work
Hypothesis :
there is an identified target state T (from there no way to escape)there is a proper policy that lead to T with proba 1‘looping’ in the system (outside T ) is costly : +∞ cost
1
2 4
3
T
a
b c
d
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 7 / 18
The Stochastic Shortest Path Problem
Extension to undiscounted MDPs i.e. α = 1 (discounted cas is special case)
Bertsekas and Tsitsiklis 1991 : Value Iteration, Policy Iteration, LP all work
Hypothesis :
there is an identified target state T (from there no way to escape)there is a proper policy that lead to T with proba 1‘looping’ in the system (outside T ) is costly : +∞ cost
1
2 4
3
T
a
b c
d
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 7 / 18
The Stochastic Shortest Path Problem
Extension to undiscounted MDPs i.e. α = 1 (discounted cas is special case)
Bertsekas and Tsitsiklis 1991 : Value Iteration, Policy Iteration, LP all work
Hypothesis :
there is an identified target state T (from there no way to escape)
there is a proper policy that lead to T with proba 1‘looping’ in the system (outside T ) is costly : +∞ cost
1
2 4
3
TT
a
b c
d
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 7 / 18
The Stochastic Shortest Path Problem
Extension to undiscounted MDPs i.e. α = 1 (discounted cas is special case)
Bertsekas and Tsitsiklis 1991 : Value Iteration, Policy Iteration, LP all work
Hypothesis :
there is an identified target state T (from there no way to escape)there is a proper policy that lead to T with proba 1
‘looping’ in the system (outside T ) is costly : +∞ cost
1
2 4
3
T
a
b c
d
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
7 2
−5
10
0
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 7 / 18
The Stochastic Shortest Path Problem
Extension to undiscounted MDPs i.e. α = 1 (discounted cas is special case)
Bertsekas and Tsitsiklis 1991 : Value Iteration, Policy Iteration, LP all work
Hypothesis :
there is an identified target state T (from there no way to escape)there is a proper policy that lead to T with proba 1‘looping’ in the system (outside T ) is costly : +∞ cost
1
2 4
3
T
a
b c
d
e
f
g
0
3
0.7
10
10.3−5
1
7
0.5
−1
0.9
2
0.2
3
2
−5
41
0.5
0.8
0.1 0
1
a
s
action
state
a sp(s|a)
as c(a)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 7 / 18
The Stochastic Shortest Path Problem
Almost an extension of the standard deterministic shortest path :
there is an identified target state T (from there no way to escape)there is a proper policy that lead to T with proba 1‘looping’ in the system (outside T ) is costly : +∞ cost→ this forbids zero cost cycles
1
2 4
3
T
3
10
−1
2
−5
7
2
NB: Bertsekas and Yu (2016) proved that perturbated version of PI and VIconverge in the presence of zero cost cycles.
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 8 / 18
The Stochastic Shortest Path Problem
Almost an extension of the standard deterministic shortest path :
there is an identified target state T (from there no way to escape)there is a proper policy that lead to T with proba 1‘looping’ in the system (outside T ) is costly : +∞ cost→ this forbids zero cost cycles
1
2 4
3
T
a
b c
d
e
f
g
3
1
10
1−5
1
7
1
−12
21
1
1
NB: Bertsekas and Yu (2016) proved that perturbated version of PI and VIconverge in the presence of zero cost cycles.
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 8 / 18
The Stochastic Shortest Path Problem
Almost an extension of the standard deterministic shortest path :
there is an identified target state T (from there no way to escape)
there is a proper policy that lead to T with proba 1‘looping’ in the system (outside T ) is costly : +∞ cost→ this forbids zero cost cycles
1
2 4
3
TT
a
b c
d
e
f
g
3
1
10
1−5
1
7
1
−12
21
1
1
NB: Bertsekas and Yu (2016) proved that perturbated version of PI and VIconverge in the presence of zero cost cycles.
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 8 / 18
The Stochastic Shortest Path Problem
Almost an extension of the standard deterministic shortest path :
there is an identified target state T (from there no way to escape)there is a proper policy that lead to T with proba 1
‘looping’ in the system (outside T ) is costly : +∞ cost→ this forbids zero cost cycles
1
2 4
3
T
a
b c
d
e
f
g
3
1
10
1−5
1
7
1
−12
21
1
1
3
10
−5
−1
NB: Bertsekas and Yu (2016) proved that perturbated version of PI and VIconverge in the presence of zero cost cycles.
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 8 / 18
The Stochastic Shortest Path Problem
Almost an extension of the standard deterministic shortest path :
there is an identified target state T (from there no way to escape)there is a proper policy that lead to T with proba 1‘looping’ in the system (outside T ) is costly : +∞ cost→ this forbids zero cost cycles
1
2 4
3
T
3
10
−1
2
−5
7
2
NB: Bertsekas and Yu (2016) proved that perturbated version of PI and VIconverge in the presence of zero cost cycles.
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 8 / 18
The Stochastic Shortest Path Problem
Almost an extension of the standard deterministic shortest path :
there is an identified target state T (from there no way to escape)there is a proper policy that lead to T with proba 1‘looping’ in the system (outside T ) is costly : +∞ cost→ this forbids zero cost cycles
1
2 4
3
T
3
10
−1
2
−5
7
2
NB: Bertsekas and Yu (2016) proved that perturbated version of PI and VIconverge in the presence of zero cost cycles.
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 8 / 18
This is not only a technical problem !
Many applications with zero cost cycles !
Maximizing the probability of reaching a target
Ex: Robot motion planing in turbulent water
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 9 / 18
Our Contribution
A Generalization of the framework by Bertsekas and Tsitsiklis thatencapsulates the deterministic version (i.e. zero cost cycles)
A proof that we can actually restrict to ‘pure’ policies
Proof of convergence of Value Iteration by a simple analysis :a natural extension of Bellman-Ford
Proof that Policy Iteration converges
A generalization of Dijkstra’s algorithm through primal-dual
→ Simplifies, Improves and Extends all previous results and analysis forinfinite horizon total cost MDPs !
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 10 / 18
Our Contribution
A Generalization of the framework by Bertsekas and Tsitsiklis thatencapsulates the deterministic version (i.e. zero cost cycles)
A proof that we can actually restrict to ‘pure’ policies
Proof of convergence of Value Iteration by a simple analysis :a natural extension of Bellman-Ford
Proof that Policy Iteration converges
A generalization of Dijkstra’s algorithm through primal-dual
→ Simplifies, Improves and Extends all previous results and analysis forinfinite horizon total cost MDPs !
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 10 / 18
Our Contribution
A Generalization of the framework by Bertsekas and Tsitsiklis thatencapsulates the deterministic version (i.e. zero cost cycles)
A proof that we can actually restrict to ‘pure’ policies
Proof of convergence of Value Iteration by a simple analysis :a natural extension of Bellman-Ford
Proof that Policy Iteration converges
A generalization of Dijkstra’s algorithm through primal-dual
→ Simplifies, Improves and Extends all previous results and analysis forinfinite horizon total cost MDPs !
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 10 / 18
Our Contribution
A Generalization of the framework by Bertsekas and Tsitsiklis thatencapsulates the deterministic version (i.e. zero cost cycles)
A proof that we can actually restrict to ‘pure’ policies
Proof of convergence of Value Iteration by a simple analysis :a natural extension of Bellman-Ford
Proof that Policy Iteration converges
A generalization of Dijkstra’s algorithm through primal-dual
→ Simplifies, Improves and Extends all previous results and analysis forinfinite horizon total cost MDPs !
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 10 / 18
Our Contribution
A Generalization of the framework by Bertsekas and Tsitsiklis thatencapsulates the deterministic version (i.e. zero cost cycles)
A proof that we can actually restrict to ‘pure’ policies
Proof of convergence of Value Iteration by a simple analysis :a natural extension of Bellman-Ford
Proof that Policy Iteration converges
A generalization of Dijkstra’s algorithm through primal-dual
→ Simplifies, Improves and Extends all previous results and analysis forinfinite horizon total cost MDPs !
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 10 / 18
Our Contribution
A Generalization of the framework by Bertsekas and Tsitsiklis thatencapsulates the deterministic version (i.e. zero cost cycles)
A proof that we can actually restrict to ‘pure’ policies
Proof of convergence of Value Iteration by a simple analysis :a natural extension of Bellman-Ford
Proof that Policy Iteration converges
A generalization of Dijkstra’s algorithm through primal-dual
→ Simplifies, Improves and Extends all previous results and analysis forinfinite horizon total cost MDPs !
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 10 / 18
Our technique : polyhedral analysis
Observation that the (dual of the) linear programming formulation for SSP isa natural relaxation of a more general problem
→ The corresponding polyhedra generalizes the network flow polyhedra
min cx∑a∈δ+(v)
x(a)−∑
a∈δ−(v)
x(a) =
1, if v = s−1, if v = t0, otherwise
,∀v ∈ V
x ≥ 0
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 11 / 18
Our technique : polyhedral analysis
Observation that the (dual of the) linear programming formulation for SSP isa natural relaxation of a more general problem
→ The corresponding polyhedra generalizes the network flow polyhedra
1
2 4
3
T
1
1
1/3
2/3
2/3
1/3
1/3
1
1/3
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 11 / 18
Our technique : polyhedral analysis
Observation that the (dual of the) linear programming formulation for SSP isa natural relaxation of a more general problem
→ The corresponding polyhedra generalizes the network flow polyhedra
min cx∑a∈A(s)
x(a)−∑a∈A
p(s|a)x(a) =
1, if s = s0
−1, if s = T0, otherwise
∀s ∈ S
x ≥ 0
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 11 / 18
Our technique : polyhedral analysis
Observation that the (dual of the) linear programming formulation for SSP isa natural relaxation of a more general problem
→ The corresponding polyhedra generalizes the network flow polyhedra
1
1
1
2 4
3
T
a
b c
d
e
f
g
0.5
0.5
0.375
10.51.5
1
2.5
0.5
0.25
0.5
1
0.5
1.251
0.5
0.5
0.5
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 11 / 18
Linear Programming relaxation : proof sketch
A policy π induces a probability distribution over all possible (s0,T )-walks
yπk (s): probability of being in state s in period k following policy π
xπk (a): probability of taking action a in period k following policy π
We have for all π and for all k ≥ 0 :∑a∈A(s) x
πk (a) = yπ
k (s) and yπk+1(s) =
∑j∈A p(s|a)xπk (a)
It implies∑
k
∑a∈A(s) x
πk+1(a) =
∑k
∑a∈A p(s|a)xπk (a)
Together with yπ0 = 1s0 =
∑a∈A(s) x
π0 (a) this yields∑
a∈A(s)
xπ(a)−∑a∈A
p(s|a)xπ(a) = 1s0
as long as xπ(a) :=∑
k xπk (a) is well-defined for all a
(this is our new def. of proper)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 12 / 18
Our technique : polyhedral analysis
Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)
→ The proof relies on a generalization of the ‘flow’ decomposition theorem
1
2 4
3
T
3
3
1
2
2
1
1
3
1
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 13 / 18
Our technique : polyhedral analysis
Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)
→ The proof relies on a generalization of the ‘flow’ decomposition theorem
1
2 4
3
T
3
3
1
1
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 13 / 18
Our technique : polyhedral analysis
Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)
→ The proof relies on a generalization of the ‘flow’ decomposition theorem
1
2 4
3
T
2
2
1
2
1
1
3
1
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 13 / 18
Our technique : polyhedral analysis
Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)
→ The proof relies on a generalization of the ‘flow’ decomposition theorem
1
2 4
3
T
2
2
1
1
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 13 / 18
Our technique : polyhedral analysis
Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)
→ The proof relies on a generalization of the ‘flow’ decomposition theorem
1
2 4
3
T
1
1
1
2
1
1
1
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 13 / 18
Our technique : polyhedral analysis
Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)
→ The proof relies on a generalization of the ‘flow’ decomposition theorem
1
2 4
3
T
1
1
1
1
1
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 13 / 18
Our technique : polyhedral analysis
Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)
→ The proof relies on a generalization of the ‘flow’ decomposition theorem
1
2 4
3
T1
1
1
1
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 13 / 18
Our technique : polyhedral analysis
Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)
→ The proof relies on a generalization of the ‘flow’ decomposition theorem
1
2 4
3
T
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 13 / 18
Our technique : polyhedral analysis
Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)
→ The proof relies on a generalization of the ‘flow’ decomposition theorem
1
2 4
3
T
3
3
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 13 / 18
Our technique : polyhedral analysis
Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)
→ The proof relies on a generalization of the ‘flow’ decomposition theorem
1
2 4
3
T
a
b c
d
e
f
g
2
2
1
0.50.53
1
2.510.75
1
0.5
0.5
0.5
5 2
0.5
0.50.5
0.5
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 14 / 18
Our technique : polyhedral analysis
Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)
→ The proof relies on a generalization of the ‘flow’ decomposition theorem
1
2 4
3
T
a
b c
d
e
f
g
2
20.50.5
1
10.75
1
0.5
0.5
0.5
1
0.5
0.50.5
0.5
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 14 / 18
Our technique : polyhedral analysis
Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)
→ The proof relies on a generalization of the ‘flow’ decomposition theorem
1
2 4
3
T
a
b c
d
e1
1
1
0.50.53
1
2.51
4 2
0.5
0.50.5
0.5
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 14 / 18
Our technique : polyhedral analysis
Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)
→ The proof relies on a generalization of the ‘flow’ decomposition theorem
1
2 4
3
T
a
b c
d
e1
10.50.52
1
21
4 2
0.5
0.50.5
0.5
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 14 / 18
Our technique : polyhedral analysis
Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)
→ The proof relies on a generalization of the ‘flow’ decomposition theorem
1
2 4
3
T
a
b c
1
0.50.51
1
0.51
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 14 / 18
Our technique : polyhedral analysis
Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)
→ The proof relies on a generalization of the ‘flow’ decomposition theorem
1
2 4
3
T
a
b c
1
0.50.51
1
0.51
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 14 / 18
Our technique : polyhedral analysis
Proof that the extreme points of this relaxation are ‘associated’ with ‘pure’policies (NB: the extreme points are NOT integral)
→ The proof relies on a generalization of the ‘flow’ decomposition theorem
1
2 4
3
T
a
b c
d
e
f
g
2
20.50.5
1
1
1
0.5
0.5
0.5
0.50.5
0.5
1
1
0.5
2
2
42
0.75
0.5
1
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 14 / 18
Idea of contributions : framework
Decomposition theorem implies that extreme points are ‘pure’ strategies andextreme rays of the relaxation are ‘transition cycles’
A transition cycle is a solution x ≥ 0 to∑
a∈A(s) x(a)−∑
a∈A p(s|a)x(a) = 0
The optimum of the relaxation and of the original problem coincide when notransition cycles of negative costs : this is our new framework
Assumptions
There exists a path between all node i and 0 in the support graph
There is no negative cost transition cycle
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 15 / 18
Idea of contributions : framework
Decomposition theorem implies that extreme points are ‘pure’ strategies andextreme rays of the relaxation are ‘transition cycles’
A transition cycle is a solution x ≥ 0 to∑
a∈A(s) x(a)−∑
a∈A p(s|a)x(a) = 0
The optimum of the relaxation and of the original problem coincide when notransition cycles of negative costs : this is our new framework
Assumptions
There exists a path between all node i and 0 in the support graph
There is no negative cost transition cycle
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 15 / 18
Idea of contributions : framework
Decomposition theorem implies that extreme points are ‘pure’ strategies andextreme rays of the relaxation are ‘transition cycles’
A transition cycle is a solution x ≥ 0 to∑
a∈A(s) x(a)−∑
a∈A p(s|a)x(a) = 0
The optimum of the relaxation and of the original problem coincide when notransition cycles of negative costs : this is our new framework
Assumptions
There exists a path between all node i and 0 in the support graph
There is no negative cost transition cycle
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 15 / 18
Idea of contributions : framework
Decomposition theorem implies that extreme points are ‘pure’ strategies andextreme rays of the relaxation are ‘transition cycles’
A transition cycle is a solution x ≥ 0 to∑
a∈A(s) x(a)−∑
a∈A p(s|a)x(a) = 0
The optimum of the relaxation and of the original problem coincide when notransition cycles of negative costs : this is our new framework
Assumptions
There exists a path between all node i and 0 in the support graph
There is no negative cost transition cycle
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 15 / 18
Idea of contributions : algorithms
Value iteration is very similar to Bellman-Ford: we essentially prove thatmin lim = lim min
minΠ∈P
limK→∞
K∑k=0
cT xΠk = lim
K→∞min
Π∈PK
K∑k=0
cT xΠk
( P ∼ all proper policies, PK ∼ all proper policies that terminate in K steps)
Policy iteration is a block-pivot simplex : we prove strict improvement togarantee finiteness.
We can apply a primal-dual algorithm, the subproblem is a reachabilityquestion : Dijkstra-like algorithm (we fall into the same class, not the casebefore because of zero cost cycles !!)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 16 / 18
Idea of contributions : algorithms
Value iteration is very similar to Bellman-Ford: we essentially prove thatmin lim = lim min
minΠ∈P
limK→∞
K∑k=0
cT xΠk = lim
K→∞min
Π∈PK
K∑k=0
cT xΠk
( P ∼ all proper policies, PK ∼ all proper policies that terminate in K steps)
Policy iteration is a block-pivot simplex : we prove strict improvement togarantee finiteness.
We can apply a primal-dual algorithm, the subproblem is a reachabilityquestion : Dijkstra-like algorithm (we fall into the same class, not the casebefore because of zero cost cycles !!)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 16 / 18
Idea of contributions : algorithms
Value iteration is very similar to Bellman-Ford: we essentially prove thatmin lim = lim min
minΠ∈P
limK→∞
K∑k=0
cT xΠk = lim
K→∞min
Π∈PK
K∑k=0
cT xΠk
( P ∼ all proper policies, PK ∼ all proper policies that terminate in K steps)
Policy iteration is a block-pivot simplex : we prove strict improvement togarantee finiteness.
We can apply a primal-dual algorithm, the subproblem is a reachabilityquestion : Dijkstra-like algorithm (we fall into the same class, not the casebefore because of zero cost cycles !!)
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 16 / 18
Main Open questions
The stochastic shortest path problem is polynomial through LP
Is it strongly polynomial ?
Ye (2011) : true for discounted MDPs if α is fixed
Is our generalization of Disjkstra’s algorithm strongly polynomial ?
Is the reachability subproblem strongly polynomial ?
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 17 / 18
Main Open questions
The stochastic shortest path problem is polynomial through LP
Is it strongly polynomial ?
Ye (2011) : true for discounted MDPs if α is fixed
Is our generalization of Disjkstra’s algorithm strongly polynomial ?
Is the reachability subproblem strongly polynomial ?
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 17 / 18
Main Open questions
The stochastic shortest path problem is polynomial through LP
Is it strongly polynomial ?
Ye (2011) : true for discounted MDPs if α is fixed
Is our generalization of Disjkstra’s algorithm strongly polynomial ?
Is the reachability subproblem strongly polynomial ?
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 17 / 18
Main Open questions
The stochastic shortest path problem is polynomial through LP
Is it strongly polynomial ?
Ye (2011) : true for discounted MDPs if α is fixed
Is our generalization of Disjkstra’s algorithm strongly polynomial ?
Is the reachability subproblem strongly polynomial ?
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 17 / 18
Main Open questions
The stochastic shortest path problem is polynomial through LP
Is it strongly polynomial ?
Ye (2011) : true for discounted MDPs if α is fixed
Is our generalization of Disjkstra’s algorithm strongly polynomial ?
Is the reachability subproblem strongly polynomial ?
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 17 / 18
Guillot and Stauffer The Stochastic Shortest Path Problem LSE 2017 18 / 18