Click here to load reader
View
3
Download
0
Embed Size (px)
Reachability in MDPs: Refining Convergence
of Value Iteration
Serge Haddad (LSV, ENS Cachan, CNRS & Inria) and
Benjamin Monmege (ULB) !
RP 2014, Oxford
2
Markov Decision Processes
• What?
✦ Stochastic process with non-deterministic choices
✦ Non-determinism solved by policies/strategies
• Where?
✦ Optimization
✦ Program verification: reachability as the basis of PCTL model-checking
✦ Game theory: 1+½ players
2
Markov Decision Processes
• What?
✦ Stochastic process with non-deterministic choices
✦ Non-determinism solved by policies/strategies
MDPs: definition and objective
a ½
½
⅓
⅔
b c
d
½
½ e
3
MDPs: definition and objective
a ½
½
⅓
⅔
b c
d
½
½ e
Finite number of states
3
MDPs: definition and objective
a ½
½
⅓
⅔
b c
d
½
½ e
Finite number of states
Probabilistic states
3
MDPs: definition and objective
a ½
½
⅓
⅔
b c
d
½
½ e
Finite number of states
Probabilistic states
Actions to be selected by the policy
3
MDPs: definition and objective
a ½
½
⅓
⅔
b c
d
½
½ e
Finite number of states
Probabilistic states
Actions to be selected by the policy
M= (S,α,δ) δ :S×α→ Dist(S)
σ : (S ⋅α)! ⋅S → Dist(α)Policy 3
MDPs: definition and objective
a ½
½
⅓
⅔
b c
d
½
½ e
Finite number of states
Probabilistic states
Actions to be selected by the policy
Reachability objective
M= (S,α,δ) δ :S×α→ Dist(S)
σ : (S ⋅α)! ⋅S → Dist(α)Policy 3
MDPs: definition and objective
a ½
½
⅓
⅔
b c
d
½
½ e
Finite number of states
Probabilistic states
Actions to be selected by the policy
Reachability objective
M= (S,α,δ) δ :S×α→ Dist(S)
σ : (S ⋅α)! ⋅S → Dist(α)Policy
Probability to reach: Pr s σ(F )
3
MDPs: definition and objective
a ½
½
⅓
⅔
b c
d
½
½ e
Finite number of states
Probabilistic states
Actions to be selected by the policy
Reachability objective
M= (S,α,δ) δ :S×α→ Dist(S)
σ : (S ⋅α)! ⋅S → Dist(α)Policy
Probability to reach: Pr s σ(F )
3
Maximal probability to reach: Pr
s max(F )= sup
σ Pr s σ(F )
Optimal reachability probabilities of MDPs
• How?
✦ Linear programming
✦ Policy iteration
✦ Value iteration: numerical scheme that scales well and works in practice
4
Optimal reachability probabilities of MDPs
• How?
✦ Linear programming
✦ Policy iteration
✦ Value iteration: numerical scheme that scales well and works in practice
4
used in the numerical PRISM model checker
[Kwiatkowska, Norman, Parker, 2011]
Value iteration
5
a ½
½
⅓
⅔
b c
d
½
½ e
Value iteration
5
a ½
½
⅓
⅔
b c
d
½
½ e
x s (n+1) = max
a∈α δ
′s ∈S ∑ (s,a)( ′s )×x ′s(n)
x s (0) = 1 if s =
0 otherwise
⎧ ⎨ ⎪⎪
⎩ ⎪⎪
Value iteration
5
a ½
½
⅓
⅔
b c
d
½
½ e
0 0 0 0
x s (n+1) = max
a∈α δ
′s ∈S ∑ (s,a)( ′s )×x ′s(n)
x s (0) = 1 if s =
0 otherwise
⎧ ⎨ ⎪⎪
⎩ ⎪⎪
Value iteration
5
a ½
½
⅓
⅔
b c
d
½
½ e
0 0 0 0 0 2/3 (b) 0 0
x s (n+1) = max
a∈α δ
′s ∈S ∑ (s,a)( ′s )×x ′s(n)
x s (0) = 1 if s =
0 otherwise
⎧ ⎨ ⎪⎪
⎩ ⎪⎪
Value iteration
5
a ½
½
⅓
⅔
b c
d
½
½ e
0 0 0 0 0 2/3 (b) 0 0
1/3 2/3 (b) 0 0
x s (n+1) = max
a∈α δ
′s ∈S ∑ (s,a)( ′s )×x ′s(n)
x s (0) = 1 if s =
0 otherwise
⎧ ⎨ ⎪⎪
⎩ ⎪⎪
Value iteration
5
a ½
½
⅓
⅔
b c
d
½
½ e
0 0 0 0 0 2/3 (b) 0 0
1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0
x s (n+1) = max
a∈α δ
′s ∈S ∑ (s,a)( ′s )×x ′s(n)
x s (0) = 1 if s =
0 otherwise
⎧ ⎨ ⎪⎪
⎩ ⎪⎪
Value iteration
5
a ½
½
⅓
⅔
b c
d
½
½ e
0 0 0 0 0 2/3 (b) 0 0
1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0 7/12 13/18 (b) 1/4 0
x s (n+1) = max
a∈α δ
′s ∈S ∑ (s,a)( ′s )×x ′s(n)
x s (0) = 1 if s =
0 otherwise
⎧ ⎨ ⎪⎪
⎩ ⎪⎪
Value iteration
5
a ½
½
⅓
⅔
b c
d
½
½ e
0 0 0 0 0 2/3 (b) 0 0
1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0 7/12 13/18 (b) 1/4 0 … … … …
x s (n+1) = max
a∈α δ
′s ∈S ∑ (s,a)( ′s )×x ′s(n)
x s (0) = 1 if s =
0 otherwise
⎧ ⎨ ⎪⎪
⎩ ⎪⎪
Value iteration
5
a ½
½
⅓
⅔
b c
d
½
½ e
0 0 0 0 0 2/3 (b) 0 0
1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0 7/12 13/18 (b) 1/4 0 … … … …
0.7969 0.7988 (b) 0.3977 0
x s (n+1) = max
a∈α δ
′s ∈S ∑ (s,a)( ′s )×x ′s(n)
x s (0) = 1 if s =
0 otherwise
⎧ ⎨ ⎪⎪
⎩ ⎪⎪
Value iteration
5
a ½
½
⅓
⅔
b c
d
½
½ e
0 0 0 0 0 2/3 (b) 0 0
1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0 7/12 13/18 (b) 1/4 0 … … … …
0.7969 0.7988 (b) 0.3977 0 0.7978 0.7992 (b) 0.3984 0
x s (n+1) = max
a∈α δ
′s ∈S ∑ (s,a)( ′s )×x ′s(n)
x s (0) = 1 if s =
0 otherwise
⎧ ⎨ ⎪⎪
⎩ ⎪⎪
Value iteration
5
a ½
½
⅓
⅔
b c
d
½
½ e
0 0 0 0 0 2/3 (b) 0 0
1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0 7/12 13/18 (b) 1/4 0 … … … …
0.7969 0.7988 (b) 0.3977 0 0.7978 0.7992 (b) 0.3984 0
≤0.001
x s (n+1) = max
a∈α δ
′s ∈S ∑ (s,a)( ′s )×x ′s(n)
x s (0) = 1 if s =
0 otherwise
⎧ ⎨ ⎪⎪
⎩ ⎪⎪
Value iteration: which guarantees?
6
½
½
½
½
k
k-1
…k+2k+1 2k-1 2k ½
½
½
½
½ ½
…k-2 1 ½
½ ½
½
Value iteration: which guarantees?
6
½
½
½
½
k
k-1
…k+2k+1 2k-1 2k ½
½
½
½
½ ½
…k-2 1 ½
½ ½
½
State 0 1 2 3 … k-1 k k+1 … 2k
Value iteration: which guarantees?
6
½
½
½
½
k
k-1
…k+2k+1 2k-1 2k ½
½
½
½
½ ½
…k-2 1 ½
½ ½
½
State 0 1 2 3 … k-1 k k+1 … 2k Step 1 1 0 0 0 … 0 0 0 … 0
Value iteration: which guarantees?
6
½
½
½
½
k
k-1
…k+2k+1 2k-1 2k ½
½
½
½
½ ½
…k-2 1 ½
½ ½
½
State 0 1 2 3 … k-1 k k+1 … 2k Step 1 1 0 0 0 … 0 0 0 … 0 Step 2 1 1/2 0 0 … 0 0 0 … 0
Value iteration: which guarantees?
6
½
½
½
½
k
k-1
…k+2k+1 2k-1 2k ½
½
½
½
½ ½
…k-2 1 ½
½ ½
½
State 0 1 2 3 … k-1 k k+1 … 2k Step 1 1 0 0 0 … 0 0 0 … 0 Step 2 1 1/2 0 0 … 0 0 0 … 0 Step 3 1 1/2 1/4 0 … 0 0 0 … 0
Value iteration: which guarantees?
6
½
½
½
½
k
k-1
…k+2k+1 2k-1 2k ½
½