# Reachability in MDPs: Refining Convergence of Value Iteration · PDF file Reachability in MDPs: Refining Convergence of Value Iteration Serge Haddad (LSV, ENS Cachan, CNRS & Inria)

• Reachability in MDPs: Refining Convergence

of Value Iteration

Benjamin Monmege (ULB) !

RP 2014, Oxford

• 2

Markov Decision Processes

• What?

✦ Stochastic process with non-deterministic choices

✦ Non-determinism solved by policies/strategies

• • Where?

✦ Optimization

✦ Program verification: reachability as the basis of PCTL model-checking

✦ Game theory: 1+½ players

2

Markov Decision Processes

• What?

✦ Stochastic process with non-deterministic choices

✦ Non-determinism solved by policies/strategies

• MDPs: definition and objective

a ½

½

b c

d

½

½ e

3

• MDPs: definition and objective

a ½

½

b c

d

½

½ e

Finite number of states

3

• MDPs: definition and objective

a ½

½

b c

d

½

½ e

Finite number of states

Probabilistic states

3

• MDPs: definition and objective

a ½

½

b c

d

½

½ e

Finite number of states

Probabilistic states

Actions to be selected by the policy

3

• MDPs: definition and objective

a ½

½

b c

d

½

½ e

Finite number of states

Probabilistic states

Actions to be selected by the policy

M= (S,α,δ) δ :S×α→ Dist(S)

σ : (S ⋅α)! ⋅S → Dist(α)Policy 3

• MDPs: definition and objective

a ½

½

b c

d

½

½ e

Finite number of states

Probabilistic states

Actions to be selected by the policy

Reachability objective

M= (S,α,δ) δ :S×α→ Dist(S)

σ : (S ⋅α)! ⋅S → Dist(α)Policy 3

• MDPs: definition and objective

a ½

½

b c

d

½

½ e

Finite number of states

Probabilistic states

Actions to be selected by the policy

Reachability objective

M= (S,α,δ) δ :S×α→ Dist(S)

σ : (S ⋅α)! ⋅S → Dist(α)Policy

Probability to reach: Pr s σ(F )

3

• MDPs: definition and objective

a ½

½

b c

d

½

½ e

Finite number of states

Probabilistic states

Actions to be selected by the policy

Reachability objective

M= (S,α,δ) δ :S×α→ Dist(S)

σ : (S ⋅α)! ⋅S → Dist(α)Policy

Probability to reach: Pr s σ(F )

3

Maximal probability to reach: Pr

s max(F )= sup

σ Pr s σ(F )

• Optimal reachability probabilities of MDPs

• How?

✦ Linear programming

✦ Policy iteration

✦ Value iteration: numerical scheme that scales well and works in practice

4

• Optimal reachability probabilities of MDPs

• How?

✦ Linear programming

✦ Policy iteration

✦ Value iteration: numerical scheme that scales well and works in practice

4

used in the numerical PRISM model checker

[Kwiatkowska, Norman, Parker, 2011]

• Value iteration

5

a ½

½

b c

d

½

½ e

• Value iteration

5

a ½

½

b c

d

½

½ e

x s (n+1) = max

a∈α δ

′s ∈S ∑ (s,a)( ′s )×x ′s(n)

x s (0) = 1 if s =

0 otherwise

⎧ ⎨ ⎪⎪

⎩ ⎪⎪

• Value iteration

5

a ½

½

b c

d

½

½ e

0 0 0 0

x s (n+1) = max

a∈α δ

′s ∈S ∑ (s,a)( ′s )×x ′s(n)

x s (0) = 1 if s =

0 otherwise

⎧ ⎨ ⎪⎪

⎩ ⎪⎪

• Value iteration

5

a ½

½

b c

d

½

½ e

0 0 0 0 0 2/3 (b) 0 0

x s (n+1) = max

a∈α δ

′s ∈S ∑ (s,a)( ′s )×x ′s(n)

x s (0) = 1 if s =

0 otherwise

⎧ ⎨ ⎪⎪

⎩ ⎪⎪

• Value iteration

5

a ½

½

b c

d

½

½ e

0 0 0 0 0 2/3 (b) 0 0

1/3 2/3 (b) 0 0

x s (n+1) = max

a∈α δ

′s ∈S ∑ (s,a)( ′s )×x ′s(n)

x s (0) = 1 if s =

0 otherwise

⎧ ⎨ ⎪⎪

⎩ ⎪⎪

• Value iteration

5

a ½

½

b c

d

½

½ e

0 0 0 0 0 2/3 (b) 0 0

1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0

x s (n+1) = max

a∈α δ

′s ∈S ∑ (s,a)( ′s )×x ′s(n)

x s (0) = 1 if s =

0 otherwise

⎧ ⎨ ⎪⎪

⎩ ⎪⎪

• Value iteration

5

a ½

½

b c

d

½

½ e

0 0 0 0 0 2/3 (b) 0 0

1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0 7/12 13/18 (b) 1/4 0

x s (n+1) = max

a∈α δ

′s ∈S ∑ (s,a)( ′s )×x ′s(n)

x s (0) = 1 if s =

0 otherwise

⎧ ⎨ ⎪⎪

⎩ ⎪⎪

• Value iteration

5

a ½

½

b c

d

½

½ e

0 0 0 0 0 2/3 (b) 0 0

1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0 7/12 13/18 (b) 1/4 0 … … … …

x s (n+1) = max

a∈α δ

′s ∈S ∑ (s,a)( ′s )×x ′s(n)

x s (0) = 1 if s =

0 otherwise

⎧ ⎨ ⎪⎪

⎩ ⎪⎪

• Value iteration

5

a ½

½

b c

d

½

½ e

0 0 0 0 0 2/3 (b) 0 0

1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0 7/12 13/18 (b) 1/4 0 … … … …

0.7969 0.7988 (b) 0.3977 0

x s (n+1) = max

a∈α δ

′s ∈S ∑ (s,a)( ′s )×x ′s(n)

x s (0) = 1 if s =

0 otherwise

⎧ ⎨ ⎪⎪

⎩ ⎪⎪

• Value iteration

5

a ½

½

b c

d

½

½ e

0 0 0 0 0 2/3 (b) 0 0

1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0 7/12 13/18 (b) 1/4 0 … … … …

0.7969 0.7988 (b) 0.3977 0 0.7978 0.7992 (b) 0.3984 0

x s (n+1) = max

a∈α δ

′s ∈S ∑ (s,a)( ′s )×x ′s(n)

x s (0) = 1 if s =

0 otherwise

⎧ ⎨ ⎪⎪

⎩ ⎪⎪

• Value iteration

5

a ½

½

b c

d

½

½ e

0 0 0 0 0 2/3 (b) 0 0

1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0 7/12 13/18 (b) 1/4 0 … … … …

0.7969 0.7988 (b) 0.3977 0 0.7978 0.7992 (b) 0.3984 0

≤0.001

x s (n+1) = max

a∈α δ

′s ∈S ∑ (s,a)( ′s )×x ′s(n)

x s (0) = 1 if s =

0 otherwise

⎧ ⎨ ⎪⎪

⎩ ⎪⎪

• Value iteration: which guarantees?

6

½

½

½

½

k

k-1

…k+2k+1 2k-1 2k ½

½

½

½

½ ½

…k-2 1 ½

½ ½

½

• Value iteration: which guarantees?

6

½

½

½

½

k

k-1

…k+2k+1 2k-1 2k ½

½

½

½

½ ½

…k-2 1 ½

½ ½

½

State 0 1 2 3 … k-1 k k+1 … 2k

• Value iteration: which guarantees?

6

½

½

½

½

k

k-1

…k+2k+1 2k-1 2k ½

½

½

½

½ ½

…k-2 1 ½

½ ½

½

State 0 1 2 3 … k-1 k k+1 … 2k Step 1 1 0 0 0 … 0 0 0 … 0

• Value iteration: which guarantees?

6

½

½

½

½

k

k-1

…k+2k+1 2k-1 2k ½

½

½

½

½ ½

…k-2 1 ½

½ ½

½

State 0 1 2 3 … k-1 k k+1 … 2k Step 1 1 0 0 0 … 0 0 0 … 0 Step 2 1 1/2 0 0 … 0 0 0 … 0

• Value iteration: which guarantees?

6

½

½

½

½

k

k-1

…k+2k+1 2k-1 2k ½

½

½

½

½ ½

…k-2 1 ½

½ ½

½

State 0 1 2 3 … k-1 k k+1 … 2k Step 1 1 0 0 0 … 0 0 0 … 0 Step 2 1 1/2 0 0 … 0 0 0 … 0 Step 3 1 1/2 1/4 0 … 0 0 0 … 0

• Value iteration: which guarantees?

6

½

½

½

½

k

k-1

…k+2k+1 2k-1 2k ½

½

