Click here to load reader

Reachability in MDPs: Refining Convergence of Value Iteration · PDF file Reachability in MDPs: Refining Convergence of Value Iteration Serge Haddad (LSV, ENS Cachan, CNRS & Inria)

  • View
    3

  • Download
    0

Embed Size (px)

Text of Reachability in MDPs: Refining Convergence of Value Iteration · PDF file Reachability in...

  • Reachability in MDPs: Refining Convergence

    of Value Iteration

    Serge Haddad (LSV, ENS Cachan, CNRS & Inria) and

    Benjamin Monmege (ULB) !

    RP 2014, Oxford

  • 2

    Markov Decision Processes

    • What?

    ✦ Stochastic process with non-deterministic choices

    ✦ Non-determinism solved by policies/strategies

  • • Where?

    ✦ Optimization

    ✦ Program verification: reachability as the basis of PCTL model-checking

    ✦ Game theory: 1+½ players

    2

    Markov Decision Processes

    • What?

    ✦ Stochastic process with non-deterministic choices

    ✦ Non-determinism solved by policies/strategies

  • MDPs: definition and objective

    a ½

    ½

    b c

    d

    ½

    ½ e

    3

  • MDPs: definition and objective

    a ½

    ½

    b c

    d

    ½

    ½ e

    Finite number of states

    3

  • MDPs: definition and objective

    a ½

    ½

    b c

    d

    ½

    ½ e

    Finite number of states

    Probabilistic states

    3

  • MDPs: definition and objective

    a ½

    ½

    b c

    d

    ½

    ½ e

    Finite number of states

    Probabilistic states

    Actions to be selected by the policy

    3

  • MDPs: definition and objective

    a ½

    ½

    b c

    d

    ½

    ½ e

    Finite number of states

    Probabilistic states

    Actions to be selected by the policy

    M= (S,α,δ) δ :S×α→ Dist(S)

    σ : (S ⋅α)! ⋅S → Dist(α)Policy 3

  • MDPs: definition and objective

    a ½

    ½

    b c

    d

    ½

    ½ e

    Finite number of states

    Probabilistic states

    Actions to be selected by the policy

    Reachability objective

    M= (S,α,δ) δ :S×α→ Dist(S)

    σ : (S ⋅α)! ⋅S → Dist(α)Policy 3

  • MDPs: definition and objective

    a ½

    ½

    b c

    d

    ½

    ½ e

    Finite number of states

    Probabilistic states

    Actions to be selected by the policy

    Reachability objective

    M= (S,α,δ) δ :S×α→ Dist(S)

    σ : (S ⋅α)! ⋅S → Dist(α)Policy

    Probability to reach: Pr s σ(F )

    3

  • MDPs: definition and objective

    a ½

    ½

    b c

    d

    ½

    ½ e

    Finite number of states

    Probabilistic states

    Actions to be selected by the policy

    Reachability objective

    M= (S,α,δ) δ :S×α→ Dist(S)

    σ : (S ⋅α)! ⋅S → Dist(α)Policy

    Probability to reach: Pr s σ(F )

    3

    Maximal probability to reach: Pr

    s max(F )= sup

    σ Pr s σ(F )

  • Optimal reachability probabilities of MDPs

    • How?

    ✦ Linear programming

    ✦ Policy iteration

    ✦ Value iteration: numerical scheme that scales well and works in practice

    4

  • Optimal reachability probabilities of MDPs

    • How?

    ✦ Linear programming

    ✦ Policy iteration

    ✦ Value iteration: numerical scheme that scales well and works in practice

    4

    used in the numerical PRISM model checker

    [Kwiatkowska, Norman, Parker, 2011]

  • Value iteration

    5

    a ½

    ½

    b c

    d

    ½

    ½ e

  • Value iteration

    5

    a ½

    ½

    b c

    d

    ½

    ½ e

    x s (n+1) = max

    a∈α δ

    ′s ∈S ∑ (s,a)( ′s )×x ′s(n)

    x s (0) = 1 if s =

    0 otherwise

    ⎧ ⎨ ⎪⎪

    ⎩ ⎪⎪

  • Value iteration

    5

    a ½

    ½

    b c

    d

    ½

    ½ e

    0 0 0 0

    x s (n+1) = max

    a∈α δ

    ′s ∈S ∑ (s,a)( ′s )×x ′s(n)

    x s (0) = 1 if s =

    0 otherwise

    ⎧ ⎨ ⎪⎪

    ⎩ ⎪⎪

  • Value iteration

    5

    a ½

    ½

    b c

    d

    ½

    ½ e

    0 0 0 0 0 2/3 (b) 0 0

    x s (n+1) = max

    a∈α δ

    ′s ∈S ∑ (s,a)( ′s )×x ′s(n)

    x s (0) = 1 if s =

    0 otherwise

    ⎧ ⎨ ⎪⎪

    ⎩ ⎪⎪

  • Value iteration

    5

    a ½

    ½

    b c

    d

    ½

    ½ e

    0 0 0 0 0 2/3 (b) 0 0

    1/3 2/3 (b) 0 0

    x s (n+1) = max

    a∈α δ

    ′s ∈S ∑ (s,a)( ′s )×x ′s(n)

    x s (0) = 1 if s =

    0 otherwise

    ⎧ ⎨ ⎪⎪

    ⎩ ⎪⎪

  • Value iteration

    5

    a ½

    ½

    b c

    d

    ½

    ½ e

    0 0 0 0 0 2/3 (b) 0 0

    1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0

    x s (n+1) = max

    a∈α δ

    ′s ∈S ∑ (s,a)( ′s )×x ′s(n)

    x s (0) = 1 if s =

    0 otherwise

    ⎧ ⎨ ⎪⎪

    ⎩ ⎪⎪

  • Value iteration

    5

    a ½

    ½

    b c

    d

    ½

    ½ e

    0 0 0 0 0 2/3 (b) 0 0

    1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0 7/12 13/18 (b) 1/4 0

    x s (n+1) = max

    a∈α δ

    ′s ∈S ∑ (s,a)( ′s )×x ′s(n)

    x s (0) = 1 if s =

    0 otherwise

    ⎧ ⎨ ⎪⎪

    ⎩ ⎪⎪

  • Value iteration

    5

    a ½

    ½

    b c

    d

    ½

    ½ e

    0 0 0 0 0 2/3 (b) 0 0

    1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0 7/12 13/18 (b) 1/4 0 … … … …

    x s (n+1) = max

    a∈α δ

    ′s ∈S ∑ (s,a)( ′s )×x ′s(n)

    x s (0) = 1 if s =

    0 otherwise

    ⎧ ⎨ ⎪⎪

    ⎩ ⎪⎪

  • Value iteration

    5

    a ½

    ½

    b c

    d

    ½

    ½ e

    0 0 0 0 0 2/3 (b) 0 0

    1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0 7/12 13/18 (b) 1/4 0 … … … …

    0.7969 0.7988 (b) 0.3977 0

    x s (n+1) = max

    a∈α δ

    ′s ∈S ∑ (s,a)( ′s )×x ′s(n)

    x s (0) = 1 if s =

    0 otherwise

    ⎧ ⎨ ⎪⎪

    ⎩ ⎪⎪

  • Value iteration

    5

    a ½

    ½

    b c

    d

    ½

    ½ e

    0 0 0 0 0 2/3 (b) 0 0

    1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0 7/12 13/18 (b) 1/4 0 … … … …

    0.7969 0.7988 (b) 0.3977 0 0.7978 0.7992 (b) 0.3984 0

    x s (n+1) = max

    a∈α δ

    ′s ∈S ∑ (s,a)( ′s )×x ′s(n)

    x s (0) = 1 if s =

    0 otherwise

    ⎧ ⎨ ⎪⎪

    ⎩ ⎪⎪

  • Value iteration

    5

    a ½

    ½

    b c

    d

    ½

    ½ e

    0 0 0 0 0 2/3 (b) 0 0

    1/3 2/3 (b) 0 0 1/2 2/3 (b) 1/6 0 7/12 13/18 (b) 1/4 0 … … … …

    0.7969 0.7988 (b) 0.3977 0 0.7978 0.7992 (b) 0.3984 0

    ≤0.001

    x s (n+1) = max

    a∈α δ

    ′s ∈S ∑ (s,a)( ′s )×x ′s(n)

    x s (0) = 1 if s =

    0 otherwise

    ⎧ ⎨ ⎪⎪

    ⎩ ⎪⎪

  • Value iteration: which guarantees?

    6

    ½

    ½

    ½

    ½

    k

    k-1

    …k+2k+1 2k-1 2k ½

    ½

    ½

    ½

    ½ ½

    …k-2 1 ½

    ½ ½

    ½

  • Value iteration: which guarantees?

    6

    ½

    ½

    ½

    ½

    k

    k-1

    …k+2k+1 2k-1 2k ½

    ½

    ½

    ½

    ½ ½

    …k-2 1 ½

    ½ ½

    ½

    State 0 1 2 3 … k-1 k k+1 … 2k

  • Value iteration: which guarantees?

    6

    ½

    ½

    ½

    ½

    k

    k-1

    …k+2k+1 2k-1 2k ½

    ½

    ½

    ½

    ½ ½

    …k-2 1 ½

    ½ ½

    ½

    State 0 1 2 3 … k-1 k k+1 … 2k Step 1 1 0 0 0 … 0 0 0 … 0

  • Value iteration: which guarantees?

    6

    ½

    ½

    ½

    ½

    k

    k-1

    …k+2k+1 2k-1 2k ½

    ½

    ½

    ½

    ½ ½

    …k-2 1 ½

    ½ ½

    ½

    State 0 1 2 3 … k-1 k k+1 … 2k Step 1 1 0 0 0 … 0 0 0 … 0 Step 2 1 1/2 0 0 … 0 0 0 … 0

  • Value iteration: which guarantees?

    6

    ½

    ½

    ½

    ½

    k

    k-1

    …k+2k+1 2k-1 2k ½

    ½

    ½

    ½

    ½ ½

    …k-2 1 ½

    ½ ½

    ½

    State 0 1 2 3 … k-1 k k+1 … 2k Step 1 1 0 0 0 … 0 0 0 … 0 Step 2 1 1/2 0 0 … 0 0 0 … 0 Step 3 1 1/2 1/4 0 … 0 0 0 … 0

  • Value iteration: which guarantees?

    6

    ½

    ½

    ½

    ½

    k

    k-1

    …k+2k+1 2k-1 2k ½

    ½