Monte-Carlo Graph Search - GitHub Pages

Monte-Carlo Graph Search:the Value of Merging Similar States

Edouard Leurent1,2,Odalric-Ambrym Maillard1

1Univ. Lille, Inria, CNRS,Centrale Lille, UMR 9189 – CRIStAL,

2Renault Group

Motivation

Monte-Carlo Tree Search algorithms• rely on a tree structure to represent their value estimates.

• performance independent of the size S of the state spaceTabular RL (UCBVI)

√HSAn

MCTS (OPD) n− log 1γ/ logA

2 -Monte-Carlo Graph Search- Edouard Leurent

Motivation

Monte-Carlo Tree Search algorithms• rely on a tree structure to represent their value estimates.

• performance independent of the size S of the state spaceTabular RL (UCBVI)

√HSAn

MCTS (OPD) n− log 1γ/ logA


Limitation

• There can be several paths to the same state s

• s is represented several times in the tree• No information is shared between these paths


Limitation




Limitation




Limitation




Limitation


• s is represented several times in the tree

• No information is shared between these paths


Limitation




Motivating example

Not accounting for state similarity hinders exploration

Sparse gridworld: reward of 0 everywhere


Motivating example

Not accounting for state similarity hinders explorationSparse gridworld: reward of 0 everywhere


Planners behaviours

Uniform planning in the space of sequences of actionsOPD

OPD, budget of n = 5460 moves


Motivating example 2/2

Concentration• Does not lead to uniform exploration of the state space

• 2D random walk ∼ Rayleigh distribution P(d) = 2dH e−

−d2H

−20 −10 0 10 20−20

−15

−10

−5

0

5

10

15

20OPD

0

100

101

102

budget of 5460 samples, maximum distance d = 6





−d2H

−20 −10 0 10 20−20

−15

−10

−5

0

5

10

15

20OPD

0

100

101

102






−d2H

−20 −10 0 10 20−20

−15

−10

−5

0

5

10

15

20OPD

0

100

101

102






−d2H

−20 −10 0 10 20−20

−15

−10

−5

0

5

10

15

20OPD

0

100

101

102



Goal

Better exploit this wasted information

• By merging similar states


Goal

Better exploit this wasted information• By merging similar states


Goal

Better exploit this wasted information• By merging similar states into a graph


Goal

Better exploit this wasted information• By merging similar states


Questions

• How to adapt MCTS algorithms to work on graphs?• Can we quantify the benefit of using graphs over trees?


MCTS for Deterministic MDPs

𝜕𝑇𝑛

𝑇𝑛∘

Optimism in the Face of Uncertainty: OPD

1. Build confidence bounds L(a) ≤ V (a) ≤ U(a)2. Follow optimistic actions, from the root down to a leaf b3. Expand the leaf b ∈ ∂T n



𝜕𝑇𝑛

𝑇𝑛∘


1. Build confidence bounds L(a) ≤ V (a) ≤ U(a)

2. Follow optimistic actions, from the root down to a leaf b3. Expand the leaf b ∈ ∂T n



𝜕𝑇𝑛

𝑇𝑛∘


1. Build confidence bounds L(a) ≤ V (a) ≤ U(a)2. Follow optimistic actions, from the root down to a leaf b

3. Expand the leaf b ∈ ∂T n



𝜕𝑇𝑛

𝑇𝑛∘





𝜕𝑇𝑛+1

𝑇𝑛+1∘




On a graph...

𝒢n

𝜕𝒢𝑛

∘

Same principle: GBOP-D

1. Build confidence bounds L(s) ≤ V (s) ≤ U(s)2. Follow optimistic actions until an external node s is reached3. Expand the external node s ∈ ∂Gn

> We are guaranteed to expand any state only once.


On a graph...

𝒢n

𝜕𝒢𝑛

∘


1. Build confidence bounds L(s) ≤ V (s) ≤ U(s)

2. Follow optimistic actions until an external node s is reached3. Expand the external node s ∈ ∂Gn



On a graph...

𝒢n

𝜕𝒢𝑛

∘


1. Build confidence bounds L(s) ≤ V (s) ≤ U(s)2. Follow optimistic actions until an external node s is reached

3. Expand the external node s ∈ ∂Gn



On a graph...

𝒢n+1

𝜕𝒢𝑛+1

∘





On a graph...

𝒢n+1

𝜕𝒢𝑛+1

∘





On a graph...

𝒢n+1

𝜕𝒢𝑛+1

∘





How to build the bounds L,U?

How to bound V (s) = sup∑∞

t=0 γtrt?

• Initialize with trivial bounds: L = 0, U = 11−γ

• Apply the Bellman operator B : V → maxaR(s, a) + γV (s ′)

L(a) ≤ B(L)(a) ≤ V (a) ≤ B(U)(a) ≤ U(a)

𝐵𝑛 ℬ𝑛

• Trees: Converges in dn steps• Graphs: May converge in ∞ steps when there is a loop




t=0 γtrt?



L(a) ≤ B(L)(a) ≤ V (a) ≤ B(U)(a) ≤ U(a)

𝐵𝑛 ℬ𝑛





t=0 γtrt?



L(a) ≤ B(L)(a) ≤ V (a) ≤ B(U)(a) ≤ U(a)

𝐵𝑛 ℬ𝑛





t=0 γtrt?



L(a) ≤ B(L)(a) ≤ V (a) ≤ B(U)(a) ≤ U(a)

𝐵𝑛 ℬ𝑛

• Trees: Converges in dn steps

• Graphs: May converge in ∞ steps when there is a loop




t=0 γtrt?



L(a) ≤ B(L)(a) ≤ V (a) ≤ B(U)(a) ≤ U(a)

𝐵𝑛 ℬ𝑛



Analysis

Is GBOP-D more efficient than OPD?

Performance: rn = V ? − V (an)

Theorem (Sample complexity of OPD, Hren and Munos, 2008)

rn = O(

n− log 1γ/log κ

),

where κ is a problem-dependent difficulty measure.

Theorem (Sample complexity of GBOP-D)

rn = O(

n− log 1γ/log κ∞

),

where κ∞ is a tighter problem-dependent difficulty measure:

κ∞ ≤ κ


Analysis

Is GBOP-D more efficient than OPD?Performance: rn = V ? − V (an)


rn = O(

n− log 1γ/log κ

),



rn = O(


),


κ∞ ≤ κ


Analysis



rn = O(

n− log 1γ/log κ

),



rn = O(


),


κ∞ ≤ κ


Analysis



rn = O(

n− log 1γ/log κ

),



rn = O(


),


κ∞ ≤ κ


Comparing difficulty measures

• κ∞ = κ if the MDP has a tree structure

• κ∞ < κ when trajectories intersect a lot

> actions cancel each-other out (e.g. moving left or right)> actions are commutative (e.g. placing pawns on a board)

Illustrative example: 3 states, K > 2 actions

𝑟∗ 𝑟−

𝑟+

0

𝑟∗

κ∞ = 1 < κ = K − 1



• κ∞ = κ if the MDP has a tree structure• κ∞ < κ when trajectories intersect a lot



𝑟∗ 𝑟−

𝑟+

0

𝑟∗

κ∞ = 1 < κ = K − 1




> actions cancel each-other out (e.g. moving left or right)

> actions are commutative (e.g. placing pawns on a board)


𝑟∗ 𝑟−

𝑟+

0

𝑟∗

κ∞ = 1 < κ = K − 1






𝑟∗ 𝑟−

𝑟+

0

𝑟∗

κ∞ = 1 < κ = K − 1






𝑟∗ 𝑟−

𝑟+

0

𝑟∗

κ∞ = 1 < κ = K − 1


Experiment: sparse gridworld

Rewards in a ball around (10, 10) of radius 5, with quadratic decay

−20 −10 0 10 20−20

−15

−10

−5

0

5

10

15

20OPD

0

100

101

102

−20 −10 0 10 20−20

−15

−10

−5

0

5

10

15

20GBOP-D

0

100

101

102

103

n = 5640 samples


Extension to stochastic MDPs

Extension to stochastic MDPs• Use state similarity to tighten the bounds L ≤ V ≤ U.• We adapt MDP-GapE (Jonsson et al., 2020) to obtain GBOP


Experiment: stochastic gridworld

Noisy transitions with probability p = 10%

−20 −10 0 10 20−20

−15

−10

−5

0

5

10

15

20UCT

0

100

101

102

103

−20 −10 0 10 20−20

−15

−10

−5

0

5

10

15

20GBOP

0

100

101

102

103

n = 5640 samples


Exploration-Exploitation score

S =n∑

t=1d(st , s0)︸︷︷︸Exploration

− d(st , sg)︸︷︷︸Exploitation

OPD GBOP-D MDP-GapE BRUE UCT KL-OLOP GBOP

0

5

10

15

20

scor

e

dynamics

deterministic

stochastic

n = 5640 samples


Sailing Domain (Vanderbei, 1996)

102 103 104

budget n

10−1sim

ple

regr

etr n

agent

BRUE

GBOP

KL-OLOP

MDP-GapE

UCT

Effective branching factor κe :• κe ≈ 3.6, for BRUE, KL-OLOP, MDP-GapE, UCT• κe ≈ 1.2 for GBOP, which suggests our results may still hold


Sailing Domain (Vanderbei, 1996)

102 103 104

budget n

10−1sim

ple

regr

etr n

agent

BRUE

GBOP

KL-OLOP

MDP-GapE

UCT

Effective branching factor κe :• κe ≈ 3.6, for BRUE, KL-OLOP, MDP-GapE, UCT• κe ≈ 1.2 for GBOP, which suggests our results may still hold


Thank You!


Documents

Monte-Carlo Graph Search - GitHub Pages