Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Monte-Carlo Graph Search:the Value of Merging Similar States
Edouard Leurent1,2,Odalric-Ambrym Maillard1
1Univ. Lille, Inria, CNRS,Centrale Lille, UMR 9189 – CRIStAL,
2Renault Group
Motivation
Monte-Carlo Tree Search algorithms• rely on a tree structure to represent their value estimates.
• performance independent of the size S of the state spaceTabular RL (UCBVI)
√HSAn
MCTS (OPD) n− log 1γ/ logA
2 -Monte-Carlo Graph Search- Edouard Leurent
Motivation
Monte-Carlo Tree Search algorithms• rely on a tree structure to represent their value estimates.
• performance independent of the size S of the state spaceTabular RL (UCBVI)
√HSAn
MCTS (OPD) n− log 1γ/ logA
2 -Monte-Carlo Graph Search- Edouard Leurent
Limitation
• There can be several paths to the same state s
• s is represented several times in the tree• No information is shared between these paths
3 -Monte-Carlo Graph Search- Edouard Leurent
Limitation
• There can be several paths to the same state s
• s is represented several times in the tree• No information is shared between these paths
3 -Monte-Carlo Graph Search- Edouard Leurent
Limitation
• There can be several paths to the same state s
• s is represented several times in the tree• No information is shared between these paths
3 -Monte-Carlo Graph Search- Edouard Leurent
Limitation
• There can be several paths to the same state s
• s is represented several times in the tree• No information is shared between these paths
3 -Monte-Carlo Graph Search- Edouard Leurent
Limitation
• There can be several paths to the same state s
• s is represented several times in the tree
• No information is shared between these paths
3 -Monte-Carlo Graph Search- Edouard Leurent
Limitation
• There can be several paths to the same state s
• s is represented several times in the tree• No information is shared between these paths
3 -Monte-Carlo Graph Search- Edouard Leurent
Motivating example
Not accounting for state similarity hinders exploration
Sparse gridworld: reward of 0 everywhere
4 -Monte-Carlo Graph Search- Edouard Leurent
Motivating example
Not accounting for state similarity hinders explorationSparse gridworld: reward of 0 everywhere
4 -Monte-Carlo Graph Search- Edouard Leurent
Planners behaviours
Uniform planning in the space of sequences of actionsOPD
OPD, budget of n = 5460 moves
5 -Monte-Carlo Graph Search- Edouard Leurent
Motivating example 2/2
Concentration• Does not lead to uniform exploration of the state space
• 2D random walk ∼ Rayleigh distribution P(d) = 2dH e−
−d2H
−20 −10 0 10 20−20
−15
−10
−5
0
5
10
15
20OPD
0
100
101
102
budget of 5460 samples, maximum distance d = 6
6 -Monte-Carlo Graph Search- Edouard Leurent
Motivating example 2/2
Concentration• Does not lead to uniform exploration of the state space
• 2D random walk ∼ Rayleigh distribution P(d) = 2dH e−
−d2H
−20 −10 0 10 20−20
−15
−10
−5
0
5
10
15
20OPD
0
100
101
102
budget of 5460 samples, maximum distance d = 6
6 -Monte-Carlo Graph Search- Edouard Leurent
Motivating example 2/2
Concentration• Does not lead to uniform exploration of the state space
• 2D random walk ∼ Rayleigh distribution P(d) = 2dH e−
−d2H
−20 −10 0 10 20−20
−15
−10
−5
0
5
10
15
20OPD
0
100
101
102
budget of 5460 samples, maximum distance d = 6
6 -Monte-Carlo Graph Search- Edouard Leurent
Motivating example 2/2
Concentration• Does not lead to uniform exploration of the state space
• 2D random walk ∼ Rayleigh distribution P(d) = 2dH e−
−d2H
−20 −10 0 10 20−20
−15
−10
−5
0
5
10
15
20OPD
0
100
101
102
budget of 5460 samples, maximum distance d = 6
6 -Monte-Carlo Graph Search- Edouard Leurent
Goal
Better exploit this wasted information
• By merging similar states
7 -Monte-Carlo Graph Search- Edouard Leurent
Goal
Better exploit this wasted information• By merging similar states
7 -Monte-Carlo Graph Search- Edouard Leurent
Goal
Better exploit this wasted information• By merging similar states into a graph
7 -Monte-Carlo Graph Search- Edouard Leurent
Goal
Better exploit this wasted information• By merging similar states
7 -Monte-Carlo Graph Search- Edouard Leurent
Questions
• How to adapt MCTS algorithms to work on graphs?• Can we quantify the benefit of using graphs over trees?
8 -Monte-Carlo Graph Search- Edouard Leurent
MCTS for Deterministic MDPs
𝜕𝑇𝑛
𝑇𝑛∘
Optimism in the Face of Uncertainty: OPD
1. Build confidence bounds L(a) ≤ V (a) ≤ U(a)2. Follow optimistic actions, from the root down to a leaf b3. Expand the leaf b ∈ ∂T n
9 -Monte-Carlo Graph Search- Edouard Leurent
MCTS for Deterministic MDPs
𝜕𝑇𝑛
𝑇𝑛∘
Optimism in the Face of Uncertainty: OPD
1. Build confidence bounds L(a) ≤ V (a) ≤ U(a)
2. Follow optimistic actions, from the root down to a leaf b3. Expand the leaf b ∈ ∂T n
9 -Monte-Carlo Graph Search- Edouard Leurent
MCTS for Deterministic MDPs
𝜕𝑇𝑛
𝑇𝑛∘
Optimism in the Face of Uncertainty: OPD
1. Build confidence bounds L(a) ≤ V (a) ≤ U(a)2. Follow optimistic actions, from the root down to a leaf b
3. Expand the leaf b ∈ ∂T n
9 -Monte-Carlo Graph Search- Edouard Leurent
MCTS for Deterministic MDPs
𝜕𝑇𝑛
𝑇𝑛∘
Optimism in the Face of Uncertainty: OPD
1. Build confidence bounds L(a) ≤ V (a) ≤ U(a)2. Follow optimistic actions, from the root down to a leaf b3. Expand the leaf b ∈ ∂T n
9 -Monte-Carlo Graph Search- Edouard Leurent
MCTS for Deterministic MDPs
𝜕𝑇𝑛+1
𝑇𝑛+1∘
Optimism in the Face of Uncertainty: OPD
1. Build confidence bounds L(a) ≤ V (a) ≤ U(a)2. Follow optimistic actions, from the root down to a leaf b3. Expand the leaf b ∈ ∂T n
9 -Monte-Carlo Graph Search- Edouard Leurent
On a graph...
𝒢n
𝜕𝒢𝑛
∘
Same principle: GBOP-D
1. Build confidence bounds L(s) ≤ V (s) ≤ U(s)2. Follow optimistic actions until an external node s is reached3. Expand the external node s ∈ ∂Gn
> We are guaranteed to expand any state only once.
10 -Monte-Carlo Graph Search- Edouard Leurent
On a graph...
𝒢n
𝜕𝒢𝑛
∘
Same principle: GBOP-D
1. Build confidence bounds L(s) ≤ V (s) ≤ U(s)
2. Follow optimistic actions until an external node s is reached3. Expand the external node s ∈ ∂Gn
> We are guaranteed to expand any state only once.
10 -Monte-Carlo Graph Search- Edouard Leurent
On a graph...
𝒢n
𝜕𝒢𝑛
∘
Same principle: GBOP-D
1. Build confidence bounds L(s) ≤ V (s) ≤ U(s)2. Follow optimistic actions until an external node s is reached
3. Expand the external node s ∈ ∂Gn
> We are guaranteed to expand any state only once.
10 -Monte-Carlo Graph Search- Edouard Leurent
On a graph...
𝒢n+1
𝜕𝒢𝑛+1
∘
Same principle: GBOP-D
1. Build confidence bounds L(s) ≤ V (s) ≤ U(s)2. Follow optimistic actions until an external node s is reached3. Expand the external node s ∈ ∂Gn
> We are guaranteed to expand any state only once.
10 -Monte-Carlo Graph Search- Edouard Leurent
On a graph...
𝒢n+1
𝜕𝒢𝑛+1
∘
Same principle: GBOP-D
1. Build confidence bounds L(s) ≤ V (s) ≤ U(s)2. Follow optimistic actions until an external node s is reached3. Expand the external node s ∈ ∂Gn
> We are guaranteed to expand any state only once.
10 -Monte-Carlo Graph Search- Edouard Leurent
On a graph...
𝒢n+1
𝜕𝒢𝑛+1
∘
Same principle: GBOP-D
1. Build confidence bounds L(s) ≤ V (s) ≤ U(s)2. Follow optimistic actions until an external node s is reached3. Expand the external node s ∈ ∂Gn
> We are guaranteed to expand any state only once.
10 -Monte-Carlo Graph Search- Edouard Leurent
How to build the bounds L,U?
How to bound V (s) = sup∑∞
t=0 γtrt?
• Initialize with trivial bounds: L = 0, U = 11−γ
• Apply the Bellman operator B : V → maxaR(s, a) + γV (s ′)
L(a) ≤ B(L)(a) ≤ V (a) ≤ B(U)(a) ≤ U(a)
𝐵𝑛 ℬ𝑛
• Trees: Converges in dn steps• Graphs: May converge in ∞ steps when there is a loop
11 -Monte-Carlo Graph Search- Edouard Leurent
How to build the bounds L,U?
How to bound V (s) = sup∑∞
t=0 γtrt?
• Initialize with trivial bounds: L = 0, U = 11−γ
• Apply the Bellman operator B : V → maxaR(s, a) + γV (s ′)
L(a) ≤ B(L)(a) ≤ V (a) ≤ B(U)(a) ≤ U(a)
𝐵𝑛 ℬ𝑛
• Trees: Converges in dn steps• Graphs: May converge in ∞ steps when there is a loop
11 -Monte-Carlo Graph Search- Edouard Leurent
How to build the bounds L,U?
How to bound V (s) = sup∑∞
t=0 γtrt?
• Initialize with trivial bounds: L = 0, U = 11−γ
• Apply the Bellman operator B : V → maxaR(s, a) + γV (s ′)
L(a) ≤ B(L)(a) ≤ V (a) ≤ B(U)(a) ≤ U(a)
𝐵𝑛 ℬ𝑛
• Trees: Converges in dn steps• Graphs: May converge in ∞ steps when there is a loop
11 -Monte-Carlo Graph Search- Edouard Leurent
How to build the bounds L,U?
How to bound V (s) = sup∑∞
t=0 γtrt?
• Initialize with trivial bounds: L = 0, U = 11−γ
• Apply the Bellman operator B : V → maxaR(s, a) + γV (s ′)
L(a) ≤ B(L)(a) ≤ V (a) ≤ B(U)(a) ≤ U(a)
𝐵𝑛 ℬ𝑛
• Trees: Converges in dn steps
• Graphs: May converge in ∞ steps when there is a loop
11 -Monte-Carlo Graph Search- Edouard Leurent
How to build the bounds L,U?
How to bound V (s) = sup∑∞
t=0 γtrt?
• Initialize with trivial bounds: L = 0, U = 11−γ
• Apply the Bellman operator B : V → maxaR(s, a) + γV (s ′)
L(a) ≤ B(L)(a) ≤ V (a) ≤ B(U)(a) ≤ U(a)
𝐵𝑛 ℬ𝑛
• Trees: Converges in dn steps• Graphs: May converge in ∞ steps when there is a loop
11 -Monte-Carlo Graph Search- Edouard Leurent
Analysis
Is GBOP-D more efficient than OPD?
Performance: rn = V ? − V (an)
Theorem (Sample complexity of OPD, Hren and Munos, 2008)
rn = O(
n− log 1γ/log κ
),
where κ is a problem-dependent difficulty measure.
Theorem (Sample complexity of GBOP-D)
rn = O(
n− log 1γ/log κ∞
),
where κ∞ is a tighter problem-dependent difficulty measure:
κ∞ ≤ κ
12 -Monte-Carlo Graph Search- Edouard Leurent
Analysis
Is GBOP-D more efficient than OPD?Performance: rn = V ? − V (an)
Theorem (Sample complexity of OPD, Hren and Munos, 2008)
rn = O(
n− log 1γ/log κ
),
where κ is a problem-dependent difficulty measure.
Theorem (Sample complexity of GBOP-D)
rn = O(
n− log 1γ/log κ∞
),
where κ∞ is a tighter problem-dependent difficulty measure:
κ∞ ≤ κ
12 -Monte-Carlo Graph Search- Edouard Leurent
Analysis
Is GBOP-D more efficient than OPD?Performance: rn = V ? − V (an)
Theorem (Sample complexity of OPD, Hren and Munos, 2008)
rn = O(
n− log 1γ/log κ
),
where κ is a problem-dependent difficulty measure.
Theorem (Sample complexity of GBOP-D)
rn = O(
n− log 1γ/log κ∞
),
where κ∞ is a tighter problem-dependent difficulty measure:
κ∞ ≤ κ
12 -Monte-Carlo Graph Search- Edouard Leurent
Analysis
Is GBOP-D more efficient than OPD?Performance: rn = V ? − V (an)
Theorem (Sample complexity of OPD, Hren and Munos, 2008)
rn = O(
n− log 1γ/log κ
),
where κ is a problem-dependent difficulty measure.
Theorem (Sample complexity of GBOP-D)
rn = O(
n− log 1γ/log κ∞
),
where κ∞ is a tighter problem-dependent difficulty measure:
κ∞ ≤ κ
12 -Monte-Carlo Graph Search- Edouard Leurent
Comparing difficulty measures
• κ∞ = κ if the MDP has a tree structure
• κ∞ < κ when trajectories intersect a lot
> actions cancel each-other out (e.g. moving left or right)> actions are commutative (e.g. placing pawns on a board)
Illustrative example: 3 states, K > 2 actions
𝑟∗ 𝑟−
𝑟+
0
𝑟∗
κ∞ = 1 < κ = K − 1
13 -Monte-Carlo Graph Search- Edouard Leurent
Comparing difficulty measures
• κ∞ = κ if the MDP has a tree structure• κ∞ < κ when trajectories intersect a lot
> actions cancel each-other out (e.g. moving left or right)> actions are commutative (e.g. placing pawns on a board)
Illustrative example: 3 states, K > 2 actions
𝑟∗ 𝑟−
𝑟+
0
𝑟∗
κ∞ = 1 < κ = K − 1
13 -Monte-Carlo Graph Search- Edouard Leurent
Comparing difficulty measures
• κ∞ = κ if the MDP has a tree structure• κ∞ < κ when trajectories intersect a lot
> actions cancel each-other out (e.g. moving left or right)
> actions are commutative (e.g. placing pawns on a board)
Illustrative example: 3 states, K > 2 actions
𝑟∗ 𝑟−
𝑟+
0
𝑟∗
κ∞ = 1 < κ = K − 1
13 -Monte-Carlo Graph Search- Edouard Leurent
Comparing difficulty measures
• κ∞ = κ if the MDP has a tree structure• κ∞ < κ when trajectories intersect a lot
> actions cancel each-other out (e.g. moving left or right)> actions are commutative (e.g. placing pawns on a board)
Illustrative example: 3 states, K > 2 actions
𝑟∗ 𝑟−
𝑟+
0
𝑟∗
κ∞ = 1 < κ = K − 1
13 -Monte-Carlo Graph Search- Edouard Leurent
Comparing difficulty measures
• κ∞ = κ if the MDP has a tree structure• κ∞ < κ when trajectories intersect a lot
> actions cancel each-other out (e.g. moving left or right)> actions are commutative (e.g. placing pawns on a board)
Illustrative example: 3 states, K > 2 actions
𝑟∗ 𝑟−
𝑟+
0
𝑟∗
κ∞ = 1 < κ = K − 1
13 -Monte-Carlo Graph Search- Edouard Leurent
Experiment: sparse gridworld
Rewards in a ball around (10, 10) of radius 5, with quadratic decay
−20 −10 0 10 20−20
−15
−10
−5
0
5
10
15
20OPD
0
100
101
102
−20 −10 0 10 20−20
−15
−10
−5
0
5
10
15
20GBOP-D
0
100
101
102
103
n = 5640 samples
14 -Monte-Carlo Graph Search- Edouard Leurent
Extension to stochastic MDPs
Extension to stochastic MDPs• Use state similarity to tighten the bounds L ≤ V ≤ U.• We adapt MDP-GapE (Jonsson et al., 2020) to obtain GBOP
15 -Monte-Carlo Graph Search- Edouard Leurent
Experiment: stochastic gridworld
Noisy transitions with probability p = 10%
−20 −10 0 10 20−20
−15
−10
−5
0
5
10
15
20UCT
0
100
101
102
103
−20 −10 0 10 20−20
−15
−10
−5
0
5
10
15
20GBOP
0
100
101
102
103
n = 5640 samples
16 -Monte-Carlo Graph Search- Edouard Leurent
Exploration-Exploitation score
S =n∑
t=1d(st , s0)︸ ︷︷ ︸Exploration
− d(st , sg)︸ ︷︷ ︸Exploitation
OPD GBOP-D MDP-GapE BRUE UCT KL-OLOP GBOP
0
5
10
15
20
scor
e
dynamics
deterministic
stochastic
n = 5640 samples
17 -Monte-Carlo Graph Search- Edouard Leurent
Sailing Domain (Vanderbei, 1996)
102 103 104
budget n
10−1sim
ple
regr
etr n
agent
BRUE
GBOP
KL-OLOP
MDP-GapE
UCT
Effective branching factor κe :• κe ≈ 3.6, for BRUE, KL-OLOP, MDP-GapE, UCT• κe ≈ 1.2 for GBOP, which suggests our results may still hold
18 -Monte-Carlo Graph Search- Edouard Leurent
Sailing Domain (Vanderbei, 1996)
102 103 104
budget n
10−1sim
ple
regr
etr n
agent
BRUE
GBOP
KL-OLOP
MDP-GapE
UCT
Effective branching factor κe :• κe ≈ 3.6, for BRUE, KL-OLOP, MDP-GapE, UCT• κe ≈ 1.2 for GBOP, which suggests our results may still hold
18 -Monte-Carlo Graph Search- Edouard Leurent
Thank You!
19 -Monte-Carlo Graph Search- Edouard Leurent