Boss

A Bayesian Sampling Approach toExploration in Reinforcement Learning

John Asmuth† Lihong Li† Michael L. Littman††Department of Computer Science

Rutgers UniversityPiscataway, NJ 08854

Ali Nouri† David Wingate‡‡Computational Cognitive Science Group

Massachusetts Institute of TechnologyCambridge, MA 02143

Abstract

We present a modular approach to reinforce-ment learning that uses a Bayesian repre-sentation of the uncertainty over models.The approach, BOSS (Best of Sampled Set),drives exploration by sampling multiple mod-els from the posterior and selecting actionsoptimistically. It extends previous work byproviding a rule for deciding when to re-sample and how to combine the models.We show that our algorithm achieves near-optimal reward with high probability with asample complexity that is low relative to thespeed at which the posterior distribution con-verges during learning. We demonstrate thatBOSS performs quite favorably comparedto state-of-the-art reinforcement-learning ap-proaches and illustrate its flexibility by pair-ing it with a non-parametric model that gen-eralizes across states.

1 INTRODUCTION

The exploration-exploitation dilemma is a definingproblem in the field of reinforcement learning (RL).To behave in a way that attains high reward, an agentmust acquire experience that reveals the structure ofits environment, reducing its uncertainty about the dy-namics. A broad spectrum of exploration approacheshas been studied, which can be coarsely classified asbelief-lookahead, myopic, and undirected approaches.

Belief-lookahead approaches are desirable becausethey make optimal decisions in the face of their uncer-tainty. However, they are generally intractable forcingalgorithm designers to create approximations that sac-rifice optimality. A state-of-the-art belief-lookaheadapproach is BEETLE (Poupart et al., 2006), whichplans in the continuous belief space defined by theagent’s uncertainty.

Myopic (Wang et al., 2005) approaches make decisionsto reduce uncertainty, but they do not explicitly con-sider how this reduced uncertainty will impact futurereward. While myopic approaches can lay no claimto optimality in general, some include guarantees ontheir total regret or on the number of subtoptimal de-cisions made during learning. An example of such analgorithm is RMAX (Brafman & Tennenholtz, 2002),which distinguishes “known” and “unknown” statesbased on how often they have been visited. It exploresby acting to maximize reward under the assumptionthat unknown states deliver maximum reward.

Undirected (Thrun, 1992) approaches take exploratoryactions, but without regard to what parts of theirenvironment models remain uncertain. Classic ap-proaches such as ε-greedy and Boltzmann explorationthat choose random actions occasionally fall into thiscategory. The guarantees possible for this class of algo-rithms are generally weaker—convergence to optimalbehavior in the limit, for example. A sophisticatedapproach that falls into this category is BayesianDP (Strens, 2000). It maintains a Bayesian poste-rior over models and periodically draws a sample fromthis distribution. It then acts optimally with respectto this sampled model.

The algorithm proposed in this paper (Section 2) isa myopic Bayesian approach that maintains its uncer-tainty in the form of a posterior over models. As newinformation becomes available, it draws a set of sam-ples from this posterior and acts optimistically withrespect to this collection—the best of sampled set (orBOSS). We show that, with high probability, it takesnear-optimal actions on all but a small number of tri-als (Section 3). We have found that its behavior isquite promising, exploring better than undirected ap-proaches and scaling better than belief-lookahead ap-proaches (Section 4). We also demonstrate its compat-ibility with sophisticated Bayesian models, resulting inan approach that can generalize experience betweenstates (Section 5).

Note that our analysis assumes a black box algo-rithm that can sample from a posterior in the ap-propriate model class. Although great strides havebeen made recently in representing and sampling fromBayesian posteriors, it remains a challenging and oftenintractable problem. The correctness of our algorithmalso requires that the prior it uses is an accurate de-scription of the space of models—as if the environmentis chosen from the algorithm’s prior. Some assump-tion of this form is necessary for a Bayesian approachto show any benefits over an algorithm that makes aworst-case assumption.

2 BOSS: BEST OF SAMPLED SET

The idea of sampling from the posterior for deci-sion making has been around for decades (Thompson,1933). Several recent algorithms have used this tech-nique for Bayesian RL (Strens, 2000; Wilson et al.,2007). In this context, Bayesian posteriors are main-tained over the space of Markov decision processes(MDPs) and sampling the posterior requires drawinga complete MDP from this distribution.

Any sampling approach must address a few key ques-tions: 1) When to sample, 2) How many models tosample, and 3) How to combine models. A natural ap-proach to the first question is to resample after everyT timesteps, for some fixed T . There are challengesto selecting the right value of T , however. Small Tcan lead to “thrashing” behavior in which the agentrapidly switches exploration plans and ends up mak-ing little progress. Large T can lead to slow learningas new information in the posterior is not exploited be-tween samples. Strens (2000) advocates a T approx-imating the depth of exploratory planning required.He suggests several ways to address the third ques-tion, leaving their investigation for future work.

BOSS provides a novel answer to these questions. Itsamples multiple models (K) from the posterior when-ever the number of transitions from a state–action pairreaches a pre-defined threshold (B). It then com-bines the results into an optimistic MDP for decisionmaking—a process we call merging. Analogously toRMAX, once a state–action pair has been observedB times, we call it known.

In what follows, we use S to refer to the size of thestate space, A the size of the action space, and γ thediscount factor. All sampled MDPs share these quan-tities, but differ in their transition functions. For sim-plicity, we assume the reward function is known in ad-vance; otherwise, it can be encoded in the transitions.

Given K sampled models from the posterior,m1,m2, · · · ,mK , merging is the process of creating a

new MDP, m#, with the same state space, but anaugmented action space of KA actions. Each actionai,j in m#, for i ∈ {1, · · · ,K}, j ∈ {1, · · · , A}, corre-sponds to the jth action in mi. Transition and rewardfunctions are formed straightforwardly—the transitionfunction for ai,j is copied from the one for aj in mi,for example. Finally, for any state s, if a policy in m#

is to take an action aij , then the actual action takenin the original MDP is aj . A complete description ofBOSS is given in Algorithm 1.

Algorithm 1 BOSS Algorithm0: Inputs: K, B1: Initialize the current state s1.2: do sample← TRUE.3: qs,a ← 0,∀s, a4: for all timesteps t = 1, 2, 3, . . . do5: if do sample then6: Sample K models m1,m2, · · · ,mK from the

posterior (initially, the prior) distribution.7: Merge the models into the mixed MDP m#.8: Solve m# to obtain πm# .9: do sample← FALSE.

10: end if11: Use πm# for action selection: at ← πm#(st),

and observe reward rt and next state st+1.12: qst,at

← qst,at+ 1.

13: Update the posterior distribution based thetransition (st, at, rt, st+1).

14: if qst,at= B then

15: do sample← TRUE16: end if17: end for

BOSS solves no more than SA merged MDPs, requir-ing polynomial time for planning. It draws a maximumof KSA samples. Thus, in distributions in which sam-pling can be done efficiently, the overall computationaldemands are relatively low.

3 ANALYSIS

This section provides a formal analysis of BOSS’s ef-ficiency of exploration. We view the algorithm as anon-stationary policy, for which a value function canbe defined. As such, the value of state s, when visitedby algorithm A at time t, denoted by V At(st), is theexpected discounted sum of future rewards the algo-rithm will collect after visiting s at time t. Our goalis to show that, when parameters K and B are cho-sen appropriately, with high probability, V At(st) willbe ε-close to optimal except for a polynomial numberof steps (Theorem 3.1). Our objective, and some ofour techniques, closely follow work in the PAC-MDPframework (Kakade, 2003; Strehl et al., 2006).

3.1 A GENERAL SAMPLE COMPLEXITYBOUND FOR BOSS

Let m∗ be the true MDP. When possible, we denotequantities related to this MDP, such as V ∗

m∗ , by theirshorthand versions, V ∗. By assumption, the true MDPm∗ is drawn from the prior distribution, and so afterobserving a sequence of transitions, m∗ may be viewedas being drawn from the posterior distribution.

Lemma 3.1 Let s0 be a fixed state, p′ the posteriordistribution over MDPs, and δ1 ∈ (0, 1). If the sam-ple size K = Θ( 1

δ1ln 1

δ1), then with probability at least

1 − δ1, a model among these K models is optimisticcompared to m∗ in s0: maxi V ∗

mi(s0) ≥ V ∗(s0).

Proof (sketch). For any fixed, true model m∗, defineP as the probability of sampling an optimistic modelaccording to p′:

P =∑

m∈Mp′(m)I (V πm

m (s0) ≥ V πm∗m∗ (s0)) ,

where I(·) is the set-indicator function andM is the setof MDPs. We consider two mutually exclusive cases.In the first case where P ≥ δ1/2, the probability thatnone of the K sampled models is optimistic is (1 −P )K , which is at most (1 − δ1/2)K . Let this failureprobability (1− δ1/2)K be δ1/2 and solve for K to get

K =log(δ1/2)

log(1− δ1/2)= Θ

(1δ1

log1δ1

).

The other case where P < δ1/2 happens with smallprobability since the chance of drawing any model, in-cluding m∗, from that part of the posterior is at mostδ1/2. Combining these two cases, the probability thatno optimistic model is included in the K samples is atmost δ1/2 + δ1/2 = δ1. �

Lemma 3.2 The sample size K = Θ(S2Aδ ln SA

δ ) suf-fices to guarantee V ∗

m#(s) ≥ V ∗(s) for all s during theentire learning process with probability at least 1− δ.

Proof (sketch). For each model-sampling step,the construction of m# implies V ∗

m#(s) ≥ V ∗mi

(s).By a union bound over all state–action pairs andLemma 3.1, we have V ∗

m#(s) ≥ V ∗(s) for all s withprobability at least 1−Sδ1. During the entire learningprocess, there are at most SA model-sampling steps.Applying a union bound again to these steps, we knowV ∗

m#(s) ≥ V ∗(s) for all s in every K-sample set withprobability at least 1 − S2Aδ1. Letting δ = S2Aδ1

completes the proof. �

To simplify analysis, we assume that samples in astate–action pair do not affect the posterior of transi-tion probabilities in other state–actions. However, the

result should hold more generally with respect to theposterior induced by the experience in the other states.Define the Bayesian concentration sample complexity,f(s, a, ε, δ, ρ), as the minimum number c such that, ifc IID transitions from (s, a) are observed, then withprobability 1 − δ the following holds true: an ε-ball(measured by `1-distance) centered at the true modelm∗ has at least 1− ρ probability mass in the posteriordistribution. Formally, with probability at least 1− δ,

Prm∼posterior

(‖Tm(s, a)− Tm∗(s, a)‖1 < ε) ≥ 1− ρ.

We call ρ the diffusion parameter.

Lemma 3.3 If the knownness parameter B =maxs,a f(s, a, ε, δ

SA , ρS2A2K ), then the transition func-

tion of all the sampled models are ε-close (in the `1sense) to the true transition function for all the knownstate–action pairs during the entire learning processwith probability at least 1− δ − ρ.

Proof (sketch). The proof consists of several appli-cations of the union bound. The first is applied to allstate–action pairs, implying the posterior concentratesaround the true model for all state–action pairs withdiffusion ρ′ = ρ

S2A2K with probability at least 1− δ.

Now, suppose the posterior concentrates around m∗

with diffusion ρ′. For any known (s, a), the probabil-ity that a sampled MDP’s transition function in (s, a)is ε-accurate is at least 1 − ρ′, according to the defi-nition of f . By the union bound, the sampled MDP’stransition function is ε-accurate in all known state–action pairs with probability at least 1 − SAρ′. Aunion bound is applied a second time to the K sam-pled models, implying all K sampled MDPs’ transi-tion functions are ε-accurate in all known state–actionpairs with probability at least 1−SAKρ′. Finally, us-ing a union bound a third time to all model-samplingsteps in BOSS, we know that all sampled models haveε-accurate transitions in all known (s, a) with proba-bility at least 1 − S2A2Kρ′ = 1 − ρ. Combining thisresult with the δ failure probability in the previousparagraph completes the proof. �

Theorem 3.1 When the knownness parameter B =maxs,a f

(s, a, ε(1− γ)2, δ

SA , δS2A2K

), then with proba-

bility at least 1− 4δ, V At(st) ≥ V ∗(st)− 4ε in all butζ(ε, δ) = O

(SAB

ε(1−γ)2 ln 1δ ln 1

ε(1−γ)

)steps.

Proof (sketch). The proof relies on a general PAC-MDP theorem by Strehl et al. (2006) by verifyingtheir three required conditions hold. First, the valuefunction is optimistic, as guaranteed by Lemma 3.2.Second, the accuracy condition is satisfied since the`1-error in the transition probabilities, ε(1 − γ)2,

translates into an ε error bound in the value func-tion (Kearns & Singh, 2002). Lastly, the agent visitsan unknown state–action at most SAB times, satisfy-ing the learning complexity condition. The probabilitythat any of the three conditions fails is, due to a unionbound, at most 3δ: the first δ comes from Lemma 3.2,and the other two from Lemma 3.3. �

3.2 THE BAYESIAN CONCENTRATIONSAMPLE COMPLEXITY

Theorem 3.1 depends on the Bayesian concentrationsample complexity f . A full analysis of f is beyondthe scope of this paper. In general, f depends on cer-tain properties of the model space as well as the priordistribution. While it is likely that a more accurate es-timate of f can be obtained in special cases, we makeuse of a fairly general result by Zhang (2006) to relateour sample complexity of exploration in Theorem 3.1to certain characteristics of the Bayesian prior. Fu-ture work can instantiate this general result to specialMDP classes and prior distributions.

We will need two key quantities introduced byZhang (2006; Section 5.2). The first is the criti-cal prior-mass radius, εp,n, which characterizes howdense the prior distribution p is around the true model(smaller values imply denser priors). The second is thecritical upper-bracketing radius with coefficient 2/3,denoted εupper,n, whose decay rate (as n becomeslarge) controls the consistency of the Bayesian pos-terior distribution. When εupper,n = o(1), the poste-rior is consistent. Now, define εn = 4εp,n + 3

2εupper,n.The next lemma states that as long as εn decreasessufficiently fast as n → ∞, we may upper bound theBayesian concentration sample complexity.

Lemma 3.4 If there exists a constant c > 0such that εn = O(n−c), then f(s, a, ε, δ, ρ) =max{O(ε−

2c δ−

1c ), O(ε−2δ−1 ln 1

ρ )}.

Proof (sketch). We set ρ = 1/2 and γ = 2 as used inCorollary 5.2 of Zhang (2006) to solve for n. Zhang’scorollary is stated using Renyi-entropy (DRE

12

) as thedistance metric between distributions. But, the samebound applies straightforwardly to `1-distance becauseDRE

12

(q||p) ≥ ‖p− q‖21/2. �

We may further simplify the result in Lemma 3.4 byassuming without loss of generality that c ≤ 1, result-ing in a potentially looser bound of f(s, a, ε, δ, ρ) =O(ε−

2c δ−

1c ln 1

ρ ). A direct consequence of this simpli-fied result, when combined with Theorem 3.1, is thatBOSS behaves ε-optimally with probability at least

1− δ in all but at most

O

(S1+ 1

c A1+ 1c

ε1+2c δ

1c (1− γ)2+

4c

)

steps, where O(·) suppresses logarithmic dependence.This result formalizes the intuition that, if theproblem-specific quantity εn decreases sufficiently fast,BOSS enjoys polynomial sample complexity of explo-ration with high probability.

When an uninformative Dirichlet prior is used, itcan be shown that f is polynomial in all relevantquantities, and thus Theorem 3.1 provides a perfor-mance guarantee similar to the PAC-MDP result forRMAX (Kakade, 2003).

4 EXPERIMENTS

This section presents computational experiments withBOSS, evaluating its performance on a simple do-main from the literature to allow for a comparison withother published approaches.

Consider the well-studied 5-state chain problem(Chain) (Strens, 2000; Poupart et al., 2006). Theagent has two actions: Action 1 advances the agentalong the chain, and Action 2 resets the agent to thefirst node. Action 1, when taken from the last node,leaves the agent where it is and gives a reward of 10—all other rewards are 0. Action 2 always has a rewardof 2. With probability 0.2 the outcomes are switched,however. Optimal behavior is to always choose Ac-tion 1 to reach the high reward at the end of the chain.

The slip probability 0.2 is the same for all state–actionpairs. Poupart et al. (2006) consider the impact ofencoding this constraint as a strong prior on the tran-sition dynamics. That is, whereas in the Full prior,the agent assumes each state–action pair correspondsto independent multinomial distributions over nextstates, under the Tied prior, the agent knows the un-derlying transition dynamics except for the value of asingle slip probability that is shared between all state–action pairs. They also introduce a Semi prior in whichthe two actions have independent slip probabilities.Posteriors for Full can be maintained using a Dirichlet(the conjugate for the multinomial) and Tied/Semi canbe represented with a simple Beta distribution.

In keeping with published results on this problem, Ta-ble 1 reports cumulative rewards in the first 1000 steps,averaged over 500 runs. Standard error is on the orderof 20 to 50. The optimal policy for this problem scores3677. The exploit algorithm is one that always actsoptimally with respect to the average model weightedby the posterior. RAM-RMAX (Leffler et al., 2007)

Table 1: Cumulative reward in ChainTied Semi Full

BEETLE 3650 3648 1754exploit 3642 3257 3078BOSS 3657 3651 3003RAM-RMAX 3404 3383 2810

is a version of RMAX that can exploit the tied pa-rameters of tasks like this one. Results for BEETLEand exploit are due to Poupart et al. (2006). Allruns used a discount factor of γ = 0.95 and BOSSused B = 10 and K = 5.

All algorithms perform very well in the Tied scenario(although RAM-RMAX is a bit slower as it needsto estimate the slip probability very accurately toavoid finding a suboptimal policy). Poupart et al.(2006) point out that BEETLE (a belief-lookaheadapproach) is more effective than exploit (an undi-rected approach) in the Semi scenario, which requiresmore careful exploration to perform well. In Full, how-ever, BEETLE falls behind because the larger pa-rameter space makes it difficult for it to complete itsbelief-lookahead analysis.

BOSS, on the other hand, explores as effectively asBEETLE in Semi, but is also effective in Full. Asimilarly positive result (3158) in Full is obtained byBayesian DP (Strens, 2000).

5 BAYESIAN MODELING OFSTATE CLUSTERS

The idea of state clusters is implicit in the Tied prior.We say that two states are in the same cluster if theirprobability distributions over relative outcomes are thesame given any action. In Chain, for example, the out-comes are advancing along the chain or resetting to thebeginning. Both actions produce the same distributionon these two outcomes independent of state, Action 1is 0.8/0.2 and Action 2 is 0.2/0.8, so Chain can beviewed as a one-cluster environment.

We introduce a variant of the chain example, the two-cluster Chain2, which includes an additional state clus-ter. Cluster 1—states 1, 3, and 5—behaves identicallyto the cluster in Chain. Cluster 2—states 2 and 4—has roughly the reverse distributions (Action 1 0.3/0.7,Action 2 0.7/0.3).

RAM-RMAX can take advantage of cluster struc-ture, but only if it is known in advance. In this sec-tion, we show how BOSS with an appropriate priorcan learn an unknown cluster structure and exploit itto speed up learning.

5.1 A NON-PARAMETRIC MODEL OFSTATE CLUSTERING

We derive a non-parametric cluster model that can si-multaneously use observed transition outcomes to dis-cover which parameters to tie and estimate their val-ues. We first assume that the observed outcomes foreach state in a cluster c are generated independently,but from a shared multinomial parameter vector θc.We then place a Dirichlet prior over each θc and inte-grate them out. This process has the effect of couplingall of the states in a particular cluster together, imply-ing that we can use all observed outcomes of states ina cluster to improve our estimates of the associatedtransition probabilities.

The generative model is

κ ∼ CRP(α)θκ(s) ∼ Dirichlet(η)os,a ∼ Multinomial(θκ(s))

where κ is a clustering of states (κ(s) is the id of s’scluster), θκ(s) is a multinomial over outcomes associ-ated with each cluster, and os,a is the observed out-come counts for state s and action a. Here, CRP is aChinese Restaurant Process (Aldous, 1985), a flexibledistribution that allows us to infer both the number ofclusters and the assignment of states to clusters. Theparameters of the model are α ∈ R, the concentrationparameter of the CRP, and η ∈ NN , a vector of Npseudo-counts parameterizing the Dirichlet.

The posterior distribution over clusters κ and multino-mial vectors θ given our observations os,a (representedas “data” below) is

p(κ, θ|data) ∝ p(data|θ)p(θ|η)p(κ|α)

=∏s,a

p(os,a|θκ(s))p(θκ(s)|η)p(κ|α)

=∏c∈κ

∏s∈c

∏a∈A

p(os,a|θc)p(θc|η)p(κ|α)

where c is the set of all states in a particular clus-ter. We now integrate out the multinomial parame-ter vector θc in closed form, resulting in a standardDirichlet compound multinomial distribution (or mul-tivariate Polya distribution):

p(data|κ) =∏c∈κ

∫θc

∏s∈c

∏a∈A

p(os,a|θc)p(θc|η) = (1)

∏c∈κ,a∈A

Γ(∑

i ηi)∏i Γ(ηi)

∏s Γ(

∑i os,a

i + 1)∏i,s Γ( os,a

i + 1)

∏i Γ(∑

s os,ai + ηi)

Γ(∑

i,s os,ai + ηi)

.

Because the θc parameters have been integrated outof the model, the posterior distribution over models

is simply a distribution over κ. We can also sampletransition probabilities for each state by examining theposterior predictive distribution of θc.

To sample models from the posterior, we sample clus-ter assignments and transition probabilities in twostages, using repeated sweeps of Gibbs sampling. Foreach state s, we fix the cluster assignments of all otherstates and sample over the possible assignments of s(including a new cluster):

p(κ(s)|κ−s,data) ∝ p(data|κ)p(κ)

where κ(s) is the cluster assignment of state s and κ−s

is the cluster assignments of all other states. Here,p(data|κ) is given by Eq. 1 and p(κ) is the CRP prior

p(κ|α) = αr Γ(α)Γ(α +

∑i κi)

r∏i=1

Γ(κi)

with r the total number of clusters and κi the numberof states in each cluster.

Given κ, we sample transition probabilities for eachaction from the posterior predictive distribution overθc, which, due to conjugacy, is a Dirichlet distribution:

θc|κ, η, α, a ∼ Dirichlet(η +∑s∈c

os,a).

5.2 BOSS WITH CLUSTERING PRIOR

We ran BOSS in a factorial design where we varied theenvironment (Chain vs. Chain2) and the prior (Tied,Full, vs. Cluster, where Cluster is the model described inthe previous subsection). For our experiments, BOSSused a discount factor of γ = 0.95, knownness param-eter B = 10, and a sample size of K = 5. The ClusterCRP used α = 0.5 and whenever a sample was re-quired, the Gibbs sampler ran for a burn period of 500sweeps with 50 sweeps between each sample.

Figure 1 displays the results of running BOSS withdifferent priors in Chain and Chain2. The top line onthe graph corresponds to the results for Chain. Movingfrom left to right, BOSS is run with weaker priors—Tied, Cluster, and Full. Not surprisingly, performancedecreases with weaker priors. Interestingly, however,Cluster is not significantly worse than Tied—it is ableto identify the single cluster and learn it quickly.

The second line on the plot is the results for Chain2,which has two clusters. Here, Tied’s assumption ofthe existence of a single cluster is violated and perfor-mance suffers as a result. Cluster outperforms Full by asmaller margin, here. Learning two independent clus-ters is still better than learning all states separately,but the gap is narrowing. On a larger example withmore sharing, we’d expect the difference to be more

Figure 1: Varying priors and environments in BOSS.

Figure 2: Varying K in BOSS.

dramatic. Nonetheless, the differences here are statis-tically significant (2× 3 ANOVA p < 0.001).

5.3 VARYING K

The experiments reported in the previous section usedmodel samples of size K = 5. Our next experimentwas intended to show the effect of varying the sam-ple size. Note that Bayesian DP is very similar toBOSS with K = 1, so it is important to quantify theimpact of this parameter to understand the relation-ship between these algorithms.

Figure 2 shows the result of running BOSS on Chain2using the same parameters as in the previous section.Note that performance generally improves with K.The difference between K = 1 and K = 10 is sta-tistically significant (t-test p < 0.001).

Figure 3: Diagram of 6x6 Marble Maze.

5.4 6x6 MARBLE MAZE

To demonstrate the exploration behavior of our al-gorithm, we developed a 6x6 grid-world domain withstandard dynamics (Russell & Norvig, 1994). In thisenvironment, the four actions, N, S, E and W, carrythe agent through the maze on its way to the goal.Each action has its intended effect with probability.8, and the rest of the time the agent travels in oneof the two perpendicular directions with equal likeli-hood. If there is a wall in the direction the agent triedto go, it will remain where it is. Each step has a cost of0.001, and terminal rewards of −1 and +1 are receivedfor falling into a pit or reaching the goal, respectively.The map of the domain, along with its optimal policy,is illustrated in Figure 3.

The dynamics of this environment are such that eachlocal pattern of walls (at most 16) can be modeled as aseparate cluster. In fact, fewer than 16 clusters appearin the grid and fewer still are likely to be encounteredalong an optimal trajectory. Nonetheless, we expectedBOSS to find and use a larger set of clusters than inthe previous experiments.

For this domain, BOSS used a discount factor of γ =0.95 and a CRP hyperparameter of α = 10. Wheneveran MDP set was needed, the Gibbs sampler ran fora burn period of 100 sweeps with 50 sweeps betweeneach sample. We also ran RMAX in this domain.

The cumulative reward achieved by the BOSS vari-ants that learned the cluster structure, in Figure 4,dominated those of RMAX, which did not know thecluster structure. The primary difference visible in thegraph is the time needed to obtain the optimal pol-icy. Remarkably, BOSS B = 10 K = 10 latches ontonear optimal behavior nearly instantaneously whereasthe RMAX variants required 50 to 250 trials beforebehaving as well. This finding can be partially ex-plained by the choice of the clustering prior and the

Figure 4: Comparison of algorithms on 6x6 MarbleMaze.

outcomes it drew from, which effectively put a lowerbound on the number of steps to the goal from anystate. This information made it easy for the agent toignore longer paths when it had already found some-thing that worked.

Looking at the clustering performed by the algorithm,a number of interesting features emerge. Although itdoes not find a one-to-one mapping from states to pat-terns of walls, it gets very close. In particular, amongthe states that are visited often in the optimal policyand for the actions chosen in these states, the algo-rithm groups them perfectly. The first, third, fourth,and fifth states in the top row of the grid are all as-signed to the same cluster. These are the states inwhich there is a wall above and none below or right,impacting the success probability of N and E, the twoactions chosen in these states. The first, second, third,and fifth states in the rightmost column are similarlygrouped together. These are the states with a wall tothe right, but none below or left, impacting the suc-cess probability of S and E, the two actions chosen inthese states. Other, less commonly visited states, areclustered somewhat more haphazardly, as it was notnecessary to visit them often to obtain high reward inthis grid. The sampled models used around 10 clustersto capture the dynamics.

5.5 COMPUTATIONAL COMPLEXITY

The computation time required by BOSS depends ontwo distinct factors. First, the time required for per-step planning using value iteration scales with thenumber of sampled MDPs, K. Second, the time re-quired for sampling new MDPs depends linearly on Kand on the type of prior used. For a simple prior, such

as Full, samples can be drawn extremely quickly. For amore complex prior, such as Cluster, samples can takelonger. In the 6x6 Marble Maze, samples were drawnat a rate of roughly one every ten seconds. It is worthnoting that sampling can be carried out in parallel.

6 CONCLUSIONS

We presented a modular approach to explorationcalled BOSS that interfaces a Bayesian model learnerto an algorithm that samples models and constructsexploring behavior that converges quickly to near op-timality. We compared the algorithm to several state-of-the-art exploration approaches and showed it wasas good as the best known algorithm in each scenariotested. We also derived a non-parametric Bayesianclustering model and showed how BOSS could useit to learn more quickly than could non-generalizingcomparison algorithms.

In future work, we plan to analyze the more generalsetting in which priors are assumed to be only ap-proximate indicators of the real distribution over en-vironments. We are also interested in hierarchical ap-proaches that can learn, in a transfer-like setting, moreaccurate priors. Highly related work in this directionwas presented by Wilson et al. (2007).

An interesting direction for future research is to con-sider extensions of our clustered state model wherethe clustering is done in feature space, possibly usingnon-parametric models such as the Indian Buffet Pro-cess (Griffiths & Ghahramani, 2006). Such a modelcould simultaneously learn how to decompose statesinto features and also discover which observable fea-tures of a state (color, texture, position) are reliableindicators of the dynamics.

We feel that decomposing the details of the Bayesianmodel from the exploration and decision-making com-ponents allow for a very general RL approach. Newlydeveloped languages for specifying Bayesian mod-els (Goodman et al., 2008) could be integrated directlywith BOSS to produce a flexible learning toolkit.

Acknowledgements

We thank Josh Tenenbaum, Tong Zhang, and the re-viewers. This work was supported by DARPA IPTOFA8750-05-2-0249 and NSF IIS-0713435.

References

Aldous, D. (1985). Exchangeability and related topics.

l’Ecole d’ete de probabilites de Saint-Flour, XIII-1983(pp. 1–198).

Brafman, R. I., & Tennenholtz, M. (2002). R-MAX—a

general polynomial time algorithm for near-optimal re-inforcement learning. Journal of Machine Learning Re-search, 3, 213–231.

Goodman, N. D., Mansinghka, V. K., Roy, D., Bonawitz,K., & Tenenbaum, J. B. (2008). Church: A language forgenerative models. Uncertainty in Artificial Intelligence.

Griffiths, T. L., & Ghahramani, Z. (2006). Infinite latentfeature models and the Indian buffet process. NeuralInformation Processing Systems (NIPS).

Kakade, S. M. (2003). On the sample complexity of rein-forcement learning. Doctoral dissertation, Gatsby Com-putational Neuroscience Unit, University College Lon-don.

Kearns, M. J., & Singh, S. P. (2002). Near-optimal rein-forcement learning in polynomial time. Machine Learn-ing, 49, 209–232.

Leffler, B. R., Littman, M. L., & Edmunds, T. (2007).Efficient reinforcement learning with relocatable actionmodels. Proceedings of the Twenty-Second Conferenceon Artificial Intelligence (AAAI-07).

Poupart, P., Vlassis, N., Hoey, J., & Regan, K. (2006).An analytic solution to discrete Bayesian reinforcementlearning. Proceedings of the 23rd International Confer-ence on Machine Learning (pp. 697–704).

Russell, S. J., & Norvig, P. (1994). Artificial intelligence: Amodern approach. Englewood Cliffs, NJ: Prentice-Hall.

Strehl, A. L., Li, L., & Littman, M. L. (2006). Incrementalmodel-based learners with formal learning-time guaran-tees. Proceedings of the 22nd Conference on Uncertaintyin Artificial Intelligence (UAI 2006).

Strens, M. J. A. (2000). A Bayesian framework for rein-forcement learning. Proceedings of the Seventeenth Inter-national Conference on Machine Learning (ICML 2000)(pp. 943–950).

Thompson, W. R. (1933). On the likelihood that one un-known probability exceeds another in view of the evi-dence of two samples. Biometrika, 25, 285–294.

Thrun, S. B. (1992). The role of exploration in learn-ing control. In D. A. White and D. A. Sofge (Eds.),Handbook of intelligent control: Neural, fuzzy, and adap-tive approaches, 527–559. New York, NY: Van NostrandReinhold.

Wang, T., Lizotte, D., Bowling, M., & Schuurmans, D.(2005). Bayesian sparse sampling for on-line reward op-timization. ICML ’05: Proceedings of the 22nd interna-tional conference on Machine Learning (pp. 956–963).New York, NY, USA: ACM.

Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007). Multi-task reinforcement learning: A hierarchical Bayesian ap-proach. Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML 2007) (pp.1015–1022).

Zhang, T. (2006). From ε-entropy to KL-entropy: Analysisof minimum information complexity density estimation.The Annals of Statistics, 34, 2180–2210.

Documents

Boss