A Bayesian Approach to Finding Compact Representations for Reinforcement Learning


A Bayesian Approach to FindingCompact Representations for

Reinforcement Learning


Stefanie TellexAlborz Geramifard

Jonathan How

David Wingate

Nicholas Roy


VisionSolving Large Sequential Decision Making

Problems Formulated as MDPs.


Reinforcement Learning�(s) : S � A at

st, rt

2. Background

Reinforcement learning (RL) is a powerful frameworkfor sequential decision making in which an agent in-teracts with an environment on every time step. Theenvironment is often modeled using a Markov De-cision Process (MDP) which is defined by a tuple(S,A,Pa

ss0 ,Rass0 , �), where S identifies the finite set

of states, A corresponds to the finite set of actions,Pass0 dictates the transition probability from state s to

state s

0 when taking action a, Rass0 is the correspond-

ing reward along the way, and � 2 [0, 1] is a dis-count factor emphasizing the relative significance ofimmediate rewards versus feature rewards.1 A tra-jectory of experience is identified by the sequences0, a0, r1, s1, a1, r2, · · · , where at time i the agent instate si took action ai, received reward ri+1, and tran-sited to state si+1. The behavior of the agent is cap-tured through the notion of policy ⇡ : S ⇥A ! [0, 1]

governing the probability of taking each action in eachstate. We limit our attention to deterministic poli-cies mapping each state to one action. The value of astate given policy ⇡ is defined as the expected cumula-tive discounted rewards obtained starting the sequencefrom s and following ⇡ thereafter:


⇡(s) = E⇡

" 1X



s0 = s


Similarly the value of a state-action pair is defined as:


⇡(s, a) = E⇡

" 1X



s0 = s, a0 = a,


The objective is to find the optimal policy defined as:

⇤(s) = argmax


⇡⇤(s, a).

One popular thrust of online reinforcement learningmethods such as SARSA (Rummery and Niranjan,1994) and Q-Learning (Watkins and Dayan, 1992)tackle the problem by updating the estimated value ofa state based on temporal difference error (TD-error),while acting mostly greedy with respect to the esti-mated values. TD-error is defined as

�t(Q⇡) = rt + �Q

⇡(st+1, at+1)� Q

⇡(st, at).

One of the main challenges facing researchers is thatmost realistic domains consist of large state spaces

1. � = 1 is only valid for episodic tasks.

and continuous state variables. Function approxima-tors have been used as a machinery to overcome theseobstacles, enabling an agent to generalize its experi-ence in order to act appropriately in states it may havenever previously encountered during training. Lin-ear function approximators, which are the focus ofthis paper, have been favored due to their theoret-ical properties and low computational complexities(Sutton, 1996; Tsitsiklis and Van Roy, 1997; Geram-ifard et al., 2006). Using a linear function approxi-mation Q

⇡(s, a) is approximated by w

T�(s, a) where

� : S⇥A ! <n is the mapping function and w is theweight vector. For simplicity we call �(s, a) the fea-ture vector and each element of the vector a feature.

Finding a suitable mapping function is one of thecritical elements to obtain an adept policy. Earlystudies on random feature generation methods haveshown promising directions on expanding the repre-sentation using some basic set of features (Sutton andWhitehead, 1993). Representational Policy Iteration(RPI) (Mahadevan, 2005) is another approach for dis-covering task independent representations fusing thetheory of smooth functions on a Riemannian mani-fold with the Least-Squares method. Another popu-lar trend of methods migrated the idea of Cascade-Correlation (Fahlman and Lebiere, 1991) to the re-inforcement learning realm using temporal differencelearning (Rivest and Precup, 2003), approximate dy-namic programming (Girgin and Preux, 2007), andLSPI (Girgin and Preux, 2008). (Geramifard et al.,2011) described a method for incrementally addinga feature which maximally reduces TD-error. How-ever, none of these techniques facilitate a regulariza-tion scheme by which the designer incorporates hisknowledge over the set of hypotheses.

From the Bayesian cognitive science community,Goodman et al. (2008) used a grammar-based in-duction scheme to learn human concepts in a super-vised learning setting. In their approach new con-cepts (features) were derived from a limited set ofinitial propositions using a generative grammar. Thiswork motivated us to revisit the representational ex-pansion within the RL community from the Bayesianapproach.

3. Our Approach

We adopt a Bayesian approach to find well perform-ing policies. The core idea is to find a representation



Linear Function Approximation








�n ✓n

sQ⇡(s, a) ⇡ �(s, a)>✓








Primitive features





Extended features

1 2 3 4 5 6

7 8

1 2 3 4 5 6

7 8

1 2 3 4 5 6


1 2 3 4 5 6

7 8


Figure 2: Representation of primitive and extended features andthe possible outcomes of the propose function

3.3.2. The Transition Probability Function (T )

When the transition probability function is symmetric,it can be removed from the MH algorithm, as it is can-celed out during the calculation. In our setting, how-ever, T is not symmetric. Given that hypothesis � hasp primitive features, e extended features and h headerfeatures, and representation �

0 proposed by taking ac-tion a from �, the transition probability function isdefined as:

T (�

0|�) =




2(p+e)/p+e�1 a = add

1/h a = remove

1/e a = mutate

4. Empirical Results

In this section, we investigate the performance of run-ning MHPI in the three domains: maze, BlocksWorld,and the inverted pendulum problem. For each domainsamples were gathered by the SARSA (Rummery andNiranjan, 1994) algorithm using the initial feature rep-resentation with the learning rates generated from thefollowing series:

↵t = ↵0N0 + 1

N0 + Episode#1.1,

where N0 was set to 100, and ↵0 was initialized at1 due to the short amount of interaction. For explo-ration, we chose the ✏-greedy approach with ✏ = .1

(i.e., 10% chance of taking a random action on eachtime step). The � parameter of the Poisson distribu-tion was set to 0.01 while ⌘ for the exponential distri-bution was set to 1. The initial representation used for

MH included all basic features. Additionally �(s, a)

was built by copying �(s) vector into the correspond-ing action slot. Therefore �(s, a) has |A| times morefeatures compared to �(s). For LSPI, we limited thenumber of policy iterations to 5, while the value of theinitial state for each policy, V ⇡i

(s0), was evaluated bya single Monte-Carlo run.

Maze Figure 3-(a) shows a simple 11 ⇥ 11 naviga-tion problem where the initial state is on the top leftcorner of the maze (�) and the goal is at the bottomright corner of the maze (?). Light blue cells indi-cate blocked areas. The action set consist of one stepmoves along the four cardinal directions. Actions arenoiseless and possible if the destination cell is notblocked. Reward is �.001 for all interactions exceptthe move leading to the goal with reward of +1. Theepisodic task is terminated if goal is reached, or 100steps is passed. There were 22 initial features used for�(s) corresponding to 11 rows and 11 columns of themaze. � was set to 1.

We used 200 samples through 2 episodes, gatheredin the domain using SARSA. The agent reached thegoal in the first episode following the top right cor-ner of the middle blocked square. The second episodefailed as the agent struggled behind the blocked areaon the bottom. Figure 3-(b) shows the distributionof the representation sizes sampled along 1, 000 iter-ations of the MH algorithm, while Figure 3-(c) showsthe corresponding performance of the sampled rep-resentations. The distribution together with the per-formance measure suggest that a desirable representa-tion should have 3 extended features. After 100 itera-tions all sampled hypotheses were expressive enoughto solve the task. It is interesting to see how Occam’sRazor is being carried away through the whole pro-cess. The MH algorithm spent most of its time ex-ploring various hypotheses with 3 extended featureswhile the more complicated representations were ofless interest as they provided the same performance(i.e., likelihood) yet had lower prior.

Figure 3-(d) shows the value function (green indi-cates positive, white represents zero, and red standsfor blocked areas) and the corresponding policy (ar-rows) for the best performing representation. Thisrepresentation had 3 extended features: (X = 2^Y =

11), (X = 3 ^ Y = 6), and (X = 2 ^ Y = 8), whereX is the row number and Y is the column number.Notice that the policy guides the agent successfullyfrom the starting point to the goal on the shortest path.



f8 = f4 ^ f6

Logical combinations of primitive features such as







f8 = f4 ^ f6

Logical combinations of primitive features such as







Likelihood: - Find the best policy given (we used LSPI)


�, D

P (G|�, D) / e⌘V⇡(s0)

⇡ [Lagoudakis et al. 2003]







A well performing policy is more likely to be a Good policy!

Simulate trajectories for estimating V ⇡(s0)


Likelihood: - Find the best policy given (we used LSPI)


�, D

P (G|�, D) / e⌘V⇡(s0)

⇡ [Lagoudakis et al. 2003]







A well performing policy is more likely to be a Good policy!

Simulate trajectories for estimating V ⇡(s0)


Likelihood: - Find the best policy given (we used LSPI)


�, D

P (G|�, D) / e⌘V⇡(s0)

⇡ [Lagoudakis et al. 2003]

Prior:- Representations with less number of features are more likely.- Representations with simple features are more likely.

[Goodman et al. 2008]







Use Metropolis-Hastings (MH) to sample from the posterior.










Use Metropolis-Hastings (MH) to sample from the posterior.

� �0Propose

Accept probabilistically based on the posterior

Markov Chain Monte-Carlo:










Primitive features





Extended features

1 2 3 4 5 6

7 8

1 2 3 4 5 6

7 8

1 2 3 4 5 6


1 2 3 4 5 6

7 8


Figure 2: Representation of primitive and extended features andthe possible outcomes of the propose function

3.3.2. The Transition Probability Function (T )

When the transition probability function is symmetric,it can be removed from the MH algorithm, as it is can-celed out during the calculation. In our setting, how-ever, T is not symmetric. Given that hypothesis � hasp primitive features, e extended features and h headerfeatures, and representation �

0 proposed by taking ac-tion a from �, the transition probability function isdefined as:

T (�

0|�) =




2(p+e)/p+e�1 a = add

1/h a = remove

1/e a = mutate

4. Empirical Results

In this section, we investigate the performance of run-ning MHPI in the three domains: maze, BlocksWorld,and the inverted pendulum problem. For each domainsamples were gathered by the SARSA (Rummery andNiranjan, 1994) algorithm using the initial feature rep-resentation with the learning rates generated from thefollowing series:

↵t = ↵0N0 + 1

N0 + Episode#1.1,

where N0 was set to 100, and ↵0 was initialized at1 due to the short amount of interaction. For explo-ration, we chose the ✏-greedy approach with ✏ = .1

(i.e., 10% chance of taking a random action on eachtime step). The � parameter of the Poisson distribu-tion was set to 0.01 while ⌘ for the exponential distri-bution was set to 1. The initial representation used for

MH included all basic features. Additionally �(s, a)

was built by copying �(s) vector into the correspond-ing action slot. Therefore �(s, a) has |A| times morefeatures compared to �(s). For LSPI, we limited thenumber of policy iterations to 5, while the value of theinitial state for each policy, V ⇡i

(s0), was evaluated bya single Monte-Carlo run.

Maze Figure 3-(a) shows a simple 11 ⇥ 11 naviga-tion problem where the initial state is on the top leftcorner of the maze (�) and the goal is at the bottomright corner of the maze (?). Light blue cells indi-cate blocked areas. The action set consist of one stepmoves along the four cardinal directions. Actions arenoiseless and possible if the destination cell is notblocked. Reward is �.001 for all interactions exceptthe move leading to the goal with reward of +1. Theepisodic task is terminated if goal is reached, or 100steps is passed. There were 22 initial features used for�(s) corresponding to 11 rows and 11 columns of themaze. � was set to 1.

We used 200 samples through 2 episodes, gatheredin the domain using SARSA. The agent reached thegoal in the first episode following the top right cor-ner of the middle blocked square. The second episodefailed as the agent struggled behind the blocked areaon the bottom. Figure 3-(b) shows the distributionof the representation sizes sampled along 1, 000 iter-ations of the MH algorithm, while Figure 3-(c) showsthe corresponding performance of the sampled rep-resentations. The distribution together with the per-formance measure suggest that a desirable representa-tion should have 3 extended features. After 100 itera-tions all sampled hypotheses were expressive enoughto solve the task. It is interesting to see how Occam’sRazor is being carried away through the whole pro-cess. The MH algorithm spent most of its time ex-ploring various hypotheses with 3 extended featureswhile the more complicated representations were ofless interest as they provided the same performance(i.e., likelihood) yet had lower prior.

Figure 3-(d) shows the value function (green indi-cates positive, white represents zero, and red standsfor blocked areas) and the corresponding policy (ar-rows) for the best performing representation. Thisrepresentation had 3 extended features: (X = 2^Y =

11), (X = 3 ^ Y = 6), and (X = 2 ^ Y = 8), whereX is the row number and Y is the column number.Notice that the policy guides the agent successfullyfrom the starting point to the goal on the shortest path.

Propose Function:

�0Use Metropolis-Hastings (MH) to sample from the posterior.

� �0Propose

Accept probabilistically based on the posterior

Markov Chain Monte-Carlo:




mance along each iteration. Unlike other domainsthat more features often helped the performance earlyon. In this domain irrelevant features dropped theperformance resulting in MH to reject them. Thisprocess took a while until interesting features startedto emerge. This effect is usually avoided by settinga burn-in value discarding limited number of initialsamples in the MH setting. Yet we added this data tohighlight the fact that expanding the representation ar-bitrary does not necessarily improve the performancein light of limited data. Figure 5-(d) shows the perfor-mance of the representations based on the number ofextended features. In our experiments, while addingmost extended features hurt the performance, the ex-tended feature (� ⇡

21 ✓ < 0) ^ (0.4 ˙

✓ < 0.6)

enabled the agent to complete the task successfully.This feature identifies an intuitive situation where thependulum is almost balanced with a velocity on theopposite direction, which would be often visited. Thisis very interesting results, because out of all possiblecorrelations among the initial features (21⇥ 21), cap-turing one such intuitive feature made the task solv-able with very limited amount of data.

In our work, we found that the adjustment of priorsplayed a critical role on the success of MHPI as priorscompete against the performance of the resulting poli-cies. Also we found MHPI to be robust in handlingstochastic domains. For exampling adding 20% noise

to the movement of the agent in the maze domain didnot change the performance noticeably.

5. Conclusion

This paper introduces a Bayesian approach for find-ing concise yet expressive representations for solvingMDPs. We introduced MHPI, a new RL techniquethat builds new representations from limited numberof simple features that perform well. Our approachuses a prior distribution that encourages representa-tion simplicity, and a likelihood function based onLSPI to encourage representations that lead to capa-ble policies. MHPI samples representations from theresulting posterior distribution. Although, the idea ofMHPI is general, in our implementation, we narrowedthe representation space to DAG structures on primi-tive binary features. The empirical results show thatMHPI finds simple yet effective representations forthree classical RL problems.

There are immediate visible expansions to this work.In our implementation, we excluded the samples gen-erated during the performance test in order to takeadvantage of caching old representation evaluations.One can use such samples along the way while beingaware of the increase to the runtime complexity. An-other extension is to relax the need of the simulationbox in LSPI by measuring the performance using off-

Maze200 Initial SamplesInitial features: row and column indicatorsNoiseless Actions: →,←,↓,↑


MHPI Iteration

BlocksWorld1000 Initial SamplesInitial features: on(A,B)20% noise of dropping the block


mance along each iteration. Unlike other domainsthat more features often helped the performance earlyon. In this domain irrelevant features dropped theperformance resulting in MH to reject them. Thisprocess took a while until interesting features startedto emerge. This effect is usually avoided by settinga burn-in value discarding limited number of initialsamples in the MH setting. Yet we added this data tohighlight the fact that expanding the representation ar-bitrary does not necessarily improve the performancein light of limited data. Figure 5-(d) shows the perfor-mance of the representations based on the number ofextended features. In our experiments, while addingmost extended features hurt the performance, the ex-tended feature (� ⇡

21 ✓ < 0) ^ (0.4 ˙

✓ < 0.6)

enabled the agent to complete the task successfully.This feature identifies an intuitive situation where thependulum is almost balanced with a velocity on theopposite direction, which would be often visited. Thisis very interesting results, because out of all possiblecorrelations among the initial features (21⇥ 21), cap-turing one such intuitive feature made the task solv-able with very limited amount of data.

In our work, we found that the adjustment of priorsplayed a critical role on the success of MHPI as priorscompete against the performance of the resulting poli-cies. Also we found MHPI to be robust in handlingstochastic domains. For exampling adding 20% noise

to the movement of the agent in the maze domain didnot change the performance noticeably.

5. Conclusion

This paper introduces a Bayesian approach for find-ing concise yet expressive representations for solvingMDPs. We introduced MHPI, a new RL techniquethat builds new representations from limited numberof simple features that perform well. Our approachuses a prior distribution that encourages representa-tion simplicity, and a likelihood function based onLSPI to encourage representations that lead to capa-ble policies. MHPI samples representations from theresulting posterior distribution. Although, the idea ofMHPI is general, in our implementation, we narrowedthe representation space to DAG structures on primi-tive binary features. The empirical results show thatMHPI finds simple yet effective representations forthree classical RL problems.

There are immediate visible expansions to this work.In our implementation, we excluded the samples gen-erated during the performance test in order to takeadvantage of caching old representation evaluations.One can use such samples along the way while beingaware of the increase to the runtime complexity. An-other extension is to relax the need of the simulationbox in LSPI by measuring the performance using off-

MHPI Iteration

BlocksWorld1000 Initial SamplesInitial features: on(A,B)20% noise of dropping the block


mance along each iteration. Unlike other domainsthat more features often helped the performance earlyon. In this domain irrelevant features dropped theperformance resulting in MH to reject them. Thisprocess took a while until interesting features startedto emerge. This effect is usually avoided by settinga burn-in value discarding limited number of initialsamples in the MH setting. Yet we added this data tohighlight the fact that expanding the representation ar-bitrary does not necessarily improve the performancein light of limited data. Figure 5-(d) shows the perfor-mance of the representations based on the number ofextended features. In our experiments, while addingmost extended features hurt the performance, the ex-tended feature (� ⇡

21 ✓ < 0) ^ (0.4 ˙

✓ < 0.6)

enabled the agent to complete the task successfully.This feature identifies an intuitive situation where thependulum is almost balanced with a velocity on theopposite direction, which would be often visited. Thisis very interesting results, because out of all possiblecorrelations among the initial features (21⇥ 21), cap-turing one such intuitive feature made the task solv-able with very limited amount of data.

In our work, we found that the adjustment of priorsplayed a critical role on the success of MHPI as priorscompete against the performance of the resulting poli-cies. Also we found MHPI to be robust in handlingstochastic domains. For exampling adding 20% noise

to the movement of the agent in the maze domain didnot change the performance noticeably.

5. Conclusion

This paper introduces a Bayesian approach for find-ing concise yet expressive representations for solvingMDPs. We introduced MHPI, a new RL techniquethat builds new representations from limited numberof simple features that perform well. Our approachuses a prior distribution that encourages representa-tion simplicity, and a likelihood function based onLSPI to encourage representations that lead to capa-ble policies. MHPI samples representations from theresulting posterior distribution. Although, the idea ofMHPI is general, in our implementation, we narrowedthe representation space to DAG structures on primi-tive binary features. The empirical results show thatMHPI finds simple yet effective representations forthree classical RL problems.

There are immediate visible expansions to this work.In our implementation, we excluded the samples gen-erated during the performance test in order to takeadvantage of caching old representation evaluations.One can use such samples along the way while beingaware of the increase to the runtime complexity. An-other extension is to relax the need of the simulationbox in LSPI by measuring the performance using off-

MHPI Iteration

BlocksWorld1000 Initial SamplesInitial features: on(A,B)20% noise of dropping the block


mance along each iteration. Unlike other domainsthat more features often helped the performance earlyon. In this domain irrelevant features dropped theperformance resulting in MH to reject them. Thisprocess took a while until interesting features startedto emerge. This effect is usually avoided by settinga burn-in value discarding limited number of initialsamples in the MH setting. Yet we added this data tohighlight the fact that expanding the representation ar-bitrary does not necessarily improve the performancein light of limited data. Figure 5-(d) shows the perfor-mance of the representations based on the number ofextended features. In our experiments, while addingmost extended features hurt the performance, the ex-tended feature (� ⇡

21 ✓ < 0) ^ (0.4 ˙

✓ < 0.6)

enabled the agent to complete the task successfully.This feature identifies an intuitive situation where thependulum is almost balanced with a velocity on theopposite direction, which would be often visited. Thisis very interesting results, because out of all possiblecorrelations among the initial features (21⇥ 21), cap-turing one such intuitive feature made the task solv-able with very limited amount of data.

In our work, we found that the adjustment of priorsplayed a critical role on the success of MHPI as priorscompete against the performance of the resulting poli-cies. Also we found MHPI to be robust in handlingstochastic domains. For exampling adding 20% noise

to the movement of the agent in the maze domain didnot change the performance noticeably.

5. Conclusion

This paper introduces a Bayesian approach for find-ing concise yet expressive representations for solvingMDPs. We introduced MHPI, a new RL techniquethat builds new representations from limited numberof simple features that perform well. Our approachuses a prior distribution that encourages representa-tion simplicity, and a likelihood function based onLSPI to encourage representations that lead to capa-ble policies. MHPI samples representations from theresulting posterior distribution. Although, the idea ofMHPI is general, in our implementation, we narrowedthe representation space to DAG structures on primi-tive binary features. The empirical results show thatMHPI finds simple yet effective representations forthree classical RL problems.

There are immediate visible expansions to this work.In our implementation, we excluded the samples gen-erated during the performance test in order to takeadvantage of caching old representation evaluations.One can use such samples along the way while beingaware of the increase to the runtime complexity. An-other extension is to relax the need of the simulationbox in LSPI by measuring the performance using off-

MHPI Iteration




policy evaluation techniques such as importance sam-pling (Sutton and Barto, 1998) and model-free MonteCarlo (Fonteneau et al., 2010).


Inverted Pendulum1000 Initial SamplesInitial features: Discretize into 21 buckets separatelyGaussian noise was added to torque values


✓, ✓

MHPI Iteration




policy evaluation techniques such as importance sam-pling (Sutton and Barto, 1998) and model-free MonteCarlo (Fonteneau et al., 2010).


Inverted Pendulum1000 Initial SamplesInitial features: Discretize into 21 buckets separatelyGaussian noise was added to torque values


✓, ✓

Many proposed representations were rejected initially

MHPI Iteration




policy evaluation techniques such as importance sam-pling (Sutton and Barto, 1998) and model-free MonteCarlo (Fonteneau et al., 2010).


Inverted Pendulum1000 Initial SamplesInitial features: Discretize into 21 buckets separatelyGaussian noise was added to torque values


✓, ✓

Many proposed representations were rejected initially

mance along each iteration. Unlike other domainsthat more features often helped the performance earlyon. In this domain irrelevant features dropped theperformance resulting in MH to reject them. Thisprocess took a while until interesting features startedto emerge. This effect is usually avoided by settinga burn-in value discarding limited number of initialsamples in the MH setting. Yet we added this data tohighlight the fact that expanding the representation ar-bitrary does not necessarily improve the performancein light of limited data. Figure 5-(d) shows the perfor-mance of the representations based on the number ofextended features. In our experiments, while addingmost extended features hurt the performance, the ex-tended feature (� ⇡

21 ✓ < 0) ^ (0.4 ˙

✓ < 0.6)

enabled the agent to complete the task successfully.This feature identifies an intuitive situation where thependulum is almost balanced with a velocity on theopposite direction, which would be often visited. Thisis very interesting results, because out of all possiblecorrelations among the initial features (21⇥ 21), cap-turing one such intuitive feature made the task solv-able with very limited amount of data.

In our work, we found that the adjustment of priorsplayed a critical role on the success of MHPI as priorscompete against the performance of the resulting poli-cies. Also we found MHPI to be robust in handlingstochastic domains. For exampling adding 20% noise

to the movement of the agent in the maze domain didnot change the performance noticeably.

5. Conclusion

This paper introduces a Bayesian approach for find-ing concise yet expressive representations for solvingMDPs. We introduced MHPI, a new RL techniquethat builds new representations from limited numberof simple features that perform well. Our approachuses a prior distribution that encourages representa-tion simplicity, and a likelihood function based onLSPI to encourage representations that lead to capa-ble policies. MHPI samples representations from theresulting posterior distribution. Although, the idea ofMHPI is general, in our implementation, we narrowedthe representation space to DAG structures on primi-tive binary features. The empirical results show thatMHPI finds simple yet effective representations forthree classical RL problems.

There are immediate visible expansions to this work.In our implementation, we excluded the samples gen-erated during the performance test in order to takeadvantage of caching old representation evaluations.One can use such samples along the way while beingaware of the increase to the runtime complexity. An-other extension is to relax the need of the simulationbox in LSPI by measuring the performance using off-

Key feature:

MHPI Iteration

ContributionsIntroduced a Bayesian approach for finding concise yet expressive representations for solving MDPs.Introduced MHPI as a new RL technique that expands the representation using limited samples.Empirically demonstrated the effectiveness of our approach in 3 domains.


Feature Work:Reuse the data for estimating for policy iterationRelax the need of a simulator to generate trajectories

Importance sampling [Sutton and Barto, 1998] Model-free Monte Carlo [Fonteneau et al., 2010]

V ⇡(s0)
