Self-Supervised Object-Level Deep Reinforcement Learning · dynamics model and a reward model. The ﬁrst object prior is that humans perceive the world in terms of objects, or groups

Self-Supervised Object-Level Deep Reinforcement Learning

William Agnew 1 Pedro Domingos 1

Abstract

Current deep reinforcement learning approachesincorporate minimal prior knowledge about theenvironment, limiting computational and sampleefficiency. We incorporate a few object-basedpriors that humans are known to use: "Infantsdivide perceptual arrays into units that move asconnected wholes, that move separately from oneanother, that tend to maintain their size and shapeover motion, and that tend to act upon each otheronly on contact" Spelke [1990]. We propose aprobabilistic object-based model of environmentsand use human object priors to develop an ef-ficient self-supervised algorithm for maximumlikelihood estimation of the model parametersfrom observations and for inferring objects di-rectly from the perceptual stream. We then useobject features and incorporate object-contact pri-ors to improve the sample efficiency our object-based RL agent. We evaluate our approach on asubset of the Atari benchmarks, and learn up tofour orders of magnitude faster than the standarddeep Q-learning network, rendering rapid desk-top experiments in this domain feasible. To ourknowledge, our system is the first to learn anyAtari task in fewer environment interactions thanhumans.

1. IntroductionDeep reinforcement learning has achieved impressive per-formance in many environments Mnih et al. [2015, 2016],Silver et al. [2017]. However, the sample inefficiency ofdeep reinforcement learning algorithms limits applicabilityto real-world domains. These algorithms are very general, intheory capable of learning in all environments. We are onlyinterested in real-world environments though, suggestingthat current deep reinforcement learning algorithms could

1Paul G. Allen School of Computer Science and Engineering,University of Washington, Seattle, Washington, USA. Correspon-dence to: William Agnew <[email protected]>, PedroDomingos <[email protected]>.

be improved by incorporating additional priors or heuristics.

In most real-world environments, humans are able to achievegood performance after far less experience than machines,pointing to human-inspired priors as a means of improvingsample efficiency. Past work has considered perceivingthe world in terms of objects as humans do (Diuk et al.[2008], Scholz et al. [2014]), and recent work has focusedon creating object state representations without supervision(Goel et al. [2018], Zhu et al. [2018], Greff et al. [2019],Lin et al. [2020]). In this work, we use human object priorsto develop a probabilistic object model of environments anda tractable algorithm for inferring an explicit object staterepresentation that jointly maximizes the likelihood of adynamics model and a reward model.

The first object prior is that humans perceive the world interms of objects, or groups of percepts that behave similarlyacross time and tend to maintain their shape and size Spelke[1990]. We incorporate this prior by developing an object-based probabilistic model of environments and tasks. Weprovide an algorithm for MLE inference on the parametersof this environment. If two low-level objects are part ofthe same high level object and therefore have the samebehavior, then their dynamics and reward behavior can bemodelled by a single model just as well as by individualmodels for each object. Our inference algorithm formalizesthis observation by attempting to combine similar objectstogether and evaluating the likelihood of a combination viathe likelihood of reward and dynamics models trained onthe resulting object representation.

Learning causality in sparse reward settings is a fundamen-tal challenge in reinforcement learning. The second humanobject prior we use is that objects interact only when contact-ing (Spelke [1990]). We implement this prior by defining anobject-contact state representation, reinforcement learningalgorithm, and approximation architecture. We representthe state of each object by the object’s position, velocity,acceleration, and relative positions, velocities, and accel-erations of contacting objects. This representation is bothcompact and sparse. Our reinforcement learning algorithmcalculates discounted returns using a discount factor that is afunction of the object state representation, specifically usinga lower γ when objects contact, giving these states morecredit for future rewards than states where no objects are

arX

iv:2

003.

0138

4v1

[cs

.LG

] 3

Mar

202

0


contacting. Our approximation architecture, which learns togeneralize the decisions of the reinforcement learning archi-tecture, consists of simple, sample-efficient models whichlearn value and reward as a function of which objects arecontacting, and richer models that learn value and rewardas a function of both which objects are contacting and howthose objects are contacting.

The object-contact focused reinforcement learning we pro-pose enables rapid learning of which objects should contactand how, but offers no advantages in states where no ob-jects are contacting. We solve this by using the learnedobject dynamics models to plan towards states in whichcontacts occur. However, this planning can be computation-ally expensive. Our final object prior is that the distancebetween two objects must decrease for those objects tocontact. While simple, this prior allows us to develop plan-ning heuristics which greatly improve the computationalefficiency of existing planning algorithms.

We combine these priors into a new reinforcement learn-ing algorithm to create an Object-Level ReinforcementLearner (OLRL). We compare OLRL with humans, deep re-inforcement learners, and an object feature set, Blob-PROST(Liang et al. [2016]) on Atari environments, and show that ityields large improvements in sample and computational ef-ficiency. Currently deep reinforcement learning research isoften resource and time intensive due to the huge number ofexperiences needed. Our object detection and tracking mod-ule can be used as a general preprocessor for reinforcementlearning, greatly increasing the pace and sample efficiencyof experiments in this area.

Our contributions are using the following human object pri-ors to create a novel OORL algorithm:1. Prior of common destiny: percepts that behave the same–same dynamics, value, and reward behavior–are the sameand may be treated as one object. We use this prior to buildan algorithm that quickly learns a succinct object represen-tation of the environment with no supervision.2. Object contact prior: objects interact primarily when incontact. Therefore the value and reward functions may bemodelled well using object contacts as features. This fea-ture set is small and often has little variance; that is, objectcontact states change infrequently, enabling sample-efficientlearning.3. Object dynamics prior: the world can be modelled wellusing classical object dynamics–objects and their positions,velocities, accelerations, and contacts. Using this prior werapidly learn a forward dynamics model accurate enough toenable effective planning.4. Implementing these priors into an agent that is more sam-ple efficient than humans on a subset of Atari environments.

2. Related WorkThere has been much work on model-based deep reinforce-ment learning and predicting future states or values withdeep networks. Oh et al. [2015] and Chiappa et al. [2017]use deep neural networks to accurately predict Atari frameshundreds of steps into the future. However, these worksrely on hundreds of thousands of training frames and com-putationally intensive deep architectures, do not considerenvironment stochasticity, and predict at the pixel levelrather than the more efficient object level. Kaiser et al.[2019] integrate similar pixel-level frame prediction forAtari into a deep RL agent. Garnelo et al. [2016] use rein-forcement learning on a high-level symbolic world repre-sentation learned with no supervision, but must still learnwhat good world representations are and are only successfulon very simple environments. Xue et al. [2016], Ebert et al.[2017], Oh et al. [2017], Higuera et al. [2018] all learn topredict future state, but still predict either very low-levelfeatures that are difficult to learn on or high-level featuresthat require many samples to effectively model.

Object state representations for reinforcement learning wereproposed by Diuk et al. [2008] with a formal framework fordescribing and reasoning about object interactions. Scholzet al. [2014] extended this framework with physics modelsof object dynamics. Li et al. [2017] investigate integratingobject information into modern deep learning approaches.Kansky et al. [2017] learn more general probabilistic objectmodels and demonstrate how to plan on such models us-ing inference. By learning the parameters of physics-basedenvironments models, Woof and Chen [2018] develop anobject embedding network to achieve great performance incomplex environments. In contrast to our object recogni-tion algorithm, all of these techniques require environment-specific object labels.

Liang et al. [2016], Machado et al. [2018], Naddaf [2010]begin to tackle the problem of recognizing objects withoutextensive supervision or hand crafted, object specific recog-nition algorithms by introducing Blob-PROST, BASS, andDISCO, two large feature sets built by dividing the environ-ment into grids and looking at monocolor blobs. Severalrecent works, MOREL Goel et al. [2018], OODP Zhu et al.[2018], IODINE Greff et al. [2019], OP3 Veerapaneni et al.[2019], ST-DIM Anand et al. [2019], Transporter Kulkarniet al. [2019], SPACE Lin et al. [2020] present unsupervisedtechniques for segmenting and modeling objects using deepneural networks. These works use the prior that the ob-served changes between successive frames can be explainedby translations of groups of pixels. Our object segmentation,tracking, and modelling incorporates more knowledge aboutobjects, including object permanence and notions that thepixels composing each object are connected and visuallysimilar. Another key contribution of our work is using not


only object dynamics but also rewards to learn our objectrepresentation. For these reasons, our object representationsare both more compact and are learned after hundreds, asopposed to tens or hundreds of thousands, of observations.Most critically, we develop an explicit object representa-tion in terms of object positions, velocities, accelerations,and contacts and use this representation for our reinforce-ment learning. Finally, we also use our object models toplan and incorporate object priors into our value and rewardestimators.

Reward shaping is another way to incorporate human priorsinto reinforcement learning Ng et al. [1999]. Hybrid RewardArchitectures Van Seijen et al. [2017] show the potentialof object-level reinforcement learning by using per-objectreward functions to achieve up to 400x the score of RainbowHessel et al. [2018], a state-of-the-art model-free agent, butrely on humans to label objects in its domain, Ms. Pacman.

3. BackgroundReinforcement learning solves the problem of taking ac-tions in some environment to maximize some reward. For-mally, let a ∈ A be actions, s ∈ S be environment states,t ∈ 0, 1, . . . ,∞ be time steps, G(s′, s, a) = P (s′|s, a)be a stochastic transition function, R(s, a) be a stochas-tic reward function, and γ ∈ [0, 1] be a discount factor.These define a Markov decision process where the goalat each time t is to take the action at that maximizes theexpected discounted reward,

∑∞t=0R(st, at)γ

t. Each ob-served (s, a, r, s′) tuple is called an experience. Reinforce-ment learning solves this problem by learning a policyπ which specifies the probability of the agent choosingeach action given the current state. Define the state valuefunction to be Vπ(s) = Eπ[

∑∞k=0 γ

kR(st+k, at+k)|st =s] and the state-action value function to be Qπ(s, a) =Eπ[

∑∞k=0 γ

kR(st+k, at+k))|st = s, at = a], where Eπ[·]denotes the expected value of the random variable giventhat the agent follows π. Then the optimal state value andstate-action value functions are V∗(s) = maxπ Vπ(s) andQ∗(s, a) = maxπ Qπ(s, a) for all s ∈ S and a ∈ A, andan optimal policy π is one that is greedy with respect to theoptimal state value or state-action value function.

In model-based reinforcement learning the agent learns amodel of the world, and uses this model to plan actions totake or to simulate an environment to learn in. Specifically,the agent learns an approximation of G(s′, s, a), the transi-tion function, and R(s, a), the reward function. Intuitively,the better these models are, the better the resulting policywill be. One approach to improving these models is creatinga better state representation. That is, rather than learn topredict the state and reward in terms of s ∈ S, learn a mapinto a smaller state space h : S → S′, and learn modelsin this smaller state space. Intuitively, if |S′| is much less

Figure 1: Diagram of OORL framework

than |S| while preserving important features, better modelscan be learned with the same amount of experience. Onegeneral approach to this work is state aggregation Bertsekas[2018], which examines when two states s1, s2 ∈ S maybe combined into a single state s3, creating a smaller staterepresentation. For tasks where the observations are imagesor video, h is often thought of as a perception algorithm, pro-cessing images into meaningful visual features. Work in thisarea includes encoder-decoders Oh et al. [2015], Xue et al.[2016], Ebert et al. [2017], Chiappa et al. [2017], Kaiseret al. [2019], object-like feature sets Liang et al. [2016],Machado et al. [2018], Naddaf [2010], object keypointsKulkarni et al. [2019], and pixel masks of objects Goel et al.[2018], Zhu et al. [2018], Veerapaneni et al. [2019], Greffet al. [2019], Anand et al. [2019], Lin et al. [2020].

4. ModelWe extend this work by using object priors to build a power-ful object state representation that enables sample efficientand accurate dynamics and reward modelling. We learn ourobject state representation by jointly maximizing the likeli-hood of both the world model and the reward model, clos-ing the loop and allowing reinforcement learning to guideperception. Guiding perception to optimize both rewardand dynamics modelling has several intuitive advantages.Reward signal is typically very sparse, while predictingdynamics provides a rich signal sufficient for supervisedlearning. However, learning perception based on purelyvisual loss can result in distractors in the representation orignoring objects that differ little from the background visu-ally but provide reward. Considering the value and rewardsof objects solves these problems.

We do this by proposing a probabilistic object based modelof environments in Table 1 Equation 1, describing how bothobservations and rewards are generated by a set of objects.Let It be an X × Y image observation of an environmentat time t. Our agent divides images into objects, each objecto is a set of pixels. ot = {pt0, pt1, ...} is the object o at timet, where p are pixels. The existence of o at time t is givenby the boolean eo,t. The set of all objects the agent canperceive is O. The set of all objects the agent can perceive,as they have been most recently perceived is Oc, and ol


is o at the most recent time it was present. F (O) are thefeatures an agent observes about each object in the set O,specifically position, velocity, acceleration, and contacts.F (O)T is the sequence of features observed about O duringthe set of times T . D is a learned environment model. LetDo,v, Do,e, Do,c, Do,p be the models for the velocity of o,when o exists, where o appears when it starts existing, andthe pixels composing o, respectively. Do,v and Do,p takeat−1 and F (ot−1) as inputs, and Do,e and Do,c take at−1

and F (Ot−1) as inputs. Let Do,·(ot) be the prediction of

Do, · on F (ot) and at. rt is the reward the agent receivesat time step t. RO and VO are a reinforcement learningalgorithm which is trained on F (O)Ttrain and rTtrain to predictthe rewards and value, respectively, of of a state at time tgiven F (Ot) and at.

We model the observations of the environment as beinggenerated by an MDP over objects, their dynamics, and ren-dering objects into pixels. Given a set of observations I ={I0, I1, ..., IT } containing n objects {o0, o1, ..., on} = Oproperties (shape, color, mass, etc) {p0, p1, ..., pn} = Pand associated dynamics d. Each object o at each time thas a state F (ot) = so,t comprising its properties and po-sition, velocity, acceleration, contact status, and whetherit currently exists. Let St = {so,t ∀o ∈ O}. In Table 1we give P (I), the distribution of environment observations.Rendering is assumed to choose uniformly at random one ofthe pixels among those with the same coordinates to render,and object properties are domain specific.

We use the prior that objects provide a succinct and pow-erful model of the world and the prior that objects onlyinteract when in contact to develop the object contact staterepresentation s which we parameterize our dynamics andrewards models (Equations 3-7) with. Given an object o, thecorresponding object contact feature set so is the position,velocity, and acceleration of o, and the relative position, ve-locity, and accelerations of any objects o′ that o is in contactwith. Using these two priors, we develop a small, sparse,but powerful feature set to model reward and dynamics with.We use prior that simpler environments with fewer objectsare more likely to derive Equation 2, modelling the numberof objects with a geometric distribution. Finally, we use theprior that objects behave consistently across time by havingour dynamics and reward models characterizations be onlya function of the current state and action in Equations 3-7.While simple, this prior forms the core of our inferencealgorithm, where we search for objects with very similardynamics and reward behavior and combine them togetherto create a more likely (and succinct) object representation.

5. InferenceIn this section we describe how we compute the MLE setof parameters for this model. Our model is parameterized

by O, the set of objects our agent divides the world into,with Ot being the objects and associated properties (pixels,position, velocity, acceleration, contact status) at time t,D, the learned dynamics model of the objects and RO, thelearned value and reward model over object features. Thelikelihood of these parameters is:

L(O,D,RO) = b(1− b)|O|−1

∏t∈Tt

P (RO(st, a

t) = rt)

×∏

c∈X,Y

∑o∈Ot−1

∑c′∈X,Y P (Do,v(o) + c′ = c)P (Do,p(o) = It(c))∑

o∈Ot−1∑

c′∈X,Y P (Do,v(o) + 〈c′〉 = 〈x, y〉)

Rather than compute likelihood based on the pixel observa-tions, we may initialize with a segmentation algorithm S,tuned to always oversegment, so no two true objects are seg-mented together, and tracking algorithm T , tuned to alwaysundertrack, so T never produces false positive trackings. Weuse S to segment each image and T to greedily assign track-ings to create an initial set of objects, or percepts, P . Doingso allows incorporation of prior knowledge and algorithmsand further improves sample efficiency of our algorithm. Inthis case, the likelihood is in terms of reconstructing thepositions of the initial objects:

LS(O,D,RO) = b(1− b)|O|−1

∏t∈Tt

P (RO(st, at) = rt)

×∏

p∈Pt

P (Dop,e(ot−1p ) = 1)P (Dop,v(o

t−1p ) = v(p

t))P (Dop,c(o

t−1p )(p) = p

t)

Where op is the object p was previously mapped to, andv(pt) is the velocity of p from time t − 1 to t. We ap-proximate MLE object segmentation as follows. Giventwo objects, o1, o2 ∈ O, we combine them together toform a new object o3 = {o1, o2} and object map O′ =(O \ {o1, o2}) ∪ o3. If o1 and o2 appear disjoint times, thenwhen our segmentation transitions between the two objectsis an instance where our tracking algorithm failed. Usingthis observation, we obtain supervised data for when T fails,and we use this data to learn a better tracking algorithm. Wecompare the likelihoods of O and O′ and accept O′ as thenew object representation if it is more likely. We may alsoattempt to split an object, o3 = {o1, o2}, into two separateobjects, o1 and o2, by similarly compare the likelihoodsof the corresponding object representations. We detail thisprocedure in Algorithm 1. Our agent processes each frameinto an object representation by using S to oversegmentthe image, then TL, a learned tracking algorithm, to tracksegments into past objects, and finally O to combine theseobject primitives into high level objects. We detail this inAlgorithm 2 in the appendix.

6. LearningIn this section we describe our model based reinforcementlearning algorithm. Our agent learns a dynamics model


Table 1: Probabilistic Object-Based Environment Model

P (I) = P (#Objs = n)P (Properties = p|n)P (Obj Dynamics = d|p, n)P (Rewards=r|d, p, n)P (Pixels = I|r, d, p, n) (1)

P (Number of Objects = n) = b(1− b)n−1 (2)

P (Object Dynamics = d|p, n) =

T∏t=0

∏o∈O

E(eo,t)X(o, t), X(o, t) =

1 ¬eo,tA(po,t) eo,t ∧ ¬eo,t−1

V (vo,t) else(3)

E(eo,t) = P (Object o is present at t) = Bernoulli(po,so,t−1,at−1) (4)

A(po,t+1) = P (Object o appears at position p at t+1|eo,t+1,¬eo,t) = N (µo,so,l,at , σ2o,so,l,at

) (5)

V (vo,t+1) = P (Object o moves with velocity v at t+1|eo,t+1) = N (µo,so,t,at , σ2o,so,t,at

) (6)

P (Rewards = r) =

T∏t=0

Bernoulli(pOt,St,at) (7)

D, value model VO, and reward model RO, and trackingalgorithm TL from experience. Our agent uses TL and O totrack the percepts in the current observation and map themonto the object representation. Our agent then uses D, VO,RO to plan the best action to take.

Tracking–Our tracking model tracks the set all percepts asthey were most recently observed, Pc, onto the set of per-cepts observed in the current observation P t by assigninga score to all pairs of current and past percepts using alearned tracking algorithm, TL(pc, p

t), pc ∈ Pc, pt ∈ P t.Let H = {(o1, o2), ...}, o1, o2 ∈ O be the set of all at-tempted trackings.

Let M be the set of all outcomes of attempted trackings.M((o1, o2)) is set to T (o1, o2) when o1 is attempted to betracked to o2. TL has the form

TL(pc, pt) =

{T (pc, p

t) W (pc, pt) < 0.5

L(pc, pt) W (pc, p

t) ≥ 0.5

where T is a classical tracking algorithm we initialize ourtracker with, W is a model trained to recognize trackingswhere T will be wrong, and L is a model trained on thecorrect trackings. W and L are trained on trackings wherethe inference procedure described in the previous sectiondiscovers T was incorrect; the discovery of these incorrecttrackings is detailed in Algorithm 1. Pairs with trackingscores that exceed some threshold wt are greedily assigned,highest scores first.

Dynamics Modelling–We model dynamics in terms of pow-erful object level features, the relative and absolute posi-tions, velocities, accelerations, and collisions of all ob-jects. For each object in the set of known objects, O, apresence model, Do,e, is trained to predict if o will bepresent in the next time step given the current positions, ve-locities, accelerations, and contacts of all objects. For each

Algorithm 1: Infer

1 Choose o1, o2 ∈ O ∪ S, or o1 ∈ O and o2 ∈ S s.t. o2 ∈ o12 if o2 ∈ o1 then3 o3 = o1 \ o2 . Make split object

4 O′ = O \ o1 . Make new object set

5 O′ = O′ ∪ o36 Train DO′ . Train dynamics models for new objects

7 Train RO′ . Train value model for new object set

8 if logL(O′, DO′ , RO′) ≥ logL(O,DO, RO) then9 Train TL . Train new tracker

10 O = O′ . Split object

11 end12 end13 else14 o3 = o1 ∪ o2 . Make combined object

15 O′ = O \ {o1, o2} . Make new object set

16 O′ = O′ ∪ o317 Train DO′

18 Train RO′ . Train value model for new object set

19 if logL(O′, DO′ , RO′) ≥ logL(O,DO, RO) then20 if {t|o1 ∈ Ot} ∩ {t|o2 ∈ Ot} = ∅ then21 foreach (ot1, o

t′

2 )|t′ > t, o1, o2 /∈Ot+1:t′−1, (ot1, o

t′

2 ) ∈ H do22 M((ot1, o

t′

2 )) = 1 . Correct tracker

23 end24 foreach (ot

′

2 , ot1)|t′ < t, o1, o2 /∈

Ot′+1:t−1, (ot

′

2 , ot1) ∈ H do

25 M((ot′

2 , ot1)) = 1 . Correct tracker

26 end27 Train TL . Train new tracker

28 end29 O = O′ . Combine objects

30 end31 end


dimension d ∈ D, an appearance model, Do,a,d is trainedto output a probability vector over the possible locations ofo in dimension d when Do,e predicts o will appear given thecurrent positions, velocities, accelerations, and contacts ofall objects. A velocity model, Do,v,d is trained to outputa probability vector over the possible future velocities of oin dimension d in the next time step given the absolute andrelative positions, velocities, and accelerations and contactsof o. We split timesteps into two sets, Ttrain and Tinfer, fortraining dynamics and reward models and inferring the MLEobject representation, respectively.

Value and Reward Estimation–We use on-policy MonteCarlo control (Sutton [1988]) to estimate Q values. Wethen train models VO to predict Q value and RO to predictreward given the current state, F (Ot) and next action, at.We use the prior that objects interact when they contact intoreinforcement learning to derive contact discounted return.The contact discounted return is calculated using two decayrates: γnc for states where no objects are contacting, andγc for states where objects are contacting, with γnc > γc.Since γnc > γc, from the backward view the agent assignsmore credit for reward to recent object contacts, and in theforward view only considers the next few object contactswhen deciding how to act.

We also use a contact-focused model architecture and fea-ture set for approximating values and rewards. R and Vare learned as an ensemble of models, one for each pair ofobjects o1, o2 ∈ O.

RO,o1,o2(F (Ot)) = r̄c,o1,o2+RM,o1,o2(F (o1), F (o2), F (o1, o2))

VO,o1,o2(F (Ot)) = v̄c,o1,o2+RV,o1,o2(F (o1), F (o2), F (o1, o2))

where r̄c,o1,o2 and v̄c,o1,o2 are the average reward and value,respectively, when o1 and o2 are contacting, F (o1, o2) arethe relative position, velocity, and acceleration between o1and o2, and RM and VM are learned models trained tominimize the error of RO,o1,o2 and VO,o1,o2 on states whereo1 and o2 are in contact. By estimating value and rewardby ensembling mean estimation with more complex models,RM and VM , these RO and VO quickly learn which objectsare good and bad to contact, which is often sufficient tocreate a reasonably good policy. More complex interactionsare learned by the richer models RM and VM .

Planning and Action Selection–Our estimators assume re-ward and value are zero unless objects are contacting. Onedifficulty this poses is determining which action to takewhen no objects are contacting. This may be solved by plan-ning into the future until object contacts are encountered, butplanning is often computationally expensive given that thenumber of possible action sequences grows exponentially.Closely following Ng’s work on reward shaping (Ng et al.[1999]), we shape our reward and value functions by trans-forming them into potentials with respect to the distances

between object pairs and adding these potentials to the orig-inal reward and value functions. This can be interpretedintuitively as incorporating the prior that to make two ob-jects contact the distance between those two objects shouldbe decreased, and vice versa. Using this shaped reward, weplan using UCT tree search. (Browne et al. [2012]).

7. ImplementationWe use XGBoost trees to learn our dynamics and value mod-els, D and VM , and a neural network to learn our rewardmodel, RM . We do not learn a model of object properties;instead we assume that object properties do not change, anapproximation that we found does not impact performance.We provide more implementation details and hyperparame-ters in the appendices.

8. Experiments and ResultsIn this section we demonstrate that our object-level frame-work is capable of creating succinct and accurate object-level models on a variety of Atari games with very fewsample. We then compare our OLRL agent against DQN,Blob-PROST, and humans. Our method of combining low-level percepts by searching for smaller yet still accuratemodels reduces the number of objects from dozens to a corefew, as shown in Figure 3.

Figure 2: Model Error on Atari Environments. Distances arein pixels; for comparison, the Atari observation is 210x160pixels.

World Model Analysis–We demonstrate the speed and ac-curacy of our object-level perception and modeling on avariety of Atari 2600 games. For each Atari game, wetrained our model for only 2000 steps, where each step is 4


game frames, collected while playing random actions. Wethen evaluated our models by first playing n random actions,where n was sampled uniformly from [50, 150], and thenpredicting the positions of objects for the next 100 steps.Figure 2 shows prediction error as the average Euclideandistance between the predicted and actual object positions.Since many Atari objects do not move, we also show pre-diction error for the player-controlled object, which hascomplex dynamics. For the same reason, we use predictingthat object velocity will remain constant as a natural base-line. Even with very little training data, our models havelearned object dynamics with high accuracy tens of framesinto the future, effective for short-term planning. In addi-tion, model prediction error generally does not explode forlonger prediction horizons, allowing the agent to considerthe approximate positions over objects even 100 steps intothe future.

Figure 3: Object-level representations of Atari games. Left:normal Atari game frames. Right: same frame with eachhigh-level object our agent perceives colored differently.

Figure 4: Comparison of OLRL and human learning curveson Atari games. Each human learning curve shown in red.Median OLRL score and error bars shown in blue.

Comparison to Humans–We compare the maximumscores achieved by each time step for humans and OLRLin Figure 4. For all games tested, this value plateaus for hu-mans, indicating they are not increasing their highest scoreand have saturated learning. Given just minutes of trainingdata, OLRL is able to outperform humans learning for 15minutes 65% of the time. We say that OLRL learns fasterthan a human if OLRL reaches this maximum score in fewertime steps than the human. OLRL is able to learn faster thanhumans 55% of the time, demonstrating the power of theobject heuristics we have incorporated. This is despite hu-mans having extensive prior knowledge, particularily forenvironments like Video Pinball.

Comparison to Deep RL–We compare OLRL to DQN andBlob-PROST in Figure 5. DQN and Blob-PROST datawas drawn from Machado et al. [2018]. Using our object-level features, our agent reaches equivalent or greater per-formance than DQN and Blob-PROST with approximately10,000x fewer experiences in three of the four games. Al-though DQN and Blob-PROST outperform OLRL in Pong,they require over 100x more experiences to reach the per-formance of OLRL. Furthermore, SimPLe Kaiser et al.[2019], the current state of the art for Atari in sample-limitedregimes, achieves a mean of 54% the human-normalizedscore across Pong, Asterix, and Breakout after 100,000steps; our agent achieves a mean of 176% human normal-ized score with fewer than 10,000 steps.


Figure 5: Comparison of OLRL, DQN, and Blob-PROST.Blue: OLRL, green: Blob-PROST, red: DQN. Note time stepsare in log-scale.

Robustness–Object-level representations are inherently ro-bust to changes in object appearance within the limits of theobject tracking algorithm used. Our OLRL framework fur-ther improves robustness in an additional way. Aggregatingobjects with similar appearances is often a good heuristic,but, as stated in Oh et al. [2015], preservation of dynamicsand reward are sufficient conditions for aggregation. OLRLuses object appearance to help select candidate objects foraggregation, but chooses to aggregate solely based on preser-vation of dynamics and reward. Explicitly excluding objectappearance from aggregation decisions makes OLRL highlyrobust to even dramatic changes in object appearance. InFigure 6, we train OLRL for 10,000 time steps and DQNfor 50 million time steps on Breakout. Then we test ro-bustness by changing the background color from black towhite. Even though OLRL’s tracking model fails to track thebackground after its color is changed and perceives the back-ground as a new object, OLRL is able to quickly recover toprevious performance levels by modelling the behavior ofthe new object and quickly combining it with the originalbackground object.

Figure 6: Robustness analysis. Left: change in Breakoutenvironment. Right: learning curve for OLRL (blue) andDQN (red) after change in environment.

Figure 7: Ablation for our contact discounted return on Break-out. OLRL with reward shaping in blue, OLRL without rewardshaping in red.

Ablations–In Figure 7 we provide an ablation for our re-ward shaping for planning. Achieving a high score in Break-out requires planning many steps ahead to contact a ball.Without using reward shaping to extend the learned contactreward and value to adjacent states, the agent is unable toeffectively plan action sequences to reach the ball.

Interpretability–Our learned policy is easily interpretable:the agent will try to contact objects with value and rewardhigher than the background, and avoid those objects withvalue and reward lower than the background. For example,in the game Breakout, the goal is to hit a ball with a paddle.After training an OLRL agent on Breakout, we find thatcontacts between the ball and paddle have an average valueof 0.147 and average reward of 0 and contacts between thepaddle and background have and average value of 0.139 andaverage reward of 0.007, providing a clear explanation forwhy the OLRL agent correctly favors contacting the ballwith the paddle over the background.

9. ConclusionThis work provides an exciting demonstration of the po-tential of human object priors for reinforcement learning.


Using these priors, we develop an object inference algo-rithm that closes the loop on perception and reinforcementlearning by finding the set of objects that jointly maximizesthe likelihood of the dynamics and the reward models. Wealso use object-contact priors to develop a novel discountingscheme that better captures causality, and motivate rewardshaping to make planning more efficient. On the subset ofAtari environments we evaluate on, or agent learns 10,000xfaster than classical deep RL algorithms and better than stateof the art deep RL algorithms for sample-limited regimes.Our agent is even able to learn in fewer environment in-teractions than humans. Finally, our object representationgives promising robustness and interpretability results, twoproblems of growing interest for the community.

In this work we initialized our object distribution with sim-ple classical tracking and segmentation algorithms. In futurework we would like to use powerful deep segmentation andtracking algorithms to extend our results to realistic 3Denvironments. We believe a compact and explicit objectrepresentation paves the way for new results in robustnessand new ideas in exploration, transfer learning, specificallylearning relations between objects, and interpretability.

10. AcknowledgementsThis work was supported by an NDSEG Fellowship andONR grant N00014-18-1-2826. The GPU machine used forthis research was donated by Nvidia.

ReferencesElizabeth S Spelke. Principles of object perception. Cogni-

tive Science, 14(1):29–56, 1990.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, An-drei A Rusu, Joel Veness, Marc G Bellemare, AlexGraves, Martin Riedmiller, Andreas K Fidjeland, GeorgOstrovski, et al. Human-level control through deep rein-forcement learning. Nature, 518(7540):529, 2015.

Volodymyr Mnih, Adria Puigdomenech Badia, MehdiMirza, Alex Graves, Timothy Lillicrap, Tim Harley,David Silver, and Koray Kavukcuoglu. Asynchronousmethods for deep reinforcement learning. In Interna-tional Conference on Machine Learning, pages 1928–1937, 2016.

David Silver, Julian Schrittwieser, Karen Simonyan, IoannisAntonoglou, Aja Huang, Arthur Guez, Thomas Hubert,Lucas Baker, Matthew Lai, Adrian Bolton, et al. Master-ing the game of go without human knowledge. Nature,550(7676):354, 2017.

Carlos Diuk, Andre Cohen, and Michael L Littman. Anobject-oriented representation for efficient reinforcement

learning. In International Conference on Machine learn-ing, pages 240–247. ACM, 2008.

Jonathan Scholz, Martin Levihn, Charles Isbell, and DavidWingate. A physics-based model prior for object-orientedmdps. In International Conference on Machine Learning,pages 1089–1097, 2014.

Vikash Goel, Jameson Weng, and Pascal Poupart. Unsuper-vised video object segmentation for deep reinforcementlearning. In Advances in Neural Information ProcessingSystems, pages 5688–5699, 2018.

Guangxiang Zhu, Zhiao Huang, and Chongjie Zhang.Object-oriented dynamics predictor. In S. Bengio, H. Wal-lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, andR. Garnett, editors, Advances in Neural Information Pro-cessing Systems 31, pages 9804–9815. Curran Associates,Inc., 2018.

Klaus Greff, Raphaël Lopez Kaufmann, Rishab Kabra, NickWatters, Chris Burgess, Daniel Zoran, Loic Matthey,Matthew Botvinick, and Alexander Lerchner. Multi-object representation learning with iterative variationalinference. arXiv preprint arXiv:1903.00450, 2019.

Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, WeihaoSun, Gautam Singh, Fei Deng, Jindong Jiang, and SungjinAhn. Space: Unsupervised object-oriented scene repre-sentation via spatial attention and decomposition. arXivpreprint arXiv:2001.02407, 2020.

Yitao Liang, Marlos C Machado, Erik Talvitie, and MichaelBowling. State of the art control of atari games usingshallow reinforcement learning. In Proceedings of the2016 International Conference on Autonomous Agents &Multiagent Systems, pages 485–493. International Foun-dation for Autonomous Agents and Multiagent Systems,2016.

Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis,and Satinder Singh. Action-conditional video predictionusing deep networks in atari games. In Advances in Neu-ral Information Processing Systems, pages 2863–2871,2015.

Silvia Chiappa, Sébastien Racaniere, Daan Wierstra, andShakir Mohamed. Recurrent environment simulators.arXiv:1704.02254, 2017.

Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos,Blazej Osinski, Roy H Campbell, Konrad Czechowski,Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, SergeyLevine, et al. Model-based reinforcement learning foratari. arXiv preprint arXiv:1903.00374, 2019.

Marta Garnelo, Kai Arulkumaran, and Murray Shana-han. Towards deep symbolic reinforcement learning.arXiv:1609.05518, 2016.

Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Free-man. Visual dynamics: Probabilistic future frame syn-


thesis via cross convolutional networks. In Advancesin Neural Information Processing Systems, pages 91–99,2016.

Frederik Ebert, Chelsea Finn, Alex X Lee, and SergeyLevine. Self-supervised visual planning with temporalskip connections. In Conference on Robot Learning,pages 344–356, 2017.

Junhyuk Oh, Satinder Singh, and Honglak Lee. Value pre-diction network. In Advances in Neural InformationProcessing Systems, pages 6120–6130, 2017.

Juan Camilo Gamboa Higuera, David Meger, and GregoryDudek. Synthesizing neural network controllers withprobabilistic model-based reinforcement learning. In2018 IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), pages 2538–2544. IEEE,2018.

Yuezhang Li, Katia Sycara, and Rahul Iyer. Object-sensitivedeep reinforcement learning. EPiC Series in Computing,50:20–35, 2017.

Ken Kansky, Tom Silver, David A Mely, Mohamed Eldawy,Miguel Lázaro-Gredilla, Xinghua Lou, Nimrod Dorf-man, Szymon Sidor, Scott Phoenix, and Dileep George.Schema networks: Zero-shot transfer with a generativecausal model of intuitive physics. In International Con-ference on Machine Learning, pages 1809–1818, 2017.

William Woof and Ke Chen. Learning to play general video-games via an object embedding network. In 2018 IEEEConference on Computational Intelligence and Games(CIG), pages 1–8. IEEE, 2018.

Marlos C Machado, Marc G Bellemare, Erik Talvitie, JoelVeness, Matthew Hausknecht, and Michael Bowling. Re-visiting the arcade learning environment: Evaluation pro-tocols and open problems for general agents. Journal ofArtificial Intelligence Research, 61:523–562, 2018.

Yavar Naddaf. Game-independent ai agents for playingatari 2600 console games. Master’s thesis, University ofAlberta, 2010.

Rishi Veerapaneni, John D Co-Reyes, Michael Chang,Michael Janner, Chelsea Finn, Jiajun Wu, Joshua BTenenbaum, and Sergey Levine. Entity abstraction in vi-sual model-based reinforcement learning. arXiv preprintarXiv:1910.12827, 2019.

Ankesh Anand, Evan Racah, Sherjil Ozair, Yoshua Bengio,Marc-Alexandre Côté, and R Devon Hjelm. Unsuper-vised state representation learning in atari. In Advancesin Neural Information Processing Systems, pages 8766–8779, 2019.

Tejas D Kulkarni, Ankush Gupta, Catalin Ionescu, Sebas-tian Borgeaud, Malcolm Reynolds, Andrew Zisserman,and Volodymyr Mnih. Unsupervised learning of object

keypoints for perception and control. In Advances in Neu-ral Information Processing Systems, pages 10723–10733,2019.

Andrew Y Ng, Daishi Harada, and Stuart Russell. Policyinvariance under reward transformations: Theory and ap-plication to reward shaping. In International Conferenceon Machine Learning, 1999.

Harm Van Seijen, Mehdi Fatemi, Joshua Romoff, RomainLaroche, Tavian Barnes, and Jeffrey Tsang. Hybrid re-ward architecture for reinforcement learning. In Advancesin Neural Information Processing Systems, pages 5392–5402, 2017.

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, TomSchaul, Georg Ostrovski, Will Dabney, Dan Horgan, BilalPiot, Mohammad Azar, and David Silver. Rainbow: Com-bining improvements in deep reinforcement learning. InThirty-Second AAAI Conference on Artificial Intelligence,2018.

Dimitri P Bertsekas. Feature-based aggregation and deepreinforcement learning: A survey and some new imple-mentations. IEEE/CAA Journal of Automatica Sinica, 6(1):1–31, 2018.

Richard S Sutton. Learning to predict by the methods of tem-poral differences. Machine Learning, 3(1):9–44, 1988.

Cameron B Browne, Edward Powley, Daniel Whitehouse,Simon M Lucas, Peter I. Cowling, Philipp Rohlfshagen,Stephen Tavener, Diego Perez, Spyridon Samothrakis,and Simon Colton. A survey of monte carlo tree searchmethods. IEEE Transactions on Computational Intelli-gence and AI in Games, 4(1):1–43, 2012.

Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficientgraph-based image segmentation. International journalof computer vision, 59(2):167–181, 2004.

Alireza Khotanzad and Yaw Hua Hong. Invariant imagerecognition by zernike moments. IEEE Transactions onPattern Analysis and Machine Intelligence, 12(5):489–497, 1990.

Tianqi Chen and Carlos Guestrin. Xgboost: A scalabletree boosting system. In Proceedings of the 22nd ACMSIGKDD International Conference on Knowledge Dis-covery and Data Mining, pages 785–794. ACM, 2016.

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, WeiChen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Light-gbm: A highly efficient gradient boosting decision tree.In Advances in Neural Information Processing Systems,pages 3146–3154, 2017.

Pedro A Tsividis, Thomas Pouncy, Jacqueline L Xu,Joshua B Tenenbaum, and Samuel J Gershman. Humanlearning in atari. Association for the Advancement ofArtificial Intelligence, 2017.


Greg Brockman, Vicki Cheung, Ludwig Pettersson, JonasSchneider, John Schulman, Jie Tang, and WojciechZaremba. Openai gym, 2016.

A. Implementation DetailsAt a low level, our system works by, at each timestep, per-ceiving objects, tracking those objects onto past objects, andusing the object dynamics, reward, and value models to planand execute an action. As the agent gathers more experi-ence, the object dynamics, reward, and value models areperiodically retrained, and the inference procedure detailedin the main paper is also rerun to find the MLE set of objects.In the next sections we give the technical implementationdetails and hyperparameters.

A.1. Object Perception and Tracking

We perceive and track objects using Algorithm 1. For thebase segmentation algorithm S, we use Felzenszwalb seg-mentation Felzenszwalb and Huttenlocher [2004] with ascale of 1000 and sigma of 0. For the base tracking algo-rithm T , we use a an ensemble of human-inspired trackingfeatures detailed in Equation 1.

Let loc(o) for o ∈ O be the mean coordinates of the pixelscomposing object o. Consider two objects, op observedin a previous experience, and oc observed in the currentexperience. We create a tracking score between the twoobjects as a linear combination of intuitive features:

T (oc, op) = w0Fdisp(oc, op) + w1Fshape(oc, op)+w2Fdisp(oc, op)Fshape(oc, op) + w3Fperm(oc, op)

+w4Fsize(oc, op) + w5Fmotion(oc, op)

(1)

Fdisp(oc, op) is the total change in relative distance betweenoc and op and all objects contacting op, capturing the notionthat over small changes in time the positions of objectsgenerally do not change too much.

Fshape(oc, op) =

25∑i=1

| log(Zi(oc))− log(Zi(op))|

where Zi(o) = ith Zernike moment (Khotanzad and Hong[1990]) of o, encoding object shape consistency.

Fperm(oc, op) captures the intuition that objects generallydo not appear or disappear and is the number of experiencessince op was last seen,

Fsize(oc, op) = max(|oc||op|

,|op||oc|

)− 1.

Fmotion(oc, op) encodes that many objects are backgroundobjects and are unlikely-move:

Fmotion(oc, op) = log(max(Pmotion(oc, op), ε)),

where ε is some small positive constant.

Pmotion(oc, op) =

{Pdm(Moves(op)|e1, . . . , ei), if loc(oc) 6= loc(op)

Pdm(NotMoves(op)|e1, . . . , ei), if loc(oc) = loc(op)

Algorithm 2: Perceive and Track Objects

1 Input : It, observation at current timestep2 P t = {}3 Ot = {}4 Q = S(It) . Segment image

5 foreach q ∈ Q do6 foreach p ∈ Pc do7 H = H ∪ (p, q) . Add all possible trackings to

history

8 M((p, q)) = TL(p, q)

9 end10 end11 while maxq∈Q,p∈Pc

TL(p, q) ≥ wt do12 q, p = arg maxq∈Q,p∈Pc

TL(p, q) . Greedily assign

trackings

13 pt = q14 P t = P t ∪ {pt}15 Q = Q \ {q}16 end17 O = O ∪Q . Add untracked percepts as new objects

18 foreach o ∈ Oc do19 if o ⊆ P t then20 ot = {pt ∈ P t|pt ∈ o} . Map percepts to objects

21 Ot = Ot ∪ {ot}22 P t = P t \ o23 end24 end25 return Ot

A.2. Planning

We plan using UCT tree search. We slightly modify thealgorithm to evaluate action paths in batches; while thisleads to slight suboptimality in action path selection, it leadsto significant gains in evaluation speed. We plan with abatch size of 160 for 10 steps, up to plans of 8 actions inlength. We use the standard exploration constant of 1√

2. We

select the action to take by applying a softmax transformto the action values with a scale of 4000 and sampling anaction from the resulting distribution.

A.3. Learning Dynamics, Value, and Reward Models

Dynamics Models–We model the velocity of each object ineach dimension with a separate XGBoost Chen and Guestrin[2016] tree. Each tree outputs a probability vector over ve-locities, discretized in steps of 1 between -30 and 30. Each


tree is trained to minimize multiclass logloss, has a max-imum depth of max(2,min(6, |D|/50)), where D is thesize of the training data. Experiences are split into a 80:20train/validation split, which is used for early stopping oftraining (5 rounds), and for computing the model likeli-hoods used in the Infer algorithm in the main paper. Whereobjects appear when they transition from not being presentto being present is modelled with an XGBoost tree per ob-ject per dimension trained to output a probability vectorover positions. Each tree has a maximum depth of 6 and istrained to minimize multiclass logloss, with a similar 80:20train/validation data split. Object presence is modelled withan XGBoost tree per object, minimizing multiclass loglosson an 80:20 split with a maximum tree depth of 6. For allmodels, mean prediction was also evaluated, and if meanprediction yields better validation error it was used instead(we did this primarily to prevent objects with very little asso-ciated data from causing noisy validation error behavior inthe XGBoost trees and interfering with the Infer algorithm).

Reward and Value Models–Our agent uses Monte CarloControl Sutton [1988] with a discount factor of 0.99 toestimate state-action values. We then learn to predict state-action reward and value with sets of ensemble models foreach object pair. Each ensemble model has two parts: thefirst is simple mean estimation of the reward of value forwhen the objects are in contact. The second is an XGBoosttree trained to predict value or reward given the absoluteand relative positions, velocities, accelerations, and contactstates of the object pair. Each tree was trained to predict themean of a Gaussian distribution with variance 1 describingthe reward distribution for each state-action×object pair andtrained to maximize likelihood. Data was split into 75:25train/validation.

A.4. Tracking Model

We learn a tracking model consisting of two LightGBM Keet al. [2017] trees. We used LightGBM rather than XGBoosttrees because of the large amount of tracking data made thefaster training time of LightGBM advantageous. For eachobject, we trained a lightGBM tree on the negative track-ing instances associated with that object and the trackinginstance which were once negative but later learned to bepositive via the Infer algorithm in the main paper. We usedthe tracking features described in Appendix A.1 as inputs.The LightGBM tree was trained to output a number between0 and 1. We form the learned tracking algorithm TL, bytaking the maximum of the base tracking algorithm T , andthe learned LightGBM tracking algorithm.

B. Human Atari Learning DataOne of the most exciting achievements of reinforcementlearning is outperforming humans Mnih et al. [2015] in

Atari games, and several years later, human normalizedscore is one of the most important metrics for comparingdeep learners Hessel et al. [2018]. Interestingly, this keymetric is based off of the performance of one expert human.However, as seen in the data of a recent study on humanplay in Atari Tsividis et al. [2017], some of these expertscores are beaten after only 15 minutes of play. In addition,the published data do not include learning curves, just thefinal scores after learning. Motivated by these facts andgrowing interest in sample efficient reinforcement learning,we conducted our own study of human Atari game play forfour games to obtain learning curves. Five participants weretasked with playing Asterix, Breakout, Pong, and VideoPinball for 15 minutes each. To ensure the environment thehumans were tested on was identical to the environmentwe trained our agents on, we used OpenAI Gym Brockmanet al. [2016] as the Atari emulator. We found that there isvery high variance in score between participants, even onAsterix, which participants were unlikely to have playedbefore. In every game at least one participant was ableto beat the expert score in 15 minutes of play, and in allbut Pong average maximum participant score exceeded theexpert score. We suspect this is because of differences inthe Atari emulator used.

Documents

Self-Supervised Object-Level Deep Reinforcement Learning · dynamics model and a reward model. The ﬁrst object prior is that humans perceive the world in terms of objects, or groups