A Geometric Perspective on Optimal …...A Geometric Perspective on Optimal Representations for Reinforcement Learning Marc G. Bellemare 1, Will Dabney 2, Robert Dadashi 1, Adrien

A Geometric Perspective on Optimal Representationsfor Reinforcement Learning

Marc G. Bellemare1, Will Dabney2, Robert Dadashi1, Adrien Ali Taiga1,3,Pablo Samuel Castro1, Nicolas Le Roux1, Dale Schuurmans1,4, Tor Lattimore2, Clare Lyle5

Abstract

We propose a new perspective on representation learning in reinforcement learningbased on geometric properties of the space of value functions. We leverage thisperspective to provide formal evidence regarding the usefulness of value functionsas auxiliary tasks. Our formulation considers adapting the representation to mini-mize the (linear) approximation of the value function of all stationary policies fora given environment. We show that this optimization reduces to making accuratepredictions regarding a special class of value functions which we call adversarialvalue functions (AVFs). We demonstrate that using value functions as auxiliarytasks corresponds to an expected-error relaxation of our formulation, with AVFsa natural candidate, and identify a close relationship with proto-value functions(Mahadevan, 2005). We highlight characteristics of AVFs and their usefulness asauxiliary tasks in a series of experiments on the four-room domain.

1 Introduction

A good representation of state is key to practical success in reinforcement learning. While earlyapplications used hand-engineered features (e.g. Samuel, 1959), these have proven onerous to generateand difficult to scale. As a result, methods in representation learning have flourished, ranging frombasis adaptation (Menache et al., 2005; Keller et al., 2006), gradient-based learning (Yu and Bertsekas,2009), proto-value functions (Mahadevan and Maggioni, 2007), feature generation schemes such astile coding (Sutton, 1996) and the domain-independent features used in some Atari 2600 game-playingagents (Bellemare et al., 2013; Liang et al., 2016), and nonparametric methods (Ernst et al., 2005;Farahmand et al., 2016; Tosatto et al., 2017). Today, the method of choice is deep learning. Deeplearning has made its mark by showing it can learn complex representations of relatively unprocessedinputs using gradient-based optimization (Tesauro, 1995; Mnih et al., 2015; Silver et al., 2016).

Most current deep reinforcement learning methods augment their main objective with additionallosses called auxiliary tasks, typically with the aim of facilitating and regularizing the representationlearning process. The UNREAL algorithm, for example, makes predictions about future pixelvalues (Jaderberg et al., 2017); recent work approximates a one-step transition model to achieve asimilar effect (Francois-Lavet et al., 2018; Gelada et al., 2019). The good empirical performance ofdistributional reinforcement learning (Bellemare et al., 2017) has also been attributed to representationlearning effects, with recent visualizations supporting this claim (Such et al., 2019). However, whilethere is now conclusive empirical evidence of the usefulness of auxiliary tasks, their design andjustification remain on the whole ad-hoc. One of our main contributions is to provides a formalframework in which to reason about auxiliary tasks in reinforcement learning.

We begin by formulating an optimization problem whose solution is a form of optimal representation.Specifically, we seek a state representation from which we can best approximate the value function ofany stationary policy for a given Markov Decision Process. Simultaneously, the largest approximation

1Google Research 2DeepMind 3Mila, Universite de Montreal 4University of Alberta 5University of Oxford

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

error in that class serves as a measure of the quality of the representation. While our approachmay appear naive – in real settings, most policies are uninteresting and hence may distract therepresentation learning process – we show that our representation learning problem can in fact berestricted to a special subset of value functions which we call adversarial value functions (AVFs).We then characterize these adversarial value functions and show they correspond to deterministicpolicies that either minimize or maximize the expected return at each state, based on the solution of anetwork-flow optimization derived from an interest function δ.

A consequence of our work is to formalize why predicting value function-like objects is helpfulin learning representations, as has been argued in the past (Sutton et al., 2011, 2016). We showhow using these predictions as auxiliary tasks can be interpreted as a relaxation of our optimizationproblem. From our analysis, we hypothesize that auxiliary tasks that resemble adversarial valuefunctions should give rise to good representations in practice. We complement our theoretical resultswith an empirical study in a simple grid world environment, focusing on the use of deep learningtechniques to learn representations. We find that predicting adversarial value functions as auxiliarytasks leads to rich representations.

2 Setting

We consider an environment described by a Markov Decision Process 〈X ,A, r, P, γ〉 (Puterman,1994); X and A are finite state and action spaces, P : X × A → P(X ) is the transition function,γ the discount factor, and r : X → R the reward function. For a finite set S, write P(S) for theprobability simplex over S . A (stationary) policy π is a mapping X →P(A), also denoted π(a |x).We denote the set of policies by P = P(A)X . We combine a policy π with the transition function Pto obtain the state-to-state transition function Pπ(x′ |x) :=

∑a∈A π(a |x)P (x′ |x, a). The value

function V π describes the expected discounted sum of rewards obtained by following π:

V π(x) = E[ ∞∑

t=0

γtr(xt)∣∣x0 = x, xt+1 ∼ Pπ(· |xt)

].

The value function satisfies Bellman’s equation (Bellman, 1957): V π(x) = r(x) + γ EPπ V π(x′).Assuming there are n = |X | states, we view r and V π as vectors in Rn and Pπ ∈ Rn×n, such that

V π = r + γPπV π = (I − γPπ)−1r.

A d-dimensional representation is a mapping φ : X → Rd; φ(x) is the feature vector for state x. Wewrite Φ ∈ Rn×d to denote the matrix whose rows are φ(X ), and with some abuse of notation denotethe set of d-dimensional representations by R ≡ Rn×d. For a given representation and weight vectorθ ∈ Rd, the linear approximation for a value function is

Vφ,θ(x) := φ(x)>θ. (1)

We consider the approximation minimizing the uniformly weighted squared error∥∥Vφ,θ − V π

∥∥2

2=∑

x∈X(φ(x)>θ − V π(x))2.

We denote by V πφ the projection of V π onto the linear subspace H ={

Φθ : θ ∈ Rd}

.

2.1 Two-Part Networks

Most deep networks used in value-based reinforcement learning can be modelled as two interactingparts φ and θ which give rise to a linear approximation (Figure 1, left). Here, the representation φcan also be adjusted and is almost always nonlinear in x. Two-part networks are a simple frameworkin which to study the behaviour of representation learning in deep reinforcement learning. We willespecially consider the use of φ(x) to make additional predictions, called auxiliary tasks followingcommon usage, and whose purpose is to improve or stabilize the representation.

We study two-part networks in an idealized setting where the length d of φ(x) is fixed and smallerthan n, but the mapping is otherwise unconstrained. Even this idealized design offers interesting

2

Representation

�(x)<latexit sha1_base64="bO+I8PXml61+kiq1+jHcoWIIiWY=">AAAHlXicfZXbjts2EIaVQ+N0kzZJc9GL3gg1AqTFIrCCAC1SBMjBbRp0d+Pd7Cm7MgyKGlmESVEh6dReRg+R2/bJ8jYdyvKuDmkE2BrN9484HA7FKOdMm8Hg06XLV65+da13/euNGze/+fbW7TvfHWo5VxQOqORSHUdEA2cZHBhmOBznCoiIOBxFsxeOH70HpZnM9s0yh7Eg04wljBKDrqMwT9n9xU+T2/3Bg0F5+V0jqIy+V12jyZ1rizCWdC4gM5QTrU+DQW7GlijDKIdiI5xryAmdkSmcopkRAXpsy3wL/x56Yj+RCn+Z8UtvPcISofVSRKgUxKS6zZzz/5hJRWN061RKJ7qVk0l+HVuW5XMDGV2llMy5b6TvyuTHTAE1fIkGoYrhrHyaEkWowWI2x+RTiYL2sOduRptAAdfsrFWi+D3LdVWkxapKjaAI374RZvA3lUKQLLYhfVPY0M2NEm7fFEWLHtfocYe+rdG3HfqsRp916F6N7nXosEaHHXpSoycdul2j2x26X6P7HbpVo1tFu1h6t4Z328H6sEYPO8HRToWjyO60Y3NQosKaKjvqRNOIKBRsfgg32znHFQo/wzQI5thvDjXZIivsYmKDx5kb7F4YQ+KHsMixX/3zRH93iQ4B96aCbXS9xkSJkepnW0kLu9KiYApqHZj4GNjKZaXXxdoKM8mZYOjBxMrBVSrP8A14m9jQwMJoas/K96yxbmPdwKbCZu07E2Th25MLPT5fRMQz7tvhBfxrq1jX3aQgFQjLAZdlC3AKRcOvnH8PBFGzJqC45ewLqSTHiiybzG1uu796aK9GJrEBToOxDTkkxg8/+P0A58SmqXtolZJlyZf1Eycxy1ZYxOewinETLj8PtvTZftBeLZbF+MUpxVUrvCom1WCntbHGZcU+3yHYEoJhl+E93HTWl4S4MJUQrXVHjF76q2SVsKOX5wuX7qA7JcbfWXuG7t+G0p1RYPCTYR8/OZcP4V2TOYRnVdA+mbrG4cMHAdq7j/pPn1en1nXvB+9H774XeL94T70/vZF34FFv5n30/vH+7X3fe9Ib9v5YSS9fqmLueo2r9/o/pEHA3Q==</latexit><latexit sha1_base64="bO+I8PXml61+kiq1+jHcoWIIiWY=">AAAHlXicfZXbjts2EIaVQ+N0kzZJc9GL3gg1AqTFIrCCAC1SBMjBbRp0d+Pd7Cm7MgyKGlmESVEh6dReRg+R2/bJ8jYdyvKuDmkE2BrN9484HA7FKOdMm8Hg06XLV65+da13/euNGze/+fbW7TvfHWo5VxQOqORSHUdEA2cZHBhmOBznCoiIOBxFsxeOH70HpZnM9s0yh7Eg04wljBKDrqMwT9n9xU+T2/3Bg0F5+V0jqIy+V12jyZ1rizCWdC4gM5QTrU+DQW7GlijDKIdiI5xryAmdkSmcopkRAXpsy3wL/x56Yj+RCn+Z8UtvPcISofVSRKgUxKS6zZzz/5hJRWN061RKJ7qVk0l+HVuW5XMDGV2llMy5b6TvyuTHTAE1fIkGoYrhrHyaEkWowWI2x+RTiYL2sOduRptAAdfsrFWi+D3LdVWkxapKjaAI374RZvA3lUKQLLYhfVPY0M2NEm7fFEWLHtfocYe+rdG3HfqsRp916F6N7nXosEaHHXpSoycdul2j2x26X6P7HbpVo1tFu1h6t4Z328H6sEYPO8HRToWjyO60Y3NQosKaKjvqRNOIKBRsfgg32znHFQo/wzQI5thvDjXZIivsYmKDx5kb7F4YQ+KHsMixX/3zRH93iQ4B96aCbXS9xkSJkepnW0kLu9KiYApqHZj4GNjKZaXXxdoKM8mZYOjBxMrBVSrP8A14m9jQwMJoas/K96yxbmPdwKbCZu07E2Th25MLPT5fRMQz7tvhBfxrq1jX3aQgFQjLAZdlC3AKRcOvnH8PBFGzJqC45ewLqSTHiiybzG1uu796aK9GJrEBToOxDTkkxg8/+P0A58SmqXtolZJlyZf1Eycxy1ZYxOewinETLj8PtvTZftBeLZbF+MUpxVUrvCom1WCntbHGZcU+3yHYEoJhl+E93HTWl4S4MJUQrXVHjF76q2SVsKOX5wuX7qA7JcbfWXuG7t+G0p1RYPCTYR8/OZcP4V2TOYRnVdA+mbrG4cMHAdq7j/pPn1en1nXvB+9H774XeL94T70/vZF34FFv5n30/vH+7X3fe9Ib9v5YSS9fqmLueo2r9/o/pEHA3Q==</latexit><latexit sha1_base64="bO+I8PXml61+kiq1+jHcoWIIiWY=">AAAHlXicfZXbjts2EIaVQ+N0kzZJc9GL3gg1AqTFIrCCAC1SBMjBbRp0d+Pd7Cm7MgyKGlmESVEh6dReRg+R2/bJ8jYdyvKuDmkE2BrN9484HA7FKOdMm8Hg06XLV65+da13/euNGze/+fbW7TvfHWo5VxQOqORSHUdEA2cZHBhmOBznCoiIOBxFsxeOH70HpZnM9s0yh7Eg04wljBKDrqMwT9n9xU+T2/3Bg0F5+V0jqIy+V12jyZ1rizCWdC4gM5QTrU+DQW7GlijDKIdiI5xryAmdkSmcopkRAXpsy3wL/x56Yj+RCn+Z8UtvPcISofVSRKgUxKS6zZzz/5hJRWN061RKJ7qVk0l+HVuW5XMDGV2llMy5b6TvyuTHTAE1fIkGoYrhrHyaEkWowWI2x+RTiYL2sOduRptAAdfsrFWi+D3LdVWkxapKjaAI374RZvA3lUKQLLYhfVPY0M2NEm7fFEWLHtfocYe+rdG3HfqsRp916F6N7nXosEaHHXpSoycdul2j2x26X6P7HbpVo1tFu1h6t4Z328H6sEYPO8HRToWjyO60Y3NQosKaKjvqRNOIKBRsfgg32znHFQo/wzQI5thvDjXZIivsYmKDx5kb7F4YQ+KHsMixX/3zRH93iQ4B96aCbXS9xkSJkepnW0kLu9KiYApqHZj4GNjKZaXXxdoKM8mZYOjBxMrBVSrP8A14m9jQwMJoas/K96yxbmPdwKbCZu07E2Th25MLPT5fRMQz7tvhBfxrq1jX3aQgFQjLAZdlC3AKRcOvnH8PBFGzJqC45ewLqSTHiiybzG1uu796aK9GJrEBToOxDTkkxg8/+P0A58SmqXtolZJlyZf1Eycxy1ZYxOewinETLj8PtvTZftBeLZbF+MUpxVUrvCom1WCntbHGZcU+3yHYEoJhl+E93HTWl4S4MJUQrXVHjF76q2SVsKOX5wuX7qA7JcbfWXuG7t+G0p1RYPCTYR8/OZcP4V2TOYRnVdA+mbrG4cMHAdq7j/pPn1en1nXvB+9H774XeL94T70/vZF34FFv5n30/vH+7X3fe9Ib9v5YSS9fqmLueo2r9/o/pEHA3Q==</latexit><latexit sha1_base64="bO+I8PXml61+kiq1+jHcoWIIiWY=">AAAHlXicfZXbjts2EIaVQ+N0kzZJc9GL3gg1AqTFIrCCAC1SBMjBbRp0d+Pd7Cm7MgyKGlmESVEh6dReRg+R2/bJ8jYdyvKuDmkE2BrN9484HA7FKOdMm8Hg06XLV65+da13/euNGze/+fbW7TvfHWo5VxQOqORSHUdEA2cZHBhmOBznCoiIOBxFsxeOH70HpZnM9s0yh7Eg04wljBKDrqMwT9n9xU+T2/3Bg0F5+V0jqIy+V12jyZ1rizCWdC4gM5QTrU+DQW7GlijDKIdiI5xryAmdkSmcopkRAXpsy3wL/x56Yj+RCn+Z8UtvPcISofVSRKgUxKS6zZzz/5hJRWN061RKJ7qVk0l+HVuW5XMDGV2llMy5b6TvyuTHTAE1fIkGoYrhrHyaEkWowWI2x+RTiYL2sOduRptAAdfsrFWi+D3LdVWkxapKjaAI374RZvA3lUKQLLYhfVPY0M2NEm7fFEWLHtfocYe+rdG3HfqsRp916F6N7nXosEaHHXpSoycdul2j2x26X6P7HbpVo1tFu1h6t4Z328H6sEYPO8HRToWjyO60Y3NQosKaKjvqRNOIKBRsfgg32znHFQo/wzQI5thvDjXZIivsYmKDx5kb7F4YQ+KHsMixX/3zRH93iQ4B96aCbXS9xkSJkepnW0kLu9KiYApqHZj4GNjKZaXXxdoKM8mZYOjBxMrBVSrP8A14m9jQwMJoas/K96yxbmPdwKbCZu07E2Th25MLPT5fRMQz7tvhBfxrq1jX3aQgFQjLAZdlC3AKRcOvnH8PBFGzJqC45ewLqSTHiiybzG1uu796aK9GJrEBToOxDTkkxg8/+P0A58SmqXtolZJlyZf1Eycxy1ZYxOewinETLj8PtvTZftBeLZbF+MUpxVUrvCom1WCntbHGZcU+3yHYEoJhl+E93HTWl4S4MJUQrXVHjF76q2SVsKOX5wuX7qA7JcbfWXuG7t+G0p1RYPCTYR8/OZcP4V2TOYRnVdA+mbrG4cMHAdq7j/pPn1en1nXvB+9H774XeL94T70/vZF34FFv5n30/vH+7X3fe9Ib9v5YSS9fqmLueo2r9/o/pEHA3Q==</latexit>

x<latexit sha1_base64="l03uCu57NEwOx4KA5SzuAFGPBH4=">AAAIQXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6C/a6/UX7K/Yn7G3Y6152lGRHP9oKsHW6z/fE4/FEhpng2gyHf125+sGHH13rXf9465NPP7vx+c1bXxzrdKEYHLFUpOokpBoET+DIcCPgJFNAZSjgRXj+1PEXr0FpniaHZpXBRNJZwmPOqEHX/nJ6sz/cHhYX6Rp+ZfS96hpPb/XuBlHKFhISwwTV+swfZmZiqTKcCci3goWGjLJzOoMzNBMqQU9skWlO7qAnInGq8JcYUnjrEZZKrVcyRKWkZq7bzDnfxcxcNka3TqV0rFs5mfj+xPIkWxhIWJlSvBDEpMQViERcATNihQZliuOsCJtTRZnBMjbHFLMUBe1hN27OmkCB0PyiVaLoNc90VaRlWaVGUIhv3woS+JWlUtIksgF7ntvAzY1RYZ/neYue1OhJh76s0Zcd+qRGn3ToQY0edOioRkcdelqjpx26W6O7HXpYo4cdulOjO3m7WHq/hvfbwfq4Ro87weFehcPQ7rVjM1CywpopO+5Es5AqFAzeBIN2zlGFgrcwDZI79tChJlsmuV1Orf8gcYPdCSKISQDLDPuVbBL9wSU6Avw2Feyi6xdMlJpU3bWVNLelFgUzUOvAmGBgK5dSr/O1FSSp4JKjBxMrBlfz9ALfgLepDQwsjWb2onjPGus21g1sKmzWvgtJl8SeXurx+TIiOhfEji7hzzv5uu5mDqkCaQXgsuwATiFv+JXzH4Ck6rwJGH5y9mmqUoEVWTWZ+7jtYfnQXo0kxQY48yc2EBAbErwhfR/nxGdz99AqJU/i9+unTmJWnbAId5EiqFren/Jp9YKzWvwkr5YEj4FiE7ERzlNBlFs1C3M73L4/wF19MMy7qpkCSDY61Gx/67StTEKxgDIRV/oyuvDZvt/uGzduS7pJp6sux3+LvkqsiNh6V09jE0uOKrwHA2e9T4itVAnRWvfw+BkpR1bSjp9tWm2+h+45NWRv7Rm5fxuk7jwFg5ucffBoIx/BqyZzCE9Xv32Wdo3je9s+2vvf9B9/X52z170vva+8rz3f+8577P3ojb0jj3ng/eb97v3R+7P3d++f3r+l9OqVKua217h6//0PuMIE2A==</latexit><latexit sha1_base64="l03uCu57NEwOx4KA5SzuAFGPBH4=">AAAIQXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6C/a6/UX7K/Yn7G3Y6152lGRHP9oKsHW6z/fE4/FEhpng2gyHf125+sGHH13rXf9465NPP7vx+c1bXxzrdKEYHLFUpOokpBoET+DIcCPgJFNAZSjgRXj+1PEXr0FpniaHZpXBRNJZwmPOqEHX/nJ6sz/cHhYX6Rp+ZfS96hpPb/XuBlHKFhISwwTV+swfZmZiqTKcCci3goWGjLJzOoMzNBMqQU9skWlO7qAnInGq8JcYUnjrEZZKrVcyRKWkZq7bzDnfxcxcNka3TqV0rFs5mfj+xPIkWxhIWJlSvBDEpMQViERcATNihQZliuOsCJtTRZnBMjbHFLMUBe1hN27OmkCB0PyiVaLoNc90VaRlWaVGUIhv3woS+JWlUtIksgF7ntvAzY1RYZ/neYue1OhJh76s0Zcd+qRGn3ToQY0edOioRkcdelqjpx26W6O7HXpYo4cdulOjO3m7WHq/hvfbwfq4Ro87weFehcPQ7rVjM1CywpopO+5Es5AqFAzeBIN2zlGFgrcwDZI79tChJlsmuV1Orf8gcYPdCSKISQDLDPuVbBL9wSU6Avw2Feyi6xdMlJpU3bWVNLelFgUzUOvAmGBgK5dSr/O1FSSp4JKjBxMrBlfz9ALfgLepDQwsjWb2onjPGus21g1sKmzWvgtJl8SeXurx+TIiOhfEji7hzzv5uu5mDqkCaQXgsuwATiFv+JXzH4Ck6rwJGH5y9mmqUoEVWTWZ+7jtYfnQXo0kxQY48yc2EBAbErwhfR/nxGdz99AqJU/i9+unTmJWnbAId5EiqFren/Jp9YKzWvwkr5YEj4FiE7ERzlNBlFs1C3M73L4/wF19MMy7qpkCSDY61Gx/67StTEKxgDIRV/oyuvDZvt/uGzduS7pJp6sux3+LvkqsiNh6V09jE0uOKrwHA2e9T4itVAnRWvfw+BkpR1bSjp9tWm2+h+45NWRv7Rm5fxuk7jwFg5ucffBoIx/BqyZzCE9Xv32Wdo3je9s+2vvf9B9/X52z170vva+8rz3f+8577P3ojb0jj3ng/eb97v3R+7P3d++f3r+l9OqVKua217h6//0PuMIE2A==</latexit><latexit sha1_base64="l03uCu57NEwOx4KA5SzuAFGPBH4=">AAAIQXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6C/a6/UX7K/Yn7G3Y6152lGRHP9oKsHW6z/fE4/FEhpng2gyHf125+sGHH13rXf9465NPP7vx+c1bXxzrdKEYHLFUpOokpBoET+DIcCPgJFNAZSjgRXj+1PEXr0FpniaHZpXBRNJZwmPOqEHX/nJ6sz/cHhYX6Rp+ZfS96hpPb/XuBlHKFhISwwTV+swfZmZiqTKcCci3goWGjLJzOoMzNBMqQU9skWlO7qAnInGq8JcYUnjrEZZKrVcyRKWkZq7bzDnfxcxcNka3TqV0rFs5mfj+xPIkWxhIWJlSvBDEpMQViERcATNihQZliuOsCJtTRZnBMjbHFLMUBe1hN27OmkCB0PyiVaLoNc90VaRlWaVGUIhv3woS+JWlUtIksgF7ntvAzY1RYZ/neYue1OhJh76s0Zcd+qRGn3ToQY0edOioRkcdelqjpx26W6O7HXpYo4cdulOjO3m7WHq/hvfbwfq4Ro87weFehcPQ7rVjM1CywpopO+5Es5AqFAzeBIN2zlGFgrcwDZI79tChJlsmuV1Orf8gcYPdCSKISQDLDPuVbBL9wSU6Avw2Feyi6xdMlJpU3bWVNLelFgUzUOvAmGBgK5dSr/O1FSSp4JKjBxMrBlfz9ALfgLepDQwsjWb2onjPGus21g1sKmzWvgtJl8SeXurx+TIiOhfEji7hzzv5uu5mDqkCaQXgsuwATiFv+JXzH4Ck6rwJGH5y9mmqUoEVWTWZ+7jtYfnQXo0kxQY48yc2EBAbErwhfR/nxGdz99AqJU/i9+unTmJWnbAId5EiqFren/Jp9YKzWvwkr5YEj4FiE7ERzlNBlFs1C3M73L4/wF19MMy7qpkCSDY61Gx/67StTEKxgDIRV/oyuvDZvt/uGzduS7pJp6sux3+LvkqsiNh6V09jE0uOKrwHA2e9T4itVAnRWvfw+BkpR1bSjp9tWm2+h+45NWRv7Rm5fxuk7jwFg5ucffBoIx/BqyZzCE9Xv32Wdo3je9s+2vvf9B9/X52z170vva+8rz3f+8577P3ojb0jj3ng/eb97v3R+7P3d++f3r+l9OqVKua217h6//0PuMIE2A==</latexit><latexit sha1_base64="l03uCu57NEwOx4KA5SzuAFGPBH4=">AAAIQXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6C/a6/UX7K/Yn7G3Y6152lGRHP9oKsHW6z/fE4/FEhpng2gyHf125+sGHH13rXf9465NPP7vx+c1bXxzrdKEYHLFUpOokpBoET+DIcCPgJFNAZSjgRXj+1PEXr0FpniaHZpXBRNJZwmPOqEHX/nJ6sz/cHhYX6Rp+ZfS96hpPb/XuBlHKFhISwwTV+swfZmZiqTKcCci3goWGjLJzOoMzNBMqQU9skWlO7qAnInGq8JcYUnjrEZZKrVcyRKWkZq7bzDnfxcxcNka3TqV0rFs5mfj+xPIkWxhIWJlSvBDEpMQViERcATNihQZliuOsCJtTRZnBMjbHFLMUBe1hN27OmkCB0PyiVaLoNc90VaRlWaVGUIhv3woS+JWlUtIksgF7ntvAzY1RYZ/neYue1OhJh76s0Zcd+qRGn3ToQY0edOioRkcdelqjpx26W6O7HXpYo4cdulOjO3m7WHq/hvfbwfq4Ro87weFehcPQ7rVjM1CywpopO+5Es5AqFAzeBIN2zlGFgrcwDZI79tChJlsmuV1Orf8gcYPdCSKISQDLDPuVbBL9wSU6Avw2Feyi6xdMlJpU3bWVNLelFgUzUOvAmGBgK5dSr/O1FSSp4JKjBxMrBlfz9ALfgLepDQwsjWb2onjPGus21g1sKmzWvgtJl8SeXurx+TIiOhfEji7hzzv5uu5mDqkCaQXgsuwATiFv+JXzH4Ck6rwJGH5y9mmqUoEVWTWZ+7jtYfnQXo0kxQY48yc2EBAbErwhfR/nxGdz99AqJU/i9+unTmJWnbAId5EiqFren/Jp9YKzWvwkr5YEj4FiE7ERzlNBlFs1C3M73L4/wF19MMy7qpkCSDY61Gx/67StTEKxgDIRV/oyuvDZvt/uGzduS7pJp6sux3+LvkqsiNh6V09jE0uOKrwHA2e9T4itVAnRWvfw+BkpR1bSjp9tWm2+h+45NWRv7Rm5fxuk7jwFg5ucffBoIx/BqyZzCE9Xv32Wdo3je9s+2vvf9B9/X52z170vva+8rz3f+8577P3ojb0jj3ng/eb97v3R+7P3d++f3r+l9OqVKua217h6//0PuMIE2A==</latexit>

State Value

✓<latexit sha1_base64="ty2ZhM7TIrIs31mJvTNAn2S9FrI=">AAAIRnicfZXdbts2FMfV7sNZ9tVul7sRZhQYCiOwig0rOgxoVw/dsCRz0sRNExkGRR3ZXEhJI+nODqt32O32RHuFvcTuht3uUJIdiWorwNbR+f2PeHh4REY5Z0oPh3/fuPnW2++829t5b/f9Dz786ONbtz+ZqGwpKZzSjGfyLCIKOEvhVDPN4SyXQETE4Vl0+djyZy9AKpalJ3qdw1SQecoSRolG1yTUC9Bkdqs/3BuWl981gtroe/U1nt3u3Q3jjC4FpJpyotRFMMz11BCpGeVQ7IZLBTmhl2QOF2imRICamjLdwr+DnthPMom/VPultxlhiFBqLSJUCqIXymXW+TqmF6I1urEqqRLl5KST+1PD0nypIaVVSsmS+zrzbZX8mEmgmq/RIFQynJVPF0QSqrGW7TH5PEOBO+zWzWgbSOCKXTklil+wXNVFWlVVagVF+PbdMIXfaCYESWMT0qeFCe3cKOHmaVE49KxBzzr0eYM+79BHDfqoQ48b9LhDRw066tDzBj3v0IMGPejQkwY96dD9Bt0v3GKpowY+coPVpEEnneDosMZRZA7d2BykqLGi0ow70TQiEgWDl+HAzTmuUfgKpkAwy76xqM1WaWFWMxM8SO1gd8IYEj+EVY796m8T/d4mOgL8NiUcoOtnTJToTN41tbQwlRYFc5CbwMTHQCeXSq+KjRWmGWeCoQcTKweXi+wK34C3mQk1rLSi5qp8zwYrF6sW1jXWG9+VICvfnF/r8fk6Ir7kvhldw5/2i03dcTvLJAjDAZdlH3AKRcsvrf8YBJGXbUDxkzOPM5lxrMi6zezHbU6qB3c10gwb4CKYmpBDov3wpd8PcE5svrAPTilZmrxZP7MSve6ExbiLlEH18v5YzOoXXDTip0W9JHgWlJuIiXGeEuLCyHlUmOHe/QHu6oNh0VXNJUC61aFm7yurdTKJ+BKqRGzpq+jSZ/qB2zd2XEe6TaerrsZ/hb5OrIzYfV1PYxMLhiq8hwNrvUmIrVQL0dr08PiJX40shRk/2bba4hDdC6L9w41nZP9NmNlDFTRucubBt1v5CH5tM4vwdA3cs7RrTO7tBWgffdl/+F19zu54n3mfe194gfe199D7wRt7px71fvF+9/7w/uz91fun92/vv0p680Yd86nXuna8/wEh7AYF</latexit><latexit sha1_base64="ty2ZhM7TIrIs31mJvTNAn2S9FrI=">AAAIRnicfZXdbts2FMfV7sNZ9tVul7sRZhQYCiOwig0rOgxoVw/dsCRz0sRNExkGRR3ZXEhJI+nODqt32O32RHuFvcTuht3uUJIdiWorwNbR+f2PeHh4REY5Z0oPh3/fuPnW2++829t5b/f9Dz786ONbtz+ZqGwpKZzSjGfyLCIKOEvhVDPN4SyXQETE4Vl0+djyZy9AKpalJ3qdw1SQecoSRolG1yTUC9Bkdqs/3BuWl981gtroe/U1nt3u3Q3jjC4FpJpyotRFMMz11BCpGeVQ7IZLBTmhl2QOF2imRICamjLdwr+DnthPMom/VPultxlhiFBqLSJUCqIXymXW+TqmF6I1urEqqRLl5KST+1PD0nypIaVVSsmS+zrzbZX8mEmgmq/RIFQynJVPF0QSqrGW7TH5PEOBO+zWzWgbSOCKXTklil+wXNVFWlVVagVF+PbdMIXfaCYESWMT0qeFCe3cKOHmaVE49KxBzzr0eYM+79BHDfqoQ48b9LhDRw066tDzBj3v0IMGPejQkwY96dD9Bt0v3GKpowY+coPVpEEnneDosMZRZA7d2BykqLGi0ow70TQiEgWDl+HAzTmuUfgKpkAwy76xqM1WaWFWMxM8SO1gd8IYEj+EVY796m8T/d4mOgL8NiUcoOtnTJToTN41tbQwlRYFc5CbwMTHQCeXSq+KjRWmGWeCoQcTKweXi+wK34C3mQk1rLSi5qp8zwYrF6sW1jXWG9+VICvfnF/r8fk6Ir7kvhldw5/2i03dcTvLJAjDAZdlH3AKRcsvrf8YBJGXbUDxkzOPM5lxrMi6zezHbU6qB3c10gwb4CKYmpBDov3wpd8PcE5svrAPTilZmrxZP7MSve6ExbiLlEH18v5YzOoXXDTip0W9JHgWlJuIiXGeEuLCyHlUmOHe/QHu6oNh0VXNJUC61aFm7yurdTKJ+BKqRGzpq+jSZ/qB2zd2XEe6TaerrsZ/hb5OrIzYfV1PYxMLhiq8hwNrvUmIrVQL0dr08PiJX40shRk/2bba4hDdC6L9w41nZP9NmNlDFTRucubBt1v5CH5tM4vwdA3cs7RrTO7tBWgffdl/+F19zu54n3mfe194gfe199D7wRt7px71fvF+9/7w/uz91fun92/vv0p680Yd86nXuna8/wEh7AYF</latexit><latexit sha1_base64="ty2ZhM7TIrIs31mJvTNAn2S9FrI=">AAAIRnicfZXdbts2FMfV7sNZ9tVul7sRZhQYCiOwig0rOgxoVw/dsCRz0sRNExkGRR3ZXEhJI+nODqt32O32RHuFvcTuht3uUJIdiWorwNbR+f2PeHh4REY5Z0oPh3/fuPnW2++829t5b/f9Dz786ONbtz+ZqGwpKZzSjGfyLCIKOEvhVDPN4SyXQETE4Vl0+djyZy9AKpalJ3qdw1SQecoSRolG1yTUC9Bkdqs/3BuWl981gtroe/U1nt3u3Q3jjC4FpJpyotRFMMz11BCpGeVQ7IZLBTmhl2QOF2imRICamjLdwr+DnthPMom/VPultxlhiFBqLSJUCqIXymXW+TqmF6I1urEqqRLl5KST+1PD0nypIaVVSsmS+zrzbZX8mEmgmq/RIFQynJVPF0QSqrGW7TH5PEOBO+zWzWgbSOCKXTklil+wXNVFWlVVagVF+PbdMIXfaCYESWMT0qeFCe3cKOHmaVE49KxBzzr0eYM+79BHDfqoQ48b9LhDRw066tDzBj3v0IMGPejQkwY96dD9Bt0v3GKpowY+coPVpEEnneDosMZRZA7d2BykqLGi0ow70TQiEgWDl+HAzTmuUfgKpkAwy76xqM1WaWFWMxM8SO1gd8IYEj+EVY796m8T/d4mOgL8NiUcoOtnTJToTN41tbQwlRYFc5CbwMTHQCeXSq+KjRWmGWeCoQcTKweXi+wK34C3mQk1rLSi5qp8zwYrF6sW1jXWG9+VICvfnF/r8fk6Ir7kvhldw5/2i03dcTvLJAjDAZdlH3AKRcsvrf8YBJGXbUDxkzOPM5lxrMi6zezHbU6qB3c10gwb4CKYmpBDov3wpd8PcE5svrAPTilZmrxZP7MSve6ExbiLlEH18v5YzOoXXDTip0W9JHgWlJuIiXGeEuLCyHlUmOHe/QHu6oNh0VXNJUC61aFm7yurdTKJ+BKqRGzpq+jSZ/qB2zd2XEe6TaerrsZ/hb5OrIzYfV1PYxMLhiq8hwNrvUmIrVQL0dr08PiJX40shRk/2bba4hDdC6L9w41nZP9NmNlDFTRucubBt1v5CH5tM4vwdA3cs7RrTO7tBWgffdl/+F19zu54n3mfe194gfe199D7wRt7px71fvF+9/7w/uz91fun92/vv0p680Yd86nXuna8/wEh7AYF</latexit><latexit sha1_base64="ty2ZhM7TIrIs31mJvTNAn2S9FrI=">AAAIRnicfZXdbts2FMfV7sNZ9tVul7sRZhQYCiOwig0rOgxoVw/dsCRz0sRNExkGRR3ZXEhJI+nODqt32O32RHuFvcTuht3uUJIdiWorwNbR+f2PeHh4REY5Z0oPh3/fuPnW2++829t5b/f9Dz786ONbtz+ZqGwpKZzSjGfyLCIKOEvhVDPN4SyXQETE4Vl0+djyZy9AKpalJ3qdw1SQecoSRolG1yTUC9Bkdqs/3BuWl981gtroe/U1nt3u3Q3jjC4FpJpyotRFMMz11BCpGeVQ7IZLBTmhl2QOF2imRICamjLdwr+DnthPMom/VPultxlhiFBqLSJUCqIXymXW+TqmF6I1urEqqRLl5KST+1PD0nypIaVVSsmS+zrzbZX8mEmgmq/RIFQynJVPF0QSqrGW7TH5PEOBO+zWzWgbSOCKXTklil+wXNVFWlVVagVF+PbdMIXfaCYESWMT0qeFCe3cKOHmaVE49KxBzzr0eYM+79BHDfqoQ48b9LhDRw066tDzBj3v0IMGPejQkwY96dD9Bt0v3GKpowY+coPVpEEnneDosMZRZA7d2BykqLGi0ow70TQiEgWDl+HAzTmuUfgKpkAwy76xqM1WaWFWMxM8SO1gd8IYEj+EVY796m8T/d4mOgL8NiUcoOtnTJToTN41tbQwlRYFc5CbwMTHQCeXSq+KjRWmGWeCoQcTKweXi+wK34C3mQk1rLSi5qp8zwYrF6sW1jXWG9+VICvfnF/r8fk6Ir7kvhldw5/2i03dcTvLJAjDAZdlH3AKRcsvrf8YBJGXbUDxkzOPM5lxrMi6zezHbU6qB3c10gwb4CKYmpBDov3wpd8PcE5svrAPTilZmrxZP7MSve6ExbiLlEH18v5YzOoXXDTip0W9JHgWlJuIiXGeEuLCyHlUmOHe/QHu6oNh0VXNJUC61aFm7yurdTKJ+BKqRGzpq+jSZ/qB2zd2XEe6TaerrsZ/hb5OrIzYfV1PYxMLhiq8hwNrvUmIrVQL0dr08PiJX40shRk/2bba4hDdC6L9w41nZP9NmNlDFTRucubBt1v5CH5tM4vwdA3cs7RrTO7tBWgffdl/+F19zu54n3mfe194gfe199D7wRt7px71fvF+9/7w/uz91fun92/vv0p680Yd86nXuna8/wEh7AYF</latexit>

Goalstate

V�(x)<latexit sha1_base64="hQTqaSVnMtWOt6x1pLM8o8Z6p9E=">AAAI5XicfZVbj9tEFMfdcmkItxYeeRkRrVRQFMVcRFWE1KWBgthdstnddLvrNBqPT+zR+sbMpCQ79UfgDSHeEHwTPgbfhjO2k/jS1lLi4/P7H8+ZM8czbhpyqYbD/27cfO31N9681Xmr+/Y77773/u07H0xlshQMzlgSJuLcpRJCHsOZ4iqE81QAjdwQHrtXDw1//AyE5El8qtYpzCLqx3zBGVXoeuoEVJHp3EkDfnf1yfx2bzgY5hdpG3Zp9KzyGs/v3PrX8RK2jCBWLKRSXtrDVM00FYqzELKus5SQUnZFfbhEM6YRyJnO087IHno8skgE/mJFcm81QtNIynXkojKiKpBNZpwvYyqIaqNroxJyIRs5qcW9meZxulQQsyKlxTIkKiGmWsTjApgK12hQJjjOirCACsoU1rQ+ZugnKGgOu3VzVgcCQsmvGyXynvFUlkVaFVWqBbn49u4eXmR/+j1JaQpYMA8WxAmmKcdFJLpY0adOyvNVzUp+Amqc47xaTOhJVkWGsHGWv7vbdWL4lSVRRGMP3SdZEcVoqE9MVI2eV+h5iz6p0Cctul+h+y06qdBJi44qdNSi4wodt+hFhV606GGFHrboaYWetuhBhR4grWN5XMHHzWA5rdBpK9g9KrHr6qNmrDvZwVatsEuibLfy49armUsFCvrPnX5zQl6JnBcwCRE37GuD6mwVZ3o11/b92Ay2VzQarFL8mMg20e9MoiPAjUPAIbp+xkSpSsSnupRmutCiwAexCVwQDGzkUuhltrGcOAl5xNHTLbtcBMk1vgFvc+0oWCnJ9HW2/QjQL5tY1rAqsdr4riO6Ivpip8fnXYR3FRI92sGfDrJN3VUAiYBIh4DLcgA4hazmF8Y/gYiKqzpguB/oh4lIQqzIus7MzqNPi4fmasQJNsClPcM+4T5xnpOeTUqzUUceL14hnhuu1q0YDze3PKJc2B8znHcIC0Uu82jB/UCRWVYuBh5V+d6mPZyhAC/TwnczPRzc6+Nh0x9mbZUvAOKtDjWDL422+RWESygSMUUvonOf7tnNjjHjNqTbdNrqYvwX6MvE8ojuy7oZ2zfiqMK70zfWq4TYRKUQrU33jh+RYmQR6fGjbZMFR+WGf7TxjMy/dhJz5oPCnVHf/2YrH8EvdWYQHvp284hvG9PPBvbnA/v4i96Db8vjv2N9ZH1s3bVs6yvrgfWDNbbOLGYJ60/rb+ufjt/5rfN7549CevNGGfOhVbs6f/0PKx89cg==</latexit>

Auxiliarytasks

V ⇢ Rn<latexit sha1_base64="lKq0xkq+B5SgE8UPh4b8PgtnaGk=">AAAIYnicfZVbb9s2FMfV7hLPuzXr4/ZAzCgwFEZgFRtatBjQrh66YUnmpImbJvIMijqyiZCSRtKdHVav+zR73b7L3vdBdmjJji5tBdg6Or//IQ8Pb2EmuDaDwb83br73/gcf7nQ+6n78yaeffX5r94uxTheKwSlLRarOQqpB8ARODTcCzjIFVIYCXoSXTx1/8QqU5mlyYlYZTCSdJTzmjBp0TW+RQFIzZ1TYcU4CvQg1mMIXhvY4/w0lvcHeYP2QtuGXRs8rn9F0d+duEKVsISExTFCtL/xBZiaWKsOZgLwbLDRklF3SGVygmVAJemLXQ8nJHfREJE4V/hJD1t5qhKVS65UMUemS1E3mnG9jZi5rvVunUjrWjZxM/GBieZItDCSsSCleCGJS4ipIIq6AGbFCgzLFcVSEzamizGCd632KWYqCZrdbN2d1oEBoftUoUfSKZ7os0rKoUi0oxNa7QQJ/sFRKmkQ2YM9zu53W53neoGcVetaiLyv0ZYs+qdAnLXpcocctOqzQYYueV+h5ix5U6EGLnlToSYvuV+h+3iyWPqrgo2awHlfouBUcHpYYN8thMzYDJUusmbKjVjQLqUJB/3XQb+YclSh4A9MguWOPHKqzZZLb5dT6DxPX2Z0ggpgEsMxwvZJtoj+6RIeAe1PBAbp+xUSpSdVdW0pzW2hRMAO1CYwJBjZyKfQ631hBkgouOXowsXXnap5eYQv4mtrAwNJoZq/W7WywbmJdw6bEZuO7knRJ7Pm1Hr+vI6JLQezwGv6yn2/qbuaQKpBWAE7LPuAQ8ppfOf8xSKou64DhlrNPU5UKrMiqztzmtifFR3M2khQXwIU/sYGAGE/V16Tn45j4bO4+GqXkSfxu/dRJzKoVFuEpsg4qp/fnfFo2cFGJn+TllOA9sT5EbITjVBDlVs3C3A72HvTxVO8P8rZqpgCSrQ41e985bSOTUCygSMSVvohe+2zPb64b129Duk2nrS76f4O+TGwd0X3bmsZFLDmq8B30nfUuIS6lUojWZg2PnpGiZyXt6Nl2qc0P0T2nhhxuPEP3b4PUXbhg8JCzD7/fyofwe505hLer37xL28b43p6P9tG3vcc/lPdsx/vS+9r7xvO9+95j7ydv5J16zPvT+8v72/tn579Ot7PbuV1Ib94oY257tafz1f/eBQ5h</latexit><latexit sha1_base64="lKq0xkq+B5SgE8UPh4b8PgtnaGk=">AAAIYnicfZVbb9s2FMfV7hLPuzXr4/ZAzCgwFEZgFRtatBjQrh66YUnmpImbJvIMijqyiZCSRtKdHVav+zR73b7L3vdBdmjJji5tBdg6Or//IQ8Pb2EmuDaDwb83br73/gcf7nQ+6n78yaeffX5r94uxTheKwSlLRarOQqpB8ARODTcCzjIFVIYCXoSXTx1/8QqU5mlyYlYZTCSdJTzmjBp0TW+RQFIzZ1TYcU4CvQg1mMIXhvY4/w0lvcHeYP2QtuGXRs8rn9F0d+duEKVsISExTFCtL/xBZiaWKsOZgLwbLDRklF3SGVygmVAJemLXQ8nJHfREJE4V/hJD1t5qhKVS65UMUemS1E3mnG9jZi5rvVunUjrWjZxM/GBieZItDCSsSCleCGJS4ipIIq6AGbFCgzLFcVSEzamizGCd632KWYqCZrdbN2d1oEBoftUoUfSKZ7os0rKoUi0oxNa7QQJ/sFRKmkQ2YM9zu53W53neoGcVetaiLyv0ZYs+qdAnLXpcocctOqzQYYueV+h5ix5U6EGLnlToSYvuV+h+3iyWPqrgo2awHlfouBUcHpYYN8thMzYDJUusmbKjVjQLqUJB/3XQb+YclSh4A9MguWOPHKqzZZLb5dT6DxPX2Z0ggpgEsMxwvZJtoj+6RIeAe1PBAbp+xUSpSdVdW0pzW2hRMAO1CYwJBjZyKfQ631hBkgouOXowsXXnap5eYQv4mtrAwNJoZq/W7WywbmJdw6bEZuO7knRJ7Pm1Hr+vI6JLQezwGv6yn2/qbuaQKpBWAE7LPuAQ8ppfOf8xSKou64DhlrNPU5UKrMiqztzmtifFR3M2khQXwIU/sYGAGE/V16Tn45j4bO4+GqXkSfxu/dRJzKoVFuEpsg4qp/fnfFo2cFGJn+TllOA9sT5EbITjVBDlVs3C3A72HvTxVO8P8rZqpgCSrQ41e985bSOTUCygSMSVvohe+2zPb64b129Duk2nrS76f4O+TGwd0X3bmsZFLDmq8B30nfUuIS6lUojWZg2PnpGiZyXt6Nl2qc0P0T2nhhxuPEP3b4PUXbhg8JCzD7/fyofwe505hLer37xL28b43p6P9tG3vcc/lPdsx/vS+9r7xvO9+95j7ydv5J16zPvT+8v72/tn579Ot7PbuV1Ib94oY257tafz1f/eBQ5h</latexit><latexit sha1_base64="lKq0xkq+B5SgE8UPh4b8PgtnaGk=">AAAIYnicfZVbb9s2FMfV7hLPuzXr4/ZAzCgwFEZgFRtatBjQrh66YUnmpImbJvIMijqyiZCSRtKdHVav+zR73b7L3vdBdmjJji5tBdg6Or//IQ8Pb2EmuDaDwb83br73/gcf7nQ+6n78yaeffX5r94uxTheKwSlLRarOQqpB8ARODTcCzjIFVIYCXoSXTx1/8QqU5mlyYlYZTCSdJTzmjBp0TW+RQFIzZ1TYcU4CvQg1mMIXhvY4/w0lvcHeYP2QtuGXRs8rn9F0d+duEKVsISExTFCtL/xBZiaWKsOZgLwbLDRklF3SGVygmVAJemLXQ8nJHfREJE4V/hJD1t5qhKVS65UMUemS1E3mnG9jZi5rvVunUjrWjZxM/GBieZItDCSsSCleCGJS4ipIIq6AGbFCgzLFcVSEzamizGCd632KWYqCZrdbN2d1oEBoftUoUfSKZ7os0rKoUi0oxNa7QQJ/sFRKmkQ2YM9zu53W53neoGcVetaiLyv0ZYs+qdAnLXpcocctOqzQYYueV+h5ix5U6EGLnlToSYvuV+h+3iyWPqrgo2awHlfouBUcHpYYN8thMzYDJUusmbKjVjQLqUJB/3XQb+YclSh4A9MguWOPHKqzZZLb5dT6DxPX2Z0ggpgEsMxwvZJtoj+6RIeAe1PBAbp+xUSpSdVdW0pzW2hRMAO1CYwJBjZyKfQ631hBkgouOXowsXXnap5eYQv4mtrAwNJoZq/W7WywbmJdw6bEZuO7knRJ7Pm1Hr+vI6JLQezwGv6yn2/qbuaQKpBWAE7LPuAQ8ppfOf8xSKou64DhlrNPU5UKrMiqztzmtifFR3M2khQXwIU/sYGAGE/V16Tn45j4bO4+GqXkSfxu/dRJzKoVFuEpsg4qp/fnfFo2cFGJn+TllOA9sT5EbITjVBDlVs3C3A72HvTxVO8P8rZqpgCSrQ41e985bSOTUCygSMSVvohe+2zPb64b129Duk2nrS76f4O+TGwd0X3bmsZFLDmq8B30nfUuIS6lUojWZg2PnpGiZyXt6Nl2qc0P0T2nhhxuPEP3b4PUXbhg8JCzD7/fyofwe505hLer37xL28b43p6P9tG3vcc/lPdsx/vS+9r7xvO9+95j7ydv5J16zPvT+8v72/tn579Ot7PbuV1Ib94oY257tafz1f/eBQ5h</latexit><latexit sha1_base64="lKq0xkq+B5SgE8UPh4b8PgtnaGk=">AAAIYnicfZVbb9s2FMfV7hLPuzXr4/ZAzCgwFEZgFRtatBjQrh66YUnmpImbJvIMijqyiZCSRtKdHVav+zR73b7L3vdBdmjJji5tBdg6Or//IQ8Pb2EmuDaDwb83br73/gcf7nQ+6n78yaeffX5r94uxTheKwSlLRarOQqpB8ARODTcCzjIFVIYCXoSXTx1/8QqU5mlyYlYZTCSdJTzmjBp0TW+RQFIzZ1TYcU4CvQg1mMIXhvY4/w0lvcHeYP2QtuGXRs8rn9F0d+duEKVsISExTFCtL/xBZiaWKsOZgLwbLDRklF3SGVygmVAJemLXQ8nJHfREJE4V/hJD1t5qhKVS65UMUemS1E3mnG9jZi5rvVunUjrWjZxM/GBieZItDCSsSCleCGJS4ipIIq6AGbFCgzLFcVSEzamizGCd632KWYqCZrdbN2d1oEBoftUoUfSKZ7os0rKoUi0oxNa7QQJ/sFRKmkQ2YM9zu53W53neoGcVetaiLyv0ZYs+qdAnLXpcocctOqzQYYueV+h5ix5U6EGLnlToSYvuV+h+3iyWPqrgo2awHlfouBUcHpYYN8thMzYDJUusmbKjVjQLqUJB/3XQb+YclSh4A9MguWOPHKqzZZLb5dT6DxPX2Z0ggpgEsMxwvZJtoj+6RIeAe1PBAbp+xUSpSdVdW0pzW2hRMAO1CYwJBjZyKfQ631hBkgouOXowsXXnap5eYQv4mtrAwNJoZq/W7WywbmJdw6bEZuO7knRJ7Pm1Hr+vI6JLQezwGv6yn2/qbuaQKpBWAE7LPuAQ8ppfOf8xSKou64DhlrNPU5UKrMiqztzmtifFR3M2khQXwIU/sYGAGE/V16Tn45j4bO4+GqXkSfxu/dRJzKoVFuEpsg4qp/fnfFo2cFGJn+TllOA9sT5EbITjVBDlVs3C3A72HvTxVO8P8rZqpgCSrQ41e985bSOTUCygSMSVvohe+2zPb64b129Duk2nrS76f4O+TGwd0X3bmsZFLDmq8B30nfUuIS6lUojWZg2PnpGiZyXt6Nl2qc0P0T2nhhxuPEP3b4PUXbhg8JCzD7/fyofwe505hLer37xL28b43p6P9tG3vcc/lPdsx/vS+9r7xvO9+95j7ydv5J16zPvT+8v72/tn579Ot7PbuV1Ib94oY257tafz1f/eBQ5h</latexit>

V ⇡�

<latexit sha1_base64="Y9XVFiR0gRzt0o8ZcHM/1+3D4iM=">AAAIUXicfZXfb9s2EMfV7EfcdFvT7XEvwowCQ2EEVrFhRYcB7eqiHZZkTpq4aULPoKiTRUSUNJJu7bD6S/a6/UV72p+ytx0l2dGPtgJsne7zPfF4PJF+FnOlh8N/b2x99PEnn273bu7c+uzzL27v3vlyotKFZHDK0jiVZz5VEPMETjXXMZxlEqjwY3jpXz6x/OVrkIqnyYleZTAVdJ7wkDOq0TXbvU0iqt3J7yTjM5JFfLbbH+4Ni8vtGl5l9J3qGs/ubN8jQcoWAhLNYqrUhTfM9NRQqTmLId8hCwUZZZd0DhdoJlSAmpoi89y9i57ADVOJv0S7hbceYahQaiV8VAqqI9Vm1vk+piPRGN1YlVShauWkwwdTw5NsoSFhZUrhInZ16tqCuQGXwHS8QoMyyXFWLouopExjWZtjxvMUBe1hN27OmkBCrPhVq0TBa56pqkjLskqNIB/fvkMSeMNSIWgSGMJe5IbYuTEamxd53qJnNXrWoa9q9FWHPq7Rxx16XKPHHTqq0VGHntfoeYce1OhBh57U6EmH7tfoft4uljqq4aN2sJrU6KQT7B9W2PfNYTs2AykqrJg0404086lEweAtGbRzDipE3sEUCG7ZjxY12TLJzXJmvIeJHewuCSB0CSwz7Fd3k+hTm+gI8NuUcICu3zBRqlN5z1TS3JRaFMxBrgNDFwNbuZR6la8tkqQxFxw9mFgxuIzSK3wD3maGaFhqxcxV8Z41Vm2sGlhXWK99V4IuXXN+rcfn64jgMnbN6Br+up+v664jSCUIEwMuyz7gFPKGX1r/MQgqL5uA4SdnnqQyjbEiqyazH7c5KR/aq5Gk2AAX3tSQGELtkrdu38M58XlkH1ql5En4Yf3MSvSqExbgLlIEVcv7Sz6rXnBRi5/m1ZLgsVBsIibAeUoIciPnfm6Gew8GuKsPhnlXNZcAyUaHmr3vrbaViR8voEzElr6MLnym77X7xo7bkm7S6arL8d+hrxIrInbe19PYxIKjCu9kYK0PCbGVKiFa6x4eP3PLkaUw42ebVosO0W0Py8O1Z2T/DUnt+QoaNznz8KeNfAR/NJlFeLp67bO0a0zu73loH33Xf/Rzdc72nK+db5xvHc/5wXnkPHfGzqnDnIXzp/OX8/f2P9v/9ZzeVindulHFfOU0rt6t/wGzCwe+</latexit><latexit sha1_base64="Y9XVFiR0gRzt0o8ZcHM/1+3D4iM=">AAAIUXicfZXfb9s2EMfV7EfcdFvT7XEvwowCQ2EEVrFhRYcB7eqiHZZkTpq4aULPoKiTRUSUNJJu7bD6S/a6/UV72p+ytx0l2dGPtgJsne7zPfF4PJF+FnOlh8N/b2x99PEnn273bu7c+uzzL27v3vlyotKFZHDK0jiVZz5VEPMETjXXMZxlEqjwY3jpXz6x/OVrkIqnyYleZTAVdJ7wkDOq0TXbvU0iqt3J7yTjM5JFfLbbH+4Ni8vtGl5l9J3qGs/ubN8jQcoWAhLNYqrUhTfM9NRQqTmLId8hCwUZZZd0DhdoJlSAmpoi89y9i57ADVOJv0S7hbceYahQaiV8VAqqI9Vm1vk+piPRGN1YlVShauWkwwdTw5NsoSFhZUrhInZ16tqCuQGXwHS8QoMyyXFWLouopExjWZtjxvMUBe1hN27OmkBCrPhVq0TBa56pqkjLskqNIB/fvkMSeMNSIWgSGMJe5IbYuTEamxd53qJnNXrWoa9q9FWHPq7Rxx16XKPHHTqq0VGHntfoeYce1OhBh57U6EmH7tfoft4uljqq4aN2sJrU6KQT7B9W2PfNYTs2AykqrJg0404086lEweAtGbRzDipE3sEUCG7ZjxY12TLJzXJmvIeJHewuCSB0CSwz7Fd3k+hTm+gI8NuUcICu3zBRqlN5z1TS3JRaFMxBrgNDFwNbuZR6la8tkqQxFxw9mFgxuIzSK3wD3maGaFhqxcxV8Z41Vm2sGlhXWK99V4IuXXN+rcfn64jgMnbN6Br+up+v664jSCUIEwMuyz7gFPKGX1r/MQgqL5uA4SdnnqQyjbEiqyazH7c5KR/aq5Gk2AAX3tSQGELtkrdu38M58XlkH1ql5En4Yf3MSvSqExbgLlIEVcv7Sz6rXnBRi5/m1ZLgsVBsIibAeUoIciPnfm6Gew8GuKsPhnlXNZcAyUaHmr3vrbaViR8voEzElr6MLnym77X7xo7bkm7S6arL8d+hrxIrInbe19PYxIKjCu9kYK0PCbGVKiFa6x4eP3PLkaUw42ebVosO0W0Py8O1Z2T/DUnt+QoaNznz8KeNfAR/NJlFeLp67bO0a0zu73loH33Xf/Rzdc72nK+db5xvHc/5wXnkPHfGzqnDnIXzp/OX8/f2P9v/9ZzeVindulHFfOU0rt6t/wGzCwe+</latexit><latexit sha1_base64="Y9XVFiR0gRzt0o8ZcHM/1+3D4iM=">AAAIUXicfZXfb9s2EMfV7EfcdFvT7XEvwowCQ2EEVrFhRYcB7eqiHZZkTpq4aULPoKiTRUSUNJJu7bD6S/a6/UV72p+ytx0l2dGPtgJsne7zPfF4PJF+FnOlh8N/b2x99PEnn273bu7c+uzzL27v3vlyotKFZHDK0jiVZz5VEPMETjXXMZxlEqjwY3jpXz6x/OVrkIqnyYleZTAVdJ7wkDOq0TXbvU0iqt3J7yTjM5JFfLbbH+4Ni8vtGl5l9J3qGs/ubN8jQcoWAhLNYqrUhTfM9NRQqTmLId8hCwUZZZd0DhdoJlSAmpoi89y9i57ADVOJv0S7hbceYahQaiV8VAqqI9Vm1vk+piPRGN1YlVShauWkwwdTw5NsoSFhZUrhInZ16tqCuQGXwHS8QoMyyXFWLouopExjWZtjxvMUBe1hN27OmkBCrPhVq0TBa56pqkjLskqNIB/fvkMSeMNSIWgSGMJe5IbYuTEamxd53qJnNXrWoa9q9FWHPq7Rxx16XKPHHTqq0VGHntfoeYce1OhBh57U6EmH7tfoft4uljqq4aN2sJrU6KQT7B9W2PfNYTs2AykqrJg0404086lEweAtGbRzDipE3sEUCG7ZjxY12TLJzXJmvIeJHewuCSB0CSwz7Fd3k+hTm+gI8NuUcICu3zBRqlN5z1TS3JRaFMxBrgNDFwNbuZR6la8tkqQxFxw9mFgxuIzSK3wD3maGaFhqxcxV8Z41Vm2sGlhXWK99V4IuXXN+rcfn64jgMnbN6Br+up+v664jSCUIEwMuyz7gFPKGX1r/MQgqL5uA4SdnnqQyjbEiqyazH7c5KR/aq5Gk2AAX3tSQGELtkrdu38M58XlkH1ql5En4Yf3MSvSqExbgLlIEVcv7Sz6rXnBRi5/m1ZLgsVBsIibAeUoIciPnfm6Gew8GuKsPhnlXNZcAyUaHmr3vrbaViR8voEzElr6MLnym77X7xo7bkm7S6arL8d+hrxIrInbe19PYxIKjCu9kYK0PCbGVKiFa6x4eP3PLkaUw42ebVosO0W0Py8O1Z2T/DUnt+QoaNznz8KeNfAR/NJlFeLp67bO0a0zu73loH33Xf/Rzdc72nK+db5xvHc/5wXnkPHfGzqnDnIXzp/OX8/f2P9v/9ZzeVindulHFfOU0rt6t/wGzCwe+</latexit><latexit sha1_base64="Y9XVFiR0gRzt0o8ZcHM/1+3D4iM=">AAAIUXicfZXfb9s2EMfV7EfcdFvT7XEvwowCQ2EEVrFhRYcB7eqiHZZkTpq4aULPoKiTRUSUNJJu7bD6S/a6/UV72p+ytx0l2dGPtgJsne7zPfF4PJF+FnOlh8N/b2x99PEnn273bu7c+uzzL27v3vlyotKFZHDK0jiVZz5VEPMETjXXMZxlEqjwY3jpXz6x/OVrkIqnyYleZTAVdJ7wkDOq0TXbvU0iqt3J7yTjM5JFfLbbH+4Ni8vtGl5l9J3qGs/ubN8jQcoWAhLNYqrUhTfM9NRQqTmLId8hCwUZZZd0DhdoJlSAmpoi89y9i57ADVOJv0S7hbceYahQaiV8VAqqI9Vm1vk+piPRGN1YlVShauWkwwdTw5NsoSFhZUrhInZ16tqCuQGXwHS8QoMyyXFWLouopExjWZtjxvMUBe1hN27OmkBCrPhVq0TBa56pqkjLskqNIB/fvkMSeMNSIWgSGMJe5IbYuTEamxd53qJnNXrWoa9q9FWHPq7Rxx16XKPHHTqq0VGHntfoeYce1OhBh57U6EmH7tfoft4uljqq4aN2sJrU6KQT7B9W2PfNYTs2AykqrJg0404086lEweAtGbRzDipE3sEUCG7ZjxY12TLJzXJmvIeJHewuCSB0CSwz7Fd3k+hTm+gI8NuUcICu3zBRqlN5z1TS3JRaFMxBrgNDFwNbuZR6la8tkqQxFxw9mFgxuIzSK3wD3maGaFhqxcxV8Z41Vm2sGlhXWK99V4IuXXN+rcfn64jgMnbN6Br+up+v664jSCUIEwMuyz7gFPKGX1r/MQgqL5uA4SdnnqQyjbEiqyazH7c5KR/aq5Gk2AAX3tSQGELtkrdu38M58XlkH1ql5En4Yf3MSvSqExbgLlIEVcv7Sz6rXnBRi5/m1ZLgsVBsIibAeUoIciPnfm6Gew8GuKsPhnlXNZcAyUaHmr3vrbaViR8voEzElr6MLnym77X7xo7bkm7S6arL8d+hrxIrInbe19PYxIKjCu9kYK0PCbGVKiFa6x4eP3PLkaUw42ebVosO0W0Py8O1Z2T/DUnt+QoaNznz8KeNfAR/NJlFeLp67bO0a0zu73loH33Xf/Rzdc72nK+db5xvHc/5wXnkPHfGzqnDnIXzp/OX8/f2P9v/9ZzeVindulHFfOU0rt6t/wGzCwe+</latexit>

Value, State 2

Value

, Sta

te 1

Figure 1: Left. A deep network viewed as a composition of two parts, one linear and one not. Right.The optimal representation φ∗ is a linear subspace that cuts through the value polytope.

problems to study. We might be interested in sharing a representation across problems, as is oftendone in transfer or continual learning. In this context, auxiliary tasks may inform how the valuefunction should generalize to these new problems. In many problems of interest, the weights θcan also be optimized more efficiently than the representation itself, warranting the view that therepresentation should be adapted using a different process (Levine et al., 2017; Chung et al., 2019).

Note that a trivial “value-as-feature” representation exists for the single-policy optimization problem

minimize∥∥V πφ − V π

∥∥2

2w.r.t. φ ∈ R;

this approximation sets φ(x) = V π(x) and θ = 1. In this paper we take the stance that this is not asatisfying representation, and that a good representation should be in the service of a broader goal(e.g. control, transfer, or fairness).

3 Representation Learning by Approximating Value Functions

We measure the quality of a representation φ in terms of how well it can approximate all possiblevalue functions, formalized as the representation error

L(φ) := maxπ∈P

L(φ;π), L(φ;π) :=∥∥V πφ − V π

∥∥2

2.

We consider the problem of finding the representation φ ∈ R minimizing L(φ):

minimize maxπ∈P

∥∥V πφ − V π∥∥2

2w.r.t. φ ∈ R. (2)

In the context of our work, we call this the representation learning problem (RLP) and say that arepresentation φ∗ is optimal when it minimizes the error in (2). Note that L(φ) (and hence φ∗)depends on characteristics of the environment, in particular on both reward and transition functions.

We consider the RLP from a geometric perspective (Figure 1, right). Dadashi et al. (2019) showedthat the set of value functions achieved by the set of policies P , denoted

V := {V π ∈ Rn : π ∈ P},forms a (possibly nonconvex) polytope. As previously noted, a representation φ defines a subspaceH of possible value approximations. The maximal error is achieved by the value function in V whichis furthest along the subspace normal to H , since V πφ is the orthogonal projection of V π .

We say that V ∈ V is an extremal vertex if it is a vertex of the convex hull of V . We will make use ofthe relationship between directions δ ∈ Rd, the set of extremal vertices, and the set of deterministicpolicies. The following lemma, based on a well-known notion of duality from convex analysis (Boydand Vandenberghe, 2004), states this relationship formally.Lemma 1. Let δ ∈ Rn and define the functional fδ(V ) := δ>V , with domain V . Then fδ ismaximized by an extremal vertex U ∈ V , and there is a deterministic policy π for which V π = U .Furthermore, the set of directions δ ∈ Rn for which the maximum of fδ is achieved by multipleextremal vertices has Lebesgue measure zero in Rn.

Denote by Pv the set of policies corresponding to extremal vertices of V . We next derive anequivalence between the RLP and an optimization problem which only considers policies in Pv .

3

Theorem 1. For any representation φ ∈ R, the maximal approximation error measured over allvalue functions is the same as the error measured over the set of extremal vertices:

maxπ∈P

∥∥V πφ − V π∥∥2

2= maxπ∈Pv

∥∥V πφ − V π∥∥2

2.

Theorem 1 indicates that we can find an optimal representation by considering a finite (albeitexponential) number of value functions: Each extremal vertex corresponds to the value function ofsome deterministic policy, of which there are at most an exponential number. We will call theseadversarial value functions (AVFs), because of the minimax flavour of the RLP.

Solving the RLP allows us to provide quantifiable guarantees on the performance of certain value-basedlearning algorithms. For example, in the context of least-squares policy iteration (LSPI; Lagoudakisand Parr, 2003), minimizing the representation error L directly improves the performance bound. Bycontrast, we cannot have the same guarantee if φ is learned by minimizing the approximation errorfor a single value function.Corollary 1. Let φ∗ be an optimal representation in the RLP. Consider the sequence of policiesπ0, π1, . . . derived from LSPI using φ∗ to approximate V π0 , V π1 , . . . under a uniform sampling ofthe state-space. Then there exists an MDP-dependent constant C ∈ R such that

lim supk→∞

∥∥V ∗ − V πk∥∥2

2≤ CL(φ∗).

This result is a direct application of the quadratic norm bounds given by Munos (2003), in whosework the constant is made explicit. We emphasize that the result is illustrative; our approach shouldenable similar guarantees in other contexts (e.g. Munos, 2007; Petrik and Zilberstein, 2011).

3.1 The Structure of Adversarial Value Functions

The RLP suggests that an agent trained to predict various value functions should develop a good staterepresentation. Intuitively, one may worry that there are simply too many “uninteresting” policies, andthat a representation learned from their value functions emphasizes the wrong quantities. However,the search for an optimal representation φ∗ is closely tied to the much smaller set of adversarial valuefunctions (AVFs). The aim of this section is to characterize the structure of AVFs and show that theyform an interesting subset of all value functions. From this, we argue that their use as auxiliary tasksshould also produce structured representations.

From Lemma 1, recall that an AVF is geometrically defined using a vector δ ∈ Rn and the functionalfδ(V ) := δ>V , which the AVF maximizes. Since fδ is restricted to the value polytope, we canconsider the equivalent policy-space functional gδ : π 7→ δ>V π . Observe that

maxπ∈P

gδ(π) = maxπ∈P

δ>V π = maxπ∈P

∑

x∈Xδ(x)V π(x). (3)

In this optimization problem, the vector δ defines a weighting over the state space X ; for this reason,we call δ an interest function in the context of AVFs. Whenever δ ≥ 0 componentwise, we recoverthe optimal value function, irrespective of the exact magnitude of δ (Bertsekas, 2012). If δ(x) < 0for some x, however, the maximization becomes a minimization. As the next result shows, the policymaximizing fδ(π) depends on a network flow dπ derived from δ and the transition function P .Theorem 2. Maximizing the functional gδ is equivalent to finding a network flow dπ that satisfies areverse Bellman equation:

maxπ∈P

δ>V π = maxπ∈P

d>π r, dπ = δ + γPπ>dπ.

For a policy π maximizing the above we have

V π(x) = r(x) + γ

{maxa∈A Ex′∼P V π(x′) dπ(x) > 0,mina∈A Ex′∼P V π(x′) dπ(x) < 0.

Corollary 2. There are at most 2n distinct adversarial value functions.

The vector dπ corresponds to the sum of discounted interest weights flowing through a state x, similarto the dual variables in the theory of linear programming for MDPs (Puterman, 1994). Theorem 2, byway of the corollary, implies that there are fewer AVFs (≤ 2n) than deterministic policies (= |A|n).It also implies that AVFs relate to a reward-driven purpose, similar to how the optimal value functiondescribes the goal of maximizing return. We will illustrate this point empirically in Section 4.1.

4

3.2 Relationship to Auxiliary Tasks

So far we have argued that solving the RLP leads to a representation which is optimal in a meaningfulsense. However, solving the RLP seems computationally intractable: there are an exponential numberof deterministic policies to consider (Prop. 1 in the appendix gives a quadratic formulation withquadratic constraints). Using interest functions does not mitigate this difficulty: the computationalproblem of finding the AVF for a single interest function is NP-hard, even when restricted todeterministic MDPs (Prop. 2 in the appendix).

Instead, in this section we consider a relaxation of the RLP and show that this relaxation describesexisting representation learning methods, in particular those that use auxiliary tasks. Let ξ be somedistribution over Rn. We begin by replacing the maximum in (2) by an expectation:

minimize EV∼ξ

∥∥Vφ − V∥∥2

2w.r.t. φ ∈ R. (4)

The use of the expectation offers three practical advantages over the use of the maximum. First, thisleads to a differentiable objective which can be minimized using deep learning techniques. Second,the choice of ξ gives us an additional degree of freedom; in particular, ξ needs not be restricted to thevalue polytope. Third, the minimizer in (4) is easily characterized, as the following theorem shows.Theorem 3. Let u∗1, . . . , u

∗d ∈ Rn be the principal components of the distribution ξ, in the sense that

u∗i := arg maxu∈Bi

EV∼ξ

(u>V )2, where Bi := {u ∈ Rn : ‖u‖22 = 1, u>u∗j = 0 ∀j < i}.

Equivalently, u∗1, . . . , u∗d are the eigenvectors of Eξ V V > ∈ Rn×n with the d largest eigenvalues.

Then the matrix [u∗1, . . . , u∗d] ∈ Rn×d, viewed as a map X → Rd, is a solution to (4). When

the principal components are uniquely defined, any minimizer of (4) spans the same subspace asu∗1, . . . , u

∗d.

One may expect the quality of the learned representation to depend on how closely the distribution ξrelates to the RLP. From an auxiliary tasks perspective, this corresponds to choosing tasks that are insome sense useful. For example, generating value functions from the uniform distribution over the setof policies P , while a natural choice, may put too much weight on “uninteresting” value functions.

In practice, we may further restrict ξ to a finite set V . Under a uniform weighting, this leads to arepresentation loss

L(φ;V ) :=∑

V ∈V

∥∥Vφ − V∥∥2

2(5)

which corresponds to the typical formulation of an auxiliary-task loss (e.g. Jaderberg et al., 2017). Ina deep reinforcement learning setting, one typically minimizes (5) using stochastic gradient descentmethods, which scale better than batch methods such as singular value decomposition (but see Wuet al. (2019) for further discussion).

Our analysis leads us to conclude that, in many cases of interest, the use of auxiliary tasks producesrepresentations that are close to the principal components of the set of tasks under consideration. IfV is well-aligned with the RLP, minimizing L(φ;V ) should give rise to a reasonable representation.To demonstrate the power of this approach, in Section 4 we will study the case when the set V isconstructed by sampling AVFs – emphasizing the policies that support the solution to the RLP.

3.3 Relationship to Proto-Value Functions

Proto-value functions (Mahadevan and Maggioni, 2007, PVF) are a family of representations whichvary smoothly across the state space. Although the original formulation defines this representation asthe largest-eigenvalue eigenvectors of the Laplacian of the transition function’s graphical structure,recent formulations use the top singular vectors of (I − γPπ)−1, where π is the uniformly randompolicy (Stachenfeld et al., 2014; Machado et al., 2017; Behzadian and Petrik, 2018).

In line with the analysis of the previous section, proto-value functions can also be interpreted asdefining a set of value-based auxiliary tasks. Specifically, if we define an indicator reward functionry(x) := I[x=y] and a set of value functions V = {(I − γPπ)−1ry}y∈X with π the uniformlyrandom policy, then any d-dimensional representation that minimizes (5) spans the same basis as thed-dimensional PVF (up to the bias term). This suggests a connection with hindsight experience replay(Andrychowicz et al., 2017), whose auxiliary tasks consists in reaching previously experienced states.

5

4 Empirical Studies

In this section we complement our theoretical analysis with an experimental study. In turn, we takea closer look at 1) the structure of adversarial value functions, 2) the shape of representationslearned using AVFs, and 3) the performance profile of these representations in a control setting.Our eventual goal is to demonstrate that the RLP, which is based on approximating value functions,gives rise to representations that are both interesting and comparable to previously proposed schemes.Our concrete instantiation (Algorithm 1) uses the representation loss (5). As-is, this algorithm isof limited practical relevance (our AVFs are learned using a tabular representation) but we believeprovides an inspirational basis for further developments.

Algorithm 1 Representation learning using AVFsinput k – desired number of AVFs, d – desired number of features.

Sample δ1, . . . , δk ∼ [−1, 1]n

Compute µi = arg maxπ δ>i V

π using a policy gradient methodFind φ∗ = arg minφ L(φ; {V µ1 , . . . , V µk}) (Equation 5)

We perform all of our experiments within the four-room domain (Sutton et al., 1999; Solway et al.,2014; Machado et al., 2017, Figure 2, see also Appendix H.1).

Goalstate

Interest function ẟ PolicyAdversarial value functiond⇡<latexit sha1_base64="IyPQidmJDIC2i+En0ByKuEzIATg=">AAAIRXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6G/a6/UX7G/ZH7G3Y63aUZEc/2gqwdbrP98Tj8USGmeDaDId/Xbn6wYcfXetd/3jrk08/u/H5zVtfHOt0oRgcsVSk6iSkGgRP4MhwI+AkU0BlKOBFeP7U8RevQWmeJodmlcFE0lnCY86oQddRNA0yPr3ZH24Pi4t0Db8y+l51jae3eneDKGULCYlhgmp95g8zM7FUGc4E5FvBQkNG2TmdwRmaCZWgJ7bINid30BOROFX4SwwpvPUIS6XWKxmiUlIz123mnO9iZi4bo1unUjrWrZxMfH9ieZItDCSsTCleCGJS4opEIq6AGbFCgzLFcVaEzamizGApm2OKWYqC9rAbN2dNoEBoftEqUfSaZ7oq0rKsUiMoxLdvBQn8ylIpaRLZgD3PbeDmxqiwz/O8RU9q9KRDX9boyw59UqNPOvSgRg86dFSjow49rdHTDt2t0d0OPazRww7dqdGdvF0svV/D++1gfVyjx53gcK/CYWj32rEZKFlhzZQdd6JZSBUKBm+CQTvnqELBW5gGyR176FCTLZPcLqfWf5C4we4EEcQkgGWG/Uo2if7gEh0BfpsKdtH1CyZKTaru2kqa21KLghmodWBMMLCVS6nX+doKklRwydGDiRWDq3l6gW/A29QGBpZGM3tRvGeNdRvrBjYVNmvfhaRLYk8v9fh8GRGdC2JHl/DnnXxddzOHVIG0AnBZdgCnkDf8yvkPQFJ13gQMPzn7NFWpwIqsmsx93PawfGivRpJiA5z5ExsIiA0J3pC+j3Pis7l7aJWSJ/H79VMnMatOWIS7SBFULe9P+bR6wVktfpJXS4JHQbGJ2AjnqSDKrZqFuR1u3x/grj4Y5l3VTAEkGx1qtr912lYmoVhAmYgrfRld+Gzfb/eNG7cl3aTTVZfjv0VfJVZEbL2rp7GJJUcV3oOBs94nxFaqhGite3j8jJQjK2nHzzatNt9D95wasrf2jNy/DVJ3poLBTc4+eLSRj+BVkzmEp6vfPku7xvG9bR/t/W/6j7+vztnr3pfeV97Xnu995z32fvTG3pHHPO795v3u/dH7s/d375/ev6X06pUq5rbXuHr//Q812gaA</latexit><latexit sha1_base64="IyPQidmJDIC2i+En0ByKuEzIATg=">AAAIRXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6G/a6/UX7G/ZH7G3Y63aUZEc/2gqwdbrP98Tj8USGmeDaDId/Xbn6wYcfXetd/3jrk08/u/H5zVtfHOt0oRgcsVSk6iSkGgRP4MhwI+AkU0BlKOBFeP7U8RevQWmeJodmlcFE0lnCY86oQddRNA0yPr3ZH24Pi4t0Db8y+l51jae3eneDKGULCYlhgmp95g8zM7FUGc4E5FvBQkNG2TmdwRmaCZWgJ7bINid30BOROFX4SwwpvPUIS6XWKxmiUlIz123mnO9iZi4bo1unUjrWrZxMfH9ieZItDCSsTCleCGJS4opEIq6AGbFCgzLFcVaEzamizGApm2OKWYqC9rAbN2dNoEBoftEqUfSaZ7oq0rKsUiMoxLdvBQn8ylIpaRLZgD3PbeDmxqiwz/O8RU9q9KRDX9boyw59UqNPOvSgRg86dFSjow49rdHTDt2t0d0OPazRww7dqdGdvF0svV/D++1gfVyjx53gcK/CYWj32rEZKFlhzZQdd6JZSBUKBm+CQTvnqELBW5gGyR176FCTLZPcLqfWf5C4we4EEcQkgGWG/Uo2if7gEh0BfpsKdtH1CyZKTaru2kqa21KLghmodWBMMLCVS6nX+doKklRwydGDiRWDq3l6gW/A29QGBpZGM3tRvGeNdRvrBjYVNmvfhaRLYk8v9fh8GRGdC2JHl/DnnXxddzOHVIG0AnBZdgCnkDf8yvkPQFJ13gQMPzn7NFWpwIqsmsx93PawfGivRpJiA5z5ExsIiA0J3pC+j3Pis7l7aJWSJ/H79VMnMatOWIS7SBFULe9P+bR6wVktfpJXS4JHQbGJ2AjnqSDKrZqFuR1u3x/grj4Y5l3VTAEkGx1qtr912lYmoVhAmYgrfRld+Gzfb/eNG7cl3aTTVZfjv0VfJVZEbL2rp7GJJUcV3oOBs94nxFaqhGite3j8jJQjK2nHzzatNt9D95wasrf2jNy/DVJ3poLBTc4+eLSRj+BVkzmEp6vfPku7xvG9bR/t/W/6j7+vztnr3pfeV97Xnu995z32fvTG3pHHPO795v3u/dH7s/d375/ev6X06pUq5rbXuHr//Q812gaA</latexit><latexit sha1_base64="IyPQidmJDIC2i+En0ByKuEzIATg=">AAAIRXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6G/a6/UX7G/ZH7G3Y63aUZEc/2gqwdbrP98Tj8USGmeDaDId/Xbn6wYcfXetd/3jrk08/u/H5zVtfHOt0oRgcsVSk6iSkGgRP4MhwI+AkU0BlKOBFeP7U8RevQWmeJodmlcFE0lnCY86oQddRNA0yPr3ZH24Pi4t0Db8y+l51jae3eneDKGULCYlhgmp95g8zM7FUGc4E5FvBQkNG2TmdwRmaCZWgJ7bINid30BOROFX4SwwpvPUIS6XWKxmiUlIz123mnO9iZi4bo1unUjrWrZxMfH9ieZItDCSsTCleCGJS4opEIq6AGbFCgzLFcVaEzamizGApm2OKWYqC9rAbN2dNoEBoftEqUfSaZ7oq0rKsUiMoxLdvBQn8ylIpaRLZgD3PbeDmxqiwz/O8RU9q9KRDX9boyw59UqNPOvSgRg86dFSjow49rdHTDt2t0d0OPazRww7dqdGdvF0svV/D++1gfVyjx53gcK/CYWj32rEZKFlhzZQdd6JZSBUKBm+CQTvnqELBW5gGyR176FCTLZPcLqfWf5C4we4EEcQkgGWG/Uo2if7gEh0BfpsKdtH1CyZKTaru2kqa21KLghmodWBMMLCVS6nX+doKklRwydGDiRWDq3l6gW/A29QGBpZGM3tRvGeNdRvrBjYVNmvfhaRLYk8v9fh8GRGdC2JHl/DnnXxddzOHVIG0AnBZdgCnkDf8yvkPQFJ13gQMPzn7NFWpwIqsmsx93PawfGivRpJiA5z5ExsIiA0J3pC+j3Pis7l7aJWSJ/H79VMnMatOWIS7SBFULe9P+bR6wVktfpJXS4JHQbGJ2AjnqSDKrZqFuR1u3x/grj4Y5l3VTAEkGx1qtr912lYmoVhAmYgrfRld+Gzfb/eNG7cl3aTTVZfjv0VfJVZEbL2rp7GJJUcV3oOBs94nxFaqhGite3j8jJQjK2nHzzatNt9D95wasrf2jNy/DVJ3poLBTc4+eLSRj+BVkzmEp6vfPku7xvG9bR/t/W/6j7+vztnr3pfeV97Xnu995z32fvTG3pHHPO795v3u/dH7s/d375/ev6X06pUq5rbXuHr//Q812gaA</latexit><latexit sha1_base64="IyPQidmJDIC2i+En0ByKuEzIATg=">AAAIRXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6G/a6/UX7G/ZH7G3Y63aUZEc/2gqwdbrP98Tj8USGmeDaDId/Xbn6wYcfXetd/3jrk08/u/H5zVtfHOt0oRgcsVSk6iSkGgRP4MhwI+AkU0BlKOBFeP7U8RevQWmeJodmlcFE0lnCY86oQddRNA0yPr3ZH24Pi4t0Db8y+l51jae3eneDKGULCYlhgmp95g8zM7FUGc4E5FvBQkNG2TmdwRmaCZWgJ7bINid30BOROFX4SwwpvPUIS6XWKxmiUlIz123mnO9iZi4bo1unUjrWrZxMfH9ieZItDCSsTCleCGJS4opEIq6AGbFCgzLFcVaEzamizGApm2OKWYqC9rAbN2dNoEBoftEqUfSaZ7oq0rKsUiMoxLdvBQn8ylIpaRLZgD3PbeDmxqiwz/O8RU9q9KRDX9boyw59UqNPOvSgRg86dFSjow49rdHTDt2t0d0OPazRww7dqdGdvF0svV/D++1gfVyjx53gcK/CYWj32rEZKFlhzZQdd6JZSBUKBm+CQTvnqELBW5gGyR176FCTLZPcLqfWf5C4we4EEcQkgGWG/Uo2if7gEh0BfpsKdtH1CyZKTaru2kqa21KLghmodWBMMLCVS6nX+doKklRwydGDiRWDq3l6gW/A29QGBpZGM3tRvGeNdRvrBjYVNmvfhaRLYk8v9fh8GRGdC2JHl/DnnXxddzOHVIG0AnBZdgCnkDf8yvkPQFJ13gQMPzn7NFWpwIqsmsx93PawfGivRpJiA5z5ExsIiA0J3pC+j3Pis7l7aJWSJ/H79VMnMatOWIS7SBFULe9P+bR6wVktfpJXS4JHQbGJ2AjnqSDKrZqFuR1u3x/grj4Y5l3VTAEkGx1qtr912lYmoVhAmYgrfRld+Gzfb/eNG7cl3aTTVZfjv0VfJVZEbL2rp7GJJUcV3oOBs94nxFaqhGite3j8jJQjK2nHzzatNt9D95wasrf2jNy/DVJ3poLBTc4+eLSRj+BVkzmEp6vfPku7xvG9bR/t/W/6j7+vztnr3pfeV97Xnu995z32fvTG3pHHPO795v3u/dH7s/d375/ev6X06pUq5rbXuHr//Q812gaA</latexit>

Network flow ⇡<latexit sha1_base64="rI36vmhMtkOtbuMStKnGM1jgiTg=">AAAIQ3icfZXbjts2EIaVpK3T7SmHy9wINQIUgbGwihYNUgRIEBdp0d2t95jNrgyDokY2saSokHRqL6NH6G37RH2IPkPvit4W6FCSvTokEWBrNN8/4nA4IqOMM22Gw7+uXb/xwYcf9W5+vPXJp599/sWt23dOtFwoCsdUcqlOI6KBsxSODTMcTjMFREQcXkQXzxx/8RqUZjI9MqsMJoLMUpYwSgy6DsOMTW/1h9vD4vK7RlAZfa+6xtPbvQdhLOlCQGooJ1qfB8PMTCxRhlEO+Va40JARekFmcI5mSgToiS1yzf376In9RCr8pcYvvPUIS4TWKxGhUhAz123mnO9iZi4ao1unUjrRrZxM8nBiWZotDKS0TClZcN9I35XIj5kCavgKDUIVw1n5dE4UoQYL2RyTzyQK2sNu3Iw2gQKu2WWrRPFrlumqSMuySo2gCN++FabwK5VCkDS2IT3MbejmRgm3h3neoqc1etqhL2v0ZYc+rdGnHXpQowcdOqrRUYee1ehZh+7W6G6HHtXoUYfu1OhO3i6W3q/h/XawPqnRk05wtFfhKLJ77dgMlKiwpsqOO9E0IgoFgzfhoJ1zXKHwLUyDYI5971CTLdPcLqc2eJS6we6HMSR+CMsM+9XfJPqDS3QE+G0q2EXXL5goMVI9sJU0t6UWBTNQ68DEx8BWLqVe52srTCVngqEHEysGV3N5iW/A29SGBpZGU3tZvGeNdRvrBjYVNmvfpSBL355d6fH5KiK+4L4dXcGfd/J13c0cpAJhOeCy7ABOIW/4lfMfgCDqogkofnL2mVSSY0VWTeY+bntUPrRXI5XYAOfBxIYcEuOHb/x+gHNis7l7aJWSpcn79VMnMatOWIy7SBFULe9P+bR6wXktfpJXS4IHQbGJ2BjnqSDOrZpFuR1uPxzgrj4Y5l3VTAGkGx1qtr912lYmEV9AmYgrfRld+Gw/aPeNG7cl3aTTVZfjv0VfJVZEbL2rp7GJBUMV3sOBs94nxFaqhGite3j83C9HVsKOn29abb6H7jkx/t7aM3L/NpTuRAWDm5x99HgjH8GrJnMIT9egfZZ2jZOvtwO097/pPxlX5+xN7573pfeVF3jfeU+8H72xd+xRb+b95v3u/dH7s/d375/ev6X0+rUq5q7XuHr//Q/jHwW3</latexit><latexit sha1_base64="rI36vmhMtkOtbuMStKnGM1jgiTg=">AAAIQ3icfZXbjts2EIaVpK3T7SmHy9wINQIUgbGwihYNUgRIEBdp0d2t95jNrgyDokY2saSokHRqL6NH6G37RH2IPkPvit4W6FCSvTokEWBrNN8/4nA4IqOMM22Gw7+uXb/xwYcf9W5+vPXJp599/sWt23dOtFwoCsdUcqlOI6KBsxSODTMcTjMFREQcXkQXzxx/8RqUZjI9MqsMJoLMUpYwSgy6DsOMTW/1h9vD4vK7RlAZfa+6xtPbvQdhLOlCQGooJ1qfB8PMTCxRhlEO+Va40JARekFmcI5mSgToiS1yzf376In9RCr8pcYvvPUIS4TWKxGhUhAz123mnO9iZi4ao1unUjrRrZxM8nBiWZotDKS0TClZcN9I35XIj5kCavgKDUIVw1n5dE4UoQYL2RyTzyQK2sNu3Iw2gQKu2WWrRPFrlumqSMuySo2gCN++FabwK5VCkDS2IT3MbejmRgm3h3neoqc1etqhL2v0ZYc+rdGnHXpQowcdOqrRUYee1ehZh+7W6G6HHtXoUYfu1OhO3i6W3q/h/XawPqnRk05wtFfhKLJ77dgMlKiwpsqOO9E0IgoFgzfhoJ1zXKHwLUyDYI5971CTLdPcLqc2eJS6we6HMSR+CMsM+9XfJPqDS3QE+G0q2EXXL5goMVI9sJU0t6UWBTNQ68DEx8BWLqVe52srTCVngqEHEysGV3N5iW/A29SGBpZGU3tZvGeNdRvrBjYVNmvfpSBL355d6fH5KiK+4L4dXcGfd/J13c0cpAJhOeCy7ABOIW/4lfMfgCDqogkofnL2mVSSY0VWTeY+bntUPrRXI5XYAOfBxIYcEuOHb/x+gHNis7l7aJWSpcn79VMnMatOWIy7SBFULe9P+bR6wXktfpJXS4IHQbGJ2BjnqSDOrZpFuR1uPxzgrj4Y5l3VTAGkGx1qtr912lYmEV9AmYgrfRld+Gw/aPeNG7cl3aTTVZfjv0VfJVZEbL2rp7GJBUMV3sOBs94nxFaqhGite3j83C9HVsKOn29abb6H7jkx/t7aM3L/NpTuRAWDm5x99HgjH8GrJnMIT9egfZZ2jZOvtwO097/pPxlX5+xN7573pfeVF3jfeU+8H72xd+xRb+b95v3u/dH7s/d375/ev6X0+rUq5q7XuHr//Q/jHwW3</latexit><latexit sha1_base64="rI36vmhMtkOtbuMStKnGM1jgiTg=">AAAIQ3icfZXbjts2EIaVpK3T7SmHy9wINQIUgbGwihYNUgRIEBdp0d2t95jNrgyDokY2saSokHRqL6NH6G37RH2IPkPvit4W6FCSvTokEWBrNN8/4nA4IqOMM22Gw7+uXb/xwYcf9W5+vPXJp599/sWt23dOtFwoCsdUcqlOI6KBsxSODTMcTjMFREQcXkQXzxx/8RqUZjI9MqsMJoLMUpYwSgy6DsOMTW/1h9vD4vK7RlAZfa+6xtPbvQdhLOlCQGooJ1qfB8PMTCxRhlEO+Va40JARekFmcI5mSgToiS1yzf376In9RCr8pcYvvPUIS4TWKxGhUhAz123mnO9iZi4ao1unUjrRrZxM8nBiWZotDKS0TClZcN9I35XIj5kCavgKDUIVw1n5dE4UoQYL2RyTzyQK2sNu3Iw2gQKu2WWrRPFrlumqSMuySo2gCN++FabwK5VCkDS2IT3MbejmRgm3h3neoqc1etqhL2v0ZYc+rdGnHXpQowcdOqrRUYee1ehZh+7W6G6HHtXoUYfu1OhO3i6W3q/h/XawPqnRk05wtFfhKLJ77dgMlKiwpsqOO9E0IgoFgzfhoJ1zXKHwLUyDYI5971CTLdPcLqc2eJS6we6HMSR+CMsM+9XfJPqDS3QE+G0q2EXXL5goMVI9sJU0t6UWBTNQ68DEx8BWLqVe52srTCVngqEHEysGV3N5iW/A29SGBpZGU3tZvGeNdRvrBjYVNmvfpSBL355d6fH5KiK+4L4dXcGfd/J13c0cpAJhOeCy7ABOIW/4lfMfgCDqogkofnL2mVSSY0VWTeY+bntUPrRXI5XYAOfBxIYcEuOHb/x+gHNis7l7aJWSpcn79VMnMatOWIy7SBFULe9P+bR6wXktfpJXS4IHQbGJ2BjnqSDOrZpFuR1uPxzgrj4Y5l3VTAGkGx1qtr912lYmEV9AmYgrfRld+Gw/aPeNG7cl3aTTVZfjv0VfJVZEbL2rp7GJBUMV3sOBs94nxFaqhGite3j83C9HVsKOn29abb6H7jkx/t7aM3L/NpTuRAWDm5x99HgjH8GrJnMIT9egfZZ2jZOvtwO097/pPxlX5+xN7573pfeVF3jfeU+8H72xd+xRb+b95v3u/dH7s/d375/ev6X0+rUq5q7XuHr//Q/jHwW3</latexit><latexit sha1_base64="rI36vmhMtkOtbuMStKnGM1jgiTg=">AAAIQ3icfZXbjts2EIaVpK3T7SmHy9wINQIUgbGwihYNUgRIEBdp0d2t95jNrgyDokY2saSokHRqL6NH6G37RH2IPkPvit4W6FCSvTokEWBrNN8/4nA4IqOMM22Gw7+uXb/xwYcf9W5+vPXJp599/sWt23dOtFwoCsdUcqlOI6KBsxSODTMcTjMFREQcXkQXzxx/8RqUZjI9MqsMJoLMUpYwSgy6DsOMTW/1h9vD4vK7RlAZfa+6xtPbvQdhLOlCQGooJ1qfB8PMTCxRhlEO+Va40JARekFmcI5mSgToiS1yzf376In9RCr8pcYvvPUIS4TWKxGhUhAz123mnO9iZi4ao1unUjrRrZxM8nBiWZotDKS0TClZcN9I35XIj5kCavgKDUIVw1n5dE4UoQYL2RyTzyQK2sNu3Iw2gQKu2WWrRPFrlumqSMuySo2gCN++FabwK5VCkDS2IT3MbejmRgm3h3neoqc1etqhL2v0ZYc+rdGnHXpQowcdOqrRUYee1ehZh+7W6G6HHtXoUYfu1OhO3i6W3q/h/XawPqnRk05wtFfhKLJ77dgMlKiwpsqOO9E0IgoFgzfhoJ1zXKHwLUyDYI5971CTLdPcLqc2eJS6we6HMSR+CMsM+9XfJPqDS3QE+G0q2EXXL5goMVI9sJU0t6UWBTNQ68DEx8BWLqVe52srTCVngqEHEysGV3N5iW/A29SGBpZGU3tZvGeNdRvrBjYVNmvfpSBL355d6fH5KiK+4L4dXcGfd/J13c0cpAJhOeCy7ABOIW/4lfMfgCDqogkofnL2mVSSY0VWTeY+bntUPrRXI5XYAOfBxIYcEuOHb/x+gHNis7l7aJWSpcn79VMnMatOWIy7SBFULe9P+bR6wXktfpJXS4IHQbGJ2BjnqSDOrZpFuR1uPxzgrj4Y5l3VTAGkGx1qtr912lYmEV9AmYgrfRld+Gw/aPeNG7cl3aTTVZfjv0VfJVZEbL2rp7GJBUMV3sOBs94nxFaqhGite3j83C9HVsKOn29abb6H7jkx/t7aM3L/NpTuRAWDm5x99HgjH8GrJnMIT9egfZZ2jZOvtwO097/pPxlX5+xN7573pfeVF3jfeU+8H72xd+xRb+b95v3u/dH7s/d375/ev6X0+rUq5q7XuHr//Q/jHwW3</latexit>

Figure 2: Leftmost. The four-room domain. Other panels. An interest function δ, the network flowdπ , the corresponding adversarial value function (blue/red = low/high value) and its policy.

We consider a two-part network where we pretrain φ end-to-end to predict a set of value functions.Our aim here is to compare the effects of using different sets of value functions, including AVFs,on the learned representation. As our focus is on the efficient use of a d-dimensional representation(with d < n, the number of states), we encode individual states as one-hot vectors and map them intoφ(x) without capacity constraints. Additional details may be found in Appendix H.

4.1 Adversarial Value Functions

Our first set of results studies the structure of adversarial value functions in the four-room domain.We generated interest functions by assigning a value δ(x) ∈ {−1, 0, 1} uniformly at random to eachstate x (Figure 2, left). We restricted δ to these discrete choices for illustrative purposes.

We then used model-based policy gradient (Sutton et al., 2000) to find the policy maximizing∑x∈X δ(x)V π(x). We observed some local minima or accumulation points but as a whole reasonable

solutions were found. The resulting network flow and AVF for a particular sample are shown in Figure2. For most states, the signs of δ and dπ agree; however, this is not true of all states (larger versionand more examples in appendix, Figures 6, 7). As expected, states for which dπ > 0 (respectively,dπ < 0) correspond to states maximizing (resp. minimizing) the value function. Finally, we remarkon the “flow” nature of dπ: trajectories over minimizing states accumulate in corners or loops, whilethose over maximizing states flow to the goal. We conclude that AVFs exhibit interesting structure,and are generated by policies that are not random (Figure 2, right). As we will see next, this is a keydifferentiator in making AVFs good auxiliary tasks.

4.2 Representation Learning with AVFs

We next consider the representations that arise from training a deep network to predict AVFs (denotedAVF from here on). We sample k = 1000 interest functions and use Algorithm 1 to generate k AVFs.

6

We combine these AVFs into the representation loss (5) and adapt the parameters of the deep networkusing Rmsprop (Tieleman and Hinton, 2012).

We contrast the AVF-driven representation with one learned by predicting the value function ofrandom deterministic policies (RP). Specifically, these policies are generated by assigning an actionuniformly at random to each state. We also consider the value function of the uniformly randompolicy (VALUE). While we make these choices here for concreteness, other experiments yieldedsimilar results (e.g. predicting the value of the optimal policy; appendix, Figure 8). In all cases, welearn a d = 16 dimensional representation, not including the bias unit.

Value function Random policies (k=1000) AVFs (k=1000)

Figure 3: 16-dimensional representations learned by predicting a single value function, the valuefunctions of 1000 random policies, or 1000 AVFs sampled using Algorithm 1. Each panel elementdepicts the activation of a given feature across states, with blue/red indicating low/high activation.

Figure 3 shows the representations learned by the three methods. The features learned by VALUEresemble the value function itself (top left feature) or its negated image (bottom left feature). Coarselyspeaking, these features capture the general distance to the goal but little else. The features learnedby RP are of even worse quality. This is because almost all random deterministic policies causethe agent to avoid the goal (appendix, Figure 12). The representation learned by AVF, on the otherhand, captures the structure of the domain, including paths between distal states and focal pointscorresponding to rooms or parts of rooms.

Although our focus is on the use of AVFs as auxiliary tasks to a deep network, we observe thesame results when discovering a representation using singular value decomposition (Section 3.2), asdescribed in Appendix I. All in all, our results illustrate that, among all value functions, AVFs areparticularly useful auxiliary tasks for representation learning.

4.3 Learning the Optimal Policy

Figure 4: Average discounted return achieved bypolicies learned using a representation produced byVALUE, AVF, or PVF. Average is over 20 randomseeds and shading gives standard deviation.

In a final set of experiments, we consider learn-ing a reward-maximizing policy using a pre-trained representation and a model-based ver-sion of the SARSA algorithm (Rummery andNiranjan, 1994; Sutton and Barto, 1998). Wecompare the value-based and AVF-based rep-resentations from the previous section (VALUEand AVF), and also proto-value functions (PVF;details in Appendix H.3).

We report the quality of the learned policies aftertraining, as a function of d, the size of the rep-resentation. Our quality measure is the averagereturn from the designated start state (bottomleft). Results are provided in Figure 4 and Fig-ure 13 (appendix). We observe a failure of theVALUE representation to provide a useful basisfor learning a good policy, even as d increases;while the representation is not rank-deficient, the features do not help reduce the approximation error.

7

In comparison, our AVF representations perform similarly to PVFs. Increasing the number of auxiliarytasks also leads to better representations; recall that PVF implicitly uses n = 104 auxiliary tasks.

5 Related Work

Our work takes inspiration from research in basis or feature construction for reinforcement learning.Ratitch and Precup (2004), Foster and Dayan (2002), Menache et al. (2005), Yu and Bertsekas(2009), Bhatnagar et al. (2013), and Song et al. (2016) consider methods for adapting parametrizedbasis functions using iterative schemes. Including Mahadevan and Maggioni (2007)’s proto-valuefunctions, a number of works (we note Dayan, 1993; Petrik, 2007; Mahadevan and Liu, 2010; Ruanet al., 2015; Barreto et al., 2017) have used characteristics of the transition structure of the MDP togenerate representations; these are the closest in spirit to our approach, although none use the rewardor consider the geometry of the space of value functions. Parr et al. (2007) proposed constructing arepresentation from successive Bellman errors, Keller et al. (2006) used dimensionality reductionmethods; finally Hutter (2009) proposes a universal scheme for selecting representations.

Deep reinforcement learning algorithms have made extensive use of auxiliary tasks to improve agentperformance, beginning perhaps with universal value function approximators (Schaul et al., 2015)and the UNREAL architecture (Jaderberg et al., 2017); see also Dosovitskiy and Koltun (2017),Francois-Lavet et al. (2018) and, more tangentially, van den Oord et al. (2018). Levine et al. (2017)and Chung et al. (2019) make explicit use of two-part network to derive more sample efficient deepreinforcement learning algorithms. Veeriah et al. (2019) use a meta-gradient approach to generateauxiliary tasks. The notion of augmenting an agent with side predictions is not new, with roots inTD models (Sutton, 1995), predictive state representations (Littman et al., 2002), and the Hordearchitecture (Sutton et al., 2011), itself inspired by the work of Selfridge (1959).

A number of works quantify or explain the usefulness of a representation. Parr et al. (2008) demon-strated that a good representation should support a good approximation of both reward and expectednext state. We conjecture that the relaxed problem (4) trades these two quantities off in a principledfashion. Li et al. (2006); Abel et al. (2016) consider the approximation error that arises from stateabstraction. More recently, Nachum et al. (2019) provide some interesting guarantees in the contextof hierarchical reinforcement learning, while Such et al. (2019) visualizes the representations learnedby Atari-playing agents. Finally, Bertsekas (2018) remarks on the two-part network we study here.

6 Conclusion

In this paper we studied the notion of an adversarial value function, derived from a geometricperspective on representation learning in RL. Our work shows that adversarial value functions exhibitinteresting structure, and are good auxiliary tasks when learning a representation of an environment.We believe our work to be the first to provide formal evidence as to the usefulness of predicting valuefunctions for shaping an agent’s representation.

Our work opens up the possibility of automatically generating auxiliary tasks in deep reinforcementlearning, analogous to how deep learning itself enabled a move away from hand-crafted features. Todo so, we expect that a number of practical challenges will need to be overcome:

Off-policy learning. A practical implementation will require learning AVFs concurrently with themain task. Doing so results in off-policy learning, whose negative effects are well-documented evenin recent applications (e.g. van Hasselt et al., 2018).

Policy parametrization. AVFs are the value function of deterministic policies. While a naturalchoice is to look for policies that maximize representation error, this poses the problem of how toparametrize the policies themselves. In particular, a policy parametrized using the representation φmay not provide a sufficient degree of “adversariality”.

Smoothness in the interest function. In continuous or large state spaces, it is desirable for interestfunctions to incorporate some degree of smoothness, rather than vary rapidly from state to state. It isnot clear how to control this smoothness in a principled manner.

From a mathematical perspective, our formulation of the RLP was made with both convenience andgeometry in mind. Conceptually, it may be interesting to consider our approach in other norms,

8

including the weighted norms used in approximation results. Practically, this would translate into anemphasis on “interesting” value functions, for example by giving additional weight to the optimalvalue function and its neighbouring AVFs.

7 Acknowledgements

The authors thank the many people who helped shape this project through discussions and feedbackon early and late drafts: Lihong Li, George Tucker, Doina Precup, Ofir Nachum, Csaba Szepesvari,Georg Ostrovski, Marek Petrik, Marlos Machado, Tim Lillicrap, Danny Tarlow, Hugo Larochelle,Saurabh Kumar, Carles Gelada, Remi Munos, David Silver, and Andre Barreto. Special thanks alsoto Philip Thomas and Scott Niekum, who gave this project its initial impetus.

8 Author Contributions

M.G.B., W.D., D.S., and N.L.R. conceptualized the representation learning problem. M.G.B., W.D.,T.L., A.A.T., R.D., D.S., and N.L.R. contributed to the theoretical results. M.G.B., W.D., P.S.C., R.D.,and C.L. performed experiments and collated results. All authors contributed to the writing.

ReferencesAbadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G.,

Isard, M., et al. (2016). Tensorflow: A system for large-scale machine learning. In Symposium onOperating Systems Design and Implementation.

Abel, D., Hershkowitz, D. E., and Littman, M. L. (2016). Near optimal behavior via approximatestate abstraction. In Proceedings of the International Conference on Machine Learning.

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin,J., Abbeel, O. P., and Zaremba, W. (2017). Hindsight experience replay. In Advances in NeuralInformation Processing Systems.

Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. (2017).Successor features for transfer in reinforcement learning. In Advances in Neural InformationProcessing Systems.

Behzadian, B. and Petrik, M. (2018). Feature selection by singular value decomposition for reinforce-ment learning. In Proceedings of the ICML Prediction and Generative Modeling Workshop.

Bellemare, M. G., Dabney, W., and Munos, R. (2017). A distributional perspective on reinforcementlearning. In Proceedings of the International Conference on Machine Learning.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The Arcade Learning Environment:An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279.

Bellman, R. E. (1957). Dynamic programming. Princeton University Press, Princeton, NJ.

Bernhard, K. and Vygen, J. (2008). Combinatorial optimization: Theory and algorithms. Springer,Third Edition, 2005.

Bertsekas, D. P. (2012). Dynamic Programming and Optimal Control, Vol. II: Approximate DynamicProgramming. Athena Scientific.

Bertsekas, D. P. (2018). Feature-based aggregation and deep reinforcement learning: A survey andsome new implementations. Technical report, MIT/LIDS.

Bhatnagar, S., Borkar, V. S., and Prabuchandran, K. (2013). Feature search in the Grassmanian inonline reinforcement learning. IEEE Journal of Selected Topics in Signal Processing.

Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cambridge university press.

Castro, P. S., Moitra, S., Gelada, C., Kumar, S., and Bellemare, M. G. (2018). Dopamine: A researchframework for deep reinforcement learning. arXiv.

9

Chung, W., Nath, S., Joseph, A. G., and White, M. (2019). Two-timescale networks for nonlinearvalue function approximation. In International Conference on Learning Representations.

Dadashi, R., Taıga, A. A., Roux, N. L., Schuurmans, D., and Bellemare, M. G. (2019). The valuefunction polytope in reinforcement learning.

Dayan, P. (1993). Improving generalisation for temporal difference learning: The successor represen-tation. Neural Computation.

Dosovitskiy, A. and Koltun, V. (2017). Learning to act by predicting the future. In Proceedings of theInternational Conference on Learning Representations.

Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based batch mode reinforcement learning.Journal of Machine Learning Research, 6:503–556.

Farahmand, A., Ghavamzadeh, M., Szepesvari, C., and Mannor, S. (2016). Regularized policyiteration with nonparametric function spaces. Journal of Machine Learning Research.

Foster, D. and Dayan, P. (2002). Structure in the space of value functions. Machine Learning.

Francois-Lavet, V., Bengio, Y., Precup, D., and Pineau, J. (2018). Combined reinforcement learningvia abstract representations. arXiv.

Gelada, C., Kumar, S., Buckman, J., Nachum, O., and Bellemare, M. G. (2019). DeepMDP: Learningcontinuous latent space modelsfor representation learning. In Proceedings of the InternationalConference on Machine Learning.

Hutter, M. (2009). Feature reinforcement learning: Part I. Unstructured MDPs. Journal of ArtificialGeneral Intelligence.

Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu,K. (2017). Reinforcement learning with unsupervised auxiliary tasks. In Proceedings of theInternational Conference on Learning Representations.

Keller, P. W., Mannor, S., and Precup, D. (2006). Automatic basis function construction for approx-imate dynamic programming and reinforcement learning. In Proceedings of the InternationalConference on Machine Learning.

Lagoudakis, M. and Parr, R. (2003). Least-squares policy iteration. The Journal of Machine LearningResearch.

Levine, N., Zahavy, T., Mankowitz, D., Tamar, A., and Mannor, S. (2017). Shallow updates for deepreinforcement learning. In Advances in Neural Information Processing Systems.

Li, L., Walsh, T., and Littman, M. (2006). Towards a unified theory of state abstraction for MDPs. InProceedings of the Ninth International Symposium on Artificial Intelligence and Mathematics.

Liang, Y., Machado, M. C., Talvitie, E., and Bowling, M. H. (2016). State of the art control of atarigames using shallow reinforcement learning. In Proceedings of the International Conference onAutonomous Agents and Multiagent Systems.

Littman, M. L., Sutton, R. S., and Singh, S. (2002). Predictive representations of state. In Advancesin Neural Information Processing Systems.

Machado, M. C., Bellemare, M. G., and Bowling, M. (2017). A Laplacian framework for optiondiscovery in reinforcement learning. In Proceedings of the International Conference on MachineLearning.

Machado, M. C., Rosenbaum, C., Guo, X., Liu, M., Tesauro, G., and Campbell, M. (2018). Eigenop-tion discovery through the deep successor representation. In Proceedings of the InternationalConference on Learning Representations.

Mahadevan, S. and Liu, B. (2010). Basis construction from power series expansions of value functions.In Advances in Neural Information Processing Systems.

10

Mahadevan, S. and Maggioni, M. (2007). Proto-value functions: A Laplacian framework for learningrepresentation and control in Markov decision processes. Journal of Machine Learning Research.

Menache, I., Mannor, S., and Shimkin, N. (2005). Basis function adaptation in temporal differencereinforcement learning. Annals of Operations Research.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deepreinforcement learning. Nature, 518(7540):529–533.

Munos, R. (2003). Error bounds for approximate policy iteration. In Proceedings of the InternationalConference on Machine Learning.

Munos, R. (2007). Performance bounds in l p-norm for approximate value iteration. SIAM Journalon Control and Optimization.

Nachum, O., Gu, S., Lee, H., and Levine, S. (2019). Near-optimal representation learning forhierarchical reinforcement learning. In Proceedings of the International Conference on LearningRepresentations.

Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Littman, M. L. (2008). An analysis of linearmodels, linear value-function approximation, and feature selection for reinforcement learning. InProceedings of the International Conference on Machine Learning.

Parr, R., Painter-Wakefield, C., Li, L., and Littman, M. (2007). Analyzing feature generationfor value-function approximation. In Proceedings of the International Conference on MachineLearning.

Petrik, M. (2007). An analysis of Laplacian methods for value function approximation in MDPs. InProceedings of the International Joint Conference on Artificial Intelligence.

Petrik, M. and Zilberstein, S. (2011). Robust approximate bilinear programming for value functionapproximation. Journal of Machine Learning Research.

Puterman, M. L. (1994). Markov Decision Processes: Discrete stochastic dynamic programming.John Wiley & Sons, Inc.

Ratitch, B. and Precup, D. (2004). Sparse distributed memories for on-line value-based reinforcementlearning. In Proceedings of the European Conference on Machine Learning.

Rockafellar, R. T. and Wets, R. J.-B. (2009). Variational analysis. Springer Science & BusinessMedia.

Ruan, S. S., Comanici, G., Panangaden, P., and Precup, D. (2015). Representation discovery for mdpsusing bisimulation metrics. In Proceedings of the AAAI Conference on Artificial Intelligence.

Rummery, G. A. and Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technicalreport, Cambridge University Engineering Department.

Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal ofResearch and Development.

Schaul, T., Horgan, D., Gregor, K., and Silver, D. (2015). Universal value function approximators. InProceedings of the International Conference on Machine Learning.

Selfridge, O. (1959). Pandemonium: A paradigm for learning. In Symposium on the mechanizationof thought processes.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J.,Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner,N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. (2016).Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489.

Solway, A., Diuk, C., Cordova, N., Yee, D., Barto, A. G., Niv, Y., and Botvinick, M. M. (2014).Optimal behavioral hierarchy. PLOS Computational Biology.

11

Song, Z., Parr, R., Liao, X., and Carin, L. (2016). Linear feature encoding for reinforcement learning.In Advances in Neural Information Processing Systems.

Stachenfeld, K. L., Botvinick, M., and Gershman, S. J. (2014). Design principles of the hippocampalcognitive map. In Advances in Neural Information Processing Systems.

Such, F. P., Madhavan, V., Liu, R., Wang, R., Castro, P. S., Li, Y., Schubert, L., Bellemare, M. G.,Clune, J., and Lehman, J. (2019). An Atari model zoo for analyzing, visualizing, and comparingdeep reinforcement learning agents. In Proceedings of the International Joint Conference onArtificial Intelligence.

Sutton, R., Modayil, J., Delp, M., Degris, T., Pilarski, P., White, A., and Precup, D. (2011). Horde: Ascalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction.In Proceedings of the International Conference on Autonomous Agents and Multiagents Systems.

Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In Proceedings ofthe International Conference on Machine Learning.

Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparsecoarse coding. In Advances in Neural Information Processing Systems.

Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction. MIT Press.

Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem ofoff-policy temporal-difference learning. Journal of Machine Learning Research.

Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methodsfor reinforcement learning with function approximation. In Advances in Neural InformationProcessing Systems.

Sutton, R. S., Precup, D., and Singh, S. P. (1999). Between MDPs and semi-MDPs: A framework fortemporal abstraction in reinforcement learning. Artificial Intelligence.

Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM,38(3).

Tieleman, T. and Hinton, G. (2012). RmsProp: Divide the gradient by a running average of its recentmagnitude. COURSERA: Neural Networks for Machine Learning.

Tosatto, S., Pirotta, M., D’Eramo, C., and Restelli, M. (2017). Boosted fitted q-iteration. InProceedings of the International Conference on Machine Learning.

van den Oord, A., Li, Y., and Vinyals, O. (2018). Representation learning with contrastive predictivecoding. In Advances in Neural Information Processing Systems.

van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., and Modayil, J. (2018). Deepreinforcement learning and the deadly triad. arXiv.

Veeriah, V., Hessel, M., Xu, Z., Lewis, R., Rajendran, J., Oh, J., van Hasselt, H., Silver, D., and Singh,S. (2019). Discovery of useful questions as auxiliary tasks. In Advances in Neural InformationProcessing Systems.

Wu, Y., Tucker, G., and Nachum, O. (2019). The Laplacian in RL: Learning representationswith efficient approximations. In Proceedings of the International Conference on LearningRepresentations.

Yu, H. and Bertsekas, D. P. (2009). Basis function adaptation methods for cost approximation in mdp.In Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and ReinforcementLearning.

12

Documents

A Geometric Perspective on Optimal …...A Geometric Perspective on Optimal Representations for Reinforcement Learning Marc G. Bellemare 1, Will Dabney 2, Robert Dadashi 1, Adrien