16
Mapping Instructions and Visual Observations to Actions with Reinforcement Learning Dipendra Misra , John Langford , and Yoav Artzi Dept. of Computer Science and Cornell Tech, Cornell University, New York, NY 10044 {dkm, yoav}@cs.cornell.edu Microsoft Research, New York, NY 10011 [email protected] Abstract We propose to directly map raw visual ob- servations and text input to actions for in- struction execution. While existing ap- proaches assume access to structured envi- ronment representations or use a pipeline of separately trained models, we learn a single model to jointly reason about lin- guistic and visual input. We use reinforce- ment learning in a contextual bandit set- ting to train a neural network agent. To guide the agent’s exploration, we use re- ward shaping with different forms of su- pervision. Our approach does not re- quire intermediate representations, plan- ning procedures, or training different mod- els. We evaluate in a simulated environ- ment, and show significant improvements over supervised learning and common re- inforcement learning variants. 1 Introduction An agent executing natural language instructions requires robust understanding of language and its environment. Existing approaches addressing this problem assume structured environment represen- tations (e.g.,. Chen and Mooney, 2011; Mei et al., 2016), or combine separately trained models (e.g., Matuszek et al., 2010; Tellex et al., 2011), includ- ing for language understanding and visual reason- ing. We propose to directly map text and raw im- age input to actions with a single learned model. This approach offers multiple benefits, such as not requiring intermediate representations, plan- ning procedures, or training multiple models. Figure 1 illustrates the problem in the Blocks environment (Bisk et al., 2016). The agent ob- serves the environment as an RGB image using a camera sensor. Given the RGB input, the agent North South East West Put the Toyota block in the same row as the SRI block, in the first open space to the right of the SRI block Move Toyota to the immediate right of SRI, evenly aligned and slightly separated Move the Toyota block around the pile and place it just to the right of the SRI block Place Toyota block just to the right of The SRI Block Toyota, right side of SRI Figure 1: Instructions in the Blocks environment. The instructions all describe the same task. Given the ob- served RGB image of the start state (large image), our goal is to execute such instructions. In this task, the direct-line path to the target position is blocked, and the agent must plan and move the Toyota block around. The small image marks the target and an example path, which includes 34 steps. must recognize the blocks and their layout. To un- derstand the instruction, the agent must identify the block to move (Toyota block) and the destina- tion (just right of the SRI block). This requires solving semantic and grounding problems. For example, consider the topmost instruction in the figure. The agent needs to identify the phrase re- ferring to the block to move, Toyota block, and ground it. It must resolve and ground the phrase SRI block as a reference position, which is then modified by the spatial meaning recovered from the same row as or first open space to the right of, to identify the goal position. Finally, the agent needs to generate actions, for example moving the Toyota block around obstructing blocks. To address these challenges with a single model,

Mapping Instructions and Visual Observations to Actions ...dkm/papers/mla-emnlp.2017.pdf · {dkm, yoav}@cs.cornell.edu zMicrosoft Research, New York, NY 10011 [email protected] Abstract

Embed Size (px)

Citation preview

Mapping Instructions and Visual Observations to Actionswith Reinforcement Learning

Dipendra Misra†, John Langford‡, and Yoav Artzi†

† Dept. of Computer Science and Cornell Tech, Cornell University, New York, NY 10044{dkm, yoav}@cs.cornell.edu

‡ Microsoft Research, New York, NY [email protected]

Abstract

We propose to directly map raw visual ob-servations and text input to actions for in-struction execution. While existing ap-proaches assume access to structured envi-ronment representations or use a pipelineof separately trained models, we learn asingle model to jointly reason about lin-guistic and visual input. We use reinforce-ment learning in a contextual bandit set-ting to train a neural network agent. Toguide the agent’s exploration, we use re-ward shaping with different forms of su-pervision. Our approach does not re-quire intermediate representations, plan-ning procedures, or training different mod-els. We evaluate in a simulated environ-ment, and show significant improvementsover supervised learning and common re-inforcement learning variants.

1 IntroductionAn agent executing natural language instructionsrequires robust understanding of language and itsenvironment. Existing approaches addressing thisproblem assume structured environment represen-tations (e.g.,. Chen and Mooney, 2011; Mei et al.,2016), or combine separately trained models (e.g.,Matuszek et al., 2010; Tellex et al., 2011), includ-ing for language understanding and visual reason-ing. We propose to directly map text and raw im-age input to actions with a single learned model.This approach offers multiple benefits, such asnot requiring intermediate representations, plan-ning procedures, or training multiple models.

Figure 1 illustrates the problem in the Blocksenvironment (Bisk et al., 2016). The agent ob-serves the environment as an RGB image using acamera sensor. Given the RGB input, the agent

North

South

EastWest

Put the Toyota block in the same row as the SRI block, inthe first open space to the right of the SRI blockMove Toyota to the immediate right of SRI, evenly alignedand slightly separatedMove the Toyota block around the pile and place it just tothe right of the SRI blockPlace Toyota block just to the right of The SRI BlockToyota, right side of SRI

Figure 1: Instructions in the Blocks environment. Theinstructions all describe the same task. Given the ob-served RGB image of the start state (large image), ourgoal is to execute such instructions. In this task, thedirect-line path to the target position is blocked, andthe agent must plan and move the Toyota block around.The small image marks the target and an example path,which includes 34 steps.

must recognize the blocks and their layout. To un-derstand the instruction, the agent must identifythe block to move (Toyota block) and the destina-tion (just right of the SRI block). This requiressolving semantic and grounding problems. Forexample, consider the topmost instruction in thefigure. The agent needs to identify the phrase re-ferring to the block to move, Toyota block, andground it. It must resolve and ground the phraseSRI block as a reference position, which is thenmodified by the spatial meaning recovered fromthe same row as or first open space to the rightof, to identify the goal position. Finally, the agentneeds to generate actions, for example moving theToyota block around obstructing blocks.

To address these challenges with a single model,

we design a neural network agent. The agent exe-cutes instructions by generating a sequence of ac-tions. At each step, the agent takes as input theinstruction text, observes the world as an RGB im-age, and selects the next action. Action executionchanges the state of the world. Given an obser-vation of the new world state, the agent selectsthe next action. This process continues until theagent indicates execution completion. When se-lecting actions, the agent jointly reasons about itsobservations and the instruction text. This enablesdecisions based on close interaction between ob-servations and linguistic input.

We train the agent with different levels of su-pervision, including complete demonstrations ofthe desired behavior and annotations of the goalstate only. While the learning problem can be eas-ily cast as a supervised learning problem, learningonly from the states observed in the training dataresults in poor generalization and failure to recoverfrom test errors. We use reinforcement learn-ing (Sutton and Barto, 1998) to observe a broaderset of states through exploration. Following recentwork in robotics (e.g., Levine et al., 2016; Rusuet al., 2016), we assume the training environment,in contrast to the test environment, is instrumentedand provides access to the state. This enables asimple problem reward function that uses the stateand provides positive reward on task completiononly. This type of reward offers two important ad-vantages: (a) it is a simple way to express the idealagent behavior we wish to achieve, and (b) it cre-ates a platform to add training data information.

We use reward shaping (Ng et al., 1999) to ex-ploit the training data and add to the reward ad-ditional information. The modularity of shap-ing allows varying the amount of supervision, forexample by using complete demonstrations foronly a fraction of the training examples. Shap-ing also naturally associates actions with imme-diate reward. This enables learning in a contex-tual bandit setting (Auer et al., 2002; Langfordand Zhang, 2007), where optimizing the immedi-ate reward is sufficient and has better sample com-plexity than unconstrained reinforcement learn-ing (Agarwal et al., 2014).

We evaluate with the block world environmentand data of Bisk et al. (2016), where each instruc-tion moves one block (Figure 1). While the orig-inal task focused on source and target predictiononly, we build an interactive simulator and formu-

late the task of predicting the complete sequenceof actions. At each step, the agent must select be-tween 81 actions with 15.4 steps required to com-plete a task on average, significantly more thanexisting environments (e.g., Chen and Mooney,2011). Our experiments demonstrate that our re-inforcement learning approach effectively reducesexecution error by 24% over standard supervisedlearning and 34-39% over common reinforcementlearning techniques. Our simulator, code, models,and execution videos are available at: https://github.com/clic-lab/blocks.

2 Technical OverviewTask Let X be the set of all instructions, Sthe set of all world states, and A the set of allactions. An instruction x ∈ X is a sequence〈x1, . . . , xn〉, where each xi is a token. The agentexecutes instructions by generating a sequence ofactions, and indicates execution completion withthe special action STOP. Action execution mod-ifies the world state following a transition func-tion T : S × A → S. The execution e of aninstruction x starting from s1 is an m-length se-quence 〈(s1, a1), . . . , (sm, am)〉, where sj ∈ S,aj ∈ A, T (sj , aj) = sj+1 and am = STOP. InBlocks (Figure 1), a state specifies the positionsof all blocks. For each action, the agent movesa single block on the plane in one of four direc-tions (north, south, east, or west). There are 20blocks, and 81 possible actions at each step, in-cluding STOP. For example, to correctly executethe instructions in the figure, the agent’s likely firstaction is TOYOTA-WEST, which moves the Toyotablock one step west. Blocks can not move over orthrough other blocks.Model The agent observes the world state viaa visual sensor (i.e., a camera). Given a worldstate s, the agent observes an RGB image I gen-erated by the function IMG(s). We distinguish be-tween the world state s and the agent context1 s,which includes the instruction, the observed imageIMG(s), images of previous states, and the pre-vious action. To map instructions to actions, theagent reasons about the agent context s to generatea sequence of actions. At each step, the agent gen-erates a single action. We model the agent with a

1We use the term context similar to how it is used in thecontextual bandit literature to refer to the information avail-able for decision making. While agent contexts capture in-formation about the world state, they do not include physicalinformation, except as captured by observed images.

neural network policy. At each step j, the networktakes as input the current agent context sj , and pre-dicts the next action to execute aj . We formallydefine the agent context and model in Section 4.Learning We assume access to training datawith N examples {(x(i), s(i)1 , e(i))}Ni=1, where x(i)

is an instruction, s(i)1 is a start state, and e(i) isan execution demonstration of x(i) starting at s(i)1 .We use policy gradient (Section 5) with rewardshaping derived from the training data to increaselearning speed and exploration effectiveness (Sec-tion 6). Following work in robotics (e.g., Levineet al., 2016), we assume an instrumented environ-ment with access to the world state to compute thereward during training only. We define our ap-proach in general terms with demonstrations, butalso experiment with training using goal states.Evaluation We evaluate task completion erroron a test set {(x(i), s(i)1 , s

(i)g )}Mi=1, where x(i) is an

instruction, s(i)1 is a start state, and s(i)g is the goalstate. We measure execution error as the distancebetween the final execution state and s(i)g .

3 Related WorkLearning to follow instructions was studied ex-tensively with structured environment represen-tations, including with semantic parsing (Chenand Mooney, 2011; Kim and Mooney, 2012,2013; Artzi and Zettlemoyer, 2013; Artzi et al.,2014a,b; Misra et al., 2015, 2016), alignmentmodels (Andreas and Klein, 2015), reinforcementlearning (Branavan et al., 2009, 2010; Vogel andJurafsky, 2010), and neural network models (Meiet al., 2016). In contrast, we study the problem ofan agent that takes as input instructions and raw vi-sual input. Instruction following with visual inputwas studied with pipeline approaches that use sep-arately learned models for visual reasoning (Ma-tuszek et al., 2010, 2012; Tellex et al., 2011; Paulet al., 2016). Rather than decomposing the prob-lem, we adopt a single-model approach and learnfrom instructions paired with demonstrations orgoal states. Our work is related to Sung et al.(2015). While they use sensory input to select andadjust a trajectory observed during training, weare not restricted to training sequences. Executinginstructions in non-learning settings has also re-ceived significant attention (e.g., Winograd, 1972;Webber et al., 1995; MacMahon et al., 2006).

Our work is related to a growing interest inproblems that combine language and vision, in-

cluding visual question answering (e.g., Antolet al., 2015; Andreas et al., 2016b,a), caption gen-eration (e.g., Chen et al., 2015, 2016; Xu et al.,2015), and visual reasoning (Johnson et al., 2016;Suhr et al., 2017). We address the prediction of thenext action given a world image and an instruction.

Reinforcement learning with neural networkshas been used for various NLP tasks, includingtext-based games (Narasimhan et al., 2015; Heet al., 2016), information extraction (Narasimhanet al., 2016), co-reference resolution (Clark andManning, 2016), and dialog (Li et al., 2016).

Neural network reinforcement learning tech-niques have been recently studied for behaviorlearning tasks, including playing games (Mnihet al., 2013, 2015, 2016; Silver et al., 2016) andsolving memory puzzles (Oh et al., 2016). In con-trast to this line of work, our data is limited. Ob-serving new states in a computer game simply re-quires playing it. However, our agent also consid-ers natural language instructions. As the set of in-structions is limited to the training data, the set ofagent contexts seen during learning is constrained.We address the data efficiency problem by learn-ing in a contextual bandit setting, which is knownto be more tractable (Agarwal et al., 2014), and us-ing reward shaping to increase exploration effec-tiveness. Zhu et al. (2017) address generalizationof reinforcement learning to new target goals in vi-sual search by providing the agent an image of thegoal state. We address a related problem. How-ever, we provide natural language and the agentmust learn to recognize the goal state.

Reinforcement learning is extensively used inrobotics (Kober et al., 2013). Similar to recentwork on learning neural network policies for robotcontrol (Levine et al., 2016; Schulman et al., 2015;Rusu et al., 2016), we assume an instrumentedtraining environment and use the state to computerewards during learning. Our approach adds theability to specify tasks using natural language.

4 ModelWe model the agent policy π with a neural net-work. The agent observes the instruction and anRGB image of the world. Given a world states, the image I is generated using the functionIMG(s). The instruction execution is generatedone step at a time. At each step j, the agentobserves an image Ij of the current world statesj and the instruction x, predicts the action aj ,and executes it to transition to the next state sj+1.

Place the Toyota east of SRIx :

h1

SOUTH

Visual State v10

LSTM

hdhb

TOYOTA

SoftMax Layers

Task Specific

TOYOTA-SOUTH Action a10

TOYOTA-SOUTH

CNN

l1 l2 l3 l4 l5 l6

Instruction Representation x

I10

I8 I9

Previous Action a9

Agent Context s10

Figure 2: Illustration of the policy architecture showing the 10th step in the execution of the instruction Place theToyota east of SRI in the state from Figure 1. The network takes as input the instruction x, image of the currentstate I10, images of previous states I8 and I9 (with K = 2), and the previous action a9. The text and images areembedded with LSTM and CNN. The actions are selected with the task specific multi-layer perceptron.

This process continues until STOP is predicted andthe agent stops, indicating instruction completion.The agent also has access to K images of previ-ous states and the previous action to distinguishbetween different stages of the execution (Mnihet al., 2015). Figure 2 illustrates our architecture.

Formally,2 at step j, the agent consid-ers an agent context sj , which is a tuple(x, Ij , Ij−1, . . . , Ij−K , aj−1), where x is the natu-ral language instruction, Ij is an image of the cur-rent world state, the images Ij−1, . . . , Ij−K repre-sent K previous states, and aj−1 is the previousaction. The agent context includes informationabout the current state and the execution. Consid-ering the previous action aj−1 allows the agent toavoid repeating failed actions, for example whentrying to move in the direction of an obstacle. InFigure 2, the agent is given the instruction Placethe Toyota east of SRI, is at the 10-th executionstep, and considers K = 2 previous images.

We generate continuous vector representationsfor all inputs, and jointly reason about both textand image modalities to select the next action.We use a recurrent neural network (RNN; Elman,1990) with a long short-term memory (LSTM;Hochreiter and Schmidhuber, 1997) recurrenceto map the instruction x = 〈x1, . . . , xn〉 toa vector representation x. Each token xi ismapped to a fixed dimensional vector with thelearned embedding function ψ(xi). The instruc-tion representation x is computed by applying theLSTM recurrence to generate a sequence of hid-den states li = LSTM(ψ(xi), li−1), and comput-ing the mean x = 1

n

∑ni=1 li (Narasimhan et al.,

2015). The current image Ij and previous im-ages Ij−1,. . . ,Ij−K are concatenated along thechannel dimension and embedded with a convolu-tional neural network (CNN) to generate the vi-

2We use bold-face capital letters for matrices and bold-face lowercase letters for vectors. Computed input and staterepresentations use bold versions of the symbols. For exam-ple, x is the computed representation of an instruction x.

sual state v (Mnih et al., 2013). The last ac-tion aj−1 is embedded with the functionψa(aj−1).The vectors vj , x, and ψa(aj−1) are concatenatedto create the agent context vector representationsj = [vj , x, ψa(aj−1)].

To compute the action to execute, we use a feed-forward perceptron that decomposes according tothe domain actions. This computation selects thenext action conditioned on the instruction text andobservations from both the current world state andrecent history. In the block world domain, whereactions decompose to selecting the block to moveand the direction, the network computes block anddirection probabilities. Formally, we decomposean action a to direction aD and block aB . We com-pute the feedforward network:

h1 = max(W(1)sj + b(1), 0)

hD = W(D)h1 + b(D)

hB = W(B)h1 + b(B) ,

and the action probability is a product of the com-ponent probabilities:

P (aDj = d | x, sj , aj−1) ∝ exp(hDd )

P (aBj = b | x, sj , aj−1) ∝ exp(hBb ) .

At the beginning of execution, the first action a0is set to the special value NONE, and previous im-ages are zero matrices. The embedding function ψis a learned matrix. The function ψa concatenatesthe embeddings of aDj−1 and aBj−1, which are ob-tained from learned matrices, to compute the em-bedding of aj−1. The model parameters θ includeW(1), b(1), W(D), b(D), W(B), b(B), the param-eters of the LSTM recurrence, the parameters ofthe convolutional network CNN, and the embed-ding matrices. In our experiments (Section 7), allparameters are learned without external resources.

5 LearningWe use policy gradient for reinforcement learn-ing (Williams, 1992) to estimate the parametersθ of the agent policy. We assume access to a

training set of N examples {(x(i), s(i)1 , e(i))}Ni=1,where x(i) is an instruction, s(i)1 is a start state, ande(i) is an execution demonstration starting froms(i)1 of instruction x(i). The main learning chal-

lenge is learning how to execute instructions givenraw visual input from relatively limited data. Welearn in a contextual bandit setting, which providestheoretical advantages over general reinforcementlearning. In Section 8, we verify this empirically.Reward Function The instruction executionproblem defines a simple problem reward to mea-sure task completion. The agent receives a posi-tive reward when the task is completed, a negativereward for incorrect completion (i.e., STOP in thewrong state) and actions that fail to execute (e.g.,when the direction is blocked), and a small penaltyotherwise, which induces a preference for shortertrajectories. To compute the reward, we assumeaccess to the world state. This learning setup isinspired by work in robotics, where it is achievedby instrumenting the training environment (Sec-tion 3). The agent, on the other hand, only usesthe agent context (Section 4). When deployed, thesystem relies on visual observations and naturallanguage instructions only. The reward functionR(i) : S ×A → R is defined for each training ex-ample (x(i), s

(i)1 , e(i)), i = 1 . . . N :

R(i)(s, a) =

1.0 if s = sm(i) and a = STOP

−1.0 s 6= sm(i) and a = STOP

−1.0 a fails to execute−δ else

,

where m(i) is the length of e(i).The reward function does not provide interme-

diate positive feedback to the agent for actions thatbring it closer to its goal. When the agent exploresrandomly early during learning, it is unlikely toencounter the goal state due to the large numberof steps required to execute tasks. As a result, theagent does not observe positive reward and failsto learn. In Section 6, we describe how rewardshaping, a method to augment the reward with ad-ditional information, is used to take advantage ofthe training data and address this challenge.Policy Gradient Objective We adapt the policygradient objective defined by Sutton et al. (1999)to multiple starting states and reward functions:

J =1

N

N∑i=1

V (i)π (s

(i)1 ) ,

where V (i)π (s

(i)1 ) is the value given by R(i) start-

ing from s(i)1 under the policy π. The summation

expresses the goal of learning a behavior parame-

terized by natural language instructions.Contextual Bandit Setting In contrast to mostpolicy gradient approaches, we apply the objec-tive to a contextual bandit setting where immedi-ate reward is optimized rather than total expectedreward. The primary theoretical advantage of con-textual bandits is much tighter sample complexitybounds when comparing upper bounds for contex-tual bandits (Langford and Zhang, 2007) even withan adversarial sequence of contexts (Auer et al.,2002) to lower bounds (Krishnamurthy et al.,2016) or upper bounds (Kearns et al., 1999) fortotal reward maximization. This property is par-ticularly suitable for the few-sample regime com-mon in natural language problems. While re-inforcement learning with neural network poli-cies is known to require large amounts of train-ing data (Mnih et al., 2015), the limited numberof training sentences constrains the diversity andvolume of agent contexts we can observe duringtraining. Empirically, this translates to poor resultswhen optimizing the total reward (REINFORCEbaseline in Section 8). To derive the approximategradient, we use the likelihood ratio method:

∇θJ =1

N

N∑i=1

E[∇θ log π(s, a)R(i)(s, a)] ,

where reward is computed from the world state butpolicy is learned on the agent context. We approx-imate the gradient using sampling.

This training regime, where immediate rewardoptimization is sufficient to optimize policy pa-rameters θ, is enabled by the shaped reward weintroduce in Section 6. While the objective is de-signed to work best with the shaped reward, the al-gorithm remains the same for any choice of rewarddefinition including the original problem reward orseveral possibilities formed by reward shaping.Entropy Penalty We observe that early in train-ing, the agent is overwhelmed with negative re-ward and rarely completes the task. This results inthe policy π rapidly converging towards a subopti-mal deterministic policy with an entropy of 0. Todelay premature convergence we add an entropyterm to the objective (Williams and Peng, 1991;Mnih et al., 2016). The entropy term encourages auniform distribution policy, and in practice stimu-lates exploration early during training. The regu-larized gradient is:∇θJ =

1

N

N∑i=1

E[∇θ log π(s, a)R(i)(s, a) + λ∇θH(π(s, ·))] ,

Algorithm 1 Policy gradient learning

Input: Training set {(x(i), s(i)1 , e(i))}Ni=1, learning rate µ,epochs T , horizon J , and entropy regularization term λ.

Definitions: IMG(s) is a camera sensor that reports an RGBimage of state s. π is a probabilistic neural networkpolicy parameterized by θ, as described in Section 4.EXECUTE(s, a) executes the action a at the state s, andreturns the new state. R(i) is the reward function forexample i. ADAM(∆) applies a per-feature learning rateto the gradient ∆ (Kingma and Ba, 2014).

Output: Policy parameters θ.1: » Iterate over the training data.2: for t = 1 to T , i = 1 to N do3: I1−K , . . . , I0 = ~0

4: a0 = NONE, s1 = s(i)1

5: j = 16: » Rollout up to episode limit.7: while j ≤ J and aj 6= STOP do8: » Observe world and construct agent context.9: Ij = IMG(sj)

10: sj = (x(i), Ij , Ij−1, . . . , Ij−K , adj−1)

11: » Sample an action from the policy.12: aj ∼ π(sj , a)13: sj+1 = EXECUTE(sj , aj)14: » Compute the approximate gradient.15: ∆j ← ∇θ log π(sj , aj)R

(i)(sj , aj)+λ∇θH(π(sj , ·))

16: j+ = 1

17: θ ← θ + µADAM( 1j

∑jj′=1 ∆j′)

18: return θ

where H(π(s, ·)) is the entropy of π given theagent context s, λ is a hyperparameter that con-trols the strength of the regularization. Whilethe entropy term delays premature convergence, itdoes not eliminate it. Similar issues are observedfor vanilla policy gradient (Mnih et al., 2016).Algorithm Algorithm 1 shows our learning al-gorithm. We iterate over the data T times. In eachepoch, for each training example (x(i), s

(i)1 , e(i)),

i = 1 . . . N , we perform a rollout using our policyto generate an execution (lines 7 - 16). The lengthof the rollout is bound by J , but may be shorter ifthe agent selected the STOP action. At each stepj, the agent updates the agent context sj (lines 9 -10), samples an action from the policy π (line 12),and executes it to generate the new world statesj+1 (line 13). The gradient is approximated us-ing the sampled action with the computed rewardR(i)(sj , aj) (line 15). Following each rollout, weupdate the parameters θ with the mean of the gra-dients using ADAM (Kingma and Ba, 2014).

6 Reward ShapingReward shaping is a method for transforming areward function by adding a shaping term to the

Low

High

Figure 3: Visualization of the shaping potentials fortwo tasks. We show demonstrations (blue arrows), butomit instructions. To visualize the potentials intensity,we assume only the target block can be moved, whilerewards and potentials are computed for any blockmovement. We illustrate the sparse problem reward(left column) as a potential function and consider onlyits positive component, which is focused on the goal.The middle column adds the distance-based potential.The right adds both potentials.

problem reward. The goal is to generate more in-formative updates by adding information to the re-ward. We use this method to leverage the train-ing demonstrations, a common form of supervi-sion for training systems that map language to ac-tions. Reward shaping allows us to fully use thistype of supervision in a reinforcement learningframework, and effectively combine learning fromdemonstrations and exploration.

Adding an arbitrary shaping term can changethe optimality of policies and modify the orig-inal problem, for example by making bad poli-cies according to the problem reward optimal ac-cording to the shaped function.3 Ng et al. (1999)and Wiewiora et al. (2003) outline potential-basedterms that realize sufficient conditions for safeshaping.4 Adding a shaping term is safe if theorder of policies according to the shaped rewardis identical to the order according to the originalproblem reward. While safe shaping only appliesto optimizing the total reward, we show empiri-cally the effectiveness of the safe shaping termswe design in a contextual bandit setting.

We introduce two shaping terms. The finalshaped reward is a sum of them and the problemreward. Similar to the problem reward, we defineexample-specific shaping terms. We modify thereward function signature as required.Distance-based Shaping (F1) The first shapingterm measures if the agent moved closer to thegoal state. We design it to be a safe potential-based

3For example, adding a shaping term F = −R will resultin a shaped reward that is always 0, and any policy will betrivially optimal with respect to it.

4For convenience, we briefly overview the theorems of Nget al. (1999) and Wiewiora et al. (2003) in Appendix A.

term (Ng et al., 1999):F

(i)1 (sj , aj , sj+1) = φ

(i)1 (sj+1)− φ(i)

1 (sj) .

The potential φ(i)1 (s) is proportional to the nega-tive distance from the goal state s(i)g . Formally,φ(i)1 (s) = −η‖s− s(i)g ‖, where η is a constant

scaling factor, and ‖.‖ is a distance metric. In theblock world, the distance between two states is thesum of the Euclidean distances between the posi-tions of each block in the two states, and η is theinverse of block width. The middle column in Fig-ure 3 visualizes the potential φ(i)1 .Trajectory-based Shaping (F2) Distance-based shaping may lead the agent to sub-optimalstates, for example when an obstacle blocks thedirect path to the goal state, and the agent musttemporarily increase its distance from the goal tobypass it. We incorporate complete trajectoriesby using a simplification of the shaping termintroduced by Brys et al. (2015). Unlike F1, itrequires access to the previous state and action.It is based on the look-back advice shapingterm of Wiewiora et al. (2003), who introducedsafe potential-based shaping that considers theprevious state and action. The second term is:F

(i)2 (sj−1, aj−1, sj , aj) = φ

(i)2 (sj , aj)−φ(i)

2 (sj−1, aj−1) .

Given e(i) = 〈(s1, a1), . . . , (sm, am)〉, to com-pute the potential φ(i)2 (s, a), we identify the closeststate sj in e(i) to s. If η‖sj − s‖ < 1 and aj = a,φ(i)2 (s, a) = 1.0, else φ(i)2 (s, a) = −δf , where δf

is a penalty parameter. We use the same distancecomputation and parameter η as in F1. When theagent is in a state close to a demonstration state,this term encourages taking the action taken in therelated demonstration state. The right column inFigure 3 visualizes the effect of the potential φ(i)2 .

7 Experimental SetupEnvironment We use the environment of Bisket al. (2016). The original task required predictingthe source and target positions for a single blockgiven an instruction. In contrast, we address thetask of moving blocks on the plane to execute in-structions given visual input. This requires gen-erating the complete sequence of actions neededto complete the instruction. The environment con-tains up to 20 blocks marked with logos or digits.Each block can be moved in four directions. In-cluding the STOP action, in each step, the agentselects between 81 actions. The set of actions isconstant and is not limited to the blocks present.

The transition function is deterministic. The sizeof each block step is 0.04 of the board size. Theagent observes the board from above. We adopta relatively challenging setup with a large actionspace. While a simpler setup, for example decom-posing the problem to source and target predictionand using a planner, is likely to perform better, weaim to minimize task-specific assumptions and en-gineering of separate modules. However, to betterunderstand the problem, we also report results forthe decomposed task with a planner.

Data Bisk et al. (2016) collected a corpus of in-structions paired with start and goal states. Fig-ure 1 shows example instructions. The originaldata includes instructions for moving one block ormultiple blocks. Single-block instructions are rel-atively similar to navigation instructions and re-ferring expressions. While they present much ofthe complexity of natural language understandingand grounding, they rarely display the planningcomplexity of multi-block instructions, which arebeyond the scope of this paper. Furthermore,the original data does not include demonstrations.While generating demonstrations for moving asingle block is straightforward, disambiguatingaction ordering when multiple blocks are moved ischallenging. Therefore, we focus on instructionswhere a single block changes its position betweenthe start and goal states, and restrict demonstra-tion generation to move the changed block. Theremaining data, and the complexity it introduces,provide an important direction for future work.

To create demonstrations, we compute theshortest paths. While this process may introducenoise for instructions that specify specific trajecto-ries (e.g., move SRI two steps north and . . . ) ratherthan only describing the goal state, analysis of thedata shows this issue is limited. Out of 100 sam-pled instructions, 92 describe the goal state ratherthan the trajectory. A secondary source of noise isdue to discretization of the state space. As a re-sult, the agent often can not reach the exact targetposition. The demonstrations error illustrates thisproblem (Table 3). To provide task completion re-ward during learning, we relax the state compari-son, and consider states to be equal if the sum ofblock distances is under the size of one block.

The corpus includes 11,871/1,719/3,177 in-structions for training/development/testing. Ta-ble 1 shows corpus statistic compared to the com-monly used SAIL navigation corpus (MacMahon

SAIL BlocksNumber of instructions 3,237 16,767Mean instruction length 7.96 15.27

Vocabulary 563 1,426Mean trajectory length 3.12 15.4

Table 1: Corpus statistics for the block environment weuse and the SAIL navigation domain.

et al., 2006; Chen and Mooney, 2011). While theSAIL agent only observes its immediate surround-ings, overall the blocks domain provides morecomplex instructions. Furthermore, the SAIL en-vironment includes only 400 states, which is in-sufficient for generalization with vision input. Wecompare to other data sets in Appendix D.Evaluation We evaluate task completion erroras the sum of Euclidean distances for each blockbetween its position at the end of the executionand in the gold goal state. We divide distancesby block size to normalize for the image size. Incontrast, Bisk et al. (2016) evaluate the selectionof the source and target positions independently.Systems We report performance of ablations,the upper bound of following the demonstrations(Demonstrations), and five baselines: (a) STOP:the agent immediately stops, (b) RANDOM: theagent takes random actions, (c) SUPERVISED: su-pervised learning with maximum-likelihood es-timate using demonstration state-action pairs,(d) DQN: deep Q-learning with both shapingterms (Mnih et al., 2015), and (e) REINFORCE:policy gradient with cumulative episodic rewardwith both shaping terms (Sutton et al., 1999). Fullsystem details are given in Appendix B.Parameters and Initialization Full details arein Appendix C. We consider K = 4 previous im-ages, and horizon length J = 40. We initialize ourmodel with the SUPERVISED model.

8 ResultsTable 2 shows development results. We run eachexperiment three times and report the best result.The RANDOM and STOP baselines illustrate thetask complexity of the task. Our approach, includ-ing both shaping terms in a contextual bandit set-ting, significantly outperforms the other methods.SUPERVISED learning demonstrates lower perfor-mance. A likely explanation is test-time executionerrors leading to unfamiliar states with poor laterperformance (Kakade and Langford, 2002), a formof the covariate shift problem. The low perfor-mance of REINFORCE and DQN illustrates thechallenge of general reinforcement learning withlimited data due to relatively high sample com-

Algorithm Distance Error Min. DistanceMean Med. Mean Med.

Demonstrations 0.35 0.30 0.35 0.30BaselinesSTOP 5.95 5.71 5.95 5.71RANDOM 15.3 15.70 5.92 5.70SUPERVISED 4.65 4.45 3.72 3.26REINFORCE 5.57 5.29 4.50 4.25DQN 6.04 5.78 5.63 5.49Our Approach 3.60 3.09 2.72 2.21

w/o Sup. Init 3.78 3.13 2.79 2.21w/o Prev. Action 3.95 3.44 3.20 2.56w/o F1 4.33 3.74 3.29 2.64w/o F2 3.74 3.11 3.13 2.49w/ Distance 8.36 7.82 5.91 5.70

RewardEnsemblesSUPERVISED 4.64 4.27 3.69 3.22REINFORCE 5.28 5.23 4.75 4.67DQN 5.85 5.59 5.60 5.46Our Approach 3.59 3.03 2.63 2.15

Table 2: Mean and median (Med.) development results.

Algorithm Distance Error Min. DistanceMean Med. Mean Med.

Demonstrations 0.37 0.31 0.37 0.31STOP 6.23 6.12 6.23 6.12RANDOM 15.11 15.35 6.21 6.09EnsemblesSUPERVISED 4.95 4.53 3.82 3.33REINFORCE 5.69 5.57 5.11 4.99DQN 6.15 5.97 5.86 5.77Our Approach 3.78 3.14 2.83 2.07

Table 3: Mean and median (Med.) test results.

plexity (Kearns et al., 1999; Krishnamurthy et al.,2016). We also report results using ensembles ofthe three models.

We ablate different parts of our approach. Ab-lations of supervised initialization (our approachw/o sup. init) or the previous action (our ap-proach w/o prev. action) result in increase in er-ror. While the contribution of initialization is mod-est, it provides faster learning. On average, af-ter two epochs, we observe an error of 3.94 withinitialization and 6.01 without. We hypothesizethat the F2 shaping term, which uses full demon-strations, helps to narrow the gap at the end oflearning. Without supervised initialization and F2,the error increases to 5.45 (the 0% point in Fig-ure 4). We observe the contribution of each shap-ing term and their combination. To study the bene-fit of potential-based shaping, we experiment witha negative distance-to-goal reward. This rewardreplaces the problem reward and encourages get-ting closer to the goal (our approach w/distancereward). With this reward, learning fails to con-verge, leading to a relatively high error.

Figure 4 shows our approach with varying

0 10 20 30 40 50 60 70 80 90 1003.5

4

4.5

5

5.5

% Demonstrations

Mea

nE

rror

Figure 4: Mean distance error as a function of the ratioof training examples that include complete trajectories.The rest of the data includes the goal state only.

amount of supervision. We remove demonstra-tions from both supervised initialization and theF2 shaping term. For example, when only 25%are available, only 25% of the data is available forinitialization and the F2 term is only present forthis part of the data. While some demonstrationsare necessary for effective learning, we get mostof the benefit with only 12.5%.

Table 3 provides test results, using the ensem-bles to decrease the risk of overfitting the develop-ment. We observe similar trends to developmentresult with our approach outperforming all base-lines. The remaining gap to the demonstrationsupper bound illustrates the need for future work.

To understand performance better, we measureminimal distance (min. distance in Tables 2 and3), the closest the agent got to the goal. We ob-serve a strong trend: the agent often gets close tothe goal and fails to stop. This behavior is alsoreflected in the number of steps the agent takes.While the mean number of steps in developmentdemonstrations is 15.2, the agent generates on av-erage 28.7 steps, and 55.2% of the time it takesthe maximum number of allowed steps (40). Test-ing on the training data shows an average 21.75steps and exhausts the number of steps 29.3% ofthe time. The mean number of steps in trainingdemonstrations is 15.5. This illustrates the chal-lenge of learning how to be behave at an absorbingstate, which is observed relatively rarely duringtraining. This behavior also shows in our video.5

We also evaluate a supervised learning variantthat assumes a perfect planner.6 This setup is sim-ilar to Bisk et al. (2016), except using raw imageinput. It allows us to roughly understand how wellthe agent generates actions. We observe a meanerror of 2.78 on the development set, an improve-ment of almost two points over supervised learn-ing with our approach. This illustrates the com-

5https://github.com/clic-lab/blocks6As there is no sequence of decisions, our reinforcement

approach is not appropriate for the planner experiment. Thearchitecture details are described in Appendix B.

plexity of the complete problem.We conduct a shallow linguistic analysis to un-

derstand the agent behavior with regard to dif-ferences in the language input. As expected, theagent is sensitive to unknown words. For instruc-tions without unknown words, the mean develop-ment error is 3.49. It increases to 3.97 for instruc-tions with a single unknown word, and to 4.19 fortwo.7 We also study the agent behavior when ob-serving new phrases composed of known words bylooking at instructions with new n-grams and nounknown words. We observe no significant corre-lation between performance and new bi-grams andtri-grams. We also see no meaningful correlationbetween instruction length and performance. Al-though counterintuitive given the linguistic com-plexities of longer instructions, it aligns with re-sults in machine translation (Luong et al., 2015).

9 ConclusionsWe study the problem of learning to execute in-structions in a situated environment given onlyraw visual observations. Supervised approachesdo not explore adequately to handle test time er-rors, and reinforcement learning approaches re-quire a large number of samples for good conver-gence. Our solution provides an effective combi-nation of both approaches: reward shaping to cre-ate relatively stable optimization in a contextualbandit setting, which takes advantage of a signalsimilar to supervised learning, with a reinforce-ment basis that admits substantial exploration andeasy avenues for smart initialization. This com-bination is designed for a few-samples regime, aswe address. When the number of samples is un-bounded, the drawbacks observed in this scenariofor optimizing longer term reward do not hold.

AcknowledgmentsThis research was supported by a Google Fac-ulty Award, an Amazon Web Services ResearchGrant, and a Schmidt Sciences Research Award.We thank Alane Suhr, Luke Zettlemoyer, and theanonymous reviewers for their helpful feedback,and Claudia Yan for technical help. We alsothank the Cornell NLP group and the MicrosoftResearch Machine Learning NYC group for theirsupport and insightful comments.

7This trend continues, although the number of instructionsis too low (< 20) to be reliable.

ReferencesAlekh Agarwal, Daniel J. Hsu, Satyen Kale, John

Langford, Lihong Li, and Robert E. Schapire. 2014.Taming the monster: A fast and simple algorithmfor contextual bandits. In Proceedings of the Inter-national Conference on Machine Learning.

Jacob Andreas and Dan Klein. 2015. Alignment-based compositional semantics for instruction fol-lowing. In Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Process-ing. https://doi.org/10.18653/v1/D15-1138.

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, andDan Klein. 2016a. Learning to compose neu-ral networks for question answering. In Proceed-ings of the 2016 Conference of the North Amer-ican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies.https://doi.org/10.18653/v1/N16-1181.

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, andDan Klein. 2016b. Neural module networks. InConference on Computer Vision and Pattern Recog-nition.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,and Devi Parikh. 2015. VQA: Visual question an-swering. In International Journal of Computer Vi-sion.

Yoav Artzi, Dipanjan Das, and Slav Petrov. 2014a.Learning compact lexicons for CCG semantic pars-ing. In Proceedings of the 2014 Conference on Em-pirical Methods in Natural Language Processing.https://doi.org/10.3115/v1/D14-1134.

Yoav Artzi, Maxwell Forbes, Kenton Lee, and MayaCakmak. 2014b. Programming by demonstrationwith situated semantic parsing. In AAAI Fall Sym-posium Series.

Yoav Artzi and Luke Zettlemoyer. 2013. Weaklysupervised learning of semantic parsers for map-ping instructions to actions. Transactions of theAssociation of Computational Linguistics 1:49–62.http://aclweb.org/anthology/Q13-1005.

Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, andRobert E. Schapire. 2002. The nonstochastic multi-armed bandit problem. SIAM J. Comput. 32(1):48–77.

Yonatan Bisk, Deniz Yuret, and Daniel Marcu. 2016.Natural language communication with robots. InProceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies.https://doi.org/10.18653/v1/N16-1089.

S.R.K. Branavan, Harr Chen, Luke Zettlemoyer, andRegina Barzilay. 2009. Reinforcement learning formapping instructions to actions. In Proceedings ofthe Joint Conference of the 47th Annual Meeting of

the ACL and the 4th International Joint Conferenceon Natural Language Processing of the AFNLP.http://aclweb.org/anthology/P09-1010.

S.R.K. Branavan, Luke Zettlemoyer, and ReginaBarzilay. 2010. Reading between the lines:Learning to map high-level instructions to com-mands. In Proceedings of the 48th Annual Meet-ing of the Association for Computational Linguis-tics. http://aclweb.org/anthology/P10-1129.

Tim Brys, Anna Harutyunyan, Halit Bener Suay, SoniaChernova, Matthew E. Taylor, and Ann Nowé. 2015.Reinforcement learning from demonstration throughshaping. In Proceedings of the International JointConference on Artificial Intelligence.

David L. Chen and Raymond J. Mooney. 2011. Learn-ing to interpret natural language navigation instruc-tions from observations. In Proceedings of the Na-tional Conference on Artificial Intelligence.

Wenhu Chen, Aurélien Lucchi, and Thomas Hofmann.2016. Bootstrap, review, decode: Using out-of-domain textual data to improve image captioning.CoRR abs/1611.05321.

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-ishna Vedantam, Saurabh Gupta, Piotr Dollár, andC. Lawrence Zitnick. 2015. Microsoft COCO cap-tions: Data collection and evaluation server. CoRRabs/1504.00325.

Kevin Clark and D. Christopher Manning. 2016. Deepreinforcement learning for mention-ranking coref-erence models. In Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing. http://aclweb.org/anthology/D16-1245.

Jeffrey L. Elman. 1990. Finding structure in time.Cognitive Science 14:179–211.

Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao,Lihong Li, Li Deng, and Mari Ostendorf. 2016.Deep reinforcement learning with a natural lan-guage action space. In Proceedings of the 54thAnnual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers).https://doi.org/10.18653/v1/P16-1153.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Longshort-term memory. Neural computation 9.

Justin Johnson, Bharath Hariharan, Laurens van derMaaten, Li Fei-Fei, C. Lawrence Zitnick, andRoss B. Girshick. 2016. CLEVR: A diagnosticdataset for compositional language and elementaryvisual reasoning. CoRR abs/1612.06890.

Sham Kakade and John Langford. 2002. Approxi-mately optimal approximate reinforcement learning.In Machine Learning, Proceedings of the NineteenthInternational Conference (ICML 2002), Universityof New South Wales, Sydney, Australia, July 8-12,2002.

Michael Kearns, Yishay Mansour, and Andrew Y. Ng.1999. A sparse sampling algorithm for near-optimalplanning in large markov decision processes. InProeceediings of the International Joint Conferenceon Artificial Intelligence.

Joohyun Kim and Raymond Mooney. 2012. Unsu-pervised PCFG induction for grounded languagelearning with highly ambiguous supervision. InProceedings of the 2012 Joint Conference on Em-pirical Methods in Natural Language Processingand Computational Natural Language Learning.http://aclweb.org/anthology/D12-1040.

Joohyun Kim and Raymond Mooney. 2013. Adapt-ing discriminative reranking to grounded lan-guage learning. In Proceedings of the 51stAnnual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers).http://aclweb.org/anthology/P13-1022.

Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. In Proceedingsof the International Conference on Learning Repre-sentations.

Jens Kober, J. Andrew Bagnell, and Jan Peters. 2013.Reinforcement learning in robotics: A survey. In-ternational Journal of Robotics Research 32:1238–1274.

Akshay Krishnamurthy, Alekh Agarwal, and JohnLangford. 2016. PAC reinforcement learning withrich observations. In Advances in Neural Informa-tion Processing Systems.

John Langford and Tong Zhang. 2007. The epoch-greedy algorithm for multi-armed bandits with sideinformation. In Advances in Neural InformationProcessing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural InformationProcessing Systems, Vancouver, British Columbia,Canada, December 3-6, 2007.

Sergey Levine, Chelsea Finn, Trevor Darrell, and PieterAbbeel. 2016. End-to-end training of deep visuo-motor policies. Journal of Machine Learning Re-search 17.

Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky,Michel Galley, and Jianfeng Gao. 2016. Deep re-inforcement learning for dialogue generation. InProceedings of the 2016 Conference on Empir-ical Methods in Natural Language Processing.http://aclweb.org/anthology/D16-1127.

Thang Luong, Hieu Pham, and Christopher D.Manning. 2015. Effective approaches toattention-based neural machine translation. InProceedings of the Conference on Empiri-cal Methods in Natural Language Processing.http://aclweb.org/anthology/D15-1166.

Matthew MacMahon, Brian Stankiewics, and Ben-jamin Kuipers. 2006. Walk the talk: Connecting

language, knowledge, action in route instructions.In Proceedings of the National Conference on Ar-tificial Intelligence.

Cynthia Matuszek, Dieter Fox, and Karl Koscher.2010. Following directions using statistical machinetranslation. In Proceedings of the international con-ference on Human-robot interaction.

Cynthia Matuszek, Evan Herbst, Luke S. Zettlemoyer,and Dieter Fox. 2012. Learning to parse natural lan-guage commands to a robot control system. In Pro-ceedings of the International Symposium on Experi-mental Robotics.

Hongyuan Mei, Mohit Bansal, and R. Matthew Walter.2016. What to talk about and how? selective gener-ation using lstms with coarse-to-fine alignment. InProceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies.https://doi.org/10.18653/v1/N16-1086.

Dipendra K. Misra, Jaeyong Sung, Kevin Lee, andAshutosh Saxena. 2016. Tell me dave: Context-sensitive grounding of natural language to manip-ulation instructions. The International Journal ofRobotics Research 35(1-3):281–300.

Kumar Dipendra Misra, Kejia Tao, Percy Liang, andAshutosh Saxena. 2015. Environment-driven lexi-con induction for high-level instructions. In Pro-ceedings of the 53rd Annual Meeting of the As-sociation for Computational Linguistics and the7th International Joint Conference on NaturalLanguage Processing (Volume 1: Long Papers).https://doi.org/10.3115/v1/P15-1096.

Volodymyr Mnih, Adrià Puigdomènech Badia, MehdiMirza, Alex Graves, Timothy P. Lillicrap, TimHarley, David Silver, and Koray Kavukcuoglu.2016. Asynchronous methods for deep reinforce-ment learning. In Proceedings of the InternationalConference on Machine Learning.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Alex Graves, Ioannis Antonoglou, Daan Wierstra,and Martin A. Riedmiller. 2013. Playing atari withdeep reinforcement learning. In Advances in NeuralInformation Processing Systems.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Andrei A Rusu, Joel Veness, Marc G Bellemare,Alex Graves, Martin Riedmiller, Andreas K Fidje-land, and Georg Ostrovski. 2015. Human-level con-trol through deep reinforcement learning. Nature518(7540).

Karthik Narasimhan, Tejas Kulkarni, and ReginaBarzilay. 2015. Language understanding for text-based games using deep reinforcement learning.In Proceedings of the 2015 Conference on Em-pirical Methods in Natural Language Processing.https://doi.org/10.18653/v1/D15-1001.

Karthik Narasimhan, Adam Yala, and Regina Barzi-lay. 2016. Improving information extraction by ac-quiring external evidence with reinforcement learn-ing. In Proceedings of the 2016 Conference on Em-pirical Methods in Natural Language Processing.http://aclweb.org/anthology/D16-1261.

Andrew Y. Ng, Daishi Harada, and Stuart J. Russell.1999. Policy invariance under reward transforma-tions: Theory and application to reward shaping. InProceedings of the International Conference on Ma-chine Learning.

Junhyuk Oh, Valliappa Chockalingam, Satinder P.Singh, and Honglak Lee. 2016. Control of mem-ory, active perception, and action in minecraft. InProceedings of the International Conference on Ma-chine Learning.

Rohan Paul, Jacob Arkin, Nicholas Roy, andThomas M. Howard. 2016. Efficient grounding ofabstract spatial concepts for natural language inter-action with robot manipulators. In Robotics: Sci-ence and Systems.

Andrei A. Rusu, Matej Vecerik, Thomas Rothörl, Nico-las Heess, Razvan Pascanu, and Raia Hadsell. 2016.Sim-to-real robot learning from pixels with progres-sive nets. CoRR .

John Schulman, Sergey Levine, Philipp Moritz,Michael I. Jordan, and Pieter Abbeel. 2015. Trustregion policy optimization .

David Silver, Aja Huang, Chris J Maddison, ArthurGuez, Laurent Sifre, George van den Driessche, Ju-lian Schrittwieser, Ioannis Antonoglou, Veda Pan-neershelvam, Marc Lanctot, Sander Dieleman, Do-minik Grewe, John Nham, Nal Kalchbrenner, IlyaSutskever, Timothy Lillicrap, Madeleine Leach, Ko-ray Kavukcuoglu, Thore Graepel, and Demis Hass-abis. 2016. Mastering the game of go with deep neu-ral networks and tree search. Nature 529 7587:484–9.

Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi.2017. A corpus of compositional language for vi-sual reasoning. In Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics. Association for Computational Linguistics.

Jaeyong Sung, Seok Hyun Jin, and Ashutosh Sax-ena. 2015. Robobarista: Object part based trans-fer of manipulation trajectories from crowd-sourcingin 3d pointclouds. In International Symposium onRobotics Research.

Richard S. Sutton and Andrew G. Barto. 1998. Rein-forcement learning: An introduction. IEEE Trans.Neural Networks 9:1054–1054.

Richard S. Sutton, David A. McAllester, Satinder P.Singh, and Yishay Mansour. 1999. Policy gradi-ent methods for reinforcement learning with func-tion approximation. In Advances in Neural Infor-mation Processing Systems.

Stefanie Tellex, Thomas Kollar, Steven Dickerson,Matthew Walter, Ashis G. Banerjee, Seth Teller, andNicholas Roy. 2011. Understanding natural lan-guage commands for robotic navigation and mobilemanipulation. In Proceedings of the National Con-ference on Artificial Intelligence.

Adam Vogel and Daniel Jurafsky. 2010. Learn-ing to follow navigational directions. InProceedings of the 48th Annual Meeting ofthe Association for Computational Linguistics.http://aclweb.org/anthology/P10-1083.

Bonnie Webber, Norman Badler, Barbara Di Euge-nio, Christopher Geib, Libby Levison, and MichaelMoore. 1995. Instructions, intentions and expecta-tions. Artificial Intelligence 73(1):253–269.

Eric Wiewiora, Garrison W. Cottrell, and CharlesElkan. 2003. Principled methods for advising re-inforcement learning agents. In Proceedings of theInternational Conference on Machine Learning.

Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine Learning 8.

Ronald J Williams and Jing Peng. 1991. Function opti-mization using connectionist reinforcement learningalgorithms. Connection Science 3(3):241–268.

Terry Winograd. 1972. Understanding natural lan-guage. Cognitive Psychology 3(1):1–191.

Kelvin Xu, Jimmy Ba, Jamie Ryan Kiros, KyunghyunCho, Aaron C. Courville, Ruslan Salakhutdinov,Richard S. Zemel, and Yoshua Bengio. 2015. Show,attend and tell: Neural image caption generationwith visual attention. In Proceedings of the Inter-national Conference on Machine Learning.

Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J.Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi.2017. Target-driven visual navigation in indoorscenes using deep reinforcement learning.

A Reward Shaping TheoremsIn Section 6, we introduce two reward shapingterms. We follow the safe-shaping theorems of Nget al. (1999) and Wiewiora et al. (2003). The theo-rems outline potential-based terms that realize suf-ficient conditions for safe shaping. Applying safeterms guarantees the order of policies accordingto the original problem reward does not change.While the theory only applies when optimizingthe total reward, we show empirically the effec-tiveness of the safe shaping terms in a contextualbandit setting. For convenience, we provide thedefinitions of potential-based shaping terms andthe theorems introduced by Ng et al. (1999) andWiewiora et al. (2003) using our notation. We re-fer the reader to the original papers for the full de-tails and proofs.

The distance-based shaping term F1 is definedbased on the theorem of Ng et al. (1999):

Definition. A shaping term F : S × A × S → R ispotential-based if there exists a function φ : S → Rsuch that, at time j, F (sj , aj , sj+1) = γφ(sj+1)−φ(sj),∀sj , sj+1 ∈ S and aj ∈ A, where γ ∈ [0, 1] is a futurereward discounting factor. The function φ is the potentialfunction of the shaping term F .

Theorem. Given a reward function R(sj , aj), if theshaping term is potential-based, the shaped rewardRF (sj , aj , sj+1) = R(sj , aj)+F (sj , aj , sj+1) does notmodify the total order of policies.

In the definition of F1, we set the discounting termγ to 1.0 and omit it.

The trajectory-based shaping term F2 followsthe shaping term introduced by Brys et al. (2015).To define it, we use the look-back advice shapingterm of Wiewiora et al. (2003), who extended thepotential-based term of Ng et al. (1999) for termsthat consider the previous state and action:

Definition. A shaping term F : S × A × S × A → Ris potential-based if there exists a function φ : S ×A → R such that, at time j, F (sj−1, aj−1, sj , aj) =γφ(sj , aj) − φ(sj−1, aj−1), ∀sj , sj−1 ∈ S andaj , aj−1 ∈ A, where γ ∈ [0, 1] is a future reward dis-counting factor. The function φ is the potential function ofthe shaping term F .

Theorem. Given a reward function R(sj , aj),if the shaping term is potential-based, theshaped reward RF (sj−1, aj−1, sj , aj) =R(sj , aj) + F (sj−1, aj−1, sj , aj) does not modifythe total order of policies.

In the definition of F2 as well, we set the discount-ing term γ to 1.0 and omit it.

B Evaluation SystemsWe implement multiple systems for evaluation.STOP The agent performs the STOP action im-mediately at the beginning of execution.

RANDOM The agent samples actions uniformlyuntil STOP is sampled or J actions were sampled,where J is the execution horizon.SUPERVISED Given the training data with Ninstruction-state-execution triplets, we generatetraining data of instruction-state-action triplets andoptimize the log-likelihood of the data. Formally,we optimize the objective:

J =1

N

N∑i=1

m(i)∑j=1

log π(s(i)j , a

(i)j ) ,

where m(i) is the length of the execution e(i), s(i)jis the agent context at step j in sample i, and a(i)jis the demonstration action of step j in demonstra-tion execution e(i). Agent contexts are generatedwith the annotated previous actions (i.e., to gener-ate previous images and the previous action). Weuse minibatch gradient descent with ADAM up-dates (Kingma and Ba, 2014).DQN We use deep Q-learning (Mnih et al.,2015) to train a Q-network. We use the architec-ture described in Section 4, except replacing thetask specific part with a single 81-dimension layer.In contrast to our probabilistic model, we do notdecompose block and direction selection. We usethe shaped reward function, including both F1 andF2. We use a replay memory of size 2,000 and anε-greedy behavior policy to generate rollouts. Weattenuate the value of ε from 1 to 0.1 in 100,000steps and use prioritized sweeping for sampling.We also use a target network that is synchronizedafter every epoch.REINFORCE We use the REINFORCE al-gorithm (Sutton et al., 1999) to train our agent.REINFORCE performs policy gradient learningwith total reward accumulated over the roll-outas opposed to using immediate rewards as in ourmain approach. REINFORCE samples the totalreward using monte-carlo sampling by performinga roll-out. We use the shaped reward function, in-cluding both F1 and F2 terms. Similar to our ap-proach, we initialize with a SUPERVISED modeland regularize the objective with the entropy of thepolicy. We do not use a reward baseline.SUPERVISED with Oracle Planner We use avariant of our model assuming a perfect planner.The model predicts the block to move and its tar-get position as a pair of coordinates. We modifythe architecture in Section 4 to predict the blockto move and its target position as a pair of coordi-nates. This model assumes that the sequence of ac-tions is inferred from the predicted target position

using an oracle planner. We train using supervisedlearning by maximizing the likelihood of the blockbeing moved and minimizing the squared distancebetween the predicted target position and the an-notated target position.

C Parameters and InitializationC.1 Architecture ParametersWe use an RGB image of 120x120 pixels, and aconvolutional neural network (CNN) with 4 lay-ers. The first two layers apply 32 8 × 8 filterswith a stride of 4, the third applies 32 4 × 4 fil-ters with a stride of 2. The last layer performsan affine transformation to create a 200-dimensionvector. We linearly scale all images to have zeromean and unit norm. We use a single layer RNNwith 150-dimensional word embeddings and 250LSTM units. The dimension of the action em-bedding ψa is 56, including 32 for embedding theblock and 24 for embedding the directions. W(1)

is a 506× 120 matrix and b(1) is a 120-dimensionvector. W(D) is 120×20 for 20 blocks, and W(B)

is 120×5 for the four directions (north, south, east,west) and the STOP action. We consider K = 4previous images, and use horizon length J = 40.

C.2 InitializationEmbedding matrices are initialized with a zero-mean unit-variance Gaussian distribution. All bi-ases are initialized to 0. We use a zero-mean trun-cated normal distribution to initialize the CNN fil-ters (0.005 variance) and CNN weights matrices(0.004 variance). All other weight matrices areinitialized with a normal distribution (mean=0.0,standard deviation=0.01). The matrices used inthe word embedding function ψ are initializedwith a zero-mean normal distribution with stan-dard deviation of 1.0. Action embedding matrices,which are used for ψa, are initialized with a zero-mean normal distribution with 0.001 standard de-viation. We initialize policy gradient learning, in-cluding our approach, with parameters estimatedusing supervised learning for two epochs, exceptthe direction parameters W(D) and b(D), whichwe learn from scratch. We found this initializa-tion method to provide a good balance betweenstrong initialization and not biasing the learningtoo much, which can result in limited exploration.

C.3 Learning ParametersWe use the distance error on a small validation setas stopping criteria. After each epoch, we save

the model, and select the final model based on de-velopment set performance. While this methodoverfits the development set, we found it more re-liable then using the small validation set alone.Our relatively modest performance degradation onthe held-out set illustrates that our models general-ize well. We set the reward and shaping penaltiesδ = δf = 0.02. The entropy regularization coef-ficient is λ = 0.1. The learning rate is µ = 0.001for supervised learning and µ = 0.00025 for pol-icy gradient. We clip the gradient at a norm of 5.0.All learning algorithms use a mini-batch of size 32during training.

D Dataset ComparisonsWe briefly review instruction following datasetsin Table 4, including: Blocks (Bisk et al., 2016),SAIL (MacMahon et al., 2006; Chen and Mooney,2011), Matuszek (Matuszek et al., 2012), andMisra (Misra et al., 2015). Overall, Blocks pro-vides the largest training set and a relatively com-plex environment with well over 2.4318 possiblestates.8 The most similar dataset is SAIL, whichprovides only partial observability of the environ-ment (i.e., the agent observes what is around itonly). However, SAIL is less complex on otherdimensions related to the instructions, trajectories,and action space. In addition, while Blocks hasa large number of possible states, SAIL includesonly 400 states. The small number of states makesit difficult to learn vision models that generalizewell. Misra (Misra et al., 2015) provides a param-eterized action space (e.g., grasp(cup)), whichleads to a large number of potential actions. How-ever, the corpus is relatively small.

E Common QuestionsThis is a list of potential questions following var-ious decisions that we made. While we ablatedand discussed all the crucial decisions in the paper,we decided to include this appendix to provide asmuch information as possible.Is it possible to manually engineer a competi-tive reward function without shaping? Shap-ing is a principled approach to add information toa problem reward with relatively intuitive potentialfunctions. Our experiments demonstrate its effec-tiveness. Investing engineering effort in designing

8We compute this loose lower bound on the number ofstates in the block world as 20! = 2.4318 (the number ofblock permutations). This is a very loose lower bound.

Name # Samples Vocabulary Mean Instruction # Actions Mean Trajectory PartiallySize Length Length Observed

Blocks 16,767 1,426 15.27 81 15.4 NoSAIL 3,237 563 7.96 3 3.12 YesMatuszek 217 39 6.65 3 N/A NoMisra 469 775 48.7 > 100 21.5 No

Table 4: Comparison of several related natural language instructions corpora.

a reward function specifically designed to the taskis a potential alternative approach.

Are you using beam search? Why not? Whileusing beam search can probably increase our per-formance, we chose to avoid it. We are motivatedby robotic scenarios, where implementing beamsearch is a challenging task and often not possible.We distinguish between beam search and back-tracking. Beam search is also incompatible withcommon assumptions of reinforcement learning,although it is often used during test with reinforce-ment learning systems.

Why are you using the mean of the LSTMhidden states instead of just the final state?We empirically tested both options. Using themean worked better. This was also observed byNarasimhan et al. (2015). Understanding in whichscenarios one technique is better than the other isan important question for future work.

Can you provide more details about initializa-tion? Please see Appendix C.

Does the agent in the block world learn to moveobstacles and other blocks? While the agentcan move any block at any step, in practice, itrarely happens. The agent prefers to move blocksaround obstacles rather than moving other blocksand moving them back into place afterwards. Thisbehavior is learned from the data and shows evenwhen we use only very limited amount of demon-strations. We hypothesize that in other tasks theagent is likely to learn that moving obstacles isadvantageous, for example when demonstrationsinclude moving obstacles.

Does the agent explicitly mark where it is in theinstruction? We estimate that over 90% of theinstructions describe the target position. There-fore, it is often not clear how much of the in-struction was completed during the execution. Theagent does not have an explicit mechanism to markportions of the instruction that are complete. Webriefly experimented with attention, but found thatempirically it does not help in our domain. De-signing an architecture to allows such considera-tions is an important direction for future work.

Does the agent know which blocks are present?Not all blocks are included in each task. The agentmust infer which blocks are present from the im-age and instruction. The set of possible actions,which includes moving all possible blocks, doesnot change between tasks. If the agent chooses tomove a block that is not present, the world statedoes not change.

Did you experiment with executing sequencesof instruction? The Bisk et al. (2016) includessuch instructions, right? The majority of exist-ing corpora, including SAIL (Chen and Mooney,2011; Artzi and Zettlemoyer, 2013; Mei et al.,2016), provide segmented sequences of instruc-tions. Existing approaches take advantage ofthis segmentation during training. For example,Chen and Mooney (2011), Artzi and Zettlemoyer(2013), and Mei et al. (2016) all train on seg-mented data and test on sequences of instructionsby doing inference on one sentence at a time.We are also able to do this. Similar to these ap-proaches, we will likely suffer from cascading er-rors. The multi-instruction paragraphs in the Bisket al. (2016) data are an open problem and presentnew challenges beyond just instruction length. Forexample, they often merge multiple block place-ments in one instruction (e.g, put the SRI, HP, andDell blocks in a row). Since the original corpusdoes not provide trajectories and our automaticgeneration procedure is not able to resolve whichblock to move first, we do not have demonstra-tions for this data. The instructions also present asignificantly more complex task. This is an impor-tant direction for future work, which illustrates thecomplexity and potential of the domain.

Potential-based shaping was proven to be safewhen maximizing the total expected reward.Does this apply for the contextual bandit set-ting, where you maximize the immediate re-ward? The safe shaping theorems (Appendix A)do not hold in our contextual bandit setting. Weshow empirically that shaping works in practice.However, how and if it changes the order of poli-cies is an open question.

How long does it take to train? How manyframes the agent observes? The agent observesabout 2.5 million frames. It takes 16 hours using50% capacity of an Nvidia Pascal Titan X GPU totrain using our approach. DQN takes more thantwice the time for the same number of epochs. Su-pervised learning takes about 9 hours to converge.We also trained DQN for around four days, but didnot observe improvement.Did you consider initializing DQN with super-vised learning? Initializing DQN with the prob-abilistic supervised model is challenging. SinceDQN is not probabilistic it is not clear what thisinitialization means. Smart initialization of DQNis an important problem for future work.