CS229 Final Project: Language Grounding in Minecraft with ...cs229.stanford.edu/proj2017/final-reports/5242630.pdf · CS229 Final Project: Language Grounding in Minecraft with Gated-Attention

CS229 Final Project: Language Grounding in Minecraft with

Gated-Attention Networks

Ali ZaidiMicrosoft, Stanford University

[email protected]

Abstract

A key question in language understanding isthe problem of language grounding – howdo symbols such as words get their mean-ing? We examine this question in the con-text of task oriented language grounding ingameplay. In order to perform tasks andchallenges specified by natural language in-structions, agents need to extract semanti-cally meaningful representations of languageand map it to the visual elements of theirscene and into actions in the environment.This is often referred to as task-oriented lan-guage grounding. In this project, we proposeto directly map raw visual observations andtext input into actions for instruction execu-tion, using an end-to-end trainable neural ar-chitecture. The model synthesizes image andtext representations using Gated-Attentionmechanisms and learns a policy using SteinVariational policy gradients to execute thenatural language instruction. We evaluateour method in the Minecraft environment tothe problem of retrieving items in rooms andmazes and show improvements over super-vised and common reinforcement learning al-gorithms.

1 Introduction

Reinforcement learning has made incredible progressin the past few years, in large parts due to the effective-ness of neural networks as value function approxima-tors, and their ability to be trained end-to-end usingstochastic optimization. In this work, we examine theability to define differentiable architectures that canbe trained end-to-end using stochastic optimizationfor solving task-oriented language grounding problems.These are problems where the agent needs to solve spe-

cific tasks defined through natural language instruc-tions. In order to succeed in such applications, theagent needs to extract semantically meaningful rep-resentations of language that they can map to visualelements and actions in the environment. In particu-lar, to accomplish goals defined through instructions,the agent needs to draw semantic correspondences be-tween the visual and linguistic modalities and learna policy to perform the required task, an example ofmulti-modal learning.

To tackle this problem, we propose an architecturethat consists of a state representation module that pro-vides a joint representation of the natural language in-struction and the visual scene observed by the agent’sview, and a policy learning module to predict the op-timal action for the agent in the next timestep t. Thisstate representation module is defined using a Gated-Attention network [CSP+17], and the policy learningmodule is created using Stein Variational policy gradi-ent. The main aspects of this project can be summa-rized as 1) train end-to-end neural architectures thattake raw pixel input of the current scene and natu-ral language instructions without assuming any inter-mediate representations or prior knowledge 2) repre-sentation learning using Gated-Attention mechanismsfor multimodal fusion of linguistic and visual modal-ities 3) policy learning using Stein variational policygradients. All of the experiments in this project areconducted using the popular Minecraft game with theopen source framework Project Malmo [JHHB16]. Fig-ure 1 provides an example of task-oriented gameplayin this environment.

2 Related Work

Learning semantic grounding is a challenging task, es-pecially without prior knowledge. From a statisticalperspective, the agent needs to learn correlations in astream of low-level inputs, (pixel-level visual input, inthis scenario), and map those correlations to linguistic

1

Figure 1: Task-oriented grounding in a maze environ-ment. Here the agent is provided with natural lan-guage instructions to pick up a certain item but alsoneeds contextual visual information to avoid the dan-gerous missing block.

expressions and feasible actions in their environment.Symbolic approaches to language grounding have beenattempted for nearly half a century, [Win72]. How-ever, such rule-based approaches are unable to scale tothe full complexities of natural language and stochas-tic three-dimensional environments. Recent research[AAD+16, CSP+17, HHG+17, LHKOR17] has aimedto fuse and learn directly from these different modal-ities: vision, language, and action. Directly learningthese concept representation spaces directly has beenshown to provide more faithful human semantic repre-sentations of properties of text and objects [SL12].

3 The 3D Minecraft Learning

Environment

3.1 Customizable Platform for AIExperimentation

Our environment is based on the customizable plat-form called Project Malmo [JHHB16], which providesan environment for experimentation with autonomousagents and AI agents in Minecraft. Minecraft is anenormously popular game, where players can controlagents in the 3D grid world and construct novel objectsand architectures. It’s rich and highly versatile struc-ture makes it ripe for AI experimentation. Malmo en-ables experimentations through the use of “Missions”,which defines the parameters of the environment, in-cluding the size of the environment, the number ofcreatures in the environment, duration of each mis-sion, as well as the weather and lighting within theenvironment.

3.2 State, Action and Policy

In our missions, we spawn a single agent inside of amaze, which can be of modifiable length and size fora series of episodes. Each episode is defined as a sin-gle task, where the task is specified through naturallanguage instruction I. For each set of experiments,the agent is spawned at a fixed coordinate at the startof each episode, and every experiment consists of 100episodes (each episode can be thought of as a singlerun of a given mission, as described in section 3.1).Let’s denote by st the state at each time step t inan episode, and Mt for the first person view of theagent at time t. Therefore, the agent’s state at time tis given by the tuple st = {I,Mt}. Observe that theinstructions are not indexed by time, and are thereforeconstant throughout a given an episode.

The agent’s goal is learn an optimal policy ⇡ (at|st),mapping the observed states into optimal actions forthe given task. Concretely, the agent’s goal is to learna policy ⇡ that specifies a distribution over sequencesof actions at in states st such that it maximizes it’sutility function:

J (⇡) = Es0,a0,...

" 1X

t=0

�tr (st, at)

#. (1)

For our agent, we set the discount factor to � = 0.01,which encourages the agent to find faster routes to thetarget object. In this project, we considered stochas-tic policies for ⇡, so that it defines a distribution ofactions given the current state st, and we utilize amodel-free setting. Under this specification, our statevalue function is provided by

V ⇡(st) = Eat,st+1,...

" 1X

i=1

�i(st+i, at+i)

#,

which is precisely the expected return from policy ⇡from state st. Moreover, the state-action function,which defines the expected return from policy ⇡ aftertaking action at in st is given by:

Q⇡(st, at) = Est+1,at+1,...

" 1X

i=0

�ir (st+i, at+i)

#.

4 Network Architecture

We use a single end-to-end architecture that combinesthe two sources of information, [CSP+17]. These twoinformation sources are divided into two modules, astate processing module, which we describe next, anda policy learning module, which we describe in thesequel.

2

4.1 Convolutional Image Representations

As defined above, our states are defined by st =

{I,Mt} where Mt is the first-person view of the en-vironment at iteration t. Each frame is fed through aconvolutional neural network to create a convolutionalrepresentation of the state, xM = f (Mt, ✓conv) 2Rd⇥H⇥W , where d denotes the number of featuremaps, i..e, the intermediate representations in the con-volutional network. This is shown in the upper moduleof Figure 2.

4.2 Instruction Embeddings

Our natural language instructions, I, are processed us-ing a Gated-Recurrent Unit (GRU) [CvG+14, Ola15],xI = f (I; ✓

GRU

), which we call instruction embed-dings. This is shown using the bottom module ofthe network architecture in Figure 2. Prior to “fusing”with the visual feature maps, the instruction embed-dings are transformed through an attention vector, aI ,which we describe further below. The attention vectorserves as a gating mechanism to magnify or diminishattention spent on a given set of features from the vi-sual feature maps, corresponding to how they relateto the instructions. This is the key component of thisarchitecture, so we describe it more fully in the nextsection.

4.3 Multimodal Fusion – HadamardProducts

Our two networks described above featurize the twomodalities into representations we can use for trainingour policy. However, we need to find an effective wayto combine these features into a single representation.A straightforward, if somewhat naive, approach wouldbe to concatenate the features from both sources intoa single source by flattening their representations intoa long dense vector:

Mconcat

(xI , xM ) = [flatten (xI) ; flatten (xM )] .

However, in our case the representations from the vi-sual input is a function of time, whereas the languageinput is not, which means the first vector in M

concat

isconstant throughout the episode.

Instead of concatenation, we take insight from the sta-tistical modeling perspective and aim for some form of“interaction” between our two feature maps. In par-ticular, the instruction embeddings are first passedthrough a fully-connected layer with a sigmoid acti-vation function to create an attention vector, aI =

h(xI) 2 Rd. Furthermore, each element of the atten-tion vector is expanded into a H ⇥W matrix to equal

the dimensions of our state-image xM , which is thenmultiplied element-wise with the output of the CNN(i.e., we use the Hadamard product of the two embed-dings):

MGA

(xI , xM ) = M (h (xI))� xM = M (aI)� xM .

Since xM is a function of time, our interaction of thesetwo featurizations will also be a function of time. Asmentioned above, since the instruction embeddings arefirst sent through our attention gate, only certain partsof the instruction embeddings are interacted with thevisual feature maps. During learning, the parametersof the attention gate focus the agent’s attention tothose components of the visual embeddings that relateto the instruction embeddings.

4.4 Policy Learning Module

The multimodal fusion unit, MGA

is provided to thepolicy learning module, which we train using reinforce-ment learning techniques. In particular, we utilize pol-icy gradient methods which aim to iteratively improvethe policy to optimize the utility function J (✓) (1) byusing estimates of it’s gradient with respect to ✓.

4.4.1 Likelihood ratio policy gradient:REINFORCE

As a baseline, we can approximate the gradient of (1)using the standard REINFORCE algorithm [Wil92].REINFORCE utilizes rolled-out samples generated by⇡ (a|s, ✓) using a likelihood ratio approximation. Con-cretely, REINFORCE uses the following approxima-tion of the policy gradient:

r✓J (✓) ⇡1X

t=0

r✓ log ⇡ (at|st; ✓)Rt

Rt =

1X

i=1

�i(st+i, at+i) .

REINFORCE is a provably unbiased estimator for ther✓J (✓), but can suffer from reliable estimates in smallsamples due to the high variance in gradient estimates.

As an alternative to our baseline, we consider varia-tional methods for policy optimization. Rather thanoptimizing for a single policy, we utilize Stein vari-ational policy gradient [LW16] which searches for adistribution q(✓) to optimize the expected return us-ing variational methods. Concretely, we formulate ourvariational optimization objective as

max

q{Eq [J (✓)] + ↵H (q)}

where the second term provides entropy of the varia-tional distribution q, and ↵ is a “temperature” param-eter that controls the rate of exploration. It’s optimal

3

Figure 2: Network Architecture

value is given by

q (✓) / exp

✓1

↵J (✓)

◆q0

(✓) . (2)

This variational objective has the advantage of act-ing as a regularizer/control variate, and a fast parti-cle based simulation mechanism. We can also see thisas a Bayesian formulation of our estimate ✓, whereexp (J (✓) /↵) serves as our likelihood and q (✓) as ourprior. Stein variational gradient descent [LW16] pro-ceeds by proposing particles according to

✓i ✓i + ✏�?(✓i) ,

where � is a suitably chosen kernel selected to decreasethe KL divergence between ✓i and the target distribu-tion J (✓). In particular, the authors showed that aclosed form solution can be obtained for this optimiza-tion, which has the form

�?(✓) = Eµ⇠⌘ [r log q (µ) k (µ, ✓) +rµk (µ, ✓)]

ˆ� (✓i) =

1

n

nX

j=1

[r✓ log q (✓j) k (✓j , ✓i) +rk (✓j , ✓i)] ,

where µ is the proposal distribution for the particles.The first term r✓ log q (✓) provides the momentum,i.e., it pushes particles towards high probability re-gions of q by the mean field of the particles (i.e., infor-mation sharing across particles, [Mor13]), and the sec-ond term is a diversity / potential energy effect, push-ing particles away from each other. Applying SVGD

to our optimization (2) we obtain the updates

ˆ� (✓i) =1

n

r✓j

✓1

↵J (✓j) + log q

0

(✓j)

◆k (✓j , ✓i) +r✓jk (✓j , ✓i)

�.

5 Experiments and Results

For experiments, we provide our agent with instruc-tions to pick up specific items in a maze, with penalty-states that terminate the agent’s life (i.e., falling downa a hole, or picking the wrong item). We examinedour architecture at varying distances and with vary-ing number of items, gradually increasing both thedistance and the number of items, but keeping theepisode/mission length fixed. The reward function is

R (x,w) = 100 · x · w,

where x denotes the amount of paces between the tar-get item and the agent, and w denotes the number ofitems. Therefore, when there’s one item, and it’s 10steps away, the agent achieves a reward of 10 when itfinds the item, and 0 otherwise. By increasing both xand w in R (x,w), we develop a aa simple learning plan/ curriculum of more challenging tasks. Note, however,that this is not a proper curriculum, and there are nointermediate rewards given to the agent if it does notfind the correct object.

While implementing the Stein variational gradient de-scent for our policy gradients, we used a Gaussian RBFkernel

k (µ, ✓) = exp

⇣�k✓ � µk2

2

/h⌘,

4

Zero-shot learning

iterations REINFORCE SVPD10

3 0.23 0.3210

4 0.27 0.4510

5 0.33 0.61

Table 1: Zero-shot learning

where the bandwidth is chosen to to be median of thepairwise distance between the particles, as in the orig-inal paper [LW16]. We initialize our policy value usinga Gaussian policy with a diagonal covariance structure,and estimate the mean structure is parameterized by athree-hidden layer (100-75-50 hidden units) neural net-work with tanh activation function. The results in Fig-ure 3 show that our Stein variational method performssignificantly better than the REINFORCE method. Inparticular, the Stein variational method exhibits lowervariance, due to the kernelized “control-variate” in it’siterations, and is also able to learn faster, most likelydue to the improved diversity of it’s policies.

5.1 Evaluation

For evaluation, we examined zero-shot task gen-eralization, which is the ability of our model togeneralize to new combinations of object-attributepairs that were not used during training. Moreprecisely, we used the following list of colorsand items during training {colors, objects} =

{[green, blue, gold, black] , [diamond, brick, sword, tulip]}.Not every color-item is valid, i.e., diamond only ap-pears in blue. However, for brick, tulip and sword,there are many color combinations, and duringtraining we drop one of their colors that are used byat least one other object, i.e., blue-sword. Zero-shotlearning is successful (with reward 1) if the agentgeneralizes and can complete the unseen instruction(in this case, “go pick up the blue sword”). Foreach evaluation episode, there are 4 other “occluding”objects, that either have the same color or the sameitem description as the target object. Hence, 0.2average reward would be attained simply by chance.Our average results over 100 evaluation episodes areshown in the table 1. Our results show the policymodule learned using Stein variational policy gradientcan generalize much better on out-of-sample examplesthan the traditional REINFORCE method.

6 Conclusion / Future Work

The experiments above showed that Stein VariationalPolicy Gradients consistently learn faster and withgreater ability to generalize than the traditional likeli-hood ratio policy gradient estimator, REINFORCE.

However, it would be useful to investigate whetherthe increased ability to generalize is simply due tothe algorithm being able to train faster due to it’sfast variational particle-based optimization methods,or because of the diversity factor embedded in thelearning objective. For further experiments, I hope toexamine additional control-variate methods for policygradients, such as the asynchronous advantage actorcritic method (A3C), [MBM+16], and trust region pol-icy optimization (TRPO), [SLA+15], and see how theyperform relative to Stein Variational Policy Gradient.

From a language grounding perspective, it wouldbe interesting to examine higher order referents andrelational concepts, such as representations of size(“pick up the larger object”), or hue (“pick up

the object with brighter shade”). Being able tosee how the agent would generalize with such relationalconcepts would shed a lot of light on relational ground-ing for language. I hope to tackle these experimentsnext!

References

[AAD+16] David Abel, Alekh Agarwal, FernandoDiaz, Akshay Krishnamurthy, andRobert Schapire, Exploratory gradient

boosting for reinforcement learning in

complex domains.

[CSP+17] Devendra Singh Chaplot, Kan-thashree Mysore Sathyendra, Rama Ku-mar Pasumarthi, Dheeraj Rajagopal, andRuslan Salakhutdinov, Gated-attention

architectures for task-oriented language

grounding, CoRR abs/1706.07230(2017).

[CvG+14] K. Cho, B. van Merrienboer, C. Gul-cehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio, Learn-

ing Phrase Representations using RNN

Encoder-Decoder for Statistical Machine

Translation, ArXiv e-prints (2014).

[HHG+17] Karl Moritz Hermann, Felix Hill, SimonGreen, Fumin Wang, Ryan Faulkner,Hubert Soyer, David Szepesvari, Wo-jciech Marian Czarnecki, Max Jader-berg, Denis Teplyashin, Marcus Wain-wright, Chris Apps, Demis Hassabis,and Phil Blunsom, Grounded language

learning in a simulated 3d world, CoRRabs/1706.06551 (2017).

[JHHB16] Matthew Johnson, Katja Hofmann, TimHutton, and David Bignell, The malmo

5

Figure 3: Learning Curves for SVPG and REINFORCE

platform for artificial intelligence exper-

imentation, Proceedings of the Twenty-Fifth International Joint Conference onArtificial Intelligence, IJCAI 2016, NewYork, NY, USA, 9-15 July 2016, 2016,pp. 4246–4247.

[LHKOR17] Zhiyu Lin, Brent Harrison, Aaron Keech,and Mark O. Riedl, Explore, exploit or

listen: Combining human feedback and

policy model to speed up deep reinforce-

ment learning in 3d worlds.

[LW16] Qiang Liu and Dilin Wang, Stein varia-

tional gradient descent: A general pur-

pose bayesian inference algorithm, Ad-vances in Neural Information ProcessingSystems 29 (D. D. Lee, M. Sugiyama,U. V. Luxburg, I. Guyon, and R. Gar-nett, eds.), Curran Associates, Inc., 2016,pp. 2378–2386.

[MBM+16] Volodymyr Mnih, Adrià PuigdomènechBadia, Mehdi Mirza, Alex Graves, Tim-othy P. Lillicrap, Tim Harley, DavidSilver, and Koray Kavukcuoglu, Asyn-

chronous methods for deep reinforce-

ment learning, CoRR abs/1602.01783(2016).

[Mor13] P.D. Moral, Mean field simulation for

monte carlo integration, Chapman &

Hall/CRC Monographs on Statistics &Applied Probability, CRC Press, 2013.

[Ola15] Chris Olah, Understand-

ing lstm networks, http:

//colah.github.io/posts/

2015-08-Understanding-LSTMs/,2015.

[SL12] Carina Silberer and Mirella Lapata,Grounded models of semantic representa-

tion, Proceedings of the 2012 Joint Con-ference on Empirical Methods in NaturalLanguage Processing and ComputationalNatural Language Learning (Strouds-burg, PA, USA), EMNLP-CoNLL ’12,Association for Computational Linguis-tics, 2012, pp. 1423–1433.

[SLA+15] John Schulman, Sergey Levine, PieterAbbeel, Michael Jordan, and PhilippMoritz, Trust region policy optimization,Proceedings of the 32nd InternationalConference on Machine Learning (ICML-15) (David Blei and Francis Bach, eds.),JMLR Workshop and Conference Pro-ceedings, 2015, pp. 1889–1897.

[Wil92] Ronald J. Williams, Simple statistical

gradient-following algorithms for con-

nectionist reinforcement learning, Mach.Learn. 8 (1992), no. 3-4, 229–256.

6

http://colah.github.io/posts/2015-08-Understanding-LSTMs/



[Win72] T. Winograd, Understanding natural lan-

guage, Academic Press, 1972.

7

Documents

CS229 Final Project: Language Grounding in Minecraft with ...cs229.stanford.edu/proj2017/final-reports/5242630.pdf · CS229 Final Project: Language Grounding in Minecraft with Gated-Attention