25
NAACL-HLT 2012 SIAC 2012 Workshop on Semantic Interpretation in an Actionable Context Proceedings of the Workshop June 8, 2012 Montr´ eal, Canada

SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

  • Upload
    vodien

  • View
    217

  • Download
    1

Embed Size (px)

Citation preview

Page 1: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

NAACL-HLT 2012

SIAC 2012Workshop on Semantic Interpretation in an Actionable

Context

Proceedings of the Workshop

June 8, 2012Montreal, Canada

Page 2: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

Production and Manufacturing byOmnipress, Inc.2600 Anderson StreetMadison, WI 53707USA

c©2012 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN13: 978-1-937284-20-6 ISBN10: 1-937284-20-4

ii

Page 3: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

Introduction

Effective and seamless Human-Computer interaction using natural language is arguably one of themajor challenges of natural language processing and artificial intelligence in general. Makingsignificant progress in developing natural language capabilities that support this level of interaction hascountless applications and is bound to attract many researchers from several AI fields: from robotics togames to the social sciences.

From the natural language processing perspective the problem is often formulated as a translation task:mapping between natural language input and a logical output language that can be executed in thedomain of interest. Unlike shallow approaches for semantic interpretation, which provide an incompleteor underspecified interpretation of the natural language input, the output of a formal semantic interpreteris expected to provide complete meaning representation that can be executed directly by a computersystem. Examples of such systems include robotic control, database access, game playing and more.

Current approaches to this task take a data driven approach, in which a learning algorithm is given aset of natural language sentences as input and their corresponding logical meaning representation andlearns a statistical semantic parser: a set of parameterized rules mapping lexical items and syntacticpatterns to a logical formula.

In recent years this framework was challenged by an exciting line of research, advocating thatsemantic interpretation should not be studied in isolation, but rather in the context of the externalenvironment (or computer system) which provides the semantic context for interpretation. This line ofresearch comprises several directions, focusing on grounded semantic representations, flexible semanticinterpretation models, and alternative learning protocols driven by indirect supervision signals. Thisprogress has contributed to expanding the scope of semantic interpretation, introduced new domainsand tasks and revealed that it is possible to make progress in this direction with reduced manualeffort. In particular, it resulted in a wide range of models, learning protocols, learning tasks, andsemantic formalisms that, while clearly related, are not directly comparable and understood under asingle framework.

The workshop consists mostly of invited speakers, but will also include several novel works. The goalof this workshop is to provide researchers interested in the field with an opportunity to exchange ideas,discuss other perspectives, and formulate a shared vision for this research direction.

iii

Page 4: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages
Page 5: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

Organizers:

Dan Goldwasser, University of Illinois at Urbana Champaign (USA)Regina Barzilay, Massachusetts Institute of Technology (USA)Dan Roth, University of Illinois at Urbana Champaign (USA)

Program Committee:

Raymond Mooney, University of Texas at Austin (USA)Luke Zettlemoyer, University of Washington (USA)Jacob Eisenstein, Georgia Tech (USA)Alexander Koller, University of Potsdam (Germany)Percy Liang, Google (USA)Wei Lu, University of Illinois at Urbana Champaign (USA)Nick Roy, Massachusetts Institute of Technology (USA)Jeffrey Mark Siskind, Purdue University (USA)Jason Weston, Google (USA)

Additional Reviewers:

Rajhans Samdani, University of Illinois at Urbana Champaign (USA)

Invited Speaker:

Raymond Mooney, University of Texas at Austin (USA)Luke Zettlemoyer, University of Washington (USA)Yoshua Bengio, University of Montreal (Canada)Julia Hockenmaier, University of Illinois at Urbana Champaign (USA)Dan Roth, University of Illinois at Urbana Champaign (USA)Mehdi Hafezi Manshadi, University of Rochester (USA)

v

Page 6: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages
Page 7: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

Table of Contents

Learning to Interpret Natural Language InstructionsMonica Babes-Vroman, James MacGlashan, Ruoyuan Gao, Kevin Winner, Richard Adjogah,

Marie desJardins, Michael Littman and Smaranda Muresan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Toward Learning Perceptually Grounded Word Meanings from Unaligned Parallel DataStefanie Tellex, Pratiksha Thaker, Josh Joseph and Nicholas Roy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

vii

Page 8: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages
Page 9: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

ix

Page 10: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

Conference Program

June 89:00 Morning Session

Workshop IntroductionWorkshop organizers

Learning to Interpret Natural Language InstructionsMonica Babes-Vroman, James MacGlashan, Ruoyuan Gao, Kevin Winner, RichardAdjogah, Marie desJardins, Michael Littman and Smaranda Muresan

Toward Learning Perceptually Grounded Word Meanings from Unaligned ParallelDataStefanie Tellex, Pratiksha Thaker, Josh Joseph, and Nicholas Roy

Invited Talk: Integrating Natural Language Programming and Programming byDemonstrationMehdi Hafezi Manshadi

10:30 Coffee Break

11:00 Morning Session 2

Invited Talk: Learning to Interpret Natural Language Navigation Instructions fromObservationRaymond Mooney

12:00 Lunch

13:00 Afternoon Session

Invited Talk:Luke Zettlemoyer

Invited Talk:Yoshua Bengio

15:00 Coffee Break

15:30 Afternoon Session 2

Invited Talk:Julia Hockenmaier

Invited Talk:Dan Roth

x

Page 11: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

Workshop on Semantic Interpretation in an Actionable Context, NAACL-HLT 2012, pages 1–6,Montreal, Canada, June 3-8, 2012. c©2012 Association for Computational Linguistics

Learning to Interpret Natural Language Instructions

Monica Babes-Vroman+, James MacGlashan∗, Ruoyuan Gao+, Kevin Winner∗Richard Adjogah∗, Marie desJardins∗, Michael Littman+ and Smaranda Muresan++

∗ Department of Computer Science and Electrical EngineeringUniversity of Maryland, Baltimore County

+ Computer Science Department, Rutgers University++ School of Communication and Information, Rutgers University

Abstract

This paper addresses the problem of trainingan artificial agent to follow verbal instructionsrepresenting high-level tasks using a set of in-structions paired with demonstration traces ofappropriate behavior. From this data, a map-ping from instructions to tasks is learned, en-abling the agent to carry out new instructionsin novel environments.

1 Introduction

Learning to interpret language from a situated con-text has become a topic of much interest in recentyears (Branavan et al., 2009; Branavan et al., 2010;Branavan et al., 2011; Clarke et al., 2010; Chenand Mooney, 2011; Vogel and Jurafsky, 2010; Gold-wasser and Roth, 2011; Liang et al., 2011; Atrzi andZettlemoyer, 2011; Tellex et al., 2011). Instead ofusing annotated training data consisting of sentencesand their corresponding logical forms (Zettlemoyerand Collins, 2005; Kate and Mooney, 2006; Wongand Mooney, 2007; Zettlemoyer and Collins, 2009;Lu et al., 2008), most of these approaches leveragenon-linguistic information from a situated contextas their primary source of supervision. These ap-proaches have been applied to various tasks such asfollowing navigational instructions (Vogel and Ju-rafsky, 2010; Chen and Mooney, 2011; Tellex etal., 2011), software control (Branavan et al., 2009;Branavan et al., 2010), semantic parsing (Clarke etal., 2010; Liang et al., 2011) and learning to playgames based on text (Branavan et al., 2011; Gold-wasser and Roth, 2011).

In this paper, we present an approach to inter-preting language instructions that describe complexmultipart tasks by learning from pairs of instruc-tions and behavioral traces containing a sequenceof primitive actions that result in these instructionsbeing properly followed. We do not assume a one-to-one mapping between instructions and primitiveactions. Our approach uses three main subcom-ponents: (1) recognizing intentions from observedbehavior using variations of Inverse ReinforcementLearning (IRL) methods; (2) translating instructionsto task specifications using Semantic Parsing (SP)techniques; and (3) creating generalized task speci-fications to match user intentions using probabilis-tic Task Abstraction (TA) methods. We describeour system architecture and a learning scenario. Wepresent preliminary results for a simplified versionof our system that uses a unigram language model,minimal abstraction, and simple inverse reinforce-ment learning.

Early work on grounded language learning usedfeatures based on n-grams to represent the naturallanguage input (Branavan et al., 2009; Vogel andJurafsky, 2010). More recent methods have reliedon a richer representation of linguistic data, such assyntactic dependency trees (Branavan et al., 2011;Goldwasser and Roth, 2011) and semantic templates(Tellex et al., 2011) to address the complexity of thenatural language input. Our approach uses a flexi-ble framework that allows us to incorporate variousdegrees of linguistic knowledge available at differ-ent stages in the learning process (e.g., from depen-dency relations to a full-fledged semantic model ofthe domain learned during training).

1

Page 12: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

2 System Architecture

We represent tasks using the Object-orientedMarkov Decision Process (OO-MDP) formal-ism (Diuk et al., 2008), an extension of Markov De-cision Processes (MDPs) to explicitly capture rela-tionships between objects. Specifically, OO-MDPsadd a set of classes C, each with a set of attributesTC . Each OO-MDP state is defined by an unorderedset of instantiated objects. In addition to these ob-ject definitions, an OO-MDP also defines a set ofpropositional functions that operate on objects. Forinstance, we might have a propositional functiontoyIn(toy, room) that operates on an objectbelonging to class “toy” and an object belonging toclass “room,” returning true if the specified “toy”object is in the specific “room” object. We extendOO-MDPs to include a set of propositional functionclasses (F) associating propositional functions thatdescribe similar properties. In the context of defin-ing a task corresponding to a particular goal, an OO-MDP defines a subset of states β ⊂ S called ter-mination states that end an action sequence and thatneed to be favored by the task’s reward function.

Example Domain. To illustrate our approach, wepresent a simple domain called Cleanup World, a 2Dgrid world defined by various rooms that are con-nected by open doorways and can contain variousobjects (toys) that the agent can push around to dif-ferent positions in the world. The Cleanup Worlddomain can be represented as an OO-MDP with fourobject classes: agent, room, doorway, and toy, and aset of propositional functions that specify whethera toy is a specific shape (such as isStar(toy)),the color of a room (such as isGreen(room)),whether a toy is in a specific room (toyIn(toy,room)), and whether an agent is in a specific room(agentIn(room)). These functions belong toshape, color, toy position or agent position classes.

2.1 Interaction among IRL, SP and TA

The training data for the overall system is a set ofpairs of verbal instructions and behavior. For exam-ple, one of these pairs could be the instruction Pushthe star to the green room with a demonstration ofthe task being accomplished in a specific environ-ment containing various toys and rooms of differentcolors. We assume the availability of a set of fea-

tures for each state represented using the OO-MDPpropositional functions descibed previously. Thesefeatures play an important role in defining the tasksto be learned. For example, a robot being taughtto move furniture around would have informationabout whether or not it is currently carrying a pieceof furniture, what piece of furniture it needs to bemoving, which room it is currently in, which roomcontains each piece of furniture, etc. We presentbriefly the three components of our system (IRL, SPand TA) and how they interact with each other dur-ing learning.

Inverse Reinforcement Learning. Inverse Re-inforcement Learning (Abbeel and Ng, 2004) ad-dresses the task of learning a reward function fromdemonstrations of expert behavior and informationabout the state-transition function. Recently, moredata-efficient IRL methods have been proposed,including the Maximum Likelihood Inverse Rein-forcement Learning (Babes-Vroman et al., 2011)or MLIRL approach, which our system builds on.Given even a small number of trajectories, MLIRLfinds a weighting of the state features that (locally)maximizes the probability of these trajectories. Inour system, these state features consist of one of thesets of propositional functions provided by the TAcomponent. For a given task and a set of sets ofstate features, MLIRL evaluates the feature sets andreturns to the TA component its assessment of theprobabilities of the various sets.

Semantic Parsing. To address the problem ofmapping instructions to semantic parses, we usea constraint-based grammar formalism, Lexical-ized Well-Founded Grammar (LWFG), which hasbeen shown to balance expressiveness with practicallearnability results (Muresan and Rambow, 2007;Muresan, 2011). In LWFG, each string is associ-ated with a syntactic-semantic representation, andthe grammar rules have two types of constraints: onefor semantic composition (Φc) and one for seman-tic interpretation (Φi). The semantic interpretationconstraints, Φi, provide access to a semantic model(domain knowledge) during parsing. In the absenceof a semantic model, however, the LWFG learnabil-ity result still holds. This fact is important if ouragent is assumed to start with no knowledge of thetask and domain. LWFG uses an ontology-based se-mantic representation, which is a logical form repre-

2

Page 13: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

sented as a conjunction of atomic predicates. For ex-ample, the representation of the phrase green roomis 〈X1.is=green, X.P1 = X1, X.isa=room〉. Thesemantic representation specifies two concepts—green and room—connected through a propertythat can be uninstantiated in the absence of a seman-tic model, or instantiated via the Φi constraints tothe property name (e.g, color) if such a model ispresent.

During the learning phase, the SP component, us-ing an LWFG grammar that is learned offline, pro-vides to TA the logical forms (i.e., the semanticparses, or the unlabeled dependency parses if no se-mantic model is given) for each verbal instruction.For example, for the instruction Move the chair intothe green room, the semantic parser knows initiallythat move is a verb, chair and room are nouns, andgreen is an adjective. It also has grammar rules ofthe form S → Verb NP PP: Φc1,Φi1,

1 but it hasno knowledge of what these words mean (that is, towhich concepts they map in the domain model). Forthis instruction, the LWFG parser returns the logicalform:

〈(X1.isa=move, X1.Arg1= X2)move,(X2.det=the)the, (X2.isa=chair)chair,(X1.P1 = X3, P2.isa=into)into, (X3.det=the)the,(X4.isa=green, X3.P2 = X2)green,(X3.isa=room)room〉.

The subscripts for each atomic predicate in-dicate the word to which that predicate corre-sponds. This logical form corresponds to thesimplified logical form move(chair1,room1),P1(room1,green), where predicate P1 is unin-stantiated. A key advantage of this framework is thatthe LWFG parser has access to the domain (seman-tic) model via Φi constraints. As a result, when TAprovides feedback about domain-specific meanings(i.e., groundings), the parser can incorporate thosemappings via the Φi constraints (e.g., move mightmap to the predicate blockToRoom with a certainprobability).

Task Abstraction. The termination conditionsfor an OO-MDP task can be defined in terms of thepropositional functions. For example, the Cleanup

1For readability, we show here just the context-free back-bone, without the augmented nonterminals or constraints.

World domain might include a task that requires theagent to put a specific toy (t1) in a specific room(r1). In this case, the termination states would bedefined by states that satisfy toyIn(t1, r1) and thereward function would be defined as Ra(s, s′) ={1 : toyIn(ts

′1 , r

s′1 );−1 : otherwise}. However,

such a task definition is overly specific and cannotbe evaluated in a new environment that contains dif-ferent objects. To remove this limitation, we defineabstract task descriptions using parametric lifted re-ward and termination functions. A parametric liftedreward function is a first-order logic expression inwhich the propositional functions defining the re-ward can be selected as parameters. This repre-sentation allows much more general tasks to be de-fined; these tasks can be evaluated in any environ-ment that contains the necessary object classes. Forinstance, the reward function for an abstract taskthat encourages an agent to take a toy of a certainshape to a room of a certain color (resulting in a re-ward of 1) would be represented as Ra(s, s′) = {1 :∃ts′∈toy∃rs′∈roomP1(t) ∧ P2(r) ∧ toyIn(t, r);−1 :otherwise}, where P1 is a propositional functionthat operates on toy objects and P2 is a propositionalfunction that operates on room objects. An analo-gous definition can be made for termination condi-tions. Given the logical forms provided by SP, TAfinds candidate tasks that might match each logi-cal form, along with a set of possible groundingsof those tasks. A grounding of an abstract task isthe set of propositional functions to be applied tothe specific objects in a given training instance. TAthen passes these grounded propositional functionsas the features to use in IRL. (If there are no can-didate tasks, then it will pass all grounded proposi-tional functions of the OO-MDP to IRL.) When IRLreturns a reward function for these possible ground-ings and their likelihoods of representing the true re-ward function, TA determines whether any abstracttasks it has defined might match. If not, TA willeither create a new abstract task that is consistentwith the received reward functions or it will modifyone of its existing definitions if doing so does notrequire significant changes. With IRL indicating theintended goal of a trace and with the abstract task in-dicating relevant parameters, TA can then inform SPof the task/domain specific meanings for the logicalforms.

3

Page 14: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

A Learning from Scratch Scenario. Our sys-tem is trained using a set of sentence–trajectorypairs ((S1, T1), ..., (SN , TN )). Initially, the sys-tem does not know what any of the words meanand there are no pre-existing abstract tasks. Let’sassume that S1 is Push the star into the greenroom.This sentence is first processed by the SP com-ponent, yielding the following logical forms: L1

is push(star1, room1), amod(room1, green) andL2 is push(star1), amod(room1, green),into(star1, room1). These logical forms andtheir likelihoods are passed to the TA compo-nent, and TA induces incomplete abstract tasks,which define only the number and kinds of ob-jects that are relevant to the corresponding re-ward function. TA can send to IRL a set offeatures involving these objects, together with T1,the demonstration attached to S1. This set offeatures might include: agentTouchToy(t1),toyIn(t1, r1), toyIn(t1, r2), agentIn(r1). IRLsends back a weighting of the features, and TAcan select the subset of features that have thehighest weights (e.g, (1.91, toyIn(t1, r1)), (1.12,agentTouchToy(t1)), (0.80, agentIn(r1)). Us-ing information from SP and IRL, TA can now createa new abstract task, perhaps called blockToRoom,adjust the probabilities of the logical forms based onthe relevant features obtained from IRL, and sendthese probabilities back to SP, enabling it to adjustits semantic model.

The entire system proceeds iteratively. While it isdesigned, not all features are fully implemented tobe able to report experimental results. In the nextsection, we present a simplified version of our sys-tem and show preliminary results.

3 A Simplified Model and Experiments

In this section, we present a simplified version of oursystem with a unigram language model, inverse rein-forcement learning and minimal abstraction. We callthis version Model 0. The input to Model 0 is a setof verbal instructions paired with demonstrations ofappropriate behavior. It uses an EM-style algorithm(Dempster et al., 1977) to estimate the probabilitydistribution of words conditioned on reward func-tions (the parameters). With this information, whenthe system receives a new command, it can behave

in a way that maximizes its reward given the pos-terior probabilities of the possible reward functionsgiven the words.

Algorithm 1 shows our EM-style Model 0. Forall possible reward–demonstration pairs, the E-stepof EM estimates zji = Pr(Rj |(Si, Ti)), the prob-ability that reward function Rj produced sentence-trajectory pair (Si, Ti), This estimate is given by theequation below:

zji = Pr(Rj |(Si, Ti)) =Pr(Rj)

Pr(Si, Ti)Pr((Si, Ti)|Rj)

=Pr(Rj)

Pr(Si, Ti)Pr(Ti|Rj)

wk∈Si

Pr(wk|Rj)

where Si is the ith sentence, Ti is the trajectorydemonstrated for verbal command Si, and wk is anelement in the set of all possible words (vocabulary).If the reward functions Rj are known ahead of time,Pr(Ti|Rj) can be obtained directly by solving theMDP and estimating the probability of trajectory Tiunder a Boltzmann policy with respect to Rj . If theRjs are not known, EM can estimate them by run-ning IRL during the M-step (Babes-Vroman et al.,2011).

The M-step in Algorithm 1 uses the current esti-mates of zji to further refine the probabilities xkj =Pr(wk|Rj):

xkj = Pr(wk|Rj) =1

X

Σwk∈Si Pr(Rj |Si) + ε

ΣiN(Si)zji + ε

where ε is a smoothing parameter, X is a normalizingfactor andN(Si) is the number of words in sentenceSi.

To illustrate our Model 0 performance, we se-lected as training data six sentences for two tasks(three sentences for each task) from a dataset wehave collected using Amazon Mechanical Turk forthe Cleanup Domain. We show the training datain Figure 1. We obtained the reward function foreach task using MLIRL, computed the Pr(Ti|Rj),then ran Algorithm 1 and obtained the parametersPr(wk|Rj). After this training process, we pre-sented the agent with a new task. She is given theinstruction SN : Go to green room. and a startingstate, somewhere in the same grid. Using parame-ters Pr(wk|Rj), the agent can estimate:

4

Page 15: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

Algorithm 1 EM-style Model 0

Input: Demonstrations {(S1, T1), ..., (SN , TN )},number of reward functions J , size of vocabularyK.Initialize: x11, . . . , xJK , randomly.repeat

E Step: Computezji =

Pr(Rj)Pr(Si,Ti)

Pr(Ti|Rj)∏wk∈Si

xkj .

M step: Computexkj = 1

X

Σwk∈SiPr(Rj |Si)+ε

ΣiN(Si)zji+ε.

until target number of iterations completed.

Pr(SN |R1) =∏wk∈SN

Pr(wk|R1) = 8.6 × 10−7,Pr(SN |R2) =

∏wk∈SN

Pr(wk|R2) = 4.1 × 10−4,and choose the optimal policy corresponding to re-ward R2, thus successfully carrying out the task.Note that R1 and R2 corresponded to the two tar-get tasks, but this mapping was determined by EM.We illustrate the limitation of the unigram model bytelling the trained agent to Go with the star to green,(we label this sentence S′N ). Using the learnedparameters, the agent computes the following esti-mates:Pr(S′N |R1) =

∏wk∈S′

NPr(wk|R1) = 8.25× 10−7,

Pr(S′N |R2) =∏wk∈S′

NPr(wk|R2) = 2.10× 10−5.

The agent wrongly chooses reward R2 and goes tothe green room instead of taking the star to the greenroom. The problem with the unigram model in thiscase is that it gives too much weight to word fre-quencies (in this case go) without taking into ac-count what the words mean or how they are usedin the context of the sentence. Using the system de-scribed in Section 2, we can address these problemsand also move towards more complex scenarios.

4 Conclusions and Future Work

We have presented a three-component architecturefor interpreting natural language instructions, wherethe learner has access to natural language input anddemonstrations of appropriate behavior. Our futurework includes fully implementing the system to beable to build abstract tasks from language informa-tion and feature relevance.

Figure 1: Training data for 2 tasks: Taking the star to thegreen room (left) and Going to the green room (right).

Acknowledgments

The authors acknowledge the support of the Na-tional Science Foundation (collaborative grant IIS-00006577 and IIS-1065195). The authors thank theanonymous reviewers for their feedback. Any opin-ions, findings, conclusions, or recommendations ex-pressed in this paper are those of the authors, anddo not necessarily reflect the views of the fundingorganization.

References

Pieter Abbeel and Andrew Ng. 2004. Apprenticeshiplearning via inverse reinforcement learning. In Pro-ceedings of the Twenty-First International Conferencein Machine Learning (ICML 2004).

Yoav Atrzi and Luke Zettlemoyer. 2011. Bootstrappingsemantic parsers for conversations. In Proceedings ofthe 2011 Conference on Empirical Methods in NaturalLanguage Processing.

Monica Babes-Vroman, Vukosi Marivate, Kaushik Sub-ramanian, and Michael Littman. 2011. Apprentice-ship learning about multiple intentions. In Proceed-ings of the Twenty Eighth International Conference onMachine Learning (ICML 2011).

S. R. K. Branavan, Harr Chen, Luke S. Zettlemoyer, andRegina Barzilay. 2009. Reinforcement learning formapping instructions to actions. In Proceedings ofthe Joint Conference of the 47th Annual Meeting ofthe ACL and the 4th International Joint Conference on

5

Page 16: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

Natural Language Processing of the AFNLP: Volume1 - Volume 1, ACL ’09.

S. R. K. Branavan, Luke S. Zettlemoyer, and ReginaBarzilay. 2010. Reading between the lines: Learningto map high-level instructions to commands. In Asso-ciation for Computational Linguistics (ACL 2010).

S.R.K. Branavan, David Silver, and Regina Barzilay.2011. Learning to win by reading manuals in a monte-carlo framework. In Association for ComputationalLinguistics (ACL 2011).

David L. Chen and Raymond J. Mooney. 2011. Learningto interpret natural language navigation instructionsfrom observations. In Proceedings of the 25th AAAIConference on Artificial Intelligence (AAAI-2011).,pages 859–865.

James Clarke, Dan Goldwasser, Ming-Wei Chang, andDan Roth. 2010. Driving semantic parsing from theworld’s response. In Proceedings of the Associationfor Computational Linguistics (ACL 2010).

A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977.Maximum likelihood from incomplete data via the EMalgorithm. Journal of the Royal Statistical Society,39(1):1–38.

Carlos Diuk, Andre Cohen, and Michael Littman. 2008.An object-oriented representation for efficient rein-forcement learning. In Proceedings of the Twenty-Fifth International Conference on Machine Learning(ICML-08).

Dan Goldwasser and Dan Roth. 2011. Learning fromnatural instructions. In Proceedings of the Twenty-Second International Joint Conference on Artificial In-telligence.

Rohit J. Kate and Raymond J. Mooney. 2006. Usingstring-kernels for learning semantic parsers. In Pro-ceedings of the 21st International Conference on Com-putational Linguistics and the 44th annual meeting ofthe Association for Computational Linguistics, ACL-44.

Percy Liang, Michael Jordan, and Dan Klein. 2011.Learning dependency-based compositional semantics.In Association for Computational Linguistics (ACL2011).

Wei Lu, Hwee Tou Ng, Wee Sun Lee, and Luke S. Zettle-moyer. 2008. A generative model for parsing naturallanguage to meaning representations. In Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing, EMNLP ’08.

Smaranda Muresan and Owen Rambow. 2007. Grammarapproximation by representative sublanguage: A newmodel for language learning. In Proceedings of ACL.

Smaranda Muresan. 2011. Learning for deep languageunderstanding. In Proceedings of IJCAI-11.

Stefanie Tellex, Thomas Kollar, Steven Dickerson,Matthew Walter, Ashis Gopal Banerjee, Seth Teller,and Nicholas Roy. 2011. Understanding naturallanguage commands for robotic navigation and mo-

bile manipulation. In Proceedings of the Twenty-FifthAAAI Conference on Articifical Intelligence.

Adam Vogel and Dan Jurafsky. 2010. Learning to follownavigational directions. In Association for Computa-tional Linguistics (ACL 2010).

Yuk Wah Wong and Raymond Mooney. 2007. Learn-ing synchronous grammars for semantic parsing withlambda calculus. In Proceedings of the 45th AnnualMeeting of the Association for Computational Linguis-tics (ACL-2007).

Luke S. Zettlemoyer and Michael Collins. 2005. Learn-ing to map sentences to logical form: Structured clas-sification with probabilistic categorial grammars. InProceedings of UAI-05.

Luke Zettlemoyer and Michael Collins. 2009. Learningcontext-dependent mappings from sentences to logicalform. In Proceedings of the Association for Computa-tional Linguistics (ACL’09).

6

Page 17: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

Workshop on Semantic Interpretation in an Actionable Context, NAACL-HLT 2012, pages 7–14,Montreal, Canada, June 3-8, 2012. c©2012 Association for Computational Linguistics

Toward Learning Perceptually Grounded Word Meanings from UnalignedParallel Data

Stefanie Tellex and Pratiksha Thaker and Josh Joseph and Matthew R. Walter and Nicholas RoyMIT Computer Science and Artificial Intelligence Laboratory

Abstract

In order for robots to effectively understandnatural language commands, they must be ableto acquire a large vocabulary of meaning rep-resentations that can be mapped to perceptualfeatures in the external world. Previous ap-proaches to learning thesegroundedmeaningrepresentations require detailed annotations attraining time. In this paper, we present anapproach which is capable of jointly learninga policy for following natural language com-mands such as “Pick up the tire pallet,” as wellas a mapping between specific phrases in thelanguage and aspects of the external world;for example the mapping between the words“the tire pallet” and a specific object in theenvironment. We assume the action policytakes a parametric form that factors based onthe structure of the language, based on the G3

framework and use stochastic gradient ascentto optimize policy parameters. Our prelimi-nary evaluation demonstrates the effectivenessof the model on a corpus of “pick up” com-mands given to a robotic forklift by untrainedusers.

1 Introduction

In order for robots to robustly understand humanlanguage, they must have access to meaning rep-resentations capable of mapping between symbolsin the language and aspects of the external worldwhich are accessible via the robot’s perception sys-tem. Previous approaches have represented wordmeanings as symbols in some specific symboliclanguage, either programmed by hand [Winograd,

1971, MacMahon et al., 2006] or learned [Matuszeket al., 2010, Chen and Mooney, 2011, Liang et al.,2011, Branavan et al., 2009]. Because word mean-ings are represented as symbols, rather than percep-tually grounded features, the mapping between thesesymbols and the external world must still be de-fined. Furthermore, the uncertainty of the mappingbetween constituents in the language and aspects ofthe external world cannot be explicitly representedby the model.

Language grounding approaches, in contrast, mapwords in the language togroundingsin the externalworld [Mavridis and Roy, 2006, Hsiao et al., 2008,Kollar et al., 2010, Tellex et al., 2011]. Groundingsare the specific physical concept that is referred toby the language and can be objects (e.g., a truckor a door), places (e.g., a particular location in theworld), paths (e.g., a trajectory through the envi-ronment), or events (e.g., a sequence of robot ac-tions). This symbol grounding approach [Harnad,1990] represents word meanings asfunctionswhichtake as input a perceptual representation of a ground-ing and return whether it matches words in the lan-guage. Recent work has demonstrated how to learngrounded word meanings from a parallel corpus ofnatural language commands paired with groundingsin the external world [Tellex et al., 2011]. How-ever, learning model parameters required that theparallel corpus be augmented with additional an-notations specifying the alignment between specificphrases in the language and corresponding ground-ings in the external world. Figure 1 shows an ex-ample command from the training set paired withthese alignment annotations, represented as arrows

7

Page 18: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

pointing from each linguistic constituent to a corre-sponding grounding.

Our approach in this paper relaxes these annota-tion requirements and learns perceptually groundedword meanings from anunalignedparallel corpusthat only provides supervision for the top-level ac-tion that corresponds to a natural language com-mand. Our system takes as input a state/action spacefor the robot defining a space of possible groundingsand available actions in the external world. In addi-tion it requires a corpus of natural language com-mands paired with the correct action executed inthe environment. For example, an entry in the cor-pus consists of a natural language command such as“Pick up the tire pallet” given to a robotic forklift,paired with an action sequence of the robot as drivesto the tire pallet, inserts its forks, and raises it off theground, drives to the truck, and sets it down.

To learn from an unaligned corpus, we derivea new training algorithm that combines the Gen-eralized Grounding Graph (G3) framework intro-duced by Tellex et al. [2011] with the policy gra-dient method described by Branavan et al. [2009].We assume a specific parametric form for the actionpolicy that is defined by the linguistic structure ofthe natural language command. The system learnsa policy parameters that maximize expected rewardusing stochastic gradient ascent. By factoring thepolicy according to the structure of language, we canpropagate the error signal to each term, allowing thesystem to infer groundings for each linguistic con-stituent even without direct supervision. We eval-uate our model using a corpus of natural languagecommands collected from untrained users on the in-ternet, commanding the robot to pick up objects ordrive to locations in the environment. The evalua-tion demonstrates that the model is able to predictboth robot actions and noun phrase groundings withhigh accuracy, despite having no direct supervisionfor noun phrase groundings.

2 Background

We briefly review the G3 framework, introduced byTellex et al. [2011]. In order for a robot to un-derstand natural language, it must be able to mapbetween words in the language and correspondinggroundings in the external world. The aim is to find

Figure 1: Sample entry from an aligned corpus,where mappings between phrases in the languageand groundings in the external world are explicitlyspecified as arrows. Learning the meaning of “thetruck” and “the pallet” is challenging when align-ment annotations are not known.

λr1

“Pick up”

γ1 =

φ1

λf2

“the pallet.”

φ2

γ2 =

Figure 2: Grounding graph for “Pick up the tire pal-let.

8

Page 19: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

the most probable groundingsγ1 . . . γN given thelanguageΛ and the robot’s model of the environ-mentM :

argmaxγ1...γN

p(γ1 . . . γN |Λ,M) (1)

M consists of the robot’s location, the loca-tions, geometries, and perceptual tags of objects, andavailable actions the robot can take. For brevity, weomit M from future equations in this section.

To learn this distribution, one standard approachis to factor it based on certain independence assump-tions, then train models for each factor. Naturallanguage has a well-known compositional, hierar-chical argument structure [Jackendoff, 1983], and apromising approach is to exploit this structure in or-der to factor the model. However, if we define a di-rected model over these variables, we must assumea possibly arbitrary order to the conditionalγi fac-tors. For example, for a phrase such as “the tire pal-let near the other skid,” we could factorize in eitherof the following ways:

p(γtires, γskid|Λ) = p(γskid|γtires,Λ)× p(γtires|Λ) (2)

p(γtires, γskid|Λ) = p(γtires|γskid,Λ)× p(γskid|Λ) (3)

Depending on the order of factorization, we willneed different conditional probability tables that cor-respond to the meanings of words in the language.To resolve this issue, another approach is to useBayes’ Rule to estimate thep(Λ|γ1 . . . γN ), but thisapproach would require normalizing over all possi-ble words in the languageΛ. Another alternativeis to use an undirected model, but this approach isintractable because it requires normalizing over allpossible values of allγi variables in the model, in-cluding continuous attributes such as location andsize.

To address these problems, the G3 framework in-troduced a correspondence vectorΦ to capture thedependency betweenγ1 . . . γN andΛ. Each entryin φi ∈ Φ corresponds to whether linguistic con-stituentλi ∈ Λ corresponds to groundingγi. Weassume thatγ1 . . . γN are independent ofΛ unlessΦ is known. IntroducingΦ enables factorization ac-cording to the structure of language with local nor-malization at each factor over a space of just the twopossible values forφi.

2.1 Inference

In order to use the G3 framework for inference, wewant to infer the groundingsγ1 . . . γN that maxi-mize the distribution

argmaxγ1...γN

p(γ1 . . . γN |Φ,Λ) (4)

which is equivalent to maximizing the joint distribu-tion of all groundingsγ1 . . . γN , Φ andΛ,

argmaxγ1...γN

p(γ1 . . . γN ,Φ,Λ). (5)

We assume thatΛ andγ1 . . . γN are independentwhenΦ is not known, yielding:

argmaxγ1...γN

p(Φ|Λ, γ1 . . . γN )p(Λ)p(γ1 . . . γN ) (6)

This independence assumption may seem unintu-itive, but it is justified because the correspondencevariableΦ breaks the dependency betweenΛ andγ1 . . . γN . If we do not know whetherγ1 . . . γN cor-respond toΛ, we assume that the language does nottell us anything about the groundings.

Finally, for simplicity, we assume that any objectin the environment is equally likely to be referencedby the language, which amounts to a constant prioron γ1 . . . γN . In the future, we plan to incorporatemodels of attention and salience into this prior. Weignorep(Λ) since it does not depend onγ1 . . . γN ,leading to:

argmaxγ1...γN

p(Φ|Λ, γ1 . . . γN ) (7)

To compute the maximum value of the objectivein Equation 7, the system performs beam searchover γ1 . . . γN , computing the probability of eachassignment from Equation 7 to find the maximumprobability assignment. Although we are usingp(Φ|Λ, γ1 . . . γN ) as the objective function,Φ isfixed, and theγ1 . . . γN are unknown. This approachis valid because, given our independence assump-tions,p(Φ|Λ, γ1 . . . γN ) corresponds to the joint dis-tribution over all the variables given in Equation 5.

In order to perform beam search, we factor themodel according to the hierarchical, compositionallinguistic structure of the command:

p(Φ|Λ, γ1 . . . γN ) =∏

i

p(φi|λi, γi1 . . . γik) (8)

9

Page 20: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

This factorization can be represented graphically;we call the resulting graphical model thegroundinggraphfor a natural language command. The directedmodel for the command “Pick up the pallet” appearsin Figure 2. Theλ variables correspond to language;theγ variables correspond to groundings in the ex-ternal world, and theφ variables areTrue if thegroundings correspond to the language, andFalseotherwise.

In the fully supervised case, we fit model param-etersΘ using an aligned parallel corpus of labeledpositive and negative examples for each linguisticconstituent. The G3 framework assumes a log-linearparametrization with feature functionsfj and fea-ture weightsθj :

p(Φ|Λ, γ1 . . . γN ) = (9)∏

i

1

Zexp(

j

θjfj(φi, λi, γi1 . . . γik)) (10)

This function is convex and can be optimized withgradient-based methods [McCallum, 2002].

Features correspond to the degree to which eachΓ correctly groundsλi. For a relation such as “on,”a natural feature is whether the grounding corre-sponding to the head noun phrase is supported bythe grounding corresponding to the argument nounphrases. However, the featuresupports(γi, γj)alone is not enough to enable the model to learn that“on” corresponds tosupports(γi, γj). Instead weneed a feature that also takes into account the word“on:”

supports(γi, γj) ∧ (“on” ∈ λi) (11)

3 Approach

Our goal is to learn model parameters for the G3

framework from an unaligned corpus of natural lan-guage commands paired with robot actions. Previ-ously, the system learned model parametersΘ usingan aligned corpus in which values for all ground-ing variables are known at training time, and anno-tators provided both positive and negative examplesfor each factor. In this paper we describe how torelax this annotation requirement so that only thetop-level action needs to be observed in order totrain the model. It is easy and fast to collect data

with these annotations, whereas annotating the val-ues of all the variables, including negative examplesis time-consuming and error prone. Once we knowthe model parameters we can use existing inferenceto find groundings corresponding to word meanings,as in Equation 4.

We are given a corpus ofD training examples.Each exampled consists of a natural language com-mandΛd with an associated grounding graph withgrounding variablesΓd. Values for the groundingvariables are not known, except for an observedvaluegda for the top-level action random variable,γda .Finally, an example has an associated environmentalcontext or semantic mapMd. Each environmentalcontext will have a different set of objects and avail-able actions for the robot. For example, one trainingexample might contain a command given in an envi-ronment with a single tire pallet; another might con-tain a command given in an environment with twobox pallets and a truck.

We define a sampling distribution to choose val-ues for theΓ variables in the model using the G3

framework with parametersΘ:

p(Γd|Φd,Λd,Md,Θ) (12)

Next, we define a reward function for choosing thecorrect groundingga for a training example:

r(Γd, gda) =

{1 if γda = gda−1 otherwise

(13)

Hereγa is a grounding variable for the top-level ac-tion corresponding to the command; it is one of thevariables in the vectorΓ. Our aim is to find modelparameters that maximize expected reward whendrawing values forΓd from p(Γd|Φd,Λd,Md,Θ)over the training set:

argmaxΘ

d

Ep(Γd|Φd,Λd,Md,Θ)r(Γd, gda) (14)

Expanding the expectation we have:

argmaxΘ

d

d∑

Γ

r(Γd, gda)p(Γd|Φd,Λd,Md,Θ)

(15)

We use stochastic gradient descent to find model pa-rameters that maximize reward. First, we take the

10

Page 21: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

derivative of the expectation with respect toΘ. (Wedrop thed subscripts for brevity.)

∂θkEp(Γ|Φ,Λ,M,Θ)r(Γ, ga) =

Γ

r(Γ, ga)∂

∂θkp(Γ|Φ,Λ,M,Θ) (16)

Focusing on the inner term, we expand it withBayes’ rule:

∂θkp(Γ|Φ,Λ,M,Θ) =

∂θk

p(Φ|Γ,Λ,M,Θ)p(Γ|Λ,M,Θ)

p(Φ|Λ,M,Θ)(17)

We assume the priors do not depend onΘ:

p(Γ|M)

p(Φ|Λ)∂

∂θkp(Φ|Γ,Λ,M,Θ) (18)

For brevity, we compressΓ, Λ, andM in the vari-ableX. Next, we take the partial derivative of thelikelihood ofΦ. First we assume each factor is inde-pendent.

∂θkp(Φ|X,Θ) =

∂θk

i

p(φi|X,Θ) (19)

=∏

i

p(φi|X,Θ)×

j

∂∂θk

p(φj |X,Θ)

p(φj |X,Θ)

(20)

Finally, we assume the distribution overφj takes alog-linear form with feature functionsfk and param-etersθk, as in the G3 framework.

∂θkp(φj |X,Θ) = (21)

p(φ|X,Θ)×(fk(φ,X)− Ep(φ′|X,Θ)

[fk(φ

′, X)])

We substitute back into the overall expression for thepartial derivative of the expectation:

∂θkEp(Γ|X,Θ)r(Γ, ga) =

Ep(Γ|X,Θ)r(Γ, ga)×∑

j

fk(φj , X)− Ep(φ′|X,Θ)

[fk(φ

′, X)] (22)

Input:1: Initial values for parameters,Θ0.2: Training dataset,D.3: Number of iterations,T .4: Step size,α.5:

6: Θ← Θ0

7: for t ∈ T do8: for d ∈ D do9: ∇Θ ← ∂

∂θkEp(Γd|Φd,Λd,Md,Θ)r(Γ

d, gda)10: end for11: Θ← Θ+ α∇Θ

12: end forOutput: Estimate of parametersΘ

Figure 3: Training algorithm.

We approximate the expectation overΓ withhighly probable bindings for theΓ and update thegradient incrementally for each example. The train-ing algorithm is given in Figure 3.

4 Results

We present preliminary results for the learning algo-rithm using a corpus of natural language commandsgiven to a robotic forklift. We collected a corpusof natural language commands paired with robot ac-tions by showing annotators on Amazon Mechani-cal Turk a video of the robot executing an action andasking them to describe in words they would use tocommand an expert human operator to carry out thecommands in the video. Frames from a video in ourcorpus, together with commands for that video ap-pear in Figure 4. Since commands often spannedmultiple parts of the video, we annotated the align-ment between each top-level clause in the commandand the robot’s motion in the video. Our initial eval-uation uses only commands from the corpus thatcontain the words “pick up” due to scaling issueswhen running on the entire corpus.

We report results using a random cost function asa baseline as well as the learned parameters on atraining set and a held-out test set. Table 1 showsperformance on a training set using small environ-ments (with one or two other objects) and a test setof small and large environments (with up to six otherobjects).

11

Page 22: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

(a) t=0 (b) t=20 (c) t=30

Pick up pallet with refridgerator [sic] and place on truck tothe left.

A distance away you should see a rectangular box. Approach itslowly and load it up onto your forklift. Slowly proceed toback out and then make a sharp turn and approach the truck.Raise your forklift and drop the rectangular box on the back ofthe truck.

Go to the pallet with the refrigerator on it and pick itup. Move the pallet to the truck trailer. Place the pallet onthe trailer.

Pick up the pallet with the refrigerator and place it onthe trailer.

(d) Commands

Figure 4: Frames from a video in our dataset, pairedwith natural language commands.

As expected, on the training set the system learnsa good policy, since it is directly rewarded for act-ing correctly. Because the environments are small,the chance of correctly grounding concrete nounphrases with a random cost function is high. How-ever after training performance at grounding nounphrases increases to 92% even though the systemhad no access to alignment annotations for nounphrases at training time; it only observes rewardbased on whether it has acted correctly.

Next, we report performance on a test set to assessgeneralization to novel commands given in novel en-vironments. Since the test set includes larger en-vironments with up to six objects, baseline perfor-mance is lower. However the trained system is ableto achieve high performance at both inferring cor-rect actions as well as correct object groundings, de-spite having no access to a reward signal of any kindduring inference. This result shows the system haslearned general word meanings that apply in novelcontexts not seen at training time.

5 Related Work

Beginning with SHRDLU [Winograd, 1971], manysystems have exploited the compositional structureof language to statically generate a plan correspond-

% CorrectActions Concrete Noun Phrases

Before Training 31% 61%After Training 100% 92%

(a) Training (small environments)

% CorrectActions Concrete Noun Phrases

Before Training 3% 26%After Training 84% 77%

(b) Testing (small and large environments)

Table 1: Results on the training set and test set.

ing to a natural language command [Hsiao et al.,2008, MacMahon et al., 2006, Skubic et al., 2004,Dzifcak et al., 2009]. Our work moves beyondthis framework by defining a probabilistic graphicalmodel according to the structure of the natural lan-guage command, inducing a distribution over plansand groundings.

Models that learned word meanings [Tellex et al.,2011, Kollar et al., 2010] require detailed align-ment annotations between constituents in the lan-guage and objects, places, paths, or events in the ex-ternal world. Previous approaches capable of learn-ing from unaligned data [Vogel and Jurafsky, 2010,Branavan et al., 2009] used sequential models thatcould not capture the hierarchical structure of lan-guage. Matuszek et al. [2010], Liang et al. [2011]and Chen and Mooney [2011] describe models thatlearn compositional semantics, but word meaningsare symbolic structures rather than patterns of fea-tures in the external world.

There has been a variety of work in transferringaction policies between a human and a robot. In imi-tation learning, the goal is to create a system that canwatch a teacher perform an action, and then repro-duce that action [Kruger et al., 2007, Chernova andVeloso, 2009, Schaal et al., 2003, Ekvall and Kragic,2008]. Rybski et al. [2007] developed an imitationlearning system that learns from a combination ofimitation of the human teacher, as well as naturallanguage input. Our work differs in that the systemmust infer an action from the natural language com-mands, rather than from watching the teacher per-

12

Page 23: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

form an action. The system is trained off-line, andthe task of the robot is to respond on-line to the nat-ural language command.

6 Conclusion

In this paper we described an approach for learningperceptually grounded word meanings from an un-aligned parallel corpus of language paired with robotactions. The training algorithm jointly infers poli-cies that correspond to natural language commandsas well as alignments between noun phrases in thecommand and groundings in the external world. Inaddition, our approach learns grounded word mean-ings or distributions corresponding to words in thelanguage, that the system can use to follow novelcommands that it may have never encountered dur-ing training. We presented a preliminary evaluationon a small corpus, demonstrating that the system isable to infer meanings for concrete noun phrases de-spite having no direct supervision for these values.

There are many directions for improvement. Weplan to train our system using a large dataset of lan-guage paired with robot actions in more complex en-vironments, and on more than one robotic platform.Our approach points the way towards a frameworkthat can learn a large vocabulary of general groundedword meanings, enabling systems that flexibly re-spond to a wide variety of natural language com-mands given by untrained users.

References

S. R. K. Branavan, H. Chen, L. S. Zettlemoyer, andR. Barzilay. Reinforcement learning for mappinginstructions to actions. InProceedings of ACL,page 82–90, 2009.

D. L. Chen and R. J. Mooney. Learning to interpretnatural language navigation instructions from ob-servations. InProc. AAAI, 2011.

S. Chernova and M. Veloso. Interactive policy learn-ing through confidence-based autonomy.JAIR, 34(1):1–25, 2009.

J. Dzifcak, M. Scheutz, C. Baral, and P. Schermer-horn. What to do and how to do it: Translatingnatural language directives into temporal and dy-namic logic representation for goal managementand action execution. InProc. IEEE Int’l Conf.

on Robotics and Automation (ICRA), pages 4163–4168, 2009.

S. Ekvall and D. Kragic. Robot learning fromdemonstration: a task-level planning approach.International Journal of Advanced Robotic Sys-tems, 5(3), 2008.

S. Harnad. The symbol grounding problem.PhysicaD, 43:335–346, 1990.

K. Hsiao, S. Tellex, S. Vosoughi, R. Kubat, andD. Roy. Object schemas for grounding languagein a responsive robot.Connection Science, 20(4):253–276, 2008.

R. S. Jackendoff.Semantics and Cognition, pages161–187. MIT Press, 1983.

T. Kollar, S. Tellex, D. Roy, and N. Roy. Toward un-derstanding natural language directions. InProc.ACM/IEEE Int’l Conf. on Human-Robot Interac-tion (HRI), pages 259–266, 2010.

V. Kruger, D. Kragic, A. Ude, and C. Geib. Themeaning of action: A review on action recogni-tion and mapping. Advanced Robotics, 21(13),2007.

P. Liang, M. I. Jordan, and D. Klein. Learningdependency-based compositional semantics. InProc. Association for Computational Linguistics(ACL), 2011.

M. MacMahon, B. Stankiewicz, and B. Kuipers.Walk the talk: Connecting language, knowledge,and action in route instructions. InProc. Nat’lConf. on Artificial Intelligence (AAAI), pages1475–1482, 2006.

C. Matuszek, D. Fox, and K. Koscher. Followingdirections using statistical machine translation. InProc. ACM/IEEE Int’l Conf. on Human-Robot In-teraction (HRI), pages 251–258, 2010.

N. Mavridis and D. Roy. Grounded situation modelsfor robots: Where words and percepts meet. In2006 IEEE/RSJ International Conference on In-telligent Robots and Systems, pages 4690–4697.IEEE, Oct. 2006. ISBN 1-4244-0258-1.

A. K. McCallum. MALLET: A machine learningfor language toolkit. http://mallet.cs.umass.edu,2002.

P. Rybski, K. Yoon, J. Stolarz, and M. Veloso. In-teractive robot task training through dialog and

13

Page 24: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

demonstration. InProceedings of HRI, page 56.ACM, 2007.

S. Schaal, A. Ijspeert, and A. Billard. Computationalapproaches to motor learning by imitation.Phil.Trans. R. Soc. Lond. B, (358), 2003.

M. Skubic, D. Perzanowski, S. Blisard, A. Schultz,W. Adams, M. Bugajska, and D. Brock. Spatiallanguage for human-robot dialogs.IEEE Trans.on Systems, Man, and Cybernetics, Part C: Appli-cations and Reviews, 34(2):154–167, 2004. ISSN1094-6977.

S. Tellex, T. Kollar, S. Dickerson, M. Walter,A. Banerjee, S. Teller, and N. Roy. Understand-ing natural language commands for robotic navi-gation and mobile manipulation. InProc. AAAI,2011.

A. Vogel and D. Jurafsky. Learning to follow naviga-tional directions. InProc. Association for Compu-tational Linguistics (ACL), pages 806–814, 2010.

T. Winograd. Procedures as a Representation forData in a Computer Program for UnderstandingNatural Language. PhD thesis, Massachusetts In-stitute of Technology, 1971. Ph.D. thesis.

14

Page 25: SIAC 2012 Context Proceedings of the Workshop ·  · 2012-05-24Context Proceedings of the Workshop ... a set of parameterized rules mapping lexical items and syntactic ... ent stages

Author Index

Adjogah, Richard, 1

Babes-Vroman, Monica, 1

desJardins, Marie, 1

Gao, Ruoyuan, 1

Joseph, Josh, 3

Littman, Michael, 1

MacGlashan, James, 1Muresan, Smaranda, 1

Roy, Nicholas, 3

Tellex, Stefanie, 3Thaker, Pratiksha, 3

Winner, Kevin, 1

15