Weakly Supervised Learning of Semantic Parsers for …yoavartzi.com/pub/az-tacl.2013.pdf · et al. (2011) used a graphical model semantics rep-resentation to learn from instructions

Transactions of the Association for Computational Linguistics, 1 (2013) 49–62. Action Editor: Jason Eisner.Submitted 11/2012; Published 3/2013. c©2013 Association for Computational Linguistics.

Weakly Supervised Learning of Semantic Parsersfor Mapping Instructions to Actions

Yoav Artzi and Luke ZettlemoyerComputer Science & Engineering

University of WashingtonSeattle, WA 98195

{yoav,lsz}@cs.washington.edu

Abstract

The context in which language is used pro-vides a strong signal for learning to recoverits meaning. In this paper, we show it can beused within a grounded CCG semantic parsingapproach that learns a joint model of mean-ing and context for interpreting and executingnatural language instructions, using varioustypes of weak supervision. The joint natureprovides crucial benefits by allowing situatedcues, such as the set of visible objects, to di-rectly influence learning. It also enables algo-rithms that learn while executing instructions,for example by trying to replicate human ac-tions. Experiments on a benchmark naviga-tional dataset demonstrate strong performanceunder differing forms of supervision, includ-ing correctly executing 60% more instructionsets relative to the previous state of the art.

1 Introduction

The context in which natural language is used pro-vides a strong signal to reason about its meaning.However, using such a signal to automatically learnto understand unrestricted natural language remainsa challenging, unsolved problem.

For example, consider the instructions in Figure 1.Correct interpretation requires us to solve many sub-problems, such as resolving all referring expres-sions to specific objects in the environment (includ-ing, “the corner” or “the third intersection”), disam-biguating word sense based on context (e.g., “thechair” could refer to a chair or sofa), and findingexecutable action sequences that satisfy stated con-straints (such as “twice” or “to face the blue hall”).

move forward twice to the chairλa.move(a) ∧ dir(a, forward) ∧ len(a, 2) ∧to(a, ιx.chair(x))

at the corner turn left to face the blue hallλa.pre(a, ιx.corner(x)) ∧ turn(a) ∧ dir(a, left) ∧post(a, front(you, ιx.blue(x) ∧ hall(x)))

move to the chair in the third intersectionλa.move(a) ∧ to(a, ιx.sofa(x)) ∧intersect(order(λy.junction(y), frontdist, 3), x)

Figure 1: A sample navigation instruction set, pairedwith lambda-calculus meaning representations.

We must also understand implicit requests, for ex-ample from the phrase “at the corner,” that describegoals to be achieved without specifying the specificsteps. Finally, to do all of this robustly without pro-hibitive engineering effort, we need grounded learn-ing approaches that jointly reason about meaningand context to learn directly from their interplay,with as little human intervention as possible.

Although many of these challenges have beenstudied separately, as we will review in Section 3,this paper represents, to the best of our knowledge,the first attempt at a comprehensive model that ad-dresses them all. Our approach induces a weightedCombinatory Categorial Grammar (CCG), includ-ing both the parameters of the linear model and aCCG lexicon. To model complex instructional lan-guage, we introduce a new semantic modeling ap-proach that can represent a number of key linguisticconstructs that are common in spatial and instruc-tional language. To learn from indirect supervision,we define the notion of a validation function, forexample that tests the state of the agent after in-terpreting an instruction. We then show how thisfunction can be used to drive online learning. For

49

that purpose, we adapt the loss-sensitive Perceptronalgorithm (Singh-Miller & Collins, 2007; Artzi &Zettlemoyer, 2011) to use a validation function andcoarse-to-fine inference for lexical induction.

The joint nature of this approach provides crucialbenefits in that it allows situated cues, such as theset of visible objects, to directly influence parsingand learning. It also enables the model to be learnedwhile executing instructions, for example by tryingto replicate actions taken by humans. In particular,we show that, given only a small seed lexicon anda task-specific executor, we can induce high qualitymodels for interpreting complex instructions.

We evaluate the method on a benchmark naviga-tional instructions dataset (MacMahon et al., 2006;Chen & Mooney, 2011). Our joint approach suc-cessfully completes 60% more instruction sets rel-ative to the previous state of the art. We also re-port experiments that vary supervision type, findingthat observing the final position of an instruction ex-ecution is nearly as informative as observing the en-tire path. Finally, we present improved results on anew version of the MacMahon et al. (2006) corpus,which we filtered to include only executable instruc-tions paired with correct traces.

2 Technical Overview

Task Let S be the set of possible environmentstates and A be the set of possible actions. Givena start state s ∈ S and a natural language instruc-tion x, we aim to generate a sequence of actions~a = 〈a1, . . . , an〉, with each ai ∈ A, that performsthe steps described in x.

For example, in the navigation domain (MacMa-hon et al., 2006), S is a set of positions on a map.Each state s = (x, y, o) is a triple, where x and y areinteger grid coordinates and o ∈ {0, 90, 180, 270} isan orientation. Figure 2 shows an example map with36 states; the ones we use in our experiments con-tain an average of 141. The space of possible actionsA is {LEFT, RIGHT,MOVE, NULL}. Actions changethe state of the world according to a transition func-tion T : A × S → S. In our navigation example,moving forward can change the x or y coordinateswhile turning changes the orientation o.

Model To map instructions to actions, we jointlyreason about linguistic meaning and action execu-

tion. We use a weighted CCG grammar to rank pos-sible meanings z for each instruction x. Section 6defines how to design such grammars for instruc-tional language. Each logical form z is mapped to asequence of actions ~a with a deterministic executor,as described in Section 7. The final grounded CCGmodel, detailed in Section 6.3, jointly constructs andscores z and ~a, allowing for robust situated reason-ing during semantic interpretation.

Learning We assume access to a training set con-taining n examples {(xi, si,Vi) : i = 1 . . . n}, eachcontaining a natural language sentence xi, a startstate si, and a validation function Vi. The validationfunction Vi : A → {0, 1} maps an action sequence~a ∈ A to 1 if it’s correct according to available su-pervision, or 0 otherwise. This training data containsno direct evidence about the logical form zi for eachxi, or the grounded CCG analysis used to constructzi. We model all these choices as latent variables.We experiment with two validation functions. Thefirst, VD(~a), has access to an observable demonstra-tion of the execution ~ai, a given ~a is valid iff ~a = ~ai.The second, VSi (~a), only encodes the final state s′iof the execution of x, therefore ~a is valid iff its finalstate is s′i. Since numerous logical forms often ex-ecute identically, both functions provide highly am-biguous supervision.

Evaluation We evaluate task completion for sin-gle instructions on a test set {(xi, si, s′i) : i =1 . . . n}, where s′i is the final state of an oracle agentfollowing the execution of xi starting at state si. Wewill also report accuracies for correctly interpretinginstruction sequences ~x, where a single error cancause the entire sequence to fail. Finally, we reportaccuracy on recovering correct logical forms zi on amanually annotated subset of the test set.

3 Related Work

Our learning is inspired by the reinforcement learn-ing (RL) approach of Branavan et al. (2009), andrelated methods (Vogel & Jurafsky, 2010), but useslatent variable model updates within a semanticparser. Branavan et al. (2010) extended their RL ap-proach to model high-level instructions, which cor-respond to implicit actions in our domain. Wei et al.(2009) and Kollar et al. (2010) used shallow linguis-tic representations for instructions. Recently, Tellex

50

et al. (2011) used a graphical model semantics rep-resentation to learn from instructions paired withdemonstrations. In contrast, we model significantlymore complex linguistic phenomena than these ap-proaches, as required for the navigation domain.

Other research has adopted expressive meaningrepresentations, with differing learning approaches.Matuszek et al. (2010, 2012) describe supervised al-gorithms that learn semantic parsers for navigationinstructions. Chen and Mooney (2011), Chen (2012)and Kim and Mooney (2012) present state-of-the-art algorithms for the navigation task, by training asupervised semantic parser from automatically in-duced labels. Our work differs in the use of jointlearning and inference approaches.

Supervised approaches for learning semanticparsers have received significant attention, e.g. Kateand Mooney (2006), Wong and Mooney (2007),Muresan (2011) and Kwiatkowski et al. (2010,2012). The algorithms we develop in this pa-per combine ideas from previous supervised CCGlearning work (Zettlemoyer & Collins, 2005, 2007;Kwiatkowski et al., 2011), as we describe in Sec-tion 4. Recently, various alternative forms of su-pervision were introduced. Clarke et al. (2010),Goldwasser and Roth (2011) and Liang et al. (2011)describe approaches for learning semantic parsersfrom sentences paired with responses, Krishna-murthy and Mitchell (2012) describe using distantsupervision, Artzi and Zettlemoyer (2011) use weaksupervision from conversational logs and Gold-wasser et al. (2011) present work on unsupervisedlearning. We discuss various forms of supervisionthat complement these approaches. There has alsobeen work on learning for semantic analysis tasksfrom grounded data, including event streams (Lianget al., 2009; Chen et al., 2010) and language pairedwith visual perception (Matuszek et al., 2012).

Finally, the topic of executing instructions innon-learning settings has received significant atten-tion (e.g., Winograd (1972), Di Eugenio and White(1992), Webber et al. (1995), Bugmann et al. (2004),MacMahon et al. (2006) and Dzifcak et al. (2009)).

4 Background

We use a weighted linear CCG grammar for seman-tic parsing, as briefly reviewed in this section.

Combinatory Categorial Grammars (CCGs)CCGs are a linguistically-motivated formalism formodeling a wide range of language phenom-ena (Steedman, 1996, 2000). A CCG is defined by alexicon and a set of combinators. The lexicon con-tains entries that pair words or phrases with cate-gories. For example, the lexical entry chair ` N :λx.chair(x) for the word “chair” in the parse in Fig-ure 4 pairs it with a category that has syntactic typeN and meaning λx.chair(x). Figure 4 shows how aCCG parse builds a logical form for a complete sen-tence in our example navigation domain. Startingfrom lexical entries, each intermediate parse node,including syntax and semantics, is constructed withone of a small set of CCG combinators (Steedman,1996, 2000). We use the application, compositionand coordination combinators, and three others de-scribed in Section 6.3.

Factored CCG Lexicons Recently, Kwiatkowskiet al. (2011) introduced a factored CCG lexiconrepresentation. Each lexical item is composed ofa lexeme and a template. For example, the entrychair ` N : λx.chair(x) would be constructed bycombining the lexeme chair ` [chair], which con-tains a word paired with logical constants, with thetemplate λv.[N : λx.v(x)], that defines the rest ofthe category by abstracting over logical constants.This approach allows the reuse of common syntacticstructures through a small set of templates. Section 8describes how we learn such lexical entries.

Weighted Linear CCGs A weighted linearCCG (Clark & Curran, 2007) ranks the space ofpossible parses under the grammar, and is closelyrelated to several other approaches (Lafferty et al.,2001; Collins, 2004; Taskar et al., 2004). Let x be asentence, y be a CCG parse, and GEN(x; Λ) be theset of all possible CCG parses for x given the lexi-con Λ. Define φ(x, y) ∈ Rd to be a d-dimensionalfeature–vector representation and θ ∈ Rd to be a pa-rameter vector. The optimal parse for sentence x is

y∗(x) = arg maxy∈GEN(x;Λ)

θ · φ(x, y)

and the final output logical form z is the λ-calculusexpression at the root of y∗(x). Section 7.2 de-scribes how we efficiently compute an approxima-tion to y∗(x) within the joint interpretation and exe-cution model.

51

Supervised learning with GENLEX Previouswork (Zettlemoyer & Collins, 2005) introduced afunction GENLEX(x, z) to map a sentence x and itsmeaning z to a large set of potential lexical entries.These entries are generated by rules that consider thelogical form z and guess potential CCG categories.For example, the rule p → N : λx.p(x) introducescategories commonly used to model certain types ofnouns. This rule would, for example, introduce thecategory N : λx.chair(x) for any logical form zthat contains the constant chair. GENLEX uses asmall set of such rules to generate categories thatare paired with all possible substrings in x, to createa large set of lexical entries. The complete learningalgorithm then simultaneously selects a small sub-set of these entries and estimates parameter valuesθ. In Section 8, we will introduce a new way ofusing GENLEX to learn from different signals that,crucially, do not require a labeled logical form z.

5 Spatial Environment Modeling

We will execute instructions in an environment, seeSection 2, which has a set of positions. A positionis a triple (x, y, o), where x and y are horizontal andvertical coordinates, and o ∈ O = {0, 90, 180, 270}is an orientation. A position also includes propertiesindicating the object it contains, its floor pattern andits wallpaper. For example, the square at (4, 3) inFigure 2 has four positions, one per orientation.

Because instructional language refers to objectsand other structures in an environment, we introducethe notion of a position set. For example, in Figure 2,the position set D = {(5, 3, o) : o ∈ O} representsa chair, while B = {(x, 3, o) : o ∈ O, x ∈ [0 . . . 5]}represents the blue floor. Both sets contain all ori-entations for each (x, y) pair, thereby representingproperties of regions. Position sets can have manyproperties. For example, E, in addition to being achair, is also an intersection because it overlaps withthe neighboring halls A and B. The set of possi-ble entities includes all position sets and a few addi-tional entries. For example, set C = {(4, 3, 90)} inFigure 2 represents the agent’s position.

6 Modeling Instructional Language

We aim to design a semantic representation that islearnable, models grounded phenomena such as spa-

X y 1 2 3 4 5 1 2 3 4 5

270 90

0

180

C

D E

A

B

{D E

}(a) chairλx.chair(x){

A B }

(b) hallλx.hall(x)

E (c) the chairιx.chair(x)

C (d) youyou{

B }

(e) blue hallλx.hall(x) ∧ blue(x)

{E } (f) chair in the intersection

λx.chair(x) ∧intersect(ιy.junction(y), x){

A B E }

(g) in front of youλx.in front of(you, x)

Figure 2: Schematic diagram of a map environmentand example of semantics of spatial phrases.

tial relations and object reference, and is executable.Our semantic representation combines ideas from

Carpenter (1997) and Neo-Davidsonian event se-mantics (Parsons, 1990) in a simply typed λ-calculus. There are four basic types: (1) entities ethat are objects in the world, (2) events ev that spec-ify actions in the world, (3) truth values t, and (4)meta-entities m, such as numbers or directions. Wealso allow functional types, which are defined by in-put and output types. For example, 〈e, t〉 is the typeof function from entities to truth values.

6.1 Spatial Language Modeling

Nouns and Noun Phrases Noun phrases arepaired with e-type constants that name specific en-tities and nouns are mapped to 〈e, t〉-type expres-sions that define a property. For example, the noun“chair” (Figure 2a) is paired with the expressionλx.chair(x), which defines the set of objects for

52

which the constant chair returns true. The deno-tation of this expression is the set {D,E} in Fig-ure 2 and the denotation of λx.hall(x) (Figure 2b)is {A,B}. Also, the noun phrase “you” (Figure 2d),which names the agent, is represented by the con-stant you with denotation C, the agent’s position.

Determiners Noun phrases can also be formed bycombining nouns with determiners that pick out spe-cific objects in the world. We consider both definitereference, which names contextually unique objects,and indefinites, which are less constrained.

The definite article is paired with a logical expres-sion ι of type 〈〈e, t〉, e〉,1 which will name a sin-gle object in the world. For example, the phrase“the chair” in Figure 2c will be represented byιx.chair(x) which will denote the appropriate chair.However, computing this denotation is challengingwhen there is perceptual ambiguity, for positionswhere multiple chairs are visible. We adopt a sim-ple heuristic approach that ranks referents based ona combination of their distance from the agent andwhether they are in front of it. For our example,from position C our agent would pick the chair Ein front of it as the denotation. The approach dif-fers from previous, non-grounded models that fail toname objects when faced with such ambiguity (e.g.,Carpenter (1997), Heim and Kratzer (1998)).

To model the meaning of indefinite articles, wedepart from the Frege-Montague tradition of us-ing existential quantifiers (Lewis, 1970; Montague,1973; Barwise & Cooper, 1981), and instead in-troduce a new quantifier A that, like ι, has type〈〈e, t〉, e〉. For example, the phrase “a chair” wouldbe paired with Ax.chair(x) which denotes an arbi-trary entry from the set of chairs in the world. Com-puting the denotation for such expressions in a worldwill require picking a specific object, without fur-ther restrictions. This approach is closely related toSteedman’s generalized Skolem terms (2011).2

Meta Entities We use m-typed terms to representnon-physical entities, such as numbers (1, 2, etc.)and directions (left, right, etc.) whose denotations

1Although quantifiers are logical constants with type〈〈e, t〉, e〉 or 〈〈e, t〉, t〉, we use a notation similar to that usedfor first-order logic. For example, the notation ιx.f(x) repre-sents the logical expression ι(λx.f(x))

2Steedman (2011) uses generalized Skolem terms as a toolfor resolving anaphoric pronouns, which we do not model.

are fixed. The ability to refer to directions allowsus to manipulate position sets. For example, thephrase “your left” is mapped to the logical expres-sion orient(you, left), which denotes the positionset containing the position to the left of the agent.

Prepositions and Adjectives Noun phrases withmodifiers, such as adjectives and prepositionalphrases are 〈e, t〉-type expressions that implementset intersection with logical conjunctions. For ex-ample in Figure 2, the phrase “blue hall” is pairedwith λx.hall(x)∧ blue(x) with denotation {B} andthe phrase “chair in the intersection” is paired withλx.chair(x) ∧ intersect(ιy.junction(y), x) withdenotation {E}. Intuitively, the adjective “blue”introduces the constant blue and “in the” adds aintersect. We will describe the full details of howthese expressions are constructed in Section 6.3.

Spatial Relations The semantic representation al-lows more complex reasoning over position sets andthe relations between them. For example, the bi-nary relation in front of (Figure 2g) tests if thefirst argument is in front of the second from the pointof view of the agent. Additional relations are usedto model set intersection, relative direction, relativedistance, and relative position by distance.

6.2 Modeling Instructions

To model actions in the world, we adopt Neo-Davidsonian event semantics (Davidson, 1967; Par-sons, 1990), which treats events as ev-type primitiveobjects. Such an approach allows for a compact lex-icon where adverbial modifiers introduce predicates,which are linked by a shared event argument.

Instructional language is characterized by heavyusage of imperatives, which we model as func-tions from events to truth values.3 For example, animperative such as “move” would have the mean-ing λa.move(a), which defines a set of events thatmatch the specified constraints. Here, this set wouldinclude all events that involve moving actions.

The denotation of ev-type terms is a sequenceof n instances of the same action. In this way, anevent defines a function ev : s → s′, where s isthe start state and s′ the end state. For example, the

3Imperatives are 〈ev, t〉-type, much like 〈e, t〉-type wh-interrogatives. Both define sets, the former includes actions toexecute, the later defines answers to a question.

53

denotation of λa.move(a) is the set of move actionsequences {〈MOVE1, . . . ,MOVEn〉 : n ≥ 1}. Al-though performing actions often require performingadditional ones (e.g., the agent might have to turnbefore being able to move), we treat such actions asimplicit (Section 7.1), and don’t model them explic-itly within the logical form.

Predicates such as move (seen above) andturn are introduced by verbs. Events can alsobe modified by adverbials, which are intersective,much like prepositional phrases. For example in theimperative, logical form (LF) pair:

Imp.: move from the sofa to the chairLF: λa.move(a) ∧ to(a, ιx.chair(x)) ∧

from(a, ιy.sofa(y))

Each adverbial phrase provides a constraint, andchanging their order will not change the LF.

6.3 Parsing Instructional Language with CCGTo compose logical expressions from sentences weuse CCG, as described in Section 4. Figures 3 and 4present a sample of lexical entries and how they arecombined, as we will describe in this section. Thebasic syntactic categories are N (noun), NP (nounphrase), S (sentence), PP (prepositional phrase),AP (adverbial phrase), ADJ (adjective) and C (aspecial category for coordinators).

Type Raising To compactly model syntactic vari-ations, we follow Carpenter (1997), who argues forpolymorphic typing. We include the more simple, orlower type, entry in the lexicon and introduce type-raising rules to reconstruct the other when necessaryat parse time. We use four rules:

PP : g → N\N : λf.λx.f(x) ∧ g(x)ADJ : g → N/N : λf.λx.f(x) ∧ g(x)AP : g → S\S : λf.λa.f(a) ∧ g(a)AP : g → S/S : λf.λa.f(a) ∧ g(a)

where the first three are for prepositional, adjectivaland adverbial modifications, and the fourth modelsthe fact that adverbials are often topicalized.4 Fig-ures 3 and 4 show parses that use type-raising rules.

Indefinites As discussed in Section 6.1, we usea new syntactic analysis for indefinites, follow-

4Using type-raising rules can be particularly useful whenlearning from sparse data. For example, it will no longer benecessary to learn three lexical entries for each adverbial phrase(with syntax AP , S\S, and S/S).

chair in the corner

N PP/NP NP/N Nλx.chair(x) λx.λy.intersect(x, y) λf.Ax.f(x) λx.corner(x)

>NP

ιx.corner(x)>

PPλy.intersect(ιx.corner(x), y)

N\Nλf.λy.f(y) ∧ intersect(ιx.chair(x), y)

<N

λy.chair(y) ∧ intersect(ιx.chair(x), y)

Figure 3: A CCG parse with a prepositional phrase.

ing Steedman (2011). Previous approaches wouldbuild parses such as

with a lamp

PP/NP PP\(PP/NP )/N Nλx.λy.intersect(x, y) λf.λg.λy.∃x.g(x, y) ∧ f(x) λx.lamp(x)

>PP\(PP/NP )

λg.λy.∃x.g(x, y) ∧ lamp(x)<

PPλy.∃x.intersect(x, y) ∧ lamp(x)

where “a” has the relatively complex syntactic cate-gory PP\(PP/NP )/N and where similar entrieswould be needed to quantify over different typesof verbs (e.g., S\(S/NP )/N ) and adverbials (e.g.,AP\(AP/NP )/N ). Instead, we include a singlelexical entry a ` NP/N : λf.Ax.f(x) which canbe used to construct the correct meaning in all cases.

7 Joint Parsing and Execution

Our inference includes an execution component anda parser. The parser maps sentences to logical forms,and incorporates the grounded execution model. Wefirst discuss how to execute logical forms, and thendescribe the joint model for execution and parsing.

7.1 Executing Logical Expressions

Dynamic Models In spatial environments, such asthe ones in our task, the agent’s ability to observe theworld depends on its current state. Taking this aspectof spatial environments into account is challenging,but crucial for correct evaluation.

To represent the agent’s point of view, for eachstate s ∈ S, as defined in Section 2, let Ms be thestate-dependent logical model. A model M consistsof a domain DM,T of objects for each type T andan interpretation function IM,T : OT → DM,T ,where OT is the set of T -type constants. IM,T

maps logical symbols to T -type objects, for exam-ple, it will map you to the agent’s position. We havedomains for position sets, actions and so on. Fi-nally, let VT be the set of variables of type T , and

54

facing the lamp go until you reach a chair

AP/NP NP/N N S AP/S NP S\NP/NP NP/N Nλx.λa.pre(a, λf.ιx.f(x) λx.lamp(x) λa.move(a) λs.λa.post(a, s) you λx.λy.intersect(x, y) λf.Ax.f(x) λx.chair(x)front(you, x))

> >NP NP

ιx.lamp(x) Ax.chair(x)> >

AP S\NPλa.pre(a, front(you, ιx.lamp(x))) λy.intersect(Ax.chair(x), y)

<S/S S

λf.λa.f(a) ∧ pre(a, front(you, ιx.lamp(x))) intersect(Ax.chair(x), you)>

APλa.post(a, intersect(Ax.chair(x), you))

S\Sλf.λa.f(a) ∧ post(a, intersect(Ax.chair(x), you))

<S

λa.move(a) ∧ post(a, intersect(Ax.chair(x), you))>

Sλa.move(a) ∧ post(a, intersect(Ax.chair(x), you)) ∧ pre(a, front(you, ιx.lamp(x)))

Figure 4: A CCG parse showing adverbial phrases and topicalization.

AT : VT →⋃

s∈S DMs,T be the assignment func-tion, which maps variables to domain objects.

For each model Ms the domain DMs,ev is a setof action sequences {〈a1, ..., an〉 : n ≥ 1}. Each ~adefines a sequences of states si, as defined in Sec-tion 6.2, and associated models Msi . The key chal-lenge for execution is that modifiers of the event willneed to be evaluated under different models fromthis sequence. For example, consider the sentencein Figure 4. To correctly execute, the pre literal, in-troduced by the “facing” phrase, it must be evaluatedin the model Ms0 for the initial state s0. Similarly,the literal including post requires the final modelMsn+1 . Such state dependent predicates, includingpre and post, are called stateful. The list of statefulpredicates is pre-defined and includes event modi-fiers, as well the ι quantifier, which is evaluated un-der Ms0 , since definite determiners are assumed toname objects visible from the start position. In gen-eral, a logical expression is traversed depth first andthe model is updated every time a stateful predicateis reached. For example, the two e-type you con-stants in Figure 4 will be evaluated under differentmodels: the one within the pre literal under Ms0 ,and the one inside the post literal under Msn+1 .

Evaluation Given a logical expression l, we cancompute the interpretation IMs0 ,T

(l) by recursivelymapping each subexpression to an entry on the ap-propriate model M .

To reflect the changing state of the agent duringevaluation, we define the function update(~a, pred).Given an action sequence ~a and a stateful predi-cate pred, update returns a model Ms, where sis the state under which the literal containing predshould be interpreted, either the initial state or one

visited while executing ~a. For example, given thepredicate post and the action sequence 〈a1, . . . , an〉,update(〈a1, . . . , an〉, post) = Msn+1 , where sn+1

the state of the agent following action an. By con-vention, we place the event variable as the first argu-ment in literals that include one.

Given a T -type logical expression l and a start-ing state s0, we compute its interpretation IMs0 ,T

(l)recursively, following these three base cases:

• If l is a λ operator of type 〈T1, T2〉 binding vari-able v and body b, IMs,T (l) is a set of pairsfrom DT1 ×DT2 , where DT1 , DT2 ∈ Ms. Foreach object o ∈ DT1 , we create a pair (o, i)where i is the interpretation IMs,T2(b) com-puted under a variable assignment function ex-tended to map AT2(v) = o.

• If l is a literal c(c1, . . . , cn) with n argu-ments where c has type P and each ci hastype Pi, IMs,T (l) is computed by first in-terpreting the predicate c to the functionf = IMs,T (c). In most cases, IMs,T (l) =f(IMs,P1(c1), . . . , IMs,Pn(cn)). However, if cis a stateful predicate, such as pre or post, weinstead first retrieve the appropriate new modelMs′ = update(IMs,P1(c1), c), where c1 is theevent argument and IMs,P1(c1) is its interpre-tation. Then, the final results is IMs,T (l) =f(IMs′ ,P1(c1), . . . , IMs′ ,Pn(cn)).

• If l is a T -type constant or variable, IMs,T (l).

The worst case complexity of the process is ex-ponential in the number of bound variables. Al-though in practice we observed tractable evaluationin the majority of development cases we considered,a more comprehensive and tractable evaluation pro-cedure is an issue that we leave for future work.

55

Implicit Actions Instructional language rarelyspecifies every action required for execution, seeMacMahon (2007) for a detailed discussion in themaps domain. For example, the sentence in Fig-ure 4 can be said even if the agent is not facing ablue hallway, with the clear implicit request that itshould turn to face such a hallway before moving.To allow our agent to perform implicit actions, weextend the domain of ev-type variables by allowingthe agent to prefix up to kI action sequences beforeeach explicit event. For example, in the agent’s po-sition in Figure 2 (set C), the set of possible eventsincludes 〈MOVEI ,MOVEI , RIGHTI ,MOVE〉, whichcontains two implicit sequences (marked by I).

Resolving Action Ambiguity Logical forms of-ten fail to determine a unique action sequences,due to instruction ambiguity. For example, con-sider the instruction “go forward” and the agent stateas specified in Figure 2 (set C). The instruction,which maps to λa.move(a) ∧ forward(a), evalu-ates to the set containing 〈MOVE〉, 〈MOVE,MOVE〉and 〈MOVE,MOVE,MOVE〉, as well as five other se-quences that have implicit prefixes followed by ex-plicit MOVE actions. To resolve such ambiguity, weprefer shorter actions without implicit actions. Inthe example above, we will select 〈MOVE〉, whichincludes a single action and no implicit actions.

7.2 Joint Inference

We incorporate the execution procedure describedabove with a linear weighted CCG parser, as de-scribed in Section 4, to create a joint model of pars-ing and execution. Specifically, we execute logi-cal forms in the current state and observe the resultof their execution. For example, the word “chair”can be used to refer to different types of objects, in-cluding chairs, sofas, and barstools, in the maps do-mains. Our CCG grammar would include a lexicalitem for each meaning, but execution might fail de-pending on the presence of objects in the world, in-fluencing the final parse output. Similarly, allowingimplicit actions provides robustness when resolv-ing these and other ambiguities. For example, aninstruction with the precondition phrase “from thechair” might require additional actions to reach theposition with the named object.

To allow such joint reasoning we define an ex-

ecution e to include a parse tree ey and trace e~a,and define our feature function to be Φ(xi, si, e),where xi is an instruction and si is the start state.This approach allows joint dependencies: the stateof the world influences how the agent interpretswords, phrases and even complete sentences, whilelanguage understanding determines actions.

Finally, to execute sequences of instructions, weexecute each starting from the end state of the previ-ous one, using a beam of size ks.

8 Learning

Figure 5 presents the complete learning algorithm.Our approach is online, considering each example inturn and performing two steps: expanding the lex-icon and updating parameters. The algorithm as-sumes access to a training set {(xi, si,Vi) : i =1 . . . n}, where each example includes an instructionxi, starting state si and a validation function Vi, asdefined in Section 2. In addition the algorithm takesa seed lexicon Λ0. The output is a joint model, thatincludes a lexicon Λ and parameters θ.

Coarse Lexical Generation To generate po-tential lexical entries we use the functionGENLEX(x, s,V; Λ, θ), where x is an in-struction, s is a state and V is a validation function.Λ is the current lexicon and θ is a parameter vector.In GENLEX we use coarse logical constants,as described below, to efficiently prune the set ofpotential lexical entries. This set is then prunedfurther using more precise inference in Step 1.

To compute GENLEX , we initially generate alarge set of lexical entries and then prune most ofthem. The full set is generated by taking the crossproduct of a set of templates, computed by factor-ing out all templates in the seed lexicon Λ0, and alllogical constants. For example, if Λ0 has a lexicalitem with the categoryAP/NP : λx.λa.to(a, x) wewould create entries w ` AP/NP : λx.λa.p(a, x)for every phrase w in x and all constants p with thesame type as to.5

In our development work, this approach oftengenerated nearly 100k entries per sentence. To ease

5Generalizing previous work (Kwiatkowski et al., 2011), weallow templates that abstract subsets of the constants in a lex-ical item. For example, the seed entry facing ` AP/NP :λx.λa.pre(a, front(you, x)) would create 7 templates.

56

Inputs: Training set {(xi, si,Vi) : i = 1 . . . n} where xi is asentence, si is a state and Vi is a validation function, as de-scribed in Section 2. Initial lexicon Λ0. Number of iterationsT . Margin γ. Beam size k for lexicon generation.

Definitions: Let an execution e include a parse tree ey anda trace e~a. GEN(x, s; Λ) is the set of all possible execu-tions for the instruction x and state s, given the lexicon Λ.LEX(y) is the set of lexical entries used in the parse tree y.Let Φi(e) be shorthand for the feature function Φ(xi, si, e)defined in Section 7.2. Define ∆i(e, e

′) = |Φi(e)−Φi(e′)|1.

GENLEX(x, s,V;λ, θ) takes as input an instruction x,state s, validation function V , lexicon λ and model param-eters θ, and returns a set of lexical entries, as defined in Sec-tion 8. Finally, for a set of executions E let MAXVi(E; θ)be {e|∀e′ ∈ E, 〈θ,Φi(e′)〉 ≤ 〈θ,Φi(e)〉 ∧ Vi(e~a) = 1}, theset of highest scoring valid executions.

Algorithm:Initialize θ using Λ0 , Λ← Λ0

For t = 1 . . . T, i = 1 . . . n :

Step 1: (Lexical generation)a. Set λG ← GENLEX(xi, si,Vi; Λ, θ), λ← Λ ∪ λGb. Let E be the k highest scoring executions fromGEN(xi, si;λ) which use at most one entry from λG

c. Select lexical entries from the highest scoring validparses: λi ←

⋃e∈MAXVi(E;θ) LEX(ey)

d. Update lexicon: Λ← Λ ∪ λiStep 2: (Update parameters)

a. Set Gi ←MAXVi(GEN(xi, si; Λ); θ)and Bi ← {e|e ∈ GEN(xi, si; Λ) ∧ Vi(e~a) 6= 1}

b. Construct sets of margin violating good and bad parses:Ri ← {g|g ∈ Gi ∧

∃b ∈ Bi s.t. 〈θ,Φi(g)− Φi(b)〉 < γ∆i(g, b)}Ei ← {b|b ∈ Bi ∧

∃g ∈ Gi s.t. 〈θ,Φi(g)− Φi(b)〉 < γ∆i(g, b)}c. Apply the additive update:θ ← θ + 1

|Ri|∑r∈Ri

Φi(r)− 1|Ei|

∑e∈Ei

Φi(e)

Output: Parameters θ and lexicon Λ

Figure 5: The learning algorithm.

the cost of parsing at this scale, we developed acoarse-to-fine two-pass parsing approach that lim-its the number of new entries considered. The algo-rithm first parses with coarse lexical entries that ab-stract the identities of the logical constants in theirlogical forms, thereby greatly reducing the searchspace. It then uses the highest scoring coarse parsesto constrain the lexical entries for a final, fine parse.

Formally, we construct the coarse lexicon λa byreplacing all constants of the same type with a singlenewly created, temporary constant. We then parse tocreate a set of trees A, such that each y ∈ A

1. is a parse for sentence x, given the world state

s with the combined lexicon Λ ∪ λa,

2. scored higher than ey by at least a margin ofδL, where ey is the tree of e, the highest scoringexecution of x, at position s under the currentmodel, s.t. V(e~a) = 1,

3. contains at most one entry from λa.

Finally, from each entry l ∈ {l|l ∈ λa ∧ l ∈y ∧ y ∈ A}, we create multiple lexical entries byreplacing all temporary constants with all possibleappropriately typed constants from the original set.GENLEX returns all these lexical entries, whichwill be used to form our final fine-level analysis.

Step 1: Lexical Induction To expand our model’slexicon, we use GENLEX to generate candidatelexical entries and then further refine this set by pars-ing with the current model. Step 1(a) in Figure 5uses GENLEX to create a temporary set of po-tential lexical entries λG. Steps (b-d) select a smallsubset of these lexical entries to add to the currentlexicon Λ: we find the k-best executions under themodel, which use at most one entry from λG, findthe entries used in the best valid executions and addthem to the current lexicon.

Step 2: Parameter Update We use a variant ofa loss-driven perceptron (Singh-Miller & Collins,2007; Artzi & Zettlemoyer, 2011) for parameter up-dates. However, instead of taking advantage of a lossfunction we use a validation signal. In step (a) wecollect the highest scoring valid parses and all in-valid parses. Then, in step (b) we construct the setRi of valid analyses and Ei of invalid ones, suchthat their model scores are not separated by a mar-gin δ scaled by the number of wrong features (Taskaret al., 2003). Finally, step (f) applies the update.

Discussion The algorithm uses the validation sig-nal to drive both lexical induction and parameterupdates. Unlike previous work (Zettlemoyer &Collins, 2005, 2007; Artzi & Zettlemoyer, 2011),we have no access to a set of logical constants,either through the the labeled logical form or theweak supervision signal, to guide the GENLEXprocedure. Therefore, to avoid over-generating lex-ical entries, thereby making parsing and learningintractable, we leverage typing for coarse parsingto prune the generated set. By allowing a single

57

Oracle SAIL# of instruction sequences 501 706# of instruction sequences

with implicit actions431

Total # of sentences 2679 3233Avg. sentences per sequence 5.35 4.61Avg. tokens per sentence 7.5 7.94Vocabulary size 373 522

Table 1: Corpora statistics (lower-cased data).

new entry per parse, we create a conservative, cas-cading effect, whereas a lexical entry that is intro-duced opens the way for many other sentence to beparsed and introduce new lexical entries. Further-more, grounded features improve parse selection,thereby generating higher quality lexical entries.

9 Experimental SetupData For evaluation, we use the navigation taskfrom MacMahon et al. (2006), which includes threeenvironments and the SAIL corpus of instructionsand follower traces. Chen and Mooney (2011) seg-mented the data, aligned traces to instructions, andmerged traces created by different subjects. Thecorpus includes raw sentences, without any form oflinguistic annotation. The original collection pro-cess (MacMahon et al., 2006) created many unin-terpretable instructions and incorrect traces. To fo-cus on the learning and interpretation tasks, we alsocreated a new dataset that includes only accurate in-structions labeled with a single, correct executiontrace. From this oracle corpus, we randomly sam-pled 164 instruction sequences (816 sentences) forevaluation, leaving 337 (1863 sentences) for train-ing. This simple effort will allow us to measure theeffects of noise on the learning approach and pro-vides a resource for building more accurate algo-rithms. Table 1 compares the two sets.

Features and Parser Following Zettlemoyer andCollins (2005), we use a CKY parser with a beamof k. To boost recall, we adopt a two-pass strategy,which allows for word skipping if the initial parsefails. We use features that indicate usage of lexicalentries, templates, lexemes and type-raising rules, asdescribed in Section 6.3, and repetitions in logicalcoordinations. Finally, during joint parsing, we con-sider only parses executable at si as complete.

Seed Lexicon To construct our seed lexicon we la-beled 12 instruction sequences with 141 lexical en-

Single Sentence Sequence

Final state validationComplete system 81.98 (2.33) 59.32 (6.66)

No implicit actions 77.7 (3.7) 38.46 (1.12)

No joint execution 73.27 (3.98) 31.51 (6.66)

Trace validationComplete system 82.74 (2.53) 58.95 (6.88)

No implicit actions 77.64 (3.46) 38.34 (6.23)

No joint execution 72.85 (4.73) 30.89 (6.08)

Table 2: Cross-validation development accuracy andstandard deviation on the oracle corpus.

tries. The sequences were randomly selected fromthe training set, so as to include two sequences foreach participant in the original experiment. Fig-ures 3 and 4 include a sample of our seed lexicon.

Initialization and Parameters We set the weightof each template indicator feature to the number oftimes it is used in the seed lexicon and each repeti-tion feature to -10. Learning parameters were tunedusing cross-validation on the training set: the mar-gin δ is set to 1, the GENLEX margin δL is set to2, we use 6 iterations (8 for experiments on SAIL)and take the 250 top parses during lexical genera-tion (step 1, Figure 5). For parameter update (step2, Figure 5) we use a parser with a beam of 100.GENLEX generates lexical entries for token se-quences up to length 4. ks, the instruction sequenceexecution beam, is set to 10. Finally, kI is set to2, allowing up to two implicit action sequences perexplicit one.

Evaluation Metrics To evaluate single instruc-tions x, we compare the agent’s end state to a labeledstate s′, as described in Section 2. We use a similarmethod to evaluate the execution of instruction se-quences ~x, but disregard the orientation, since endgoals in MacMahon et al. (2006) are defined with-out orientation. When evaluating logical forms wemeasure exact match accuracy.

10 Results

We repeated each experiment five times, shufflingthe training set between runs. For the developmentcross-validation runs, we also shuffled the folds. Asour learning approach is online, this allows us to ac-count for performance variations arising from train-ing set ordering. We report mean accuracy and stan-dard deviation across all runs (and all folds).

58

Single Sentence SequenceChen and Mooney (2011) 54.4 16.18Chen (2012) 57.28 19.18

+ additional data 57.62 20.64Kim and Mooney (2012) 57.22 20.17Trace validation 65.28 (5.09) 31.93 (3.26)

Final state validation 64.25 (5.12) 30.9 (2.16)

Table 3: Cross-validation accuracy and standard de-viation for the SAIL corpus.

Table 2 shows accuracy for 5-fold cross-validation on the oracle training data. We first variedthe validation signal by providing the complete ac-tion sequence or the final state only, as described inSection 2. Although the final state signal is weaker,the results are similar. The relatively large differencebetween single sentence and sequence performanceis due to (1) cascading errors in the more difficulttask of sequential execution, and (2) corpus repe-titions, where simple sentences are common (e.g.,“turn left”). Next, we disabled the system’s abilityto introduce implicit actions, which was especiallyharmful to the full sequence performance. Finally,ablating the joint execution decreases performance,showing the benefit of the joint model.

Table 3 lists cross validation results on the SAILcorpus. To compare to previous work (Chen &Mooney, 2011), we report cross-validation resultsover the three maps. The approach was able to cor-rectly execute 60% more sequences then the previ-ous state of the art (Kim & Mooney, 2012). Wealso outperform the results of Chen (2012), whichused 30% more training data.6 Using the weakervalidation signal creates a marginal decrease in per-formance. However, we still outperform all previ-ous work, despite using weaker supervision. Inter-estingly, these increases were achieved with a rel-atively simple executor, while previous work usedMARCO (MacMahon et al., 2006), which supportssophisticated recovery strategies.

Finally, we evaluate our approach on the held outtest set for the oracle corpus (Table 4). In contrastto experiments on the Chen and Mooney (2011) cor-pus, we use a held out set for evaluation. Due to thisdiscrepancy, all development was done on the train-ing set only. The increase in accuracy over learningwith the original corpus demonstrates the significantimpact of noise on our performance. In addition to

6This additional training data isn’t publicly available.

Validation Single Sentence Sequence LFFinal state 77.6 (1.14) 54.63 (3.5) 44 (6.12)

Trace 78.63 (0.84) 58.05 (3.12) 51.05 (1.14)

Table 4: Oracle corpus test accuracy and standarddeviation results.

execution results, we also report exact match logi-cal form (LF) accuracy results. For this purpose, weannotated 18 instruction sequences (105 sentences)with logical forms. The gap between execution andLF accuracy can be attributed to the complexity ofthe linguistic representation and redundancy in in-structions. These results provide a new baseline forstudying learning from cleaner supervision.

11 Discussion

We showed how to do grounded learning of a CCGsemantic parser that includes a joint model of mean-ing and context for executing natural language in-structions. The joint nature allows situated cuesto directly influence parsing and also enables algo-rithms that learn while executing instructions.

This style of algorithm, especially when using theweaker end state validation, is closely related to re-inforcement learning approaches (Branavan et al.,2009, 2010). However, we differ on optimizationand objective function, where we aim for minimalloss. We expect many RL techniques to be usefulto scale to more complex environments, includingsampling actions and using an exploration strategy.

We also designed a semantic representation toclosely match the linguistic structure of instructionallanguage, combining ideas from many semantictheories, including, for example, Neo-Davidsonianevents (Parsons, 1990). This approach allowed us tolearn a compact and executable grammar that gen-eralized well. We expect, in future work, that suchmodeling can be reused for more general language.

Acknowledgments

The research was supported in part by DARPA un-der the DEFT program through the AFRL (FA8750-13-2-0019) and the CSSG (N11AP20020), the ARO(W911NF-12-1-0197), and the NSF (IIS-1115966).The authors thank Tom Kwiatkowski, NicholasFitzGerald and Alan Ritter for helpful discussions,David Chen for providing the evaluation corpus, andthe anonymous reviewers for helpful comments.

59

ReferencesArtzi, Y., & Zettlemoyer, L. (2011). Bootstrapping Se-

mantic Parsers from Conversations. In Proceed-ings of the Conference on Empirical Methods inNatural Language Processing.

Barwise, J., & Cooper, R. (1981). Generalized Quanti-fiers and Natural Language. Linguistics and Phi-losophy, 4(2), 159–219.

Branavan, S., Chen, H., Zettlemoyer, L., & Barzilay, R.(2009). Reinforcement learning for mapping in-structions to actions. In Proceedings of the JointConference of the Association for ComputationalLinguistics and the International Joint Conferenceon Natural Language Processing.

Branavan, S., Zettlemoyer, L., & Barzilay, R. (2010).Reading between the lines: learning to map high-level instructions to commands. In Proceedingsof the Conference of the Association for Computa-tional Linguistics.

Bugmann, G., Klein, E., Lauria, S., & Kyriacou, T.(2004). Corpus-based robotics: A route instruc-tion example. In Proceedings of Intelligent Au-tonomous Systems.

Carpenter, B. (1997). Type-Logical Semantics. The MITPress.

Chen, D. L. (2012). Fast Online Lexicon Learning forGrounded Language Acquisition. In Proceedingsof the Annual Meeting of the Association for Com-putational Linguistics.

Chen, D., Kim, J., & Mooney, R. (2010). Training a mul-tilingual sportscaster: using perceptual context tolearn language. Journal of Artificial IntelligenceResearch, 37(1), 397–436.

Chen, D., & Mooney, R. (2011). Learning to InterpretNatural Language Navigation Instructions fromObservations. In Proceedings of the National Con-ference on Artificial Intelligence.

Clark, S., & Curran, J. (2007). Wide-coverage efficientstatistical parsing with CCG and log-linear mod-els. Computational Linguistics, 33(4), 493–552.

Clarke, J., Goldwasser, D., Chang, M., & Roth, D.(2010). Driving Semantic Parsing from theWorld’s Response. In Proceedings of the Confer-ence on Computational Natural Language Learn-ing.

Collins, M. (2004). Parameter estimation for statis-tical parsing models: Theory and practice ofdistribution-free methods. In New Developmentsin Parsing Technology.

Davidson, D. (1967). The logical form of action sen-tences. Essays on actions and events, 105–148.

Di Eugenio, B., & White, M. (1992). On the Interpre-tation of Natural Language Instructions. In Pro-ceedings of the Conference of the Association ofComputational Linguistics.

Dzifcak, J., Scheutz, M., Baral, C., & Schermerhorn, P.(2009). What to Do and How to Do It: Trans-lating Natural Language Directives Into Temporaland Dynamic Logic Representation for Goal Man-agement and Action Execution. In Proceedingsof the IEEE International Conference on Roboticsand Automation.

Goldwasser, D., Reichart, R., Clarke, J., & Roth, D.(2011). Confidence Driven Unsupervised Seman-tic Parsing. In Proceedings of the Association ofComputational Linguistics.

Goldwasser, D., & Roth, D. (2011). Learning from Nat-ural Instructions. In Proceedings of the Interna-tional Joint Conference on Artificial Intelligence.

Heim, I., & Kratzer, A. (1998). Semantics in GenerativeGrammar. Blackwell Oxford.

Kate, R., & Mooney, R. (2006). Using String-Kernels forLearning Semantic Parsers. In Proceedings of theConference of the Association for ComputationalLinguistics.

Kim, J., & Mooney, R. J. (2012). Unsupervised PCFGInduction for Grounded Language Learning withHighly Ambiguous Supervision. In Proceedingsof the Conference on Empirical Methods in Natu-ral Language Processing.

Kollar, T., Tellex, S., Roy, D., & Roy, N. (2010). TowardUnderstanding Natural Language Directions. InProceedings of the ACM/IEEE International Con-ference on Human-Robot Interaction.

Krishnamurthy, J., & Mitchell, T. (2012). Weakly Super-vised Training of Semantic Parsers. In Proceed-ings of the Joint Conference on Empirical Meth-ods in Natural Language Processing and Compu-tational Natural Language Learning.

Kwiatkowski, T., Goldwater, S., Zettlemoyer, L., &Steedman, M. (2012). A probabilistic modelof syntactic and semantic acquisition from child-directed utterances and their meanings. Proceed-ings of the Conference of the European Chapter ofthe Association of Computational Linguistics.

Kwiatkowski, T., Zettlemoyer, L., Goldwater, S., &Steedman, M. (2010). Inducing probabilistic CCGgrammars from logical form with higher-order

60

unification. In Proceedings of the Conference onEmpirical Methods in Natural Language Process-ing.

Kwiatkowski, T., Zettlemoyer, L., Goldwater, S., &Steedman, M. (2011). Lexical Generalization inCCG Grammar Induction for Semantic Parsing.In Proceedings of the Conference on EmpiricalMethods in Natural Language Processing.

Lafferty, J., McCallum, A., & Pereira, F. (2001). Con-ditional Random Fields: Probabilistic Models forSegmenting and Labeling Sequence Data. In Pro-ceedings of the International Conference on Ma-chine Learning.

Lewis, D. (1970). General Semantics. Synthese, 22(1),18–67.

Liang, P., Jordan, M., & Klein, D. (2009). Learning se-mantic correspondences with less supervision. InProceedings of the Joint Conference of the Asso-ciation for Computational Linguistics the Interna-tional Joint Conference on Natural Language Pro-cessing.

Liang, P., Jordan, M., & Klein, D. (2011). LearningDependency-Based Compositional Semantics. InProceedings of the Conference of the Associationfor Computational Linguistics.

MacMahon, M. (2007). Following Natural LanguageRoute Instructions. Ph.D. thesis, University ofTexas at Austin.

MacMahon, M., Stankiewics, B., & Kuipers, B. (2006).Walk the Talk: Connecting Language, Knowl-edge, Action in Route Instructions. In Proceed-ings of the National Conference on Artificial Intel-ligence.

Matuszek, C., FitzGerald, N., Zettlemoyer, L., Bo, L., &Fox, D. (2012). A Joint Model of Language andPerception for Grounded Attribute Learning. Pro-ceedings of the International Conference on Ma-chine Learning.

Matuszek, C., Fox, D., & Koscher, K. (2010). Follow-ing directions using statistical machine translation.In Proceedings of the international conference onHuman-robot interaction.

Matuszek, C., Herbst, E., Zettlemoyer, L. S., & Fox, D.(2012). Learning to Parse Natural Language Com-mands to a Robot Control System. In Proceedingsof the International Symposium on ExperimentalRobotics.

Montague, R. (1973). The Proper Treatment of Quantifi-cation in Ordinary English. Approaches to naturallanguage, 49, 221–242.

Muresan, S. (2011). Learning for Deep Language Under-standing. In Proceedings of the International JointConference on Artificial Intelligence.

Parsons, T. (1990). Events in the Semantics of English.The MIT Press.

Singh-Miller, N., & Collins, M. (2007). Trigger-basedlanguage modeling using a loss-sensitive percep-tron algorithm. In IEEE International Conferenceon Acoustics, Speech and Signal Processing.

Steedman, M. (1996). Surface Structure and Interpreta-tion. The MIT Press.

Steedman, M. (2000). The Syntactic Process. The MITPress.

Steedman, M. (2011). Taking Scope. The MIT Press.

Taskar, B., Guestrin, C., & Koller, D. (2003). Max-Margin Markov Networks. In Proceedings ofthe Conference on Neural Information ProcessingSystems.

Taskar, B., Klein, D., Collins, M., Koller, D., & Manning,C. (2004). Max-Margin Parsing. In Proceedings ofthe Conference on Empirical Methods in NaturalLanguage Processing.

Tellex, S., Kollar, T., Dickerson, S., Walter, M., Banerjee,A., Teller, S., & Roy, N. (2011). UnderstandingNatural Language Commands for Robotic Naviga-tion and Mobile Manipulation. In Proceedings ofthe National Conference on Artificial Intelligence.

Vogel, A., & Jurafsky, D. (2010). Learning to follow nav-igational directions. In Proceedings of the Con-ference of the Association for Computational Lin-guistics.

Webber, B., Badler, N., Di Eugenio, B., Geib, C., Lev-ison, L., & Moore, M. (1995). Instructions, In-tentions and Expectations. Artificial Intelligence,73(1), 253–269.

Wei, Y., Brunskill, E., Kollar, T., & Roy, N. (2009).Where To Go: Interpreting Natural DirectionsUsing Global Inference. In Proceedings of theIEEE International Conference on Robotics andAutomation.

Winograd, T. (1972). Understanding Natural Language.Cognitive Psychology, 3(1), 1–191.

Wong, Y., & Mooney, R. (2007). Learning SynchronousGrammars for Semantic Parsing with Lambda Cal-culus. In Proceedings of the Conference of the As-sociation for Computational Linguistics.

61

Zettlemoyer, L., & Collins, M. (2005). Learning to mapsentences to logical form: Structured classificationwith probabilistic categorial grammars. In Pro-ceedings of the Conference on Uncertainty in Ar-tificial Intelligence.

Zettlemoyer, L., & Collins, M. (2007). Online learningof relaxed CCG grammars for parsing to logicalform. In Proceedings of the Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learn-ing.

62

Documents

Weakly Supervised Learning of Semantic Parsers for …yoavartzi.com/pub/az-tacl.2013.pdf · et al. (2011) used a graphical model semantics rep-resentation to learn from instructions