Functional Combinatory Categorial Grammardaniel_mcmichael/papers/FCCGJ06.pdf · 2006. 6. 24. · Functional Combinatory Categorial Grammar driven phrase structure grammar [HPSG] (Sag,

Journal of Artificial Intelligence Research 25 (2006) pages 1-42 Submitted 3/06; published ?/06

Functional Combinatory Categorial Grammar

Daniel McMichael [email protected]

Simon Williams [email protected]

Geoff Jarrad [email protected]

CSIRO ICT Centre

Locked Bag 2,

Glen Osmond,

South Australia 5064

Abstract

Functional Combinatory Categorial Grammar (FCCG) advances the field of combi-natory categorial grammars by enabling semantic dependencies to be determined directlyfrom the syntactic derivation under the action of a small set of extraction rules. Pred-icates are extracted composably and can be used to apply semantic constraints duringparsing. The approach is an alternative to that of classical CCG which requires (i) map-ping from categories to lambda expressions, (ii) a set of semantic transformation rules forunary combination, and (iii) an explicit β-reduction stage. GFCCG, a generalised form ofthe grammar, has previously been applied to situation assessment (McMichael, Jarrad, &Williams, 2006).

In FCCG, combinators are largely distinguished by their semantic purpose. Unarycombination is only used for preterminal-terminal transitions. Replacing unary type-raising and type-changing by their binary counterparts R and P tends to reduce parseambiguity. Four other binary combinators are introduced to model various semanticphenomena: functional composition (F), modification (M), apposition (A) and copularmodification (Q). Of the combinators of classical CCG, only binary coordination (&) isretained.

The category is the natural feature structure for CCG, and we show how it may beextended to host semantic and parsing-related features compactly. The grammar is demon-strated by extracting an FCCG-annotated corpus from the Penn treebank using the com-binator calculator described in (Foreman & McMichael, 2004). Only fifty categories and140 productions cover 99.7% of the extracted corpus, a substantially more efficient repre-sentation than previous conversions.

We adopt a factored conditional statistical model, and provide an efficient iterativelearning algorithm involving direct feedback of parsing errors that does not require enu-meration of the parse forest. While the best parser performances on data derived fromthe Penn treebank have come from fully lexicalized parsers, we have been motivated bythe need to provide good coverage outside that relatively narrow training domain. Wehave therefore used a semilexicalised feature model, which does not, for example, containbilexicalised features. Spurious ambiguity is controlled via probabilistic scoring coupledwith agenda-based A* parsing. The parser uses an auxiliary queue prioritized by multi-tagprobabilities for terminal node introduction. The parser’s syntactic dependency F-scoreof 79.2% compares well with the best labelled syntactic dependency F-score for a fullylexicalised CCG parser tested on data consistent with the training domain of 84.6% (Clark& Curran, 2004b). The parser obtained a semantic extraction F-score of 83.1%.

c©2006 AI Access Foundation. All rights reserved.

McMichael, Williams and Jarrad

1. Introduction

Combinatory Categorial Grammar (CCG) elegantly represents both syntactical and seman-tic aspects of the derivation (Ades & Steedman, 1982), (Steedman, 1990), (Park, 1992).Semantically, the combination of constituents becomes a process of β-reduction. Predi-cates are introduced as semantic attachments to lexical categories and closed class words(Bos, Clark, Steedman, Curran, & Hockenmaier, 2004). While it provides a comparativelycompact integration of syntax and semantics, it is possible to go a stage further and elimi-nate semantic attachments completely, deriving a quasi-logical form [QLF] (Alshawi & vanEijck, 1989) entirely from the categories and combinators. This is the aim of FunctionalCombinatory Categorial Grammar, and it motivates a collection of new combinators, eachwith a different semantic purpose. They allow the quasi-logical form to be calculated fromthe derivation using a small set of rules by process of primary semantic extraction.

The raw QLF represents the logical semantics associated with the syntactic derivation.The interpretation of words in logical terms we include within the scope of secondary seman-tics, which comprises quantification and such phenomena as modal reasoning and anaphora.In our implementation, we use a procedural approach rather than explicit β-reduction be-cause it enables efficient application of semantic constraints during parsing.

Modern statistical approaches, such as maximum entropy and max-margin parsing, re-quire computationally intensive optimizations. In Section 8, we present an iterative errorfeedback algorithm that efficiently optimizes the probabilistic model at much lower compu-tational cost. We use an agenda-based A* parser, in which the leaf nodes are introducedfrom an auxiliary queue, giving a substantial gain in efficiency over the block enumerativeapproach of Clark and Curran (Clark & Curran, 2004b). The parser is semilexicalised, inthat it avoids direct use of features involving open class words extracted from the trainingcorpus. Semilexicalised parsers suffer a significant drop in performance when comparedto lexicalized parsers tested on sentences similar to the training set (Klein & Manning,2003b). Our semilexicalised system has labelled syntactic dependency F-scores of 79.2%and labelled semantic F-score of 83.1%, which compare well with the best comparable syn-tactic dependency score for a lexicalized CCG parser (84.6%) (Clark & Curran, 2004b) andthe best published parser-driven semantic dependency score on Penn treebank sentences of80.97% (Cahill, Burke, O’Donovan, van Genabith, & Way, 2004).

In the remainder of the introduction, we review recent progress in deep parsing forsemantic extraction. We then give a formal definition of the grammar (Section 2), itssyntax (Section 3) and semantics (Sections 4 and 5). By way of illustration, we show howvarious language phenomena are modelled by the grammar (Section 6), and describe theprocess for converting the Penn treebank (Section 7). The statistical framework, featureset and algorithms for training and parsing, together with our numerical results are set outin Section 8.

1.1 Deep Parsing

The need for deep grammars capable of representing phrasal, dependency and semanticstructure has been met by creation of formalisms such as CCG, tree-adjoining grammar(Joshi, Levy, & Takahashi, 1975), lexical functional grammar [LFG] (Bresnan, 1982), gen-eralised phrase structure grammar [GPSG] (Gazdar, Klein, K., & Sag, 1985) and head

2


driven phrase structure grammar [HPSG] (Sag, Wasow, & Bender, 2003). Of these CCGoffers scalable implementation and appears to be the cleanest approach. In Section 3.2,we show how the dependencies within CCG categories may be viewed as feature structuresand thus how constituent combination under CCG is a form of bound unification. Thistechnique can be used to extract syntactic dependencies of the kind described in (Clark,Hockenmaier, & Steedman, 2002). It gives semantically cleaner, more compact, but less de-tailed derivations than Hockenmaier and Steedman’s approach (Hockenmaier & Steedman,2002b), (Hockenmaier, 2003a).

Conversion of large treebanks of thoroughly annotated diverse corpora has not yet caughtup with the development of grammatical formalisms. The Penn treebank (Marcus, San-torini, & Marcinkiewicz, 1993) which has facilitated the development of good surface parsers(Collins, 1999), (Charniak, 2000) provides some useful long range dependency annotation,but it is not comprehensive. It provides no semantic annotation and its markup of com-pound nouns is vestigial. The Redwoods treebank (Oepen, Toutanova, Shieber, Manning,Flickinger, & Brants, 2002) provides accurate dependencies and semantics, but is currentlyof small size. In the interim, before large modern treebanks become available or unsuper-vised techniques reach maturity, workers seeking large treebanks marked up with moderngrammars have done their best to extract such annotation from the Penn treebank; at-tempts have been made for CCG (Hockenmaier, 2003a), LFG (Burke, Cahill, O’Donovan,van Genabith, & Way, 2004), and Lexical Tree Adjoining Grammar [LTAG] (Shen & Joshi,2005).

The minimal semantics we derive deterministically and automatically under FCCG fromsyntactic derivations is inspired by considerations of the kind that motivated the creation ofMinimal Recursion Semantics, the semantic counterpart to HPSG (Copestake, Flickinger,Sag, & Pollard, 1999), (Copestake, 2003), (Ritchie, 2004). Discourse Representation Theoryhas been applied in similar vein as a post processor to explicit β-reduction for CCG (Bos,2005).

The opportunity created by the availability of robust accurate deep parser is likely tobe immense—but commercial and experimental systems offer nowhere near the levels ofperformance required for reliable logical reasoning. Molla and Hutchinson’s 2003 study ofthe of the Conexor parser’s dependency extraction performance using a 500 sentence corpus(Carroll, Minnen, & Briscoe, 1999) showed that average precision and recall of subject-verbdependencies was a modest 73.6% and 64.5% respectively. Average precision and recallacross a variety of dependencies was 74.6% and 59.7% (Molla & Hutchinson, 2003).

In recent years, several studies on robust semantic extraction have been motivated byattempts to achieve robustness via hybrid schemes. Examples include the combination ofconstituency and dependency structure (Abney, 1995), cascaded finite state and probabilis-tic parsing (Briscoe & Carroll, 2002), and use of the Collins surface parser (Collins, 1999)to constrain the extraction of deep dependencies (Swift, Allen, & Gildea, 2004), (Schneider,Dowdall, & Rinaldi, 2004).

Toutanova et al’s experiments with the Redwoods treebank on the fusion of constituencyand dependency parsers indicated that of the two, constituency parsers performed bet-ter, and that combining them only a yielded small improvement (Toutanova, Manning,Flickinger, & Oepen, 2005b).

3


Experiments on the extraction of labelled semantic dependencies from complex sentencesyield F-scores of about 80%; for example 79.6% on the PARC 700 dependency bank (Kaplan,Riezler, King, Maxwell, Vasserman, & Crouch, 2004) and 80.97% for f-structures extractedfrom section 23 of the WSJ portion of the Penn treebank (Cahill et al., 2004).

An alternative approach to extracting dependencies is semantic role labelling: the associ-ation of semantic roles with extracted constituents. The development of the Propbank dataset has led to an emphasis on verb frame role labelling at the expense of compound nom-inal semantics and clause composition (Kingsbury, Palmer, & Marcus, 2002), (Kingsbury& Palmer, 2003). Gildea and Hockenmaier’s experiments with CCG-based dependencyextraction achieved a 66.8% head word labelled dependency F-score versus the Propbankgold standard (Gildea & Hockenmaier, 2003). Miyao et al. achieved a core argument la-belling F-score of 74.2% by direct processing of the output of an HPSG parser (Miyao &Tsujii, 2004), while Pradhan et al. achieved an F-score of 78.8% from dependencies ex-tracted from the Charniak parser (Pradhan, Ward, Hacioglu, Martin, & Jurafsky, 2005),(Charniak, 2000). Recently, role labelling has been attacked as an inference problem in itsown right. Joint association of constituents with roles has been achieved with an F-scoreof 92.1% (Toutanova, Haghighi, & Manning, 2005a). Selecting and classifying constituentsestimated from several parsers also give significant improvement over single-parser systems,achieving an F-score of 85.2% (Pradhan et al., 2005). This work suggests that the applica-tion of semantic constraints can produce significant improvement in dependency extraction.

1.2 Semilexicalisation: An Approach to Robust Parsing

The success of lexicalized parsers trained and evaluated using resources such as the Penntreebank has tended to overshadow robust techniques which avoid features that depend onopen class words extracted from the training corpus. Klein and Manning’s study (Klein &Manning, 2003b) indicates that very good performance can be achieved against standardPenn treebank test sentences (86.36% bracketing F-score). Such a design could be expectedto perform better outside the domain of financial journalism than fully lexicalized parsers.Relevant information from open class words can be administered cleanly via token featuresderived from preprocessing to provide markup of compounds such as numbers, times andproper names, much to the benefit of semantic dependency extraction (Grover, Lascarides,& Lapata, 2005). Such an approach could be used to adapt an unlexicalized parser to thepeculiarities of specific domains, and inspires our use of such features.

An alternative method of domain tuning would be to extend the training corpus. Suchan annotation process can be made orders of magnitude faster by using a parser trained onthe original corpus to estimate reference parses for the new material, which then only haveto be corrected (Clark, Steedman, & Curran, 2004).

2. Functional CCG

Combinatory categorial grammar, developed by Steedman and others, is a generalization ofcategorial grammar (Steedman, 1996), (Steedman, 2000). In CCG, categories are assignedto tokens, and the category of each multi-token constituent is derived from those of itschildren (predecessors) under a combinator. Lexical categories correspond to functors in

4


the combinatory logic of Curry and Feys (Curry & Feys, 1958). However, in FCCG, theformation of logical functors is deferred until combination.

A Functional Combinatory Categorial Grammar G may be written as the tuple (N , Σ,P , S, V ), the elements of which are:

• A finite set N of nonterminal nodes;

• A finite set Σ of terminal nodes disjoint from the nodes in N ,

where each node n ∈ Σ ∪ N , contains a set of parameters Pn= {x, k, S, Π}, and

– x, the state, which consists of an featured category c,

– k, the combinator that generated the node,

– S, the semantics, and

– Π, references to the of children (predecessors) of n;

• A finite set P of production rules, where a rule is of the form:

m` (V,N) −→ mr (V )

where m` (V,N) is any member of V containing a node in N , and mr (V ) is a memberof V ;

• A base vocabulary V = {Σ ∪ N}∗; and

• A start node S ∈ N .

• x, a set of states (x ∈ x);

• K, a set of combinators (k ∈ K), where the non-combination event is assigned to knull;

• fU (·), a function that generates the parameters of derived nodes in unary combination:P ′ = {x, fU (P, x)};

• fB (·) a function that generates the parameters of derived nodes in binary combination:P ′ = {k, fB (Pn1 ,Pn2 , k)};

• PU , a function assigning a probability P (x | Pn1) to the derived state under unarycombination, given the parameters of predecessor node n1; and

• PB , a function assigning a probability P (k | Pn1 ,Pn2) to the combinator in a binarycombination, given the parameters of predecessor nodes n1 and n2.

In our usage, tokens are annotated items of text from which leaf nodes are derived.

5


3. Syntax

The syntax of FCCG is defined in terms of unary productions for the generation of terminalsand binary productions for the derivation of nonterminals. Historically, CCG productionsare written constructively, and take the form X [·Y] ⇒k Z, where X, and optionally Y, arethe predecessor categories, k is the combinator, and Z is the derived category.

The guiding principles of FCCG are (i) minimize differences from classical CCG, (ii)distinguish combinators primarily by their semantic function, (iii) choose combinators thatallow a terse description of the procedure for extracting the quasi-logical form from thederivation, (iv) eliminate unary combinations (to avoid uncontrolled chart growth), (v)eliminate ternary combinations (for uniformity). These considerations led us to abandonthe combinators for unary type-raising (T), ternary coordination (Φ), substitution (S) andcomposition (B). In their stead we introduced combinators for functional composition (F),modification (M), binary type-raising (R) and binary type-changing (P), together with abinary a combinator for apposition (A), and a form of the binary coordination combinator(&). Both crossing and reversal variants are permitted; their application is controlled viathe probabilistic model. The unary type-changing combinator U is only permitted in thederivation of leaf nodes from tokens.

3.1 Categories

The set of bare (unfeatured) categories {X} is generated by the following EBNF rules:

X :== E | CB :== E | “(” C “)”C :== B � B� :== / | \

where C is a complex category, E is an elementary category, B is a bracketed category and� is a slash.

A complex category can be written in the form X�Y, where Y is termed the argumentand X the result. The inner category is the left-most elementary subcategory. The coresubcategory is the outermost occurrence of a subcategory of the form X�X; if no such pat-tern exists, it is the inner category. For example, X is both the core and inner subcategoryof categories ((X\Y)/Y)/Z and (X\X)/(X/X); but the core subcategory of ((Z\Z)/(Z\Z))/Yis (Z\Z)/(Z\Z), while its inner subcategory is Z.

In the grammar we have extracted from the Penn treebank, described in Section 7,the permitted elementary categories are N (nouns and noun phrases), S (sentences andsentential phrases) and P (particles). In FCCG, all features are attached to subcategorieswithin the category. The simplest of these are the index numbers given to elementarycategories to differentiate the semantic objects to which they refer. Following (Hockenmaier,2003b), elementary categories are numbered outwards from the inner subcategory, exceptwhere dependencies dictate otherwise. For example, in the verb phrase modifier category(S0\N1)/(S0\N1), both S’s refer to the same verb, and both N’s to the same noun phrase.However, in the phrasal category (S0\N2)/(S1\N2), S1 refers to a sentential complement ofS0 sharing the same subject. The zero marker can be omitted without ambiguity.

6


This numbering scheme results in the canonical verb categories: S0, S0\N1, (S0\N1)/N2,(S0\N1)/N2/N3, where the indexes are allocated: 0 for the core (S), 1 for the subject, 2 forthe object, 3 for the indirect object.

Categories are divided into three mutually exclusive classes:

• Modificational – if they have the form X�X (including subcategory numbering) e.g.(S0\N1)\(S0\N1);

• Adjunctive – if they can be written in the form (X�X)�Zn (including subcategorynumbering) e.g. (S0\S0)/N1, (S0\S0)/(S1\S1), or (S0\N1)\(S0\N1)/N2;

• Phrasal – the remainder, which have the form X�Zn and cannot be written as mod-ificational or adjunctive categories e.g. (S0\N1)/N2 or (S0\N1)/S2.

3.2 Featured Categories

Unification categorial grammar (Uszkoreit, 1986), (Calder, Klein, & Zeevat, 1988) sought tounite the functional aspects of categorial grammar with the rich feature structures availablefrom unification grammar. In CCG, features have been applied within categories as at-tachments to subcategories to code such information as syntactic agreement (Hockenmaier,2003a), subcategorization (Hockenmaier & Steedman, 2002a) and traces (Steedman, 2000).In Baldridge’s multi-modal CCG, features are attached to the slashes to indicate restrictionson combination (Baldridge, 2002).

We have selectively applied these ideas: in FCCG a category is feature structure definedby the slashes which contains elementary categories and all other features associated witha node. Subcategory index numbers determine its reentrancy structure. During combi-nation, the targets of the predecessor categories are unified and annihilated, and the localdependencies and other information they contain are embedded within the derivation.

slash: /

arg:

slash: \arg: [Ecat: N]

res: [Ecat: N]

res:

slash: \arg: [Ecat: N]

res: [Ecat: S]

/

NSNN

\\

Figure 1: The adjunctive category (N\N)/(S\N) viewed as a feature structure.

In Figure 1, the bare category (N\N)/(S\N) is shown both as a bracketed feature struc-ture and as a directed acyclic graph. Its structure is determined by the slash positionswithin the category, which progressively divide successive arguments and results down tothe elementary categories which form the leaves. When the same index number appearsmore than once, for example in (N0\N0)/(S1\N0), the category feature structure becomesreentrant (Figure 2). Figure 3 shows features for the span of the node, its handle, and thesemantic term types of each leaf (Section 4).

7


slash: /

arg:

slash: \arg: 1

res: [Ecat: S]

res:

slash: \arg: 1

res: 1 [Ecat: N]

/

SN

\\

Figure 2: The feature structure for the adjunctive category (N0\N0)/(S1\N0).

slash: /span: 3 − 15

arg:

slash: \arg: 1

res:

[

Ecat: Stype: r

]

res:

slash: \handle: p1

arg: 1

res: 1[

Ecat: Ntype: e

]

/

\\

Ecat:NType:e

handle:p1

Ecat:SType:r

Span:3-15

Figure 3: The feature structure for the adjunctive category (N0\N0)/(S1\N0) with the span,handle, semantic term types and links between common objects.

3.3 Combinators and Syntactic Combination

To enable functional semantic extraction, FCCG requires a richer set of combinators eachwith a narrow, semantically defined scope. There are eight combinators grouped into sixsytactic types, as set out in the table below.

Root Name Type Remark

F Functional Composition functional absorbing complements, etc.R Binary Type-Raising functional absorbing internal complementsP Binary Type-Changing functional combined functional

composition and type-raisingM Modification modificational absorbing adjunctsQ Copular Modification copular modifies copular subcategoriesA Apposition appositional parenthetic dependencies& Coordination coordinational binary coordinationU Unary Type-Changing unary preterminals to terminals

8


In FCCG, the combinator determines all aspects of derivation, including head featureinheritance and the relationship between the parameters of the predecessor and derivednodes. This requires, for example, that different combinators be used for modification andcomplementation.

A derivation is determined by its leaf categories and its combinators. In binary combi-nation, the predecessor categories limit which combinator can be applied. Parsing becomesa matter of repeated combinator selection (Section 8.1).

The set of FCCG combinator variants K (k ∈ K) is generated by the following EBNF:

k :== direction root ârity [ variant] | U | & | A

direction :== < | >root :== F | M | Q | R | P

arity :== 0,1,2,...

variant :== x | r

Example combinators are >Fx and <M1xr. The arity may be omitted when it can be

inferred from context. When two categories combine, they are regarded as functors, one isprimary and the other secondary; the primary target, a subcategory of the primary categoryabsorbs, unifies with and annihilates a corresponding target subcategory of the secondarycategory leaving the derived category. If the primary category is on the left [right], thecombinator direction is forward (>) [backward (<)].

For example, in X/Y · Y ⇒>F X the primary category is X/Y and in Y · X\Y ⇒<F Xthe primary category is X\Y; the target in both combinations is Y. Combinations mayalso be written in functional form; for example >F(X/Y,Y) = X and <F(Y,X\Y) = X.In combinations involving & and A, the target, result and argument of each predecessorcategory is the category itself.

A primary category’s outer slash determines the direction from which it canonicallyabsorbs a target. For instance, in functional composition (F) the two outer slash directionsare exemplified by the following combinations:

/ → canonical argument to the right, e.g. X/Y · Y ⇒>FX;

\ → canonical argument to the left, e.g. Y · X\Y ⇒<FX.

The category templates associated with the combinators are listed in Table 1. Foreach combinator type, there are entries for whether it is used in FCCG, templates for theprimary, secondary and derived categories, together with permitted variations and the headinheritance rule.

In the templates, subcategories are represented by the symbols X, Y and Z. A smallletter (i.e. y) indicates an elementary category. It is required that y 6= X 6= Y and thatin the definition of Qn, X cannot be written X′�y, for some X′. Slashes are repre-sented by the symbol �. The notation X�Yn (n ≥ 0) indicates a category of the form((. . . (X�1Y1)�2. . . �n−1)Yn−1)�nYn, where �i has no relationship to �j (i 6= j) and nis the target depth. The arity of the combinator must equal the target depth of the sec-ondary category; for example, functional composition with a target depth of 2 requires F2.Naturally, X�Y0 = X. In unary combination, the function f (·) generates a set of derivedcategories. Heads may be inherited from the primary predecessor (Fn, Rn and &), from

9


Typein

FCCG

Primary

category

Secondary

category

Derived

category

Vari-

antHead

FnX X�1Y Y�2Z

n X�2Zn x, r primary

Rn X (X�1y)�2Zn y X�2Z

n x, r primary

PnX X�1Y Y�2Z

n Y�2Zn x, r secondary

MnX X�1X X�2Z

n X�2Zn x, r secondary

QnX y�1y (X�0y)�2Z

n (X�0y)�2Zn x, r secondary

A X X X X — secondary

& X X X X — primary

U X X — f (X) — predecessor

Bn × X�1Y Y�2Zn X�2Z

n — —

Sn × (X�1Y)�2Zn Y�2Z

n X�2Zn — —

T × X — Y�1(Y�1X) — —

Table 1: Combinator Templates. The symbols X, Y stand for general subcategories; y standsfor a elementary category, and � stands for any valid slash, and carry suffices todifferentiate them. �i has the opposite sense to �i. Standard CCG combinatorsthat are not used are listed below the double horizontal.

the secondary predecessor (Mn, Qn, Pn and A) or from the lone predecessor node (U).

The directions of the combination slashes (/ and \) are determined according to thecombinator variant. The x variant indicates crossing (reversal of �2), while the r variantindicates reversal (reversal of �1). The use of combinator reversal explicitly breaks Steed-man’s principles of directional consistency and inheritance (Steedman, 1990). However,doing so substantially reduces the complexity of the grammar. Over-generation in parsingis controlled via the probabilistic model.

Binary combinations are preferred to unary combinations because they have longer listsof features for use in parse discrimination. All of the new combinators, except for Q, areclosely related to those of classical CCG. The functional composition and modificationcombinators (F and M) partition the space occupied by functional application and compo-sition (> , < and B) so that the combinator distinguishes semantic purpose. The binaryversions of type-raising (R) and type-changing (P) can be derived by compounding othercombinators. The functional type-raising combinator R is formed by the application ofunary type-raising T and then F. To show this, we write T(y) = X�1 (X�1y), leave offthe direction indicators and expand the derived template:

X�2Zn = Fn (X�1(X�1y), (X�1y)�2Z

n)

= Fn (T(y), (X�1y)�2Zn)

= Rn (y, (X�1y)�2Zn) .

10


The binary type-changing combinator P is derived by the application of the B combi-nator followed by a form of unary type-changing (U) that performs the transformationU(X�2Z

n) = Y�2Zn:

Y�2Zn = U (X�2Z

n)

= U (Bn(X�1Y,Y�2Zn))

= Pn (X�1Y,Y�2Zn) .

4. Primary Semantics

In the classical CCG approach to semantics (Steedman, 2000), (Bos et al., 2004), thelexicon contains, for each word, a set of possible categories and, for each word-categorypair, a semantic attachment. For open class words the semantic attachment is dependenton the category only. The semantic role of the combinator is (i) to administer rule-drivenchanges to logical expressions required by type-changing and type-raising and (ii) to applythe logical expressions of the predecessors to each other to steer the process of β-reduction.

The purpose of primary semantic extraction in FCCG is to extract a quasi-logical form(QLF) with the correct dependency structure. Instead of looking up lexical semanticattachments for each category or closed class word, a small number of simple rules determinethe semantic term type of each token from its category. Terms may have one of three types:entity, relation and property. The semantic role of the combinator is to declare (i) termsvia is-predicates and (ii) dependencies via has-predicates. The extraction of the logicalsemantics of closed class words is deferred, together with other semantic phenomena, tosecondary semantic extraction, examined in Section 5.

The unary operations of classical CCG: type-raising and type-changing, allow the cre-ation of predicates within the nonterminal nodes of the derivation. In FCCG, the creationof predicates is dispersed throughout the derivation tree, occurring when dependencies be-come bound. If arguments are not available, they are deferred. This approach couples thecreation of logical functors with the binding of their arguments. Logical semantic extractionup to the QLF is purely a function of the FCCG syntax and the semantic extraction rules.The Logical form can be obtained by expanding quantifiers (Section 4.4) and assigning theirscope (Steedman, 1999).

The term type of an object has a one-to-one correspondence with the elementary cate-gory used to represent it. The grammar we extracted from the Penn treebank (Section 7)has the correspondences shown in the table below:

Elementary

Category

Syntactic

Type

Term

Type

N noun entity (e)S sentence relation (r)P particle property (p)

The scope of the primary semantics is constrained to the construction of predicates involvingthese term types.

11


4.1 Primary predicates

The is- and has- predicates are highly reified forms called primary predicates, which declareterms and dependencies. The is-predicate has the structure:

is[handle, sense, term type],

where handle is the semantic object being declared, sense is its word sense, and term type

is the term type of the object determined from the lexical category and whether the tokenhas a coordination sense (Table 2).

Term type Core of lexical category has coordination sense

entity (e) Ni —

relation (r) Si ×property (p) modificational ×deferred (d) Si X

(no object declaration) modificational X

Table 2: Table of semantic term type assignment rules.

The has-predicate has the structure:

has[handle, slot, attribute, att type],

where handle is a reference to the semantic object receiving the attribute, attribute is areference to the attributed semantic object, slot is the slot number of the attribute, andatt type is the attribute’s type, which can be either argument, property, parenthesis, ormember. The attribute type depends on the combinator, as set out by Table 3. In has-predicates derived from functional composition, the slot number is the index number of theprimary target. In other forms of combination, the slot number is just the current countof attributes of a given type attached to the same handle. The handle of a constituentis marked as a feature within its featured category. It is constructed from the underlyingobject’s term type abbreviation, subscripted to differentiate object instances (e.g. the lexicalcategory N could generate object references e1, e2 etc.).

During combination, the laws of unification suffice to ensure the correct marking offeatures in derived category such as the handle (Figure 4).

The reason for separating the syntactic head from the semantic handle is that theirpurpose is different: the syntactic head is used to construct parse features, while the handleis the semantic object summarizing the constituent. Differentiation of the two allows,for example, the head of a conjunct to be its right-hand member, while its handle is theconjunct object itself.

12


fly

S0[r1], <F0

has(r1, 1, e1, a)

birds

N0[e1], U

is(e1, I′, e)

birds

fly

S0[r1]\N1, U

is(r1, fly′, r)

fly

Figure 4: A simple FCCG derivation tree showing syntax and semantics.

Attribute type Combination type Combinator type

argument (a) functional F, R, P

property (p) modificational M

property (p) copular Q

parentheis (pa) appositional A

member (m) coordinational &

Table 3: Table of attributed head types.

To fix these ideas we present the trivial example “Daniel ate happily.”, which yields theformula:

is(e1, Daniel′, e) ∧ has(r1, 1, e1, a) ∧ is(r1, ate′, r) ∧ ...

... is(p1, happily′, p) ∧ has(r1, 1, p1, p),

which can be partially de-reified to make it more intelligible:

is entity(e1, Daniel′) ∧ is relation(r1, ate′) ∧ has first argument(r1, e1)...

∧ is property(p1, happily′) ∧ has first property(r1, p1).

Under the assumption of active SVO ordering, this expression can be further simplified:

is entity(e1, Daniel′) ∧ ate′(r1) ∧ agent(r1, e1)...

∧ is property(p1, happily′) ∧ property(r1, p1);

finally yielding

ate′(r1) ∧ agent(r1, Daniel′) ∧ happily′(r1),

a form is more pleasing to the eye, but less useful computationally, than the original.

13


4.2 Primary Semantic Extraction

The functional approach to primary semantic extraction is characterized by the Principle

of Semantic Functionality, which states:

The quasi logical form of a derivation is a function of its categories, combinators,and a small set of rules.

It implies that, within a derivation, categories and combinators alone determine both syntaxand semantics. Besides motivating FCCG, this idea is the implicit basis of the grammarinduction algorithm of Zettlemoyer and Collins (Zettlemoyer & Collins, 2005), which usessemantic markup to induce a CCG.

The primary semantics of a sentence, which map directly onto the QLF (Section 4.4),are realized as a conjunction of is- and has- predicates. New predicates can be conjoinedat each combination. The functional semantics of combination are described by the lambdaexpression (eqn. 1). Let the primary category be P and its target in the secondary cate-gory under combinator k be T. The target contains m differently indexed subcategoriesτ1, . . . , τM . There are M dependencies created; let the mth be from the initially unboundhandle xm to the unbound attribute am, at slot sm with attribute type tm. Let the semanticextraction rules bind the free variables to objects as follows: xm → om and am → o′m. Letthe combination declare J objects {oj}J

j=1 each with sense σj and type tj. The relationshipbetween a syntactic combination summarised by P|T (τ1, . . . , τM ) and the derived semanticsis

P|T (τ1, . . . , τM ) −→ λx1 λa1 . . . λxM λaM . . .∧J

j=1is (oj , σj, tj)∧M

m=1has (xm, sm, am, tm) @{

om, o′m}M

m=1(1)

=∧J

j=1is (oj , σj, tj)∧M

m=1has(

om, sm, o′m, tm)

,

where sm and tm are functions of P, T, k, and m. Following binary combination the derivedpredicates are inserted within a bracket, which is later used in quantification. To operatethis definition as a procedure either during or after parsing requires a set of rules for objectand dependency declaration and binding – to which we now turn.

4.3 Semantic Extraction Rules for FCCG

While primary semantic extraction could be implemented via β-reduction after parsing,such an approach would prevent the application of semantic constraints during parsing.We therefore generate the required predicates on-the-fly and, wherever possible, bind theirarguments immediately (Section 6).

The core of the method is the extraction of the handle and the argument from thepredecessors. The handle Hsem and the attribute Asem in a combination are defined by theequations below:

Hsem (nprim, nsec, hchild (k)) =

{

H (nprim) , if hchild (k) = primary;H (target (nsec)) , if hchild (k) = secondary;

Asem (nprim, nsec, hchild (k)) = {primary, secondary} \ Hsem (nprim, nsec, hchild (k))

14


where nprim and nsec are the primary and secondary predecessors; hchild (k) ∈ {primary,secondary} is the head child of the combination; and H (·) is the handle of its argument.

The rules governing the semantic extraction procedure are as follows:

Unary Combination

1. Declare a semantic object with the predicate is[handle,sense,object type]

as required by Table 2.

2. Attach the handle (if there is one) as a feature of the core subcategory;

3. If there is no handle or it is deferred:

• attach a [conj:di,sense,t]1 feature to the inner subcategory of the result

Binary Combination

Only in coordinational combination (with &) when a conjunct feature exists:

1. Using the attributes of the conj feature, declare the conjunct with is[ci,sense,c].

2. Declare the handles of both predecessors to be members of the conjunct withpredicates has[c,i,m,m], where c is the conjunct referenced in the conjunctfeature, i is the member number of the conjuncted object m, and “m” indicatesthat the predicate refers to conjunction.

3. Attach the featured conjunct of the primary predecessor as the handle on thecore of the derived category.

Only in copular combination (with Q):

1. Declare a dependency between the deferred handle di and the handle of thesecondary predecessor as an attribute with the predicate has[di,1,Hsem,p]

2.

2. Attach a copular object feature [cop:Hsem] to the target;

Only in modificational combination (with M):

1. If the primary predecessor has a conj feature, attach a deferred handle to it fromthe secondary predecessor (to give [conj:di,sense,t,hj].)

Only in functional combination (with F, R or P) or copular combination (Q):

1. If a conj feature is present declare the deferred property: is[di,sense,p]; anddeclare the deferred dependency: has[hj,1,di,p].

2. For all non-core elementary categories in the target, except those where a copu-lar object feature is attached: declare has[Asem,i,dj,a], where i is the indexnumber of the elementary category in the secondary target.

1. The handle of the undeclared deferred object is di, where i is its identifying index, sense is the token’s

word sense, and t is the category class.

2. The deferred binding is di, where i is its identifying index.

15


Always:

1. For all combinators except &:

• if Hsem exists:

– declare an attribute with the predicate has[Hsem,slot,Asem,

att type], where att type is defined in Table 3;

2. For each new deferred binding, add the feature [def:di] to the innermost corre-sponding elementary category within the derived category;

3. When a subcategory with a deferred binding feature becomes bound to an object,substitute its place holder(s) with the bound object reference; however if thebinding takes place under &, create a separate instance of the original referenceand substitute the placeholder with that.

4. When a quantifier is applied as a modifier, delete both its declaration (is predi-cate) and its attribution (has predicate) and replace them with the quantifier theargument of which is the predicates attached to quantified constituent (Section4.4).

5. Apply the predicates applied to a conjunct to its members [omit for collectivereading only].

The procedure can extract both a collective reading via the conjuncts, and an distribu-tive reading via expansion. The collective reading can be removed by deleting the conjuncts.The distributive reading is not created if the conjuncts are not expanded.

This prodedure also prevents the emergence of the second order lambda expressions thatare appear in the treatment in (Bos et al., 2004).

4.4 Quantification

Quantifiers occur syntactically as modifiers and semantically as properties. They can beextracted by replacing the relevant property declarations and dependencies with a quantifier.If a quantifier q is applied to entity e, and that entity is argumented to an object o, thesubstitution is:

(

is(pi ,q ′, p) ∧ is(ej ,e ′, e) ∧ has(ej, 1, pi, p))

∧ has(o, k, ej , a)

−→(

q ′ ej is(ej ,e ′, e) ∧ has(o, k, ej, a))

.

The bracket in which the quantifier lies absorbs predicates containing the quantified object.

16


In the sentence “every girl likes a boy”, the quantifier substitution rule for nominalsyields the QLF:

(

is(p1, every′, p) ∧ is(e1, girl′, e) ∧ has(e1, 1, p1, p))

∧ has(r1, 1, e1, a)

∧(

is(r1, likes′, r) ∧ has(r1, 2, e2, a) ∧(

is(p2 ,a′, p) ∧ is(e2, boy′, e)

∧ has(e2, 1, p2, p)))

=(

∀ e1 is(e1, girl′, e) ∧ has(r1, 1, e1, a))

∧(

is(r1, likes′, r)(

∃ e2 has(r1, 2, e2, a) ∧ is(e2, boy′, e)))

=(

∀ e1 is(e1, girl′, e) ∧ has(r1, 1, e1, a))

∧(

∃ e2 is(r1, likes′, r) ∧ has(r1, 2, e2, a) ∧ is(e2, boy′, e))

;

whereas “every girl likes every boy” becomes

(

∀ e1 is(e1 ,girl′, e) ∧ has(r1, 1, e1, a))

∧ is(r1, likes′, r) ∧(

∀ e2 is(e2, boy′, e) ∧ has(r1, 2, e2, a))

.

Negation of nominal phrases is similarly straightforward; for example, “no girl likesevery boy” takes the QLF:

(

¬∃ e1 is(e1 ,girl′, e) ∧ has(r1, 1, e1, a))

∧ is(r1, likes′, r) ∧(

∀ e2 is(e2, boy′, e) ∧ has(r1, 2, e2, a))

.

However, negation of verb phases can be read either as simple negation or inversion ofthe verb sense. Under the former reading, “every girl doesn’t like every boy” yields thesemantics:

(

∀ e1 is(e1 ,girl′, e) ∧ has(r1, 1, e1, a))

∧ ¬(

is(r1, likes′, r) ∧ ∀ e2 is(e2, boy′, e) ∧ has(r1, 2, e2, a))

.

Selection of the appropriate quantifier interpretation and scope alternation (Steedman,1999) lies beyond the scope of primary semantics.

4.5 Metrics

In evaluating FCCG parsing performance, the key metrics of interest are semantic. Thespurious ambiguity (Section 7.1) inherent in the CCG formalism suggests that purely syn-tactic measures may not be useful. For example, despite building in a preference for right-branching into the training set, multiple derivations can still generate the same semantics.Invariance to such semantically innocuous differences are provided by the syntactic depen-dency metric of Clark and Hockenmaier (2002) (Clark & Hockenmaier, 2002). However,this metric is still sensitive to semantically irrelevant phenomena, such as whether an ad-junct is attached via a simple modifier category or a modifier-modifier. Purely semantictagging and dependency metrics are required.

In the tagging metric we propose, a point is scored for each correct is predicate; inthe semantic dependency metric, a point is scored for each correct has predicate. Thefull metric requires correct word senses, but in this application, we are only concerned with

17


logical semantics, and word senses are disregarded. All variants of the dependency metricscan be provided in both precision and recall forms. The unlabelled form of the dependencymetric is independent of the argument type and position. The dependency metric can berestricted to verb predicate-argument dependencies. Results are reported in terms of thesemetrics for the parsing experiments in Section 8.

5. Secondary Semantics

Our purpose in functionally deriving a QLF was to infer a primary semantic form to whichsubsequent sematic detail can easily be added. Secondary processing only involves localmodifications of the QLF, and comprises such aspects as complement identification, number,gender, voice, aspect, mood, quantifier scope, word sense disambiguation, frame semanticsand the interpretation of text within an ontology.

In our implementation, identification of noun gender and number are accomplishedstraightforwardly from morphological analysis (Karp, Schabes, Zaidel, & Egedi, 1992) andagreement features. The extraction of tense, voice and aspect involves the applicationof deterministic rules, such as are to be found in (Quirk, Greenbaum, Leech, & Svartvik,1985). Identification of voice allows the allocation of role labels to the core arguments ofverbal categories.

6. A Functional CCG for English

The main requirement on a deep grammar is that it is able to represent the semanticdependencies within a sentence. Under the principle of parsimony (Jaynes, 2003), thegrammar should only represent syntactic detail to the extent required by accurate semanticextraction – and no further.

Steedman’s principles of Lexical Head Government and Categorial Uniqueness (Steed-man, 2000) propose that when a word is used in a construction, its category should specifyboth the canonical word order and all varieties of extraction. The exact nature of a con-struction is left open, but the desirability of parsimonious modelling suggests that theirnumber should be restricted. The design requirement is that the number of productions3

be minimized subject to the grammar being able to represent the required dependencies.We term this the principle of production parsimony.

Guided by this principle, the following sections describe FCCG derivations of a varietyof constructions, showing both syntax and primary semantics.

6.1 Functional Composition and Modification

The sentence “I walk home quickly.” involves both complementation and modification ofthe verb, and its FCCG derivation aligns with that of classical CCG (Figure 5). Underthe F combinator, the head of the derived node is inherited from the primary category(the verb), while under M, the head is taken from the secondary category, which allowsthe verb to remain head after modification. Modifiers may themselves be modified, as in

3. In the case of CCG, a production is the quadruple comprising the two predecessor categories, the com-

binator and the derived category, unfeatured except for index numbering.

18


walk

S[r1], <F0

has(r1, 1, e1, a)

I

N[e1], U

is(e1, I′, e)

I

walk

S[r1]\ N1, <M0

has(r1, 1, p1, p)

walk

S[r1]\ N1, >F0

has(r1, 2, e2, a)

walk

(S[r1]\N1)/N2, U

is(r1, walk′, r)

walk

home

N[e2], U

is(e2, home′, e)

home

quickly

((S0\N1)\(S0\N1))[p1], U

is(p1, quickly′, p)

quickly

Figure 5: A derivation tree involving verb modification.

the sentence “James jumped high enough.”. Here enough can be given one of two lexicalcategories: the simple modifier S\S or the modifier-modifier (S\S)\(S\S). The choice isentirely determined by which approach gives better parsing performance.

Prepositional phrases are regarded as adjuncts rather than complements and are at-tached as modifiers using adjunctive categories (Section 3.1). In the sentence, “Joe wentto bed.” (Figure 6), to is given the category (S\S)/N1 to allow it to take the prepositionalobject as a nominal complement and then to modify the verb.

Verb phrase derivations are simplified by the extraction of compound verb sequences inwhich the final verbal element takes the role of canonical verb, and the previous elementsare modifiers. This leads to such analyses as:

Joe would happily have eaten bread.N (S0\N1)/(S0\N1) (S0\N1)/(S0\N1) (S0\N1)/(S0\N1) (S0\N1)/N2 N

This approach enables auxiliary verbs to be assembled consistently and allows simple coor-dination structures.

In canonical conditional forms such as “I will be happy if I eat” the lexical category ofif is (S0\S1)/S2. The same category is assigned in the reordered form “If I eat, then Iwill be happy”, but the applying combinator is reversed. In the latter form, then takes thecategory S0/S1; whereas in “I eat then I go”, it takes (S0\S1)/S2, and becomes the sententialhead. However, in “I eat then go”, then is treated as a conjunction, and go becomes thesentential head. For correlative conjunctions, the first element is given a phrasal category(e.g. S0/S1), and the second is treated as a coordinating conjunction.

19


went

S[r1], <F0

has(r1, 1, e1, a)

Joe

N[e1], U

is(e1, Joe′, e)

Joe

went

S[r1]\ N1, <M0

has(r1, 1, p1, p)

went

S[r1]\N1, U

is(r1, went′, r)

went

to

((S0\N2)\(S0\N2))[p1], >F0

has(p1, 1, e2, a)

to

(S0\N2)\(S0\N2)[p1]/N1, U

is(p1, to′, p)

to

bed

N[e2], U

is(e2, bed′, e)

bed

Figure 6: A derivation tree with a prepositional clause.

is

S[r1], <F0

d1 → e1

apple

N[e1], >M0

has(e1, 1, a1, p)

The

(N0/N0)[p1], U

is(p1, The′, p)

The

apple

N[e1], U

is(e1, apple′, e)

apple

is

S[r1]\ N1[cop:r1, def:d1], <Q0

r

has(d1, 1, r1, p)has(r1, 1, p2, p)

is

S[r1]\N1, U

is(r1, is′, r)

is

red

(N0/N0)[p2], U

is(p2, red′, p)

red

Figure 7: A derivation tree involving copular modification.

6.2 Copular Constructions

The copular modification combinator Q is designed to enable functional derivation of cop-ular structures. It allows modification of complements in place. Figure 7 shows thederivation of the sentence “The apple is red.”. Similar constructions include “The applewas red.” and “The apple looked red.” To support these, Q attributes the nominal modifier

20


attacked

S[r1], <F0

has(r1, 1, e1, a)

Achilles

N[e1] <A

has(e1, 1, e2, pa)

Achilles

N[e1], U

is(e1, Achilles′, e)

Achilles

hero

N[e2], >M0

has(e2, 1, p1, p)

the

(N0/N0)[p1], U

is(p1, the′, p)

the

hero

N[e2], U

is(e2, hero′, e)

hero

attacked

S[r1]\N1, U

is(r1, attacked′, r)

attacked

Figure 8: A derivation tree involving apposition.

(e.g. red or happy) via the verb to which it attaches (e.g. is, was, looked or be). Resolutionof the unbound subject is deferred. Normal subject attribution is inhibited by propagatinga copular feature. Without Q, significant semantic rewiring of naıve derivations would berequired to extract the correct dependencies.

6.3 Apposition

In appositional constructions, the right-hand predecessor acts parenthetically to the left-hand predecessor; for example in “The butcher, a knave, enjoyed sporting his meat cleaver.”and “Achilles, the hero, attacked.” – the derivation of which is shown in Figure 8. Apposi-tion occurs extensively between sentential clauses; for example, “That was the rub: the safecould not have been locked.”

6.4 Movement

Subordination frequently implies movement of a subject or object of the main clause intoa subsidiary clause. In “Frank wanted to go home.” the subject of go is passed to theinfinitival clause to go home via to, which takes the category ((S1\N1)\(S1\N1))/(S2\N1).In “Fred sculpted the marble that he bought.”, where the object of both verbs is marble, itis passed to the subordinate clause by giving that the category (N1\N1)/(S2/N1).

When there is no word to support an adjunctive category, the subordinated phrase isattached to the preceding constituent using the binary type-changing combinator P. In“George kissed the girl [that] he loved.” the omission of the conjunction that forces the useof P, which makes girl the head of the attaching phrase (Figure 9).

21


kissed

S[r1 ], <F0

has(r1, 1, e1, a)

George

N[e1], U

is(e1, George′, e)

George

kissed

S[r2 ]\ N1, >F0

has(r1, 2, e2, a)

kissed

(S[r1 ]\N1)/N2, U

is(r1, kissed′, r)

kissed

girl

N[e2], <P0

r

has(r2, 2, e2, a)

girl

N[e2 ], >M0

has(e2, 1, p1, p)

the

(N1/N1)[p1 ], U

is(p1, the′, p)

the

girl

N[e2], U

is(e2, girl′, e)

girl

loved

S[r2 ]/N2 , <R1

has(r2, 1, e3, a)

he

N[e3], U

is(e3, he′, e)

he

loved

(S[r2]\N1)/N2, U

is(r2, loved′, r)

loved

Figure 9: Use of the binary type changing combinator (P) for subordination without aconjunction.

To allow argument passing into parasitic gaps, Steedman uses the S combinator (Steed-man, 1996). We prefer M, for example, in the fragment “articles that I file without reading”(Figure 10). The binary type-raising combinatory R is allows the subject to be absorbedbefore the object.

In noun phrases with relative clauses, such as “people that I saw running”, there are twoverbs and the noun is the object of one and subject of the other. In such cases, we assignthe burden of managing the argument matching to the perceptive verb (saw), giving it thecategory ((S0\N1)/N2)/(S3\N2).

6.5 Coordination

FCCG adopts a variant of the established binary coordination combinator &, in whichthe coordinating conjunction is given a proper category, a conj feature and is assigned aconjunct object. The initial absorbtion of the coordinating conjunction into a conjunct usesanother combinator (usually M). The & combinator is only applied when two conjunctsare combined. The semantic conjunct object is inherited until & combination, when itis declared as the handle of the derived node. In simple applications, where there are noextraction phenomena, the natural category class for conjunctions is modificational, andmost conjunctions are represented by forward facing modifiers. Figures 11 and 12 giveexamples of nominal and verbal coordination in the sentences “Bill and Ben eat.” and “Billate and left.” The distributive reading is provided by expanding the attributions appliedto the conjunct. The conjunct rules operate recursively on nested conjuncts.

22


articles

N[e1 ], <M0

has(e1, 1, p1, p)d2 → e1

articles

N[e1 ], U

is(e1, articles′, e)

articles

that

(N1[def:d2 ]\N1)[p1], >F0

has(p1, 1, r1, a)has(r1, 1, d2, a)

that

(N1\N1)[p1 ]/(S/N1), U

is(p1, that′, p)

that

file

S[r1 ]/N2 [def:d2], <R1

has(r1, 1, e2, a)d1→ e2

I

N[e2], U

is(e2, I′, e)

I

file

(S[r1]\N1 [def:d1 ])/N2 [def:d2 ], <M0

has(r1, 1, p2, p)

file

(S1 [r1]\N1)/N2), U

is(r1, file′, r)

file

without...

...without

(((S1\N1 [def:d1])/N2 [def:d2 ])\((S1\N1)/N2))[p2], >F0

has(p2, 1, r2, a)has(r2, 1, d1, a)has(r2, 2, d2, a)

without

(((S1\N1)/N2)\((S1\N1)/N2))[p2]/((S2\N1)/N2)), U

is(p2, without′, p)

without

reading

(S[r2 ]\N1)/N2, U

is(r2, reading′, r)

reading

Figure 10: Handling the semantics of parasitic gaps with M.

In non-constituent coordination, the conjunction has to provide the complementationcapabilities of the missing verb; for example, in the sentence “John ate bananas and Maryoranges.” the conjunction is given the quasi verbal category (S/N1)/N2 (Figure 13). Thesemantic extraction rules require that deferred bindings are made to copies. In “Jane sawPhil today and Jim yesterday”, the conjunction and is given the category (S\N1)/N2, andtoday and yesterday can be assigned either S\S or (S\N1)\(S\N1). Unlike some previousapproaches that use unary type-raising, the semantics are correct and the verb modifiers(today and yesterday) are applied to different instantiations of the verb. In terms of ourexample, Jane did two separate actions, one today and the other yesterday.

Phrasal coordination in such sentences as “Jack is happy and a rich man” and “Jackis a rich man and happy” is accomplished without the coordination combinator at all.The conjunction and is regarded as a simple forward modifier and <Qr does the copularmodification of the subject.

23


eat

S[r1], <F0

has(r1, 1, c1, a)has(r1, 1, e1, a)has(r1, 1, e2, a)

Ben

N[c1], <&

is(c1, and′, c)has(c1, 1, e2, m)has(c1, 2, e1, m)

Bill

N[e1], U

is(e1, Bill′, e)

Bill

Ben

N[e2][conj:d1, and′, mod, e2], >M0

and

(N1[conj:d1, and′, mod]/N1), U

and

Ben

N[e2], U

is(e2, Ben′, e)

Ben

eat

S[r1]\N, U

is(r1, eat′, r)

eat

Figure 11: A derivation tree involving nominal coordination.

7. Extracting FCCG Annotations from the Penn Treebank

To develop and demonstrate FCCG, we converted the Wall Street Journal section of thePenn treebank (Marcus et al., 1993), the largest available corpus of parsed English sentences.The context-free grammar in which it is annotated has 48 part-of-speech (POS) tags and 14higher-level phrase tags, many of which are augmented with functional markers. The firstCCG conversion, by Watkinson and Manandhar, made no attempt to construct realizableCCG derivations or handle any of the syntactic constructions in the Penn treebank that usenull elements to encode long-range dependencies (Watkinson & Manandhar, 2001). Thesecond conversion, the “CCGbank”, by Hockenmaier and Steedman, comprises coherentderivations exhibiting a wide range of phenomena (Hockenmaier & Steedman, 2002a).

The FCCG approach is sufficiently different from classical CCG to require a new anno-tated corpus. Our conversion has been guided by the principle of production parsimony(Section 6) and the need for correct semantics.

Compact derivation trees have low average root-to-leaf path lengths, and therefore tendto be well balanced. Characteristically, they are formed of independent subtrees, eachspanning a sentence chunk. They tend to parse quickly.

In the grammar we extracted from the Penn treebank (release 2), only 140 productionssuffice to cover 97.7% of the extracted corpus. This figure compares to the more than

24


left

S[c1 ], <F0

has(c1, 1, e1, a)has(r1, 1, e1, a)has(r2, 1, e1, a)

Bill

N[e1], U

is(e1, Bill′, e)

Bill

left

(S[c1]\N1), <&

is(c1, and′, c)has(c1, 1, r2, m)has(c1, 2, r1, m)

ate

S[r1 ]\N1, U

is(r1, ate′, r)

ate

left

(S[r2 ][conj:d1 , and′, mod, r2]\N), >M0

and

((S1[conj:d1 , and′, mod]\N1)/(S1\N1)), U

and

left

S[r2 ]\N1, U

is(r2, left′, r)

left

Figure 12: A derivation tree involving verbal coordination.

ate

S[c1 ], <&

is(c1, and′, c)has(c1 , 1, d1, m)has(c1, 1, r1, m)r2 = copy(r1)

d1 → r2

ate

S[r1 ], <F0

has(r1, 1, e1, a)

John

N[e1 ], U

is(e1, John′, e)

John

ate

S[r1 ]\N1, >F0

has(r1, 2, e2, a)

ate

(S[r1 ]\N1)/N2, U

is(r1, ate′, r)

ate

banannas

N[e2 ], U

is(e2, Ben′, e)

Banannas

and

S[d1 ][def:d1 ][conj:d2 , and′, phr], >F0

has(d1, 2, e4, a)

and

S[d1 ][def:d1 ][conj:d2 , and′, phr]/N1 , >F0

has(d1, 1, e3, a)

and

(S[d1 ][def:d1 ][conj:d2 , and′, phr]/N2 )/N1, U

is(d1, and′, r)

and

Mary

N[e3 ], U

is(e3, Mary′, e)

Mary

oranges

N[e4], U

is(e4, oranges′, e)

oranges

Figure 13: A derivation involving nonconstituent coordination.

12,400 rules in the original and ∼3000 in the CCGbank. Economy is achieved in numerousways, for example, by giving verbs the same category whether they are used declarativelyor as part of a subordinate clause. Treating prepositional phrases as adjuncts rather thancomplements eliminates large numbers of nugatory categories.

25


The process for generating FCCG derivations from native Penn treebank annotationshas four stages: (i) head identification, (ii) tree surgery, (iii) conversion to a binary tree,and (iv) assignment of categories and combinators. Each child of each production in aPenn treebank derivation is assigned a type (Head, Adjunct, Conjunct or Comple-

ment) using the production’s tags and a set of head identification rules. Trees whosetreebank structure leads to incorrect semantics under FCCG conversion are subjected tosurgery. Each resulting tree is converted to a binary structure consistent with the types ofits constituents and a set of constraint rules. Finally a top-down pass generates the FCCGcategories and combinators.

The main difference between our extraction process and Hockenmaier’s (Hockenmaier,2003a) lies in head-finding. Accurate head finding is a prerequisite for accurate semanticextraction. In conventional head-finding, the children of a production are traversed by headdetection rules, one at a time; the process terminates when a head is found. Our proceduretraverses the chidren in a specified direction, testing each child with a set of rules. Ifno head is found, a different set of rules and search direction is applied. While we stillbranch left (right) for nodes to the right (left) of the head node, we work inwards from thebeginning of the sentence, changing to moving inwards from the end of the sentence and backagain each time we hit a conjunct or the head. This enables us to analyse the relativelyflat structures of the Penn treebank without recourse to such extensive tree surgery; forexample, in right-node raising and argument clustering.

While we have tried to minimise the amount of tree surgery, it is often necessary eventto convert simple sentences; for example, in the treatment of possessives and percent signs.Semantic transparency requires that in such sentences as “no-one got discouraged” theword get be treated as an auxiliary verb of discouraged, instead of taking the participleas a complement. To improve tree compactness, auxiliary verb sequences are extracted asindependent subtrees and then applied to the participle, rather than being applied to it oneat a time. We treat infinitival complements as modifiers of the main verb and give to anadjunctive category. The required tree surgery is shown below: the original Penn treebankmarkup on the left; our revision on the right.

(VP (VBP seem)

(VP (NP−SBJ−1 (−NONE− *−1) )

(TO to)

(S

(VP (VBP seem)

(VP (VB disappear) )

(VP (TO to)

(NP−SBJ (−NONE− *−1) )

(S

(VP (VB disappear) )

Key to conversion is combinator selection. As each CCG production is extracted, a listof possible combinators is formed and a score is allocated to each, depending on the tags,types and categories of the node and its predecessors. Combinators are ranked in order ofdecreasing preference:

F FN R M MN Q P & A

Scores associated with variations (x and r) are added to each combinator score.The Penn treebank contains long lists of adjuncts (modifiers). Using only one level of

modifier category (i.e. X�X) produces deep trees that are difficult and time consuming

26


to parse. These problems can be alleviated by allowing a modifier-modifiers of the form(X�1X)�2(X�1X) for such words as intensifiers and adverbs modifying adjectives.

The conversion does not yet cover the whole of the WSJ portion of treebank. Theremaining 35% fails to convert due to (i) failure to find a head, (ii) insufficiency of multi-clause constructions, (iii) absense of list constructions, and (iv) errors in the native treebankannotation.

7.1 Spurious Ambiguity

CCG permits multiple semantically equivalent derivations, a phenomenon known as spu-rious ambiguity. During parsing, it has the effect of bloating the chart with unnecessaryderivations. Reducing spurious ambiguity has been addressed in three ways: restrictingderivations to a normal form (Vijay-Shanker & Weir, 1993), (Eisner, 1996), eliminatingsuperfluous semantically equivalent derivations (Karttunen, 1989), (Komagata, 1999), andpenalizing depreciated derivations by probabilistic scoring coupled with selective search(Hockenmaier, 2003a).

With the exception of R and Q, the combinators we use all permit spurious ambiguity.We have sought to reduce by probabilistic scoring coupled with A* parsing. We distinguishcombinators with zero arity from those with higher arities (by inserting an ‘N’ after thecombinator type letter) to permit harsher scoring of higher arity combinations.

Probabilistic scoring can reduce spurious ambiguity if preferences amongst semanticallyequivalent derivations are expressed in the training data. The list of combinators extractedfrom the Penn treebank data in order of declining frequency is:

>M, >F, <M, ,<F, <&, >MNx, <MNx, <A, >MN, <Qr, <R, <Fr,

<Pr, <MN, <P, >Fr, >FNx, >Pr.

Non variant forms and zero arity combinators predominate.

The combinator scoring scheme described above, together with other restrictions, con-strain FCCG extraction so that conjunct and appositive sequences are combined from theright. While modifier and modifier-modifier categories are permitted, higher order modifiersare not.

8. Parsing

Lexicalized CCG parsers have made extensive use of bilexical dependency features, andhave applied both conditional (Clark et al., 2002) and generative models (Hockenmaier& Steedman, 2002b). Clark and Curran achieved good results using maximum entropytraining (Clark & Curran, 2003) (Clark & Curran, 2004b). Recently, Zettlemoyer andCollins have described an algorithm for inducing a CCG based only on semantic markup(Zettlemoyer & Collins, 2005).

Our treatment begins with the presentation of a factored posterior distribution forderivations; we then introduce the feature set and give a training algorithm and an A*parsing procedure. This section concludes with a discussion of results.

27


8.1 A Probabilistic Model for the Syntactic Derivation

In probabilistic context free grammar (PCFG), the probability of a derivation is the productof the conditional probabilities of the child nodes given their parents, enumerated over thewhole tree. With PCFG, parsing quality suffers from the lack of contextual constraint.In an effort to ameliorate this, Abney introduced a novel but unscalable formulation thatadmitted contextual features and was trained by sampling the entire language generatedby the grammar (Abney, 1997). Drawing on the ideas of history-based grammar (Black,Jelinek, Lafferty, Magerman, Mercer, & Roukos, 1993) and maximum entropy (Berger,Pietra, & Pietra, 1996), Ratnaparkhi (1997) introduced a staged constructive approachadmitting contextual features (Ratnaparkhi, 1997), (Ratnaparkhi, 1999). Early maximumentropy approaches applied either improved or generalised iterative scaling, (Ratnaparkhi,1997) and (Clark & Curran, 2003), but such algorithms have been superseded by quasi-Newton algorithms (Johnson, Geman, Canon, Chi, & Riezler, 1999), (Riezler, King, Kaplan,Crouch, Maxwell, & Johnson, 2002), (Clark & Curran, 2004a).

Like PCFGs, history-based grammars have a top-down generative structure, where thederivation probability is the product of probabilities of children given their parents. Thederivation probabilities of maximum entropy models are built the other way, as the productof the probability of the token string and the posterior probability of the derivation given thetoken string. In (Ratnaparkhi, 1997), (Johnson et al., 1999), (Clark & Curran, 2003) and(Clark & Curran, 2004a) only the derivation posterior is optimized. Given appropriatelyaugmented node states, the derivation posterior can be factorized into bottom-up termsof the form P (derived node | predecessor nodes). We have used a posterior model thatexposes this factorization.

An interesting difference between generative and posterior models is that, in the former,each child must have a parent, so there is never a question of children ever not combining.However, in posterior modelling, non combination is very much a possibility. Where maxi-mum entropy techniques optimize this phenomenon implicitly, in factored posterior modelsit must be exposed and optimized explicitly.

Recalling the formal definition of FCCG (Section 2), we partition the derivation treeT between the set of tokens T, the set of leaf nodes derived by unary combination U (T )from the tokens, and B (T ) the set of higher nodes derived by binary combination. In thefactored posterior model, the posterior probability of T is the product of the conditionalprobabilities of the parameters of its derived nodes:

P (T | T) =∏

i∈B(T )

P(

Pi | {Pj}j∈Π(T ,i)

)

∏

i∈U(T )

P(

Pi | PΠ(T ,i)

)

,

where Π (T , i) is the set of parents of node i in derivation T . The set of parameters Pi ofbinary-derived nodes can be split between the combinator ki and the remaining parametersPi; Pi =

{

ki, Pi

}

, giving

P(

Pi | {Pj}j∈Π(T ,i)

)

= P(

ki, Pi | {Pj}j∈Π(T ,i)

)

= P(

ki | {Pj}j∈Π(T ,i)

)

P(

Pi | {Pj}j∈Π(T ,i) , ki

)

.

28


The value of Pi, given its predecessors, is completely determined by the combinator, so thesecond term may be written

P(

Pi | {Pj}j∈Π(T ,i) , ki

)

= δ(

Pi − fB

(

{Pj}j∈Π(T ,i) , ki

))

.

For leaf nodes, the model is similarly partitioned between the state x and the remaining

parameters Pi; Pi ={

xi, Pi

}

such that

P(

Pi | PΠ(T ,i)

)

= P(

xi, Pi | PΠ(T ,i)

)

= P(

xi | PΠ(T ,i)

)

δ(

Pi − fU

(

PΠ(T ,i), xi

)

)

.

Since the value of the δ-function terms is invariant across valid combinations, the treeprobability has the simple form

P (T | T) =∏

i∈B(T )

P(

ki | {Pj}j∈Π(T ,i)

)

∏

i∈U(T )

P(

xi | PΠ(T ,i)

)

.

The probability of binary combination is entirely determined by the conditional probabil-ity of the combinator. The conditional combinator and state probabilities can be providedby probabilistic classifiers. For example, if logistic classifiers are used (and the attributesare indexed by `, the nodes by i), the binary combination attribute ai` is a function of ki

and {Pj}j∈Π(T ,i) and the unary combination attribute ui` is a function of xi and PΠ(T ,i).Omitting references to the parameters, the derivation posterior becomes

P (T | T) =exp

∑

i∈B(T )

∑

`λiài` (ki)∏

i∈B(T )

[

∑

ki

(

exp∑

`λiài`

(

ki

))] ·exp

∑

i∈U(T )

∑

`λiùi` (xi)∏

i∈U(T )

[∑

xi(exp

∑

`λiùi` (xi))] . (2)

To represent non-combination, we insert a null combinator knull into the combinatorset K. An interesting question then arises: if the training corpus contains only positiveexamples, how then are we to estimate the conditional probability of knull?

In maximum entropy modelling, it is conventional to provide such negative data byenumerating the parse forest and then applying the inside-outside algorithm to calculatethe node marginals – a demanding computation (Johnson et al., 1999), (Clark & Curran,2003) and (Zettlemoyer & Collins, 2005). We pursue a different approach, more directlylinked to optimising parse quality.

8.2 Training

Training proceeds in two stages: firstly, positive data is used to train the leaf node state andhigher node combinator classifiers, secondly, there is an iterative process in which trainingset is parsed and errors are feed back to expand the training data. This process gives aniterative improvement in parsing performance. The logistic classifiers are optimized usingthe limited memory BFGS algorithm (Liu & Nocedal, 1989).

The leaf node classifier is restricted to attributes available at token level, while thehigher node classifier utilizes all the features described in section 8.3, 4039 attributes in all.

29


An `p regularizing prior (Johnson et al., 1999), (Riezler & Vasserman, 2004) was applied toprevent overfitting. It was modified so that its scale parameter σ becomes independent ofthe number of data N :

p (λ | σ, p) =p p√

N

2σΓ(

1p

)e−N|λσ |p .

We set p = 2. The choice of σ has an significant effect on classifier error rates and onparsing efficiency (Figure 14).

95.8

96

96.2

96.4

96.6

96.8

97

97.2

0.01 0.1 1 10 100�

% c

orr

ect

p N/σ

Figure 14: The effect of varying the scale parameter σ in the prior distribution (p = 2).

Let the initial training data for the combinator classifier be Z0. It is extracted fromthe training corpus Γ and consists of a set of cases, each comprising the attributes z andthe corresponding combinator k; i.e. Z0 = {zi, ki}i = {zi}i, where i is a member of theset of binary combinations in Γ. Subsequent training sets Z1, Z2, etc. are produced bysuccessively augmenting Z0.

Let S (Γ, n) be a random sample of n sentences from Γ obtained without replacement.The sample sentences are parsed and error combinations having predecessors in the goldstandard data are assigned to the sets:

Enull (S) = the set of combinations in S that are not in the gold standard;EWC (S) = the set of combinations in S combined using the wrong combinator.

For a combination zi, let the gold-standard combinator be kgs (zi) and the correspondingcombinator in the maximum probability parse be kparse (zi); the sequence of training sets{Zj}j

is obtained via the iteration:

30


75

80

85

90

95

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26IterationIteration

avg syntactic dependency precision (F1-score)

avg semantic dependency precision (F1-score)

avg semantic tagging accuracy

avg syntactic tagging accuracy

Figure 15: Variation of paser scores over 25 iterations of feedback training.

Iterate j:

∆Z1j = {(zm, knull)}m | zm ∈ Enull (S (Γ, n))

∆Z2j = {(zm, knull)}m | zm ∈ EWC (S (Γ, n))

∆Z3j =

{(

zm, kcorrectm

)}

m| zm ∈ EWC (S (Γ, n))

Zj ={

Zj−1,∆Z1j ,∆Z2

j ,∆Z3j

}

retrain the binary combinator classifier

The improvement generated by this iterative scheme is about 10% in labelled semanticdependency precision and recall. Figure 15 shows the variation of cross validation scores(Section 4.5) using section 24 of the Penn treebank as holdout data. The are calculatedfor semantic dependency, syntactic dependency and labelled bracketing accuracy. Thescores tend to improve as the number of iterations increases. The optimum is syntacticand semantic dependency F1-scores was reached at itertion 23. Parsing time progressivelyimproves throughout the iteration process. Figure 16 shows the variation of parser speedmeasured in terms of the average number of tokens parsed per second and the averagenumber of combinations per token. The zeroth iteration contains no negative data. At thefirst iteration negative data is introduced and there is a considerable increase in the numberof unused candidate partial derivations. The progressive introduction of more negativedata steadily reduces the number of wasted combinations and increases parsing speed.

31


0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26IterationIteration

tokens/second

combinations/token

Figure 16: Variation of parser speed and effort over 25 iterations of feedback training.

8.3 Parsing Features

The category is the vehicle to which all features are attached (Section 3.2). In additionto those so far described, we use a number of features the sole purpose of which is toprovide attributes for the statistical parser. They are not used to differentiate grammaticalproductions, but are applied via the mechanism of Eqn. (2).

One of the major limitations of features dependent on lexical occurrences within thetraining corpus is that there are bound to be too few examples to provide training ofsufficiently wide coverage for accurate parsing. We have therefore hand crafted aboutsixty features. Lexical dependency occurs largely through wide coverage sources such asdictionaries. There are five classes of root feature:

head features These are computed from the lexical properties of each token, and are prop-agated to the derived node by inheritance from the head predecessor node. Examplesinclude detectors for:

negations, quantifiers, determiners, subordinate and coordinate conjunctions, pro-

nouns, question words, common verbs, auxiliary verbs, modal verbs, phrasal verbs,

common adjectives, common adverbs, temporal nouns, intensifiers, adjectives of num-

ber, mode-changing verbs, mode-changing nouns; capitalization; internal punctuation;

suffix endings; domain-specific terms and multi-word forms, e.g. people, places, coun-

tries, financial terms; FCCG-specific information, e.g. previously observed categories

and combinators.

32


lexical variant features Also computed from the lexical properties of tokens, the lexicalvariant features are propagated to the derived node by combining the correspondingfeatures from both predecessor nodes. Information about the type of combination,such as the combinator, direction of head inheritance, direction of functional applica-tion, etc., may be utilized. Examples include indicators of:

adverb to the left, noun and verb phrase prepositional attachment, auxiliary verb

bracketing; prior absorption of negations, determiners, modal verbs, subordinate and

coordinate conjunctions, and question words; etc.

span variant features These are computed from the context of each token within the sen-tence, including positional information and lexically-derived information about sur-rounding tokens. Examples include:

punctuation to the left and right; estimated part-of-speech tags; width of the de-

rived span; position of the left-most token in the span; distance of the head token to

its left-most and right-most spanned tokens; distance between the head tokens of the

predecessor spans; etc.

The span variant features are propagated to the derived node via feature resolution.

state variant features These are computed from the combinatory and categorial infor-mation assigned to nodes after the transformation of tokens into leaf nodes, e.g. aftersupertagging. Examples include:

category, lexical head, position of head token; previously used combinators; etc.

combination variant features These are computed solely for the purpose of combinatorclassification. They are a non-linear transformation of pairs of any of the types offeatures discussed above. Examples include:

the set of combinators observed in the training set with the same given left and

right predecessor categories; membership of the set of productions in the training set.

With the exception of the span variant features, all the feature groups can be cached.The root features are not used directly by the parser (Section 8), but are first coded into alist of attributes with values in [0, 1]. The coding schemes employed are:

• Boolean – the attribute value is 1 if the feature is present, or 0 if the feature is absent.

• probability – the attribute value represents the probability that the feature is present.

• one-of-N – a feature having one of N discrete values is coded as a list of N attributes,with a 1 for the attribute corresponding to the feature value, and 0 for each remainingattribute.

• numerical – a numerically-valued feature, e.g. head distance, is transformed into anattribute value in the range 0 . . . 1.

33


We have excluded feature markings from the production rules to reduce the size of thegrammar, while preserving its coverage of dependencies; this contrasts with the syntacti-cally richer approach of Hockenmaier and Steedman (Hockenmaier & Steedman, 2002a),(Hockenmaier, 2003a). Hard blocking of combinations is triggered by such occurrences asunmatched brackets and & combination without a conj feature.

8.4 Auxiliary Queue A* Parsing

The goal of agenda-based parsing is to avoid enumerating the parse forest for the tokensequence under the grammar using the CYK algorithm (Kasami, 1965). A design method-ology for agenda-based parsers with a variety of control strategies has been provided byKlein and Manning (Klein & Manning, 2001b). Their proposals are based on Dijkstra’sgraph search algorithm (Dijkstra, 1959), the precursor of the A* algorithm (Hart, Nilsson,& Raphael, 1968), (Hart, Nilsson, & Raphael, 1972). A* incorporates expected completioncost estimates has been applied to parsing (Klein & Manning, 2001a), (Klein & Manning,2003a), (Klein & Manning, 2003c). Klein and Manning’s algorithm (Klein & Manning,2001b) provides for a variety of introduction strategies, but it simplifies if “discovery” and“finishing” coalesce into a single step. Our approach uses A* search in a bottom-up strategywith back-tracking via an auxiliary queue (Section 8.4.1).

Initially, the chart is empty and leaf nodes are introduced from the auxiliary queue tothe agenda, a priority queue scored by the logarithm of node probability. The leaf nodescores are the log conditional probabilities of the category given the token feature data.Parsing involves successive removal of the highest scoring node from the agenda (whichneeds occasional resorting). This node, the pivot, is discarded if it appears in the chart. Ifnot, the parser retrieves its context – the set of chart nodes with which it can be combined.An attempt is made to combine them with the pivot, and the resulting derived nodes areplaced on the agenda. The current pivot is then placed on the chart. The next pivot isdrawn from the agenda. The chart only distinguishes nodes on the basis of span and state.The search terminates when a node with an admissible root state that spans the sentenceis assigned as pivot.

We experimented with two forms of A* completion cost estimate: a leaf node normalizerνleaf and a derived node normalizer νderived. The former is the negative logarithm of theproducts of the most probable leaf nodes for each token in the constituent. It can be shownthat the leaf node normalizer satisfies the admissibility criteria for A* heuristics; it also ledto a speed up of (∼5%) (Matsumoto, Powers, & Jarrad, 2003).

The derived node normalizer is the negative logarithm of the most probable derivednode with the same reduced state and span in the training data. A reduced state and spanis a function of a node’s state and span; in our experiment, we used the category andspan width. However, the derived node normalizer is only admissible when the nodes inthe current parse are less probable than the most probable nodes with the same reducedstate and span in the training data. This simple derived node normalisation yielded littleincrease in parser speed and because of its occasional non-admissibility it led to a slightreduction in performance and was discarded.

34


Table 4 shows the results for parsing with an auxiliary queue on the extracted sentencesfrom the test set, section 23 of the Penn treebank. The auxiliary queue introductionstrategy ensures that there are no failed parses.

Strategy

# comb-

inations

per token

Syntactic

Dependency

F1-score

Semantic

Dependency

F1-score

Semantic

Type

Accuracy

Dijkstra 14.3 78.88 81.95 96.93

+ Hard feature constraints 11.1 78.73 82.77 96.92

+ Leaf node normalisation (A*) 10.5 79.16 83.13 96.97

Table 4: Test results on section 23 of the Penn Treebank with an auxiliary queue parser.

Parser Leaf Categories Lexicalization F1-score

Hockenmaier 2003 estimated bilexicalization 83.3 %

Clark et al. 2004 estimated bilexicalization 84.6 %

McMichael et al. 2006 estimated semiexicalization 79.2 %

Table 5: Comparative syntactic dependency test scores on section 23 of the Penn Treebank.

8.4.1 The Auxiliary Queue

The parser’s queue-based framework inspired a robust form of leaf node introduction alsousing a queue. The role of the auxiliary queue is to introduce leaf nodes to the agenda inorder of decreasing auxiliary score.

Before parsing commences, the sentence is supertagged (Bangalore & Joshi, 1999) byattaching a probability to each possible category conditioned on a 5-word window centredon each token, using Ratnaparkhi’s approach (Ratnaparkhi, 1996). Clark and Curran havetermed this procedure multitagging (Clark, 2002), (Curran & Clark, 2003) and (Clark &Curran, 2004a). The multitagger uses the features available at token level (Section 8.3).

The auxiliary queue contains all the possible leaf nodes (i.e. a node for each possiblecategory for each token in the sentence). The auxiliary queue is ordered by the auxiliaryscore which is the multitagging probability of each terminal node divided by the maximummultitag probability of any node derived from its predecessor token. Initially, the queue istop-sliced and all the leaf nodes with a score of more than βcrit (∼ 0.85) are transferred tothe agenda.

If the agenda empties before the an admissible root is found, successively lower scoringnodes are drawn from the auxiliary queue and placed on the agenda until the parser ter-minates. Figure 17 shows the variation in the average number of trial combinations andsemantic dependency score as βcrit is varied over a range from 0.1–1.0 on the sentences insection 24 of the Penn treebank. Provided βcrit is not too high, dependency accuracy isinsensitive to its value.

35


76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

0 0.2 0.4 0.6 0.8 1

βcrit

Synt

actic

Dep

ende

ncy

F 1-s

core

Figure 17: Variation of semantic dependency F1-score with change in the value of βcrit.

Table 6 compares the auxiliary queue method with three other strategies (i) only in-troducing the most probable leaf node into the agenda and never drawing further from theauxiliary queue, (ii) placing all the nodes on the agenda at initialization (i.e. βcrit = 0), and(iii) using the staged reduction policy4 of Clark and Curran (Clark & Curran, 2004a).

Strategy% Failed

Sentences# Combinations

Semantic

Dependency

Score

Parser

Tagging

Accuracy

(i) Best nodes only 97.5 6262 94.88 96.11

(ii) Staged reduction 0 513527 74.76 89.06

(iii) Auxiliary queue 0 346278 83.13 89.62

Table 6: Results on section 23 of the Penn Treebank for various node injection strategies.

9. Conclusion

The principle contribution of CCG was the creation of a grammar that fitted the phenomenaof language well and needed fewer productions for the same coverage (v3,000 CCGbankproductions, instead of >12,400 in the native Penn treebank annotation). This compres-sion was obtained by allowing functional syntactic derivation under a small number ofcombinators.

4. βcrit = 0.1 → 0.01 → 0.001 etc.

36


The contribution of FCCG is to reinterpret combination as a semantic process, to rede-fine the combinators accordingly, and so to provide compressed functional semantic extrac-tion. This reworking has also yielded a significant further reduction in syntactic complexity.Only 140 productions are required to cover 99.7% of the extracted training corpus. Thisimprovement in efficiency has been achieved at the same time as achieving gains in semanticand syntactic fidelity, such as better handling of non-consituent coordination.

Use of the A* algorithm for parsing provides significant speed improvements over CYK(Klein & Manning, 2003a). It has the estimable property that better performance tendsto yield higher speed. Use of an auxilliary queue for introduction proved a significantimprovement over existing procedures.

The iterative training algorithm requires an order of magnitude less computation thanthe maximum entropy technique and provides good performance.

In the near term, we will fully convert the Penn treebank and explore the applicationof semantic constraints to parsing and semantic extraction.

References

Abney, S. (1995). Chunks and dependencies: Bringing processing evidence to bear on syn-tax. In Cole, J., Green, G., & Morgan, J. (Eds.), Computational Linguistics and theFoundations of Linguistic Theory, pp. 145–164. CSLI.

Abney, S. P. (1997). Stochastic attribute-value grammars. Computational Linguistics, 23 (4),597–618.

Ades, A., & Steedman, M. (1982). On the order of words. Linguistics and Philosophy, 5,519–558.

Alshawi, H., & van Eijck, J. (1989). logical forms in the core language engine. In Proceedingsof the 27th Annual Meeting of the Association for Computational Linguistics, pp. 25–32, Vancouver, Canada.

Baldridge, J. (2002). Lexically Specified Derivational Control in Combinatory CategorialGrammar. Ph.D. thesis, Edinburgh University.

Bangalore, S., & Joshi, A. (1999). Supertagging: An approach to almost parsing. Compu-tational Linguistics, 25 (2), 237–265.

Berger, A. L., Pietra, V. J. D., & Pietra, S. A. D. (1996). A maximum entropy approachto natural language processing. Computational Linguistics, 22 (1), 39–71.

Black, E., Jelinek, F., Lafferty, J., Magerman, D. M., Mercer, R., & Roukos, S. (1993).Towards history-based grammars: Using richer models for probabilistic parsing. InProceedings of the 31st Annual Meeting of the ACL, Columbus, Ohio.

Bos, J. (2005). Towards wide-coverage semantic interpretation. In Proceedings of SixthInternational Workshop on Computational Semantics IWCS-6, pp. 42–53.

Bos, J., Clark, S., Steedman, M., Curran, J. R., & Hockenmaier, J. (2004). Wide-coveragesemantic representations from a CCG parser. In Proceedings of the 20th InternationalConference on Computational Linguistics (COLING ’04), Geneva, Switzerland.

37


Bresnan, J. (Ed.). (1982). The Mental Representation of Grammatical Relations. MITPress, Cambridge, MA.

Briscoe, T., & Carroll, J. (2002). Robust accurate statistical annotation of general text. InProceedings of the 3rd LREC Conference, pp. 1499–1504, Las Palmas, Gran Canaria.

Burke, M., Cahill, A., O’Donovan, R., van Genabith, J., & Way, A. (2004). Evaluationof an automatic f-structure algorithm against the Parc 700 dependency bank. InProceedings of LFG04, Canterbury, New Zealand.

Cahill, A., Burke, M., O’Donovan, R., van Genabith, J., & Way, A. (2004). Long-distancedependency resolution in automatically acquired wide-coverage PCFG-based LFG ap-proximations. In Proceedings of the 42nd Annual Meeting of the Association for Com-putational Linguistics (ACL-04), pp. 320–327, Barcelona, Spain.

Calder, J., Klein, E., & Zeevat, H. (1988). Unification categorial grammar: A concise, ex-tendable grammar for natural language processing. In Proceedings of the 12th Inter-national Conference on Computational Linguistics (COLING), pp. 83–87, Budapest,Hungary.

Carroll, J., Minnen, G., & Briscoe, T. (1999). Corpus annotation for parser evaluation.In Proceedings of the EACL workshop on Linguistically Interpreted Corpora (LINC),Bergen, Norway.

Charniak, E. (2000). A maximum-entropy-inspired parser. In Proceedings of the First Meet-ing of the North American Chapter of the Association for Computational Linguistics,Seattle, WA.

Clark, S., & Hockenmaier, J. (2002). Evaluating a wide-coverage CCG parser. In Proceedingsof the LREC Beyond PARSEVAL workshop, Las Palmas, Spain.

Clark, S. (2002). Supertagging for combinatory categorial grammar. In Proceedings of the6th International Workshop on Tree Adjoining Grammars and Related Frameworks(TAG+6), pp. 19–24, Venice, Italy.

Clark, S., & Curran, J. R. (2003). Log-linear models for wide-coverage CCG parsing. InProceedings of the SIGDAT Conference on Empirical Methods in Natural LanguageProcessing (EMNLP ’03), pp. 97–104, Sapporo, Japan.

Clark, S., & Curran, J. R. (2004a). The importance of supertagging for wide-coverageCCG parsing. In Proc. 20th International Conference on Computational Linguistics(COLING-04), pp. 282–288, Geneva, Switzerland.

Clark, S., & Curran, J. R. (2004b). Parsing the WSJ using CCG and log-linear mod-els. In Proceedings of the 42nd Annual Meeting of the Association for ComputationalLinguistics (ACL ’04), Barcelona, Spain.

Clark, S., Hockenmaier, J., & Steedman, M. (2002). Building deep dependency structureswith a wide-coverage CCG parser. In Proceedings of the 40th Annual Meeting of theAssociation for Computational Linguistics, Philadephia, PA.

Clark, S., Steedman, M., & Curran, J. R. (2004). Object-extraction and question-parsingusing ccg. In Proceedings of the SIGDAT Conference on Empirical Methods in NaturalLanguage Processing (EMNLP-04), pp. 111–118, Barcelona, Spain.

38


Collins, M. (1999). Head-Driven Statistical Models for Natural Language Parsing. Ph.D.thesis, University of Pennsylvania.

Copestake, A. (2003). Report on the design of RMRS. Tech. rep. D1.1b, University ofCambridge.

Copestake, A., Flickinger, D. P., Sag, I. A., & Pollard, C. (1999). Minimal recursion se-mantics; an introduction. unpublished.

Curran, J. R., & Clark, S. (2003). Investigating GIS and smoothing for maximum entropytaggers. In Proceedings of the 11th Annual Meeting of the European Chapter of theAssociation for Computational Linguistics (EACL’03), pp. 91–98, Budapest, Hungary.

Curry, H. B., & Feys, R. (1958). Combinatory Logic, Vol. I. North Holland, Amsterdam.

Dijkstra, E. (1959). A note on two problems in connexion with graphs. Numerische Math-ematik, 1, 269–271.

Eisner, J. (1996). Efficient normal-form parsing for combinatory categorial grammar. InProceedings of the 34th Annual Meeting of the Association for Computational Lin-guistics, pp. 79–86, Santa Cruz, CA.

Foreman, M., & McMichael, D. (2004). A meta-grammar for ccg. In Proceedings of theAustralasian Language Technology Workshop, Sydney, Australia.

Gazdar, G., Klein, E., P., K., G., & Sag, I. A. (1985). Generalized phrase structure grammar.Harvard University Press, Cambridge, MA.

Gildea, D., & Hockenmaier, J. (2003). Identifying semantic roles using combinatory cate-gorial grammar. In Proceedings of the EMNLP, Sapporo, Japan.

Grover, C., Lascarides, A., & Lapata, M. (2005). A comparison of parsing technologies forthe biomedical domain. Natural Language Engineering, 11 (1), 27–65.

Hart, P. E., Nilsson, N. J., & Raphael, B. (1968). A formal basis for the heuristic determina-tion of minimum cost paths. IEEE Transactions on Systems Science and CyberneticsSSC4, 2, 100–107.

Hart, P. E., Nilsson, N. J., & Raphael, B. (1972). Correction to “a formal basis for theheuristic determination of minimum cost paths”. SIGART Newsletter, 37, 28–29.

Hockenmaier, J. (2003a). Data and Models for Statistical Parsing with Combinatory Cate-gorial Grammar. Ph.D. thesis, Edinburgh University.

Hockenmaier, J. (2003b). Parsing with generative models of predicate-argument structure.In Proceedings of 41th Annual Meeting of the Association for Computational Linguis-tics, Sapporo, Japan.

Hockenmaier, J., & Steedman, M. (2002a). Acquiring compact lexicalized grammars froma cleaner treebank. In Proceedings of the Third International Conference on LanuageResources and Evaluation, Las Palmas, Spain.

Hockenmaier, J., & Steedman, M. (2002b). Generative models for statistical parsing withcombinatory categorial grammar. In Proceedings of 40th Annual Meeting of the As-sociation for Computational Linguistics, Philadelphia, PA.

39


Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press,Cambridge, UK.

Johnson, M., Geman, S., Canon, S., Chi, Z., & Riezler, S. (1999). Estimators for stochas-tic unification-based grammars. In Proceedings of the 37th Annual Meeting of theAssociation for Computational Linguistics (ACL’99), College Park, MD.

Joshi, A. K., Levy, L. S., & Takahashi, M. (1975). Tree adjunct grammars. J. Comput.Syst. Sci., 10 (1).

Kaplan, R. M., Riezler, S., King, T. H., Maxwell, J. T., Vasserman, A., & Crouch, R.(2004). Speed and accuracy in shallow and deep stochastic parsing. In Proceedings ofHLT/NAACL, Boston, MA.

Karp, D., Schabes, Y., Zaidel, M., & Egedi, D. (1992). A freely available wide coveragemorphological analyzer for english. In The 15th International Conference on Compu-tational Linguistics (COLING 1992).

Karttunen, L. (1989). Radical lexicalism. In Baltin, M., & Kroch, A. (Eds.), AlternativeConceptions of Phrase Structure. Chicago University Press, Chicago.

Kasami, T. (1965). An efficient recognition and syntax-analysis algorithm for context-freelanguages. Tech. rep. AFCRL-65-758, Air Force Cambridge Research Lab, Bedford,MA.

Kingsbury, P., Palmer, M., & Marcus, M. (2002). Adding semantic annotation to the penntreebank. In Proceedings of the Human Language Technology Conference (HLT’02).

Kingsbury, P., & Palmer, M. (2003). Propbank: the next level of treebank. In Proceedingsof Treebanks and Lexical Theories, Vaxjo Sweden.

Klein, D., & Manning, C. D. (2001a). A∗ parsing: Fast exact viterbi parse selection. Tech.rep. dbpubs/2002-16, Stanford University.

Klein, D., & Manning, C. D. (2001b). An O(n3) agenda-based chart parser for arbitraryprobabilistic context-free grammars. Tech. rep. dbpubs/2001-16, Stanford University.

Klein, D., & Manning, C. D. (2003a). A* parsing: Fast exact viterbi parse selection. InProceedings of HLT-NAACL 03.

Klein, D., & Manning, C. D. (2003b). Accurate unlexicalized parsing. In Proceedings ofthe 41st Annual Meeting of the Association for Computational Linguistics, Sapporo,Japan.

Klein, D., & Manning, C. D. (2003c). Factored a* search for models over sequences andtrees. In Proceedings of the Eighteenth International Joint Conference on ArtificialIntelligence.

Komagata, N. (1999). Information Structure in Texts: A Computational Analysis of Con-textual Appropriateness in English and Japanese. Ph.D. thesis, Computer and Infor-mation Science, University of Pennsylvania.

Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large-scaleoptimization. Mathematical Programming, 45, 503–528.

40


Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotatedcorpus of English: the Penn treebank. In Computational Linguistics, Vol. 19, pp.313–330.

Matsumoto, T., Powers, D., & Jarrad, G. (2003). Application of search algorithms tonatural language processing. In Proc. Australasian Language Technology Workshop,Melbourne.

McMichael, D., Jarrad, G., & Williams, S. (2006). Situation assessment with generalisedgrammar. in press.

Miyao, Y., & Tsujii, J. (2004). Deep linguistic analysis for the accurate identification ofpredicate-argument relations. In Proceedings of COLING 2004, pp. 1392–1397.

Molla, D., & Hutchinson, B. (2003). Intrinsic versus extrinsic evaluations of parsing sys-tems. In Proceedings European Association for Computational Linguistics (EACL),workshop on Evaluation Initiatives in Natural Language Processing, pp. 43–50, Bu-dapest.

Oepen, S., Toutanova, K., Shieber, S., Manning, C., Flickinger, D., & Brants, T. (2002). TheLinGO Redwoods treebank: Motivation and preliminary applications. In Proceedingsof the 19th International Conference on Computational Linguistics (COLING 2002),pp. 1253–1257, Taipei, Taiwa.

Park, J. C. (1992). A unification-based semantic interpretation for coordinate constructs.In Proceedings of the 30th Annual Meeting of the Association for Computational Lin-guistics, Newark, Delaware.

Pradhan, S., Ward, W., Hacioglu, K., Martin, J., & Jurafsky, D. (2005). Semantic rolelabeling using different syntactic views. In Proceedings of the 43rd Annual Meetingof the Association for Computational Linguistics (ACL’05), pp. 581–588, Ann Arbor,Michigan.

Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1985). A comprehensice grammar ofthe English language. Longman.

Ratnaparkhi, A. (1996). A maximum entropy part-of-speech tagger. In Proceedings of theEMNLP Conference, pp. 133–142, Philadelphia, PA.

Ratnaparkhi, A. (1997). A linear observed time statistical parser based on maximal entropymodels. In Cardie, C., & Weischedel, R. (Eds.), Proceedings of the Second Confer-ence on Empirical Methods in Natural Language Processing, pp. 1–10, Somerset, NewJersey. Association for Computational Linguistics.

Ratnaparkhi, A. (1999). Learning to parse natural language with maximum entropy models.Machine Learning, 34 (1-3), 151–175.

Riezler, S., King, T. H., Kaplan, R. M., Crouch, R., Maxwell, J. T., & Johnson, M. (2002).Parsing the Wall Street Journal using a lexical-functional grammar and discriminativeestimation techniques. In Proceedings of 40th Annual Meeting of the Association forComputational Linguistics, Philadelphia, PA.

Riezler, S., & Vasserman, A. (2004). Incremental feature selection and `1 regularizationfor relaxed maximum-entropy modeling. In Proceedings of the 2004 Conference onEmpirical Methods in Natural Language Processing (EMNLP’04), Barcelona, Spain.

41


Ritchie, A. (2004). Compatible RMRS representations from rasp. Tech. rep. D1.3, Universityof Cambridge, Computer Laboratory.

Sag, I. A., Wasow, T., & Bender, E. M. (2003). Syntactic Theory: A Formal Introduction(2 edition). CSLI Publications, Stanford.

Schneider, G., Dowdall, J., & Rinaldi, F. (2004). A robust and hybrid deep-linguistictheory applied to large scale parsing. In COLING-2004 workshop on Robust Methodsin Analysis of Natural language Data, Geneva, Switzerland.

Shen, L., & Joshi, A. K. (2005). Building an LTAG treebank. Tech. rep. MS-CIS-05-15,CIS Department, University of Pennsylvania.

Steedman, M. (1996). Surface Structure and Interpretation. No. 30 in Linguistic InquiryMonographs. MIT Press, Cambridge, MA.

Steedman, M. (2000). The Syntactic Process. MIT Press.

Steedman, M. (1990). Gapping as constituent coordination. Linguistics and Philosophy,13, 207–264.

Steedman, M. (1999). Alternating quantifier scope in CCG. In Proceedings of 37th AnnualMeeting of the Association for Computational Linguistics, pp. 301–308.

Swift, M., Allen, J., & Gildea, D. (2004). Skeletons in the parser: Using a shallow parserto improve deep parsing. In Proceedings of The 20th International Conference onComputational Linguistics (COLING’04), Vol. 1, pp. 383–389, Geneva, Switzerland.

Toutanova, K., Haghighi, A., & Manning, C. (2005a). Joint learning improves semantic rolelabeling. In Proceedings of the 43rd Annual Meeting of the Association for Computa-tional Linguistics (ACL’05), pp. 589–596, Ann Arbor, Michigan.

Toutanova, K., Manning, C. D., Flickinger, D., & Oepen, S. (2005b). Stochastic HPSG parsedisambiguation using the Redwoods corpus. Research on Language and Computation,3 (1), 83–105.

Uszkoreit, H. (1986). Categorial unification grammars. In Proceedings of the 11th Interna-tional Conference on Computational Linguistics (COLING), pp. 187–194.

Vijay-Shanker, K., & Weir, D. (1993). Parsing some constrained grammar formalisms.Computational Linguistics, 19, 591–636.

Watkinson, S., & Manandhar, S. (2001). Translating treebank annotation for evaluation. InProceedings of the workshop on Computational Natural Language Learning (CoNLL-2001), Toulouse, France.

Zettlemoyer, L., & Collins, M. (2005). Learning to map sentences to logical form: Structuredclassification with probabilistic categorial grammars. In Proceedings of the TwentyFirst Conference on Uncertainty in Artificial Intelligence (UAI-05).

42

Documents

Functional Combinatory Categorial Grammardaniel_mcmichael/papers/FCCGJ06.pdf · 2006. 6. 24. · Functional Combinatory Categorial Grammar driven phrase structure grammar [HPSG] (Sag,