Introduction to Natural Language Syntax and …sc609/teaching/all_L95_notes.pdf · Introduction to Natural Language Syntax and Parsing Lecture 1: Automatic Linguistic Annotation

Introduction to Natural Language Syntax and

Parsing

Lecture 1: Automatic Linguistic Annotation

Stephen Clark

September 29, 2015

Automatic Linguistic Annotation We would like to automatically anno-tate linguistic units (typically sentences) with some linguistic structure, in or-der to facilitate various NLP tasks and applications, such as (semantic) search,question answering, information extraction, machine translation, and so on. Wemight also want to model some aspects of linguistic structure for linguistic, orcognitive science, reasons, but in this part of the course we’ll be using NLP —with more of an engineering focus — as the main motivation.

Sentence Segmention One of the first tasks in any NLP pipeline is oftensentence segmentation – breaking the document up into sentences. This mayappear trivial — e.g. just split on periods — but the period-splitting heuristic isnot going to work. In English, periods serve a number of functions, for exampleto mark abbreviations, as in the Dr. example. One approach to this problemis to manually write a number of rules, e.g. split on a period unless the periodfollows Dr. or Mr. or Mrs. or... The difficulty with this approach is that therule set, in order to cover all the cases, soon becomes unwieldy and difficult tomodify and maintain. Hence, as in the rest of NLP, a machine learning approachis often taken, using manually segmented sentences as training data.

One general comment applying to all 8 lectures applies here: in the exampleswe’ll often be using well-edited text such as newspaper text or Wikipedia arti-cles. However, there is currently a lot of interest in processing less well-editedtext, such as tweets or other postings on social media, for example to determinewhether customers are saying positive things about a particular product, or topredict stock market prices or influenza outbreaks. Hence one question youshould keep asking yourself is: would this proposed method work on Twitterdata?

Tokenisation (What’s a Word?) The next task in the canonical NLPpipeline is often tokenisation, the task of breaking the sentence into tokens.This is useful because we may have learnt token-level translation rules, for ex-ample, or we may be searching for information about Dr. Black, in which case

1

it’s useful to know that the sentence contains the token Black, as opposed toBlack’s. Having said that, there is currently interest in processing sentences atthe character level, using neural network models, and not segmenting at theword level at all (or even at the sentence level).1

A second general comment about the 8 lectures applies here: the defaultlanguage we will assume is English. However, there are a number of featuresof English which are not representative of the world’s languages. First, thereare many languages, such as Turkish, which are much more morphologicallycomplex than English. A single word in these languages can be used to expressa concept which requires many words in English. Hence the questions of howto do tokenisation and parsing are rather different for these languages. Second,there are a number of languages — the canonical example being Chinese —which do not use spaces to separate the words. In fact, the very notion ofa word is controversial in Chinese, and native speakers do not exhibit highagreement on where to place the spaces if asked to perform word segmentation.Despite this lack of agreement, there is a large literature on the task of Chineseword segmentation; for a recent paper see [3].

But even in English the question of how to do tokenisation is not alwaysclear-cut. Should medal-winning be one token or two? It perhaps dependson the application. If we have a translation for medal-winning, then it makessense to keep it as a single token when doing translation. If we’re lookingfor information about what Dr. Black has won, then splitting it may makesense. These questions arise in particular when processing biomedical text,which uses a lot of characters, such as hyphens, outside of the standard alphabet.Hence two more questions you should keep asking yourself are: will it work forChinese/Turkish/Swahili, and will it work for biomedical text?

Part-of-Speech Tagging The next stage is part-of-speech tagging, where webegin to add grammatical structure which is not overtly realised in the sentence.You have learnt a lot about possible tagging schemes in the other half of thecourse. The task of assigning the tags, which can be thought of as a sequencelabelling problem from machine learning, is a classic task in NLP. For well-editedEnglish text for which there is plenty of manually annotated data to learn from,e.g. newspaper text, and for relatively small tag sets, e.g. the Penn Treebanktagset, POS tagging is close to being a solved problem (although not completelysolved). Again, tagging for Twitter and biomedical text is harder. There are anumber of freely available POS taggers. The Stanford tagger is one of the mostwidely used [2].

Syntactic Parsing - Phrase Structure Now we begin to see some hier-archical structure. I will say more about phrase structure, and the resourcetypically used to build phrase-structure parsers, in the next lecture.

1In order to keep the number of readings to a manageable level, I do not always includereferences when referring to the literature; but if you wanted to find a recent paper aboutcharacter level parsing, for example, a Google search for “character level chinese dependencyparsing” will do the trick.

2

Semantic Parsing - Logical Form If we wanted to be really ambitious, wecould try and construct a logical form. The example is from a semantic analysistool called Boxer, which builds on the Combinatory Categorial Grammar parserwe’ll hear more about later in the course. You’ll also learn more about semanticanalysis and interpretation in the other half of the course. The details aren’ttoo important at this stage, except to say that the example is a pretty-print ofa representation which is essentially a piece of first-order logic. The advantageof translating into logic is that there are ready-made inference procedures, andtools, available which are straightforward to implement on a computer. The dis-advantage, as is known from decades of work in AI which attempts to translatenatural language into some formal language, is that inferences in NLP typicallyrequire large amounts of linguistic and world knowledge, even if the translationinto logic can be successfully automated.

Syntactic Parsing - Dependency Structure This is the representationthat we’re going to focus on in this half of the course, for reasons I’ll give inthe next lecture. One interesting feature of the example is that the dependencylinks cross – more on this later.

Why is Parsing Difficult? Natural languages exhibit many different struc-tures. For phrase structure grammars in particular, many different grammaticalrules are needed to cover all these structures. Obtaining those rules, either man-ually through an expert linguist writing the grammar, or (semi-)automaticallythrough some learning procedure from corpus data, is a challenging task.

The second reason parsing is difficult is perhaps one of the more surpris-ing properties of natural languages that NLP has uncovered in the last fewdecades. Natural languages exhibit large amounts of syntactic ambiguity. Thereason we don’t see it as humans is because our own language processors areextremely effective at using context to perform the disambiguation. Note alsothat, as grammars become more comprehensive, solving the first problem, thisonly increases the level of ambiguity, making the second problem even worse.

Syntactic Ambiguity The classic text book example of syntactic ambiguityis John saw the man with the telescope. Is John looking through the telescopeat the man, or is the man holding the telescope? The two semantic readingsresult from different syntactic parse trees.

Syntactic Ambiguity: the problem is worse than you think The clas-sic example is useful, because the ambiguity is easy for humans to see, butalso misleading. To resolve it would require contextual representations, worldknowledge, and general reasoning capabilities currently not available. It’s alsothe case that either reading could be possible. Natural language ambiguity ispernicious, because it’s precisely the cases we don’t easily perceive, such as Johnate the pizza with a fork, which cause the problems for current parsing technol-

3

ogy. Here only one of the readings is plausible, and it’s the job of the parser todecide which one.

Syntactic Ambiguity: the problem is even worse than that It’s notjust the existence of “hidden” ambiguity which is the problem, it’s the fact thereis so much of it. Many constructions, including PP attachment, coordination,relative clause attachment, lead to alternative possibilities which multiply whenchained together. In the example sentences, the number of possible analysesgrows exponentially with the number of PPs, following the Catalan series. Ofcourse natural languages don’t contain sentences quite like these, but they doexhibit chains of attachment decisions which have this property. A favouritemoment from my own research occurred when calculating, with James Curran,the number of parses for a long newspaper sentence given by our CCG parser(using an efficient dynamic programming technique so the counting can be per-formed exactly). The number was close to the Avogadro constant you may haveencountered in high school chemistry, 6.022 × 1023.

Readings for Today’s Lecture In addition to the references below, Chap-ters 9 and 10 of Manning and Schutze, and Chapter 5 of Jurafsky and Martin(2nd. Ed.), are useful readings for POS tagging. Only part of the Zhang andClark reference below is relevant for today’s lecture, but the whole article willbecome relevant as the course progresses. A useful reference which covers muchof the course is my book chapter on statistical parsing [1]. All papers are freelyavailable on the web.

References

[1] Stephen Clark. Statistical parsing. In Clark, Fox, and Lappin, editors,Handbook of Computational Linguistics and Natural Language Processing,pages 333–363. Blackwell, 2010.

[2] Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer.Feature-rich part-of-speech tagging with a cyclic dependency network. InProceedings of the HLT/NAACL conference, pages 252–259, Edmonton,Canada, 2003.

[3] Yue Zhang and Stephen Clark. Syntactic processing using the generalizedperceptron and beam search. Computational Linguistics, 37(1):105–151,2011.

4


Parsing

Lecture 2: Introduction to Statistical Parsing

Stephen Clark

September 30, 2015

Automatic Parsing One way to characterise the natural language parsingproblem is in terms of these three questions:

• Where does the grammar come from?

• What’s the algorithm for generating possible parses?

• How do we decide between all the parses?

For the first question, there are three possibilities (and various combinationsin between). One option is to have a linguist hand code the grammar, typicallyusing a particular linguistic formalism such as HPSG. The resulting grammarsare often referred to as precision grammars, since the information containedin them is detailed and precise, resulting in rich parse representations. Thedownside with this approach, common to all rule-based approaches to NLP, isthat it is difficult to manually create grammars in this way which are robustand can apply to a wide variety of textual input.

The second approach, which is currently dominant in the literature, is tohave a linguist annotate sentences with the desired parse output, and thenlearn the grammar automatically from this annotation. The set of annotatedparses is often referred to as a treebank. This is the approach we’ll be followingin this course.

The third approach is to try and learn a grammar entirely from raw, orPOS tagged, text, with some biases encoded in the model (either implicitly orexplicitly) to help with the learning process. In some respects the third optionis the most desirable, because it does not require the costly human annotationassociated with treebank creation and it is (perhaps) the closest to how humanslearn language. However, learning grammars from raw or POS tagged text hasproven to be extremely difficult, and the performance of the resulting unsu-pervised parsers is way below that of their supervised counterparts which learnfrom manual annotation.

In terms of the parsing algorithm, there are a number of possibilities. Fordependency parsing, which is the focus of these lectures, there are two dominant

1

approaches: graph based and transition based. The graph-based method uses adata structure called a chart to record all ways in which the words in the inputsentence can be linked together. The transition-based method uses a queue anda stack to combine words, processing them from left to right by shifting wordsoff the queue, and combining words on the stack.

Because of the massive ambiguity problem that was described in Lecture 1,we need a way to select one of the parses (or perhaps rank them and outputa scored subset). For scoring we use a parsing model, trained on the availableannotated data. Again there are many possibilities here — both probabilisticand non-probabilistic — but in these lectures we will focus on a simple (non-probabilistic) linear model, which is easy to train and yet surprisingly effective.

The Penn Treebank The Penn Treebank was created in the early 90s [2]and immediately sparked a parsing “competition” that is still continuing today.The part that has been used in this competition contains around 1M words ofnewswire text, manually annotated with phrase-structure trees. The annotationalso contains traces and empty elements, marking aspects of predicate-argumentstructure which are not overtly realised in the surface sentence; however, themajority of work using the treebank has ignored this extra annotation. Thetreebank took around 3 years to create, using a handful of annotators, basedon a detailed set of annotation guidelines. See my book chapter [1] for morecommentary on the history of statistical parsing and the effect the treebankhas had on NLP research (demonstrating the importance of resources for theadvancement of the field more generally).

The treebank can also be used to generate a “dependency bank”, consistingof pairs of sentences with dependency, rather than phrase-structure, trees, whichcan then be used to train a dependency parser. The dependencies are createdusing the notion of linguistic heads, which can be heuristically recovered fromthe phrase-structure annotation, and then used to generate head dependenciesbetween words.

Problems with the PTB Parsing Task There is no doubting the influencethat the PTB parsing task has had on NLP research, not just for parsing, butalso for other tasks such as statistical machine translation where parsing mod-els have been applied. However, some researchers and commentators becamedisillusioned with the central position that the PTB parsing task acquired inNLP research, arguing that the focus on English newswire text was detrimen-tal to the field. There is also the problem that the same test set — roughly2,400 sentences from Section 23 — has been used continually by, not only thesame researchers, but the field as a whole. Hence there is the possibility thatthe field has been implicitly fitting models to the test set, even if not explicitly“cheating” by directly observing it during training and development.

In contrast, the dependency parsing community has developed a numberof dependency banks, for many different languages, and also different domainswithin the same language. It is often suggested that manually annotating de-

2

pendency structure is easier than annotating phrase structure, and thereforedependency banks are easier to create (although I don’t know of any scientificstudies demonstrating this empirically). One of the reasons that Google hasadopted dependency parsing as its main parsing paradigm is undoubtedly theavailability of training data in many languages.

Dependency Parsing Head dependencies of the sort shown in the exampleshave proven useful for a variety of NLP tasks, such as Information Extraction,Question Answering and Machine Translation. The reason is that they providean approximation to the underlying predicate-argument structure, expressingroughly who did what to whom. Recovery of the dependencies can also beperformed accurately and efficiently (although the goal of 100% accuracy is stilla long way off, especially with the more difficult dependencies arising from e.g.PP attachment and coordination).

Another possible reason for the success of dependency parsing is that theformalism is easy to understand, unlike, say, CCG, which we’ll study later inthe course. That’s not to say that dependency grammar isn’t a serious syntactictheory in linguistics — since it is — only that computer scientists with littlelinguistic training can easily understand it. Another possible advantage is thatdependency parsers are almost entirely data driven, in the sense that the knowl-edge required to parse unseen sentences is acquired entirely from a treebank,with little or no manual intervention.

Dependency Trees Dependency graphs are graphs in a mathematical sense:sets of edges and nodes, where the nodes are the words in the sentence. Adummy word — $ in the example — is often placed at the beginning of thesentence to act as a dummy root of the graph. The graph is directed, since thenotion of dependency incorporates the notion of head, with each edge pointingfrom head to dependent. A lot of the evaluations in dependency parsing useunlabelled graphs, but in practice it is likely to be useful to have grammaticallabels on the edges, such as subj (subject), mod (modifier), and so on. Depen-dency graphs are typically restricted to be trees, so that each node has only oneparent. Whether this is desirable from a linguistic viewpoint is debatable, sincesome constructions, such as control, result in some arguments having more thanone parent. However, performing computations with trees is generally easierthan with graphs.

Dependency Trees more Formally As well as the restriction to trees, de-pendency graphs are typically restricted in other ways, again to ease any com-putation (such as finding the highest scoring tree). Dependency trees are oftendefined to be connected (so it’s possible to reach any node from any other nodeby moving along the edges, ignoring the direction of the edges); acyclic, so thatit’s not possible to return to any node by moving along the edges; single-head,so that each node only has one parent; and projective, so that none of the edges,when the graph is written in two dimensions, cross.

3

Crossing Dependencies The example shows a case in English with crossingdependencies, although cases such as these are relatively rare in English. Infact, the dependency banks for English typically consist entirely of projectivedependencies. However, crossing dependencies are not rare in other languages,such as Czech, German and Dutch. The Wang and Zhang tutorial states that23% of the sentences in the Prague Dependency Treebank of Czech contain atleast one crossing dependency.

Graph-Based Models Graph-based models use scoring functions definedover the graphs (as opposed to shift-reduce models which score the parsingactions). Then a decoding algorithm is used to find the highest-scoring graph.Much of the research in dependency parsing is concerned with devising optimal,efficient decoding algorithms using dynamic programming over the graph (sothat the decoder is guaranteed to return the highest-scoring graph). This isin contrast to shift-reduce approaches, which typically use heuristic search incombination with very rich feature sets.

Edge-Based Factorisation Model In order to define efficient dynamic pro-gramming decoders over the graph, it is crucial that the features used to definethe model are kept sufficiently local to the edges. The extreme version of thisidea, resulting in a first order model, is to define local scoring functions whichonly look at a single edge (but which are allowed to look at any part of thesentence). Then the score for the whole graph is the sum of the scores for eachedge.

Edge-Based Linear Model There are various ways of defining the localscoring function, both probabilistic and non-probabilistic. Here we will use asimple, non-probabilistic form for the scoring function: a linear model. Thefeatures are typically indicator functions, taking the value zero or one, pickingout particular aspects of the edge. Very rich models are required for good per-formance, resulting in millions of different features, each with a correspondingweight which needs to be estimated. However, recent work using neural net-works shows how to obtain good performance without the need for so manyfeatures.

Example Features A later lecture will look at the features in more detail,but I’d like to provide some intuition now in order to describe the decoder in thenext lecture (which uses the local scores to find the highest-scoring graph). Thekey idea is that the features in a first-order model cannot span more than oneedge, but they are allowed to look anywhere in the sentence. Mathematicallyfeatures are binary-valued functions, but a more intuitive way to think of afeature is that it captures a particular pattern in the graph. For example,saw VBD duck NN captures the presence of a particular edge in the graph.VBD PRP$ NN captures a particular sequence of POS tags between the wordsmaking up the edge. Note the extensive use of POS tags in the features, which is

4

designed to overcome the extreme sparsity that would result from only definingfeatures in terms of words.

Readings for Today’s Lecture There are various freely available tutorialson dependency parsing. The one that I have been stealing pictures from is thefollowing:

• Recent Advances in Dependency Parsing, Qin Iris Wang and Yue Zhang.NAACL Tutorial, Los Angeles June 1, 2010.http://naaclhlt2010.isi.edu/tutorials/t7-slides.pdf

References

[1] Stephen Clark. Statistical parsing. In Clark, Fox, and Lappin, editors,Handbook of Computational Linguistics and Natural Language Processing,pages 333–363. Blackwell, 2010.

[2] Mitchell Marcus, Beatrice Santorini, and Mary Marcinkiewicz. Building alarge annotated corpus of English: The Penn Treebank. ComputationalLinguistics, 19(2):313–330, 1993.

5


Parsing

Lecture 3: Graph-Based Dependency Parsing

Stephen Clark

October 13, 2015

Untyped Dependency Trees Much of the literature on dependency parsingis concerned with untyped dependency trees, where the edges between words arenot labelled with grammatical relations. We’ll also consider the untyped case,although extending the various parsing algorithms to deal with typed edges isstraightforward.

The example on the slide shows a projective dependency tree, with an al-ternative definition of projectivity. The definition given so far is that a tree isprojective iff the tree can be drawn in two dimensions without any edges cross-ing. An equivalent definition is that a tree is projective iff an edge from wordw to word u implies that w is an ancestor of all words between w and u. Forexample, consider the edge from hit to with: all the words in between can alsobe reached from hit (i.e. are ancestors of hit). Now imagine that there is anedge from hit to the second the, i.e. a crossing edge in the example, replacingthe edge between bat and the. This ruins the projectivity, since there is anedge from with to bat, but the word the in between with and bat is no longer anancestor of with.

Edge-Based Linear Model As a reminder, we’re considering first-orderedge-based models where the score for a tree is the sum of individual scoresfor each edge; and the score for an edge is a linear sum defined as a dot productbetween a weight vector and feature vector.

Dependency Parsing Formally This slide provides some notation for theedge-based linear model.

Maximum Spanning Trees The directed graph Gx, for sentence x, is a setof vertices (or nodes) Vx and a set of edges Ex. Vx is the set of words in x plusan additional dummy root note x0. Ex is the set of all possible directed edgesbetween words in x, with the following exceptions: there are no reflexive edges(i.e. an edge from a word to itself), and x0 cannot be the child of an edge.

1

The reason for considering Gx is that finding the highest-scoring dependencytree for x is equivalent to a well-known problem in graph theory, namely findingthe maximum spanning tree (MST) in Gx. Finding the MST is also known asthe maximum arborescence problem. Restricting the tree to be projective resultsin finding the MST which is also projective.

Decoding: finding the MST There is a classic algorithm from the 60s —the Chu-Liu-Edmonds algorithm — for finding the MST for non-projective trees,with an O(n2) implementation. The projective case is computationally harder,because now we have to find trees that satisfy a particular set of constraints(corresponding to the projectivity). We’ll consider a straightforward adaptionof the chart-based CKY algorithm, which runs in cubic time for CFGs, butin O(n5) time for dependency grammars. Eisner [1] introduced a variant ofthe chart-based algorithm which runs in cubic time for dependency grammars,and this is the one that is typically implemented in practice, for example inMcDonald’s MST parser.

CKY-style Dependency Parsing The CKY algorithm operates bottom-up,using CFG rules of the form A → B C, where A, B and C are non-terminalsfrom the CFG. The complexity of the algorithm is O(G2n3), where G is agrammar constant related to the number of non-terminals, and n is the lengthof the sentence. An informal analysis is as follows: there are O(n2) cells in thechart; for each cell we have to consider a number of split points, for which thereare O(n); and for each split point we have to consider O(G2) combinations ofnon-terminals (B and C on the RHS of the rule above).

A useful perspective on the dependency parsing problem is to consider eachedge in a dependency tree as a CFG rule. Consider the edge (hits → ball).We can consider this edge as having arisen from the application of the CFGrule (hits → hits ball). So now the number of combinations of non-terminals— O(G2) above — is no longer a constant but O(n2), resulting in an overallcomplexity of O(n5).

Why CKY is O(n5) and not O(n3) The example on the slide is designed toshow that all possible pairs of heads have to be considered when deciding whichedges to add to the chart (giving the additional O(n2) complexity). Considerthe phrase visiting relatives. If the sentence is ... advocate visiting relatives,then the dependency link is between advocate and visiting (since visiting is averb and is the head of visiting relatives in this case). But if the sentence is... hug visiting relatives, then the dependency link is between hug and relatives(since visiting is an adjective and relatives is the head of visiting relatives in thiscase).

Dependency Parsing Algorithms The slide summarises the various al-gorithms available for dependency parsing. We’ll be focusing on graph-based

2

algorithms, but there is an alternative, namely shift-reduce parsing. The linear-time complexity of shift-reduce algorithms make them an attractive alternativeto graph-based chart parsing.

Shift-Reduce Dependency Parsing The example on the slides demon-strates one method of how to implement a shift-reduce parser, with a set offour possible transition actions: { shift, reduce, arcLeft, arcRight }. Thekey data structures are the stack and the queue. The queue contains a list ofwords yet to be processed, and the stack contains partial trees as the completetree is being built.

Greedy Local Search Given a sentence, there are many possible sequencesof transitions leading to a dependency tree (each possible tree has a separatetransition sequence). One way to handle the ambiguity is to use a statisticalclassifier to make a single decision at each point in the parsing process, and stickwith that decision. This is a greedy algorithm which is linear-time in the lengthof the sentence and potentially results in a very fast parser, substantially fasterthan the graph-based chart parser.

Beam Search The downside of the greedy approach is that, if the classifiermakes a mistake, there is no way for the parser to recover later in the parsingprocess. One way to mitigate this problem is to use beam search instead, whereK possible decisions — the K with the highest scores according to the classifier— are retained at each parsing step. Using beam search in this way typicallyresults in a significant improvement in accuracy, with beam sizes of around 32leading to a good trade-off between improved accuracy and loss in speed.

Shift-reduce parsing with beam search is still linear in the length of thesentence, but now has a constant associated with the size of the beam. So usinga beam size of 64, say, would result in a significantly slower parser than thegreedy parser.

Readings for Today’s Lecture

• Spanning Tree Methods for Discriminative Training of Dependency Parsers.Ryan McDonald, Koby Crammer and Fernando Pereira. UPenn CIS Tech-nical Report: MS-CIS-05-11.

• Characterizing the Errors of Data-Driven Dependency Parsing Models. R.McDonald and J. Nivre. Empirical Methods in Natural Language Process-ing and Natural Language Learning Conference (EMNLP-CoNLL), 2007.

References

[1] Jason Eisner. Three new probabilistic models for dependency parsing: Anexploration. In Proceedings of the 16th COLING Conference, pages 340–345,Copenhagen, Denmark, 1996.

3


Parsing

Lecture 4: The Perceptron Parsing Model

Stephen Clark

October 14, 2015

Edge-Based Linear Model A minimal feature set for a dependency parsingmodel would simply look at the words in an edge, and score the edge based onthose words; for example, how likely is an edge from runs to lion? However,such a feature set would be extremely sparse, in the sense that very few of thelarge number of possible word pairs would appear in the training data (whichis typically of the order of 1M words). Hence, in practice, dependency parsingmodels are much richer, in particular making extensive use of part-of-speech(POS) tags, which are less sparse than words. Features are defined over theedges themselves, in terms of both words and POS tags (and combinations ofthe two); but also in terms of the POS tags in between dependent words, andoutside of dependent words.

Features in the MST Parser The table on the slide shows the feature setfrom McDonald’s MST parser [2]. The basic unigram features effectively askquestions such as: how likely is the word runs to be the parent of an edge?how likely is the POS tag NN to be the child of an edge? The bigram featureseffectively ask questions such as: how likely is the word runs with POS tagVBZ to be the parent of the word lion? The in-between and surrounding POSfeatures are designed to capture patterns of POS tags seen frequently between,and either side of, dependent words. For example, the VBD PRP$ NN featureeffectively asks the question: how likely is a dependency edge where the POStag sequence between the two dependent words (inclusive) is VBD PRP$ NN?

Global Linear Model In the edge-based linear model, the score for a wholetree is defined as the sum of the scores for each edge; and the score for an edge isa dot product between a local feature vector and a weight vector. The derivationon the slide shows that, because of the properties of sums of products, the scorefor a tree can be written as a dot product between a global feature vector and aweight vector (the same weight vector used in the local dot products).

The function fk is an indicator function which takes an edge as argumentand has the value 1 or 0, depending on whether the edge displays a particular

1

pattern. Here we overload fk by also have it take a whole tree τ as an argument,and return a non-negative integer, so that it now counts the number of timesthe corresponding pattern appears in the whole tree. If we encode all the valuesof the fk(τ) counting functions as a global feature vector F(τ), then the scorefor a tree τ can be written as the dot product between the weight vector andthe global feature vector.

Generic Online Learning The pseudocode for a generic online learning al-gorithm shows the simplicity of the idea: start with a zero weights vector, usethe current weight vector to decode each training instance in turn, updatingthe weight vector at each instance. For dependency parsing, the training in-stances are pairs of sentences and gold-standard dependency trees, taken froma dependency bank.

The final line outputs averaged weights [1], as a method to avoid overfitting.So v is just an accumulation of all the weight vectors encountered during thetraining process (N passes over T training instances), and w is the averagedoutput.

The Perceptron Update The perceptron update uses the highest-scoringdependency tree zt given the current weight vector wt−1; F(xt, zt) is the globalfeature vector for tree zt (and sentence xt). The update to the current weightvector is simple: add the global feature vector for the gold-standard tree, andtake away the global feature vector for the tree returned by the decoder. Notethat the update is passive, meaning that no update takes place if the decoderreturns the gold-standard tree.

A useful intuition for the perceptron update can be given in terms of the localindicator features. The local features for a training instance can be divided intothree sets: 1) those that are in the gold standard and returned by the decoder;2) those that are in the gold standard and not returned by the decoder; and 3)those that are returned by the decoder but not in the gold standard. For thefeatures in 1), their weights remain the same (no update); for the features in2), a value of 1 is added to their weights; and for the features in 3), a value of1 is taken from their weights. Intuitively the update is attempting to force thedecoder to return correct features, and prevent it from returning the incorrectones.

CoNLL Shared Task Data The table on the slide is taken from [4]. Thepoint of the table is to demonstrate the range of languages for which depen-dency parsers can be built and evaluated (Arabic, Basque, Catalan, Chinese,Czech, English, Greek, Hungarian, Italian, and Turkish). Note that the amountof training data varies considerably across languages, from 51,000 tokens forBasque to 447,000 tokens for English. Notice also the variation in the level ofprojectivity, from 0% of sentences for Chinese to over 30% for Turkish.

2

Graph-based vs. Transition-based The table, from [3], compares the ac-curacy of the graph-based MST parser with the transition-based Malt parser(for the 2006 CoNLL shared task data). Interestingly the accuracies are similaracross languages, although what the paper shows is that the different parsingarchitectures do lead to different parsing errors, paving the way for a fruitfulensemble (or combination).

The table also shows how the accuracies for some languages are much higherthan for others. Whilst this may suggest that some languages are intrinsicallyharder to parse than others, this conclusion needs to be reached with care, sincea number of other factors, such as the size of the corresponding dependencybanks, need to be taken into account.

State-of-the-Art (2015) The New York Google parsing team published apaper in 2015 with the best reported results on English dependency parsing [5].The parsing algorithm is shift-reduce, the training method is the perceptron,but what is interesting about the parser is that a neural network is used toautomatically extract the features. One dissatisfying aspect of previous depen-dency parsers is the huge number of features required for top performance –as many as 30M in some cases! The neural network-based parsers use similarfeature templates, extracting similar information, but because the informationis distributed across dense feature vectors, rather than “one-hot” vectors, theeffective number of features is greatly reduced.

Any empirical result from industrial labs such as Google has to be interpretedin the context of the vast resources available in such places, which can be usedto great effect in, for example, tuning hyperparameters. However, there arecurrently many papers being published showing the benefits of the distributedrepresentations in neural networks (NNs), and parsing accuracies are likely tocontinue to increase for a few years yet, using NN approaches.

Accuracy League Table (2015) The current best-performing parsers onEnglish newspaper data are transition-based parsers, but the difference is rela-tively small. It’s possible that similar accuracuies could be achieved with graph-based parsers using neural network models. The numbers may look impressivelyhigh — almost 94% for unlabelled parsing, and over 92% for labelled parsing— but bear in mind these are aggregate scores across all dependency types,including the easy ones such as determiner-noun edges. Accuracies for someindividual dependency types, such as coordination and prepositional phrase at-tachment, are still much lower, and the overall parsing problem is still far frombeing solved.

One interesting feature of the latest transition-based parsing results is howthe accuracies vary by beam size. For the neural network parsers, the fullygreedy approach with a beam size of 1 is not far behind the scores with largerbeam sizes, and the best reported results are for a beam size as low as 8. Recallthat the beam size has a direct impact on the speed of the parser, which is animportant consideration in the context of a web-search company.

3

Readings for Today’s Lecture The McDonald technical report is still thecore reading, although the online update described there is more complicatedthan the simple perceptron update given in the lecture. For a description ofthe perceptron, see [1], which shows how to apply perceptron models to thePOS tagging problem (although the same techniques can easily be adapted todependency parsing).

References

[1] Michael Collins. Discriminative training methods for hidden markov mod-els: Theory and experiments with perceptron algorithms. In Proceedings ofEMNLP, pages 1–8, Philadelphia, USA, 2002.

[2] Ryan McDonald, Koby Crammer, and Fernando Pereira. Spanning treemethods for discriminative training of dependency parsers. Technical report,University of Pennsylvania, 2005.

[3] Ryan McDonald and Joakim Nivre. Analyzing and integrating dependencyparsers. Computational Linguistics, 37(1), 2011.

[4] J. Nivre, J. Hall, S. Kubler, R. McDonald, J. Nilsson, S. Riedel, andD. Yuret. The CoNLL 2007 shared task on dependency parsing. In Con-ference on Empirical Methods in Natural Language Processing and NaturalLanguage Learning, 2007.

[5] David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. Structuredtraining for neural network transition-based parsing. In Proceedings of the53rd Annual Meeting of the ACL, Beijing, China, 2015.

4


Parsing

Lecture 5: Categorial Grammar

Stephen Clark

October 15, 2015

Categorial Grammar (CG) Categorial Grammar is a linguistic theory inthe lexicalist tradition, along with other grammar formalisms such as Tree Ad-joining Grammar, Lexical Functional Grammar and Head Driven Phrase Struc-ture Grammar, and in contrast to the earlier transformation-based theories ofChomsky. The key idea is that most of the information required for specifyinglegal syntactic structures within a language resides in the lexicon, and a smallnumber of additional rules specify how to combine such structures.

This lexicon-centered perspective is attractive for a number of reasons. Froma theoretical viewpoint, it helps explain how children are able to learn languages:the small number of language-independent combination rules could be innate,and the structures residing in the lexicon which vary across languages can belearnt by exposure to the language. From a practical viewpoint, the rich syntac-tic structures applying at the word level can be assigned to words using highlyefficient sequence labelling methods (often referred to as supertagging).

Connection with Semantics The other half of this course deals with com-positional semantics, but it’s worth pointing out that one of the attractionsof (Combinatory) Categorial Grammar — and one of the reasons it is still anactive research area in the ACL community — is the close connection to com-positional semantics. The tight interface between the syntactic derivations andthe underlying semantic (predicate-argument) structure holds the promise ofbuilding representations which can be used for “deep” natural language under-standing tasks.

Contrast with Phrase-Structure Rules Categorial grammar captures sim-ilar information to that contained in traditional phrase-structure rules, but theinformation is encoded in the lexical categories which reside at the leaves of thederivation tree. (Informally, I like to think of the information being “pusheddown” a traditional phrase-structure tree onto the complex types at the leaves.)

Categorial grammar is a relatively old linguistic formalism, pre-dating Chom-skian linguistics and appearing as early as the 1930s in the work of Polish math-

1

ematicians. For the reader interested in theoretical computer science, the notionof grammatical type being used here relates to that used in theoretical CS; andthe Combinatory in Combinatory Categorial Grammar is taken from Curry andFey’s combinatory logic.

Lexical Categories Lexical categories assigned to words represent elemen-tary syntactic structures. The idea is that lexical categories encode the combi-natory potential of words to combine with other words, based on their types.

Lexical categories are either atomic or complex. The set of atomic categoriesis typically small, for example { S , N , NP , PP }. Complex categories are builtrecursively from atomic categories and slashes (forward or backward), where theslash indicates the direction of the argument. The key intuition with a complexcategory is to think of it as a function. For example, the transitive verb category(S\NP)/NP is to be thought of as a function that requires an NP to the right(its object), an NP to the left (its subject), and that returns a sentence S .

A Simple CG Derivation Since the transitive verb category is a function,it can combine with arguments using function application. The bracketing inthe complex category means that it has to combine with (or apply to) theargument NP to its right first — using so-called forward application — and thenthe argument NP to its left — using backward application. The intermediatecategory — S\NP — is the type of a verb phrase in English, and can be thoughtof as a sentence missing an NP to its left.

Combination Rules in CG Another useful intuition is that, when a complexcategory applies to its argument, categories effectively “cancel”. The exampleon the previous slide demonstrates this, with the categories in blue cancelling.Earlier work in CG also thought of the combination rules as being akin tomultiplication and division; for example, the combination of verb phrase and

subject NP is analagous to the arithmetic expression NP × SNP = S .

Classical Categorial Grammar Early, “classical” variants of categorial gram-mar only had forward and backward application as the combination rules.In fact, in terms of weak generative capacity, classical categorial grammar iscontext-free. So why not just use a context-free grammar? The key notion atthis stage, and the difference compared to a CFG, is that of lexicalisation –the fact that lexical categories are so rich in terms of the syntactic informationthey encode (compare the categories we’ve seen so far with the typical POStags used for verbs, for example). In the next lecture we’ll see CombinatoryCategorial Grammar, which has additional combinatory rules and the potentialfor grammars with greater than context-free power.

Readings for Today’s Lecture

2

• Categorial Grammar, Mark Steedman, 1999. Short encyclopedia entryfor MIT Encyclopedia of Cognitive Sciences, R. Wilson and F. Keil (eds.).Available at: http://homepages.inf.ed.ac.uk/steedman/papers.html

3


Parsing

Lecture 6: Combinatory Categorial Grammar

Stephen Clark

October 17, 2015

Long-Range Dependencies An interesting feature of natural languages isthat they have syntactic constructions which allow unbounded amounts of inter-vening material between items which belong together in the semantic predicate-argument structure. If the job of a parser is to return such predicate-argumentstructure, then it needs an analysis of these constructions.

The obvious example for English is the (object) relative clause construction.In the examples on the slide, the direct object a woman of the verb likes has beenextracted from the canonical object position, to the front of the noun phrasebefore the relative pronoun. The point of the examples is to demonstrate that,at least in principle, there is no limit to the number of words that can appearbetween the verb and the extracted object.

The Relative Clause Construction In CCG, the lexical category assignedto a transitive verb is always (S\NP)/NP , irrespective of the syntactic envi-ronment it finds itself in. (This is not true of other linguistic formalisms, forexample TAG.) Hence the question regarding the relative clause constructionis: what is the appropriate lexical category for whom?

I like to think of this question as akin to a jigsaw puzzle, where we start tofill in parts of the analysis and see what’s left. In this case, the type of whomWarren likes needs to be NP\NP , so that it can combine with the extractedobject NP to the left, and return an NP for the whole noun phrase. So it looksas though the category for whom has to be (NP\NP)/X , for some X to bedetermined.

“Non-Constituents” in CCG If whom has the type (NP\NP)/X , thenWarren likes is a constituent with type X . Can a subject-verb combinationbe a constituent? Most linguistic theories would say no, but at least one ofthe tests for constituenthood — the coordination test — suggests that it canbe (since Warren likes can be coordinated with other similar phrases such asDexter detests).

1

A natural type for Warren likes is S/NP : a sentence missing an object NPto the right (analagous to a verb phrase, which is a sentence missing a subjectNP to the left); in which case the type for whom is (NP\NP)/(S/NP). Thefact that a subject-verb combination is not typically considered a constituentis the reason we have non-constituents in the slide title, and the scare quotesare there to suggest that perhaps these “non-constituents” should be consideredconstituents after all.

Deriving “Non-Constituents” In order to derive a type for Warren likes,we somehow need to combine an NP to the left with (S\NP)/NP to the right.The bracketing in (S\NP)/NP means this can’t happen with function applica-tion (since the object NP to the right needs cancelling first with application).Two new rules, which take us beyond classical categorial grammar, will allowthe combination: type-raising and composition.

Type-Raising Type-raising arises from the question: why should the verb bethe function, and not the subject noun phrase? Assuming a subject NP canbe a function, what would it naturally look for? The answer is a verb phrase(S\NP). Hence the type-raised category for a subject NP becomes a sentencemissing a verb phrase to its right: S/(S\NP).

More generally, type-raising is represented by a unary rule schema: NP ⇒T/(T\NP), where the variable T gets instantiated in a rule instance, in thecurrent example with S . The way I like to describe type-raising is as follows:the type-raised NP looks to the right for a category looking to the left for it(T\NP), and when it’s found that category it returns the category which thecategory to the right would have returned, if the category to the right had foundit (i.e. T ). Got it?

This type-raising rule is known as forward type-raising, since the resultingcategory looks to the right for its argument. Later we’ll also encounter backwardtype-raising, where it looks to the left.

Forward Composition Type-raising has created a category which is lookingto the right for a verb phrase (S\NP), and there is a verb phrase to the right, butthe problem is that it’s embedded in the transitive verb category – (S\NP)/NP .The object NP , which has been extracted to the front of the noun phrase, hasnot yet been cancelled and hence is “getting in the way”.

The rule of forward composition allows a category to “get inside” an argu-ment category, and hence effectively bypass the object NP . The general schemais X /Y Y /Z ⇒ X /Z . Intuitively the Y s in the centre are cancelling. The wayI like to describe forward composition is as follows: we can return an X if onlywe can find a Y to the right; we have a Y to the right, but only if we can finda Z further to the right. So let’s just look for a Z to the right and immediatelyreturn an X , ignoring the Y .

2

CCG Derivation for a Relative Clause Once type-raising (> T) andcomposition (>B)1 have created the S/NP constituent, then the derivationis straightforward. The category for the relative pronoun — (NP\NP)/(S/NP)— effectively “knows” that it’s in the object extraction scenario, so it’s lookingto the right for a category which fits that scenario (one where the object NPhasn’t yet been cancelled).

“Spurious” Ambiguity The use of type-raising and composition does nothave to be confined to sentences with long-range dependencies. In practice, aparser will use whatever combinatory operations it has at its disposal. Hence,type-raising and composition can be used even for simple subject-verb-objectsentences, as in the example on the slide, leading to additional syntactic ambi-guity. This ambiguity has often been referred to as “spurious” ambiguity, sincethe resulting semantic interpretation remains the same. For example, if a logi-cal form for the sentence were built using the combinatory operations, applyingthe techniques described in the other half of the course, the result would bethe same for the derivation on the slide and the canonical derivation using onlyfunction application.

Generalised Forward Composition In the example on the slide, the type ofoffered is ((S\NP)/PP)/NP . It needs to be coordinated with may give, in whichcase may give also has to have this type. The types of may — (S\NP)/(S\NP)— and give — ((S\NP)/PP)/NP — look as though they ought to combineby forward composition, except that the (S\NP) which needs cancelling in thetype for give is too far embedded into the category for forward composition towork.

The solution is to allow generalised forward composition, which allows thecategory on the left to get further “inside” the category on the right. Theintuition is the same as for vanilla forward composition — the category in themiddle is cancelling — but this time we’re ignoring an extra set of brackets.

The combinatory rule described above is referred to as >B2, where the 2denotes the level of embedding of the argument category (in this case S\NP).The rule can be generalised further to >Bn, where n is greater than 2, so thatthe category on the left is able to penetrate further into categories on the rightwith more recursive structure.

Argument Cluster Coordination The example on the slide is an exampleof what is often referred to as “non-constituent coordination”. Since CCG hassuch a flexible notion of constituenthood, it turns out that even the requiredconjuncts in this example — a teacher an apple and a policeman a flower —can be built using combinatory rules.

1The use of B for composition follows Steedman’s notation, who followed Curry (as inCurry and Feys combinatory logic).

3

Forward and Backward Type-Raising Backward type-raising is analagousto the forward case. If we again instantiate the T variable with S , then applyingbackward type-raising to an (object) NP results in S\(S/NP). Using a similarexplanation to before, a (type-raised) NP can look to the left for a categorylooking for it to the right, and when it finds that category the result is whateverwould have resulted if the category to the left had found the NP (in this casean S ).2

Argument Cluster Coordination In the example on the slide, backwardtype-raising has been applied to all four arguments, with the T variable in therule schema being instantiated in two different ways. (Exercise for the reader:determine T in the two cases.) Now we need a rule to combine the complicated-looking categories.

The derivation is much easier to understand if we use the abbreviationson the slide, replacing S\NP with VP (verb phrase), (S\NP)/NP with TV(transitive verb), and ((S\NP)/NP)/NP with DTV (ditransitive verb). Nowit looks as though the categories could combine with some form of composition,and a new rule of backward composition does the job. The general schema isY \Z X \Y ⇒ X \Z . Using a similar explanation to before, we can return an Xif only we can find a Y to the left, and we have a Y to the left, if only we canfind a Z further to the left; so let’s just look for a Z to the left and return anX , ignoring the Y .

Backward Crossed Composition Other linguistic phenomena suggest theneed for additional rules. The phenomena often involve coordination, as in thebuy today and cook tomorrow example. The use of backward-crossed compositionallows the types of buy and today to combine in the required fashion. (Explainingthis one is left as an exercise for the reader.)

Another Combinatory Rule One question you may be asking at this stageis: how do we decide which combinatory rules are allowed? From a linguistictheory perspective, the approach is usually to see which linguistic phenomena anadditional rule could help explain, whilst at the same time not licencing analysesfor ungrammatical sentences of the language in question. One rule which wewould not want to add to the English grammar is forward-crossed composition.However, there are constructions in Dutch which appear to require this rule.

Cross-Serial Dependencies in Dutch There are some sentences involvingsubordinate clauses in Dutch which appear to have some level of crossing de-pendencies. The translation of the example on the slide is because I saw Ceciliahelp Henk feed the hippos. The indices on the NP arguments are not part ofthe atomic symbol, but are there to indicate where the dependencies are in the

2Note this allows the possibility of another derivation for a subject-verb-object sentence,again resulting in the same semantic interpretation.

4

sentence. For example, NP4 is the hippos, which is also the thing being fed (ob-ject of voeren). Note that, in this Dutch construction, the arguments are listedbefore the verbs and, crucially, the respective orders of the arguments and verbsmean that the dependencies cross, rather than nest, as they do in the Englishtranslation. The dependencies are often referred to as cross-serial because thecrossings have a serial quality to them, also; i.e. the noun phrases and verbshave to line up in a particular order.

It is left as an exercise to the reader to understand how the rules of forward-crossed composition and generalised forward-crossed composition can enable aderivation which captures this crossing.

Mild Context Sensitivity A long-standing question in theoretical linguis-tics concerns how much automata-theoretic power is required to process naturallanguages. Within the Chomskian paradigm, many arguments were given whichpurported to show that natural languages are not context-free. However, Pul-lum and Gazdar in 1982 [2] showed that these arguments were invalid. It wasnot until the mid-eighties that a number of researchers, including Stuart Shieber[3], noticed that there were phenomena in Dutch, and Swiss German, which ap-peared to exhibit the sort of crossing dependencies which cannot be handled withthe stack-like architecture of a push-down automaton (the automata-theoreticequivalent of the context-free grammar).

The addition of the generalised composition rules leads to a CCG withgreater than context-free power, but still much less powerful than a context-sensitive grammar.3 Amazingly, it was shown by Weir, Vijay-Shankar and Joshiin the late 80s that a number of formalisms, including CCG and TAG, are weaklyequivalent in terms of the languages they can generate. I say “amazingly” be-cause, on the face of it, these formalisms look rather different.

These formalisms have become known as “mildly context-sensitive”. Thehypothesis is that mild context-sensitivity is just the right place on the Chom-sky hierarchy to be describing natural languages: high enough up that thecross-serial dependency phenomena can be handled, but low enough down thatefficient polynomial-time algorithms still exist for these formalisms.

The question of generative capacity for CCG has been revived recently withthe work of Kuhlmann, Koller and Satta [1].

Readings for Today’s Lecture The following is not required reading, sincethese notes should be enough to understand the remaining, more practically-oriented, lectures. However, for those keen to learn more about the linguistictheory, the following is an excellent exposition:

• Combinatory Categorial Grammar (2011), (With Jason Baldridge) Draft7.0, to appear in: R. Borsley and K. Borjars (eds.) Non-TransformationalSyntax, 181-224, Blackwell. Available athttp://homepages.inf.ed.ac.uk/steedman/papers.html.

3The notion of “much less” can be made mathematically precise on the Chomsky hierarchy;see the work of Weir, Vijay-Shankar and Joshi.

5

References

[1] Marco Kuhlmann, Alexander Koller, and Giorgio Satta. Lexicalization andgenerative power in CCG. Computational Linguistics, 41(2), 2015.

[2] Geoffrey K. Pullum and Gerald Gazdar. Natural languages and context-freelanguages. Linguistics and Philosophy, 4, 1982.

[3] Stuart M. Shieber. Evidence against the context-freeness of natural lan-guage. Linguistics and Philosophy, 8:333–343, 1985.

6


Parsing

Lecture 7: A CCG Grammar and Treebank for

naturally occurring text

Stephen Clark

October 22, 2015

CCG Analyses for Real Text? The examples found in linguistic textbooksand papers can often appear artificial and unlike the sentences encounteredin the real world. It is a reasonable question to ask whether the “neat” formalgrammar we’ve seen so far can be applied to the “messy” sentences found on theweb or social media, or to sentences which are less messy but contain technicaljargon, e.g. from biomedical research papers.

We’ll look at examples from three different domains or genres: newspapers,biomedical research papers, and Wikipedia. It’s true that these still consist ofreasonably well-edited text, so we’ll leave open the question of whether a CCGgrammar could be developed for e.g. Twitter.

Newspaper Example The sentence on the slide is the famous first sentencefrom Section 00 of the Penn Treebank. It is immediately clear that, given thelexical categories assigned to the words, the CCG rules we’ve seen so far willnot be able to assemble a spanning analysis.

The first problem is that Pierre Vinken is an N , but the verb phrase requiresa subject NP . The distinction between N and NP is not clear-cut in CCG, andthe two are often conflated, so we’ll effectively do the same by introducing anew unary type-changing rule which turns an N into an NP . In keeping withCCG convention, the rule is written bottom-up on the slide.

The phrase 61 years old has the type S [adj ]\NP , the type of a predicativeadjective (since I can say e.g the man is 61 years old). However, in this ex-ample the phrase is acting as a post-nominal modifier of Pierre Vinken. Hencewe’ll introduce another unary type-changing rule which turns S [adj ]\NP intoNP\NP .

Punctutation is ubiquitous in natural language, and often carries importantsyntactic information, but is rarely discussed in the NLP literature.1 Here we’lladopt a simple approach to analysing punctuation, by introducing rules which

1One exception is Prof. Briscoe’s work on punctuation in the late 1990s.

1

effectively merge the punctuation mark into a neighbouring constituent. Forexample, there is the following binary rule instance in the CCG parser: S . ⇒ S,meaning that an S followed by a period can be replaced with an S .2 Introducinga similar rule for commas and NPs will suffice for this example sentence.

With all these additional rules in place, the sentence can now be analysed.Note that, out of all the combinatory rule schema, only forward and backwardapplication are necessary for this example. It is possible to use the type-raisingand composition rules and still arrive at the correct semantic interpretation —because of CCG’s “spurious” ambiguity — but they are not required.

Grammatical Features in CCGbank You may have noticed that manyof the S categories in the examples carry grammatical features, such as dcl(declarative). The grammar in CCGbank does not make much use of featurestructures (in a linguistic, rather than machine learning, sense), unlike, say,a full implementation of an HPSG grammar. However, there is a feature setwhich distinguishes between different types of sentence and verb phrase, andthe CCG parser does contain a unification mechanism to deal with these fea-tures. For example, when a verb phrase (S [dcl ]\NP) is modified by an adverb((S\NP)\(S\NP)), the resulting verb phrase inherits the dcl feature. The Scategories in the adverbial category effectively carry a variable grammaticalfeature [X ] which gets instantiated when the full categories combine.

The slide lists some of the grammatical features. Julia Hockenmaier’s thesis[2] (p.47) contains a full list.

Biomedical Example The main difficulty with analysing biomedical text isthe profusion of long and complicated noun phrases. Even linguistic expertshave difficulty analysing such noun phrases, which often requires biomedical,as well as linguistic, expertise. For example, is T cell activation a kind of Tactivation, or cell activation? It’s probably cell activation, and that’s how it hasbeen analysed on the slide. On my resources webpage there are a numberof files that have been manually annotated with CCG lexical categories (byme and Laura Rimell), including 1,000 sentences from the Genia corpus. Notethat two versions are provided: one where Laura and I made a best guess atthe bracketing for noun phrase cases we weren’t sure about, and one where wedidn’t even try and left the structure flat.

Continuing with the example sentence, the phrase resulting in enhanced pro-duction ... has the type S [ng ]\NP ; however, in this sentence it is acting as anadverbial modifier of the preceding verb phrase. (Intuitively it’s the providingof the signal which results in the enhanced production.) Hence we need anotherunary type-changing rule, similar to the one used for the newspaper sentence,which turns S [ng ]\NP into (S\NP)\(S\NP).

Another common feature of biomedical text is the use of brackets, especiallyto delimit abbreviations. Similar to punctuation, the CCG parser has some rules

2These rules are referred to as rule instances, rather than schema, since they do not containany variables.

2

which merge a bracket with a neighbouring constituent; for example, the rightbracket after the noun IL-2 will merge with the noun, to give another noun. Butother brackets receive lexical categories. For example, the left bracket beforethe noun IL-2 will receive the category (N \N )/N (not shown on the slide),allowing the phrase interleukin-2 (IL-2) to become a noun.

Once all these additional rules are in place, and the noun phrases havebeen identified and analysed, the remaining structure is straightforward, againrequiring only forward and backward application.

Wikipedia Example Aside from the punctuation, the notable aspects of theexample sentence are the possessive s, which receives the category (NP/N )\NP ,and the compound noun Alfriston Clergy House – is it the Alfriston Clergy orthe Alfriston House? Otherwise the structure is straightforward, once the lexicalcategories have been assigned, not even requiring a unary rule in this example(except N changing to NP).

Unary Type-Changing Rules The unary type-changing rules are in somesense against the spirit of CCG, with an emphasis on its lexicalised nature, sincethese rules are not part of the lexicon and are language-specific. An alternativesolution would be to effectively push these rules onto the lexical categories,retaining the fully lexicalised nature of the formalism. The first example onthe slide shows what happens to the lexical categories for once and used whenthis approach is adopted. Note that we now require additional lexical categoriesfor these words, whereas, with the application of unary type-changing rules, thelexical categories remain the same (i.e. the same as in the canonical constructionAsbestos was once used ...). Hence the advantage of the unary rules is that, inpractice, they lead to a more compact lexicon and reduce the number of possiblelexical categories for some of the words.

Real Examples using Composition So far, the real examples we’ve seenonly require function application, with no unbounded dependencies. Do suchcases occur at all in real text? The slide shows two example sentences fromnatural language corpora which contain instances of object extraction, requiringfunction composition for their analysis. In Rimell et al. [5] we describe thecreation of a corpus of naturally occurring sentences which contain unboundeddependencies, across a variety of syntactic constructions, and give statistics forhow often such cases occur in corpora. My resources webpage has a link tothe data described in the paper.

Creating a Treebank for CCG In order to build a statistical parser forCCG — following the standard supervised methodology — we need a CCGtreebank: gold-standard pairs of sentences and CCG analyses. The sentenceanalyses are likely to be CCG derivations, but they could be predicate-argumentdependencies (in addition to, or instead of, the derivations). The treebank fulfils

3

two main roles: it provides data for inducing a grammar, and data for traininga statistical disambiguation model.

Building a treebank is expensive, requiring significant time and expertise, sorather than build a CCG treebank from scratch it is more desirable to leveragethe information in the existing Penn Treebank.

The Penn Treebank The Penn Treebank (PTB) contains analyses in theform of phrase-structure trees, so somehow we need to transform these intoCCG analyses. You may think it is just a case of relabeling the nodes in thetrees, but there are various reasons why the transduction problem is harderthan that. One reason is that, for some constructions, such as various types ofcoordination, the PTB trees are not even isomorphic to the CCG derivations,and so it’s not just a case of relabelling – the tree structures themselves needchanging. Hence it was a considerable effort to produce CCGbank, the CCGversion of the Penn Treebank (which was achieved by Julia Hockenmaier andMark Steedman as part of Julia’s PhD thesis [2]).

Three types of information are required from the PTB trees to produceCCG derivations: linguistic head information; the argument/adjunct distinction(since CCG lexical categories encode this explicitly); and information regardingtraces and extracted arguments so that long-range dependencies can be analysedcorrectly.

Example PTB Tree (with traces) Most PTB parsers produce phrase-structure trees without the trace information and co-indexing present. How-ever, this information, which can be used to extract the underlying predicate-argument structure, is an important part of the PTB annotation and crucial forderiving the CCG analyses. In the example on the slide, there are two “traces”or “empty elements”: NPs 01 and 02. The idea is that these are not overtlyrealised in the surface sentence, but in terms of the underlying structure thereis both an object of the verb do and a subject of to do. The object is what, andthe subject is I, encoded by the co-indexing shown in the diagram.

The Basic Transformation Algorithm If we ignore the more difficult long-range dependency examples, the basic translation algorithm from PTB to CCG,at an abstract level, is straightforward, consisting of the three methods givenon the slide. Each one is now described in turn.

Determining Constituent Type Three types of constituent need distin-guishing: head, complement and adjunct. In fact, this information is not ex-plicitly encoded in the PTB trees, but rules for heuristically recovering it havebeen around at least since Collins’ thesis [1], whose statistical parsing modelswere defined in terms of heads and complements (e.g. Collins’ Model 2 explicitlyuses subcategorisation frames, similar to CCG lexical categories).

4

Appendix A of Collins’ thesis gives a list of head-finding rules, and AppendixA of the CCGbank manual [3] also explains how the complement-adjunct dis-tinction is made.

Binarizing the Tree Section 4 from Hockenmaier and Steedman [4] containsan instructive example showing the translation of a PTB tree to a CCG deriva-tion. Section 4.2 shows how the tree is binarized. Binarization is necessary sincethe nodes in CCG derivation trees contain at most two children, whereas thetrees in the PTB are relatively flat, with some nodes having significantly morechildren than two. In fact, for some constructions, such as compound nounphrases, the PTB doesn’t even contain the requisite information to producethe correct analysis, in which case the CCG (sub-)derivation assumes a defaultright-branching structure.

Assigning Categories Assigning categories can now be performed by distin-guishing three cases. Assigning a CCG label to the root node of a derivationtree is performed by a manually-defined mapping; for example a PTB VP nodeis mapped to S\NP , and any of { S ,SINV ,SQ } get mapped to S .

For heads and complements, the category of a complement child is given aCCG label from a manually-defined mapping, similar to the root node; e.g. aPTB PP node is also labelled PP in the CCG derivation. The category of thehead can be determined from the category of the parent node and the relativeposition and category of the child. For example, if the parent node is S , andthe child is an NP to the left, then the category of the head will be S\NP(corresponding to a VP).

Finally, for heads and adjuncts, the adjunct category essentially has twocopies of the parent label, with the direction determined by the relative positionof the adjunct. For example, if the parent is S\NP and the adjunct is to theleft, then the adjunct category will be (S\NP)/(S\NP).

Long Range Dependencies Perhaps the most interesting part of the trans-lation procedure is how the trace information in the PTB is propagated aroundthe tree, via the co-indexing, to create the correct CCG lexical categories foranalysing long-range dependencies. The interested reader is referred to p.57 ofthe CCGbank manual for a detailed example.

Properties of CCGbank The coverage of the translation algorithm — interms of how many PTB trees get turned into CCG derivations — is very high:over 99%. One of the striking features of the resulting CCGbank is how manylexical categories there are for some very common words; e.g. is and as areassigned over 100 different category types!

More Statistics The numbers on the slide are calculated for sections 2-21,traditionally used as training data. Another striking statistic is that, for wordtokens, the average number of lexical categories is over 19. This number is high

5

because of the large number of possible categories for many frequent words; forword types the average number is lower. There are over 1,200 lexical categorytypes in total, although a large proportion of these occur only once or twice inthe training data. Finally, perhaps the most important statistic on this slideis the coverage figure on unseen data. For section 00, 6% of the tokens do nothave the correct lexical category in the lexicon: 3.8% because the token is not inthe lexicon; and 2.2% because the token is there, but not with the appropriatecategory.

References

[1] Michael Collins. Head-Driven Statistical Models for Natural Language Pars-ing. PhD thesis, University of Pennsylvania, 1999.

[2] Julia Hockenmaier. Data and Models for Statistical Parsing with Combina-tory Categorial Grammar. PhD thesis, University of Edinburgh, 2003.

[3] Julia Hockenmaier and Mark Steedman. CCGbank: User’s manual. Tech-nical Report MS-CIS-05-09, Department of Computer and Information Sci-ence, University of Pennsylvania, 2005.

[4] Julia Hockenmaier and Mark Steedman. CCGbank: a corpus of CCG deriva-tions and dependency structures extracted from the Penn Treebank. Com-putational Linguistics, 33(3):355–396, 2007.

[5] Laura Rimell, Stephen Clark, and Mark Steedman. Unbounded dependencyrecovery for parser evaluation. In Conference on Empirical Methods in Nat-ural Language Processing (EMNLP-09), pages 813–821, Singapore, 2009.

6


Parsing

Lecture 8: Parsing with CCG

Stephen Clark

October 23, 2015

Inducing a Grammar from CCGbank Since CCG is a lexicalised gram-mar, then the grammar can be induced from the treebank by effectively readingthe lexicon off the leaves of the derivation trees, where the lexicon is a set ofword-category pairs. So in the example on the slide, we would learn that Markscan have the category NP , for example. In addition to the lexicon, the grammaralso consists of the manually-defined combinatory rules.

In practice, inducing a grammar is not quite as neat as the description sug-gests, primarily because of all the additional rules that have to be extractedfrom the derivations. These include the unary type-changing rules and punctu-ation rules we encountered in the previous lecture. There are also a number ofadditional rule instances in CCGbank which do not conform to the combinatoryrules, since the conversion of the PTB to CCG was a semi-automatic process,which did not always produce “perfect” CCG derivations. Whether and howto include these additional rule instances is a design choice for the parser de-veloper. If the goal is to score highly on a CCGbank evaluation, then it makessense to include at least the most frequent of these additional rule instances inthe grammar.

Chart Parsing with CCG The first stage in the CCG parsing pipeline is toassign lexical categories to the words in the input sentence. One way to achievethis is to simply assign all possible lexical categories to each word, as determinedby the lexicon (with some strategy for dealing with unknown and rare words),and let the parser resolve the lexical category ambiguity. However, as we sawin the previous lecture, this would result in a huge number of categories beingassigned to many frequent words.

Hence the usual practice is to use a sequence labelling algorithm (a tagger)to assign lexical categories to the words. The second bullet on the slide suggestsusing a standard maximum entropy tagger, which is the approach we took in theC&C parser. However, recent work by Mike Lewis and Wenduan Xu shows thata tagger based on a recurrent neural network is more accurate, and in particularis more robust to domain changes between the training and test data [3].

1

The chart-parsing algorithm itself is similar to the one we saw applied todependency parsing. CCG is also a good fit with shift-reduce parsing architec-tures, and recent work by Yue Zhang and Wenduan Xu shows that it is possibleto build a competitive shift-reduce parser for CCG [4].

CCG Supertagging Following Bangalore and Joshi’s seminal work on tag-ging for lexicalised tree adjoining grammar, tagging when applied to lexicalisedgrammar formalisms is often referred to as supertagging. The examples on theslide are designed to demonstrate the difficulty of supertagging, compared toPTB POS tagging. The categories in blue are three different categories assignedto prepositions, which would all receive the IN PTB POS tag. Note that, forthe two categories assigned to with, the supertagger is often making attachmentdecisions when choosing these categories. There is no choice in the second case,but in the first case with could attach to the road (in which case it would havethe (NP\NP)/NP category).

A useful measure of the difficulty of a tagging task is to evaluate a simplebaseline method which, given a word, assigns the tag most seen with that wordin the training data. For the PTB POS tagging task, this baseline method hasa per-word accuracy of around 90%; for CCG supertagging it’s around 72%.

CCG Multitagging A maximum entropy tagger which assigns a single tagto each word has a per-word accuracy of around 92%, which may not sound sobad compared to the 72% baseline. However, bare in mind that for a datasetwith sentences consisting of 20 to 25 words, this accuracy level equates to acouple of mistakes per sentence. Hence a better approach is to allow the taggerto assign multiple tags per word, using the probability distributions from thetagger as a measure of how confident the tagger is in its decisions. For wordswhere there is little confidence, more tags can be assigned.

For a multitagger using the strategy described on the slide — which, as ameasure of confidence, compares the probability of a tag for a word with thehighest-probability tag for that word — the per-word accuracy is now over 97%.In addition, the lexical ambiguity level is still relatively low: only 1.4 categoriesper word.

CKY Algorithm A chart is a data structure consisting of cells (i, j), where iindicates the start of a constituent and j its span. Each cell contains sets of non-terminal symbols, in our case CCG categories. Charts enable efficient parsingby grouping together equivalent categories in the same cell. For a context-freegrammar, non-terminals in the same cell with the same label are equivalent,and the parsing complexity is O(n3), where n is the length of the sentence.For a lexicalised context-free grammar, where the non-terminal label also has alinguistic head, the parsing complexity is O(n5) (as it was for the dependencyparsing case).

The case for CCG is complicated somewhat, since there are variants of CCG,namely those with the generalised composition rules, where the CKY algorithm

2

is not even polynomial: it’s exponential. Vijay-Shankar and Weir developeda polynomial-time parsing algorithm for CCG, albeit with a relatively largeexponent, in the early 90s, and this problem has recently been revived withthe work of Kuhlmann and Satta [2]. The C&C parser deals with this issueby restricting the use of the combinatory rules, in particular by only allowingcategories to combine if they have been seen to combine in the training data,effectively making the grammar context-free.

C&C also appends linguistic heads to the category labels, both for use in thestatistical parsing model, and to enable the recovery of predicate-argument de-pendencies, as described in Clark and Curran (2007). Hence the grammar usedin the C&C parser is effectively a lexicalised context-free grammar extractedfrom CCGbank. What this means for the CKY algorithm is that, for two cate-gories to be equivalent, they must be in the same cell, have the same categorylabel, and have the same linguistic head.

The algorithm itself traverses the cells in the chart, bottom up, filling inthe cells with all possible combinations of contiguous categories from those thathave already been built.

Chart Parsing The example on the slide, which uses a CFG, demonstrateshow charts naturally lead to a packed structure. For the two VPs spanningwords 2 to 7, only one needs to be considered when performing further parsinghigher in the chart (in particular when filling the “corner” cell). This packingoccurs in all cells, and explains how an exponential number of parse trees canbe represented efficiently in a polynomial-sized structure.

Furthermore, as long as the features of the statistical parsing model arerestricted to be local to the rule instantiations — i.e. they are not allowed to“look outside” of a single rule instance — then dynamic programming in theform of the Viterbi algorithm applied to trees can be used to find the highest-scoring tree. A PCFG naturally has this property, since the probabilities makingup the probability of a tree are by definition restricted to single rule applications.

A useful intuition for Viterbi is the following: suppose that the VP on theleft in the example has a higher score than the one on the right. Can we safelydiscard the one on the right when doing further parsing? As long as the featuresof the statistical parsing model are sufficiently local as described above, thenthere is no way that the VP on the right can “overtake” the one on the left in anysubsequent parsing; the VP on the right cannot be part of the highest-scoringparse, and can therefore be discarded.

Linear Parsing Model This is the same linear parsing model that we’vealready seen for dependency parsing. The difference is that, for CCG, we’redefining a model over derivations, d (for sentence S). The model is a globalmodel, so the feature functions are integer-valued, counting the number of timesthat certain patterns are observed in the derivation tree. The features are fairlystandard for a model of this type, picking out the category at the root of thederivation; the category-word pairs at the leaves; and the category triples (or

3

pairs) defining a rule instantiation. Each one of the category-only features willhave an additional version featuring the lexical head, and another version withthe lexical head replaced by its POS tag (since the lexical features will be sparse).Finally, there is a set of features defined in terms of the word-word dependenciescreated by the derivation.

Training Data from CCGbank CCGbank has around 40,000 newspapersentences with gold-standard derivation trees, together with a set of word-worddependencies for each derivation, from which the features will be extracted.

Feature Representation Mathematically each feature is an integer-valuedfunction on a derivation. The number of features can be very large, especiallybecause of the word-word features, leading to feature sets in the order of millions.Remarkably, simple regularisation techniques mean that the resulting modelsare not overfitted to the training data, despite the large number of features andrelatively small number of training instances. For the preceptron, averaging allthe weight vectors created during training, and using the average weight vectoras the final model, is an effective regulariser.

Linear Parsing Model In Clark and Curran (2007) we defined a number ofprobabilistic parsing models for CCG. Here we’ll use a (non-probabilistic) linearmodel, trained with the structured perceptron. The beauty of the perceptronis its simplicity, and competitiveness against more complicated alternatives.

Perceptron Training We’ve already seen perceptron training for dependencyparsing. It works the same way here: take each training instance one at a time;find the highest-scoring derivation using the current set of weights (decoding);and then update the weights using that derivation and the gold-standard.

The example updates on the slides demonstrate this process nicely. First westart with a zero set of weights, and decode the first sentence in the trainingdata, using chart-parsing with Viterbi. This will produce three types of feature:the ones in blue are the features which appear in the derivation returned bythe parser and the gold standard; the ones in red are the ones returned by theparser, but not in the gold standard; and finally the ones in green are in thegold standard but not returned by the parser. Intuitively we’d like the parserto keep returning the blue features; stop returning the red features; and beginreturning the green features. Hence we leave the weights of the blue featuresalone; the weights of the red features get decreased by one; and the weights ofthe green features increased by one. Then we move on to the next sentence andrepeat. And we pass through the whole data set a number of times (around 10passes will suffice for convergence).

DP vs. Beam Search There is a simple modification to the chart-parsingalgorithm which can result in higher accuracies, without compromising speed.The idea is to retain only the K highest-scoring items in each cell (where K

4

is typically between 8 and 64). Here an item is a sub-derivation spanning therelevant part of the sentence. Hence there is no longer any packing, and thechart is simply being used to provide an order of combination of the constituents,and a means by which to compare items when applying the beam.

Since we’re using heuristic beam search, the search is no longer optimal andnot guaranteed to return the highest-scoring derivation. So why use it? The keypoint is that we are no longer confined to defining features which only cover asingle rule instance. Since there is no longer the optimal sub-problem propertyrequired for dynamic programming, features can be defined over any part of thederivation. This increase in flexibility is enough to offset the lack of optimality,and increased accuracies can be obtained. The training process is still basedon the perceptron, although a modified version — the so-called max-violationperceptron — has to be employed to accommodate the beam search.

A recent technical report describes some experiments using beam search,and can be obtained from my resources webpages, along with a new, Javaversion of the C&C parser.

Parser Evaluation Most of the CCG parsing evaluations use the CCG predicate-argument dependencies from CCGbank as the gold standard, adopting the stan-dard split of sections 2-21 for grammar extraction and model training, 00 fordevelopment and 23 for testing. One problem with this approach is that it onlyallows the comparison of CCG parsers. What if we’d like to know how the C&Cparser compares with the Stanford or Berkeley parsers?

A proposal was made in the late 90s (in fact originating from Cambridge andSussex) to use generic grammatical relation dependencies as the gold standard,and compare all parsers against that. The idea was that a parser would onlyneed some representation based on linguistic heads, and extracting the GRswould be straightforward. In fact it turned out to be a little more complicatedthan that [1], and the problem of how to effectively perform cross-formalismparser comparison is still unsolved. However, the idea of evaluating againstGRs is still an attractive one.

Head-based GRs The examples on the slide use the Briscoe and Carroll(Cambridge-Sussex) GRs scheme. Other dependency schemes include the Stan-ford dependencies — similar to Briscoe and Carroll but more fine-grained —and the ongoing universal dependencies framework for dependency parsing.

Mapping CCG Dependencies to GRs It turned out that mapping fromthe native CCG dependencies to GRs was a non-trivial task, requiring a fairlycomplex set of rules with a number of exceptions [1]. An interesting researchquestion which, as far as I know, has not been investigated, is to machine learna mapping between the two representations.

Test Suite: DepBank DepBank contains 700 newspaper sentences manuallyannotated with GRs. Standard precision, recall and F-score measures can be

5

used for evaluation.

Parsing Accuracy The overall accuracy number is probably a little belowthe state-of-the-art1, although with the difficulties of mapping between for-malisms it’s difficult to know what the state-of-the-art number is. Beware alsoany comparisons of these numbers with, say, the accuracy numbers obtained independency parser evaluations. We showed convincingly in [1] that any suchcomparisons are not meaningful.

Perhaps the most interesting feature of the results is the accuracy break-down per grammatical relation. Some GRs are easy to get: determiners andauiliaries, whereas others, such as coordination, are still very difficult. Theclassic disambiguation problems we saw in the first lecture are still there andunsolved, despite decades of research on the problem. Recent parsing modelsbased on neural networks have produced another incremental improvement, butstill not solved the problem.

Readings for Today

• Wide-Coverage Efficient Statistical Parsing with CCG and Log-LinearModels. Stephen Clark and James R. Curran. Computational Linguis-tics, 33(4), 493-552.

References

[1] Stephen Clark and James R. Curran. Formalism-independent parser eval-uation with CCG and DepBank. In Proceedings of the 45th Meeting of theACL, pages 248–255, Prague, Czech Republic, 2007.

[2] Marco Kuhlmann and Giorgio Satta. A new parsing algorithm for combina-tory categorial grammar. Transactions of the Association for ComputationalLinguistics, 2, 2014.

[3] Wenduan Xu, Michael Auli, and Stephen Clark. CCG supertagging with arecurrent neural network. In Proceedings of the 53rd Annual Meeting of theACL, Beijing, China, 2015.

[4] Yue Zhang and Stephen Clark. Shift-reduce CCG parsing. In Proceedingsof the 49th Annual Meeting of the ACL, Portland, OR, 2011.

1The new Java C&C parser will have higher accuracies, but hasn’t yet been evaluated onGRs.

6

Documents

Introduction to Natural Language Syntax and …sc609/teaching/all_L95_notes.pdf · Introduction to Natural Language Syntax and Parsing Lecture 1: Automatic Linguistic Annotation