Learning and Inference in Phrase Recognition: A Filtering-Ranking ...people.csail.mit.edu/carreras/pubs/tesi.pdf · Llu s M arquez Villodre Departament de Llenguatges i Sistemes Inform

Learning and Inference in Phrase

Recognition: A Filtering-Ranking

Architecture using Perceptron

Tesi Doctoral

per a optar al grau de

Doctor en Informàtica

per

Xavier Carreras Pérez

sota la direcció del doctor

Llúıs Màrquez Villodre

Departament de Llenguatges i Sistemes Informàtics

Universitat Politècnica de Catalunya

Barcelona, Juliol de 2005

Abstract

This thesis takes a machine learning approach to the general problem of rec-ognizing phrases in a sentence. This general problem instantiates in many dis-ambiguation tasks of Natural Language Processing, such as Shallow SyntacticParsing, Clause Identification, Named Entity Extraction or Semantic Role La-beling. In all of them, a sentence has to be segmented into many labeled phrases,that form a sequence or hierarchy of phrases.

We study such problems under a unifying framework for recognizing a struc-ture of phrases in a sentence. The methodology combines learning and inferencetechniques, and consists of decomposing the problem of recognizing a complexstructure into many intermediate steps or local decisions, each recognizing asimple piece of the structure. Such decisions are solved with supervised learn-ing, by training functions from data that predict outcomes for the decisions.Inference combines the outcomes of learning functions applied to different partsof a given sentence to build a phrase structure for it.

In a phrase recognition architecture, two issues are of special interest: effi-ciency and learnability. By decomposing the general problem into lower-levelproblems, both properties can be achieved. On the one hand, the type of localdecisions we deal with are simple enough to be learned with reasonable accu-racy. On the other hand, the type of representations of a decomposed structureallows efficient inference algorithms that build a structure by combining manydifferent pieces.

Within this framework, we discuss a modeling choice related to the granular-ity at which the problem is decomposed: word-level or phrase-level. Word-leveldecompositions, used commonly in shallow parsing tasks, reduce the phraserecognition problem into a sequential tagging problem, for which many tech-niques exist. In this thesis, we concentrate on phrase-based models, that putlearning in a context more expressive than word-based models, at the cost ofincreasing the complexity of learning and inference processes. We describe in-cremental inference strategies for both type of models that go from greedy torobust, with respect to their ability to trade off local predictions to form a coher-ent phase structure. Finally, we describe discriminative learning strategies fortraining the components of a phrase recognition architecture. We focus on largemargin learning algorithms, and discuss the difference between training eachpredictor locally and independently, and training globally and dependently allpredictors.

4

As a main contribution, we propose a phrase recognition architecture that wename Filtering-Ranking. Here, a filtering component is first used to substantiallyreduce the space of possible solutions, by applying learning at word level. On thetop of it, a ranking component applies learning at phrase level to discriminate thebest structure among those that pass the filter. We also present a global learningalgorithm based on Perceptron, that we name FR-Perceptron. The algorithmtrains the filters and rankers of the architecture at the same time, and benefitsfrom the interactions that these predictors exhibit within the architecture.

We present exhaustive experimentation with FR-Perceptron in the contextof several partial parsing problems proposed in the CoNLL Shared Tasks. Weprovide empirical evidence that our global learning algorithm is advantageousover a local learning strategy. Furthermore, the results we obtain are amongthe best results published on the tasks, and in some cases they improve thestate-of-the-art.

Agräıments

Em sento afortunat de que en Llúıs Màrquez m’hagi dirigit aquest treball. Ferrecerca sota la seva direcció ha sigut entusiasmador, apassionant i divertit, comun bon joc. En els moments borrosos m’ha donat consells i suggerències que,per la via fàcil i simple, han il·luminat els següents passos. Li dono les gràciesmés profundes per la confiança, paciència i treball que ha dipositat en mi.

Hi ha altres persones directament implicades en la qüestió. En Jordi Turmoi l’Horacio Rodŕıguez —el savi— em van iniciar en l’apassionant món del Pro-cessament del Llenguatge Natural, i en Llúıs Padró i el German Rigau m’handonat moltes mostres de com cal fer les coses.

A més, tinc la sort d’haver caigut en un grup de recerca on, sense cap dubte,hi regna el bon ambient. Pel seu companyerisme, dono les gràcies a la gentdel Grup de Recerca en Processament del Llenguatge Natural i col·laboradorshabituals. A part dels Llüısos, l’Horacio, en Jordi i el German, aqúı ve unacolla de bons col·legues: Alicia Ageno, Laura Alonso, Jordi Atserias, VictòriaArranz, Manu Bertran, Neus Català, Bernardino Casas, Núria Castell, IreneCastellón, Isaac Chao, Grzegorz Chrupa la, Montserrat Civit, Pere R. Comas,Eli Comelles, Montse Cuadros, Jordi Daudé, Gerard Escudero, Javi Farreres,David Farwell, Dani Ferrés, Maria Fuentes, Marta Gatius, Jesús Giménez, EdgarGonzález, Meritxell González, Àngels Hernández, Patrik Lambert, Toni Mart́ı,Muntsa Padró, F.J. Raya, Francis Real, Enrique Romero, Mihai Surdeanu, LlúısVillarejo, et al. (per si les mosques).

També vull donar agräıments a les persones del Departament de LSI de laUPC que me n’han fet un bon lloc de treball, molt especialment a la gent dellaboratori de càlcul i a les secretàries, tots ells tant generosos.

En la part personal, vull donar les gràcies a la meva famı́lia, pel seu amor isuport. L’esforç que he posat en aquest treball va dedicat a ells.

Acknowledgements

In my doctoral research, I had the amazing opportunity of visiting research cen-ters abroad. In 2002, I visited Dan Roth at the University of Illinois at Urbana-Champaign. An important part of my research started there, and many ideas inthis thesis result from discussions with Dan and his students, especially VasinPunyakanok. More recently, during the spring of 2004 I could visit Michael

6

Collins at the Massachussets Institute of Technology, in the Boston area. Dis-cussing machine learning topics with him was great, and gave me new interestingperspectives that are very valuable to me. I am very grateful to them and tothe people I met there, so friendly.

I express my gratitude to the co-authors of the papers I’ve been involved induring this thesis research. Here’s the list: Llúıs Màrquez, Llúıs Padró, JorgeCastro, Enrique Romero, Vasin Punyakanok, Dan Roth, Grzegorz Chrupa la,Adrià de Gispert, Toni Mart́ı, Montse Arévalo and Maria José Simon. Also,I thank the two anonymous referees of this thesis for helpful and encouragingcomments on a preliminary version of this document.

From 2001 to 2004, the author was supported by a pre-doctoral grant fromDURSI, the Ministry of Universities, Research and Information Society of theCatalan Government (grant reference: 2001FI 00663 ).

The research in this thesis was developed in the context of several researchprojects and other initiatives, funded by the following institutions: the CatalanMinistry of Universities, Research and Information Society (Research Groupof Quality, 2001 SGR 00254), the Spanish Ministry of Science and Technology(Hermes, TIC2000-0335-C03-02; Petra, TIC2000-1735-C02-02; Aliado, TIC2002-04447-C02), and the European Union Commission (NAMIC, IST-1999-12392;Meaning, IST-2001-34460; Chil, IP 506909; PASCAL Network of Excellence,IST-2002-506778).

Contents

Abstract 3

Agräıments / Acknowledgements 5

1 Introduction 111.1 The Phrase Recognition Problem . . . . . . . . . . . . . . . . . . 13

1.1.1 From Full to Partial Syntactic Parsing . . . . . . . . . . . 131.1.2 Phrase Recognition in CoNLL Shared Task Series . . . . . 151.1.3 Problem Definition and Evaluation . . . . . . . . . . . . . 181.1.4 Generalities . . . . . . . . . . . . . . . . . . . . . . . . . . 181.1.5 The Machine Learning Approach . . . . . . . . . . . . . . 20

1.2 This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.2.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 211.2.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . 24

2 A Review of Supervised Natural Language Learning 272.1 Learning to Disambiguate in Natural Language . . . . . . . . . . 27

2.1.1 Probabilistic Learning . . . . . . . . . . . . . . . . . . . . 292.1.2 Direct, Discriminative Learning . . . . . . . . . . . . . . . 322.1.3 Learning and Inference Paradigm . . . . . . . . . . . . . . 34

2.2 Learning Linear Separators: A Margin Based Approach . . . . . 382.2.1 Theoretical Aspects of Distribution Free Learning . . . . 382.2.2 Learning Algorithms: From Classification to Discrimina-

tion of Structures . . . . . . . . . . . . . . . . . . . . . . . 412.3 Learning Systems in Partial Parsing . . . . . . . . . . . . . . . . 45

2.3.1 Typical Architectures . . . . . . . . . . . . . . . . . . . . 462.3.2 A Review of Partial Parsing Systems . . . . . . . . . . . . 50

3 A Framework for Phrase Recognition 533.1 A Formal Definition of Phrase Structures . . . . . . . . . . . . . 543.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2.1 Models at Word-Level . . . . . . . . . . . . . . . . . . . . 563.2.2 Models at Phrase-Level . . . . . . . . . . . . . . . . . . . 583.2.3 Models in a Phrase Recognition Architecture . . . . . . . 60

8 CONTENTS

3.3 Inference Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 613.3.1 Inference in Word-Based Models . . . . . . . . . . . . . . 643.3.2 Inference in Phrase-Based Models . . . . . . . . . . . . . 653.3.3 Inference in a Phrase Recognition Architecture . . . . . . 70

3.4 Learning Algorithms for Phrase Recognition . . . . . . . . . . . . 713.4.1 Linear Functions for Supervised Classification and Ranking 723.4.2 Perceptron Algorithms . . . . . . . . . . . . . . . . . . . . 743.4.3 Learning in a Phrase Recognition Architecture . . . . . . 78

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4 A Filtering-Ranking Learning Architecture 814.1 Filtering-Ranking Architecture . . . . . . . . . . . . . . . . . . . 81

4.1.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.1.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.1.3 Learning Components of the Architecture . . . . . . . . . 85

4.2 Filtering-Ranking Perceptron . . . . . . . . . . . . . . . . . . . . 854.2.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 864.2.2 Filtering-Ranking Recognition Feedback . . . . . . . . . . 874.2.3 Binary Classification Feedback . . . . . . . . . . . . . . . 884.2.4 Discussion on the FR-Perceptron Algorithm . . . . . . . . 894.2.5 Convergence Analysis of FR-Perceptron . . . . . . . . . . 90

4.3 Experiments on Partial Parsing . . . . . . . . . . . . . . . . . . . 924.3.1 Experimental Setting and Results . . . . . . . . . . . . . 924.3.2 Local vs. Global Learning . . . . . . . . . . . . . . . . . . 944.3.3 A Closer Look at the Filtering Behavior . . . . . . . . . . 974.3.4 A Closer Look at the Behavior of the Score Function . . . 100

4.4 Conclusion of this Chapter . . . . . . . . . . . . . . . . . . . . . 102

5 A Pipeline of Systems for Syntactic-Semantic Parsing 1055.1 A Pipeline of Analyzers . . . . . . . . . . . . . . . . . . . . . . . 1065.2 General Details about the Systems . . . . . . . . . . . . . . . . . 110

5.2.1 On Representation and Feature Extraction . . . . . . . . 1105.2.2 Voted Perceptron (VP) . . . . . . . . . . . . . . . . . . . 113

5.3 Syntactic Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.3.1 Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.3.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.3.4 Comparison to Other Works . . . . . . . . . . . . . . . . . 117

5.4 Clause Identification . . . . . . . . . . . . . . . . . . . . . . . . . 1195.4.1 Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.4.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1215.4.4 Comparison To Other Works . . . . . . . . . . . . . . . . 122

5.5 Semantic Role Labeling (SRL) . . . . . . . . . . . . . . . . . . . 1235.5.1 Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1235.5.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

CONTENTS 9

5.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285.5.4 Comparison to Other Works . . . . . . . . . . . . . . . . . 128

5.6 Conclusion of this Chapter . . . . . . . . . . . . . . . . . . . . . 130

6 Conclusion 1316.1 Summary and Results . . . . . . . . . . . . . . . . . . . . . . . . 1316.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.2.1 From Greedy to Robust Inference . . . . . . . . . . . . . . 1326.2.2 Learning issues for FR-Perceptron . . . . . . . . . . . . . 1336.2.3 Natural Language Tasks . . . . . . . . . . . . . . . . . . . 1346.2.4 Introducing Knowledge . . . . . . . . . . . . . . . . . . . 1356.2.5 On Representations and Kernels . . . . . . . . . . . . . . 136

Bibliography 137

A Proof for FR-Perceptron 147

B Author’s Publications 153

10 CONTENTS

Chapter 1

Introduction

Natural Language (NL) processing and understanding requires the general taskof revealing the language structures of text, at morphologic, syntactic and se-mantic levels [Jurafsky and Martin, 2000]. Broadly speaking, analyzing a sen-tence consists of segmenting it into words, recognize syntactic elements and theirrelationships within a structure, and induce a syntactico-semantic representa-tion of the many concepts the sentence may express. In this line, a number ofcentral NL tasks consist of recognizing some type of structure which representslinguistic elements of the analysis and their relations. Many language applica-tions, such as Question-Answering, Information Extraction, Machine Transla-tion, Summarization, etc. build on general tools which perform such tasks.

A major problem is that natural language is ambiguous at all levels. Thus,the main concern of language analyzers is how to disambiguate the correctstructure of a sentence from all possible structures for it.

This thesis focuses on language structures based on phrases, at the syntactico-semantic level. In a sentence, phrases group words that together represent alinguistic element of some nature. For example, the sentence the cat eats freshfish can be segmented into three basic syntactic phrases: a noun phrase (thecat), a verb phrase (eats), and another noun phrase (fresh fish). In the NLdomain, several fundamental tasks consist of recognizing phrases of some type.Examples of these tasks include Named Entity Extraction, Syntactic Analysis,or Semantic Role Labeling, among others. Generally, in all cases the task con-sists of recognizing a set of phrases in a sentence, organized in a structure. Thedifference between tasks concerns the nature of the phrases and the relationsthey exhibit in a sentence.

The aim of this thesis is to develop machine learning systems for these prob-lems under a unified framework. Machine learning techniques are used to predictwhether some words of a sentence form a phrase or not. In doing so, it is as-sumed that there exist some patterns that naturally explain the phrases thatare to be recognized –which certainly exist in natural language. Thus, a crucialpoint is to choose a representation of linguistic elements that gives the learningcomponents enough expressivity to capture the natural patterns.

12 Introduction

An important aspect of our framework are the structural properties of theproblems we are dealing with, with two main issues. The first is computa-tional. Structurally, a sentence is a sequence of words. The number of possiblephrases in a sentence grows quadratically with the length of the sentence, andthe space of possible phrase structures is of exponential size. In this scenario,an algorithmic scheme is required to efficiently explore the sentence to recognizephrases.

The second issue concerns the relations that different phrases in a structureexhibit. For example, in the syntactic domain it is commonly observed that anoun phrase is followed by a verb phrase, and that the former is the agent ofthe predicate expressed in the verb phrase, as in the cat eats. A more complexlevel of dependencies is found when phrases appear under a recursive patternin a structure. For example, syntactic clauses appear in a sentence an arbitrarynumber of times through coordination, subordination and other patterns. Thus,an important aspect is to put learning in a context in which these dependenciescan be captured, so that a prediction can benefit from them.

These observations motivate approaches which combine learning and explo-ration processes, each supporting the other. In this thesis, we refer to thisgeneral approach as learning and inference paradigm. The role of learning isto capture statistics from training data that serve to accurately and robustlymake predictions about which words form phrases on unseen data. The role ofinference is to efficiently explore the sentence so as to recognize phrases withlearning predictors, ensuring that the recognized phrases form a coherent phrasestructure for the sentence. The research presented in this thesis proposes tech-niques that follow this paradigm, in the context of several phrase recognitionproblems that arise in natural language disambiguation.

The rest of the chapter is organized as follows. The next section describesthe phrase recognition problem and identifies some goals of this thesis. Then,Section 1.2 presents this thesis contents, starting by over-viewing the majorcontributions of this research, and then outlining the organization of the thesisinto chapters.

1.1 The Phrase Recognition Problem 13

1.1 The Phrase Recognition Problem

In this section we describe the phrase recognition problem. We first motivatethe type of problems that we regard as phrase recognition, and introduce thefamily of techniques used to solve them. Next, in subsection 1.1.2 we present anumber of natural language tasks that are addressed in this thesis, all of thembelonging to the CoNLL Shared Task Series, and all of them being particularinstances of the the general problem we address. In subsection 1.1.3, we pro-vide a formal definition of the phrase recognition problem and the standardmethod to evaluate a system. After that, in subsection 1.1.4 we discuss somegeneral characteristics of the tasks in NL that instantiate the phrase recognitionproblem. Finally, we introduce the machine learning approach for these type oftasks.

1.1.1 From Full to Partial Syntactic Parsing

During the 90’s, a number of statistical approaches applied to natural languagedemonstrated that, to some extent, non-restricted automatic syntactic analysisof language is possible. Part of the success can be attributed to the availabil-ity of large-scale treebanks, that is, resources containing a large collection ofsentences annotated with syntactic structure (e.g., Brown Corpus and PennTreebank [Marcus et al., 1993], 3LB [Palomar et al., 2003], PropBank [Palmeret al., 2005]). The annotation scheme of such resources, defining which linguis-tic elements and syntactic relations and dependencies are annotated, is usuallyfine-grained. To some extent, the annotations unambiguously and consistentlyrepresent all behaviors that syntactic elements of natural language exhibit inthe treebank. With this, many statistical methods have been developed forthe full parsing task, the problem of associating sentences with their syntactictree. Broadly, probabilistic full parsers consist of fine-grained parametric lan-guage models which associate probabilities to trees, relying on a grammar of thelanguage that is assumed to generate such syntactic trees and sentences. Theparameters of the model are estimated from data, in particular from the largecollection of sentence-tree pairs of the treebank. To date, the best state-of-the-art statistical full parsers working on the standard Wall Street Journal (WSJ)data achieve results slightly below 90% of accuracy [Collins, 1999; Charniak,2000].

Current natural language applications make use of much less analysis thanthe one annotated in treebanks. For instance, a question-answering application—as a complex application of interest in the NL community– makes use of manydifferent types of analysis, each of which is much simpler than a full syntactictree. Typical modules are syntactic part-of-speech taggers and chunkers, namedentity recognizers, semantic disambiguators, analyzers of the relations expressedin verbs, etc. The question-answering system makes use of the output of eachone to make inferences that answer a posed question.

Therefore, in recent years there has been a lot of interest on the design oflearning systems which perform only a partial analysis of the sentence [Abney,

14 Introduction

1991, 1996b; Hammerton et al., 2002]. In contrast with tasks that reveal a fullanalysis, partial tasks are characterized by much coarser annotation schemes.The corresponding analysis does not identify and disambiguate all linguistic el-ements of a sentence, but only those specific to the task. Thus, partial tasks aremuch simpler than full tasks (for example, the cardinality of the output spaceis much lower) and, consequently, the resulting systems are much simpler andfaster than those performing a full analysis of the sentence. This property hasfacilitated the use of general machine learning for partial tasks. Here, instead ofdesigning a fine-grained, linguistically-motivated statistical model, the approachrelies on reducing the problem to classification subproblems, and learn powerfulclassifiers that assign labels to the linguistic elements of interest. To predictsuch labels, classifiers operate in high-dimensional spaces of propositional fea-tures that represent linguistic elements and their relations, possibly with manydifferent sources of information. This flexibility in representation constitutesan attractive property of classification-based techniques, in contrast to tradi-tional probabilistic approaches that impose strong conditions on the type ofrepresentations.

So, the general problem of analyzing natural language presents a trade-offon how to approach it. Focusing on syntactic analysis, one can think of a singlecomplex task which disambiguates all syntactic elements of a sentence. On theother hand, one can break a full analysis into many intermediate steps or layers,define and learn a partial analyzer for each layer, and chain partial analyzersin a pipeline to obtain incrementally the full analysis. An advantage of the fullapproach is that the dependencies between syntactic elements at different levelscan be exploited within the same task. However, it is generally accepted that acomplete syntactic disambiguation cannot be performed isolatedly of other lin-guistic knowledge, such as the semantics of syntactic constituents and of theirrelations, but rather at the same time. In contrast, the partial approach allowsto build the structure at each layer taking into account different annotationsof earlier analysis, and incorporating knowledge resources specific to each layer(e.g., dictionaries, semantic taxonomies, etc.). A crucial aspect, then, is howto break down the whole analysis process into many partial tasks, and whichdependencies are established between tasks. To succeed, each layer of partialanalysis must be simple enough to be resolved independently of higher-levels ofanalysis, and accurate enough to permit further processing depending on it. Ifthis is possible, the general scheme for language analysis can be resolved with asimple pipeline which chains the partial processors from lower to higher levels,each benefiting from the structures recognized at lower levels. In practice, itis not possible to resolve any natural language task with no error, and it hasbeen shown in many experimentations that errors committed in early layerssubstantially degrade the performance of later layers. Thus, to overcome errorpropagation through a pipeline of processors, one can think of a flexible archi-tecture for analyzing language. Here, several different types of analyzers detectpartial structures on a sentence, and assign meanings to them. Above them, areasoning process induces a global analysis for the sentence, taking into accountdependencies and constraints over different types and levels of partial structures


to form a global one.

In the context of syntactic parsing, up to date full approaches show superi-ority over partial approaches. First, a full analysis is typically richer than theresult of several layers of partial processing, since partial approaches often con-sider only the main elements of a syntactic structure, and discard intermediateconstituents of the structure to simplify the problem. Second, in terms of re-sults, full parsers perform better in the context of general syntactic analysis. So,a full analysis is richer and more accurate than a few layers of partial analysis.However, empirical evidence has been found in favor of machine learning ap-proaches detecting the basic syntactic chunks of a sentence. Their performance,focusing only on the basic analysis, is better than that of statistical full parsers[Li and Roth, 2001]. Nowadays, the study of learning algorithms for higherlevels of analysis constitutes an active direction of research in natural languagetagging and parsing, and machine learning. A complementary research directionconcerns the semantic analysis of a sentence. The complexity of the semanticmeanings of words and, more importantly, the relations between them, consti-tutes a challenging aspect of language understanding in which, clearly, learningtakes an important role. Currently, semantic analyzers are not yet accurate andstable enough to take part in a global interdependent analysis.

This thesis takes a machine learning approach to resolve partial analysistasks, at different levels of analysis. The aim is to study mechanisms for rec-ognizing phrase structures in a sentence. We concentrate on techniques forsolving a single problem. In particular, in Chapter 3 we propose a frameworkto recognize phrases in a sentence, based on learning and incremental inferencetechniques. We study strategies that go from simple to complex, and experi-ment with partial parsing tasks of different characteristics. Out of the scopeof this thesis are the mechanisms to integrate many different partial analyzersinto a global analysis process. However, the tasks we experiment with can bepipelined to resolve a coarse-grained syntactic analysis of the sentence. Nextsection explains these tasks.

1.1.2 Phrase Recognition in CoNLL Shared Task Series

Since 1999, the Conference on Natural Language Learning (CoNLL) organizeseach year a Shared Task 1. Each edition proposes a natural language problem tobe solved with learning techniques, with the aim of comparing different learningapproaches in a common problem setting. For a problem, training and testdata derived from existing corpora are prepared and made public to developsystems. Then, systems can be compared by contrasting their approach withthe evaluation measures they obtain on the test set.

An attractive aspect of the CoNLL Shared Task series is that the addressedproblems can be thought as a decomposition of the basic syntactico-semanticanalysis of language into many separate tasks, each building on the analysisachieved in the previous task. Thus, by chaining in a pipeline a processor of

1See CoNLL website at: http://www.cnts.ua.ac.be/conll

16 Introduction

each task, one obtains a basic language analyzer that is useful for buildinglanguage applications.

In this thesis, we concentrate on the following tasks for the experimentalevaluation. Chapter 5 is devoted to describe systems these tasks. It is assumedthat the pipeline starts with a part-of-speech tagging process, that assigns toeach word of a given sentence its part-of-speech tag. Then, three phrase recog-nition tasks are applied, layered in the following order:

Syntactic Chunking

Shallow Syntactic Parsing is the problem of recognizing syntactic base phrasesin a sentence. Base phrases are syntactic phrases which do not contain anyother phrase in within, that is, they are non-recursive. Such base phrases arealso known as chunks, so the task is often referred to as Syntactic Chunking.Syntactically, words within a chunk act as a unit in a syntactic relation of asentence. The chunking of a sentence (i.e., the set of chunks) constitutes thefirst syntactic generalization of the sentence. In the following examples, chunksare represented in brackets, with the label of the chunk as a subscript of theclosing bracket:

(The San Francisco Examiner)NP (issued)VP (a special edition)NP(around)PP (noon)NP (yesterday)NP (that)NP (was filled)VP (entirely)ADVP(with)PP (earthquake news and information)NP .

Here, NP stands for noun phrase, VP for verb phrase, PP for prepositional phrase,and ADVP for adverbial phrase.

This task was addressed in CoNLL-2000 [Tjong Kim Sang and Buchholz,2000]. Eleven types of syntactic chunks were considered, and data was derivedfrom the Wall Street Journal portion of the Penn Treebank (WSJ) [Marcuset al., 1993]. The task generalizes the one of the 1999 edition, concerning therecognition of basic noun phrases, that in turn reproduced the problem proposedby Ramshaw and Marcus [1995].

The particular setting of this task, together with the public datasets, hasbecome a benchmark for evaluation of shallow parsing systems. Within themachine learning community, this task is also an attractive problem for testingsequential learning algorithms.

Clause Identification

This task consists of recognizing the syntactic clauses of a sentence. Clauses canbe roughly defined as sequences of words with a verb and a subject, possiblyimplicit. The main challenge of this task is that clauses form a hierarchicalstructure in a sentence, thus the task cannot be directly approached with se-quential learning techniques, as in chunking. In a sentence, the hierarchicalclause structure forms the skeleton of the full syntactic tree. In the examplebelow, clauses are enclosed within brackets:


( The San Francisco Examiner issued a special edition around noon yesterday( that ( was filled entirely with earthquake news and information )S )S . )S

This task was addressed in CoNLL-2001 [Tjong Kim Sang and Déjean, 2001],and data was derived from the WSJ corpus, as in syntactic chunking. All clauseswere labeled the same type (S), so in the data there is no differentiation amongdifferent types of clauses (e.g., main clause, relative clause, etc.).

Semantic Role Labeling

In this problem, the goal is to recognize the arguments of the predicates in asentence. Arguments are phrases in the sentence which play a relation with apredicate of the sentence. This relation is called a semantic role. The PropBankcorpus [Palmer et al., 2005] defines what semantic roles accepts each verb inEnglish, and annotates the predicate-argument relations of the verbs in the WSJcorpus with their semantic role. In a sentence, each verb has a set of labeledarguments. For the example sentence, the arguments of the verb “issue” are:

(The San Francisco Examiner)A0 (issued)V (a special edition)A1 (aroundnoon)TMP (yesterday)TMP ( that was filled entirely with earthquake news andinformation)C−A1 .

According to PropBank, A0 is the issuer of the verb predicate “issue”, and A1 isthe thing issued (in the example, this argument is broken down into two pieces,the second annotated as C-A1). V stands for the verb, and TMP is a generalmodifier expressing a temporal relation. For the verb “fill” the arguments are:

The San Francisco Examiner issued (a special edition)A1 around noon yes-terday (that)R−A1 was (filled)V (entirely)MNR with (earthquake news andinformation)A2 .

A1 is the destination, R-A1 is a referent to A1, and A2 is the theme. MNR standsfor a general modifier expressing a manner relation.

In this thesis, we frame all these tasks under the general problem of recog-nizing phrases in a sentence. The difference between tasks relies then on thenature of the phrases, the scheme of phrase labels and the type of structuresthat phrases form in a sentence (i.e., shallow phrases or phrase hierarchies).

Apart from these tasks, another task that can be directly casted as a phraserecognition problem is Named Entity Extraction. This task consists of recog-nizing the named entities in a sentence and classify them according to theirtype. For instance, in the above example sentence, “San Francisco Examiner”is a named entity, denoting an organization. This problem was addressed intwo editions of the CoNLL Shared Task, namely for Spanish and Dutch in 2002[Tjong Kim Sang, 2002a], and for English and German in 2003 [Tjong Kim Sangand De Meulder, 2003]. While throughout the thesis we mention this problem,for conciseness we do not present experiments on it.

18 Introduction

1.1.3 Problem Definition and Evaluation

In this section we give a definition of the general problem discussed in this thesis,in the context of a supervised learning problem, and we describe the standardevaluation method of a system.

Let X be the space of sentences of a language. Let Y be the space of phrasestructures for sentences. In particular, an element y ∈ Y is a set of phrases thatconstitute a well-formed phrase structure (i.e., the phrases in y form a sequenceor a hierarchy of phrases).

Given a training set S = {(x1, y1), . . . , (xm, ym)}, where xi are sentences inX and yi are phrase structures in Y, the goal is to learn a function R : X → Ywhich correctly recognizes phrases on unseen sentences.

In order to evaluate a phrase recognition system, the standard measuresfor recognition tasks are used: precision, recall and Fβ=1. Precision (p) is theproportion of recognized phrases that are correct. Recall (r) is the proportionof solution phrases that are correctly recognized. Finally, their harmonic mean,Fβ=1, is taken as the standard performance measure to compare systems.

We consider that a predicted phrase is correctly recognized if it matchesexactly a correct phrase. That is, the words that the phrase spans and thephrase label have to be correct. This criterion makes the evaluation of a systemstrict. In contrast, there are softer criterions that consider that a phrase iscorrectly recognized if the important words of it match the important words ofa correct phrase (e.g., in standard evaluation of full parsers, punctuation tokensare not considered).

Recall that a solution y is a set, so the intersection with another solution y′

gives the set of phrases that are in both sets, with exact matching. Let | · | bethe number of elements in a set. The computation of the evaluation measuresin a test set {(xi, yi)}l1 can be expressed as follows:

p =

∑li=1 |y

i ∩ R(xi)|∑l

i=1 |R(xi)|

r =

∑li=1 |y

i ∩ R(xi)|∑l

i=1 |yi|

Fβ=1 =2 p r

p + r

1.1.4 Generalities

In general, the goal of the tasks we focus on is to recognize a set of phrasesin a sentence, organized in a structure. Here we characterize different tasks ofphrase recognition.

Origin

Phrase recognition tasks arise in the NLP area with two different motivations.First, the syntactic analysis problem can be approached as a full task or bebroken down into many separate tasks, each focusing on a part of the generalstructure. Full parsing can be seen as a phrase recognition task, in which phrasescorrespond to nodes of the full syntactic tree. However, this thesis focuses onpartial parsing tasks, where the structures are much simpler than syntactic trees.


Typical partial syntactic tasks are noun phrase recognition, shallow parsing,prepositional phrase attachment or clause identification. In a sentence, each ofthe phrase structures detected in these tasks corresponds to a piece of the fullsyntactic tree.

Second, language applications often look for specific information in text,and consequently define tasks in which the goal is to recognize phrase struc-tures containing such information. For instance, current text-retrieval, question-answering or summarization systems make use of named entity extractors, whichdetect and classify phrases in text denoting a named entity. In the most generalcase, a named entity can be a complex structure formed of basic named entitiesand other words. For example, Goldsboro Express is a named entity denoting aparticular train related to a place named Goldsboro. Or, The Complete PrestigeRecordings of John Coltrane denotes the title of a music compilation, and thereare in within two related named entities, namely a company (Prestige) and aperson (John Coltrane). Regardless of their complexity, named entity structuresappear in specific parts of a sentence, and constitute only a piece of the globalstructure representing syntactico-semantically a sentence.

Complexity of Phrase Structures

Different tasks can be grouped according to properties of the phrase structuresthat are to be recognized. Structurally, the most simple task is to recognizebasic phrases that do not contain other phrases in within. This problem isusually known as shallow analysis or chunking, since non-recursive phrases areoften called chunks. The resulting phrase structure in a sentence is a sequenceof phrases. Tasks in this category include syntactic chunking, or named entityextraction in its simpler and most typical setting —where the goal is to recognizemaximal named entities, without considering their internal structure. Also,recognizing arguments of a predicate in a sentence can be modeled as a chunkingtask, where each chunk is an argument that plays some semantic role in thepredicate.

A more complex level of tasks is found when phrases admit embedding,usually under a recursive pattern. In this case, related phrases form trees and,in general, the phrase structure of a sentence is a forest of phrase hierarchies.A particular case is that of full parsing, where the structure is the syntactictree of the sentence. Other hierarchical phrase recognition tasks are syntacticclause identification, or general versions of noun phrase recognition –recognizingstructures of base noun phrases together with their modifiers, as in (a cup (ofcoffee))– or named entity extraction –unraveling the internal structure of namedentities, as in ((United Nations) Headquarters).

Sparseness of Phrase Structures

A different dimension for characterizing phrase structures –and their correspond-ing problems– is the sparseness of phrases in a sentence, that is, whether mostof the words are covered by the phrase structure or not. For example, in shallow

20 Introduction

syntactic parsing the task may be to recognize a particular base chunk, such asnoun or verb phrases, resulting in specific phrase structures which do not covercompletely the level of analysis. On the other side, the task may be to recognizeall base syntactic chunks of the sentence. In this case, except for punctua-tion and other functional tokens, all words in the sentence belong to some basechunk, and the resulting phrase structure completes the shallow level of syntac-tic analysis. Recognizing all chunks together will allow to take advantage of thedependencies between chunks and to ensure coherence of the shallow analysis.In the syntactic domain, clause hierarchies or predicate-argument constructs arealso partial syntactic structures which complete a piece of the global sentenceanalysis. In contrast, named entities appear in text sparsely. A sentence mightcontain no named entities or a few of them. And, in the context of a sentence,two named entities cannot be related without looking at some other piece ofthe sentence analysis, such as a verb expressing a relation between them, as inPeter goes to Japan.

1.1.5 The Machine Learning Approach

Simple phrase recognition tasks, where the goal is to recognize non-recursivephrases with little dependencies among them, facilitate the use of general ma-chine learning algorithms. The approach consists of reducing the global taskof recognizing phrases to local classification subproblems, and learn a classifierfor each of the subproblems using machine learning. Typical local subproblemsinclude learning whether a word opens, closes, or is inside a phrase of sometype. Recognizing phrases, then, consists of exploring the words of a sentenceand apply classifications to them, while combining the outcome of the classifica-tions to build the phrases of the sentence. Usually, not all combinations of theclassifiers’ outcomes are possible. For example, a word cannot be inside a cer-tain phrase if such phrase has not been opened in some preceding word. Thus,the exploration strategy not only visits words of a sentence to classify them,but also checks that the classifications form coherent phrases in a sentence. Insimple phrase recognition tasks, many state-of-the-art systems rely on powerfullearning algorithms which provide accurate classifiers. Then, the explorationstrategy is usually greedy at ensuring coherence, in that a classification in a cer-tain word is assumed to be correct, and then future classifications which wouldproduce incoherence are not considered.

As the complexity of phrase structures increases, and it is accepted thatlearned predictors are not error-free, it seems adequate to design explorationstrategies that are more robust to local prediction errors than simple greedystrategies. In general, the learning components of a phrase recognition systemcompute different local predictions at different parts of the sentence. It is alwayspossible to express that a combination of predictions forms a coherent phrasestructure with constraints over different predictions. Thus, given a phrase recog-nition task decomposed into many local learning problems, and given learnedlocal predictors, recognizing phrases in a sentence consists of finding the bestcombination of local predictions that satisfies the constraints. This way, the

1.2 This Thesis 21

general approach is not only an exploration of a sentence to compute predic-tions, but rather a powerful inference process to find the best global coherentassignment given the local predictions [Roth and Yih, 2004]. The phrase recog-nition architecture can be divided into two layers: first, a learning layer, wherelocal predictions are computed independently; then, an inference layer, whichselects the best phrase structure considering both the local predictions and thestructural properties of the solution, expressed in the constraints.

Still, in some tasks the phrases in a structure exhibit many dependenciesbetween them. For example, in the hierarchy of syntactic clauses of a sentence,top-level clauses group low-level simple clauses to form a complex sentence. Inthese domains, thus, it seems desirable that some predictions take into accountpart of the phrase structure that is being recognized. This idea motivates a thirdparadigm for phrase recognition task, that of combining learning and inferencein both directions. While in the previous paradigm inference was on the top oflearners, here learners will depend on inference, and vice-versa, since a certainprediction in a part of the sentence will benefit of the structure recognized inother parts of the sentence. As it will be shown, the type of learning here isglobal, in that it considers the global inference strategy to train the predictionfunctions. Actually, this paradigm corresponds to the standard approach ingenerative probabilistic models, such as HMM or PCFG. In these approaches,a certain model makes use of several interdependent probabilistic distributionsestimated from data, corresponding to learners. Inference consists of findingthe solution with highest probability, and, when searching for it, dependenciesof the model are taken into account. However, discriminative learning seems tobe advantageous over generative learning, because it offers flexibility in termsof feature representations and theoretical guarantees on generalization. In thisdirection, the paradigm of globally training learners with respect to inferencehas been recently proposed in the literature for discriminative learners [Collins,2004].

This thesis studies phrase recognition architectures, focusing on discrimina-tive learning methods, inference strategies, and their interaction. We presentdifferent architectures and experiment with them in the context of the taskspresented in Section 1.1.2.

1.2 This Thesis

1.2.1 Contributions

The research work presented in this thesis can be framed in the field of NaturalLanguage Learning, which is the field devoted to study approaches that placemachine learning as the central mechanism to understand and process naturallanguage. As the name suggests, the field of Natural Language Learning borrowstheories and problems from the area of Natural Language Processing, whileconcentrates on the results and techniques of the Machine Learning communityto make them applicable to the Natural Language domain.

22 Introduction

The specific field we focus on is that of designing supervised learning systemsfor complex problems that arise when analyzing natural language. In particular,we study and propose methods that learn to recognize phrases in a sentence.Below, we overview the three major contributions of this thesis.

A Framework for Phrase Recognition

This thesis develops a framework for phrase recognition problems oriented toapproaches based on discriminative learning, and inference mechanisms to dealwith structured data. The starting point is the formalization of the generalproblem. In particular, the goal in a phrase recognition problem is to recognizein a sentence a structure of labeled phrases. Here, the basic element is a phrase,defined as a sequence of contiguous words which, together, assume some func-tionality in the sentence analysis. Such functionality is expressed by the labelof the phrase. Then, the framework considers two types of phrase structures,which result in two particularizations of the general problem. The first is thatof chunking, where phrases form a non-overlapping sequential structure in thesentence. Instances of this problem include shallow syntactic analysis, namedentity extraction, or recognition of the semantic roles of a sentence predicate.The second problem, more general, consists of recognizing hierarchical phrasestructures, that is, structures of phrases that form a tree or a forest in the sen-tence. This problem instantiates on tasks such as the identification of syntacticclauses, more general versions of the above-mentioned problems, and, in general,syntactic parsing tasks.

The framework develops techniques for phrase recognition architectures thatmake use of two central mechanisms: learning and inference. Learning servesto predict which words of a sentence form phrases. Inference combines theoutcome of learning predictors to form a structure of phrases for the sentence,with two main concerns: efficiency and coherence of the phrase structure. In theframework, a phrase recognition architecture consists of three main components:

• Model. We discuss models that decompose the global problem into manydecisions, at two different granularities, which are directly solved withlearning techniques. The first is at word-level, which is the traditionalapproach of phrase recognition systems. The second is at phrase level,that puts learning in a context more expressive than word-level models,but also computationally more expensive.

• Inference strategy. We comment on inference algorithms for phrase recog-nition models at word and phrase levels. In all cases, the type of inferenceis incremental, that is, as the sentence is processed in some order, theplausible solutions are built, incremented and contrasted between them,so that when the final of the sentence is reached the best solution is ready.We present instantiations of this general processing scheme that go fromapproximate to exact strategies according to the model’s optimality crite-rion.

1.2 This Thesis 23

• Learning strategy. We discuss discriminative strategies for learning thepredictors of a phrase recognition architecture. We focus on large marginalgorithms for training linear separators, and discuss the difference be-tween learning each predictor locally, or learning globally all predictors ofthe architecture.

Perceptron Learning for a Filtering-Ranking Architecture

As a main contribution, this thesis proposes a phrase recognition architecture,which we name Filtering-Ranking, and a global learning strategy for it based onPerceptron. This work is published in [Carreras et al., 2005] and, in turn, buildson previous work [Carreras and Màrquez, 2003a,b; Carreras et al., 2002b].

The Filtering-Ranking architecture faces the general problem of recognizinghierarchies of phrases, and can be particularized to look for sequential phrasestructures. The strategy for recognizing phrases in a sentence can be sketchedas follows. Given a sentence, learning is first applied at word level to identifyphrase candidates of the solution. Then, learning is applied at phrase level toscore phrase candidates and discriminate among competing ones. These twolayers of learning predictors are controlled by the inference strategy, that ex-plores the sentence and computes predictions at different parts of it, with thegoal of building the optimal phrase structure for the sentence, according to thepredictions and other coherency criteria. In the architecture, the phrase–levellayer of predictors allows to deal with complete candidate phrases and partiallyconstructed phrase structures. An advantage of working at this level is that richand informed feature representations can be used to exploit structural propertiesof the examples, possibly through the use of kernel functions. However, a disad-vantage of working with high level constructs is that the number of candidatesto explore increases and the search space may become prohibitively expensiveto explore. For this reason, the word–level layer of our architecture plays acrucial role by filtering out non plausible phrase candidates and thus reducingthe search space in which the high–level layer operates.

We then propose the FR-Perceptron learning algorithm to globally train thelearning functions of the system as linear separators, all in one go. The learningstrategy is a generalization of Perceptron that follows [Collins, 2002] in twomain aspects. First, it works online at sentence level: when visiting a sentence,the functions being trained are first used to recognize the set of phrases in it,and then updated according to the correctness of the solution. Second, the typeof update is conservative since it operates only on the mistakes of the globalsolution. The proposed algorithm extends Collins’ in that not only it trains ascore function for ranking potential output labelings, but also trains the filteringcomponent which provides candidates to the ranker. The extension is in termsof the feedback rule for the Perceptron, which reflects to each individual functionits committed errors when recognizing a set of phrases. As a result, the learnedfunctions are automatically approximated to behave as word filters and phraserankers, and thus, become adapted to the recognition strategy.

Regarding the analysis of the algorithm a convergence proof is presented.

24 Introduction

Regarding the empirical study, we provide extensive experimentation on tworelevant problems of the Partial Parsing domain, namely base Chunking andClause Identification. Moreover, we incorporate, in the experimental architec-ture, the results of Freund and Schapire [1999] on Voted Perceptrons to pro-duce robust predictions and allow the use of kernel functions. The performanceachieved by the presented learning architecture is comparable to the state–of–the–art systems on base chunking and substantially better in the recognitionof clauses. Besides, the experiments presented help understanding the behav-ior of the different components of the architecture and give evidence about itsadvantages compared to other standard alternative learning strategies.

State-of-the-art results in CoNLL Shared Tasks

A central goal of this thesis research is to build competitive systems for theyearly-organized Shared Tasks of the CoNLL conference, which we view asphrase recognition problems. Namely, these tasks include Shallow SyntacticParsing [Tjong Kim Sang and Buchholz, 2000], Clause Identification [TjongKim Sang and Déjean, 2001], Named Entity Extraction [Tjong Kim Sang, 2002a;Tjong Kim Sang and De Meulder, 2003] and Semantic Role Labeling [Carrerasand Màrquez, 2004]. The series of CoNLL Shared Tasks are an interesting andmotivating initiative of the Natural Language Learning community, since theypropose relevant, real-sized Natural Language problems, and establish experi-mental settings that permit fair comparisons of systems implementing differentlearning approaches. In particular, all systems are developed under the sametask specification and training data, and evaluated with standard performancemeasures on the same test data. Thus, systems can be ranked according totheir final performance, and conclusions can be drawn about what elementsinfluenced the good or bad performance of different approaches. In general,building a competitive system concerns issues related to: (1) the type of ar-chitecture and model for recognizing phrases; (2) the learning algorithm usedto train the learning functions of the architecture; (3) the type of features forrepresenting the data; and (4) practical techniques and tricks to develop andtune a learning system for a real-sized natural language problem.

Under an unified framework, we have developed systems that perform amongthe best at each edition, and thus can be considered in the state–of–the–art. ForShallow Syntactic Parsing, the task with more evaluated systems, our systemobtains results that are very close to the top-performing system. On ClauseIdentification, our system is substantially better than any other system. ForNamed Entity Extraction (on 4 languages) and Semantic Role Labeling, oursystems are among the top-performing ones.

1.2.2 Organization

The rest of the thesis is organized in the following chapters.

• Chapter 2. A Review of Supervised Natural Language LearningWe review the major approaches to natural language disambiguation prob-

1.2 This Thesis 25

lems with supervised machine learning, focusing on discriminative learningand large margin learning algorithms, as they are the techniques used inthis thesis. We also review the relevant literature on shallow and partialparsing systems.

• Chapter 3. A Framework for Phrase RecognitionThis chapter describes the framework to design phrase recognition sys-tems. We comment on the major components of a learning-based phraserecognition architecture, namely the model, the inference algorithm, andthe learning strategy. We propose a variety of options for each one, anddiscuss instantiations of phrase recognition architectures that go from sim-ple to more complex.

• Chapter 4. A Filtering-Ranking Learning ArchitectureThis chapter presents the main contribution of this thesis, namely a filtering-ranking learning architecture for general phrase recognition problems.First, we describe the architecture. Then, we propose a Perceptron algo-rithm to train it globally, with analysis on its convergence, and extensiveempirical experimentation that evinces its good performance.

• Chapter 5. A Pipeline of Systems for Syntactic-Semantic Pars-ingIn this chapter, we develop Natural Language analyzers for three CoNLLShared Tasks, using the filtering-ranking learning architecture. We con-trast our systems, in terms of results, with other systems in the literaturedeveloped for the same tasks, and show that the performance of our sys-tems is among the best in the state-of-the-art.

• Chapter 6. ConclusionThis chapter gives conclusions and outlines future research directions.

How to read this document. The main contribution of this thesis is foundin Chapter 4, that presents the filtering-ranking learning architecture and thePerceptron-based global algorithm for it. This chapter is mostly self-containedand, in fact, it is essentially the main part of the article published in [Carreraset al., 2005]. Readers who are familiar with machine learning and partial parsingtechniques should be able to read straightforwardly the chapter.

For readers not familiar with these topics, Chapter 2 introduces machinelearning techniques for disambiguating language. We contextualize and discussthe type of learning that we use. We also comment on partial parsing techniquesand systems of the literature that are relevant to the area and influence our work.

Chapter 3 presents a framework for partial parsing and other phrase recogni-tion problems. Essentially, the phrase recognition problem is studied in formalway, identifying the major computational difficulties and discussing a family oftechniques to solve it with discriminative learning. The filtering-ranking archi-tecture proposed later in Chapter 4 can be thought as a particular choice withinthis family of phrase recognition methods.

26 Introduction

Chapter 5 provides a practical description about building partial parsersfor natural language processing, for three different tasks. We comment on fi-nal adaptations of the general model, features used by the learners, trainingprocesses, and evaluation results.

Chapter 2

A Review of SupervisedNatural Language Learning

This chapter reviews the main concepts and approaches of the areas of MachineLearning (ML) and Natural Language (NL) processing that serve as basis of thework presented in this thesis.

The chapter is organized in three sections. The first section introduces super-vised machine learning techniques which apply to natural language disambigua-tion problems, focusing on approaches for dealing with structured domains, asis the case for the problems discussed in this thesis. The second section re-views the specific type of learning algorithms that are used, a family known aslarge margin methods. We introduce the main theoretical concepts that serveas starting point for the design and justification of the methods, and reviewthe main algorithms. Finally, the last section reviews techniques and systemsfor shallow and partial parsing, which is the specific type of NL problems weconcentrate on.

2.1 Learning to Disambiguate in Natural Lan-guage

In this section we review machine learning techniques for natural language dis-ambiguation tasks, focusing on supervised methods which apply to phrase recog-nition tasks and other more general problems.

A natural language disambiguation problem can be thought as inducing afunction h : X → Y. In the problems discussed in this thesis (and in mostNL disambiguation problems), X is the space of all sentences of language. Butin general X could be a space representing words, sentences, documents, orother textual domains. As for the output space, we can differentiate betweenclassification problems and structured problems. In classification problems, Yis fixed set of labels. For example, Y might be the set of possible part-of-speech

28 A Review of Supervised Natural Language Learning

tags of a language, and the associated task is to determine the tag in Y that agiven word assumes in a sentence. Other examples of pure NL classification tasksinclude Word Sense Disambiguation —consisting of assigning the appropriatesemantic label for a word in a sentence—, Spelling Correction —the task offixing spelling errors, by classifying a certain string into the possible legitimatewords of the language— or Text Categorization —where the goal is to identifythe topic categories of a document.

In structured problems, the elements of the output space Y are structuressuch as sequences, trees or, in general, graphs. Such structures are labeled, inthat each of the nodes forming the structure have a label that denotes the natureof the constituent represented by the node. In most of these problems, X is thespace of sentences, so the problem is to induce a function mapping sequencesof words (i.e., sentences) to labeled sequences or trees. For example, let T bethe set of part-of-speech tags of a language, and Y be the space of sequencesformed with tags in T . A sequential classification problem is that of determiningthe sequence of part-of-speech tags for each word of a given sentence —whichcorresponds to the standard problem known as Part-of-Speech (PoS) tagging.Another instance of structured-output problems is that of syntactic parsing,where the goal is to disambiguate the syntactic tree of a sentence (i.e., Y thespace of all possible syntactic trees). Closely related, in the problems discussedin this thesis Y is the set of all possible phrase structures for sentences, eithersequential or hierarchical. Structured output domains exhibit two particularproperties. First, the cardinality of the output spaces is of exponential size withrespect to the length of an input sentence. Second, two different structures ofY can be very similar, while others can be radically different. For example, twotrees can differ only in one node’s label. These properties are crucial for thedesign of appropriate disambiguation learning techniques.

In supervised learning, the learning protocol is the following. It is assumedthe existence of a distribution D over X × Y that is unknown, but generatesa collection of training examples, independently and identically distributed.1

Each example is of the form (xi, yi), with xi ∈ X and yi ∈ Y.

In this framework, the task of a learning algorithm is to induce, from thetraining examples, a hypothesis h which predicts accurately the correct valuesy ∈ Y for an instance x ∈ X . In order to evaluate the quality of a learnedfunction, an error function or loss function, denoted L(y, ŷ), measures the erroror cost of proposing the output ŷ when the correct output is y. In classification,the most common error function is the 0-1 loss, which assigns a penalty of 1 whenthe output is not correct (y 6= ŷ) and 0 when correct (y = ŷ). When dealingwith structures, however, it is more appropriate to consider error functionsbased on the number of different nodes of the correct and predicted structures,which yields error functions closely related to precision and recall measures.

1Note that since D is a joint distribution, it is possible to observe examples with differenty’s for the same x. While we assume that only one value of Y is correct for each instance ofX , a joint distribution of examples is useful to study a number of interesting issues out of thescope of this thesis. For instance, noisy sources of examples, or inherently ambiguous instancesthat have several correct solutions —the later specially suitable for processing language.

2.1 Learning to Disambiguate in Natural Language 29

Conceptually, the ultimate goal is to learn the best hypothesis, that minimizingthe true error over the complete distribution D. However, D is unknown, so,in practice, a separate test set, assumed also to be drawn from D, is used tomeasure the error of the learned function on the test examples.

In the following sections, we review three main families of machine learningtechniques for resolving NL ambiguities. First, we comment on probabilisticlearning, where hypothesis are probabilistic distributions of the data, trying toestimate the densities of D in some way. The focus is to introduce the ap-proach of generative models. Then, we review discriminative learning, the typeof learning that, up to date, provides the most powerful and general learningalgorithms. Finally, we discuss the learning and inference paradigm, a familyof methods for disambiguating in structured domains. As it will be clear, thereviewed methods do not belong exclusively to the family of methods in whichwe present them.

2.1.1 Probabilistic Learning

Probabilistic methods associate probabilities to input-output pairs of language[Charniak, 1993]. If x ∈ X is an instance, and Y(x) are the possible outputvalues for x, probabilistic methods choose the value of Y(x) maximizing theprobability of it given x, that is h(x) = arg max Y(x) p(y|x). Generative modelsestimate a joint probability distribution p(x, y) of the data, and are later usedfor disambiguation via the Bayes rule to compute p(y|x). Modeling alternativesexist to estimate directly the conditional distribution p(y|x).

Generative Models

Generative models define a probability distribution of the data parametrizedby a stochastic generation mechanism of the data. Under this mechanism, apair (x, y) is uniquely defined by its derivation, that is, the sequence of steps ordecisions taken for generating (x, y) in some canonical order. The probabilityof a pair (x, y) is then the probability of chaining the corresponding steps in thederivation. The model defines which steps are possible, and makes assumptionson which elements of the x and y variables take part in a generation step. Withthis, a probability distribution can be defined parametrically by associatingparameters to the decisions of a derivation. Broadly, parameters determinethe probability of each possible instantiation of each decision (i.e., an event),and are estimated from data. In particular, each parameter is the conditionalprobability the part of (x, y) generated in the step given a factored history ofthe generation.

We now describe two generative models, Hidden Markov Models and Prob-abilistic Context Free Grammars, which are standard models for tagging andparsing problems.

Hidden Markov Models (HMM). A hidden Markov model is a standardprobabilistic model for language modeling and sequential disambiguation tasks


(see [Rabiner, 1989] for a classic tutorial oriented to speech recognition). In atagging task, x is a sequence of tokens x1 · · ·xn and y is a sequence of tagsy1 · · · yn attached to x. An HMM is a stochastic finite-state automaton, gener-ating sequences of tokens. To generate a sequence x, the process first choosesan initial state, from which it emits the first token x1. Then, it transitions toa new state, emitting the second token, and so on until a designated final stateis reached. Three probability distributions are involved in this process, namelythe initial state distribution p0(s), and two conditional distributions, the statetransition probability p(s′|s) from state s to s′, and the emission probabilityp(x|s) of a token xi from a state s. There is a correspondence between thestates of the automaton and the possible labels, so knowing the states whichgenerated x determines y. A HMM defines a joint probability distribution ofinput-output sequences, p(x,y), by means of the transition, emission and initialstate probabilities, estimated from data.

Thus, the assumptions in HMM dictate that a token depends only on thestate that generated it, and that a state is determined only by its previous statein the process. For example, a trigram HMM part-of-speech tagger defines astate for each combination of two part-of-speech tags 〈t−, t〉, where t is the tagof the current word and t− is the tag of the previous word. The state-transitiondistribution controls the probability of assigning the tag t+ to the followingword when the previous and current word are tagged with t− and t respectively,with a parameter for every trigram 〈t−, t, t+〉 —hence the name of the tagger.The emission distribution controls the probability of generating a word w giventhe current and previous tags, with a parameter for every triple of two tagsand a word 〈t−, t, w〉. To disambiguate the labels of a given sequence x with atrained HMM, the Viterbi algorithm finds the most probable sequence of states(associated to labels) that generates x.

Probabilistic Context-Free Grammars (PCFG). PCFG models asso-ciate probabilities to sentence-tree pairs (x, y). The sequence of decisions ofthe generative mechanism is defined as the expansions of the grammar rules ina derivation of the tree, under a fixed parsing strategy (e.g., top-down, left-mostderivations). Each rule has an associated probability, and the probability of a(x, y) pair is the product of rule probabilities in the derivation of (x, y). In doingso, the assumption in PCFG models is that generating the right-hand side ofa rule depends only on the left-hand side of the rule, that is, the non-terminalbeing expanded. Given a PCFG, finding the most probable tree for a sentencecan be done in polynomial time with, e.g., the CKY algorithm [Charniak, 1993].Modern statistical parsers, such as those with best results on the WSJ corpus[Collins, 1999; Charniak, 2000], are PCFG models where the rules are lexical-ized. That is, the non-terminals of the grammar are augmented with lexicalitems and other information which can be considered linguistically relevant tothe generation process. For example, Collins [1999] proposes head-based PCFGmodels, where the non-terminals are augmented with the lexical head governingthe non-terminal constituent. The expansion of a rule (i.e., generating the right


part given the left part) is broken down into a number of steps. Broadly, firstthe head non-terminal is chosen; then the head modifiers, to the left and tothe right of it, are selected sequentially, given the lexical head of the rule andother information generated during the rule expansion. Lexicalization leads tofine-grained PCFG models with a very large number of rules, with the advan-tage that now the model is much more expressive, since the model parameterscapture dependencies between the lexical information of the left and right partsof a rule.

Maximum Likelihood Estimates

A standard technique in parametric joint probabilistic models is to set the pa-rameters to optimize the joint likelihood of the data. In this framework, thegoal is to estimate the unknown distribution D that generates the training andtest examples. A crucial assumption of the framework is that D belongs to theclass of distributions considered by the probabilistic model. That is, there is anassignment of weights α to the model parameters for which the probability distri-bution is D, pα(x, y) = D. Under this assumption, learning translates to findingsuch optimal assignment. Bayesian learning suggests to fit parameters to maxi-mize the joint likelihood of the training data, αjMLE = arg maxα

∑

i pα(xi, y, i).It is well-known that under the assumption that D belongs to the class of para-metric distributions of the model, maximum-likelihood values make the esti-mated distribution converge to the natural distribution D, as the amount ofdata goes to infinity.

It turns out that in history-based processes, such as HMM, PCFG and othermodels, the maximum-likelihood estimates correspond to simple relative fre-quencies of the decisions observed in the derivations of the training collection.In particular, the weight for a parameter associated to an observation o given ahistory h, p(o|h), is the fraction of training decisions with history h where o isobserved. Furthermore, there exist many smoothing techniques to robustly es-timate parameters in the case data sparseness, that is, when some observationsoccur infrequently in the training data and produce unreliable frequencies (e.g.,see the empirical study of Chen and Goodman [1996] and their references).

A major limitation of maximum-likelihood estimation techniques (pointedout by many authors [Johnson et al., 1999; Collins, 2004]), which apply eitherto joint or conditional estimation, comes from the assumption that the truedistribution D is actually an instantiation of the class of probability distributionsparametrized by the model. Typical language disambiguation models, suchas generative models, make strong independence assumptions which clearly donot hold in the natural data. In other words, the representation of input-output pairs adopted by the model (in terms of features, in direct relationwith the model parameters) is very poor. This limitation conflicts with theability of linguists to describe language units in many different ways and sourcesof information. Enriching a generative model to include more features in therepresentation is a complex task, since the steps of a derivation have to beredefined to generate the included features at some point. Indeed, features not


tied to the derivations of solutions add dependencies in the model, and usuallymake maximum likelihood estimation intractable.

Furthermore, apart from restricting to the model class of distributions, thetheoretical guarantees that maximum likelihood estimation converges to the truedistribution are asymptotic as the training size goes to infinity. Collins [2004]gives more formal arguments to these limitations, and discusses the advantagesof distribution-free learning methods, that, as the name indicates, do not makeassumptions on the underlying real distribution of the data. Also, the theorybehind them provides strategies that benefit generalization to unseen data giventhat the amount of actual training data is limited. Section 2.2.1 looks closer todistribution-free learning.

Alternatives exist to model the conditional probability distribution of thedata, and choose the parameters that maximize the conditional likelihood ofthe data [Johnson, 2001; Klein and Manning, 2002]. Here, estimation methods,such as probability-based decision trees [Magerman, 1996], Maximum Entropyestimation [Ratnaparkhi, 1998; McCallum et al., 2000] or Conditional RandomFields [Lafferty et al., 2001], do not make strong independence assumptions,and thus allow flexible representations where arbitrary features can be coded.These estimation techniques are reviewed in the following section, as generaldiscriminative learning techniques.

2.1.2 Direct, Discriminative Learning

Discriminative learning solves the disambiguation learning problem directly,learning a map from inputs in X to outputs in Y. This contrasts with genera-tive or informative learning (see [Rubinstein and Hastie, 1997] for a comparisonbetween generative and discriminative paradigms), where the approach is tolearn the distribution generating X and Y, and then use it to disambiguate byexamining the value of Y that most probably was generated with a given inputx ∈ X . In probabilistic learning, generative methods estimate a joint proba-bility distribution of the data, while discriminative methods estimate directly aconditional probability distribution. The family of discriminative algorithms wefocus on moves away from estimating a (conditional) probability distribution.Rather, the algorithms concentrate on learning the boundaries of the possibledisambiguation outputs.

The main concern in discriminative techniques is to learn accurate hypothe-ses, considering the error function as the ultimate practical measure to optimizewhen testing the hypothesis. In this framework, training data serves to discoverregularities in the data that might be useful in predicting the output variablesgiven the input variables. Since training and test examples are assumed to bedrawn from the same distribution, one hopes that a good predictor learned fromtraining data will perform also well on test data.

In the 90s, many general machine learning methods of the Artificial Intel-ligence (AI) community [Mitchell, 1997] were successfully applied to naturallanguage problems that can be framed into classification tasks [Roth, 1998;Daelemans et al., 1997]. Typical problems here include Word Sense Disam-


biguation, Spelling Correction, Text Categorization, etc. In all cases, there is apre-specified scheme of classes or categories, and the problem consists of assign-ing the most suitable class to a given instance (i.e., a word in context, sentence,or document).

Broadly, the design of a learning system for such problems involves issuesconcerning the representation of the input instances, the type of policy adoptedto predict the most suitable class, and the algorithm that learns a predictorgiven a collection of training examples.

In terms of representation, the most common standard way is to representan instance with a collection of propositional features. These features captureproperties, attributes or characteristics of the instance, and their design dependsboth on the domain of the data –in our case natural language– and the specifictask we are willing to resolve. From this point of view, x is represented as avector of features. Each coordinate in the vector corresponds to one feature,and the value of the coordinate is the value of the feature for the instance. Ingeneral, a feature set has to be expressive enough to characterize properly anyof the instances of the input space X , so that it is possible to learn a predictorthat discriminates correctly the output value of an instance x ∈ X by lookingonly at the feature values of x.

As for the type of predictors and learners, there exist many options. Some ofthe most popular classical AI methods include Decision Tree Learning, DecisionLists, Memory-Based Learning and Nearest Neighbor classifiers, or Transforma-tion Based Learning, among others. In the probabilistic family, the two mostcommon techniques are Maximum Entropy Estimation and Naive Bayes2 Fi-nally, Artificial Neural Networks are also a common choice in NLP, althoughas we explain below, recent research has populated the use of simpler learnersbased on linear separators that optimize margins, resulting in algorithms suchas Perceptron, Winnow, Support Vector Machines, or related algorithms suchas AdaBoost.

With so many choices, it is desirable to understand the properties of thedifferent problems and algorithms. From a learning point of view, natural lan-guage problems are characterized by feature spaces of very large dimensionality.For example, to achieve reasonable expressiveness, it is common to consider oneor several dimensions for each word of the language, or even for combinationsof words. In such spaces, the representation of instances is very sparse, that is,most of the features in the vector have a null value. For example, a represen-tation where features are binary indicators usually have tens of thousands ofdimensions, but only a few hundred of them have a non-trivial value in a par-ticular instance. On the other side, the size of a training collection varies fromproblem to problem, and goes from just hundreds of examples, to tens or evenhundreds of thousands of examples. With these properties, a learning algorithmfor NLP has to be computationally efficient at dealing with big training sets of

2Naive Bayes is a generative classifier, not discriminative. In practice, however, it is suc-cessfully used with representations in which features are not independent, and thus violatethe assumptions of the learner. In this sense, the learner is used in a direct discriminativeway, rather than aiming at estimating a generative model of the data. See Roth [1999].


huge dimensionality, where instances occur sparsely. It has to be also robust atdealing with irrelevant features.

It turns out that many of the cited learning algorithms, although being de-signed from different ideas, operate with linear decision surfaces [Roth, 1998,1999]. That is, the shape of a learned model is a linear separator in the represen-tation space that discriminates between the positive and the negative instancesof the target concept. In other words, with linear separators the problem of notproducing errors translates to the problem of separating well between positiveand negative instances. Therefore, the difference between algorithms learning alinear separator relies on the criteria for choosing a particular separator amongall possible separators. This observation facilitates the analysis of algorithmsunder a unified framework, that of learning linear separators. In section 2.2.1 weoverview some theoretical concepts and results of this framework. Importantly,these results translate into clues about the relevant quantities to optimize duringlearning and, in turn, motivate algorithms such as Support Vector Machines,AdaBoost, or the well-known Perceptron, the latter being the core algorithmused in the systems that this thesis proposes.

2.1.3 Learning and Inference Paradigm

So far, we have over-viewed two families of supervised learning methods: genera-tive learning –willing to explain the data– and discriminative learning –focusingdirectly on differentiating what is correct and what is not.

The generative approach offers well-understood techniques for problems withstructured output –which is the case of the problem of recognizing phrase struc-tures in sentences or, more generally, tagging and parsing problems. Examplesof such techniques are HMM or PCFG, discussed in Section 2.1.1, as well asother more general graphical models [Jordan, 2004]. However, as it has beenpointed out, a major limitation of generative models comes from the fact thatfeatures are tied to the derivations assumed to generate the data, which makesit difficult to incorporate arbitrary and dependent features for supporting pre-dictions.

On the other hand, discriminative learning imposes no conditions on thefeatures representing learning instances, thus offers flexibility at combining dif-ferent, possibly dependent knowledge sources and characteristics of the data.However, discriminative learning techniques are designed and analyzed as gen-eral algorithms for classification problems. That is, the learning algorithms ofthe machine learning community are developed to discriminate among a lim-ited, fixed number of classes. When the nature of the problem has a structured,exponential-sized output space, such as the sequential or hierarchical solutionsof tagging and parsing problems, these algorithms shall not be used “out ofthe box”, for several related reasons. First, because most of the discriminativemethods rely on enumerating exhaustively all possible output values duringtraining and prediction. Obviously, this is not feasible for exponential-sizedoutput spaces. Even when the learning method is not sensitive to the num-ber of output classes (such as, e.g., nearest neighbor learners), the learner will


suffer from data sparseness, and will not be able to generalize. But, most im-portantly, treating each possible output structure as an atomic class does notreflect appropriately the nature of the structures: two different structures mayhave a lot of substructure in common, and differ only in specific parts. Thus,a structured-output space has to be compactly modeled, in a way that sharedsubstructure is factored in the model.

For these reasons, there is a need to combine the strength of discriminativelearning at arbitrarily representing instances with the strength of generativelearning at compactly representing solutions with factored models.

We refer to learning and inference as the paradigm of to study learning tech-niques that apply to complex domains where: (1) a global complex predictionis decomposed into many local simple predictions, and; (2) there are dependen-cies and constraints that influence what combinations of local predictions forma correct global prediction. As a simple example, consider the Part-of-Speech(PoS) tagging problem, where the goal is to assign the correct PoS tag to eachword of a given sentence. Here, the global solution is the sequence of PoS tagsof a sentence, which can be computed by locally predicting a tag for each word.Clearly, there exist dependencies between the tags of a sentence, such as that adeterminer often precedes a noun, or that usually a sentence contains at leastone verb. In such scenarios, we differentiate two processes. First, a learning-based process, with learning functions as main components, that predicts valuesto local parts of a sentence. Second, an inference process, with an explorationalgorithm as a main component, that combines local predictions to form a globalsolution. When dependencies and constraints exist, the role of the inference isto trade off values predicted at different positions to obtain a globally optimalsolution.

In fact, traditional generative models make use of such learning and inferenceprocesses, the former to estimate the distribution generating the data, and thelater used to disambiguate the structure of a given sentence. As pointed out,though, discriminatory models are currently preferred, and their design anddevelopment constitutes nowadays an active research direction in areas that dealwith complex data, such as natural language processing, information retrieval,computer vision or computational biology.

In the literature, Dietterich [2002] reviews a specific scenario of learningand inference, that of supervised sequential learning, or tagging. The generalproblem consists of assigning labels to the components of a sequence, such as inPoS tagging. Below we describe in more detail families of inference algorithmsand the role of learning in such systems.

Chained Classifiers. A simple approach to apply general discriminative learn-ing to structured domains consists of learning classifiers which assign values tolocal parts. To predict the global structure of a sentence, the different partsof a sentence are explored in some canonical order (e.g., from left to right insequences) and, at each part, the classifiers are used to assign the most plau-sible value. In the recurrent version of the strategy –the most common one– a


prediction in one part makes use of the values assigned in previous parts, thusexploiting dependencies between neighboring output values.

Such a simple strategy is not able to trade off decisions at different points.That is, the local predictions at one part are final in the solution, with no at-tempt to globally optimize the outcome. Hence, the technique is merely an ex-ploration, rather than an inference process. Despite this limitation, and maybefor its simplicity, many systems in the literature make use of this strategy,specially for problems with simple constraints and few dependencies and oftencombined with powerful learning algorithms that commit few errors.

Probabilistic Inference. Probabilistic learning provides models for dealingwith structured domains that exploit dependencies between different parts ofa structure. In such models, the global (joint or conditional) probability dis-tribution is decomposed into several local conditional probability distributionsthat are estimated from data with Maximum Likelihood techniques. Once themodel is learned, there exist well-known inference algorithms for disambigua

Documents

Learning and Inference in Phrase Recognition: A Filtering-Ranking ...people.csail.mit.edu/carreras/pubs/tesi.pdf · Llu s M arquez Villodre Departament de Llenguatges i Sistemes Inform