86
Using the semantics of prepositions for ontology learning Master thesis of Vincent Jacobs Utrecht University Supervisor Paola Monachesi Date July 31, 2006

Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

Using the semantics of prepositionsfor ontology learning

Master thesis of Vincent Jacobs

Utrecht University

SupervisorPaola Monachesi

DateJuly 31, 2006

Page 2: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While
Page 3: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

Contents

Abstract i

Acknowledgements iii

1 Introduction 1

2 Ontologies and ontology learning 32.1 The Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Semantic Web components . . . . . . . . . . . . . . . . . . . . . 62.4 Ontology learning . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.1 The layer cake . . . . . . . . . . . . . . . . . . . . . . . 122.5 Ontology learning techniques . . . . . . . . . . . . . . . . . . . . 15

2.5.1 Natural Language Processing . . . . . . . . . . . . . . . 152.5.2 Data mining methods . . . . . . . . . . . . . . . . . . . . 202.5.3 Machine learning . . . . . . . . . . . . . . . . . . . . . . 23

3 Relation extraction 253.1 Semantic roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 Semantic relations . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 From roles to relations using mappings . . . . . . . . . . 293.2.2 From roles to relations using syntactic dependencies . . . 29

3.3 Prepositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.1 Disambiguating prepositions . . . . . . . . . . . . . . . . 31

3.4 Ambiguity of preposition syntax . . . . . . . . . . . . . . . . . . 323.5 Extraction components . . . . . . . . . . . . . . . . . . . . . . . 36

3.5.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.5.2 The Alpino parser . . . . . . . . . . . . . . . . . . . . . 373.5.3 The Alpino Treebank . . . . . . . . . . . . . . . . . . . . 383.5.4 XPath and XSLT . . . . . . . . . . . . . . . . . . . . . . 38

3

Page 4: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

3.6 Extraction methodology . . . . . . . . . . . . . . . . . . . . . . . 39

4 Less ambiguous prepositions 414.1 The BMO relation . . . . . . . . . . . . . . . . . . . . . . 41

4.1.1 Syntactic aspects . . . . . . . . . . . . . . . . . . . . . . 424.1.2 Extraction method and formats . . . . . . . . . . . . . . . 44

4.2 The CB relation . . . . . . . . . . . . . . . . . . . . . . . 454.2.1 Example extraction process . . . . . . . . . . . . . . . . 46

5 Ambiguous prepositions 495.1 PropBank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1.1 PropBank annotation on the Alpino Treebank . . . . . . . 515.1.2 Extraction using PropBank information . . . . . . . . . . 52

5.2 FrameNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.2.1 Pluriformity of annotation . . . . . . . . . . . . . . . . . 555.2.2 FrameNet and the Alpino Treebank . . . . . . . . . . . . 565.2.3 Extraction using FrameNet information . . . . . . . . . . 57

6 Conclusion 59

A Extraction 61

B Extraction using PropBank 67

C Extraction using FrameNet 71

Page 5: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

Abstract

To help automate the process of ontology creation, computational linguists havebeen working to develop methods that extract ontology elements from natural lan-guage. Their focus thus far has been mainly on relation extraction; linguistic el-ements frequently used for this purpose are nouns, verbs and adjectives. Preposi-tions have been fairly unpopular by comparison, probably due to their polysemousnature and elusive semantics. This thesis demonstrates progress can be made inthis field by selective application of prepositions in relation extraction and dis-cusses ways in which semantic annotation projects like PropBank and FrameNetcan be included in the process.

Keywords ontology learning; non-taxonomic relations; prepositions; framenet;propbank.

i

Page 6: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

ii

Page 7: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

Acknowledgements

iii

Page 8: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

Chapter 1

Introduction

The wordontologyhas a very long history – it has been around in traditional phi-losophy since the Ancient Greeks. It is used to refer to a branch of metaphysicsthat is concerned with the concept of existence; ontology is an attempt to say whatentities exist.

More recently, the word has taken on a new meaning within the field of arti-ficial intelligence and computer science. Here, it is used to refer to a data modeldescribing a specific domain and containing objects from that domain. An ontol-ogy can also be used to reason about the objects in the domain and the relationsbetween them. Note that while one can speak ofanontology in a computer sciencecontext, it makes little sense to do so in a philosophical context.

A lot of the interest for ontologies was sparked by Berners-Lee et al. (2001); itpresented a grand vision for structuring the information on the World Wide Web. Inthe Semantic Web (as this vision was called) ontologies would provide a machinereadable version of the Web’s content. This would enable more or less indepen-dent computer programs (oragents) to autonomously reason with information, thusbecoming an invaluable asset to users of the Web.

Computational linguists have since been trying to extract ontology elementsfrom natural language to automate the process of ontology creation. This processis calledontology learning. Ontology learning is a process containing many tasks,each task building upon the results of other tasks. One of these tasks is the extrac-tion of relations, which are formed between the objects in the ontology. This thesisconcerns itself with this task.

Nouns, verbs and adjectives are the linguistic elements used most often for re-lation extraction (cf. Bannard and Baldwin, 2003; Schutz and Buitelaar, 2005).Prepositions on the other hand – while being an active research topic in othercontexts – do not receive a lot of attention. A possible explanation for this lack

1

Page 9: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

2 Chapter 1. Introduction

of interest is the general perception of prepositions as having too little semanticcontent and being overly polysemous, thus providing very low amounts of usefulinformation (Bannard and Baldwin, 2003). Along with those problems, semanticsof spatial prepositions (e.g.under, acrossandaround) don’t fit well in an ontology,and the semantics of phrasal verbs (e.g.act up, drop out) are hard to split up in averbal and a prepositional part.

While it is therefore true that using preposition semantics in ontology learn-ing is rather difficult, considerable progress can be made by selective application.According to Bannard and Baldwin (2003), the “semantics of peripheral [prepo-sitional phrases] is determined largely by the preposition (e.g. from March, inToulouse, by the gate)”, suggesting that these phrases are good candidates for re-lation extraction. Since this thesis is concerned with Dutch prepositions, it is goodto note that this observation also holds for Dutch: ‘bij de deur’ (at the door) is verydifferent from ‘door de deur’ (through the door).

Performance can be increased further with an additional restriction. Using onlythose prepositions expressing semantic relations that can easily be added to an on-tology avoids detangling preposition semantics. These two restrictions combinedtake away most objections for using preposition semantics in ontology learning,leaving only the ambiguity problem. This brings us to the research question inthis thesis: how can we use the semantics of prepositions for relation extraction inontology learning?

I will start off with a more thorough introduction on the Semantic Web, the roleof ontologies therein and the ingredients of ontology learning. The next chapter(Chapter 3) discusses relation extraction from corpora, what kind of relations canbe extracted and what the link is between these relations and certain prepositions. Italso describes a methodology for extraction of relations from prepositions. Chapter4 shows how this methodology can be applied to unambiguous prepositions andchapter 5 extends this methodology to ambiguous prepositions. The last chaptergives an answer to the research question, discusses realization of objectives andsuggests directions for further research.

Page 10: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

Chapter 2

Ontologies and ontology learning

This chapter provides an introduction to the Semantic Web. It also shows howontologies play their part in the Semantic Web, and discusses ontology componentssuch as XML, RDF and OWL. Finally, it demonstrates how the information in anontology can be ‘learned’ from free text and describes methods to do so.

2.1 The Semantic Web

While the World Wide Web contains an enormous quantity of information, almostall of this information is accessible only to humans. Software programs (includ-ing search engines) have difficulty interpreting the information that is on the Webbecause of the way it is structured.

The Semantic Web is seen by many as a solution to this problem, and as suchas the future of the World Wide Web. Although it is not yet a reality, considerableprogress has been made in the last couple of years. The vision of the SemanticWeb was introduced by Tim-Berners Lee in his landmark article published in May2001 (Berners-Lee et al., 2001).

The goal of the Semantic Web projects is to make web pages and their contentsunderstandable to computers. This is done by making the meaning of documentson the Web explicit, in a manner that is understandable to machines. This machinereadability of web pages might allow for more accurate web search and communi-cation among heterogeneous web services and devices accessible from the web.

(2.1) A practical example of this would be the automatic reservation of an airlineticket on an online booking service. A more or less autonomous program(anagent) could compare prices from various airline companies, check withthe handheld personal organizer of the client for possible planning

3

Page 11: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

4 Chapter 2. Ontologies and ontology learning

Margherita Pizza

Quattro Stagioni

Pizza

Pizza

A Quattro Stagioni

Pizza

Pint of Beer

Meal

partOf partOf

hasIngredient (Cheese)

Figure 2.1: A basic pizza ontology

problems, strive to avoid long transfer times and then make the finalbooking. This kind of scenario will only be possible if the computer canunderstand the meaning of the pages, services and devices that can be foundon the Web.

2.2 Ontologies

But how does one store the meaning of documents in a manner understandable tomachines? For this,ontologiesare used. An ontology is a data model which speci-fies a shared understanding of a domain and can be used to reason about the objectsin that domain and the relations between them. It seeks to reduce or eliminate con-ceptual confusion when sharing information among (human or non-human) users.

Ontologies use four basic concepts to specify a conceptualization. All theseelements combined form a formal description of the domain the ontology is con-cerned with. Consider Figure 2.1, which depicts a very small ontology.

Individuals Individuals are the basic objects or entities that populate the ontology.These entities may either be concrete (people, animals, automobiles, etc.) orabstract (numbers, words, feelings). In the classic pizza example (Horridgeet al., 2004), a particular Quattro Stagioni Pizza can live inside the ontology.

Page 12: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

2.2. Ontologies 5

ClassesClasses are sets, types or collections of objects. Objects are ‘instances ofclasses’; hence classes can be used to indicate relatedness between objects.In the pizza example, both ‘Margherita Pizza’ and ‘Quattro Stagioni Pizza’belong to the ‘Pizza’ class, because both pizzas are instances of the pizzaclass.

Attributes Attributes are properties of objects that can also be shared among ob-jects. Both the Margherita Pizza and the Quattro Stagioni pizza share theproperty of containing Cheese. This attribute is thus a shared attribute forboth pizza types.

Relation Relations define ways that objects relate to one another. If we againrefer to the pizza example, a relation could define that a pizza is a part of anItalian-style meal. A pint of beer could have the same relationship with theItalian meal. The ontology then contains three concepts (a pizza, a pint ofbeer and a meal) that are interlinked by means of relationships.

In contrast to “lightweight”

FoundationalOntology

CoreOntology

DomainOntology

Figure 2.2: Three levels of generality in a domainontology. Adapted from Navigli et al. (2003)

ontologies that comprise onlya small taxonomy, larger ontolo-gies can feature a layered struc-ture similar to Figure 2.2. Eachlayer represents a different levelof generality. The topmost layeror Foundational Ontology1 con-tains only the most basic prin-ciples, supporting the modelsgenerality to ensure reusabilityacross different domains. Themiddle layer (or Core Ontology) typically defines several hundred key domain con-ceptualizations, which are embedded in the ontology in compliance with the struc-ture defined in the Foundational Ontology. Finally, the bottom layer (or DomainOntology) further refines the model with the most specific concepts and relationsconcerning the domain (Navigli et al., 2003). An example of an object that canbe found in a Domain Ontology is the word ‘card’; an ontology on the domainof poker would model the ‘playing card’ meaning, whereas an ontology on thecomputer domain would model meanings like ‘network card’ and ‘memory card’.

Not all domain ontologies are built as an extension of a Foundational Ontol-ogy. This, and internal inconsistency, can lead to disagreement between domainontologies, caused by factors like cultural background and language differences.

1Examples of Foundational Ontologies include SUMO, OpenCYC and DOLCE.

Page 13: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

6 Chapter 2. Ontologies and ontology learning

The process of re-syncing ontologies – resolving the semantic correspondences be-tween heterogeneous models – is called ontology alignment and is subject of manycurrent research projects. Examples of such systems include CM (Magniniet al., 2004) and GLUE (Doan et al., 2004). Both systems try to interpret the mean-ing of ontology nodes; if nodes from different ontologies have matching semantics,they can be used as starting points for alignment or integration. Or, quoting Scef-fer (2004): “One of the major challenges [. . . ] is the automatic exchange of datathroughout distributed and heterogeneous systems. This can be considered as aproblem of semantic coordination since it requires a sort of agreement on howto map and semantically connect a model to each other. It can be addressed bydefining and using shared models and global schemas. [. . . ] The natural answer[however] is a form of peer-to-peer coordination to relate different models directly.”

2.3 Semantic Web components

For the Semantic Web to function, it needs an infrastructure of language, con-ventions and shared vocabulary. The five most prominent components of this in-frastructure (XML, XML Schema, RDF, RDF Schema and OWL) are discussedbriefly below. This section is by no means intended as an introduction to the field,but rather as a very general overview to get a grasp of what is needed to make theSemantic Web work.

In its current form, the World Wide Web uses HTML (hypertext markup lan-guage) as its common language, meaning that all Web pages are written in someversion of HTML. A fragment of my Web page containing information on thisthesis might look something like this:

(2.2) Using the semantics of prepositions for ontology learningMaster Thesis ofVincent JacobsUtrecht University

The HTML code for this fragment would probably be quite similar to this:

(2.3) <h3>Using the semantics of prepositions for ontologylearning</h3>

<i>Master Thesis of <b>Vincent Jacobs</b></i><br>

Utrecht University

This example illustrates why search engines and other web-enabled softwareprograms have such difficulty understanding the meaning of Web pages. BecauseHTML was intended as a markup language, it does not provide any information

Page 14: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

2.3. Semantic Web components 7

about the content of the page. Compare this to the same information, this timestructured in a language calledXML (extensible markup language, cf. Bray et al.,2004):

(2.4) <thesis><title>Using the semantics of prepositions for ontology

learning</title>

<author>Vincent Jacobs</author>

<type>Master thesis</type>

<school>Utrecht University</school>

</thesis>

If my Web page looked like this, the search engine would haveexplicit informa-tion that this piece of text describes a thesis, because the content is surrounded bythesis tags. Moreover, it would also know that<title>, <author> etc. specifythe properties ofthesis, because these elements are nested within the<thesis>

elements. Another characteristic of XML is that it does not tightly restrict what el-ement tags are allowed. If I want to add information on my thesis, say a downloadlink, a set of tags containing an URL is easily added:

(2.5) <thesis><title> etc.

...

<url>http://mydomain.org/thesis.pdf</url>

</thesis>

The lack of a predefined vocabulary for XML shows that XML is “a metalan-guage for markup: it does not have a fixed set of tags but allows users to define tagsof their own” (Antoniou and van Harmelen, 2004). Its extensible character makesit ideal to build further upon, creating specialized vocabularies for each applicationdomain. Examples of these applications include MathML (mathematics) and AML(astronomy) but also RDF (resource description framework) and OWL (web ontol-ogy language) that are explicitly intended as data models for the Semantic Web.Both RDF and OWL are discussed in more detail further on.

Two applications exchanging information by means of XML documents mustbe able to rely on the fact that they are both speaking the exact same ‘dialect’. Notonly the vocabulary must be coordinated, but also what values an attribute may takeand what elements are or aren’t allowed within other elements. There are two waysto formally define this structural information: DTDs (a strict and rather outdated

Page 15: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

8 Chapter 2. Ontologies and ontology learning

method) and XML Schema, which offers more possibilities and extended ways todefine data types. For brevity’s sake, only XML Schema is discussed here.

XML Schema (Biron and Malhotra, 2004; Thompson et al., 2004), which de-fines the structure of an XML document, is itself written in XML. This has a dis-tinct advantage: XML Schema can (unlike DTD) be parsed with existing tools.This reduces overhead by eliminating the need for additional parsers, editors, va-lidity checkers, and so on. Another advantage is the possibility of extending andrefining already existing schemas: one can build schemas from other schemas.XML Schema also provides an extended set of data types, where DTD is limitedto string elements.

A very brief example shows how the XML document containing informationon my thesis (Example 2.3) can be defined using XML Schema.

(2.6) <?xml version="1.0" encoding="utf-8"?><xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="thesis">

<xs:complexType>

<xs:sequence>

<xs:element name="title" type="xs:string"/>

<xs:element name="author" type="xs:string"/>

<xs:element name="type" type="xs:string"

minOccurs="0" />

<xs:element name="school" type="xs:string"

minOccurs="0" />

</xs:sequence>

</xs:complexType>

</xs:element>

</xs:schema>

Highlighting some features in this XML Schema, we can see that the<thesis>

element is defined as acomplexType (third and fourth line). AcomplexType isa user-defined data type containing elements and attributes. Inside of<xs:com-

plexType>, there’s asequence element, indicating that the order of the elementsthat follow is important. Another point of interest isminOccurs="0", which isan attribute of the elements definingtype andschool. This attribute defines theminimum number of occurrences of these elements as zero or more, effectivelymaking them optional.

RDF (resource description framework, cf. Lassila and Swick, 1998) takesXML as a starting point and adds a layer of meaning (orsemantics) to the datarepresentation. It lays down “a foundation for processing metadata; it provides

Page 16: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

2.3. Semantic Web components 9

interoperability between applications that exchange machine-understandable in-formation on the Web. Basically, RDF defines a data model for describing ma-chine processable semantics of dataprimitives defined by RDF Schema.” (Broek-stra et al., 2000). RDF Schema will be discussed further on.

The building blocks of RDF are resources, properties and statements. Re-sources are the things we want to convey information on, for instance this thesis.Every resource has a URI, a Universal Resource Identifier. Often, this URI comesin the form of a URL (Universal Resource Locator, or Web address). For example,I can identify my thesis by its (fictitious) URL:

(2.7) http://mydomain.org/thesis.pdf

Properties then describe the relations between resources, for instance “pub-lished by”, “age”, “owned by”, and so forth. As properties are essentially a specialkind of resource, they are also identified by means of a URI. To state my thesis hasan author, we identify the property “isWrittenBy”, again by means of a URL:

(2.8) http://mydomain.org/thesisinfo#isWrittenBy

Finally the last of the three building blocks of RDF: statements. Statementsassert the value of a property, taking the form of a triple. This triple contains aresource, a property and a value. Suppose we want to make a statement about mythesis, say, that I am the author. The corresponding triple would then be:

(2.9) {http://mydomain.org/thesis.pdf,http://mydomain.org/thesisinfo#isWrittenBy, “Vincent Jacobs”}

The next example shows how this triple can be embedded into RDF.

(2.10) <?xml version="1.0"?><rdf:RDF

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:ti="http://mydomain.org/thesisinfo#">

<rdf:Description rdf:about="http://mydomain.org/thesis.pdf">

<ti:isWrittenBy>Vincent Jacobs</ti:isWrittenBy>

</rdf:Description>

</rdf:RDF>

Without going into too much technical detail, as this would go well beyondthe scope of this thesis, I’ll quickly describe how this fragment is built up. Thetwo statements on the third and fourth line reference the ‘namespaces’ used in

Page 17: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

10 Chapter 2. Ontologies and ontology learning

this RDF document. Namespaces are expected to be “RDF documents definingresources, which are then used in the importing RDF document” (Antoniou and vanHarmelen, 2004). In this manner, RDF definitions can be referenced and re-usedby others. The next next line contains a reference to the RDF Description element,making a statement about the the resource http://mydomain.org/thesis.pdf (which,as you may recall, is supposed to be my thesis). Theti:isWrittenBy element(from the ti namespace defined on line 4) defines the author of this resource,which in this case is me. So, having defined the resource (my thesis), the property(isWrittenBy) and the value (“Vincent Jacobs”), we can conclude this RDF to besemantically equal to the triple from Example 2.9.

RDF Schema(Brickley and Guha, 2000) builds on top of RDF. While RDF isa universal language that lets users “describe resources using their own vocabular-ies” (Antoniou and van Harmelen, 2004), RDF Schema provides the basic elementsfor ontologies, while structuring RDF resources. Or, said in the words of Broekstraet al. (2000): “The modeling primitives offered by RDF are very basic. Therefore,the RDF Schema specification defines further modeling primitives in RDF. Exam-ples are class, subclass relationship, domain and range restrictions for property,and subproperty. With these extensions, RDF Schema comes closer to existingontology languages”.

"Using the semantics of

prepositions for ontology learning"

Vincent JacobsisWrittenBy

Master thesis

type

RDF

RDF Schema

isWrittenBy

MA student

Student

subClassOf

domain range

PhD student

subClassOf

Thesis

subClassOf

type

Figure 2.3: The relationship between RDF and RDFS. Illustration adapted fromAntoniou and van Harmelen (2004).

Figure 2.3 provides a graphical representation of the interconnections betweenRDF and RDFS (a common abbrevation for RDF Schema). It is again based on theexample regarding my thesis. The bottom half represents the part that is in RDF;

Page 18: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

2.4. Ontology learning 11

a thesis is defined in RDF with an URL (http://mydomain.org/thesis.pdf, in thisillustration with its title) which is connected to me by means of theisWrittenByproperty. These three items are then structured in the upper, RDF Schema part.

The thesis is described as a Master thesis, which is a kind of (orsubClassOf)Thesis. On the right, I am defined to be a Master student, which is a kind of Student.A PhD student is also a kind of student, and can thus be defined as asubClassOf

Student. The range and domain of theisWrittenBy relation are defined in themiddle as Student and Thesis, respectively. A more colloquial description couldsay that theses are written by students, and that more than one kind of studentexists. Of course, this model is not complete: PhD theses are not defined in thismodel, along with a host of other entities from the ‘thesis domain’ that could belinked in.

Concluding our small tour round the components of the Semantic Web, wearrive atOWL (web ontology language, cf. Smith et al., 2003), the semanticallyrichest Semantic Web component of all that are discussed here. OWL can be seenas an extension of RDF Schema, although this brings about serious problems con-cerning expressive power and efficient reasoning. For this reason, three ‘flavors’ ofOWL have been created: OWL Full, OWL DL and OWL Lite.

OWL Full is the entire language, free from any constraints. It is completelyupward compatible with RDF/RDFS, both syntactically and semantically. The re-sulting power of expression comes at a price: OWL Full is so powerful that it isundecidable, rendering it useless for reasoning purposes. OWL DL2 solves the de-cidability problem by restricting the application range of OWL constructors, bring-ing the language in line with Description Logics. Since DL is decidable, so is OWLDL, be it at the cost of upward compatibility with RDF. RDF documents will “ingeneral have to be extended in some ways and restricted in other before [they are]legal OWL DL documents” (Antoniou and van Harmelen, 2004). Finally, OWLLite provides a less complex starting point for ontology users and tool builders. Incomparison to OWL DL and OWL Full, it lack features like disjointness statementsand arbitrary cardinality.

2.4 Ontology learning

Ontologies can be hand-built by a human operator with the help of an ontologyengineering tool such as Protege (Noy et al., 2001). While this is sufficient forsmall-scale domain ontologies and research purposes, it is not a viable option forthe creation of large ontologies due to their complexity and size.

2DL is an initialism for Description Logics, a family of knowledge representation languages.

Page 19: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

12 Chapter 2. Ontologies and ontology learning

Because of this limitation, the automatic construction of ontologies – com-monly referred to asontology learning– has become the subject of intensive re-search over the last years. This research has come from various corners of thescientific community such as database research and artificial intelligence as well ascomputational linguistics.

While considerable progress has been made by computational linguists to tacklethe problem of acquiring word semantics from natural language texts (Gomez-Perez and Manzano-Macho, 2003), the semantic richness of language and its in-trinsic ambiguity still pose problems.

2.4.1 The layer cake

Buitelaar et al. (2005) discuss a so-called ‘ontology learning layer cake’ (Figure2.4). It is a graphical representation of the elements that are needed for ontologylearning.

∀x, y ( sufferFrom ( x, y ) → ill ( x ))

cure ( dom:DOCTOR, range:DISEASE )

is-a ( DOCTOR, PERSON )

DISEASE := < Int, Ext, Lex >

{ disease, illness, maladie }

disease, illness, hospital

Axioms and Rules

Relations

Taxonomy

Concepts

(Multilingual) Synonyms

Terms

Figure 2.4: The ontology learning layer cake

Starting at the bottom layer we find Terms, that are at the basis of ontologylearning. Terms express semantic units that form the building blocks of an ontol-ogy. It is relevant in the process of ontology learning to establish which phrases aremost relevant for a specific document and thus can be made Terms. This helps inkeeping the size of the ontology down, and increases the relevance of the conceptsit contains. The process ofTerm Extraction aims to automatically process textand return the terms that are most relevant. Figure 2.4 shows ontology learningfor the hospital domain, for which ‘disease’ etc. are highly relevant terms. TermExtraction can also be used to provide labels for unnamed non-taxonomic relations(see below, also cf. Martin Kavalec and Vojtech Svatek, 2004).

Next up isSynonym Extraction, or the identification of terms that share (apart of) their semantics. If terms share some of their semantics, it could be that theyare referring to the same concept. Since ontologies are concerned with conceptsand not with terms, this is a crucial step. Also, synonyms across languages can

Page 20: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

2.4. Ontology learning 13

be found with this method. Pairs of terms with one term in a language having theexact same meaning as another term in another language do not exist, but pairs withsimilar meanings do exist and are useful to extract. Its usefulness can be illustratedin the context of the Semantic Web. If a page on the Web contains informationon a particular subject but that page is in a foreign language, the cross-languagesynonym pairs can help to still retrieve that information.

Another level up the layer cake, we find a layer labeled Concepts. In Buitelaaret al. (2005), concepts are described using the triple Intension, Extension and Lex-ical Realizations.Concept Extraction tries to find all three for each concept. Theintension of a concept is a definition of the set of objects the concept describes,e.g. for the concept ‘Pizza’ the intension would be something along the lines of‘An oven-baked, flat, usually circular bread covered with tomato sauce and cheesewith optional toppings’. The extension of the concept Pizza is a set of objects orinstances that conform to the definition of the concept Pizza, e.g. Margherita Pizzaor Quattro Stagioni Pizza. The third part of the triple, Lexical Realizations, con-tains the term itself and its multilingual synonyms. In this way it is clear that PizzaQuattro Stagioni (Italian) actually refers to the same concept as Pizza GodisnjaDoba (Croatian).

The next part of ontology learning isTaxonomy Extraction. A taxonomy canbe defined as a hierarchically ordered system that indicates semantic relationships.A common example of a taxonomy is the hierarchical biological classification sys-tem for animals, dividing the living world into categories that are general at the topof the hierarchy and specific at the bottom. Taxonomy Extraction is an importantstep in ontology learning, as the taxonomy forms the ‘skeleton’ of the ontology.

Several keywords are used to designate the hierarchical relationships withinthe taxonomy: superordinate, hyponym, hypernym, and subordinate. A hyponymis a word whose extension is included within that of another word - a hypernymis the opposite. Subordinate and superordinate refer to the same concept, but in ataxonomical sense. It is best explained by means of an example.

(2.11) We take two concepts: ‘cat’ and ‘dog’. Both are hyponyms of ‘pet’: both acat and a dog can be called a pet. Consequently, ‘pet’ is a hypernym of both‘cat’ and ‘dog’. The related taxonomical structure puts the concept of ‘pet’on a certain level, and the concepts ‘cat’ and ‘dog’ on a level below that.The associated terms in the taxonomical context are superordinate for ‘pet’in relation to ‘cat’ and ‘dog’ and – vice versa – subordinate for ‘cat’ and‘dog’ in relation to ‘pet’.

We can define this more formally by defining ‘hyponym’ as: if everyx is a y,or if everyx is a type of y, the relation ofx andy is a hyponym relation. Hyponym

Page 21: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

14 Chapter 2. Ontologies and ontology learning

relations are relatively easy to extract from text (IJzereef, 2004) and can easily beintegrated in an ontology by means of subclassing the hyponym under its superor-dinate concept, e.g.yeti becomes a subclass ofmonsterif the text states that yetisare monsters. Figure 2.4 shows that a doctor is recognized as a person in this step.Many of the methods described in section 2.5 (Ontology learning techniques) focuson hyponym extraction.

Ontologies also allow for non-taxonomic relations between concepts. An com-mon relation of this kind is themeronymrelation, which holds between two entitieswhere one entity is a part of the other. More formally, ifx is part of y, the relationbetweenx andy is a meronym relation, also called apart-of or part-wholerelation.For example:

(2.12) Finger is a meronym ofhand, which in turn is a meronym ofarm, whichin turn is a meronym ofbody.

Although easily confused, meronyms are distinctly different from hyponymsand can therefore not be expressed in the ontology by means of subclassificationof the subordinate concept. Referring to the example above, one cannot say that “ahandis anarm” or “a fingeris abody”.

The automatic discovery of relationships of this kind between entities in theontology is calledNon-Taxonomic Relation Extraction, Semantic Relation Ex-traction or just Relation extraction. This is also the next level of the ontologylearning layer cake. Semantic roles (see section 3.1 on page 26) found in the textcan be very helpful for discovering semantic relations. These semantic roles areakin to semantic relations and can be used to function as basis or as an additionalresource for extraction or disambiguation.

The set of non-taxonomic semantic relations is open, as their relation labelsform an exact description of the nature of the relationship. Compare Figure 2.4,where the relationship between a doctor and a disease is defined ascure: if allgoes well, a doctor cures a disease. The labeling of a semantic relation is usuallyperformed by a human ontology engineer, after which the relation becomes part ofan ontology. However, labeling the relation between two more general concepts isnot always easy, as multiple relations between instances of the same concepts arepossible. For example, if the textual data suggests a relation between the conceptsP and C, the ontology engineer must decide whether the Cis selling, consuming, producingor propagatingthe P (Martin Kavalec andVojtech Svatek, 2004).

Finally, we may want to be able to say something about the properties in anontology. For example, one might want to capture the relationship between thecomposition of the ‘parent’ and ‘brother’ properties and the ‘uncle’ property, or

Page 22: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

2.5. Ontology learning techniques 15

express constraints on (a set of) properties. Ontology rules help accomplish thistask and are essentially a form of Horn clauses (e.g.p∨ ¬q→ u; if p is true andqis not true, thenu). The example in Figure 2.4 for this part of the layer cake statesthat∀x, y( sufferFrom(x, y) → ill (x)), expressing that if there is asufferFromrelation betweenx andy, x can be considered ill. Colloquially: somebody sufferingfrom something (presumably a disease) is ill. The automatic extraction of rules is,of course, calledRule Extraction3.

2.5 Ontology learning techniques

The application of linguistic methods in ontology learning is limited, both in num-ber of projects and in linguistic depth. Often, natural language processing (NLP) isused in conjunction with other techniques that have a longer tradition in computerscience. Blomqvist (2005) but also Chen and Wu (2005) acknowledge the cross-disciplinary character of ontology learning and identify the three main players inthe field: NLP, Machine Learning and Data Mining. All are discussed below, butsince this thesis is concerned with linguistic methods for ontology learning there’sa strong focus on NLP. It is indicated for every technique described where it be-longs in Buitelaar et al.’s layer cake.

As said, most real-world systems approach the problem of ontology learningwith some combination of techniques: the OntoLT system (Buitelaar et al., 2004)uses both linguistically motivated mapping rules and statistical preprocessing, andthe Text-to-Onto system (Maedche and Staab, 2000) heavily relies on associationrule mining (see page 22) for its extraction of non-taxonomic relations. It appearstherefore that no single technique is powerful enough to yield satisfying results.

The following chapter should therefore not be regarded as a complete overviewof all available methods, but as an exploration of the linguistic possibilities inthe field of ontology learning from text. More complete (but less in-depth) re-views of ontology learning methods and projects can be found in Gomez-Perezand Manzano-Macho (2003), Blomqvist (2005) and Buitelaar et al. (2005).

2.5.1 Natural Language Processing

Methods based on techniques from the field of Natural Language Processing (NLP)often form a part of ontology learning tools, but some methods (i.e. pattern-basedmethods) are much more common than others. Not all methods require syntacticparses, and some benefit more from deep parses than others.

3Rules are not (yet) part of the ontology standard OWL, but implemented in an extension of OWLcalled SWRL: http://www.w3.org/Submission/SWRL/

Page 23: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

16 Chapter 2. Ontologies and ontology learning

Lexico-syntactic patterns

A very common method (and also one of the oldest) is the use of lexico-syntacticpatterns, first described in Hearst (1992). These patterns match sequences of wordsin the text against general text patterns. Hearst uses these patterns to discover hy-ponymic relations for taxonomy extraction purposes. Other uses of lexico-syntacticpatterns include term extraction from compound nouns in technical text. A com-pound noun is a word composed of more than one noun, like ‘milk float’ or ‘carpark’. Hearst identifies a total of six patterns, including the following:

(2.13) pattern exampleX such asY “Mountain dwellerssuch asyetis”X is aY “A yeti is a furry creature”X like Y (andZ) “Monsterslike yetisanddwarves”

An example matching sequence using the last of these three patterns shows howa chunk of found text is matched against the pattern rule. Because the found textmatches the pattern, two new hyponymic relations spring forth;

(2.14) Pattern rule if NP0 like {NP1,NP2, . . . , (and|or)} NPn

then {NPi | 1 ≤ i ≤ n} → hyponym(NPi ,NP0)Found text “Monsters like yetis and dwarves aren’t house-trained”Conclusions hyponym(‘yetis’, ‘monsters’)

hyponym(‘dwarves’, ‘monsters’)

It is assumed that more of these patterns can be identified, and to be able todiscover them automatically, Hearst (1992) proposes a method to find syntacticrelations between known hyponym pairs. These syntactic relations can then beconverted to new lexico-syntactic patterns and in turn used for hyponym extraction.The effectiveness of this method is deeply questioned in Chen and Wu (2005): “themethod [is] based on syntactic analysis and does not use semantic information.Only a very small number of relations can be extracted [with] these methods. [. . . ]a very small group of patterns can be learned [using this] learning method. Theshortage of the patterns constrains the application of the pattern based method”. Afurther discussion regarding the effectiveness of and subsequent improvements onthis method can be found in IJzereef (2004).

Hearst (1992) also tries to apply the same method to meronymy, but with-out much success, as the patterns detected also express other semantic relations.Berland and Charniak (1999) also focus on meronymy but take the Hearst methodfurther by introducing statistical measures for instance reliability evaluation and byusing ‘seed instances’.

Page 24: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

2.5. Ontology learning techniques 17

In their system, sentences from the corpus are matched against known meronympairs, checking if both parts of the pair occur close to each other in one sentence.If this is the case, a meronym pattern is extracted. When a set of meronym patternshas been extracted, the pattern is instantiated with a ‘seed instance’ (for instancecar) and applied to the corpus. Each pattern of the set will yield other words thatmay or may not be a part of the seed instance: forcar, examples includetailpipe,windowandbeauty. Statistical measures are then used to order the possible partsby the the likelihood that they are true parts.

Pattern based systems deal with some rather persistent problems, which aredescribed by Berland and Charniak. Idiomatic phrases like “the son of a gun”(‘son’ is not a part of ‘gun’) are hard to weed out, and faulty tagging of phrases(e.g. ‘crash’ as a verb instead of a noun) leads to unjustified matches. Also, atendency to extract qualities of objects was observed, e.g. ‘drivability’ was foundto have a strong correlation with ‘car’. The most persistent problem however wasfound to be sparse data, although Berland and Charniak were optimistic to solvethis problem by increasing the data set. The overall accuracy obtained for the top50 of proposed parts was 55%.

Head-modifier relations

Methods based on a head-modifier relation are – like lexico-syntactic patterns –used in many of the larger ontology learning systems. Several types of head-modifier pairs exist, most common of which are the adjective-noun pair and thenoun-noun pair orcompound noun.

(2.15) type exampleadjective-noun “furry yeti”compound noun “milk float”

The adjective-noun pair usually induces a hyponym relation convenient to tax-onomy extraction for ontology learning (Bodenreider et al., 2001). In the case ofour example, a ‘furry yeti’ can also be called a yeti. This allows the head-nounto be mapped to a class in combination with its modifier to a subclass (Buitelaaret al., 2004). Hence, ‘yeti’ is mapped to a class and ‘furry yeti’ to its subclass.An extensive report on extracting hyponymic relations from large corpora usingadjective modifiers can be found in Bodenreider et al. (2001).

A compound noun can be less cooperative in this respect as some areexocen-tric, meaning that the head noun doesn’t give a correct indication of the meaningof the compound. For example ‘bird brain’, which is exocentric “because it doesnot refer to a kind of brain, but rather to a kind of person – whose brain resemblesthat of a bird” (Barker and Szpakowicz, 1998).

Page 25: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

18 Chapter 2. Ontologies and ontology learning

Verb preference

Verbs often have semantic restrictions on the concepts they can be associated with.A look at the associated verb of a subject or object can therefore yield useful in-formation for classification of the concept it refers to. The application of this se-lectional preference of verbs (Hastings, 1994) can be of assistance in the processof taxonomy extraction. Because of the high volume of semantic verb informationthat has to be added by the ontology engineer, its application is limited to domain-specific texts.

(2.16) The verbto light has a preference for flammable objects. If the sentence“John lights the stove” is encountered, ‘stove’ can be added in the ontologyto ‘flammable objects’.

Reification of relations

A little known method involves reification of relations. Reification is the process ofmaking an abstract relation more concrete or ‘real’. The reification of a ‘Part-Of’relation for instance is to use “a concept namedPart of W to subsume a conceptPinstead of using a Part-Of relationship between the conceptP (the part) andW (thewhole) – see Figure 2.5. From a linguistic perspective, the concept namePart of Wreifies the Part-Of relationship from conceptP to W” (Zhang and Bodenreider,2003). Reification finds its application in taxonomy and relation extraction.

Part Of WW

P

Is-a

P

W

Part Of

Figure 2.5: Left: the initial situation, right: after reification.

Semantic interpretation

Semantic interpretation can be defined as the task of translating a natural languageinto a formal Meaning Representation Language or MRL. This formal languagehas several distinct advantages over natural language: an MRL has well-definedsemantics, is unambiguous and supports inference. A small example:

Page 26: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

2.5. Ontology learning techniques 19

(2.17) We look at a classical example of ambiguity, the sentence “The man seesthe girl with the binoculars”. This might mean two things: either the man islooking at the girl through his binoculars, or the man sees a girl that isholding binoculars. Semantic interpretation can tease this ambiguity apartby just stating the two possible options in formal language: the first optioncan be represented asholding (man, binoculars1), the second asholding(girl, binoculars2). Both options are inserted into the MeaningRepresentation Language. Of course, eitherbinoculars1or binoculars2isnon-existent, but the model is unambiguous.

Semantic interpretation is used for relation extraction, which can be observedin the fairly complex example found in Bodenreider (2005, page 16).

Dependency structures

The dependency structure of a sentence is an acyclic, planar, undirected graphexpressing the relations between the headwords of each phrase in a sentence. Typ-ically, dependency structures are generated by a natural language parser like theAlpino parser (see section 3.5.2 on page 37). Although these structures can be usedfor relation extraction in myriad ways, Schutz and Buitelaar (2005) specifically usethe dependency context of verbs to perform relation extraction. The dependencystructure for the (German) sentence “Ballack schießt das Leder ins Netz” (“Ballackshoots the ball in the net”) is shown in Figure 2.6.

clause

subject predicate direct object pp adjunct

head head np headhead

NE lemma lemma lemmalemma

Ballack schießen Leder in Netz

FootballPlayer BallObject

Figure 2.6: Dependency structure, adapted from Schutz and Buitelaar (2005).

Next in the process is Named Entity Recognition, in order to automaticallyrecognize phrases like ‘Ballack’, which is a midfielder from the German nationalfootball team, as an instance of the concept FP. This is done using

Page 27: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

20 Chapter 2. Ontologies and ontology learning

gazetteer lists. Also, ‘Leder’ (‘leather’) needs to be recognized as something hav-ing roughly the same meaning as the word ‘Fußball’ (‘football’), and thus as aninstance of the concept BO. This last process is called Concept Tagging.The result of both processes is added to the dependency structure, as shown inFigure 2.6. The algorithm would then proceed as follows:

1. Compose a sub-unit consisting of a predicate and a highly ranked OBJ orNP-Head of PPAdjunct. This sub-unit could be{schießen, BO}.

2. Glue a highly ranked SUBJ to the lefthand side of the sub-unit. The sub-unitcould then be augmented to{FP, schießen, BO}. Theresulting triple is a description of a relation.

This relation is then transformed to fit into the ontology; the left-hand side becomesthe relation domain, the center part the relation name and the last part describes therange of the relation. The resulting relation could easily be inserted into a footballdomain ontology: Rel: (Dom:FP, Range:BO)

2.5.2 Data mining methods

The field of data mining is a fairly recent topic in computer science and deals withthe extraction of useful information from large volumes of data. Data mining isan ‘umbrella term’ for a set of methods and its meaning varies with the contextin which it is used. The methods described in the following section are the mostcommon data mining methods in the field of ontology learning.

Statistical analysis of co-occurrence

It is highly relevant for the process of ontology learning to know which words in aset of documents are most representative for that set. Ensuring these words end upas objects in the ontology increases relevance (see section 2.4.1, Term Extraction).For example:

(2.18) We compare two corpora, one from the general domain and one containingtext specifically from the computer domain. The word ‘byte’ is probablyquite specific for the computer domain corpus, but not so much for thegeneral corpus. Hence, we want to be sure that ‘byte’ is included in thecomputer domain ontology because it helps to set it apart from the generaldomain ontology.

The general term for the set of statistical techniques that discover this above-chance juxtaposition of a particular word with another word or words isstatistical

Page 28: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

2.5. Ontology learning techniques 21

analysis of co-occurrenceor collocation analysis. Two popular techniques for co-occurrence analysis are tf.idf and Pearson’sχ2.

The first method,tf.idf or Term Frequency-Inverse Document Frequencyis the most popular. It is built up from two main terms: ‘term frequency’ and‘document frequency’. While term frequency gives a measure of the importanceof a term within a particular document, inverse document frequency indicates howprominent the term is across all documents in the document set. Several variationson the formula used to calculate tf.idf exist, but all are variations on the followingformula. For a termi in documentj:

wi, j = tfi, j · log

(Ndfi

)(2.19)

with tfi, j being the number of occurrences ofi in j, N being the total number ofdocuments anddfi being the number of documents containingi.

(2.20) Suppose we have a document containing a single chapter from a book.This chapter is about yetis, and has 10 occurrences of the word ‘monster’.Now, this chapter can either be part of a book about animals rumored toexist (‘cryptids’) or part of a book about the Himalayas.

For demonstration purposes, we assume that both books contain 12 chapters(N = 12), that no chapters of the book about the Himalayas contain theword ‘monster’ except the chapter on yetis (dfi = 1) but that half of thechapters in the book about cryptids contain ‘monster’ (dfi = 6).

For the book about the Himalayas: 10· log(

121

)= 10.8

But for the book about cryptids: 10· log(

126

)= 3

Because the word ‘monster’ is a lot less specific in the book about cryptids,it receives a significantly lower score. Note that ifall chapters in the bookon cryptids would mention the word ‘monster’, the score would drop downto zero (log(1) = 0).

Another popular method is the so-calledChi square (χ2) test, of which Pear-son’sχ2test is the most widely used variant. It tests a null hypothesis that therelative frequencies of occurrence of observed events follow a specified frequencydistribution.

Typically, the hypothesis tested with chi square is whether two samples of textare different enough that we can assume from those samples that the two popula-tions from which these samples are drawn are also different. An application in theontology learning domain can be found in OntoLT (Buitelaar et al., 2004) where

Page 29: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

22 Chapter 2. Ontologies and ontology learning

chi square “computes a relevance score by comparison of frequencies in a domaincorpus under consideration with that of frequencies in a reference corpus. In thisway, word use in a particular domain is contrasted with that of more general worduse”.

Clustering

Clustering relies on formulae that describe the semantic distance between con-cepts. Concepts are grouped according to this semantic distance. As this methodprovides the user with sets of semantically related terms, it is excellently suited fortaxonomy and term extraction.

An example of the application of clustering can be found in Hahn and Schnat-tinger (1998). Their objective is to learn the correct ontological class for unknownwords. They start offwith a ‘hypothesis space’ for each concept the unknown wordcould actually belong to. Based on the semantic distance between each concept inthe hypothesis space and the linguistic context of word occurrences, the hypothesisspace is gradually reduced to a final concept.

Association rule mining

Association rule mining is used to discover links between two sets of events orconcepts of arbritrary size. A famous example of association rule mining is thediscovered relation between diapers and beer.

(2.21) At the time the first large grocery store databases started to come into use,enough data became available to signal trends in shopping behavior. One ofthe more interesting observations was that men that came into the store laterin the day and bought beer were also inclined to buy diapers (Agrawal et al.,1993).

One could describe this relation as LITD + MC + B-B⇒ BD. This is the canonical form of an association rule – rules areexpressed in the formA ⇒ B, whereB is the set of events or concepts that canbe inferred fromA. The application in ontology learning lies primarily in relationextraction. The following real world case is related to the Text-to-Onto ontologylearning system

(2.22) The system starts off from the sentence “The festival on the island attractstourists from all over the world”. Syntactic analysis reveals ‘on the island’to be a modifier of ‘festival’, and the pair{festival, island} is fed back to thealgorithm. The algorithm contains a class hierarchy as background

Page 30: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

2.5. Ontology learning techniques 23

knowledge, and generates candidate association rules while traversing thistaxonomy. One of these candidates would have the word ‘festival’generalized to ‘event’ and ‘island’ to ‘area’. After a scoring mechanism hasselected the best pairs, the pair{event, area} is shown to an ontologyengineer, whose task it is to name the relation. In this example, the engineercould opt for a label like LocatedIn, resulting in the relation LocatedIn(event, area) viz. events are located in an area (Maedche and Staab, 2001).

The binary relation can then be inserted into the ontology, resulting in an ex-tension of the set of non-taxonomic relations.

2.5.3 Machine learning

Machine learning is a general term for the subfield of artificial intelligence that isconcerned with the development of algorithms and techniques that allow computersto ‘learn’. Although there is a large overlap with methods stemming from the fieldof statistics, machine learning focuses more on the computational tractability ofmethods. Subfields of machine learning include artificial neural networks, geneticprogramming and Bayesian networks. The latter holds a recent claim to famesince it is used in many of the modern spam email filtering techniques. Althoughmachine learning can be very useful in ontology learning, it is too far outside ofthe scope of this thesis to elaborate on it further.

Page 31: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

24 Chapter 2. Ontologies and ontology learning

Page 32: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

Chapter 3

Relation extraction

As we have seen in section 2.4.1, determination of semantic relations is an impor-tant layer in Buitelaar’s ontology learning layer cake (see section 2.4.1). Semanticrelations connect the entities expressed by sentences and describe the events theseentities participate in.

As already stated in the introduction and the title, this thesis suggests a newmethod for relation extraction from Dutch texts. Contrary to the methods whichhave been described in the previous chapter, the method described in this chapteruses prepositions in the corpus as the linguistic items relations are extracted from.

Prepositions, especially when embedded in a prepositional phrase (like ‘dooreen gelukkig toeval’, ‘because of a fortunate coincidence’) have enough semanticvalue to extract relations from. Because of their ambiguity and complex semanticshowever, prepositions are hardly used in existing systems for relation extraction.The method proposed in this thesis aims to avoid these problems by selective ap-plication.

The method is focused on two relations in particular: CB and BM-O. These two relations can easily be fitted in an ontology and are expressed bycertain prepositions in a rather straightforward fashion. By limiting both the set ofextracted relations and the prepositions used for extraction, difficult deconstructionof preposition semantics and ontology integration problems can be avoided.

Dutch prepositions that express the BMO relation are ‘middels’ and‘door’, CB relations are expressed by ‘vanwege’ and also ‘door’. Theseprepositions will be used in this thesis to illustrate the extraction method. While‘middels’ and ‘vanwege’ only express their respective relations, ‘door’ is muchmore ambiguous. Table 3.1 shows which relations can be expressed by ‘door’.

I will start off by giving the linguistic perspective on relation extraction fromprepositions. Semantic roles and the task of labeling them are strongly related

25

Page 33: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

26 Chapter 3. Relation extraction

relation exampleCaused By “De boom spleet in tweeendoorde ingeslagen bliksem”

“The tree split in two because lightning had struck”By Means Of “Hij liet zijn liefde blijkendoorhet sturen van een kaartje”

“He showed his love by sending a card”Agent “Een behandelingdooreen arts”

“A treatment by a doctor”Location “Hij wandeldedoorde Amsterdamse binnenstad”

“He strolled through Amsterdam’s town center”

Table 3.1: Relations expressed by ‘door’.

to semantic relations and hence discussed first, after which two methods to linkroles to relations are investigated. The next section deals with prepositions andtheir ambiguous semantics, followed by the results of a small experiment aimed atestablishing (a lack of) relatedness between preposition syntax and preposition se-mantics. The establishment of a connection between preposition syntax and prepo-sition semantics would allow for simple relation extraction.

This is followed by an extended description of the components in the proposedextraction method. Background information on all components of the process isgiven, along with a detailed account of their function in the entire process.

Finally, a general case of the proposed extraction methodology is discussed.This method is further refined and adapted to the circumstances in the two chaptersfollowing this chapter.

3.1 Semantic roles

To be able to extract semantic relations between entities in a sentence, the seman-tic arguments in the sentence need to be identified. Arguments complement thepredicate, the predicate being that part of the sentence describing what event istaking place. In Example 3.1, the predicate is ‘chases’; the sentence is describing a‘chasing event’. Its semantic arguments (noun phrases in the rest of the sentence)provide additional information about that event.

(3.1) The yeti chases the mountaineer through the frozen meadows.

We now assign labels to the sentence elements that constitute arguments, andlink them to the predicate by means of asemantic role, also known as athematic

Page 34: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

3.1. Semantic roles 27

role. By assigning a semantic role it becomes clear in what way the argument isproviding additional information about the predicate. Semantic roles were intro-duced around the late 1960s and 1970s (Fillmore, 1968; Jackendoff, 1974; Gruber,1976) as a way of classifying predicate arguments into a closed set of participanttypes.

The exact nature and composition of the set of semantic roles is a matter of dis-pute, largely due to the arbitrary character of the matter (Jensen and Nilsson, 2003).This lack of agreement introduces (much unwanted) ambiguity into semantic an-notations and scientific project comparison. Some help in solving this problemmight come from the PropBank and FrameNet projects (Kingsbury and Palmer,2003; Baker et al., 1998). Both are relatively new projects that aim to establish alink between syntactic structure and semantic information. For this purpose, bothprojects have devised ways to structure predicate arguments that might developinto a de facto standard. PropBank and FrameNet are discussed further in sections5.1 and 5.2.

Role relation Abbreviation DescriptionAgent AGT Animate being acting intentionallyPatient (or Theme) PNT Affected entity; Effected entityCause CAU Inanimate force/actorCaused-By CBY Inverse CAUBy Means Of BMO Means to end; InstrumentPurpose PRP PurposeLocation LOC Place; Position

Table 3.2: Popular semantic roles (Jensen and Nilsson, 2003).

In the following sentence, all sentence elements are prefixed by a label indicatingwhat semantic role is attached to them.

(3.2) [AgentThe yeti ][Predicatechases ][Patient the mountaineer ][Location throughthe frozen meadows ].

The act of semantic role labeling can be summarized as answering the questionof “who” did “what” to “whom”, “when” and “where”. In reponse to this question,we see that this sentence gives a partial answer. The yeti (“who?” – an Agent)does the act of chasing (“what?” – the Predicate). This act is performed on themountaineer (“whom?” – the Patient) in the meadows (“where?” – the Location).Finally, the question of “when?” remains unanswered in this example.

A graphical representation of the semantic roles in the sentence of Example3.2 can be seen in Figure 3.1. The predicate (in this case ‘chases’) acts as the hub

Page 35: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

28 Chapter 3. Relation extraction

The yeti chases the mountaineer through the frozen meadows.

Agent Patient

Location

Figure 3.1: Thematic grid with semantic roles

for all semantic roles in the sentence. The predicate of a sentence is the answer tothe question of “what”; it specifies what event is taking place. Most often is thepredicate lexicalized by a verb but verb nominalizations and adjectives can act as apredicate as well. Together, these roles form thethematic gridor theta grid.

Some systems for automatic semantic role parsing have been or are being de-veloped, cf. e.g. Pradhan et al. (2005); Stevens (2006).

3.2 Semantic relations

As discussed in section 2.4.1, semantic relations are part of the ontology learninglayer cake, and also the subject of this thesis. Semantic roles should not be con-fused with semantic relations. Although the two are closely related, they are notexchangable terms and there is no simple translation paradigm.

The difference yet relatedness between semantic roles and relations is best il-lustrated by means of an example. Figures 3.1 and 3.2 both show the same sen-tence, the first one annotated with semantic roles that together form a thematic grid,the second annotated with semantic relations.

The yeti chases the mountaineer through the frozen meadows.

Rel:chaseRel:location

Figure 3.2: Semantic relations

It is fairly obvious that the two structures share common semantic informa-tion. But where semantic roles radiate out from the predicate, semantic relations

Page 36: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

3.2. Semantic relations 29

are formed between the two entities involved. Semantic relations also have moredescriptive labels. These characteristics enable semantic relations to live insideontologies.

The relation between the thematic grid of a sentence and the semantic relationsit contains is seldom as transparant as in this example. Also, not all semanticroles are easily translatable to semantic relations. The next sections discuss twotranslation techniques.

3.2.1 From roles to relations using mappings

Surdeanu et al. (2003) demonstrate a method in which both the predicate-argumentstructure and other forms of semantic information are used to get the right informa-tion into ‘slots’ – the latter being a familiar term from ontology development. Thismethod uses mapping rules to specify under which conditions specific relations canbe extracted.

Examples 3.3 and 3.4 come from the ‘death domain’ and show how semanticannotation, lexical information and verb class information are combined to extractinformation.

(3.3)(Person and Arg0 and DieVerb) or

(Person and Arg1 and KillVerb)→ Deceased

(3.4)(Arg0 and Kill Verb) or

(Arg1 and DieVerb)→ Agent Of Death

Example 3.3 shows that a word (probably a noun) of type Person co-occurringwith an Arg0 (Agent) role and with a verb describing the action of dying legitimatesthe extraction of a Deceased relation. The same holds for a noun of type Person,associated with an Arg1 (or Patient) role and with a verb describing the action ofkilling.

An obvious downside of this approach is the amount of extra semantic infor-mation that is necessary: the verb needs to be embedded in some semantic categoryand almost every situation imaginable needs a customized mapping rule, confiningthis method to domain-specific applications.

3.2.2 From roles to relations using syntactic dependencies

Another method called RelExt (Schutz and Buitelaar, 2005) employs the depen-dency relations present in the syntactic parse and is applied to German phrases fromthe football domain. Extra semantic information is not required for this method.

Page 37: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

30 Chapter 3. Relation extraction

The method models verbs as relations holding between concepts. Schutz andBuitelaar argue that, although modeling of verbs is generally not very popular,“from an ontology engineering point of view [. . . ] verbs express a relation betweentwo classes that specify domain and range”.

The general idea of RelExt is that “verbs specify an action or event, whereas thesemantic classes of their syntactic arguments account for the class of participantsin that event. Exploiting this information could be very useful when it comes torestricting a relation to hold only between a small set of semantic classes” (Schutzand Buitelaar, 2005).

A triple constituting the elements of the semantic relation (relation, domain andrange) is constructed from the predicate and head nouns. The subject head nounis chosen as the domain of the relation, the direct and indirect objects as well astheir adjuncts are candidates for the range. Statistical processing then is used toselect the most relevant triple from the candidates triples. A complete account ofthe process can be found in section 2.5.1 on page 19.

3.3 Prepositions

A major problem in relation extraction from prepositions is the semantic ambigu-ity of the prepositions themselves. This section provides an overview of existingapproaches to preposition semantics and disambiguation strategies.

The study of semantics in general can be divided into two main groups: philo-sophically oriented and cognition-based on the one hand (e.g. Tyler and Evans,2003) and more practically oriented on the other. The fruits of the latter group areexplicitly intended to be used in NLP systems. In thesis, we are not concerned withthe first group, so I will only elaborate on practical research in this area.

According to Bannard and Baldwin (2003), preposition semantics research inthe past can be divided in two categories: large-scale symbolic accounts of prepo-sition semantics, and disambiguation of PP sense. I am not aware of any resourcesdescribing Dutch prepositions semantics, but PP sense disambiguation has hadsome attention in the field: e.g. Kordoni (2006) in a general context, and vanHalteren et al. (2006) specifically for Dutch.

Several new takes on the problem have been formulated in the past few years,driven by the need for a computationally attractive, high-performance disambigua-tion system that can be integrated in a knowledge acquisition system.

An ontology-based approach of preposition semantics is described in Jensenand Nilsson (2003). It is based on two assumptions: the existence of a formalontology and a set of binary relations such as Agent, Patient and Cause. The poly-semous nature of prepositions is captured in this model by linking one preposition

Page 38: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

3.3. Prepositions 31

to the ontology with multiple relations, e.g. the versatile Dutch preposition ‘door’would be linked to the ontology by means of four relations: Caused By, Agent, ByMeans Of and Location. The notion that one relation can be expressed by multipleprepositions is conceptualized by linking all these prepositions to the ontology bymeans of the same relation. This approach is well-suited for practical applicationin the ontology domain, as it can take advantage of already-existing resources likeCore Ontologies.

Bannard and Baldwin (2003) propose a method based on distributional sim-ilarity, i.e. based on the assumption that prepositions found in similar contextshave a higher probability of expressing the same semantics. This is expressedby the famous quote “you shall know a word by the company it keeps” (Firth,1957). Although the use of distributional similarity is common, this method breaksnew ground by applying it on closed-class words. The intended end product is anautomatically-derived preposition thesaurus, which in turn can act as the catalystin the development of preposition ontologies.

3.3.1 Disambiguating prepositions

Bannard and Baldwin (2003) point out the “impact of dependency data on thesemantic classification of prepositions” as a topic for further research. This is aninteresting perspective and can be related to the aforementioned work of Schutzand Buitelaar (2005), who used dependency data to extract semantic relations fromverbs.

Disambiguation is preferably done with a minimal amount of added informa-tion, both for computational reasons and complexity considerations. This mini-mum would – in the case of prepositions – probably be a syntactic parse of theentire sentence. A correlation between syntactic patterns and certain semantic re-lations would have to be established, as well as a formal method to link the one tothe other.

The method described in Schutz and Buitelaar (2005) provides some startingpoints to do the same for prepositions. Schutz and Buitelaar uses the dependencystructure of a sentence to extract the triples that build up the semantic relation; thesame idea could be applied to prepositions. To estimate feasibility, we first needto establish how strong the correlation between preposition syntax and prepositionsemantics actually is.

Page 39: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

32 Chapter 3. Relation extraction

3.4 Ambiguity of preposition syntax

The linguistic community seems to agree in large numbers that generally, syntax isnot enough to disambiguate for semantics (cf. e.g. Kingsbury and Palmer, 2003).It is seldom proven however, especially in the case of prepositions.

There are however good reasons to believe this common sense belief is true.To name two: semantic role labeling is notoriously difficult, especially for prepo-sitions (Stevens, 2006) and secondly, assuming that syntax will suffice for disam-biguation belies lexical semantics and world knowledge, both known to be crucialfactors when pursuing full disambiguation.

To put some solid ground below common sense belief, small-scale corpus re-search can help and demonstrate the thin correlation between preposition syntaxand semantics. To cover as much ground as possible, the research is done bothbottom-up and top-down:

Bottom-up: We describe the syntactic environments in which a preposition occursin dependency structure patterns; these patterns take the form of extendedXPath statements. Then, we extract all sentences from a corpus that matchthe pattern. If there were a correlation between a specific pattern and somesemantic relationship, this would become apparent.

Top-down: Sentences containing a particular preposition are retrieved from thecorpus, then filtered for one particular semantic situation. It can then beobserved whether there is some common semantic ground.

Bottom-up

For the bottom-up approach, 577 sentences containing the Dutch preposition‘door’ were extracted from the corpus. From these sentences, the thirteen mostcommon dependency structures were isolated and converted to patterns. Thesethirteen patterns covered a little over 50% of the entire set of 577 sentences. AXPath query (see section 3.5.4) was created to match each pattern and applied tothe corpus using a tool proprietary to the Alpino distribution called ‘dtsearch’.

We follow the procedure for creating an XPath query matching the syntac-tic structure shown in Figure 3.3. This is a modifier pattern containing the Dutchpreposition ‘door’ and is quite common among sentences containing ‘door’; slightlyless than 10% of these sentences contains this pattern. The syntactic structure con-tains relation, part of speech and category information and is created by the Alpinoparser.

Starting from the top, working our way down, we encounter nodes of var-ious categories and with different dependencies. We make sure all nodes are

Page 40: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

3.4. Ambiguity of preposition syntax 33

matched in our XPath query: the top node is matched withnode[@cat="pp"],the node containing ‘door’ withnode[@root="door"] and its sibling node withnode[@rel="obj1" and @cat="np"]. Iteration of these steps results in theXPath query of Figure 3.4.

//node[@cat="pp" and ./node[1][@root="door"] and

./node[2][@rel="obj1" and @cat="np" and ./node[1][@rel="det"

and @pos="det"] and ./node[2][@rel="hd" and @pos="noun"] and

./node[3][@rel="mod" and @cat="pp"]]]

Figure 3.4: The XPath expression matching the structure in Figure 3.3

This query defines the rela-

Figure 3.3: Modifier phrase of a syntactic struc-ture (parse id 2356).

tion of one node to another ina rather rigid manner. XPathalso allows for less strict pat-tern descriptions; the strict or-der in which the daughter nodesdet/det, hd/noun andmod/ppare fixed in this query by meansof index numbers (e.g.[2])can be replaced by a more loose-ly defined XPath axis, such asfollowing-sibling. For thisexperiment however, patterns arekept strict to avoid unforeseenor unwanted matches.

Execution of all queries foreach pattern on the corpus re-sulted in a plain text file con-

taining all matching sentences from the corpus. All sentences for each patternwere marked with one of the four semantic situations the preposition ‘door’ canintroduce: Caused By, By Means Of, Location and Actor.

Table 3.3 shows for every syntactic pattern what semantic relations they wereassociated with. The three patterns with the most ‘hits’ in the corpus (nrs. 4,7 and 10) all fail to designate a particular semantic situation: no real trend canbe observed in the percentages. The patterns that seem to have more succes incapturing semantics are more or less false positives. Some patterns have a ratherlow hit count (nrs. 5 and 2) which unrightfully raises the score. The case of patternnr. 9 is special, because this pattern searches for appositions. All appositionsyielded by this pattern contain a proper name, which explains the semantics found.

Page 41: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

34 Chapter 3. Relation extraction

This should however be considered lexical semantics rather than the merit of adependency pattern.

Attaining proper succes rates with a syntactic disambiguation system requiresvery strong correlation between syntax and semantics. We can conclude from thisfirst part of the experiment that the correlation between preposition semantics andsyntax looks rather thin.

semantics corpus average syntactic patterns1 2 3 4 5

Caused By 38% 0% 100% 0% 33% 100%Agent 45% 33% 0% 92% 33% 0%By Means Of 11% 33% 0% 8% 28% 0%Location 6% 33% 0% 0% 6% 0%

6 7 8 9 10Caused By 38% 31% 19% 18% 0% 9%Agent 45% 62% 56% 82% 100% 54%By Means Of 11% 0% 19% 0% 0% 17%Location 6% 8% 5% 0% 0% 20%

Table 3.3: Semantic relations for dependency patterns containing ‘door’. Percent-ages are rounded numbers.

Top-down

For the second (top-down) approach, ten sentences were selected at random out ofa set of sentences retrieved from the corpus. This set only contained sentences withthe preposition ‘door’, and it was verified by hand that each sentence would allowthe extraction of a BMO relation. A lack of common syntactic structureamong these sentences would confirm that the syntactic context of prepositions isinsufficiently linked to semantics to base an extraction method on.

Table 3.4 shows the selection of ten sentences from the corpus, all containingthe preposition ‘door’ and all candidates for the extraction of a BMO se-mantic relation. The syntactic structures rendered by the Alpino parser for thesesentences show no correlation. No two sentences of this set have the same syntac-tic structure. For brevity’s sake I decided not to include all parse trees, but lookingat these sentences shows this statement is intuitively true. We can therefore con-clude that ambiguous prepositions like ‘door’ are difficult to disambiguate usingonly syntactic information; we need further semantic information to be able to usethem for relation extraction.

Page 42: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

3.4. Ambiguity of preposition syntax 35

parse id sentence1116 De tentoonstelling wordt begeleid door een uitvoerige en infor-

matieve catalogus.1205 Omwonenden, die waren gewekt door het glasgerinkel, alarmeer-

den de politie.1540 Ze verweten hem niet dat hij zich door ontactisch optreden uit de

regering had laten verdrijven.1621 Het door “restauratie” zeker sterker geworden SVDPW gaat, na

het verdienstelijke resultaat in de bekercompetitie, naar ’t Noor-den.

1685 Gevraagd wordt of de minister de mening deelt dat deze com-missie zeer goed werk doet door, waar ook ter wereld, de schend-ing van mensenrechten te signaleren.

2104 Dit kan worden verklaard door aan te nemen, dat de quarks eenfantastische bindingsenergie hebben als ze samen een deeltje vor-men.

2356 Davids kwam fortuinlijk aan een winstpunt door een blunder vanV. d. Lei.

2884 Het WSI meent nu dat die eenzijdigheid kan worden doorbrokendoor een democratisering van de verhoudingen tussen studentenen leiding van de sociale academies.

3262 Het partijblad van de Italiaanse communisten , ” Unita ” , heeftolie op het vuur gegooid [door] te verklaren , dat er ” binnenkorteen grote algemene nationale staking komt .

4086 De radgravure ontstond door het glas te bewegen langs een snel-draaiend van diamantstof voorzien slijpwieltje.

Table 3.4: Selection of sentences from the corpus containing ‘door’ and candidatefor the extraction of a BMO relation

Final example

Finally, one specific sentence from the corpus1 provides a powerful illustration towhat extent lexical semantics influences the overall meaning:

1See: http://www.let.rug.nl/˜vannoord/trees/cdb/5240.xml

Page 43: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

36 Chapter 3. Relation extraction

(3.5) DeThe

strafzaakcriminal case

isis

daarnaafter that

geeindigdfinished

doorby

eena

kennisgevingnotification

vanof

nietnot

verderefurther

vervolging.prosecution.

The criminal case was finished afterwards by a notification statingprosecution would cease.

In this case, the preposition ‘door’ induces a Caused-By semantic role: the no-tification causes the criminal case to come to and end. We now change this sentencein the slightest way, by altering the verb. The verb ‘geeindigd’ can be replaced by‘beeindigd’, the first being a more passive way of ‘ending something’, while inthe latter the subject is more actively involved in the ending process. If we do infact replace ‘geeindigd’ by ‘beeindigd’, the semantic role changes from CausedBy to By Means Of: the sentence then describes a situation in which somebody orsomething causes the criminal case to end, doing so by means of a notification.

3.5 Extraction components

Alpino TreeBank

Alpino ParserEindhoven Corpus

Extraction Process

Ontologies

XPath XSLT

Figure 3.5: Overview of the extraction process

Before discussing the actual extraction methodology, I describe some already-existing components of the extraction process which are visualized in Figure 3.5.This figure is taken as a guide to describe all process components further on in thischapter. The first three components of the process already exist: the corpus wasparsed by the Alpino parser, forming a corpus of syntactically annotated sentencescalled the Alpino Treebank. The process described in this thesis, using XPath and

Page 44: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

3.5. Extraction components 37

XSLT technology, leads to the extraction of relations which can then be integratedin an ontology.

3.5.1 Corpus

For our experiments, a subcorpus of the Eindhoven (or Uit den Boogaart) Cor-pus (Uit den Boogaart, 1975) was chosen. This corpus is called the ‘cdbl’ corpusor, more general, the ‘newspaper part’. With 7153 sentences, this is the largestsubcorpus contained within the Alpino Treebank (see section 3.5.3). Its mediumsize helps achieving statistical relevance in experiments, while the total amount ofstructures is modest enough to allow some tasks to be performed by humans. Thelatter is an important factor, since not all prerequisites for the process described inthis thesis are available at this point in time.

The Max Planck Institute (Max Planck Institute for Psycholinguistics, 2001)provides some insight on the history of the Eindhoven Corpus: “In the early 1970s,a survey of contemporary, general-domain Dutch written and spoken word frequen-cies was conducted by a team of researchers cooperating in the Working GroupDutch Frequency Research. The idea was to emulate English-language efforts sim-ilar in method and scope, such as the Brown corpus.”

The Eindhoven Corpus contains written and (transcribed) spoken Dutch, from1964–1971 and 1960–1973, respectively. The written Dutch part contains around600K words, the spoken Dutch part about 120K. The sentences are encoded morpho-syntactically, i.e. with lexical categories and inflection information. It is uncertainwhether the corpus is copyrighted, but it is assumed it is free to use for scientificpurposes (Bouma and Schuurman, 1998).

3.5.2 The Alpino parser

The Alpino parser, a wide-coverage dependency parser for Dutch (Bouma et al.,2000) is an excellent candidate to conduct the syntactic disambiguation tests. Notonly does it produce the required dependency structures in a convenient XML for-mat, its wide coverage character asserts that the level of syntactic disambiguationis high enough to build another (semantic) disambiguation extension on top of it.

The Alpino grammar is built in the tradition of the Head-driven Phrase Struc-ture Grammar or HPSG (Pollard and Sag, 1994; Sag, 1997). The grammar is builtup from hand-written, linguistically motivated rules and lexical types. The outputfrom the Alpino grammar consists of dependency structures compliant with CGN-guidelines (Schuurman et al., 2003) and is as such relatively independent from anyparticular grammatical framework.

Page 45: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

38 Chapter 3. Relation extraction

In contrast with earlier work on HPSG grammars, the grammar rules in Alpinoare rather detailed. The rules are organized in an inheritancy hierarchy, which al-lows for retaining the relevant linguistic generalizations. Disambiguation is doneboth by hand-coded rules and statistical methods. The grammar covers the basisconstructions of Dutch, but more exotic constructs like verbal and nominal com-plementation and modification patterns, appositions and PPs including a particleare also covered.

3.5.3 The Alpino Treebank

As said earlier, the corpus used for the experiments is the ‘cdbl’ corpus. Thiscorpus is embedded in the Alpino Treebank (van der Beek et al., 2001), originallyconceived as a development and evaluation tool for the Alpino parser. It is an amal-gam of several Dutch corpora – all pre-parsed by the Alpino parser. The Treebankcontains a section that is corrected by hand containing a little over 13K sentencesand another, much larger but not corrected section containing over 5 million sen-tences.

Included with the Alpino Treebank is a number of scripts and programs. Thissoftware can help the end user view graphical representation of parse trees, searchthe corpus and help with annotation work. Structures in the Alpino Treebank areavailable in both SVG and XML format for visual and machine inspection, respec-tively.

3.5.4 XPath and XSLT

All relation extraction and corpus research experiments in this thesis are done us-ing XPath (Clark et al., 1999a) and XLST (Clark et al., 1999b) technologies. BothXSLT and XPath (as well as XSL-FO, a unified document presentation language)were developed within the W3C2.The following two sections provide an introduc-tion and related background information.

Extensible Stylesheet Language Transformations or XSLT (Clark et al., 1999b)is a declarative language, stating what the end result of a transformation processmust look like3. Practically, this means that an XSLT stylesheet contains one ormore template rules. Each template rule specifies what to add to the ‘result tree’when the XSLT processor encounters a node in the ‘source tree’ that meets theconditions specified in the template rule. However, at first glance functions in

2W3C is the acronym for ‘World Wide Web Consortium’, an international standards body for theWorld Wide Web. The W3C homepage can found on http://www.w3c.org

3Another famous example of a declarative language is SQL (Structured Query Language, thestandard interface language for databases.

Page 46: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

3.6. Extraction methodology 39

XSL templates might look like imperative functions. In reality, they are functionalexpressions, representing their evaluated results – nodes that will end up in theresult tree.

XSL uses the XPath (Clark et al., 1999a) language, a terse, non-XML lan-guage used to address portions of an XML document. XPath offers flexible waysof defining node sets or even unions of node sets and is as such a popular querylanguage. Each XPath expression consists of a sequence of steps, describing thepath to get from the start node (or ‘context node’, which is the current node of theXSL processor) to another node or set of nodes.

3.6 Extraction methodology

We have now arrived at the core of this chapter, being the extraction process thatdistills semantic relations from prepositions. A semantic relation is extracted froma sample sentence to demonstrate the method. The structure for the example sen-tence “Zij vroeg middels een enquete 170 leerlingen uit 2-vmbo naar hun leesge-drag” (“She asked by means of an inquiry 170 students from 2-vmbo about theirreading behavior.”) can be seen in Figure 3.6, in which the three elements of in-terest – the domain and range of the semantic relation plus the preposition – havebeen boxed.

This section describes the general methodology only. For a more in-depth re-view of technical, syntactic and semantic considerations, I refer to the chapters 4(“Less ambiguous prepositions”) and 5 (“Ambiguous prepositions”).

range

domain

Figure 3.6: A syntactic parse tree from the Alpino parser for a Dutch sentencecontaining the prepositions ‘middels’.

Figure 3.6 shows our preposition ‘middels’ (associated with the BM-

Page 47: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

40 Chapter 3. Relation extraction

O relation) embedded in amod/pp node. The domain (‘vraag’) and the range(‘enquete’) are one level higher and lower, respectively. We note that both do-main and range are the head nouns on their respective levels; this is the basis ofthe extraction method. An XSL stylesheet parses the syntactic structure, and theoccurrence of ‘middels’ triggers the creation of a new BMO relation. Thedomain is then retrieved by selecting the sibling head node of the preposition’smother node; for the range, the head node of theobj1 node that is a sibling of thepreposition is used. The resulting relation is in XML; for this example sentence, itwould be as follows.

1 <?xml version="1.0" encoding="iso-8859-1"?>

2 <candidates>

3 <relation label="ByMeansOf" domain="vraag" range="enquete"/>

4 </candidates>

If more than one sentence or an entire corpus is fed through this XSL stylesheet,relations are extracted when they are encountered. Sentences not containing anyextractable information are silently ignored. Hence, the result of applying theextraction process to a corpus containing dozens of sentences but only three ex-tractable relations using ‘middels’ would look something like this.

1 <?xml version="1.0" encoding="iso-8859-1"?>

2 <candidates>

3 <relation label="ByMeansOf" domain="vraag" range="enquete"/>

4 <relation label="ByMeansOf" domain="schrijven" range="werk"/

>

5 <relation label="ByMeansOf" domain="uitrusten" range="slaap"

/>

6 </candidates>

Page 48: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

Chapter 4

Less ambiguous prepositions

As remarked in the beginning of Chapter 3, some prepositions allow for relativelystraightforward relation extraction. These prepositions can be considered to havean almost one-to-one relationship with a particular semantic relation. This chapterinvestigates our two semantic relations (BMO and CB) together withtheir associated unambiguous prepositions (‘middels’ and ‘vanwege’) and proposesa method for extracting these relations from a syntactic parse made by the Alpinoparser (Bouma et al., 2000).

4.1 The BMO relation

A Dutch preposition that is closely related to the BMO relation is ‘middels’.This preposition is usually followed by some phrase that indicates which object oraction is used to reach the goal expressed by the sentence. Typically, sentencescontaining ‘middels’ look like this:

(4.1) ZijShe

vroegasked

middelsby means of

eenan

enqueteinquiry

170170

leerlingenstudents

uitfrom

2-vmbo2-vmbo

naarabout

huntheir

leesgedrag.reading behavior.

She enquired 170 students from 2-vmbo about their reading behavior.

As can be seen in Example 4.1, ‘middels’ links the concept of ‘enquete’ (‘en-quiry’) to the concept of ‘vragen’ (‘ask’). Of course, it would be interesting to seewhether this relation is also visible in the syntactic structure of the sentence. TheAlpino parser can render a syntactic parse tree when the sentence in Example 4.1is given as input, see Figure 4.1.

41

Page 49: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

42 Chapter 4. Less ambiguous prepositions

Figure 4.1: The parse tree as rendered by Alpino of the sentence in Ex. 4.1

4.1.1 Syntactic aspects

Inspection shows the range of the relation induced by ‘middels’ (in this case ‘en-quete’) to be syntactically embedded as a daughter node of themod/pp node con-taining the preposition. Here, the node is labeled as anobj1/np node.

The phrase expressing the concept that’s on the other end of the semantic rela-tion (i.e. the domain), in this case ‘vragen’, is a sibling head node of themod/pp

phrase containing our preposition. This makes sense: the modifier phrase givesadditional information to the head of the entire phrase, in this case the means bywhich the action described is completed.

It can also be argued that not only the head noun, but the whole phrase shouldbe considered for relation extraction. Paraphrasing Schutz and Buitelaar (2005,page 5), this approach would introduce data sparseness and it’s therefore better tonormalise a phrase to its head node. This eases concept tagging and consequentlybroadens coverage.

The assumption that (Dutch) prepositions usually connect to the head node ofthe phrase in which the modifier-pp phrase is embedded is supported by evidencefrom the Alpino Treebank. Parse trees from four months (January 1995 – April1995) of the NRC Handelsblad newspaper containing the ‘middels’ prepositionwere examined for preposition connection statistics. As can be seen in Table 4.1,it appears that a large majority of the prepositions connect as predicted.

Some sentences proved too complex for the Alpino parser and yielded incorrectresults (e.g. split sentences, parser errors). Because this can be considered a parserissue and as such isolated from this subject, sentences that could not be parsed

Page 50: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

4.1. The BMO relation 43

normal level 1 level up 2 levels up 3 levels up73% 13% 7% 7%

Table 4.1: Connection distance for the ‘middels’ preposition

correctly are omitted in Table 4.1.

Figure 4.2: Part of a sentence in which ‘middels’ connects one level higher up theparse tree than usual.

A nonnegligible amount of sentences examined show ‘middels’ connectinghigher up the tree to find the domain. An example of a sentence in which ‘mid-dels’ connects one level higher up can be seen in Figure 4.2. Here, the relationwould be Rel:BMO (Dom:D, Range:O). The nodethat would be the first candidate for the domain (“Amerikaan”) is actively involvedin the action described, but certainly not the domain for the BMO relation.

This incorrect parse is due to a very common (and very hard to tackle) parserproblem: disambiguation of prepositional phrase attachments. Paraphrasing van derBeek et al. (2001), “a noun phrase and a prepositional phrase can form a constituenton different levels. [One option is] a dependency structure with a noun phrase andan internal prepositional modifier, [a second option is] a dependency tree in whichthe prepositional phrase is a modifier on the sentence level”. Suffice to say that inthis case the first option was chosen, whereas the second option would have givenresults. Prepositional attachment is by no means a solved problem, and it is stillactively researched (cf. e.g. van Halteren et al., 2006).

From these investigations follow that all prerequisites are satisfied to allow

Page 51: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

44 Chapter 4. Less ambiguous prepositions

for automatic extraction of a semantic relationship: a native speaker of Dutch canassert ‘middels’ is a preposition that induces a BMO relation and a B-MO relation only, and the two concepts connected by this relation can beextracted because of their relatively fixed syntactic positions.

4.1.2 Extraction method and formats

As the Alpino parser exports its parses in XML format, the obvious choice for thisextraction is XLST (see page 38). The fact that XSLT is tailored for XML process-ing, the abundance of available implementations and its flexibility and power makeit very well suited for the extraction method described in 4.1.1. XML is used asoutput format – both for reasons of coherence and ease of use in later applications.The format of a semantic relation could be an empty element with three attributes,denoting the relation name, the domain and the range. For instance, one couldexpect a relation like

<relation name="ByMeansOf" domain="vragen" range="enquete" />

if the sentence from Figure 4.1 were to be parsed. Ideally, one would integrate anarray of XSL-like transformations to form a plugin for a knowledge base editor likeProtege (Gennari et al., 2003). The clear syntax of an XML element as suggestedwould facilitate this process. Extraction is done by means of an XSLT stylesheet,part of which is printed below.

1 <candidates>

2 <xsl:for-each select="//node[@root=’middels’]">

3 <relation>

4 <xsl:attribute name="label">ByMeansOf</xsl:attribute>

5 <xsl:attribute name="domain">

6 <xsl:for-each select="../../node[@rel=’hd’]">

7 <xsl:value-of select="@root" />

8 </xsl:for-each>

9 </xsl:attribute>

10 <xsl:attribute name="range">

11 <xsl:value-of select="../node[@rel=’obj1’]/node[@rel=’hd’]/

@root" />

12 </xsl:attribute>

13 </relation>

14 </xsl:for-each>

15 </candidates>

Page 52: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

4.2. The CB relation 45

The XSL Transformation starts its business on the second line with afor-each

statement. Here all nodes for the word ‘middels’ are selected. Inside said state-ment, a relation element is specified, after which its attributes are extracted fromthe parse one after the other.

The first attribute islabel, which is rather straightforward as the match with‘middels’ (along with the intuition of a native speaker) already has determinedthat this is a BMO relation. For thedomain attribute, the context shiftsup two levels and reaches the direct object node. From this new context, thehd

node is selected and itsroot attribute is used as thedomain attribute for the newrelation. Note that one could also opt for using theword attribute (which is thenon-stemmed, original word from the sentence) as relation attribute. This dependson the integration characteristics with the existing ontology. Because non-stemmedversions are more likely to introduce data sparseness, the stemmed version is usedas ontology concept label.

Therange attribute is extracted last. The context is back to its original state,and this time shifted up one level to reach theobj1 node within the prepositionalphrase. From the context goes down the tree via theobj1 node to reach the headnoun, where theroot attribute is selected and used asrange attribute.

Assuming this stylesheet is applied to the syntactic parser for the example sen-tence “Zij vroeg middels een enquete 170 leerlingen uit 2-vmbo naar hun leesge-drag” (“She asked 170 students from 2-vmbo about their reading behavior bymeans of an enquiry.”), the end result of the process is a triple, contained in anXML document:

1 <?xml version="1.0" encoding="iso-8859-1"?>

2 <candidates>

3 <relation label="ByMeansOf" domain="vraag" range="enquete"/>

4 </candidates>

4.2 The CB relation

A comparable procedure as described in section 4.1 can be applied to relativelyunambiguous prepositions inducing the CB relation. An example of sucha preposition in Dutch is ‘vanwege’. The following will show that both extrac-tion processes are very much alike. We start off with an example to show how‘vanwege’ typically is embedded in a sentence.

Page 53: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

46 Chapter 4. Less ambiguous prepositions

(4.2) InIn

19331933

bezette FrankrijkFrance occupied

Andorra,Andorra,

vanwegebecause of

socialesocial

onrustunrest

voorbefore

dethe

verkiezingen.elections.

The parse tree in Figure 4.3 reveals a structure very similar to the one in Figure4.1. The head node of the modifier phrase holds the preposition, and its domainand range are in the embedded node and in the head sister node of the modifierphrase, respectively.

Figure 4.3: The parse tree as rendered by Alpino of the sentence in Ex. 4.2

4.2.1 Example extraction process

The extraction process goes along the exact same lines as the extraction for ‘mid-dels’. The XSL stylesheet is changed only on two points: the key preposition ischanged from ‘middels’ to ‘vanwege’ and the name of the relation is altered to cor-respond with the preposition. Both the XSL stylesheet and related XML exampleparse can be found in Appendix xsl-transformation-for-the-bmo-relation. Applica-tion of the stylesheet on the example parse yields the following relation triple inXML.

Page 54: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

4.2. The CB relation 47

1 <?xml version="1.0" encoding="iso-8859-1"?>

2 <candidates>

3 <relation label="CausedBy" domain="bezet" range="onrust"/>

4 </candidates>

This relation is in agreement with the meaning of the sentence: because of‘onrust’ (unrest) the country of Andorra is occupied (‘bezet’).

Page 55: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

48 Chapter 4. Less ambiguous prepositions

Page 56: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

Chapter 5

Ambiguous prepositions

As the previous chapter has shown, extracting semantic relations from prepositionsusing only syntactic parses is feasible – provided the preposition is unambiguousand denotes a relation that can integrated in an ontology.

The majority of prepositions however is highly ambiguous, both in a semanticand a syntactic sense, as Chapter 3 has shown us. Also, preposition meaning canbe hard to capture in a semantic relation label. This means the extraction methodas it is described in Chapter 4 can be used in only a limited set of cases.

To widen the scope of the method, additional semantic information has to beadded to the equation. This chapter investigates whether initiatives like PropBank(Kingsbury et al., 2002) or FrameNet (Baker et al., 1998) can provide the additionalsemantic information required.

5.1 PropBank

PropBank is “an annotation of syntactically parsed, or treebanked, structures with‘predicate-argument’ structures” Babko-Malaya (2005). These predicate-argumentstructures come in the form of propositions, hence the name PropBank. Its objec-tive is to provide the linguistic community with a body of training data to allow forthe development of accurate tools for predicate-argument analysis. In its currentform, PropBank is an addition ‘on top of’ the Penn Treebank (Marcus et al., 1993),although it covers only a small part of the Penn Treebank corpus.

PropBank uses coreference and predicate argument structure for verbs, par-ticipial modifiers and nominalizations; these are the namegivers of the proposition.Propositions use tuples to describe the semantic relations between the predicateand its arguments. Consider the following sentence, taken from Palmer (2001).

49

Page 57: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

50 Chapter 5. Ambiguous prepositions

(5.1) When Powell met Zhu Rongji on Thursday they discussed the return of thespy plane.

The following propositions describe roles in this sentence:

meet (Powell, Zhu)discuss ([Powell, Zhu], return (X, plane))

While verbs name the proposition, other phrases form the arguments. These ar-guments are labeled, numbered sequentially from Arg0 to (potentially) Arg5. Thelabels are ususally verb-specific; for every verb exists (or should exist, as PropBankis work in progress) aframe filethat describes which argument receives which la-bel. An example of such a file is the frame file forhit, cf. Kingsbury et al. (2002);

(5.2) HIT (sense: strike)Arg0: hitterArg1: thing hitArg2: instrument, hit with

Similar arguments do not always receive the same label. Also, the set of argumenttypes that fall under one label is quite heterogenous. Jurafsky (2006) can howeveridentify “trends in argument numbering”; note that Arg2 and Arg3 show quite a bitof overlap;

Arg0 = agentArg1 = direct object/ theme/ patientArg2 = indirect object/ benefactive/ instrument/ attribute/ end stateArg3 = start point/ benefactive/ instrument/ attributeArg4 = end point

Figure 5.1: Trends in argument numbering

Modifiers fall outside of the Argn argument labeling sequence and are labelledARGM-xxx, wherexxx is a descriptor code indicating what function the clausehas. A summary of the descriptor codes can be seen below, the full description ofeach descriptor code is in Babko-Malaya (2005).

Page 58: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

5.1. PropBank 51

DIR directionals LOC locativesMNR manner TMP temporalEXT extent REC reciprocalsPRD secondary predication PNC purpose (not cause)CAU cause DIS discourse markersADV adverbials MOD modalsNEG negation

5.1.1 PropBank annotation on the Alpino Treebank

To apply the principles of the PropBank annotation method on the Alpino Tree-bank, one must consider how PropBank interacts with the Penn Treebank. Kings-bury et al. (2002) states that “all annotations are given in the form of a standoff

notation referring to the syntactic terminals within the original TreeBank files”.What this means in practice can be seen in the following sentence, which is

used as an example. Here, the numbers below each word (i.e. syntactic terminal)represent the position in the sentence; the[*] is used in the annotation guidelines(Babko-Malaya, 2005) as the symbol for the ‘trace’ that is left behind by a wordthat has moved within the sentence.

(5.3) Mr.1

Fournier2

also3

noted4

that5

Navigation6

Mixte7

joined8

Paribas’s9

core10

of11

shareholders12

when13

Paribas14

was15

denationalized16

[*]17

in18

1987,19

and20

said21

it22

now23

holds24

just25

under26

5%27

of28

Paribas’s29

shares30

The argument labeling annotation PropBank provides in this case for the rela-tion ‘denationalize’ is:

denationalize ----- 16:0-rel 14:1*17:0-ARG1 18:1-ARGM-TMP

The numbers again refer to the sentence positions. This allows us to extract ahuman readable version of this argument labeling:

argument word sentence positionrel denationalized 1ARG1 *trace*→Paribas 17→14ARGM-TMP in 1987 18 (:1 means ‘plus following word’)

Page 59: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

52 Chapter 5. Ambiguous prepositions

As expected, the labelrel indicates the phrase describing the relation, whichin this case is denationalization.Arg1 is the label for the clause that denotes theentity that is denationalized. As the Paribas bank is being denationalized, “Paribas”receives theArg1 label. Finally, there’s an extra argument to describe when thisaction took place. This argument is optional, i.e. not in the frame file for “dena-tionalize”, and thus labelledArgM, more specificallyArgM-TMP.

This example shows that PropBank uses word position as the means of iden-tifying which syntactic terminals correspond to which argument phrase, thus real-izing the standoff notation. The case of the Alpino Treebank however, is slightlydifferent.

Alpino Treebank files are in XML format, where word position is of no use:since nodes can be either branch nodes or leafs, numbering them top-down wouldresult in confusing gaps in the annotation. Conversely, numbering only leaf nodeswould make it very inconvenient to label entire subtrees.

One could consider XPath (described on page 38) as an alternative positionindicator. XPath statements could be used directly in place of the sentence posi-tion indicators that PropBank uses to interface with the Penn Treebank structures.But since XPath statements cannot live within the XML structure, this approachwould imply that the argument labeling be stored separately, as is the case for thePropBank–Penn Treebank combination. However, using XSLT on non-XML ismuch harder than on XML, and choosing this strategy would result in a cumber-some extraction process.

It seems therefore most logical to have much tighter integration between thePropBank-like annotation and the Alpino Treebank syntactic information. Thiscan be done by inserting argument labeling informationwithin the syntactic tree,consequently leaving behind the concept of standoff notation.

This approach is also taken in the related work of Stevens (2006), which isconcerned with the automatic semantic labeling of sentences. Although this workdiffers in that is based on the semantic annotation guidelines and research corpusused in the D-COI project (Monachesi and Trapman, 2006), the syntactic parsedocument format is the same. Hence, a synchronization of annotation practicesallows for future integration of the XSL stylesheets in section 3.6 with Stevens’work on semantic role labeling.

5.1.2 Extraction using PropBank information

To know whether the new Alpino-cum-PropBank parse tree enables us to extractsemantic relations, we examine the two semantic relations seen earlier, the B-MO relation and the CB relation.

Page 60: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

5.1. PropBank 53

The CB relation

Appendix B contains a syntactic parse tree augmented with PropBank-style seman-tic argument labeling and an XSL stylesheet that is able to deal with the additionalinformation, respectively. The preposition seen in this example is ‘door’, and na-tive speaker intuition dictates the semantic relation seen here is CB. Theexample sentence is the following:

(5.4) DeThe

kerkchurch

is ontstaanarose

doorthrough

zendingswerkmission work

vanof

dethe

NG-kerkNG-church

inin

OranjeOranje

Vrijstaat.Vrijstaat.

The entry in the English PropBank for ‘arose’, the English verb closest to the Dutchverb ‘onstaan’ is:

Roleset arise.01 ”come into existence (from)”:Roles:Arg1: thing (state) arisingArg2: source (from or in or of)

Following this guideline, we augment the syntactic parse from the Alpino parserby labeling ‘De kerk’ asArg1, ‘ontstaan’ asRel and finally ‘door zendingswerkvan de NG-kerk in Oranje Vrijstraat’ asArgM-CAU. This takes the ambiguationproblem out of the XSL stylesheet. The stylesheet programmer can be sure that aclause starting with ‘door’ and annotated with aArgM-CAU label is in fact providingcausal information, thus making extraction of a CB relation possible.

1 <?xml version="1.0" encoding="iso-8859-1"?>

2 <candidates>

3 <relation label="CausedBy" domain="ontsta" range="

zendingswerk"/>

4 </candidates>

Figure 5.2: Extracted relation (Appendix B)

We observe that extraction of the CB relation is relatively easy, as Prop-Bank provides an argument label that overlaps with the semantics of the CBrelation: ArgM-CAU. To ensure we have enough overlap between the definition ofthe PropBank semantic argument and the definition of the CB semantic re-lation, we compare to the guidelines in Babko-Malaya (2005). The annotation

Page 61: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

54 Chapter 5. Ambiguous prepositions

guidelines forArgM-CAU states that clauses bearing this label “indicate the reasonfor an action”. Comparison of this definition to the one given in Chapter 3 showsthat these two overlap.

The BMO relation

The other semantic relation under examination - BMO - is significantlyharder to extract than the CB relation. Several argument types show overlapwith the BMO relation to some degree. Referring back to Figure 5.1, we cansee that bothArg2 andArg3 relate to the BMO role, since both can label an‘instrument’ clause. Jensen and Nilsson (2003) describe the BMO semanticrole as “Means to end; Instrument”, implicating thatArg2 andArg3 are probablyused to label the clauses relating to BMO relations.

Although this labeling strategy is not necessarily flawed in a broader sense, itis unusable for semantic relation extraction. Firstly because the argument number-ing is a ‘trend’ rather than a division in distinct semantic categories, and secondlybecause the scope of semantic categories falling under one argument label is ratherbroad. This shifts the problem from disambiguating semantic roles to disambiguat-ing the semantic annotation.

Also, there is noArgM-xxx label corresponding with the BMO relation.If one would relax constraints, theArgM-MNR (Manner) label could be considered.But while this label specifies “how an action is performed” (Babko-Malaya, 2005)it will also yield a lot of false positives. An example of this is the clause “workswell with others”; this is also labeled asArgM-MNR, but is not a candidate for theBMO relation.

5.2 FrameNet

Acknowledging the mixed results of the previous section, we can conclude thatthe level of semantic annotation provided by PropBank-style argument labeling isinsufficient to disambiguate for some semantic roles.

We therefore turn to another semantic role annotation project, called FrameNet(Baker et al., 1998), developed at the University of California in Berkeley. FrameNettakes frame semantics (Fillmore, 1976) as its theoretical foundation, a theory thatrelates semantics to encyclopaedic knowledge. It is based on the assumption that“word meanings are best understood in terms of the semantic/conceptual structureswhich they presuppose” (Jurafsky, 2006). These structures are called ‘frames’, andwords are said to ‘evoke’ or activate their corresponding frame and its frame ele-ments (FEs).

Page 62: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

5.2. FrameNet 55

(5.5) An example of this is the word ‘baker’. This word evokes a frame of thepreparation and baking of dough, consequently profiling the individual whodoes this work.

FrameNet aims to discover and describe frames for every word. These framesare ordered in an ontological structure, resulting in a more coherent annotationsystem and a deeper level of semantic annotation.

(5.6) An example of this layered annotation is the very general frame of Motion,in which the object moving (Theme), the starting point (Source), theitinerary (Path) and the destination (Goal) are elements. Many other framesare descendants of the Motion frame: ‘leave’, ‘depart’, ‘arrive’ and ‘pass’,to name a few.

The elements in a frame are structured using a ‘script’. As an example, thescript for the frame Causechangeof consistency reads: “In this frame an Agentor Cause changes the consistency of an Undergoer from its Initialstate to a Result.This may take place under certain Circumstances for a Purpose or Reason in a cer-tain Manner. A Changeagent can also be used.” Verbs are associated to this frameby means of a Levin class (Levin, 1993), example verbs includesoften, coagulateandharden.

Since PropBank provided sufficient semantic annotation depth to disambiguateCB relations but failed for BMO relations, we review the latter rela-tion only.

5.2.1 Pluriformity of annotation

While PropBank’s annotation guidelines can be called somewhat vague, FrameNeton the other hand has a very fine-grained and precisely defined ontological struc-ture. A brief investigation reveals however that the semantic properties of the B-MO relation do not concur with the semantic buildup of FrameNet. In otherwords: FrameNet does not generalize the BMO relation in a single frameelement, nor does it provide an ‘umbrella frame element’ subclassing all frameelements related to the BMO relation.

This problem clearly presents itself looking at the various frame elements thatexpress the BMO relation. As can be seen in the (possibly incomplete) tablebelow, the frame element used for the BMO relation varies across differentframes. Furthermore, frame elements 2 and 4 show that sometimes two frame ele-ments within one frame should be joined to extract BMO: both the elementslabeled Instrument and Means would be of interest for our extraction process. And

Page 63: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

56 Chapter 5. Ambiguous prepositions

lastly, frame element label abbreviations seem inconsistent and diffuse: frame ele-ments 2-3 and 4-5 have slight variations in their names.

# Frame BMO FE Example or usage1 Perceptionactive Means [Mns] “You can observe distant galax-

ieswith a good telescope.”2 Theft Instrument [Inst] “Chris stole itwith a long bam-

boo polethrough the fence.”3 Hostileencounter Instrument [Ins] The instrument with which an

intentional act is performed.4 Theft Means [Mns] “Leslie stole the watch from

Kim by following her home af-ter work .”

5 Grooming Means [mea] The action performed by theAgent to accomplish the groom-ing.

6 Apply heat HeatingInstrument[Heat instr]

“Kate cooked the ricein a rice-cooker.”

7 Beingattached Connector [Conn] “The tarp is tied downwith arope.”

The next sections will disregard this problem and focus solely on integrationwith the Alpino parser and the extraction procedure. It must however be notedthat a real-world implementation with the proposed means of extraction – XSLstylesheets – will become rather cumbersome due to the number of frame elementsto be considered.

Another problem that must be taken into account when considering a real-world implementation is the incongruence between the constituents in the parse andthe frame elements as defined by FrameNet. In Ahn et al. (2004) as much as 15%of the frame elements does not correspond to constituents, i.e. frame elements andsyntactic constituents only partially overlap, resulting in mismatches. It is notedhowever that this percentage also includes parsing errors. More in-depth review ofthis problem is difficult since the distribution of this percentage among the possiblecauses is unspecified.

5.2.2 FrameNet and the Alpino Treebank

Adding FrameNet annotation to the Alpino Treebank can be done much along thesame lines as the PropBank annotation. Frame element labels are inserted in thesame location as PropBank labels. Appendix C shows both an example sentencefrom the corpus annotated with FrameNet semantic labels and the associated XSL

Page 64: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

5.2. FrameNet 57

transformation stylesheet.Note the similarity between this method and the method with PropBank an-

notation from Appendix B. The XSL stylesheet and consequently the extractionprocess hardly changes – demonstrating the fact that this extraction method can beused with hybrid or even unrelated annotation systems.

5.2.3 Extraction using FrameNet information

Similar to the extraction using PropBank annotation, we’ll be using a sentencecontaining the preposition ‘door’, this time expressing a BMO relation. Thesentence is again taken from the corpus1.

(5.7) VolgensAccording to

G.G.

Th.Th.

RothuizenRothuizen he

overtuigdeconvinced

hij “door“by means of

eena

grotelarge

matedegree

vanof

weerloosheid”defenselessness”

(Gereformeerd(Reformed

Weekblad).Weekly).

For indentification of the individual frame elements, we turn to the frame of Sua-sion2, the frame evoked by the verb ‘convince’. The relevant frames for this sen-tence are the following:

Target word Implicit for every frame.

Speaker [Spkr] The person who intends through the use of language to get anotherperson to act.

Means [Mns] An act of the Speaker that accomplishes the persuasion.

Following the guidelines of this frame, we annotate the sentence. The annotatedsyntactic parse in XML format can be found in Appendix C, the human friendlyversion is below:

(5.8) Volgens G. Th. Rothuizen [Target Wordovertuigde ][S peakerhij ][ Means“dooreen grote mate van weerloosheid” ] (Gereformeerd Weekblad).

The next step of the extraction process goes much along the same lines as theone using PropBank annotation. The result of the XSL transformation (also inAppendix C) is the following:

1Located here: http://www.let.rug.nl/˜vannoord/trees/cdb/2143.xml2See: http://framenet.icsi.berkeley.edu/index.php?option=com wrapper&Itemid=118&frame=Suasion

Page 65: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

58 Chapter 5. Ambiguous prepositions

1 <?xml version="1.0" encoding="iso-8859-1"?>

2 <candidates>

3 <relation label="ByMeansOf" domain="overtuig" range="mate"/>

4 </candidates>

Note that the range of this relation is not ‘weerloosheid’ (defenselessness), aswould probably have been the case if extraction had been done by a human ontol-ogy engineer. Instead, ‘mate’ (degree) is extracted, which is syntactically the headof the modifier phrase. Although the syntactic parse is correct, it fails to indicatethat ‘weerloosheid’ is the semantic core of the phrase ‘mate van weerloosheid’ (de-gree of defenselessness), thus fooling the extraction process. Because the syntacticanalysis is correct, this kind of information loss is hard to tackle.

Page 66: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

Chapter 6

Conclusion

Extracting non-taxonomic semantic relations from preposition phrases can improveontology learning and retrieve information that otherwise would be lost. It has beendemonstrated in this thesis that extraction of these relations is indeed possible. Theprocess is however not without its pitfalls and shortcomings.

Some very unambiguous prepositions allow semantic relations to be extractedusing only the dependency structure generated by the parser. But these prepositionsconstitute only a small minority; most prepositions are (highly) polysemous.

Additionally, some prepositions do not express a kind of meaning that can betransformed to a semantic relation. An example of such a preposition is the Dutch‘in’. This preposition can be used to indicate temporal information, e.g. “In drieuur was de klus geklaard” (“The job was done in three hours”). State of the art on-tology languages do not have the capacity to incorporate this type of semantics. Itis argued in this thesis that this translation problem can be avoided by carefully se-lecting what prepositions are to be used for relation extraction. Relation extractionfrom prepositions that express ontology-friendly semantics should merely sufferfrom ambiguity problems.

It has been demonstrated that these selected prepositions indeed show potentialfor relation extraction, provided that the ambiguity problem be solved. Two optionsfor preposition disambiguation have been studied (FrameNet and PropBank), bothof which were found to have their merits – but also shortcomings that preventedfull extraction.

FrameNet and PropBank are on opposite sides of the scale; while FrameNetprovides a ontological structure of meaning that is so diverse and fine-grained itmakes one wonder where to find generalisation, PropBank on the other hand has arather loosely-defined annotation strategy that fails to really pin down any semanticconstituent, consequently introducing new – much unwanted – ambiguity. This is

59

Page 67: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

60 Chapter 6. Conclusion

the main problem preventing the relation extraction process from attaining highersucces rates.

Consequently, one direction for further research could be researching otheroptions for semantic annotation. The D-COI project (Monachesi and Trapman,2006) promises to bridge the gap between FrameNet and PropBank and might beat the right level of abstraction to pursue full disambiguation.

Another future topic might be the composition of preposition meaning. If twoconcepts are linked to each other multiple times by different prepositions, the com-posed meaning of these prepositions might be easier to add to the ontology thanthe meaning of a single preposition.

A major hurdle in the extraction process and also a possible topic for furtherresearch is the ‘syntactic obfuscation’ of key terms in a sentence. This problem(discussed at the end of Chapter 5) occurs when the semantically richest term isembedded in some substructure, for instance a modifier phrase (as in Chapter 5).This is not always a parser problem. One possible direction might be to introducesome ‘fuzziness’ to the extraction process, where semantically richer terms arepreferred as part of the relation triple.

Page 68: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

Appendix A

Extraction

The BMO relation

Example syntactic parse in XML format and XSL stylesheet, used for extractionof the BMO relation with unambiguous prepositions, in this case the Dutchpreposition ‘middels’.

1 <?xml version="1.0" encoding="ISO-8859-1"?>

2 <alpino_ds version="1.1">

3 <node begin="0" cat="top" end="12" id="0" rel="top">

4 <node begin="0" cat="smain" end="12" id="1" rel="--">

5 <node begin="0" end="1" id="2" pos="pron" rel="su" root="zij

" word="Zij"/>

6 <node begin="1" end="2" id="3" pos="verb" rel="hd" root="

vraag" word="vroeg"/>

7 <node begin="2" cat="pp" end="5" id="4" rel="mod">

8 <node begin="2" end="3" id="5" pos="prep" rel="hd" root="

middels" word="middels"/>

9 <node begin="3" cat="np" end="5" id="6" rel="obj1">

10 <node begin="3" end="4" id="7" pos="det" rel="det" root="

een" word="een"/>

11 <node begin="4" end="5" id="8" pos="noun" rel="hd" root="

enqute" word="enquete"/>

12 </node>

13 </node>

14 <node begin="5" cat="np" end="12" id="9" rel="obj1">

15 <node begin="5" end="6" id="10" pos="num" rel="det" root="

170" word="170"/>

61

Page 69: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

62 Chapter A. Extraction

16 <node begin="6" end="7" id="11" pos="noun" rel="hd" root="

leerling" word="leerlingen"/>

17 <node begin="7" cat="pp" end="12" id="12" rel="mod">

18 <node begin="7" end="8" id="13" pos="prep" rel="hd" root="

uit" word="uit"/>

19 <node begin="8" cat="np" end="12" id="14" rel="obj1">

20 <node begin="8" end="9" id="15" pos="noun" rel="hd" root="2

_vmbo" word="2-vmbo"/>

21 <node begin="9" cat="pp" end="12" id="16" rel="mod">

22 <node begin="9" end="10" id="17" pos="prep" rel="hd" root=

"naar" word="naar"/>

23 <node begin="10" cat="np" end="12" id="18" rel="obj1">

24 <node begin="10" end="11" id="19" pos="det" rel="det" root=

"hun" word="hun"/>

25 <node begin="11" end="12" id="20" pos="noun" rel="hd" root=

"lees_gedrag" word="leesgedrag"/>

26 </node>

27 </node>

28 </node>

29 </node>

30 </node>

31 </node>

32 </node>

33 <sentence>Zij vroeg middels een enquete 170 leerlingen uit

2-vmbo naar hun leesgedrag</sentence>

34 <comments>

35 <comment>Q#0|Zij vroeg middels een enquete 170 leerlingen uit

2-vmbo naar hun leesgedrag|1|1|-0.14210569549999988</

comment>

36 </comments>

37 </alpino_ds>

1 <?xml version="1.0" encoding="ISO-8859-1"?>

2 <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org

/1999/XSL/Transform">

3 <xsl:output method="xml" version="1.0" encoding="iso-8859-1"

indent="yes" />

4 <xsl:template match="/alpino_ds">

5 <candidates>

6 <xsl:for-each select="//node[@root=’middels’]">

Page 70: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

63

7 <relation>

8 <xsl:attribute name="label">ByMeansOf</xsl:attribute>

9 <xsl:attribute name="domain">

10 <xsl:for-each select="../../node[@rel=’hd’]">

11 <xsl:value-of select="@root" />

12 </xsl:for-each>

13 </xsl:attribute>

14 <xsl:attribute name="range">

15 <xsl:value-of select="../node[@rel=’obj1’]/node[@rel=’hd’]/

@root" />

16 </xsl:attribute>

17 </relation>

18 </xsl:for-each>

19 </candidates>

20 </xsl:template>

21 </xsl:stylesheet>

Page 71: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

64 Chapter A. Extraction

The CB relation

Example syntactic parse in XML format and XSL stylesheet, used for extractionof the CB relation with unambiguous prepositions, in this case the Dutchpreposition ‘vanwege’.

1 <?xml version="1.0" encoding="ISO-8859-1"?>

2 <alpino_ds version="1.1">

3 <node begin="0" cat="top" end="13" id="0" rel="top">

4 <node begin="5" end="6" id="1" pos="punct" rel="--" root=","

word=","/>

5 <node begin="0" cat="smain" end="12" id="2" rel="--">

6 <node begin="0" cat="pp" end="2" id="3" rel="mod">

7 <node begin="0" end="1" id="4" pos="prep" rel="hd" root="in"

word="In"/>

8 <node begin="1" end="2" id="5" pos="noun" rel="obj1" root="

1933" word="1933"/>

9 </node>

10 <node begin="2" end="3" id="6" pos="verb" rel="hd" root="

bezet" word="bezette"/>

11 <node begin="3" end="4" id="7" pos="name" rel="su" root="

Frankrijk" word="Frankrijk"/>

12 <node begin="4" end="5" id="8" pos="name" rel="obj1" root="

Andorra" word="Andorra"/>

13 <node begin="6" cat="pp" end="12" id="9" rel="mod">

14 <node begin="6" end="7" id="10" pos="prep" rel="hd" root="

vanwege" word="vanwege"/>

15 <node begin="7" cat="np" end="12" id="11" rel="obj1">

16 <node begin="7" end="8" id="12" pos="adj" rel="mod" root="

sociaal" word="sociale"/>

17 <node begin="8" end="9" id="13" pos="noun" rel="hd" root="

onrust" word="onrust"/>

18 <node begin="9" cat="pp" end="12" id="14" rel="mod">

19 <node begin="9" end="10" id="15" pos="prep" rel="hd" root="

voor" word="voor"/>

20 <node begin="10" cat="np" end="12" id="16" rel="obj1">

21 <node begin="10" end="11" id="17" pos="det" rel="det" root

="de" word="de"/>

22 <node begin="11" end="12" id="18" pos="noun" rel="hd" root

="verkiezing" word="verkiezingen"/>

Page 72: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

65

23 </node>

24 </node>

25 </node>

26 </node>

27 </node>

28 <node begin="12" end="13" id="19" pos="punct" rel="--" root="

." word="."/>

29 </node>

30 <sentence>In 1933 bezette Frankrijk Andorra , vanwege

sociale onrust voor de verkiezingen .</sentence>

31 <comments>

32 <comment>Q#0|In 1933 bezette Frankrijk Andorra , vanwege

sociale onrust voor de verkiezingen

.|1|1|0.1404090389999999</comment>

33 </comments>

34 </alpino_ds>

1 <?xml version="1.0" encoding="ISO-8859-1"?>

2 <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org

/1999/XSL/Transform">

3 <xsl:output method="xml" version="1.0" encoding="iso-8859-1"

indent="yes" />

4 <xsl:template match="/alpino_ds">

5 <candidates>

6 <xsl:for-each select="//node[@root=’vanwege’]">

7 <relation>

8 <xsl:attribute name="label">CausedBy</xsl:attribute>

9 <xsl:attribute name="domain">

10 <xsl:for-each select="../../node[@rel=’hd’]">

11 <xsl:value-of select="@root" />

12 </xsl:for-each>

13 </xsl:attribute>

14 <xsl:attribute name="range">

15 <xsl:value-of select="../node[@rel=’obj1’]/node[@rel=’hd’]/

@root" />

16 </xsl:attribute>

17 </relation>

18 </xsl:for-each>

19 </candidates>

20 </xsl:template>

Page 73: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

66 Chapter A. Extraction

21 </xsl:stylesheet>

Page 74: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

Appendix B

Extraction using PropBank

Example syntactic parse in XML format and accompanying XSL stylesheet, usedfor extraction of the CB relation with an ambiguous preposition, in this casethe Dutch preposition ‘door’. The syntactic parse is enriched with semantic anno-tation according to the guidelines of PropBank (Babko-Malaya, 2005).

The example sentence can be found in the corpus under id #4665, or URLhttp://www.let.rug.nl/˜vannoord/trees/cdb/4665.xml.

1 <?xml version="1.0" encoding="ISO-8859-1"?>

2 <alpino_ds version="1.1">

3 <node begin="0" cat="top" end="13" id="0" rel="top">

4 <node begin="0" cat="smain" end="12" id="1" rel="--">

5 <node begin="0" cat="np" end="2" id="2" index="1" rel="su

" pb="Arg1">

6 <node begin="0" end="1" id="3" pos="det" rel="det" root

="de" word="De"/>

7 <node begin="1" end="2" id="4" pos="noun" rel="hd" root

="kerk" word="kerk"/>

8 </node>

9 <node begin="2" end="3" id="5" pos="verb" rel="hd" root="

ben" word="is"/>

10 <node begin="0" cat="ppart" end="12" id="6" rel="vc">

11 <node begin="0" end="2" id="7" index="1" rel="su"/>

12 <node begin="3" end="4" id="8" pos="verb" rel="hd" root

="ontsta" word="ontstaan" pb="rel" />

13 <node begin="4" cat="pp" end="12" id="9" rel="mod" pb="

67

Page 75: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

68 Chapter B. Extraction using PropBank

ArgM-CAU">

14 <node begin="4" end="5" id="10" pos="prep" rel="hd"

root="door" word="door"/>

15 <node begin="5" cat="np" end="12" id="11" rel="obj1">

16 <node begin="5" end="6" id="12" pos="noun" rel="hd"

root="zendingswerk" word="zendingswerk"/>

17 <node begin="6" cat="pp" end="9" id="13" rel="mod">

18 <node begin="6" end="7" id="14" pos="prep" rel="hd

" root="van" word="van"/>

19 <node begin="7" cat="np" end="9" id="15" rel="obj1

">

20 <node begin="7" end="8" id="16" pos="det" rel="

det" root="de" word="de"/>

21 <node begin="8" end="9" id="17" pos="noun" rel="

hd" root="kerk" word="NG-kerk"/>

22 </node>

23 </node>

24 <node begin="9" cat="pp" end="12" id="18" rel="mod">

25 <node begin="9" end="10" id="19" pos="prep" rel="

hd" root="in" word="in"/>

26 <node begin="10" cat="mwu" end="12" id="20" rel="

obj1">

27 <node begin="10" end="11" id="21" pos="noun" rel

="mwp" root="Oranje" word="Oranje"/>

28 <node begin="11" end="12" id="22" pos="noun" rel

="mwp" root="Vrijstraat" word="Vrijstraat"/>

29 </node>

30 </node>

31 </node>

32 </node>

33 </node>

34 </node>

35 <node begin="12" end="13" id="23" pos="punct" rel="--" root

="." word="."/>

36 </node>

37 <sentence>De kerk is ontstaan door zendingswerk van de NG-

kerk in Oranje Vrijstraat .</sentence>

38 </alpino_ds>

1 <?xml version="1.0" encoding="ISO-8859-1"?>

Page 76: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

69

2 <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org

/1999/XSL/Transform">

3 <xsl:output method="xml" version="1.0" encoding="iso-8859-1"

indent="yes" />

4 <xsl:template match="/alpino_ds">

5 <candidates>

6 <xsl:for-each select="//node[@root=’door’]">

7 <xsl:if test="../@pb=’ArgM-CAU’">

8 <relation>

9 <xsl:attribute name="label">CausedBy</

xsl:attribute>

10 <xsl:attribute name="domain">

11 <xsl:for-each select="../../node[@rel=’hd’]">

12 <xsl:value-of select="@root" />

13 </xsl:for-each>

14 </xsl:attribute>

15 <xsl:attribute name="range">

16 <xsl:for-each select="../node[@rel=’obj1’]/

node[@rel=’hd’]">

17 <xsl:value-of select="@root" />

18 </xsl:for-each>

19 </xsl:attribute>

20 </relation>

21 </xsl:if>

22 </xsl:for-each>

23 </candidates>

24 </xsl:template>

25 </xsl:stylesheet>

Page 77: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

70 Chapter B. Extraction using PropBank

Page 78: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

Appendix C

Extraction using FrameNet

Example syntactic parse in XML format and accompanying XSL stylesheet, usedfor extraction of the BMO relation with an ambiguous preposition, in thiscase the Dutch preposition ‘door’. The syntactic parse is enriched with semanticannotation according to the guidelines of FrameNet (Baker et al., 1998).

The example sentence can be found in the corpus under id #2143, or URLhttp://www.let.rug.nl/˜vannoord/trees/cdb/2143.xml.

The related FrameNet frame isSuasion, to be found on URLhttp://framenet.icsi.berkeley.edu/index.php?option=com wrapper&Itemid=118&frame=Suasion

1 <?xml version="1.0" encoding="ISO-8859-1"?>

2 <alpino_ds version="1.1">

3 <node begin="0" cat="top" end="19" id="0" rel="top">

4 <node begin="6" end="7" id="1" pos="punct" rel="--" root="&

quot;" word="&quot;"/>

5 <node begin="13" end="14" id="2" pos="punct" rel="--" root="&

quot;" word="&quot;"/>

6 <node begin="14" end="15" id="3" pos="punct" rel="--" root="(

" word="("/>

7 <node begin="0" cat="du" end="17" id="4" rel="--">

8 <node begin="0" cat="smain" end="13" id="5" rel="nucl">

9 <node begin="0" cat="pp" end="4" id="6" rel="mod">

10 <node begin="0" end="1" id="7" pos="prep" rel="hd" root="

volgens" word="Volgens"/>

71

Page 79: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

72 Chapter C. Extraction using FrameNet

11 <node begin="1" cat="mwu" end="4" id="8" rel="obj1">

12 <node begin="1" end="2" id="9" pos="noun" rel="mwp" root="G.

" word="G."/>

13 <node begin="2" end="3" id="10" pos="noun" rel="mwp" root="

Th." word="Th."/>

14 <node begin="3" end="4" id="11" pos="noun" rel="mwp" root="

Rothuizen" word="Rothuizen"/>

15 </node>

16 </node>

17 <node begin="4" end="5" id="12" pos="verb" rel="hd" root="

overtuig" word="overtuigde" fn="Target_Word" />

18 <node begin="5" end="6" id="13" pos="noun" rel="su" root="

hij" word="hij" fn="Speaker" />

19 <node begin="7" cat="pp" end="13" id="14" rel="mod" fn="

Means">

20 <node begin="7" end="8" id="15" pos="prep" rel="hd" root="

door" word="door"/>

21 <node begin="8" cat="np" end="13" id="16" rel="obj1">

22 <node begin="8" end="9" id="17" pos="det" rel="det" root="

een" word="een"/>

23 <node begin="9" end="10" id="18" pos="adj" rel="mod" root="

groot" word="grote"/>

24 <node begin="10" end="11" id="19" pos="noun" rel="hd" root="

mate" word="mate"/>

25 <node begin="11" cat="pp" end="13" id="20" rel="mod">

26 <node begin="11" end="12" id="21" pos="prep" rel="hd" root

="van" word="van"/>

27 <node begin="12" end="13" id="22" pos="noun" rel="obj1"

root="weerloosheid" word="weerloosheid"/>

28 </node>

29 </node>

30 </node>

31 </node>

32 <node begin="15" cat="mwu" end="17" id="23" rel="sat">

33 <node begin="15" end="16" id="24" pos="noun" rel="mwp" root=

"Gereformeerde" word="Gereformeerde"/>

34 <node begin="16" end="17" id="25" pos="noun" rel="mwp" root=

"Weekblad" word="Weekblad"/>

35 </node>

36 </node>

Page 80: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

73

37 <node begin="17" end="18" id="26" pos="punct" rel="--" root="

)" word=")"/>

38 <node begin="18" end="19" id="27" pos="punct" rel="--" root="

." word="."/>

39 </node>

40 <sentence>Volgens G. Th. Rothuizen overtuigde hij &quot;

door een grote mate van weerloosheid &quot; (

Gereformeerde Weekblad ) .</sentence>

41 </alpino_ds>

1 <?xml version="1.0" encoding="ISO-8859-1"?>

2 <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org

/1999/XSL/Transform">

3 <xsl:output method="xml" version="1.0" encoding="iso-8859-1"

indent="yes" />

4 <xsl:template match="/alpino_ds">

5 <candidates>

6 <xsl:for-each select="//node[@root=’door’]">

7 <xsl:if test="../@fn=’Means’">

8 <relation>

9 <xsl:attribute name="label">ByMeansOf</xsl:attribute>

10 <xsl:attribute name="domain">

11 <xsl:for-each select="../../node[@rel=’hd’]">

12 <xsl:value-of select="@root" />

13 </xsl:for-each>

14 </xsl:attribute>

15 <xsl:attribute name="range">

16 <xsl:for-each select="../node[@rel=’obj1’]/node[@rel=’hd’]

">

17 <xsl:value-of select="@root" />

18 </xsl:for-each>

19 </xsl:attribute>

20 </relation>

21 </xsl:if>

22 </xsl:for-each>

23 </candidates>

24 </xsl:template>

25 </xsl:stylesheet>

Page 81: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

74 Chapter C. Extraction using FrameNet

Bibliography

Agrawal, Rakesh, Tomasz Imielinski, and Arun Swami. “Mining association rulesbetween sets of items in large databases.” InSIGMOD ’93: Proceedings ofthe 1993 ACM SIGMOD international conference on Management of data. NewYork, NY, USA: ACM Press, 1993, 207–216.

Ahn, David, Sisay Fissaha, Valentin Jijkoun, and Maarten de Rijke. “The Uni-versity of Amsterdam at Senseval-3: Semantic Roles and Logic Forms.” InProceedings of SENSEVAL. 2004, volume 3.

Antoniou, Grigoris, and Frank van Harmelen.A Semantic Web Primer. Cambridge,Massachusetts: MIT Press, 2004.

Babko-Malaya, Olga. “PropBank Annotation Guidelines.”, 2005.

Baker, Collin F., Charles J. Fillmore, and John B. Lowe. “The Berkeley FrameNetproject.” Proceedings of COLING-ACL98.

Bannard, Colin, and Timothy Baldwin. “Distributional Models of PrepositionSemantics.” InProceedings of the ACL-SIGSEM Workshop on the LinguisticDimensions of Prepositions and their Use in Computational Linguistics For-malisms and Applications. Toulouse, France, 2003, 169—180.

Barker, Ken, and Stan Szpakowicz. “Semi-Automatic Recognition of Noun Modi-fier Relationships.” InProceedings of the 36th Annual Meeting of the ACL and17th International Conference on Computational Linguistics (COLING/ACL-98). Montreal, Canada, 1998, 96–102.

van der Beek, Leonoor, Gosse Bouma, Robert Malouf, and Gertjan van Noord.“The Alpino Dependency Treebank.”Computational Linguistics in the Nether-lands8–22. URLhttp://www.let.rug.nl/˜vannoord/trees/.

Berland, Matthew, and Eugene Charniak. “Finding parts in very large corpora.” InProceedings of the 37th Annual Meeting of the Association for ComputationalLinguistics. 1999, 57–64.

Berners-Lee, Tim, James Hendler, and Ora Lassila. “The Semantic Web.”ScientificAmerican.

Biron, Paul V., and Ashok Malhotra. “XML Schema Part 2: Datatypes Second Edi-tion.” W3C RecommendationURL http://www.w3.org/TR/xmlschema-2/.

Page 82: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

BIBLIOGRAPHY 75

Blomqvist, Eva. “State of the Art: Patterns in Ontology Engineering.” Technicalreport, Jonkoping University, School of Engineering, 2005.

Bodenreider, Olivier. “Lexical and Statistical Approaches to Acquiring Ontolog-ical Relations.” Technical report, Lister Hill National Center for BiomedicalCommunications Bethesda, Maryland, USA, 2005.

Bodenreider, Olivier, Anita Burgun, and Thomas C. Rindflesch. “Lexically-suggested hyponymic relations among medical terms and their representation inthe UMLS.” In Proceedings of TIA: ”Terminology and Artificial Intelligence”.2001, 11–21.

Uit den Boogaart, P. C.Woordfrequenties in geschreven en gesproken Neder-lands. Werkgroep Frequentieonderzoek van het Nederlands. Utrecht: Ooster-hoek, Scheltema & Holkema, 1975.

Bouma, G., G. van Noord, and R. Malouf. “Alpino: wide-coverage computationalanalysis of Dutch.”, 2000.

Bouma, Gosse, and Ineke Schuurman. “De positie van het Nederlands in Taal-en Spraaktechnologie.” Technical report, De Nederlandse Taalunie, 1998. URLhttp://www.let.rug.nl/˜gosse/taalunie/webrapport/.

Bray, Tim, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and FrancoisYergeau. “Extensible Markup Language (XML) 1.0 (Third Edition).”W3C Rec-ommendationURL http://www.w3.org/TR/2004/REC-xml-20040204/.

Brickley, Dan, and Ramanathan V. Guha. “Resource Description Framework(RDF) Schema Specification 1.0.”W3C Candidate RecommendationURLhttp://www.w3.org/TR/2000/CR-rdf-schema-20000327.

Broekstra, Jeen, Michel Klein, Stefan Decker, Dieter Fensel, and Ian Horrocks.“Adding formal semantics to the Web: building on top of RDF Schema.” InProceedings of the ECDL 2000 Workshop on the Semantic Web. 2000.

Buitelaar, Paul, Philipp Cimiano, Marko Grobelnik, and Michael Sintek. “Ontol-ogy Learning from Text.” Tutorial, ECML/PKDD, 2005.

Buitelaar, Paul, Daniel Olejnik, and Michael Sintek. “A Protege Plug-In for Ontol-ogy Extraction from Text Based on Linguistic Analysis.” InProceedings of the1st European Semantic Web Symposium (ESWS), Heraklion, Greece. 2004.

Chen, Enhong, and Gaofeng Wu. “An Ontology Learning Method Enhanced byFrame Semantics.” InISM ’05: Proceedings of the Seventh IEEE International

Page 83: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

76 Chapter C. Extraction using FrameNet

Symposium on Multimedia. Washington, DC, USA: IEEE Computer Society,2005, 374–382.

Clark, J., S. DeRose, et al. “XML Path Language (XPath) Version 1.0.”W3CRecommendation16: (1999a) 1999.

Clark, J., et al. “XSL Transformations (XSLT) Version 1.0.”W3C Recommenda-tion 16, 11.

Doan, AnHai, Jayant Madhavan, Pedro Domingos, and Alon Halevy. “Ontologymatching: A machine learning approach.”Handbook on Ontologies in Informa-tion Systems397–416.

Fillmore, Charles J. “The case for case.” InUniversals in linguistic theory, editedby Emmon Bach, and Robert T. Harms, New York: Holt, Rinehart and Winston,1968.

. “The need for a frame semantics within linguistics.”Statistical Methodsin Linguistics1: (1976) 37–41.

Firth, J. R.Papers in Linguistics. Oxford University Press, 1957.

Gennari, J.H., M.A. Musen, R.W. Fergerson, W.E. Grosso, M. Crubezy, H. Eriks-son, N.F. Noy, and S.W. Tu. “The evolution of Protege: an environmentfor knowledge-based systems development.”International Journal of Human-Computer Studies58, 1: (2003) 89–123.

Gomez-Perez, A., and D. Manzano-Macho. “A survey of ontology learning meth-ods and techniques.” Technical report, OntoWeb Project Deliverable 1.5, 2003.URL http://ontoweb.org/Members/ruben/Deliverable%201.5.

Gruber, Jeff. Lexical structure in syntax and semantics. New York: North Holland,1976.

Hahn, Udo, and Klemens Schnattinger. “Ontology engineering via text under-standing.” InProceedings of the 15th World Computer Congress ’The GlobalInformation Society on the Way to the Next Millenium’ (IFIP’98). 1998.

van Halteren, Hans, Peter Arno Coppen, Bram Elffers, Dusan Bavcar, and MichaHulsbosch. “Prepositional Phrase Attachment for Dutch: New attention for anold task.” InProceedings of CLIN 2005. 2006.

Hastings, P.M. Automatic Acquisition of Word Meanings from Context. Ph.D.thesis, University of Michigan, 1994.

Page 84: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

BIBLIOGRAPHY 77

Hearst, M.A. “Automatic acquisition of Hyponyms from large text corpora.” InProceedings of the Fourteenth International Conference on Computational Lin-guistics, Nantes, France. 1992.

Horridge, Matthew, Holger Knublauch, Alan Rector, Robert Stevens, and ChrisWroe. “A Practical Guide To Building OWL Ontologies Using The Protege-OWL Plugin and CO-ODE Tools.”The University of Manchester, Manchester,United Kingdom, 1st edition, August.

IJzereef, Leonie. Automatische extractie van hyponiemrelaties uit grote tek-stcorpora. Master’s thesis, Rijksuniversiteit Groningen, 2004. URLhttp://www.let.rug.nl/alfa/scripties/LeonieIJzereef.html.

Jackendoff, Ray S.Semantic Interpretation in Generative Grammar. Cambridge,Massachusetts: The MIT Press, 1974.

Jensen, Per Anker, and Jørgen Fischer Nilsson. “Ontology-based semantics forprepositions.”ACMSIGSEM Workshop on the Linguistic Dimension of Preposi-tions and their use in Computational Linguistics, Institut de Recherce en Infor-matique de Toulouse, Toulouse, September4–6.

Jurafsky, Daniel. “Natural Language Understanding, Background Lec-ture 4: Semantic Roles, Frames, Propbank, FrameNet.”, 2006. URLhttp://www.stanford.edu/class/linguist288/presentations/.

Kingsbury, Paul, and Martha Palmer. “PropBank: the Next Level of TreeBank.”In Proceedings of The Second Workshop on Treebanks and Linguistic Theories.Sweden, 2003.

Kingsbury, Paul, Martha Palmer, and Mitch Marcus. “Adding Semantic Annotationto the Penn TreeBank.” InProceedings of the Human Language TechnologyConference (HLT ’02). 2002. URLciteseer.ist.psu.edu/572691.html.

Kordoni, Valia. “Prepositional Arguments in a Multilingual Context.” InCom-putational Linguistics Dimensions of the Syntax and Semantics of Prepositions,edited by Patrick Saint-Dizier, Springer-Verlag Berlin Heidelberg, 2006, vol-ume 29 ofText, Speech and Language Technology, 131–145.

Lassila, Ora, and Ralph R. Swick. “Resource Description Framework(RDF): Model and Syntax Specification.” W3C RecommendationURLhttp://www.w3.org/TR/REC-rdf-syntax/.

Levin, Beth.English Verb Classes and Alternations: A Preliminary Investigation.University of Chicago Press, 1993.

Page 85: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

78 Chapter C. Extraction using FrameNet

Maedche, Alexander, and Steffen Staab. “Semi-automatic engineering of ontolo-gies from text.” InProceedings of the 12th Internal Conference on Software andKnowledge Engineering.Chicago, USA, 2000.

. “Ontology Learning for the Semantic Web.”IEEE Intelligent Systems16,2: (2001) 72–79.

Magnini, Bernardo, Luciano Serafini, and Manuela Speranza. “Semantic Coordi-nation for Document Retrieval.”Kunstliche Intelligenz18, 4: (2004) 18–23.

Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. “Build-ing a large annotated corpus of English: the Penn Treebank.”ComputationalLinguistics19, 2: (1993) 313–330.

Martin Kavalec and Vojtech Svatek. “A Study on Automated Relation Labelling inOntology Learning.” InProceedings of SOFSEM. Springer, 2004.

Max Planck Institute for Psycholinguistics. “WebCelex.”, 2001. URLhttp://www.mpi.nl/world/celex/help/dudb.html.

Monachesi, Paola, and Jantine Trapman. “Merging FrameNet and PropBank in acorpus of written Dutch.” InProceedings of the workshop Merging and layeringlinguistic information. Workshop held in conjunction with LREC 2006.Genoa,Italy, 2006.

Navigli, Roberto, Paola Velardi, and Aldo Gangemi. “Ontology Learning and ItsApplication to Automated Terminology Translation.”IEEE Intelligent Systems18, 1: (2003) 22–31.

Noy, Natalya F., Michael Sintek, Stefan Decker, Monica Crubezy, Ray W. Ferger-son, and Mark A. Musen. “Creating Semantic Web contents with Protege-2000.”Intelligent Systems, IEEE16, 2: (2001) 60–71.

Palmer, Martha. “Proposition Bank: A resource of predicate-argument relations.”,2001. URLhttp://www1.cs.columbia.edu/nlp/sgd/palmer.ppt.

Pollard, Carl, and Ivan A. Sag. “Head-driven Phrase Structure Grammars.”CSLILecture Notes, Chicago University Press.

Pradhan, Sameer, Kadri Hacioglu, Valerie Krugler, Wayne Ward, James H. Martin,and Daniel Jurafsky. “Support Vector Learning for Semantic Argument Classi-fication.” Machine Learning60, 1-3: (2005) 11–39.

Sag, Ivan A. “English Relative Clause Constructions.”Journal of Linguistics.

Page 86: Using the semantics of prepositions for ontology learning...and the semantics of phrasal verbs (e.g. act up, drop out) are hard to split up in a verbal and a prepositional part. While

BIBLIOGRAPHY 79

Sceffer, Simone. Semantic coordination of hierarchical classifications with at-tributes. Master’s thesis, Politecnico di Milano, 2004.

Schutz, Alexander, and Paul Buitelaar. “RelExt: A Tool for Relation Extractionfrom Text in Ontology Extension.” InInternational Semantic Web Conference.2005, 593–606.

Schuurman, Ineke, Machteld Schouppe, Heleen Hoekstra, and Ton van derWouden. “CGN, an Annotated Corpus of Spoken Dutch.”Proceedings of the4th International Workshop on Linguistically Interpreted Corpora (LINC-03).Budapest, Hungary.

Smith, Mike, Chris Welty, and Deborah L. McGuinness. “OWL webontology language guide.” W3C Candidate RecommendationURLhttp://www.w3.org/TR/owl-guide/.

Stevens, Gerwert.Master Thesis on Semantic Role Labeling (In Progress). Mas-ter’s thesis, Universiteit Utrecht, 2006.

Surdeanu, Mihai, Sanda Harabagiu, John Williams, and Paul Aarseth. “Usingpredicate-argument structures for information extraction.” InProceedings of the41st Annual Meeting of the Association for Computational Linguistics. 2003,8–15.

Thompson, Henry S., David Beech, Murray Maloney, and Noah Mendelsohn.“XML Schema Part 1: Structures Second Edition.”W3C RecommendationURLhttp://www.w3.org/TR/xmlschema-1/.

Tyler, Andrea, and Vyvyan Evans.The Semantics of English Prepositions: SpatialScenes, Embodied Meaning, and Cognition. Cambridge University Press, 2003.

Zhang, Songmao, and Olivier Bodenreider. “Knowledge Augmentation for Align-ing Ontologies: An Evaluation in the Biomedical Domain.” InProceedings ofthe Semantic Integration Workshop at the Second International Semantic WebConference. 2003, 109–114.