113
Proceedings of the 2nd Workshop on NLP and XML (NLPXML-2002) Held at the 19th International Conference on Computational Linguistics (COLING-2002) Taipei, Taiwan September 1, 2002 Edited by Graham Wilcock, Nancy Ide and Laurent Romary

Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

  • Upload
    vodien

  • View
    218

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

Proceedings of the

2nd Workshop on NLP and XML

(NLPXML-2002)

Held at the 19th International Conference on

Computational Linguistics (COLING-2002)

Taipei, Taiwan

September 1, 2002

Edited by

Graham Wilcock, Nancy Ide and Laurent Romary

Page 2: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP
Page 3: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

Preface

We are pleased to introduce the proceedings of the 2nd Workshop on NLPand XML (NLPXML-2002). Following the 1st NLP and XML Workshop,held in Tokyo in November 2001, this is the second in what we hope will bea regular series.

The goals of this workshop are: to provide a forum for presentation anddiscussion of XML use in NLP (including resource and software develop-ment, applications, tools, etc.); and to clarify the “big picture” for NLPapplications and resources vis-a-vis the XML framework and developmentof the Semantic Web. As such, the workshop is intended not only for thosealready using XML, but also for members of the NLP community who seek afuller understanding of the motivations and implications of XML and relatedstandards for the field.

This volume of proceedings contains 10 full papers and 5 project notescovering a wide range of relevant topics, including: XML use for linguisticannotation; position statements concerning the implications of XML, RDFand the Semantic Web; XML-based document generation; XML in multi-modal and speech technology; use of XML mechanisms (especially XSLT)in specific applications; and several XML-based NLP tools.

We would like to thank the program committee for their timely andconstructive reviews of the large number of submitted papers. We would alsolike to thank the Workshops Chair and other organizers of COLING-2002for their support. Finally, we take this opportunity to thank the organizersof the first NLPXML workshop, Naoyuki Nomura and Chieko Nakabasami,for initiating the NLPXML workshop series.

Nancy IdeLaurent RomaryGraham Wilcock

(co-chairs)

Page 4: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

Program Committee

Key-Sun Choi, KAIST, KoreaHamish Cunningham, University of Sheffield, UKThierry Declerck, DFKI, GermanyTomaz Erjavec, Jozef Stefan Institute, SloveniaJohn Garofolo, NIST, USAJan Hajic, Prague University, Czech RepublicNancy Ide, Vassar College, USA (co-chair)Chieko Nakabasami, Toyo University, JapanNaoyuki Nomura, Justsystem/Hosei University, JapanLaurent Romary, Loria/CNRS, France (co-chair)Henry Thompson, University of Edinburgh, UKNaohiko Uramoto, IBM, JapanKuansan Wang, Microsoft, USAGraham Wilcock, University of Helsinki, Finland (co-chair)

Page 5: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

Contents

Cascading XSL Filters for Content Selection in MultilingualDocument GenerationGuillermo Barrutieta, Joseba Abaitua and JosuKa Dıaz 7

An Introduction to the GeM Annotation Schema for ComplexDocument LayoutJohn Bateman, Renate Henschel and Judy Delin 13

XML/XSL in the Dictionary: The Case of Discourse MarkersDaniela Berger, David Reitter and Manfred Stede 21

XML-based NLP Tools for Analysing and Annotating MedicalLanguageClaire Grover, Ewan Klein, Alex Lascarides andMaria Lapata 29

Annotating the Semantic Web Using Natural LanguageBoris Katz and Jimmy Lin 37

The TASX Environment: an XML-based Toolset for theCreation of Multimodal CorporaJan-Torsten Milde 45

Cascaded Regular Grammars over XML DocumentsKiril Simov, Milen Kouylekov and Alexander Simov 51

XtraGen — A Natural Language Generation System UsingXML- and Java-TechnologiesHolger Stenzhorn 59

Page 6: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

XiSTS — XML in Speech Technology SystemsMichael Walsh, Stephen Wilson and Julie Carson-Berndsen 67

SALT: An XML Application for Web-based Multimodal DialogManagementKuansan Wang 75

Posters and Project Notes 83

RDF(S)/XML Linguistic Annotation of Semantic Web PagesGuadeloupe Aguado-de-Cea, Inmaculada Alvarez-de-Mon,Antonio Pareja-Lora and Rosario Plaza-Arteche 85

The Papillon Project: Cooperatively Building a MultilingualLexical Database to Derive Open Source Dictionaries andLexiconsChristian Boitet, Mathieu Mangeot-Lerebours and GillesSerasset 93

Towards a Web-based Information Centre on SwedishLanguage TechnologyPetter Karlstrom and Robin Cooper 97

XML in a Web-based Grammar Development EnvironmentEugene Koontz 103

A Proposal for Screening Inconsistencies in Ontologies basedon Query Languages using WSDChieko Nakabasami and Naoyuki Nomura 109

Page 7: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

Cascading XSL filters for content selection in multilingual document generation

Guillermo BARRUTIETA Mondragon Unibertsitatea

Loramendi, 4 Arrasate, Spain, 20500

[email protected]

Joseba ABAITUA Universidad de Deusto

Avenida de las Universidades, 24 Bilbao, Spain, 48007 [email protected]

JosuKa DÍAZ Universidad de Deusto

Avenida de las Universidades, 24 Bilbao, Spain, 48007

[email protected]

Abstract

Content selection is a key factor of any successful document generation system. This paper shows how a content selection algorithm has been implemented using an efficient combination of XML/XSL technology and the framework of RST for discourse modeling. The system generates multilingual documents adapted to user profiles in a learning environment for the web. This CourseViewGenerator applies simplified RST schemes to the elaboration of a master document in XML from which content segments are chosen to suit the user's needs. The personalisation of the document is achieved through the application of a sequence of filtering levels of text selection based on the user aspects given as input. These cascading filters are implemented in XSL.

Introduction

It is widely accepted that content selection plays a crucial role in text generation (Reiter and Dale 2000). This process is normally seen as a goal-directed activity in which text segments are fit into the discourse structure of the text so as to convey a coherent communicative goal (Grosz and Sidner 1986). Content planning techniques, such as textual schemas (McKeown 1985) or plan operators (Moore and Paris 1993), have been successfully used as models of text generation. There are cases, though, in which these techniques may face some limitations, for example, when the structure of the discourse is difficult to anticipate (Mellish et al. 1998). Nevertheless, when a set of well-defined communicative goals exists, complex goals can be broken down into sequences of utterances and generation becomes an efficient "top-down'' process (Marcu 1997).

This paper shows a macro level content selection algorithm that applies user profiles to

constrain and discriminate the contents of a text, whose discourse structure is represented using a simplified version of Rhetorical Structure Theory (Mann and Thompson 1988). The algorithm has been implemented using XML/XSL-based technology in a multilingual document generation system for educational purposes. The main objective of this CourseViewGenerator system (Barrutieta, 2001 and Barrutieta et al., 2001) is to automatically produce multilingual learning documents that suit the student's needs at each particular stage of the learning process. Figure 1 shows the overall architecture of the system.

Course material(multilingual parallel corpus)

User aspects

xml-dtd

Document generation

Document view

COURSE GENERATOR

Generation engine

html-xml-dtd-xsl-javascriptSelect content and format in an

“intelligent” way.

Inputs

W eb browser

Figure 1: General scheme of the multilingual document generation system

We will begin by explaining the different parts of the system before addressing in more detail the content selection algorithm itself. The system starts by constructing a master document of the kind Hirst et al. (1997) proposed. This master document consists in a full-fledged text with references to all necessary multimedia elements (figures, tables, pictures, links, etc.). In our case, this master document takes the shape of a simple text file with all relevant information tagged in XML. Tags carry information of the logical composition of the text as well as metadata information about

Page 8: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

its discourse structure. The text is seen as raw data, and tags encapsulate these raw data as metadata. The structure of the discourse is represented using a simplified version of RST. RST is simplified in the sense that the granularity of discourse segments does not transcend the boundaries of the sentence.

Table 1. illustrates this gross-grained version of RST in which discourse relations are represented as XML tags.

<RST> <RST-S> <PREPARATION> <S> What is knowledge management? </S> </PREPARATION> </RST-S> <RST-N> <S> Knowledge, in a business context, is the organizational memory, which people know collectively and individually </S> <S> Management is the judicious use of means to accomplish an end </S> <S> Knowledge management is the combination of those concepts, KM = knowledge + management </S> </RST-N> </RST> <RST> <RST-S> <PREPARATION> <S> ¿Qué es gestión del conocimiento? </S> </PREPARATION> </RST-S> <RST-N> <S> Conocimiento, en el contexto de los negocios, es la memoria de la organización, lo que la gente sabe colectiva e individualmente </S> <S> Gestión es el uso juicioso de recursos para alcanzar un fin </S> <S> Gestión del conocimiento es la combinación de esos dos conceptos, GC = gestión + conocimiento </S> </RST-N> </RST> <RST> <RST-S> <PREPARATION> <S> Zer da ezagutzaren kudeaketa? </S> </PREPARATION> </RST-S> <RST-N> <S> Kudeaketa, negozioetan, erakundearen memoria da, jendeak bakarka eta taldeka dakiena </S> <S> Kudeaketak erabideen erabilera zuzena du helburu </S> <S> Ezagutzaren kudeaketa bi kontzeptu hauen nahasketa da, EK = ezagutza + kudeaketa </S> </RST-N> </RST>

Table 1: Gross-grained RST in XML

As any other standard RST discourse tree, this simplified RST contains a nucleus for each text paragraph, and one or several satellites linked by a discourse relation to the nucleus within the same paragraph. The nucleus is an absolutely essential segment of the text, as it carries the main message that the author wants to convey. Satellites can be replaced or erased without changing the overall message and play an important supporting role for the nucleus.

In our system, satellites are selected or discarded depending on the reader’s profile. The reader’s profile is defined through a set of user aspects. These take the form of multi-value parameters that were sketched after a number of surveys were conducted among teachers, students and other experts from the educational context. As a result of these surveys a user model was proposed (Barrutieta et al, 2002). Table 2 illustrates a simplified version of the model.

Specific User Aspects Discrete values

Subject Language processors Moment in time Before the course / Period 1 /

Period 2 / … / After the course (review)

Languages EN/ ES/ EU General User Aspects Discrete values

Level of expertise Null / Basic / Medium / HighReason to read To get an idea / To get deep

into it Background Not related to the subject /

Related to the subject Opinion or motivation Against / Without an opinion

or motivation / In favour Time available A little bit of time / Quite

some time / Enough time

Table 2: User model

Based on this user model, we will now discuss the content selection algorithm (henceforth CSA). The CSA determines which segments of the discourse are going to be used in order to make explicit the set of parameters that conform with the user’s profile. In principle, nuclei will always be chosen (as they convey the main message of the text); satellites, however, will be selected depending on their relation to the nucleus and the user aspects that are activated at the time of generation.

Page 9: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

The selection algorithm works in three consecutive phases: parallel selection, horizontal filtering and vertical filtering. Vertical filtering is the most important phase of the three as it is here that the parts of the discourse tree are selected or discarded.

1 CSA - Parallel selection - Phase 1

In the phase of parallel selection two of the three specific user aspects are taken into account: subject and languages. These aspects identify the relevant XML master document in the chosen language (as illustrated in figure 2.). There is one master document for each subject covered by the system, and these documents contain parallel aligned versions of the texts in each language (English, Spanish and Basque, in our case).

Figure 2: CSA – Parallel selection

As a result of this first filtering phase, the appropriate language division of the master document is selected. This text division is the input for subsequent filtering phases in which the particular segments of the document will be discriminated.

2 CSA - Horizontal filtering - Phase 2

The horizontal filtering phase concerns the third remaining user aspect that is moment in time, which is used to suit the generated text to the particular moment of the learning plan. This aspect cuts horizontally the parallel selection of the previous section.

The master document is structured in accordance with a set of course scheduling parameters. Each day and learning unit within the day is correlated with corresponding set of learning entities in the XML master document. In this way, the generated document can be targeted for learning unit 1 of day 1, or any

other day or unit. The XML master file also contains some informative elements that the reader may need to know even before the course starts or after it has finished. These will be generated also as a result of some specific user aspects that are activated. Figure 3 shows a graphical representation of horizontal filtering.

Figure 3: CSA – Horizontal filtering

3 CSA - Vertical filtering - Phase 3

The final phase of vertical filtering comprises the five user aspects of level expertise, reason to read, professional background, opinion or motivation and time available. These five aspects will be relevant to discriminate those parts of the discourse tree which have been previously selected and filtered.

Nuclei will be always maintained because they are, by definition, irreplaceable segments of the text and convey the main message. Satellites are segments of the text that will be subject to the algorithm's process of selection. Figure 4. shows graphically this filtering phase.

Figure 4: Vertical filtering

Page 10: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

The set of discrimination rules applied in this first version of the content selection algorithm is described below. These rules apply in subsequent checking levels of filtering, and therefore have a cascading effect. It is known that RST covers an indefinite number of relation-satellites (Knott, 1995) which have been classified by Hovy & Maier (1997), but we will only mention the set of relation-satellites used in the master document taken as example.

3.1 Vertical filter – Level of expertise

If level_expertise = “null” or level_expertise = “basic” Then no relation-satellite is discarded; If level_expertise = “medium” or level_expertise = “high” Then discard example, exercise, background and preparation relation-satellites;

Rationale for the rule: Any user with a null

or basic level of expertise on the selected subject will need all the information available to understand the text. Alternatively, a user with a medium or high level of expertise will not require examples, exercises, background, preparation and similar relation-satellites.

3.2 Vertical filter – Reason to read

If reason_to_read = “to get an idea” Then discard exercise and elaboration (all the types of elaboration: textual elaboration, link elaboration and image elaboration) relation-satellites; If reason_to_read = “to get deep into it”Then no relation-satellite is discarded;

Rationale: Any user wishing to broaden his

knowledge in the selected subject will need additional information. Conversely, a user with the intention of just getting an idea does not need any exercise, elaboration, or similar relation-satellites, which often require a more active role on the part of the user.

3.3 Vertical filter – Professional background

If job_studies = “not related subject” Then no relation-satellite is discarded; If job_studies = “related subject” Then discard background and preparation relation-satellites;

Rationale: Any user whose professional background is not related to the subject will need all the additional supporting text to understand its meaning. Conversely, if the user is related to the selected subject, we may assume that background, preparation and similar relation-satellites will be unnecessary.

3.4 Vertical filter – Opinion or motivation

If opinion_motivation = “against” or opinion_motivation = “without an opinion or motivation” Then no relation-satellite is discarded; If opinion_motivation = “in favour” Then discard motivate, antithesis, concession and justify relation-satellite;

Rationale: A motivated or favourable user

will not require additional motivation and, therefore, the motivate, antithesis, concession, justification, and similar relation-satellites will be disregarded, since they play a role in changing the opinion of the user to be in favour of the course material.

3.5 Vertical filter – Time available

If time_available = “a little bit of time” Then discard all the relation-satellites;

If time_available = “quite some time” Then discard exercise relation-satellite; If time_available = “enough time” Then no relation-satellite is discarded;

Rationale: Time availability is a crucial user

aspect. If the user is in a rush or has little time, the system has to provide only the most elementary information. In such case only nuclei will be generated. If the user has a bit more time, but not much, exercises are not offered, since they are usually quite time consuming and they require an active participation of the user. Finally, if the user has plenty of time, all the additional information is delivered.

3.6 Final comments on vertical filters

Cascading filters apply to the relation-satellites that are still active after the previous phases in the generation process. When a vertical filter 3 tries to get rid of a relation-satellite already abandoned at a previous phase (2 or 1), there will be nothing to act upon, but

Page 11: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

this circumstance will produce no consequence, since the CSA continues the filtering process on the remaining text. Thus, the order in which the vertical filters are applied is not relevant.

After the filtering process has been successfully completed, there is still a final presentation task. A good presentation is, in our opinion, one that will provide the student with an optimal version of the document to read, understand and fruitfully assimilate its content.

4 Implementation

The javascript code manages the user aspects (one of the inputs of the algorithm) and the application of the casdading filters (the CSA).

Depending on the user aspects given by the user, the variables sXSL1 to sXSL5 take the value of the filter to be applied for each user aspect (level of expertise, reason to read, background, opinion or motivation and time available).

The sResult variable contains the XML file whose content will be varying after each filter is applied. Table 3 shows the code that executes a filter.

objData.loadXML(sResult); objStyle.load(sXSL1); sResult=objData.transformNode(objStyle);

Table 3: Javascript implementation

XSL filters pass on (or not) one element to the following vertical filter depending on the rules described before. Table 4 shows how this is done with the relation-satellite BACKGROUND.

<xsl:template

match="BACKGROUND"> <xsl:copy> <xsl:apply-templates/> </xsl:copy> </xsl:template>

Table 4: XSL implementation

5 Experimentation

The objective of the experiment is to validate the hypothesis expressed in the filtering rules and the actual filtering mechanism of the CSA.

Several ideas are taken into consideration in this respect, but we are aware that users (students, professor and other scholars) are the final judges. Their assessment of the system will depend on whether the generated document meet (or fail to do so) their information requirements, providing them with just the right type and amount of information.

Conclusions

In the tests conducted so far, the CourseViewGenerator is functioning correctly. One of the features that is worth considering is the scalability of the filtering mechanism. We anticipate two types of expansions to the system: (1) Increasing the size of the corpus, including more subjects and master documents, and (2) augmenting the user model by adding user aspects or by adding more parameters to the existing user aspects.

The first type of expansion will not require any alteration of the CSA as long as the added document tokens conform to the existing DTD and our RST model. In order to increase the size of the corpus, it will be necessary to annotate XML discourse-tree metadata manually. This is a complex and time-consuming task (as has been noted by Carlson and Marcu, 2001). Future research activities should focus on helping automate the annotation process, for example using cue phrases à la Knott (Knott 1995; Alonso and Castellón, 2001).

The second type of expansion requires only the elaboration of additional XSL filters. Adding new values to existing user aspects requires only the modification of the corresponding XSL filter. Any of these last two operations can be incorporated easily. Therefore, adding a new user aspect or a new discrete value does not increase in any substantial way the complexity of the system.

Acknowledgements

This research was partly supported by the Basque Government (XML-Bi: multilingual document flow management procedures using XML/TEI-P3, PI1999-72 project).

Page 12: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

References

Alonso, L. and Castellón, I. (2001) Towards a delimitation of discursive segment for Natural Language Processing applications. First International Workshop on Semantics, Pragmatics and Rhetoric. Donostia (Spain), pp. 45-52.

Barrutieta, G. (2001) Generador inteligente de documentos de formación. Virtual Educa 2001, Madrid (Spain), pp. 256-261.

Barrutieta, G., Abaitua, J. and Díaz, J. (2001) Gross-grained RST through XML metadata for multilingual document generation. MT Summit VIII. Santiago de Compostela (Spain), pp. 39-42.

Barrutieta, G., Abaitua, J. and Díaz, J. (2002) User modelling and content selection for multilingual document generation. Unpublished but currently been evaluated for publication.

Carlson, L.and Marcu, D. (2001) Discourse tagging manual. Technical report ISI-TR-545. ISI Marina del Rey (USA).

Grosz, B. and Sidner, C. (1986) Attention, intentions and the structure of discourse'', Computational Linguistics, 12:175-204.

Hirst, G., DiMarco, C., Hovy E. & Parsons K. (1997) Authoring and Generating Heath-Education Documents That Are Tailored to the Needs of the Individual Patient. Proceedings of the Sixth International Conference. UM97. Vienna (NY-USA), pp. 107-118.

Hovy, E. & Maier, E. (1997) Parsimonious or profligate: how many and which discourse structure relations? <http://citeseer.nj.nec.com/hovy97parsimonious.html>

Knott, A. (1995) A Data-Driven Methodology for Motivating a Set of Coherence Relations, Ph.D. thesis, University of Edinburgh, Edinburgh (UK).

Mann, W.C., and Thompson, S.A. (1988) Rhetorical Structure Theory: A theory of text organization. Tech. Rep. RS-87-190. Information Sciences Institute. Los Angeles, CA.

Marcu, D. (1997) From local to global coherence: a bottom-up approach to text planning, in Proceedings of AAAI-97, American Association for Artificial Intelligence, pp.629-635.

McKeown, K. (1985) Text Generation: Using Discourse Strategies and Focus Constraints to Generate Natural Language Text, Cambridge University Press.

Mellish, C., M. O'Donnell, J. Oberlander and A. Knott (1998) An architecture for opportunistic text generation. Proceedings of the Ninth International Workshop on Natural Language Generation, Niagara-on-the-Lake, Ontario, Canada, pp. 28-37.

Moore, J. and Paris, C. (1993) Planning texts for advisory dialogues: capturing intentional and rhetorical information, Computational Linguistics, 19.

Reiter, E. and Dale, R. (2000) Building applied natural language generation systems. Cambridge University Press (UK).

Page 13: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

A brief introduction to the GeM annotation schema for complexdocument layout

John BatemanUniversity of BremenBremen, Germany

[email protected]

Renate HenschelUniversity of StirlingStirling, Scotland

[email protected]

Judy DelinUniversity of Stirling and

Enterprise IDUNewport Pagnell, [email protected]

AbstractIn this paper we sketch the design, motivationand use of the GeM annotation scheme: anXML-based annotation framework for prepar-ing corpora involving documents with complexlayout of text, graphics, diagrams, layout andother navigational elements. We set out thebasic organizational layers, contrast the techni-cal approach with some other schemes for com-plex markup in the XML tradition, and indicatesome of the applications we are pursuing.

1 IntroductionIn the GeM project (“Genre and Multimodal-ity”: http://www.purl.org/net/gem)1 we areinvestigating the relationship between differentdocument genres and their potential realiza-tional forms in combinations of text, layout,graphics, pictures and diagrams. The central fo-cus of the project is to develop a theory of visualand textual page layout in electronic and paperdocuments that includes adequate attention tolocal and expert knowledge in information de-sign. By analysing resources across visual andverbal modes, we aim to reveal the purpose ofeach in contributing to the message and struc-ture of the communicative artefact as a whole.We see it as crucial, however, that research ofthis nature be placed on as solid an empiricalbasis as has become common in other areas oflinguistic inquiry. The data basis for many ofthe claims made in this area hitherto has beenfar too narrow: the provision of suitable corpusmaterials is therefore fundamental.

For such an enterprise to succeed, it is es-sential to obtain or construct a structured set

1The GeM project is funded by the British Economicand Social Research Council, whose support we grate-fully acknowledge. We also thank the anonymous review-ers for this workshop for some very useful comments.

of data on which to base the analysis; thatis, in the words of Gunther Kress, a leadingresearcher in the area of multimodal meaning(cf., e.g., Kress and Van Leeuwen (2001)): weneed “to turn stuff into data”. In our case,the “stuff” consists of raw paged-based infor-mation presentations, such as illustrated books,newspapers (print and online versions), instruc-tional texts, manuals, and so on; the “data”is then highly structured re-representations ofthese documents that bring out parallel but in-terrelated dimensions of organization crucial forthe total effect, or meaning, of the ‘page’.

Although it is (a) widely accepted that datafor designing and improving natural languageprocessing can best be made available in theform of structured, standardized annotated cor-pora and (b) increasingly accepted that suchdata should stretch to include more than thetraditional concerns of linguistics–i.e., speechand plain text data–and take in more visu-ally challenging presentations, movements inthis direction have to date been very limited.The GeM annotation scheme is being devel-oped in order to support analyses of the broaderrange of layout-text-graphical interactions thatis commonplace in professionally designed doc-uments. We are currently annotating an ex-ploratory corpus in order to bring out the com-plex interrelationships that can be observedwithin page-based information delivery.

2 Annotation contentThe starting basis for our annotation drawson some detailed non-computational ac-counts of the organization of multimodalpages/documents–most specifically, the seminalaccount of constraints on document designby Waller (1987)–and exploratory computa-tional accounts–such as the layout structures

Page 14: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

introduced by Bateman et al. (2001). Thisorganization reflects both artefact-internalconsiderations such as the layout, text andgraphics, as well as artefact-external consid-erations such as design decisions, productionconstraints (e.g., cost), and artefact constraints(i.e., the limited size of a piece of papercontrasted with the theoretically unboundedscrollable window on a computer screen). Theseexternal considerations are often connected.The ‘ideal’ layout of information on a pagemight as a consequence never occur: it mustbe ‘folded in’ to the structures afforded by theartefact, and labelled and arranged accordingto the structures required for access.

In order to pick apart and explicitly repre-sent the strands of meaning that we believe playa crucial role in multimodal page-based docu-ment design, we require several orthogonal lay-ers of annotation. We claim that these levels arethe minimum necessary for revealing accountsof the operation of the kinds of visual artifactsbeing gathered in our corpus–we expect furtherlayers to be added. Indeed, we consider it acrucial design feature that the annotation lay-ers adopted be additive and open rather thanexcluding and closed. The layers at the focus ofattention within the current phase of the GeMenterprise are:

• Rhetorical structure: the rhetorical rela-tionships between content elements; howthe content is ‘argued’;

• Layout structure: the nature, appearanceand position of communicative elements onthe page;

• Navigation structure: the ways in whichthe intended mode(s) of consumption of thedocument is/are supported.

We then need in addition to these layers, ex-plicit representation of constraints that rangefreely over the layers and which relate designdecisions to document types, or genres. Furtherconstraints that are known to determine docu-ment design include: canvas constraints, arisingout of the physical nature of the object beingproduced (e.g., page or screen, fold-geometry inleaflets, and so on), production constraints, aris-ing out of the production technology, and con-sumption constraints, arising out of the time,

place, and manner of acquiring and consumingthe document. Further details and backgroundfor our approach to document design and de-scription are given in Delin et al. (2002).

Our corpus needs to contain informationabout each of these contributing sources of con-straint in a way that supports empirical in-vestigation. Our hypothesis, following Waller(1987), is that not only is it possible to findsystematic correspondences between these lay-ers, but also that those correspondences them-selves will depend on specifiable aspects of theircontext of use. But to verify (or otherwise) thishypothesis, the data gathering and annotationmust come first. And this leads directly to someimportant technical issues, since the structuresinduced by these layers of constraint can behighly divergent and need to be mapped ontoone another with extreme flexibility. The well-known corpus annotation problem of intersect-ing hierarchies therefore arises here with consid-erable force.

3 Technical approach

Our approach to implementing the requiredmultiple layer annotation scheme is to adoptmultiple level ‘stand-off’ or ‘remote’ annota-tions similar to those suggested by Thompsonand McKelvie (1997) or the Corpus EncodingStandard (e.g., CES, 1999: Annex 10). Foreach document to be included in the corpus,therefore, we create a ‘base level’ documentwhose purpose is provide a common set of unitsto which all subsequent stand-off levels refer.These base level units range over textual, graph-ical and layout elements and give a comprehen-sive account of the material on the page, i.e.they comprise everything which can be seen onthe page/pages of the document, including: or-thographic sentences, sentence fragments initi-ating a list, headings, titles, headlines, photos,drawings, diagrams, figures (without caption),captions of photos, drawings, diagrams, tables,text in photos, drawings, diagrams, icons, tablescells, list headers, list items, list labels (itemiz-ers), items in a menu, page numbers, footnotes(without footnote label), footnote labels, run-ning heads, emphasized text, horizontal or ver-tical lines which function as delimiters betweencolumns or rows, lines, arrows, and polylineswhich connect other base units. Each such el-

Page 15: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

ement is marked as a base unit and receives aunique base unit identifier.

The more abstract annotation levels maythen group base units as required; these group-ings must again be very flexible–for exam-ple, it is quite possible that non-consecutivebasic units need to be grouped (and thatdiffering non-consecutive basic units will begrouped within differing annotation layers).Each of the more abstract layers is repre-sented formally as a further structured XMLspecification whose precise informational con-tent and form is in turn defined by an ap-propriate Document Type Definition (DTD).2Each layer defines a particular structural viewof the original document. The markupfor a single document then consists mini-mally of the following four inter-related layers:Name contentGeM base base unitsRST base rhetorical structureLayout base layout properties and

structureNavigation base navigation elements

and structureAll information apart from that of the base

level is expressed in terms of pointers to the rel-evant units of the base level. This stand-off ap-proach to annotation readily supports the nec-essary range of intersecting, overlapping hierar-chical structures commonly found in even thesimplest documents.

The relationships of the differing annotationlevels to the base level units is depicted graph-ically in Figure 1. This shows that base units(the central column) provide the basic vocabu-lary for all other kinds of units and can, further,be cross-classified.

Space precludes a detailed account of the or-ganization of all the levels of the annotationscheme. Instead we select some examples of anannotated document at each layer of annotationto give an indication of the annotation scheme inaction. For further technical details and spec-ifications of the annotation scheme, the inter-ested reader is referred to the technical manual(Henschel, 2002). For ease of exposition, we willdraw most of our examples from the annotationof the page shown in Figure 2. This page has

2For the DTDs themselves, as well as further infor-mation and examples, see the GeM corpus webpages.

base unitsLayoutSemantic

Content

RST

segments

navigational

elements

layout units

Figure 1: The distribution of base elements tolayout, rhetorical and navigational elements

Figure 2: Example page: Flegg, J. (1999)Collins Gem Birds. Italy: Harper Collins. p21.Used by kind permission of the publisher.

the advantage that it is relatively straightfor-ward; more complex examples can be found onthe corpus webpages.

The base unit annotation (GemBase) for anextract from the centre of the page is as follows:

<unit id="u-21.5">---------------</unit><unit id="u-21.6"src="gannet.jpg" alt="gannet-photo"/>

<unit id="u-21.7">Huge (90cm) unmistakable seabird.

</unit><unit id="u-21.8">Watch for white, cigar-shaped body andlong straight, slender, black-tipped wings.

</unit><unit id="u-21.9">In summer, yellow head ofadult inconspicuous. </unit>

Page 16: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

<unit id="u-21.10">Plunges spectacularly for fish.</unit>

<unit id="u-21.11">Sexes similar.</unit>

Although the base annotation generally has aflat structure, in certain cases, we diverge fromthis and allow nested markup, i.e., base unitsinside base units. This is used in the followingsituations: emphasized text portions in a sen-tence/heading, icons or similar pictorial signs ina sentence, text pieces in a diagram or picture,arrows and other graphical signs in a diagramor picture, and document deictic expressions oc-curring in a sentence.

The layout base then consists of three mainparts: (a) layout segmentation–identification ofthe minimal layout units, (b) realization infor-mation – typographical and other layout prop-erties of the basic layout units, and (c) the lay-out structure information–the grouping of thelayout units into more complex layout entities.Whereas in typography, the minimal layout el-ement (in text) is the glyph, here we are con-cerned with groupings of base units that havea visual coherence and unity with respect tothe organisation of the page: these groupingsare termed layout units and, unlike the baseunits, are organized into a non-trivial hierarchi-cal structure as required to describe the page.Again, each layout-unit has an id attribute,which carries an identifying symbol; in addition,however, the stand-off annotation is achievedvia an attribute xref which points to the baseunits which belong to that layout unit. It ispossible, but not necessary, to store the corre-sponding text portions of the original text filebetween the start and end tag of a layout-unitfor mnemonic purposes; this text information isnot used in any further processing. The follow-ing extract shows the layout unit correspondingto the main block of text underneath the gannetphoto.

<layout-unit id="lay-flegg-text"xref="u-21.7 u-21.8 u-21.9

u-21.10 u-21.11">Huge (90cm) unmistakable seabird.Watch for white, cigar-shaped bodyand long straight, slender,black-tipped wings. In summer, yellowhead of adult inconspicuous. Plungesspectacularly for fish. Sexes similar.

</layout-unit>

The second part of the layout base is the re-alization. Each layout unit specified in the lay-out segmentation has a visual realization. Themost apparent difference is which mode hasbeen used – the verbal or the visual mode. Fol-lowing this distinction, the layout base differen-tiates between two kinds of elements: textualelements and graphical elements marked withthe tags <text> and <graphics> respectively.These two elements have a differing sets of at-tributes describing their layout properties. Theattributes are generally consistent with the lay-out attributes defined for XSL formatting ob-ject and CSS layout models. The id of eachlayout unit of the segmentation part of the lay-out base has to occur exactly once under xrefin the realization part: either in a <text> ora <graphics> element. In the following cod-ing example, we have five layout units whichshare typographical characteristics. These cor-respond to the five table cells in the first columnof the table on the page shown in Figure 2.

<text xref="lay-21.12 lay-21.14 lay-21.16lay-21.18 lay-21.20"

font-family="sans-serif"font-size="10" font-style="normal"font-weight="bold"case="mixed" justification="right"color="black"/>

The third part of the layout base then servesto represent the hierarchical layout structure.Generally we assume that the layout structureof a document is tree-like with the entire docu-ment being the root; this will certainly be prob-lematic for some document types but also hassufficient applicability to enable us to make con-siderable headway. Each layout chunk is a nodein the tree, and the basic layout units, whichhave been identified in the segmentation part ofthe layout base, are the terminal nodes of thattree. In our annotation, we use several differenttags for the nodes in the layout tree. The threemost common are: <layout-root>, the ele-ment describing the entire document, <layout-chunk>, all non-terminal nodes in the layouttree except for the root, and <layout-leaf>,the terminal nodes. A slightly simplified (i.e.,some further substructure is ommitted) extractof the layout structure for our example page isdepicted graphically in Figure 3; it is describedby the following XML annotation:

Page 17: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

page-21

header-21 body-21 page-no-21

lay-21.2 lay-21.3

Figure 3: Example page layout structure showngraphically

<layout-root id="page-21"><layout-leaf xref="header-21"/><layout-chunk id="body-21">

<layout-leaf xref="lay-21.2"/><layout-leaf xref="lay-21.3"/>

</layout-chunk><layout-leaf xref="page-no-21"/>

</layout-root>

Whereas the annotations so far specify thehierarchical structuring of a page into visuallydistinguishable and groupable layout units, weneed also to record specific information abouthow layout units are placed on their pages: thepage or page segment layout is not fully deter-mined by grouping layout units into a tree struc-ture since further information is required aboutthe actual position of each unit in the docu-ment (on or within its page). For this, we in-troduce an area model, which serves to deter-mine the position of each layout-chunk/layout-leaf in an abstract, but fully explicit, way. Thearea model is a generalization of common no-tions of ‘page models’. Each page usually par-titions its space into sub-areas and these can beused for positioning or aligning layout units orsubtrees of the layout structure. For instance,a page is often designed in three rows – thearea for the running head (row-1), the area forthe page body (row-2), and the area for thepage number (row-3) – which are arranged ver-tically. The page body space can itself consistof two columns arranged horizontally. Theserows/columns need not to be of equal size. Forthe present, we restrict ourselves to rectangularareas and sub-areas, and allow recursive areasubdivision. The partitioning of the space ofthe entire document is defined in the area-root,which structures the document (page) into rect-angular sub-areas in a table-like fashion.

The tag to represent the area root is <area-

root>. The tag to represent the division ofa sub-area into smaller rectangles is <sub-area>, this shares the attributes of the rootbut adds a location attribute so that subareasare positioned relative to their parent. Loca-tions are indicated with respect to a logical griddefining rows and columns. The area model forour example page therefore contains a single col-umn with 5 rows (the header, the photograph,the text block, the table, and the footer), inwhich the fourth row, the table, is itself madeup of a subarea corresponding to the rows andcolumns of a virtual table. This is captured bythe following annotation:

<area-root id="page-frame" cols="1" rows="5"hspacing="100" vspacing="5 40 15 45 5"height="16cm" width="14cm">

<sub-area id="table-frame" location="row-4"cols="2" rows="5" hspacing="10 90"vspacing="100"/>

</area-root>

The attribute vspacing=‘‘5 40 15 45 5’’means that the area for the running head takes5% of the entire page height, the area for thenext row 40%, etc. The area model then pro-vides logical names for the precise positioningof the layout units identified in the hierarchi-cal layout structure. This makes it straightfor-ward to indicate, for example, that collectionsof siblings in the layout structure share (or failto share) some particular alignment propertieswithin the page.

The RST base presents the rhetorical struc-ture of the document. The rhetorical structureis annotated following the Rhetorical StructureTheory (RST) of Mann and Thompson (1988).The relation between rhetorical structure andlayout is currently an important area of studyin multimodal document description and so itsinclusion in the GeM annotation scheme is es-sential. Several annotation schemes for RSThave been proposed in the literature (cf. DanielMarcu and Mick O’Donnell’s proposals, e.g.,www.sil.org/˜mannb/rst/toolnote.htm and theRAGS rhetorical level: Cahill et al. (2001)).The precise GeM notation differs from thesein certain respects, but the main principles ofrepresentation remain similar. Since many ofthe details of the rhetorical annotation presumesome familiarity with the decomposition of textsaccording to RST, we will note here simply that

Page 18: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

this annotation layer again represents a hierar-chical decomposition in which the leaves of thetree can correspond to both textual and graph-ical base units.

Finally, the navigation base captures thoseparts of the document/page which tell thereader where the current text, or ‘documentthread’, is continued or which point to an al-ternative continuation or continuations. Thesemake up the navigation layer of annotation.The addresses used by such pointers are eithernames of RST spans or names of layout chunks.For long-distance navigation, typical nodes inthe RST structure and in the layout structurehave been established for use in pointers; inparticular, chapter/section headings are namesfor RST spans and page numbers are namesfor page-sized layout-chunks. This structureimposed by the navigational elements is thusquite different from the preceding layers andcan freely cross-cut the hierarchicies expressedthere. Again, for further details, the reader isrefered to the documentation manual.

4 Comparison with otherXML-based approaches

It is useful to consider other approaches torepresenting ‘overlapping hierarchies’ that havebeen proposed in the XML literature; sincethese are early days in the construction of suchannotation schemes, it is likely that the expe-rience gained with differing schemes will provehighly beneficial for further development.

The first examples of extensive overlappinghierarchies within markup for NLP are proba-bly to be found in speech corpora. It is clearthat intonational phenomena, for example, mayor may not respect grammatical or other kindsof structure and so need to be maintained sep-arately. Speech-oriented corpora generally usethe time line as a basic reference method sincespeech events are necessarily strictly orderedin time. This is quite different in the GeMcase where we have found the non-linearity andthe non-consecutive nature of the units groupedwithin our annotation scheme as presenting amajor problem for annotation models that havebeen developed in the speech processing tradi-tion where contiguity of units is the expectedcase. Whereas the speech signal can be encodedby means of time-stamps, in the GeM model we

need to use the layout structure (or even thearea model within the layout structure) insteadfor placing elements within the physical docu-ment.

One of the most detailed general considera-tions of the range of XML-based solutions tothe multiple, overlapping hierarchies problem,as well as an extensive listing of further litera-ture, is given by Durusau and O’Donnell (sub-mitted). After reviewing several requirementsand approaches to the problem, Durusau andO’Donnell propose a Bottom-Up Virtual Hier-archy approach which (i) creates multiple in-stances of the source document, each with itsown consistent XML-defined hierarchy and (ii)creates one further document instance called the‘base file’, in which each basic element of thedocument is linked via XPath expressions to itsposition in each of the separately defined hier-archy documents. The position in each separatehierarchy is captured as the value of a distinctattribute to a base element tag. This meansthat the base file becomes extremely complex,although Durusau and O’Donnell envisage thatthis file could in future be automatically con-structed and maintained.

The similarity of this approach to the inde-pendently developed GeM approach argues toa certain extent for the necessity of proceedingin this direction. The differences between theapproaches arise from the different tasks thatare being considered and the corresponding dif-ferences in emphasis. Durusau and O’Donnellare considering the task from the perspective ofdiffering ‘interpretations’ of a text in the sensemore commonly pursued in text corpus markup:it is, therefore, natural to consider each hier-archy as an autonomous marked-up document.We are concerned with linguistic analyses ofdocuments, where the structure of the analy-ses themselves is itself a major focus of atten-tion. It is then no longer so important thateach level of analysis transparently representsa ‘view of the text’. When querying the cor-pus for structural and realizational regularities,we are free to do this across any set of the an-notation layers present, using the full power ofproperly parsed XPath expressions, and with-out the need to always decompose such queriesinto the terms of the elements of the base file.This is the XML reflex of the linguistic strategy

Page 19: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

of stratification of linguistic information acrossthe linguistic system.

Single structured text views can, of course,be created out of the GeM markup by followingthe indirection links present in any individualGeM annotation layer. This requires that thatlayer be interpreted with respect to the corre-sponding base level document and is not then‘locally’ complete. This (formally slight) com-plexity is, we feel, more than balanced by thefact that we need neither any additional com-plexity in our base level markup nor the doublecoding of the position of nodes in hierarchiesand base elements. It is also straightforwardto introduce intermediate levels of structuralanalysis–as, for example, already done in our se-lection of base units as units relevant to layoutrather than ‘words’ or other more straightfor-ward formal units. Indeed, when the more tra-ditional linguistic levels of annotation (syntax,semantics) are added into the complete scheme,many of the GeM annotation layers will con-tinue not to access the leaves of these structuresat all, and will continue to work with the baseunits illustrated here; this variable granularitymay prove to be a general requirement for lin-guistically complex analyses. And, finally, wealso need not recompile our base level documentwhenever an additional layer of annotation isadded to the scheme, thereby simplifying main-tenance. Further comparison of the approacheswill, however, require more detailed evaluationin use.

Finally, we can consider the GeM approachas contrasted with directions within the XMLcommunity itself since there, too, there are pro-posals for capturing distinctions of content, lay-out (e.g., in terms of formatting objects: XSL-FO) and navigational elements (e.g., in terms ofXlink). Whereas we are attempting to make theGeM description as compatible with these con-structs as possible–for example, as noted abovewith respect to the realizational possibilities forthe layout units–it is important to understandthe very different aims involved. The purposeof the GeM project is to analyse the multimodaldecisions made in a wide range of documenttypes and it is not yet clear which theoreticallevels and which theoretical constructs withinthose levels will prove appropriate. The format-ting object description is only suited to a certain

class of layout types (which excludes, for exam-ple, newspapers) and so is in many respects toospecific for our more exploratory purposes. Weare also searching for effective levels of abstrac-tion at which to characterize our data: effectivehere meaning that these will be the constructsover which canvas constraints, production con-straints and consumption constraints are mostappropriately expressed. Perhaps, to offer anNLP analogy: whereas the XML modelling de-cisions correspond to a fine-scaled phonetic de-scription of a language event, we are in theGeM project searching for the higher levels ofabstraction corresponding to the grammar, se-mantics and pragmatics of the language events.We expect this to give us a substantially bettertheoretical grasp of the meaning-making poten-tial of layout decisions and their control by ex-ternal constraints.

5 Applications of the annotatedcorpus

A number of uses are currently being made ofthe annotated GeM corpus. While our empiri-cal study will need considerably more data to beencoded before we can make reliable statementsconcerning the patterning of various constraintswith document decisions, we have already beenstruck by the rather wide variation that existswithin single documents between selected lay-out structures on the one hand and rhetoricalorganization on the other. In surprisingly manycases, this variation goes beyond what mightbe considered ‘good’ design: in fact, we wouldargue that most such designs are flawed andwould be improved by a more explicit attentionto the rhetorical force communicated by partic-ular layout decisions. This represents the use ofthe corpus for document critique and improve-ment (cf. Delin and Bateman (2002)).

We are also using the data gathered so far toinform the design of a prototype automatic doc-ument generation system capable of producingthe kinds of variation and layout forms seen inour corpus. In this work, the annotation schemeprovides skeletal data structures that define tar-get formats of various stages of the generationprocess. Thus, for example, layout planningneeds to produce a structure that is an instan-tiated version of a layout structure as we havedefined it above. Some first results are reported

Page 20: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

in Henschel et al. (to appear) which describes aprototype implemented as a set of XSLT trans-formations that convert a content representa-tion into an XSL-FO document. The transfor-mations are conditionalized so as to respond tovarious features of the content, the modes of theavailable material, and the rhetorical structure;so far pages generated include further examplesof the kind used as an example in this paper aswell as pages from instructional texts such astelephone user guides.

Conditionalization is expressed in terms ofXPath specifications that check the presence orabsence of particular configurations within anyof the GeM annotation layers as required. Suchspecifications are, however, somewhat cumber-some for more complex queries. Whether fur-ther developments such as XQL or XQuery willbring benefits is not yet clear. Somewhat dis-appointing was the unsuitability of the previousgeneration of linguistic-oriented corpus tools,which, despite considerable investment, seem tohave been outstripped by the very rapid devel-opments seen in the mainstream XML commu-nity. Most of our current work is done directlywith XMLSpy and XLST tools such as Xalan.

The final goal of our corpus collection workand the prototype document generation systemsis to place the commonly quoted aim of usingXML markup for document ‘repurposing’ on asolid theoretical foundation. Thus, for example,the ability automatically to generate very dif-ferent presentational forms for the instructionaltexts or bird pages mentioned above is an in-herent feature of the GeM model. More impor-tant for us is to uncover as precisely as possi-ble the conditions which make certain presen-tational selections more appropriate than oth-ers. In general, we relate the need for present-ing information in different forms to the kindsof constraints we introduced above: very differ-ent canvas constraints are imposed, for exam-ple, depending on whether the delivery mediumis across the telephone, on a palmtop, or as adisplay on a big screen. However, it is not sim-ply a matter of the differing affordances of thedisplay device. The selection of particular infor-mation and information display modes is also amatter of established document types. Thesedocument types change over time, due both tochanging production constraints and to the es-

tablishment of new genres. It is not possible todeploy inappropriate realizations for establishedgenres; a newspaper that changed its presenta-tion style to that of birdbooks would quickly beout of business–and vice versa. Mapping outthese possibilities and showing what larger pat-terns hold is then our eventual goal. And forthis, extensive and detailed empirical studies ofthe kind we hope the GeM corpus and annota-tion scheme will support are crucial.

ReferencesJohn A. Bateman, Thomas Kamps, Jorg Kleinz, and

Klaus Reichenberger. 2001. Constructive text, di-agram and layout generation for information pre-sentation: the DArtbio system. ComputationalLinguistics, 27(3):409–449.

Lynne Cahill, Roger Evans, Chris Mellis, DanielPaiva, Mike Reape, and Donia Scott. 2001. Intro-duction to the RAGS architecture . Available athttp://www.itri.brighton.ac.uk/projects/rags.

Corpus Encoding Standard. 2000. Corpus En-coding Standard. Version 1.5. Available at:http://www.cs.vassar.edu/CES.

Judy L. Delin and John A. Bateman. 2002. Describ-ing and critiquing multimodal documents. Docu-ment Design, 3(2). Amsterdam: John Benjamins.

Judy Delin, John Bateman, and Patrick Allen. 2002.A model of genre in document layout. Informa-tion Design Journal.

Patrick Durusau and Matthew Brook O’Donnell.submitted. Implementing concurrent markup inXML. Markup Languages: Theory and Practice.

Renate Henschel, John Bateman, and Judy Delin.to appear. Automatic genre-driven layout gener-ation. In Proceedings of the 6. Konferenz zur Ver-arbeitung naturlicher Sprache (KONVENS 2002),University of the Saarland, Saarbrucken.

Renate Henschel. 2002. GeM annotation manual.Gem project report, University of Bremen andUniversity of Stirling, Bremen and Stirling. Avail-able at http://purl.org/net/gem.

Gunther Kress and Theo Van Leeuwen. 2001. Mul-timodal discourse: the modes and media of con-temporary communication. Arnold, London.

William C. Mann and Sandra A. Thompson. 1988.Rhetorical structure theory: Toward a functionaltheory of text organization. Text, 8(3):243–281.

Henry S. Thompson and D. McKelvie. 1997. Hyper-link semantics for standoff markup of read-onlydocuments. In Proceedings of SGML Europe ’97.

Robert Waller. 1987. The typographical contributionto language: towards a model of typographic gen-res and their underlying structures. Ph.D. thesis,Department of Typography and Graphic Commu-nication, University of Reading, Reading, U.K.

Page 21: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

XML/XSL in the Dictionary: The Case of Discourse Markers

Daniela Berger and David Reitter and Manfred StedeUniversity of Potsdam

Dept. of Linguistics / Applied Computational LinguisticsP.O. Box 601553 / D-14415 Potsdam / Germany{berger|reitter|stede}@ling.uni-potsdam.de

Abstract

We describe our ongoing work on an applicationof XML/XSL technology to a dictionary, fromwhose source representation various views forthe human reader as well as for automatic textgeneration and understanding are derived. Ourcase study is a dictionary of discourse markers,the words (often, but not always, conjunctions)that signal the presence of a disocurse relationbetween adjacent spans of text.

1 Overview

Electronic dictionaries have made extensive useof SGML encoding in the past, but to ourknowledge, the advantages of contemporaryframeworks such as XML/XSL are only be-ginning to be explored. We are applying thisframework to our work on a lexicon of ‘dis-course markers’ and will outline the advantagesof deriving a variety of views from the commonunderlying lexical resource: different views fordifferent demands by the human eye, but alsoapplication-specific views that tailor the dictio-nary to either the parsing or the generation task(a third one would be machine translation butis not covered in this paper), and that respectthe conventions of the specific underlying pro-gramming language. Using XSL style sheets forproducing the views automatically is especiallyuseful when the lexicon is still under develop-ment: the ramifications of particular modifica-tions or extensions can be made visible easilyby running the conversion and testing the ap-plications in question.

Discourse markers are words (predominantlyconjunctions) that signal the kind of semantic orrhetorical relationship between adjacent spansof text. In text generation, when given a rep-resentation of propositions and relations hold-ing between them, the task is to select an ap-

propriate discourse marker in the context. Intext understanding, discourse markers are themost important clues for inferring the ‘rhetori-cal structure’ of the text, a task that has latelybeen called ‘rhetorical parsing’. While thesediscourse markers are a somewhat idiosyncraticclass of lexical items, we believe that our generalapproach to applying XML/XSL can be fruitfulto other branches of the dictionary as well (inparticular to the open-class “content words”).

After reviewing some earlier work on XML-based dictionaries (Section 2) and discussing thenotion of discourse markers (Section 3), we pro-ceed to outline the particular requirements ona discourse marker lexicon from both the textgeneration and the text understanding perspec-tive (Section 4). Then, Section 5 describes ourXML/XSL encoding of the source lexicon andthe views for the human eye, for automatic textgeneration, and for text understanding. Finally,Section 6 draws some conclusions.

2 Dictionaries and XML: Relatedwork

Recent research in lexicology has been focusedon two different goals: the mark-up processfor existing print dictionaries, and the success-ful construction of machine-readable dictionar-ies from scratch.

The first approach has received more at-tention in the past. This is partly due tothe fact that the transformation of existingprint dictionaries into modules for NLP applica-tions promises to be less time-consuming thanthe construction of a new machine-readabledatabase. Lexicologists agree on the fact that adictionary entry is inherently hierarchical, i.e.,it consists out of atomic elements grouped to-gether within non-atomic elements in a tree-likehierarchy. Many approaches place orthograph-

Page 22: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

ical and phonological information together inone group, while grammatical information is putin a different group. This hierarchical approachalso allows to denote scope by inserting informa-tion at different levels of the hierarchy. Again,information about orthography and phonologygenerally applies to every facet of the headwordand are thus placed high in the hierarchy, whileother information might only apply to singledefinitions and thus ranks lower hierarchically(Amsler/Tompa, 1988; Ide, Veronis, 1995; Ideet al., 2000).

A common problem of lexicologists workingwith print dictionaries is the fact that there isa certain variation between entries in any twogiven dictionaries or even within the same dic-tionary. This results in a neccessary trade-offbetween the descriptive power and the gener-ality of an approach, i.e. to design a SGMLapplication that is both descriptive enough tobe of practical value and general enough to ac-comodate the variation.

There has been, on the other hand, only littleresearch on machine-readable dictionaries thatare not based on print dictionaries. To ourknowledge, only (Ide et al., 1993) deals withthis issue by reviewing several approaches to-wards encoding machine-readable dictionaries.One of these is the use of text models that ap-ply a rather flat hierarchy to mark up dictio-nary entries. These text models might chieflyuse typographical or grammatical information.Another approach is using relational databases,in which the information contained in a dictio-nary entry is distributed over several databases.A third approach is based on feature structuresthat impose a rich hierarchical structure on thedata. The authors finally describe an exampleapplication that uses feature structures encodedin SGML to set up a machine-readable dictio-nary.

The papers mentioned above agree on us-ing SGML for the mark-up. We found thattheir SGML code is, however, in general XML-compliant.

3 Discourse markers

Several contemporary discourse theories positthat important aspects of a text’s coherencecan be formally described (and represented) bymeans of discourse relations holding between

adjacent spans of text (e.g. Asher, 1993; Mann,Thompson, 1988). We use the term discoursemarker for those lexical items that (in additionto non-lexical means such as punctuation, as-pectual and focus shifts, etc.) can signal thepresence of such a relation at the linguistic sur-face. Typically, a discourse relation is associ-ated with a wide range of such markers; con-sider, for instance, the following variety of Con-

cessions, which all express the same underly-ing propositional content. The words that wetreat as discourse markers are underlined.

We were in SoHo; {nevertheless | nonetheless| however | still | yet}, we found a cheap bar.

We were in SoHo, but we found a cheap baranyway.

Despite the fact that we were in SoHo, wefound a cheap bar.

Notwithstanding the fact that we were inSoHo, we found a cheap bar.

Although we were in SoHo, we found a cheapbar.

If one accepts these sentences as paraphrases,then the various discourse markers all need tobe associated with the information that theysignal a concessive relationship between the twopropositions involved. Notice that the markersbelong to different syntactic categories and thusimpose quite different syntactic constraints ontheir environment in the sentence. Discoursemarkers do not form a homogeneous class fromthe syntactican’s viewpoint, but from a func-tional perspective they should nonetheless betreated as alternatives in a paradigmatic choice.

A detailled characterization of discoursemarkers, together with a test procedure foridentifying them in text, has been provided forEnglish by (Knott, 1996). Recently, (Grote, toappear) adapted Knott’s procedure for the Ger-man language. Very briefly, to identify a dis-course marker (e.g., because) in a text, isolatethe clause containing a candidate from the text,resolve any anaphors and make elided items ex-plicit; if the resulting text is incomplete (e.g.,because the woman bought a Macintosh), thenthe candidate is indeed a ‘relational phrase’, orfor our purposes, a two-place discourse marker.

In addition to the syntactic features, the dif-ferences in meaning and style between similarmarkers need to be discerned; one such differ-ence is the degree of specificity: for example,

Page 23: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

but can mark a general Contrast or a morespecific Concession. Another one is the no-table difference in formality between, say but ...anyway and notwithstanding.

From the perspective of text generation, notall paraphrases listed above are equally felici-tous in specific contexts. In order to choosethe most appropriate variant, a generator needsknowledge about the fine-grained differences be-tween similar markers for the same relation.Furthermore, it needs to account for the interac-tions between marker choice and other genera-tion decisions and hence needs knowledge aboutthe syntagmatic constraints associated with dif-ferent markers. We will discuss this perspectivein Section 4.1

From the perspective of text understanding,discourse markers can be used as one source ofinformation for guessing the rhetorical structureof a text, or automatic rhetorical parsing. Wewill characterize this application in Section 4.2.

4 Requirements on a discoursemarker lexicon

As the following two subsections will show, textgeneration and understanding have quite dif-ferent preferences on the information coded ina discourse marker lexicon, or “DiMLex” forshort. In addition, different systems employ dif-ferent programming languages, and the formatof the lexicon has to be adapted accordingly.Yet we want to avoid coding different lexiconsmanually and thus seek a common “core rep-resentation” for DiMLex from which the var-ious application-specific instantiations can bederived. Before proposing such a representa-tion, though, we have to examine in more detailthe different requirements.

4.1 The text generation perspectivePresent text generation systems are typicallynot very good at choosing discourse mark-ers. Even though a few systems have incor-porated some more sophisticated mappings forspecific relations (e.g., in DRAFTER (Paris etal., 1995)), there is still a general tendency totreat discourse marker selection as a task tobe performed as a “side effect” by the gram-mar, much like for other function words such asprepositions.

To improve this situation, we propose to viewdiscourse marker selection as one subtask of the

general lexical choice process, so that — to con-tinue the example given above — one or anotherform of Concession can be produced in thelight of the specific utterance parameters andthe context. Obviously, marker selection alsoincludes the decision whether to use any markerat all or leave the relation implicit. When thesedecisions can be systematically controlled, thetext can be tailored much better to the specificgoals of the generation process.

The generation task imposes a particular viewof the information coded in DiMLex: the en-try point to the lexicon is the discourse relationto be realized, and the lookup yields the rangeof alternatives. But many markers have moresemantic and pragmatic constraints associatedwith them, which have to be verified in the gen-erator’s input representation for the marker tobe a candidate. Then, discourse markers place(predominantly syntactic) constraints on theirimmediate context, which affects the interac-tions between marker choice and other realiza-tion decisions. And finally, markers that are stillequivalent after evaluating these constraints aresubject to a choice process that can utilize pref-erential (e.g. stylistic or length-based) criteria.Therefore, under the generation view, the infor-mation in DiMLex is grouped into the followingthree classes:

— Applicability conditions: The necessaryconditions for using a discourse marker, i.e., thefeatures or structural configurations that needto be present in the input specification.

— Syntagmatic constraints: The constraintsregarding the combination of a marker and theneighbouring constituents; most of them aresyntactic and appear at the beginning of the listgiven above (part of speech, linear order, etc.).

— Paradigmatic features: Features that labelthe differences between similar markers sharingthe same applicability conditions, such as stylis-tic features and degrees of emphasis.

Very briefly, we see discourse marker choiceas one aspect of the sentence planning task(e.g. (Wanner, Hovy, 1996)). In order toaccount for the intricate interactions betweenmarker choice and other generation decisions,the idea is to employ DiMLex as a declara-tive resource supporting the sentence planningprocess, which comprises determining sentenceboundaries and sentence structure, linear order-

Page 24: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

ing of constituents (e.g. thematizations), andlexical choice. All these decisions are heavilyinterdependent, and in order to produce trulyadequate text, the various realization optionsneed to be weighted against each other (in con-trast to a simple, fixed sequence of making thetypes of decisions), which presupposes a flexiblecomputational mechanism based on resourcesas declarative as possible. This generation ap-proach is described in more detail in (Grote,Stede, 1998).

4.2 The text understanding perspectiveIn text understanding, discourse markers serveas cues for inferring the rhetorical or semanticstructure of the text. In the approach proposedin (Marcu, 1997), for example, the presence ofdiscourse markers is used to hypothesize indi-vidual textual units and relations holding be-tween them. Then, the overall discourse struc-ture tree is built using constraint satisfactiontechniques. Our analysis method uses the lexi-con for an initial identification and disambigua-tion of discourse markers. They serve as oneof several other shallow features that determinethrough a statistical, learned language modelthe optimal rhetorical analysis.

In contrast to the use of markers in genera-tion, the list of cues is significantly longer andincludes phrasal items like aus diesem Grund(for this reason) or genauer genommen (moreprecisely).

5 Our XML/XSL solution

In the following we show some sample represen-tations and style sheets that have been abridgedfor presentation purposes.

5.1 Source representationIn our hierarchical XML structure, the<dictionary> root tag encloses the entire file,and every single entry rests in an <entry>tag, which unambigously identifies every entrywith its id attribute. Within the <entry> tagthere are four subordinate tags: <form>, <syn>,<sem>, and <examples>.

The <form> tag contains the orthographicform of the headword; at present this amountsto two slots for alternative orthographies. The<syn> area contains the syntactic informationabout the headword. In this shortened exam-ple, there is only the <init field> tag present,

<?xml version="1.0" ?><?xml-stylesheet type="text/xsl"href="short_dictionary.xsl" ?>

<!DOCTYPE dictionary SYSTEM "DTD.dtd">5 <dictionary>

<entry id="05"><form>

<orth>denn</orth><alt_orth>none</alt_orth>

10 <!-- . . . --></form><syn>

<init_field>-</init_field><!-- . . . -->

15 </syn><sem>

<function>causal</function><!-- . . . -->

</sem>20 <examples>

<example>Das Konzert muss ausfallen,*denn* die S&auml;ngerin ist erkrankt.</example><example>Die Blumen auf dem Balkon sind

25 erfroren, *denn* es hat heute nachtFrost gegeben.</example>

</examples></entry><entry>

30 <!-- more entries --></entry>

</dictionary>

Figure 1: The XML structure

which shows whether the headword can be usedin the initial field of a sentence. Correspond-ingly, <sem> contains semantic features such asthe <function> tag, which contains the seman-tic/discourse relation expressed by the head-word. Finally, <examples>, contains one ormore <example> tags that may each give an ex-ample sentence.

We have shortened this presentation consider-ably; the full lexicon contains more fine-grainedfeatures for all three areas: within <form>, in-formation on pronounciation, syllable structure,and hyphenation; within <syn>, informationabout syntactic subcategorization and possiblepositions in the clause; within <sem>, for exam-ple the feature whether the information subor-dinated by the marker is presupposed or not.

5.2 HTML views

The listing in Figure 4 shows a style sheet thatprovides an HTML by listing the XML data ina format that roughly resembles a print dictio-

Page 25: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

nary. Figure 2 shows the output that resultsfrom applying this XSL file to the XML sourcein figure 1.

05: dennOccurrences: middle field / NullstelleSemantics: kausalRelated markers: weil da

Examples: Das Konzert muss ausfallen, *denn* die

Sangerin ist erkrankt.

Die Blumen auf dem Balkon sind erfroren, *denn* es hat

heute nacht Frost gegeben.

Figure 2: One HTML view of the data

We assume that the general structure of theformatting part of XSL is familiar to the reader.We would like to highlight some details.

XLINK is used to ensure that the entry con-tains an HTML-anchor named after the head-word (ll. 14-20). This way it is possible to linkto a certain entry from the <rel> tag of a dif-ferent entry (39-45).

We also employ the XSL equivalent toan if/then/else construct (24-31). The<xsl:choose> tag encloses the choices to bemade. The <xsl:when> tag contains the con-dition match=".[alt orth=’none’]" that doesnothing if the <alt orth> tag contains thedata none. Every other case is covered bythe <xsl:otherwise> tag that prints out the<alt orth> information if it is not no entry.

Entry alt orth init field mid field . . .denn none - + . . .da none + + . . .zumal none - - . . .weil none - - . . .als none - - . . .

Figure 3: Another possible HTML view of thedata

Figure 3 shows another possible view for thedata. In this case the data is presented in tableform, ordered by the value of the mid field tag.It would be easy to show that it is possible touse a <xsl:choose> construct as shown in theexample before to print out only those entriesthat satisfy a certain condition.

5.3 The text generation viewFor the lexicon to be applied in our textgeneration system ‘Polibox’ (Stede, 2002),we need a Lisp-based version of DiM-Lex. Using the (defstruct <name> <slot1>

<?xml version="1.0"?><xsl:stylesheetxmlns:xsl=

"http://www.w3.org/1999/XSL/Transform">5 <xsl:template match="/">

<FONT SIZE="-2"><xsl:apply-templates/>

</FONT></xsl:template>

10 <xsl:template match="dictionary"><xsl:apply-templates/>

</xsl:template><xsl:template match="entry">

<P><font size="2"><B><A>15 <xsl:attribute name="NAME">

<xsl:value-of select="form/orth"/></xsl:attribute><xsl:value-of select="./@id"/>:<xsl:value-of select="form/orth"/>

20 </A></b></font><xsl:apply-templates/></P>

</xsl:template><xsl:template match="form">

<xsl:choose>25 <xsl:when match=".[alt_orth=’none’]">

</xsl:when><xsl:otherwise>

<BR/><B>Alternative orthography:</B><xsl:value-of select="alt_orth"/>

30 </xsl:otherwise></xsl:choose>

</xsl:template><xsl:template match="sem">

<BR/><B>Semantics:</B>35 <xsl:value-of select="ko_sub"/>

/ <xsl:value-of select="function"/><br/><b>Related markers:</b>

<xsl:for-each select="rel"><A><xsl:attribute name="HREF">

40 #<xsl:value-of select="."/></xsl:attribute><xsl:value-of select="."/></A>

</xsl:for-each></xsl:template>

45 <xsl:template match="syn"><BR/><B>Occurrences:</B><xsl:choose>

<xsl:when match=".[init\_field=’-’]"></xsl:when>

50 <xsl:otherwise>initial field /

</xsl:otherwise></xsl:choose>

</xsl:template>55 <xsl:template match="examples">

<BR/><B>Examples:</B><xsl:for-each select="example">

<xsl:value-of select="."/><BR/></xsl:for-each>

60 </xsl:template></xsl:stylesheet>

Figure 4: The XSL file for the HTML viewshown in Figure 2

Page 26: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

.. <slotn>) construct, we define a class of ob-jects for discourse markers, where the featuresneeded for generation are stored in the slots.Again, we abbreviate slightly:

(defstruct DiscMarkerRelation N-Complexity S-ComplexityOrtho POS ... Style)

Now, a Lisp-object for each individual dis-course marker entry is created with the func-tion make-Discmarker, which provides the val-ues for the slots. Figure 5 shows the shape of theentry for denn, whose XML-source was given infigure 1.

Again, we aim at deriving these entries au-tomatically via an XSL sheet (which we do notshow here). Notice that the mapping task isnow somewhat different from the HTML cases,since the transformation part of XSL (XSLT)comes into play here. Instead of merely display-ing the data in a web browser as in the examplesbefore, an XSLT processor may transform datafor use in some XML based client application.

As explained in Section 4.1, in the generationscenario we are given a tree fragment consist-ing of a discourse relation node and two daugh-ters representing the related material, the nu-cleus and the satellite of the relation. In orderto decide whether a particular marker can beused, one important constraint is the “size” ofthe daughter material, which can be a singleproposition or an entire sub-tree. The gener-ator needs to estimate whether it will fit intoa single phrase, clause, sentence, or into a se-quence of sentences; a subordinating conjunc-tion, for instance, can only be used if the ma-terial can be expressed within a clause. Thus,the Lisp-entry contains slots N-Complexity andS-Complexity, which are highly application-specific and thus do not have a simple corre-sponding feature in the XML source represen-tation of the dictionary. The XSL sheet thusinspects certain combinations of daughter at-tributes of <syn> and maps them to new namesfor the fillers of the two Complexity slots inthe Lisp structure. (Similar mappings occur inother places as well, which we do not show here.)

5.4 The text understanding viewOur analysis method recasts rhetorical parsingas a set of classification decisions, where a pars-

(make-Discmarker:Relation cause:N-Complexity sent:S-Complexity sent

5 :Ortho denn:POS coordconj:Style unmarked)

Figure 5: Lisp-version of generation-orienteddictionary entry for denn (abridged)

ing framework builds a tree structured analy-sis. Each of the decisions is based on a set offeatures. Feature types range from syntacticalconfiguration to the presence of a certain dis-course marker. The mapping from a pattern ofobserved features to a rhetorical relation may belearned automatically by a classification learn-ing algorithm.

Learning and analysis applications use a pars-ing framework that gives us a set of text spanpairs. Every two text spans are subject to aclassification learning algorithm (during train-ing) or the actual classifier. So, a rhetorical rela-tion is assigned to these two spans of text alongwith a score so that the parsing framework maydecide which of several competing classificationsto accept.

Learning and actual rhetorical analysis areaccomplished by a set of distinct tools that addspecific annotations to a given input text, be-fore resulting relations are learned or guessed.These tools include a data mining component, apart-of-speech tagger and a segmenter. They allaccess data organized in an XML syntax. Thecentral learning and parsing application makesuse of a Document Object Model (DOM) repre-sentation of the corpus. This data structure iseffectively used for information interchange be-tween several components, because it allows usto easily visualize and modify the current dataat each processing step during development.

With the present corpus data, the learning al-gorithm is theoretically able to identify rhetori-cal markers automatically and could thus com-pile a marker lexicon. However, markers arehighly ambiguous. Even though many of themcan be tagged as adverbials or conjunctions,markers often lack distinctive syntactic and/orpositional properties; some of them are phrasal,some are discontinuous. To identify significantcue - relation correlations, a lot of annotated

Page 27: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

data is necessary: more than is usually avail-able. In a sparse data situation, we want toeasen the learning task for the rhetorical lan-guage model: It makes sense to use a discoursemarker lexicon.

On the other hand, we do not expect a hand-crafted lexicon to contain all contextual con-straints that would enable us to assign a sin-gle rhetorical relation. These constraints can bevery subtle; some of them should be representedas probabilistic scalar information.

Thus, DiMLex contributes to initial dis-course marker disambiguation. From each en-try, we interpret syntactic positioning informa-tion, morphosyntactic contextual informationand a scope class (sentential, phrasal, discourse-level) as a conjunction of constraints. The pres-ence of a certain discourse marker in a specifiedconfiguration is one of the features to be ob-served in the text.

Depending on the depth of the syntactic andsemantic analysis carried out by the text un-derstanding system, different features providedby DiMLex can be taken into account. Certainstructural configurations can be tested with-out any deep understanding; for instance, theGerman marker wahrend is generally ambigu-ous between a Contrast and a Temporal-

Cooccurrence reading, but when followed bya noun phrase, only the latter reading is avail-able (wahrend corresponds not only to the En-glish while but also to during).

In the parsing client application, DiMLexserves as resource for the identification of cuephrases in specific structural configurations.Rhetorical information from the DiMLex entriesmay serve as one of several cues for the classi-fication engine. The final linking from cue pat-terns to rhetorical relations is learned from acorpus annotated with rhetorical structures.

6 Summary

We have presented our ongoing work on con-structing an XML-based dictionary of discoursemarkers, from which a variety of views are de-rived by XSL sheets: For the dictionary de-signer or application developer, we present thedictionary in tabular form or in a form resem-bling print dictionaries, but with hyperlinks in-cluded for easy cross-referencing. Similarly, textgeneration and understanding systems are on

the one hand written in different programminglanguages and thus expect different dictionaryformats; on the other hand, the informationneeded for generation and parsing is also notidentical, which is accounted for by the XSLsheets. Evaluation of the approach will dependon the client applications. Their implementa-tion will determine the final shape of DiMLex.

References

R.A. Amsler and F.W. Tompa, “An SGML-Based Standard for English Monolingual Dic-tionaries.” In: Information in Text: FourthAnnual Conference of the UW Center for theNew Oxford English Dictionary, University ofWaterloo Center for the New Oxford EnglishDictionary, Waterloo, Ontario, 1988, pp. 61-79.

N. Asher. Reference to Abstract Objects in Dis-course. Dordrecht: Kluwer, 1993.

B. Grote, M. Stede. “Discourse marker choicein sentence planning.” In: Proceedings ofthe Ninth International Workshop on NaturalLanguage Generation, Niagara-on-the-Lake,Canada, 1998.

B. Grote. “Signalling coherence relations: tem-poral markers and their role in text genera-tion.” Doctoral dissertation, Universitat Bre-men, forthcoming.

N. M. Ide, J. Le Maitre, and J. Veronis, “Out-line of a Model for Lexical Databases.” In-formation Processing and Management, 29, 2(1993), 159-186.

N. M. Ide and J. Veronis. Encoding Dictionar-ies.Computers and the Humanities, 29:167-180, 1995

N. M. Ide, Kilgarriff, A., and Romary, L.A For-mal Model of Dictionary Structure and Con-tent. In Proceedings of EURALEX 2000, pp.113-126, Stuttgart.

A. Knott. “A data-driven methodology for mo-tivating a set of coherence relations.” Doc-toral dissertation, University of Edinburgh,1996.

W. Mann, S. Thompson. “Rhetorical structuretheory: Towards a functional theory of textorganization.” In: TEXT, 8:243-281, 1988.

D. Marcu. “The Rhetorical Parsing of Un-restricted Natural Language Texts.”Proceedings of the Thirty-Fifth AnnualMeeting of the Association for Computa-

Page 28: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

tional Linguistics and Eighth Conference ofthe European Chapter of the Association forComputational Linguistics, Somerset, NewJersey, 1997.

C. Paris, K. Van der Linden, M. Fischer, A.Hartley, L. Pemberton, R. Power and D.R.Scott, “Drafter: A Drafting Tool for Produc-ing Multilingual Instructions”, Proceedings ofthe 5th European Workshop on Natural Lan-guage Generation, pp. 239-242, Leiden, theNetherlands, 1995.

M. Stede. “Polibox: Generating, descriptions,comparisons, and recommendations from adatabase.” In Proceedings of COLING-2002.

L. Wanner, E. Hovy. “The HealthDoc SentencePlanner.” Proceedings of the 8th Interna-tional Workshop on Natural Language Gen-eration, Hearstmonceux Castle, 1996.

Web References

Domain Object ModelW3C Recommendation, 13 November 2000http://www.w3.org/TR/DOM-Level-2-Core

Extensible Stylesheet Language (XSL) 1.0W3C Recommendation, 15 October 2001http://www.w3.org/TR/xsl

XML BaseW3C Recommendation 27 June 2001http://www.w3.org/TR/xmlbase

XSL Transformations (XSLT) 1.0W3C Recommendation, 16 November 1999http://www.w3.org/TR/xslt

Page 29: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

XML-Based NLP Tools for Analysing and Annotating Medical Language

Claire Grover, Ewan Klein, Mirella Lapata and Alex LascaridesDivision of Informatics

The University of Edinburgh2 Buccleuch Place

Edinburgh EH8 9LW, UK{C.Grover, E.Klein, M.Lapata, A.Lascarides }@ed.ac.uk

AbstractWe describe the use of a suite of highly flexibleXML -basedNLP tools in a project for processing andinterpreting text in the medical domain. The mainaim of the paper is to demonstrate the central rolethat XML mark-up andXML NLP tools have playedin the analysis process and to describe the resultantannotated corpus ofMEDLINE abstracts. In additionto theXML tools, we have succeeded in integratinga variety of non-XML ‘off the shelf’ NLP tools intoour pipelines, so that their output is added into themark-up. We demonstrate the utility of the anno-tations that result in two ways. First, we investigatehow they can be used to improve parse coverage of ahand-crafted grammar that generates logical forms.And second, we investigate how they contribute toautomatic lexical semantic acquisition processes.

1 IntroductionIn this paper we describe our use ofXML for an anal-ysis of medical language which involves a numberof complex linguistic processing stages. The ulti-mate aim of the project is to to acquire lexical se-mantic information fromMEDLINE through parsing,however, a fundamental tenet of our approach is thathigher-levelNLP activities benefit hugely from be-ing based on a reliable and well-considerered initialstage of tokenisation. This is particularly true forlanguage tasks in the biomedical and other technicaldomains since general purposeNLP technology maystumble at the first hurdle when confronted withcharacter strings that represent specialised techni-cal vocabulary. Once firm foundations are laid thenone can achieve better performance from e.g. chun-kers and parsers than might otherwise be the case.We show how well-founded tools, especiallyXML -based ones, can enable a variety ofNLP componentsto be bundled together in different ways to achievedifferent types of analysis. Note that in fields suchas information extraction (IE) it is common to use

statistical text classification methods for data anal-ysis. Our more linguistic approach may be of as-sistence inIE: see Craven and Kumlien (1999) fordiscussion of methods forIE from MEDLINE.

Our processing paradigm isXML -based. As amark-up language forNLP tasks, XML is expres-sive and flexible yet constrainable. Furthermore,there exist a wide range ofXML -based tools forNLP

applications which lend themselves to a modular,pipelined approach to processing whereby linguis-tic knowledge is computed and added asXML an-notations in an incremental fashion. In processingMEDLINE abstracts we have built a number of suchpipelines using as key components the programsdistributed with theLT TTT and LT XML toolsets(Grover et al., 2000; Thompson et al., 1997). Wehave also successfully integrated non-XML public-domain tools into our pipelines and incorporatedtheir output into theXML mark-up using theLT XML

programxmlperl(McKelvie, 2000).In Section 2 we describe our use ofXML -based

tokenisation tools and techniques and in Sections 3and 4 we describe two different approaches toanalysingMEDLINE data which are built on top ofthe tokenisation. The first approach uses a hand-coded grammar to give complete syntactic and se-mantic analyses of sentences. The second approachperforms a shallower statistically-based analysiswhich yields ‘grammatical relations’ rather thanfull logical forms. This information about gram-matical relations is used in a statistically-trainedmodel which disambiguates the semantic relationsin noun compounds headed by deverbal nominali-sations. For this second approach we compare twoseparate methods of shallow analysis which requirethe use of two different part-of-speech taggers.

2 Pre-parsing of Medline AbstractsFor the work reported here, we have used theOHSUMED corpus ofMEDLINE abstracts (Hersh et

Page 30: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

<RECORD><ID>395</ID><MEDLINE-ID>87052477</MEDLINE-ID><SOURCE>Clin Pediatr (Phila) 8703; 25(12):617-9</SOURCE><MESH>Adolescence; Alcoholic Intoxication/BL/*EP; Blood Glucose/AN; Canada; Child; Child, Preschool;Electrolytes/BL; Female; Human; Hypoglycemia/ET; Infant; Male; Retrospective Studies.</MESH><TITLE>Ethyl alcohol ingestion in children. A 15-year review.</TITLE><PTYPE>JOURNAL ARTICLE.</PTYPE><ABSTRACT><SENT><W P=’DT’>A</W> <W P=’JJ’>retrospective</W><W P=’NN’ LM=’study’ >study</W> <W P=’VBD’ LM=’be’ >was</W><W P=’VBN’ LM=’conduct’>conducted</W> <W P=’IN’>by</W> <W P=’NN’ LM=’chart’ >chart</W><W P=’NNS’ LM=’review’>reviews</W> <W P=’IN’ >of</W> <W P=’CD’>27</W><W P=’NNS’ LM=’patient’>patients</W> <W P=’IN’>with</W> <W P=’JJ’>documented</W><W P=’NN’ LM=’ethanol’>ethanol</W> <W P=’NN’ LM=’ingestion’>ingestion</W><W P=’.’>.</W></SENT> <SENT> .. . </SENT> <SENT> .. . </SENT></ABSTRACT><AUTHOR>Leung AK.</AUTHOR></RECORD>

Figure 1: A sample from theXML -marked-upOHSUMED Corpus

al., 1994) which contains 348,566 references takenfrom the years 1987–1991. Not every referencecontains an abstract, thus the total number of ab-stracts in the corpus is 233,443. The total number ofwords in those abstracts is 38,708,745 and the ab-stracts contain approximately 1,691,383 sentenceswith an average length of 22.89 words.

By pre-parsing we mean identification of wordtokens and sentence boundaries and other lower-level processing tasks such as part-of-speech (POS)tagging and lemmatisation. These initial stages ofprocessing form the foundation of ourNLP workwith MEDLINE abstracts and our methods are flex-ible enough that the representation of pre-parsingcan be easily tailored to suit the input needs of sub-sequent higher-level processors. We start by con-verting theOHSUMED corpus from its original for-mat to anXML format (see Figure 1). From thispoint on we pass the data through pipelines whichare composed of calls to a variety ofXML -basedtools from theLT TTT and LT XML toolsets. Thecore program in our pipelines is theLT TTT programfsgmatch, a general purpose transducer which pro-cesses an input stream and rewrites it using rulesprovided in a hand-written grammar file, where therewrite usually takes the form of the addition ofXML mark-up. Typically, fsgmatchrules specifypatterns over sequences ofXML elements and use a

regular expression language to identify patterns in-side the character strings (PCDATA) which are thecontent of elements. For example, the followingrule for decimals such as “.25” is searching for asequence of twoS elements where the first containsthe string “.” as itsPCDATA content and the secondhas been identified as a cardinal number (C=‘CD’,e.g. any sequence of digits). When these twoS el-ements are found, they are wrapped in aW elementwith the attributeC=‘CD’ (targ sg). (HereS ele-ments encode character sequences, see below, andW elements encode words.)

<RULE name="decimal" targ_sg="W[C=‘CD’]"><REL match="S/#˜ˆ[\.]$"></REL><REL match="S[C=‘CD’]"></REL>

</RULE>

Subparts of a pipeline can be thought of as dis-tinct modules so that pipelines can be configured todifferent tasks. A typical pipeline starts with a two-

<S C=’UCA’>A</S><S C=’LCA’>rterial</S><S C=’WS’> </S><S C=’UCA’>P</S><S C=’LCA’>a</S><S C=’UCA’>O</S><S C=’CD’>2</S><S C=’WS’> </S><S C=’LCA’>as</S><S C=’WS’> </S><S C=’LCA’>measured</S>

Figure 2: Character Sequence (S) Mark-up

Page 31: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

stage process to identify word tokens within ab-stracts. First, sequences of characters are bundledinto S (sequence) elements usingfsgmatch. For eachclass of character a sequence of one or more in-stances is identified and the type is recorded as thevalue of the attributeC (UCA=upper case alphabetic,LCA=lower case alphabetic,WS=white space etc.).Figure 2 shows the stringArterial PaO2 as mea-suredmarked up forS elements (line breaks addedfor formatting purposes). Every single character in-cluding white space and newline is contained inS

elements which become building blocks for the nextcall to fsgmatchwhere words are identified. An al-ternative approach would find words in a single stepbut our two-step method provides a cleaner set ofword-level rules which are more easily modified andtailored to different purposes: modifiability is criti-cal since the definition of what is a word can differfrom one subsequent processing step to another.

A pipeline which first identifies words and thenperforms sentence boundary identification andPOS

tagging followed by lemmatisation is shown in Fig-ure 3 (somewhat simplified and numbering addedfor ease of exposition). The Perl program in step 1wraps the input inside anXML header and footeras a first step towards conversion toXML . Step 2calls fsgmatchwith the grammar fileohsumed.grtoidentify the fields of anOHSUMEDentry and convertthem intoXML mark-up: each abstract is put insidea RECORDelement which contains sub-structure re-flecting e.g. author, title,MESH code and the ab-stract itself. From this point on, all processing is di-rected at theABSTRACT elements through the query“.*/ ABSTRACT”1. Steps 3 and 4 make calls tofsg-matchto identify S and W (word) elements as de-scribed above and after this point, in step 5, theS

mark-up is discarded (using theLT TTT programsgdelmarkup) since it has now served its purpose.

Step 6 contains a call to the other mainLT TTT

program, ltpos (Mikheev, 1997), which performsboth sentence identification andPOS tagging. Thesubquery (-qs ) option picks outABSTRACTs as theelements withinRECORDs (-q option) that are tobe processed; the-qw option indicates that the in-put has already been segmented into words marked

1The query language that theLT TTT andLT XML tools useis a specialisedXML query language which pinpoints the partof the XML tree-structure that is to be processed at that point.This query language pre-dates XPath and in expressiveness itconstitutes a subset of XPath except that it also allows regularexpressions over text content. Future plans include modifyingout tools to allow for the use of XPath as a query language.

up asW elements; the-sent option indicates thatsentences should be wrapped asSENT elements; the-tag option is an instruction to outputPOStags andthe-pos attr option indicates thatPOStags shouldbe encoded as the value of the attributeP on W ele-ments. The finalresource.xml names the resourcefile that ltpos is to use. Note that the tagset usedby ltpos is the Penn Treebank tagset (Marcus et al.,1994).

1. ohs2xml.perl \2. | fsgmatch -q ".*/TEXT" ohsumed.gr \3. | fsgmatch -q ".*/ABSTRACT" pretok.gr \4. | fsgmatch ".*/ABSTRACT" tok.gr \5. | sgdelmarkup -q ".*/S" \6. | ltpos -q ".*/RECORD" -qs ".*/ABSTRACT" \

-qw ".*/W" -sent SENT \-tag -pos_attr P resource.xml \

7. | xmlperl lemma.rule

Figure 3: Basic Tokenisation Pipeline

Up to this point, each module in the pipeline hasused one of theLT TTT or LT XML programs whichare sensitive toXML structure. There are, however,a large number of tools available from theNLP com-munity which could profitably be used but which arenot XML -aware. We have integrated some of thesetools into our pipelines using theLT XML programxmlperl. This is a program which makes underly-ing use of anXML parser so that rules defined ina rule file can be directed at particular parts of theXML tree-structure. The actions in the rules are de-fined using the full capabilities of Perl. This givesthe potential for a much wider range of transforma-tions of the input thanfsgmatchallows and, in par-ticular, we use Perl’s stream-handling capabilitiesto pass the content ofXML elements out to a non-XML program, receive the result back and encode itback in theXML mark-up. Step 7 of the pipeline inFigure 3 shows a call toxmlperl with the rule filelemma.rule . This rule file invokes Minnen et al.’s(2000)morphalemmatiser: thePCDATA content ofeach verbal or nominalW element is passed to thelemmatiser and the lemma that is returned is en-coded as the value of the attributeLM. A sampleof the output from the pipeline is shown in Figure 1.

3 Deep Grammatical AnalysisAs part of our work with OHSUMED, we havebeen attempting to improve the coverage of a hand-crafted, linguistically motivated grammar which

Page 32: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

provides full-syntactic analysis paired with logicalforms. The grammar and parsing system we useis the wide-coverage grammar, morphological anal-yser and lexicon provided by the Alvey Natural Lan-guage Tools (ANLT) system (Carroll et al. 1991,Grover et al. 1993). Our first aim was to increasecoverage up to a reasonable level so that parse rank-ing techniques could then be applied.

TheANLT grammar is a feature-based unificationgrammar based on theGPSGformalism (Gazdar etal., 1985). In this framework, lexical entries carrya significant amount of information including sub-categorisation information. Thus the practical parsesuccess of the grammar is significantly dependenton the quality of the lexicon. TheANLT grammaris distributed with a large lexicon and, while thisprovides a core of commonly-occurring lexical en-tries, there remains a significant problem of inade-quate lexical coverage. If we try to parseOHSUMED

sentences using theANLT lexicon and no other re-sources, we achieve very poor results (2% coverage)because most of the medical domain words are sim-ply not in the lexicon and there is no ‘robustness’strategy built intoANLT. Rather than pursue thelabour-intensive course of augmenting the lexiconwith domain-specific lexical resources, we have de-veloped a solution which does not require that newlexicons be derived for each new domain type andwhich has robustness built into the strategy. Fur-thermore, this solution does not preclude the use ofspecialist lexical resources if these can be used toachieve further improvements in performance.

Our approach relies on the sophisticatedXML -based tokenisation andPOStagging described in theprevious section and it builds on this by combin-ing POStag information with the existingANLT lex-ical resources. We preservePOStag information forcontent words (nouns, verbs, adjectives, adverbs)since this is usually reliable and informative andwe dispose ofPOS tags for function words (com-plementizers, determiners, particles, conjunctions,auxiliaries, pronouns, etc.) since theANLT hand-written entries for these are more reliable and aretuned to the needs of the grammar. Furthermore,unknown words are far more likely to be contentwords, so knowledge of thePOStag will most oftenbe needed for content words.

Having retained content word tags, we use themduring lexical look-up in one of two ways. If theword exists in the lexicon with the same basic cat-egory as thePOStag then thePOStag plays a ‘dis-

ambiguating’ role, filtering out entries for the wordwith different categories. If, on the other hand, theword is not in the lexicon or it is not in the lexiconwith the relevant category, then a basic underspeci-fied entry for thePOStag is used as the lexical entryfor the word, thereby allowing the parse to proceed.For example, if the following partially tagged sen-tence is input to the parser, it is successfully parsed.

We studied VBD the value NN oftranscutaneous JJ carbon NN dioxide NNmonitoring NN during transport NN

Without the tags the parse would fail since the wordtranscutaneousis not in theANLT lexicon. Further-more,monitoring is present in the lexicon but as averb and not as a noun. For both these words, or-dinary lexical look-up fails and the entries for thetags have to be used instead. Note that the caseof monitoring would be problematic for a strategywhere tagging is used only in case lexical look-upfails, since here it is incomplete rather than failed.The implementation of our wordtag pair look-upmethod is specific to theANLT system and uses itsmorphological analysis component to treat tags as anovel kind of affix. Space considerations precludediscussion of this topic here but see Grover and Las-carides (2001) for further details.

Another impediment to parse coverage is theprevalence of technical expressions and formulae inbiomedical and other technical language. For ex-ample, the following sentence has a straightforwardoverall syntactic structure but theANLT grammardoes not contain specialist rules for handling ex-pressions such as5.0+/-0.4 grams tensionand thusthe parse would fail.

Control tissues displayed a reproducible response tobethanechol stimulation at different calciumconcentrations with an ED50 of 0.4 mM calciumand a peak response of 5.0+/-0.4 grams tension.

Our response to issues like these is to place a fur-ther layer of processing in between the output ofthe initial tokenisation pipeline in Figure 3 and theinput to the parser. Since theANLT system is notXML -based, we already usexmlperl to convert sen-tences to theANLT input format of one sentence perline with tags appended to words using an under-score. We can add a number of other processes atthis point to implement a strategy of usingfsgmatchgrammars to package up technical expressions so asto render them innocuous to the parser. Thus allof the following ‘words’ have been identified using

Page 33: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

fsgmatchrules and can be passed to the parser asunanalysable units. The classification of these ex-amples as nouns reflects a hypothesis that they canslot into the correct parse as noun phrases but thereis room for experimentation since the conversion toparser input format can rewrite the tag in any way.

<W P=‘NN’>P less than 0.001</W><W P=‘NN’>166 +/- 77 mg/dl</W><W P=‘NN’>2 to 5 cc/day</W><W P=‘NN’>2.5 mg i.v.</W>

In addition to these kinds of examples, we alsopackage up other less technical expressions such ascommon multi-word words and spelled out num-bers:

<W P=‘CD’>thirty-five</W> thirty-five CD<W P=‘CD’>Twenty one</W> Twenty∼oneCD<W P=‘IN’>In order to</W> In∼order∼to IN<W P=‘JJ’>in vitro</W> in∼vitro JJ

In order to measure the effectiveness of our at-tempts to improve coverage, we conducted an ex-periment where we parsed 200 sentences taken atrandom fromOHSUMED. We processed the sen-tences in three different ways and gathered parsesuccess rates for each of the three methods. Ver-sion 1 established a ‘no-intervention’ baseline byusing the initial pipeline in Figure 3 to identifywords and sentences but otherwise discarding allother mark-up. Version 2 addressed the lexical ro-bustness issue by retainingPOS tags to be used bythe grammar in the way outlined above. Version 3applied the full set of preprocessing techniques in-cluding the packaging-up of formulaic and othertechnical expressions. The parse results for theseruns are as follows:

Version 1 Version 2 Version 3

Parses 4 (2%) 32 (16%) 79 (39.5%)

Even in Version 3, coverage is still not very high butthe difference between the three versions demon-strates that our approach has made significant in-roads into the problem. Moreover, the increase incoverage was achieved without any significant al-terations to the general-purpose grammar and thetokenisation of formulaic expressions was by nomeans comprehensive.

4 Shallow AnalysisIn contrast to the full syntactic analysis experi-ments described in the previous section, here we

describe two distinct methods of shallow analy-sis from which we acquire frequency informationwhich is used to predict lexical semantic relationsin a particular kind of noun compound.

4.1 The Task

The aim of the processing in this task is to pre-dict the relationship between a deverbal nominalisa-tion head and its modifier in noun-noun compoundssuch astube placement, antibody response, pain re-sponse, helicopter transport. In these examples, themeaning of the head noun is closely related to themeaning of the verb from which it derives and therelationship between this noun and its modifier cantypically be matched onto a relationship betweenthe verb and one of its arguments. For example,there is a correspondence between the compoundtube placementand the verb plus direct object stringplace the tube. When we interpret the compoundwe describe the role that the modifier plays in termsof the argument position it would fill in the corre-sponding verbal construction:

tube placement objectantibody response subjectpain response to-objecthelicopter transport by-object

We can infer thattube in tube placementfills theobject role in theplace relation by gathering in-stances from the corpus of the verbplaceand dis-covering thattubeoccurs more frequently in objectposition than in other positions and that the objectinterpretation is therefore more probable.

To interpret such compounds in this way, we needaccess to information about the verbs from whichthe head nouns are derived. Specifically, for eachverb, we need counts of the frequency with whichit occurs with each noun in each of its argumentslots. Ultimately, in fact, in view of the sparse dataproblem, we need to back off from specific noun in-stances to noun classes (see Section 4.4). The cur-rent state-of-the-art inNLP provides a number ofroutes to acquiring grammatical relations informa-tion about verbs, and for our experiment we chosetwo methods in order to be able to compare the tech-niques and assess their utility.

4.2 Chunking with Cass

Our first method of acquiring verb grammatical re-lations is that used by Lapata (2000) for a similartask on more general linguistic data. This methoduses Abney’s (1996) Cass chunker which uses the

Page 34: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

finite-state cascade technique. A finite-state cas-cade is a sequence of non-recursive levels: phrasesat one level are built on phrases at the previouslevel without containing same level or higher-levelphrases. Two levels of particular importance arechunksand simplex clauses. A chunk is the non-recursive core of intra-clausal constituents extend-ing from the beginning of the constituent to its head,excluding post-head dependents (i.e.,NP, VP, PP),whereas a simplex clause is a sequence of non-recursive clauses (Abney, 1996). Cass recognizeschunks and simplex clauses using a regular expres-sion grammar without attempting to resolve attach-ment ambiguities. The parser comes with a large-scale grammar for English and a built-in tool thatextracts predicate-argument tuples out of the parsetrees that Cass produces. Thus the tool identifiessubjects and objects as well asPPs without how-ever distinguishing arguments from adjuncts. Weconsider verbs followed by the prepositionby anda head noun as instances of verb-subject relations.Our verb-object tuples also include prepositionalobjects even though these are not explicitly iden-tified by Cass. We assume thatPPs adjacent to theverb and headed by either of the prepositionsin, to,for, with, on, at, from, of, into, through, uponareprepositional objects.

The input to the process is the entireOHSUMED

corpus after it has been converted toXML , to-kenised, split into sentences andPOS tagged us-ing ltpos as described in Section 2. The output ofthis tokenisation is converted to Cass’s input formatwhich is a non-XML file with one word per line andtags separated by tab. We achieve this conversionusing xmlperl with a simple rule file. The outputof Cass and the grammatical relations processor is alist of each verb-argument pair in the corpus:

manage :obj refibrillationrespond :subj psoriasisaccess :to system

4.3 Shallow Parsing with the Tag SequenceGrammar

Our second method of acquiring verb grammati-cal relations uses the statistical parser developed byBriscoe and Carroll (1993, 1997) which is an ex-tension of theANLT grammar development systemwhich we used for our deep grammatical analysis asreported in Section 3 above. The statistical parser,known as the Tag Sequence Grammar (TSG), uses ahand-crafted grammar where the lexical entries are

for POStags rather than words themselves. Thus itis strings of tags that are parsed rather than stringsof words. The statistical part of the system is theparse ranking component where probabilities are as-sociated with transitions in anLR parse table. Thegrammar does not achieve full-coverage but on theOHSUMED corpus we were able to obtain parses for99.05% of sentences. The number of parses foundper sentence ranges from zero into the thousandsbut the system returns the highest ranked parse ac-cording to the statistical ranking method. We donot have an accurate measure of how many of thehighest ranked parses are actually correct but even apartially incorrect parse may still yield useful gram-matical relations data.

In recent developments (Carroll and Briscoe,2001), theTSG authors have developed an algorithmfor mappingTSG parse trees to representations ofgrammatical relations within the sentence in the fol-lowing format:

These centres are efficiently trapped in proteins at lowtemperatures(|ncsubj| |trap| |centre| |obj|)(|iobj| |in| |trap| |protein|)(|detmod| |centre| |These|)(|mod| |trap| |efficiently|)(|aux| |trap| |be|)(|ncmod| |temperature| |low|)(|ncmod| |at| |trap| |temperature|)

This format can easily be mapped to the same for-mat as described in Section 4.2 to give counts of thenumber of times a particular verb occurs with a par-ticular noun as its subject, object or prepositionalobject.

As explained above, theTSG parses sequencesof tags, however it requires a different tagset fromthat produced byltpos, namely theCLAWS2 tagset(Garside, 1987). To prepare the corpus for parsingwith theTSG we therefore tagged it with Elworthy’s(1994) tagger and since this is a non-XML tool weusedxmlperl to invoke it and to incorporate its re-sults back into theXML mark-up. Sentences werethen prepared as input to theTSG—this involved us-ing xmlperlto replace words by their lemmas and toconvert toANLT input format:

These DD2 centre NN2 be VBR efficiently RRtrap VVN in II protein NN2 at II low JJtemperature NN2

The lemmas are needed in order that theTSG out-puts them rather than inflected words in the gram-matical relations output shown above.

Page 35: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

4.4 Compound Interpretation

Having collected two different sets of frequencycounts from the entireOHSUMED corpus for verbsand their arguments, we performed an experiment todiscover (a) whether it is possible to reliably predictsemantic relations in nominalisation-headed com-pounds and (b) whether the two methods of col-lecting frequency counts make any significant dif-ference to the process.

To collect data for the experiment we needed toadd to the mark-up already created by the basicpipeline in Figure 3, (a) to mark up deverbal nomi-nalisations with information about their verbal stemto give nominalisation-verb equivalences and (b) tomark up compounds in order to collect samples oftwo-word compounds headed by deverbal nominal-isations. For the first task we combined further useof the lemmatiser with the use of lexical resources.In a first pass we used themorpha lemmatiser tofind the verbal stem for-ing nominalisations suchasscreeningand then we looked up the remainingnouns in a nominalisation lexicon which we createdby combining the nominalisation list which is pro-vided byUMLS (2000) with theNOMLEX nominali-sation lexicon (MacLeod et al., 1998) As a result ofthese stages, most of the deverbal nominalisationscan be marked up with aVSTEM attribute whosevalue is the verbal stem:

<W P=’NN’ LM=’reaction’ VSTEM=’react’>reaction</W><W P=’NN’ LM=’growth’ VSTEM=’grow’ >growth</W><W P=’NN’ LM=’control’ VSTEM=’control’ >control</W><W P=’NN’ LM=’coding’ VSTEM=’code’>coding</W>

To mark up compounds we developed anfsgmatchgrammar for compounds of all lengths and kindsand we used this to process a subset of the first twoyears of the corpus.

We interpret nominalisations in the biomedicaldomain using a machine learning approach whichcombines syntactic, semantic, and contextual fea-tures. Using theLT XML program sggrep weextracted all sentences containing two-word com-pounds headed by deverbal nominalisations andfrom this we took a random sample of 1,000 nom-inalisations. These were manually disambiguatedusing the following categories which denote theargument relation between the deverbal head andits modifier: SUBJ (age distribution), OBJ (weightloss), WITH (graft replacement), FROM (blood elim-ination), AGAINST (seizure protection), FOR (non-stress test), IN (vessel obstruction), BY (aerosol ad-

ministration), OF (water deprivation), ON (knee op-eration), andTO (treatment response). We also in-cluded the categoriesNA (non applicable) for nom-inalisations with relations other than the ones pre-dicted by the underlying verb’s subcategorisationframe (e.g.,death stroke) andNV (non deverbal) forcompounds that were wrongly identified as nomi-nalisations.

We treated the interpretation of nominalisationsas a classification task and experimented with dif-ferent features using the C4.5 decision tree learner(Quinlan, 1993). Some of the features we took intoaccount were the context surrounding the candidatenominalisations (encoded as words orPOS-tags), thenumber of times a modifier was attested as an argu-ment of the verb corresponding to the nominalisedhead, and the nominalisation affix of the deverbalhead (e.g.,-ation, -ment). In the face of sparsedata, linguistic resources such as WordNet (Millerand Charles, 1991) andUMLS were used to recre-ate distributional evidence absent from our corpus.We obtained several different classification modelsas a result of using different marked-up versions ofthe corpus, different parsers, and different linguisticresources. Full details of the results are describedin Grover et al. (2002); we only have space for abrief summary here. Our best results achieved anaccuracy of 73.6% (over a baseline of 58.5%) whenusing the type of affixation of the deverbal head, theTSG, and WordNet for recreating missing frequen-cies.

5 Conclusions

We have performed a number of differentNLP taskson the OHSUMED corpus of MEDLINE abstractsranging from low-level tokenisation through shal-low parsing to deep syntactic and semantic analy-sis. We have usedXML as our processing paradigmand we believe that without the coreXML tools thetask would have become extremely hard. Further-more, we have built fully-automatic pipelines andhave not resorted to hand-coding at any point so thatour output annotations are completely reproducableand our resources are reusable on new data. Ourapproach of building a firm foundation of low-leveltokenisation has proved invaluable for a variety ofhigher-level tasks.

TheXML -annotatedOHSUMED corpus which hasresulted from our project will be useful for a num-ber of different tasks in the biomedical domain. Forthis reason we are developing a web-site from which

Page 36: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

many of our resources (including the pipelinesdescribed in this paper) are available:http://www.ltg.ed.ac.uk/disp/ . In addition, we pro-vide various marked-up and tokenised versions ofOHSUMED, including the output of the parsers de-scribed here.

ReferencesSteven Abney. 1996. Partial parsing via finite-state

cascades. In John Carroll, editor,Proceedingsof Workshop on Robust Parsing at Eighth Sum-mer School in Logic, Language and Information,pages 8–15. University of Sussex.

Ted Briscoe and John Carroll. 1993. Generalisedprobabilistic LR parsing of natural language (cor-pora) with unification grammars.ComputationalLinguistics, 19(1):25–60.

Ted Briscoe and John Carroll. 1997. Automaticextraction of subcategorization from corpora. InProceedings of the Fifth ACL Conference on Ap-plied Natural Language Processing, 356–363.

John Carroll and Ted Briscoe. 2001. High preci-sion extraction of grammatical relations. InPro-ceedings of the 7th ACL/SIGPARSE InternationalWorkshop on Parsing Technologies, pages 78–89,Beijing, China.

Mark Craven and Johan Kumlien. 1999. Construct-ing biological knowledge bases by extracting in-formation from text sources. InProceedings ofthe 7th Interntaional Conference on IntelligentSystems for Molecular Biology(ISMB-99).

David Elworthy. 1994. Does Baum-Welch re-estimation help taggers? InProceedings of the4th ACL conference on Applied Natural Lan-guage Processing, pages 53–58, Stuttgart, Ger-many.

Roger Garside. 1987. The CLAWS word-taggingsystem. In Roger Garside, Geoffrey Leech, andGeoffrey Sampson, editors,The ComputationalAnalysis of English. Longman, London.

Gerald Gazdar, Ewan Klein, Geoff Pullum, and IvanSag. 1985.Generalized Phrase Structure Gram-mar. Basil Blackwell, London.

Claire Grover, Colin Matheson, Andrei Mikheev,and Marc Moens. 2000. LT TTT—a flexibletokenisation tool. InLREC 2000—Proceedingsof the Second International Conference on Lan-guage Resources and Evaluation, pages 1147–1154.

Claire Grover and Alex Lascarides. 2001. XML-based data preparation for robust deep parsing.

In Proceedings of the Joint EACL-ACL Meeting(ACL-EACL 2001).

Claire Grover, Mirella Lapata and Alex Lascarides.2002. A Comparison of Parsing Technologies forthe Biomedical Domain. Submitted toJournal ofNatural Language Engineering.

William Hersh, Chris Buckley, TJ Leone, and DavidHickam. 1994. OHSUMED: an interactive re-trieval evaluation and new large test collection forresearch. In W. Bruce Croft and C. J. van Rijsber-gen, editors,Proceedings of the 17th Annual In-ternational Conference on Research and Devel-opment in Information Retrieval, pages 192–201.

Maria Lapata. 2000. The automatic interpretationof nominalizations. InProceedings of the 17thNational Conference on Artificial Intelligence,pages 716–721, Austin, TX.

Catherine MacLeod, Ralph Grishman, Adam Mey-ers, Leslie Barrett, and Ruth Reeves. 1998.NOMLEX: a lexicon of nominalisations. InEU-RALEX’98, pages 187–194.

Mitchell Marcus, Grace Kim, Mary AnnMarcinkiewicz, Robert MacIntyre, Ann Bies,Mark Ferguson, Karen Katz, and Britta Schas-berger. 1994. The Penn treebank: annotatingpredicate argument structure. InARPA HumanLanguage Technologies Workshop.

David McKelvie. 2000. XMLPERL 1.7.2. A RuleBased XML Transformation Languagehttp://www.cogsci.ed.ac.uk/˜dmck/xmlperl .

Andrei Mikheev. 1997. Automatic rule inductionfor unknown word guessing.Computational Lin-guistics, 23(3):405–423.

George A. Miller and William G. Charles. 1991.Contextual correlates of semantic similarity.Language and Cognitive Processes, 6(1):1–28.

Guido Minnen, John Carroll, and Darren Pearce.2000. Robust, applied morphological generation.In Proceedings of 1st International Natural Lan-guage Conference (INLG ’2000).

Ross J. Quinlan. 1993.C4.5: Programs for Ma-chine Learning. Morgan Kaufman, San Mateo,CA.

Henry S. Thompson, Richard Tobin, David McK-elvie, and Chris Brew. 1997. LT XML. Soft-ware API and toolkit for XML processing.http://www.ltg.ed.ac.uk/software/ .

UMLS. 2000. Unified Medical Language System(UMLS) Knowledge Sources. National Library ofMedicine, Bethesda (MD), 11th edition edition.

Page 37: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

���������������� ���������������������� ����� �!��� �"#���$�%&�')(*��� +$�� ��

,.-0/2143*57698;: <=14>?>A@CB+14DEGFIH*J=KMLMNPORQSNIT4UVFXWYLMZSUPUPNP[\ZSW]QSZ_^&T4`ba\K�T4LMa\KMc

d4e\e HfZSQhgRW]a\UPa\[�cjilkYmnThKMZo T4pq`]KMNsr][\ZYtuEGJ e;d\v\w�x t]yzi�J{S|R}l~]���u���������������S�R�������f���b�����X�����

��� 3;8;/269�n8���������2 ��¢¡h£¤�+�2¥�¡h¦�§¨��¡h��©2�2ª�©u«] ¬�+«�­9¡h£2��®2��§¨�]¯R¡¬¦°��±²��³¦� �¡h«´£2��¥�©��2 ���ª� �³V��¡¬¡h��ªG¥�«R����¡¬�Rµb«]ª¬¶]��¯2¦�·��;µ;��¯¤¸�©2ª¬«;���� � ��«]¯;¡h��¯;¡�µ\¹º�=³u��¥°¦���»]�º¡¬£2��¡9¦�¡� �£2«]�2¥°¸�³u��¶]ª¬«]�2¯2¸2��¸¢¦°¯�¡h£¤�¦�¯2­¼«�ª�§¨��¡h¦�«�¯½�������� � �§���¡h£2«;¸)£;�2§¨��¯2 ¨��ª��¾§¨«] ¬¡¨��«]§¨¿­¼«�ª�¡h�]³2¥���¹_¦�¡h£2À)¯2�]¡h�2ª���¥9¥���¯2¶]�2��¶]�RÁGÂ�«�¹=��»]��ª�µ2¡¬£2�¨Ã���¿ �«��2ª�����ÄG�� ���ª�¦�©2¡h¦�«�¯.Åuª¬�]§���¹º«]ª¬ÆÈǼÃ�ÄGÅ�Éhµ�¡h£¤�´­¼«]�2¯2¸2�]¿¡h¦�«�¯Ê«]­Ë¡h£¤�Ì®2��§¨��¯;¡¬¦°�̱Í��³�µ�¹=�� z¸2�� �¦�¶�¯2��¸Í¡¬«j³u�Ì���� �¿¦�¥°Î´©2ª¬«;���� � ¬��¸´³;ξ��«�§¨©2�2¡¬��ª� �µR¯¤«�¡_£;�2§¨��¯2 �Á9Ï!«Ìª¬��¯2¸2��ªÃ�ÄGÅЧ¨«�ª���­¼ª¬¦���¯2¸¤¥°ÎÑ¡¬«j£R�¤§��]¯2 �µf¹º��©2ª�«�©V«� ���¡¬«j���2¶]¿§¨��¯;¡0¦�¡&¹_¦°¡¬£Ì¯2��¡¬�2ª���¥2¥���¯2¶]�2��¶]����¯2¯2«]¡h�]¡h¦�«�¯2 �µ�«�ª&§¨��¡h��¿¸2�]¡h��¹_ª�¦°¡¬¡h��¯Ê¦�¯Ê��»���ª¬Î;¸2��ÎÒ¥°�]¯2¶��2�]¶��;Á+±Í���]ª¬¶]�2�¨¡h£2�]¡¯2�]¡h�2ª���¥]¥°�]¯2¶��2�]¶��&�]¯2¯2«�¡¬��¡¬¦°«]¯2 �µ�©2�]ª¬ ���¸z¦�¯R¡¬«���«�§¨©2�2¡¬��ª�¿ª�����¸¤��³2¥��ª¬��©2ª��� ���¯;¡h�]¡h¦�«�¯2 �µ;��ª��¢¯2«]¡�«�¯¤¥°Îj¦�¯R¡¬�2¦�¡h¦�»��¢�]¯2¸��ÓV����¡h¦�»��;µ\³2�2¡!����¯���¥� ¬«���������¥���ª���¡¬�Ë¡h£¤�º©2�]���º¹_¦°¡¬£+¹_£2¦���£¡h£¤�=®¤��§¨��¯;¡h¦���±²��³�¦� &³V��¦�¯2¶z��¸2«]©2¡h��¸�ÁÔ±Í��³u��¥°¦���»]��¡h£2�]¡«��¤ª¡h����£¤¯2«�¥�«�¶]ÎÑ���]¯Í­¼�]��¦�¥°¦�¡h�]¡h�Ì��£2�]©2©;Îѧ¨��ª�ª¬¦���¶]��³V��¿¡M¹=����¯+¯2��¡¬�2ª¬�]¥�¥���¯2¶]�2��¶]�&¡h����£2¯2«�¥�«�¶]Î��]¯2¸¢¡h£¤�Ë®2��§��]¯;¡h¦��±Í��³j»;¦° �¦°«]¯�Á

Õ Ö D�8;/2-�×�Ø��n8;14-ËDÏ_£2�½»R¦� �¦°«]¯Ù«�­Ì¡¬£2�½®¤��§¨��¯;¡h¦��Ú±Í��³ÛǼ����ª�¯2��ª� �¿IÜ����*��¡��¥MÁ�µGÝ�Þ�Þfß�É�¦� Ñ¡h«���«�¯;»���ª¬¡j��à;¦� ¬¡¬¦°¯¤¶�±²��³A¦°¯2­¼«]ª¬§¨�]¡h¦�«�¯¦�¯R¡¬«Ê�.§�«]ª¬�j§¨����£¤¦°¯2��¿Iª����]¸2��³¤¥°�´­¼«�ª�§�µË¹_¦�¡h£È¡h£2��¶]«��]¥«�­q§¨��Æ;¦°¯¤¶Í¡¬£2��±Í��³)§�«]ª¬�j��Óu����¡h¦�»��j­¼«]ªÌ�2 ¬��ª¬ �ÁÚÏ_£2¦° ¶�«]��¥�¶]ª¬��¹C«]�2¡¾«]­z¡h£2�²ª¬����«�¶]¯2¦�¡h¦�«�¯È¡h£2�]¡��*¹=���]¥°¡¬£�«�­¦�¯2­¼«�ª�§¨��¡h¦�«�¯²ª¬����¸2¦�¥�Î���à;¦° �¡h ¡¬«;¸2��Î�¦°¯Í��¥�����¡hª�«�¯2¦���­¼«�ª�§�á£2«�¹=��»]��ª�µV ¬¦�¯2���Ì¡¬£2¦� ¢¦°¯¤­¼«�ª�§��]¡h¦�«�¯Í¥°�]��Æ; ��¯;Îѧ¨����£2¦�¯2��¿�2¯2¸¤��ª� ¬¡¬��¯2¸2�]³2¥��¾ ���§¨��¯;¡h¦��� �µ�¦°¡����]¯2¯2«]¡+³V�´���� �¦°¥�⩤ª¬«]¿���� ¬ ���¸â³;Î���«]§¨©2�2¡h��ª� �ÎR �¡h��§¨ �ÁŤ�2¯2¸¤��§¨��¯;¡h�]¥°¥�ÎRµ*®2��§¨�]¯R¡¬¦°�ã±Í��³äª��� �����ª���£å¦� A��¡¬¿¡h��§¨©2¡h¦�¯2¶�¡¬«Ì��¸2¸2ª��� � _¡h£¤�¢©2ª�«�³2¥���§C«�­!¦°¯2­¼«]ª¬§¨�]¡h¦�«�¯j����¿���� ¬ �æ�³2�2¦�¥°¸¤¦°¯2¶ �Î; ¬¡¬��§¨ 9¡h£2�]¡&£2��¥�©��2 ¬��ª¬ 9¥�«;����¡¬�Rµ���«�¥�¥°�]¡h�;µ��«]§¨©2��ª��RµÔ��¯2¸Ê��ª�«� � �¿Iª���­¼��ª¬��¯2������«]¯R¡¬��¯;¡\Á�ç� � ¬�2��£!µn¹=�³V��¥�¦°��»��G¡h£2�]¡=¡¬£2�z®2��§¨�]¯R¡¬¦°�z±Í��³â ¬£¤«��2¥�¸¾³V�z§�«]¡h¦�»���¡h��¸³;Î��]¯2¸¶�ª�«��2¯¤¸2��¸z¦�¯¡h£2�0§¨��¡h£2«;¸«�­;¦�¯2­¼«�ª�§¨��¡¬¦°«]¯G�]������ � §¨«� �¡���«�§¨­¼«�ª�¡h�]³2¥°�+¡h«¾�¤ ¬��ª¬ �À½¯¤��¡h�¤ª¬�]¥�¥���¯2¶]�2��¶]�RÁ�èq��¡h¿�2ª���¥�¥���¯2¶]�2��¶]�¾¦� Ì¦°¯;¡h�¤¦°¡¬¦°»]�RµË���� �Î.¡h«Í�2 ��j��¯2¸)ª¬�]©2¦�¸2¥°Î¸2��©2¥°«�Î]��³2¥��Rµ2��¯2¸Òª¬��éR�¤¦°ª��� q¯2«� �©u����¦���¥�¦°·���¸Ñ¡¬ª¬�]¦°¯2¦�¯2¶fÁ�êM¯«��¤ª�»;¦° �¦°«]¯�µ;¡¬£2�¢®2��§��]¯;¡h¦��¢±²��³�¹_¦°¥�¥Ô³V�¢��é;�2�]¥°¥�ξ�]������ �¿ �¦°³2¥��¨³;Î���«�§¨©2�¤¡h��ª� z�2 �¦°¯2¶j �©u����¦���¥�¦°·���¸²¥°�]¯2¶��2�]¶��� q�]¯2¸¦�¯R¡¬��ª���£2��¯¤¶���­¼«]ª¬§¨��¡¬ �µ]�]¯2¸¾£;�2§¨��¯¤ =�¤ ¬¦�¯2¶+¯¤��¡h�¤ª¬�]¥V¥���¯¤¿¶��¤��¶��;ÁnÏ_£2�� ¬����¯2�]ª¬¦�«z«�­V³u��¦°¯2¶+��³2¥��_¡h«¢�] ¬Æ+���«�§¨©2�¤¡h��ªë ¹_£¤��¯â¹º�] =¡¬£2�©2ª��� �¦°¸¤��¯;¡�«�­�Ï!��¦�¹=��¯¾³V«�ª�¯�µ 쨫�ª ë Å9¦�¯2¸

§¨�¾¡¬£2�¾��£2����©V�� �¡+»�������¡¬¦°«]¯.©2����Æ���¶]�̦°¯Ú¡h£2�â����£2�]§��] ¡h£¤¦° z§¨«�¯;¡h£¤ì´�]¯2¸Ò¶]��¡h¡¬¦°¯¤¶´³2�]��Æ ësí �2 �¡z¡h£2��ª¬¦�¶�£;¡G¦°¯¤­¼«�ª�¿§¨��¡¬¦°«]¯2쨦° _»]��ª�Î���©2©V���]¥°¦�¯2¶fÁ�=�����]�2 ¬��¡¬£2�ïî2ª� �¡* ¬¡¬��©ð¡¬«Ù³2�2¦�¥°¸¤¦°¯2¶ñ¡h£2�ﮤ��§¨��¯2¿¡h¦��ï±Í��³ò¦� Ú¡¬«Ù¡hª���¯¤ ¬­¼«]ª¬§ó��à;¦° �¡h¦�¯2¶Ù �«��¤ª¬���� ôÇõ ¬¡¬«�ª���¸�� ¾Â�Ï_öÑÜÙ©¤��¶��� �µº¦�¯�¥°��¶��]��ÎÚ¸¤��¡h�]³2�� ��� �µË��¡h�;Á°É¢¦°¯;¡¬«.�§¨����£2¦�¯2��¿I�2¯¤¸2��ª� ¬¡¬��¯2¸¤��³2¥���­¼«�ª�§÷Ǽ¦MÁ �;Á°µ&ø�öÑÜ�ù�Ã�ÄGÅ�Éhµ!¦�¡¦� Ì ¬«]§���¡h¦�§¨�� ¨��¡¨«;¸2¸2 ¨¹_¦�¡h£È�²£;�2§¨�]¯2¿I³¤�� ���¸Ú¯¤��¡h�¤ª¬�]¥¥���¯2¶]�2��¶]�½»;¦���¹ú«]­¾¡¬£2��¹º«]ª¬¥�¸�Á�ç�¥�¡h£2«]�2¶�£û¡h£¤��¶���¯¤¿��ª���¥º­¼ª¬�]§¨��¹=«�ª�Ʋ«]­_¡h£2�â®2��§¨�]¯R¡¬¦°�´±²��³Ú¦�¯2��¥��2¸2�� �©¤ª¬«]¿»;¦° �¦�«�¯2 �­¼«�ª+¯¤��¡h�¤ª¬�]¥Ë¥���¯2¶]�2��¶]�Ì¡h����£¤¯2«�¥�«�¶]ÎRµ�¡h£2�â����¡h�2�]¥¸2��©2¥°«�Î;§¨��¯;¡«]­º ��2��£Ê¡h����£2¯2«�¥�«�¶]Î�ª¬��§¨��¦�¯2 ¥°�]ª¬¶]��¥�ÎÑ�2¯2¿��à;©2¥�«�ª���¸!Á_êM¯Ñ�]¯Ñ��ÓV«�ª�¡�¡¬«¾��à;©2¥�«�¦�¡�¡¬£2�� ¬Î;¯2��ª¬¶]¦° �¡h¦��¢«�©¤¿©V«�ª�¡h�2¯2¦�¡h¦��� z³V��¡M¹=����¯Ñ¡h£2��®2��§��]¯;¡h¦��+±Í��³Ò�]¯2¸Ò¯¤��¡h�¤ª¬�]¥¥���¯2¶]�2��¶]��¡¬����£2¯2¦�é;�2�� �µf¹=�¨©2ª¬«]©u«] ¬�¨¡¬£2ª¬���¨§�����£2��¯¤¦° �§¨ ­¼«�ª½ ¬����§¨¥��� � ¬¥�Î?¦�¯;¡h��¶�ª���¡h¦�¯2¶A¯2�]¡h�2ª���¥´¥°�]¯2¶��¤��¶���¡¬����£2¿¯2«]¥°«]¶�Îï¦�¯;¡h«�¡h£¤�.Ã��� �«��2ª����ÊÄG�� ���ª�¦°©¤¡h¦�«�¯üŤª���§¨��¹=«�ª�ÆÔÁÏ_£2�Ñ¬ �¡�¦�¯R»]«�¥�»��� Ì���2¶]§���¯;¡h¦�¯2¶.Ã�ÄGÅû©¤ª¬«]©u��ª¬¡Mν¸2��î2¿¯2¦�¡h¦�«�¯2 �ÁôÏ_£2�Ñ �����«�¯2¸È¦�¯R»]«�¥�»��� Ì��ª�����¡¬¦°¯¤¶.¦°¯2­¼«]ª¬§¨�]¡h¦�«�¯�������� ¬ 0 ���£2��§��]¡h�z¡h«+³2ª¬¦�¸2¶]�_¡h£2��¶]��©¨³V��¡M¹=����¯Ì¥���¯¤¶��2�]¶����¯¤¸ôÃ�ÄGÅ�Á�Ï_£2�Ê¡h£¤¦°ª�¸ô§¨����£2��¯2¦� �§÷©2ª�«�©V«� ��� â­¼�2ª¬¡¬£2��ª��à;¡¬��¯2 �¦°«]¯2 ¢¡¬£2��¡���¡h¡¬��§¨©2¡�¡h«Ñ§¨¦�ª¬ª�«�ª£;�2§¨��¯.é;�2�� ¬¡¬¦°«]¯��¯¤ ¬¹=��ª�¦°¯¤¶�³V��£¤��»R¦�«�ª�¦°¯�¡¬£2�+­¼«]ª¬§C«�­9¯2��¡¬�2ª���¥�¥���¯¤¶��2�]¶��ë é;�2��ª¬Î�©¤¥°�]¯2 �Á ìãç�¥�¥0¡h£¤�� ��̧�����£2��¯¤¦° �§¨ ��ª���³2�] ¬��¸²«]¯¡h£¤����«]¯2����©2¡0«�­V¯2�]¡h�2ª���¥2¥���¯¤¶��2�]¶��_��¯¤¯2«�¡¬��¡h¦�«�¯¤ �µ��¡¬����£2¿¯2¦�é;�2�¾¹_£2¦���£Ú¹=�¾£¤��»���©2¦�«�¯¤����ª���¸.¦�¯Ú¡¬£2�¾¥��� �¡+¸¤�����]¸2�RÁÏ_£2¦� ¡h����£2¯2«�¥�«�¶]Î�£2�� z�]¥°ª����]¸2Î�³u����¯Í ¬�¤������ ¬ �­¼�2¥�¥°ÎÒ�2 ¬��¸¦�¯jýÔþRÿ��;þ!µR¡h£2�zî2ª� ¬¡=é;�2�� �¡h¦�«�¯¾��¯2 �¹º��ª¬¦�¯2¶� ¬Î; �¡h��§ã��»���¦�¥�¿��³¤¥°��«]¯Ì¡h£2�q±²«]ª¬¥�¸Ì±�¦�¸2��±Í��³�µ]��¯¤¸Ì¹º��³V��¥�¦���»���¡h£2�]¡Ë¦�¡©2ª�«�»;¦°¸2�� ��� �¦°§¨©2¥��+§¨����£2��¯2¦� �§C¡¬«¾§¨��ª�ª¬Î´¯2��¡¬�2ª���¥�¥���¯¤¿¶��¤��¶��G¡h����£2¯2«]¥°«]¶�ÎÌ��¯¤¸â¡¬£2�®2��§��]¯;¡h¦��G±Í��³!Á

� � 698;Ø�/26��¨B+69D��0Ø�6��� � DqD�-08;698;14-0D�3� ��¢«]­�§¨��¡¬��¸2�]¡h�̦� �����«�§¨§¨«�¯j¡h����£2¯2¦�éR�¤�­¼«�ª�ª���¯2¸¤��ª�¿¦�¯2¶¾¦�¯2­¼«�ª�§¨��¡¬¦°«]¯�­¼ª¬�]¶�§¨��¯;¡h q§¨«�ª��¢¡h��¯2��³2¥��+¡¬«¾©2ª�«;���� ¬ �¿¦�¯2¶²³;Î.��«�§¨©2�2¡¬��ª¨ ¬Î; �¡h��§� �Á�� �¦�¯2¶²¯¤��¡h�¤ª¬�]¥=¥���¯¤¶��2�]¶��¦�¡h ���¥�­V�] &§¨��¡h�]¸2��¡¬�©2ª��� ���¯;¡h 0 ���»]��ª���¥;��¸¤¸2¦°¡¬¦°«]¯2��¥¤��¸2»���¯¤¿¡h�]¶��� �æV¦�¡�©¤ª¬�� ¬��ª¬»]�� �£;�2§¨�]¯¢ª¬����¸2�]³2¦�¥°¦�¡MÎRµ���¥�¥�«�¹_ �­¼«�ª!���] ¬Îé;�2��ª�Î;¦°¯¤¶Vµ_�]¯2¸7��¯2��«��2ª���¶]�� ¾¯¤«�¯2¿X��à;©V��ª�¡´�2 ���ª� j¡h«)��¯2¿¶��]¶���¦�¯�§¨��¡¬��¸2�]¡h����ª�����¡¬¦°«]¯�Á�Ï�«�¡h£2¦� º��¯2¸�µ]¹=��£2��»]��¸2��¿»���¥°«]©u��¸j¯2��¡¬�2ª¬�]¥n¥°�]¯2¶��¤��¶��z�]¯2¯2«�¡¬��¡¬¦°«]¯2 +Ç ��]¡h·Rµ�ß�������ɬµ¹_£2¦���£È��ª��⧨�]��£2¦°¯¤��¿X©2��ª� ¬�]³2¥��´ ���¯;¡h��¯2���� Ì��¯2¸È©2£¤ª¬�] ¬�� ¡h£¤��¡&¸2�� ¬��ª¬¦�³u�_¡¬£2�_��«]¯;¡h��¯;¡&«]­u»���ª�¦�«��2 &¦�¯2­¼«�ª�§¨��¡¬¦°«]¯� ���¶]¿§¨��¯;¡h �Á=Ï_£2�� ¬�¢�]¯2¯2«]¡h��¡¬¦°«]¯2 � ���ª�»���] �§¨��¡¬��¸2�]¡h��¡¬«�¸2��¿

Page 38: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

 ���ª�¦°³V�¡h£2�Æ;¦�¯2¸2 �«]­9éR�¤�� �¡h¦�«�¯2 _¡¬£2��¡��¨©2�]ª¬¡¬¦°���2¥���ª_©2¦������«�­!Æ;¯2«�¹_¥���¸2¶]�q¦� _���]©2��³2¥��z«�­!��¯2 �¹=��ª�¦°¯2¶fÁÏ�«�¦�¥�¥°�2 �¡hª���¡¬�+£2«�¹ü¡¬£2¦° G¡h����£2¯2«]¥°«]¶�ξ¹=«�ª�Æ; �µ2��«]¯2 ¬¦�¸2��ª¡h£¤�­¼«�¥�¥°«�¹_¦�¯2¶¨©2��ª���¶]ª¬�]©2£��]³u«]�2¡��]«] ¬��©2£´��ª�«;¸2 ¬Æ;ÎÔæ

� � a�KãT4WúThUPU��õZSpq`]K�ThQSNPW][äT4mRLMgRa�K��MgRN��ntÊNPp�`RmRZ�r� NPLMgjQSUIThKMNPLXc´a��9LMg]a\m][\gYL�ThWRr��ba�ZSLMNPQ¢NsWYLMZSW��MNPLXc�t !¬a��MZ"�Rg$#uKMa�r���%Yc � T&��T � T4KMrRZSr´LMgRZ v�x�'�(*) a\`bZSU+2KMN�,SZ�NPW�^uNPLMZSK�ThLMm]KMZ.-

Ï_£2¦� ©2��ª���¶]ª¬�]©2£Ñ§¨��Î�³V�Ì��¯2¯2«]¡h�]¡h��¸²¹_¦°¡¬£Ê¡h£2�¨­¼«�¥�¥�«�¹_¿¦�¯2¶Væ

!¬a��MZ"�Rg/#uKMa�r0��%Yc � T&�zT � ThKMr]ZSrÌLMgRZ ) a\`bZSU�+2KMN�,SZ��a�Kº^uNsLMZ�K�ThLMmRKMZ_NPW v\x�'1( -v\x1'�(2) a�`bZSU3+2KMN�,SZ4��a�K0^VNPLMZSK�ThLMm]KMZçåé;�2�� �¡h¦�«�¯È��¯¤ ¬¹=��ª�¦°¯¤¶Í �Î; ¬¡¬��§ ¹º«]�2¥°¸�©¤��ª� ¬�j¡h£¤�� ��¡M¹=«Ù��¯2¯¤«�¡h�]¡h¦�«�¯2 ½��¯2¸ð �¡h«]ª¬�Сh£2�ô©2��ª� ¬��¸ð ¬¡¬ª¬�¤��¡h�¤ª¬�� Ç¼�;Á ¶fÁ°µu¡¬��ª�¯2��ª�δ��à;©2ª¬�� ¬ �¦�«�¯2 ¨Ç ���¡¬·Rµ9ß���565fÉhÉ9¹_¦�¡h£²©u«]¦°¯;¡h¿��ª� ¾³¤����Æ)¡h«*¡¬£2�²«]ª¬¦�¶�¦�¯2�]¥�¦°¯¤­¼«�ª�§��]¡h¦�«�¯� ���¶]§���¯;¡\Á7Ï!«��¯¤ ¬¹=��ª��Ñé;�2�� �¡h¦�«�¯�µ!¡¬£2�â�2 ¬��ª+é;�2��ª¬Î;µ!©2��ª� ¬��¸.¦°¯;¡¬«Ò¡h£¤� ���§¨��¡MÎ;©u�q«�­n �¡hª��2��¡¬�2ª��� �µY¹=«��2¥�¸Ì³V�q��«�§¨©2��ª���¸¨�]¶���¦�¯2 �¡¡h£¤�Ì��¯2¯2«]¡h�]¡h¦�«�¯2  �¡h«]ª¬��¸²¦�¯Ê¡h£2��Æ;¯2«�¹_¥���¸¤¶���³2�� ��RÁ¨����¿���]�2 ��¡h£2¦� �§¨��¡¬��£j¹º«]�2¥°¸�«;�����2ª��]¡_¡h£2�+¥°��»���¥�«�­9©2��ª� ���¸ª���©2ª��� ���¯;¡h�]¡h¦�«�¯2 �µ]¥°¦�¯2¶]�2¦° �¡h¦����]¥°¥�ξ �«�©¤£2¦° �¡h¦����]¡h��¸â§¨�]��£2¦°¯¤¿��ª�ÎÑ ��2��£Í��  �ÎR¯¤«�¯;Î;§+Î;ù�£;Î;©u«]¯;ÎR§�δª���¥���¡¬¦°«]¯2 �µf«�¯;¡h«]¥°«]¿¶�¦��� �µ9�]¯2¸Ú �¡hª��2��¡¬�2ª���¥�¡hª���¯2 �­¼«�ª�§¨��¡¬¦°«]¯.ª¬�2¥��� ÑÇõ�RÁ ¶VÁ�µ ë ®¤¿Ã��2¥��� �ì)Ç7���¡¬·Rµ_ß������]á3���¡h·´��¯2¸ÚÜ���»;¦�¯�µ_ß���565fÉhÉ���«��2¥�¸³V�ⳤª¬«]�2¶�£;¡+¡¬«²³V���]ª¨«�¯Ú¡¬£2�â§��]¡h��£2¦�¯2¶Í©2ª�«R���� � �ÁÍêM­��§¨��¡¬��£A¹º��ª¬�.­¼«]�2¯2¸�µ¢¡h£2�½ ¬��¶�§¨��¯;¡Ñ��«�ª�ª¬�� ¬©V«�¯2¸¤¦°¯2¶È¡¬«¡h£¤�*��¯2¯¤«�¡h�]¡h¦�«�¯7¹º«]�2¥�¸7³V�*ª���¡¬�2ª¬¯¤��¸7¡h«�¡h£¤�*�2 ���ªÑ�] ¡h£¤���¯2 �¹º��ª�Á����������2 ��q �«�©¤£2¦° �¡h¦����]¡h��¸´¯2�]¡h�2ª���¥n¥���¯¤¶��2�]¶��©2ª�«;���� � �¦°¯2¶���«��2¥�¸*³V�´¦�¯;»�«]Æ���¸Ê¦�¯*§¨��¡¬��£2¦°¯¤¶Òé;�2�� �¡h¦�«�¯¤ ¹_¦�¡h£â��¯2¯2«]¡h�]¡h¦�«�¯2 �µ]©2ª¬����¦� ¬¦�«�¯´­¼��ª�³V��Î�«]¯2¸¾¡¬£2��¡�«]­� �¡h��¯¤¿¸2�]ª¬¸Æ]��Î;¹º«]ª¬¸¤¿I³2�] ¬��¸q¦�¯2­¼«]ª¬§¨��¡¬¦°«]¯ª¬��¡hª�¦°��»���¥Y¡¬����£2¯2¦�é;�2�� ��«]�2¥�¸¢³u�º����£2¦���»]��¸�ÁnêM¯¢��¸2¸¤¦°¡¬¦°«]¯�µ���¥�¦°¯2¶]�2¦� ¬¡¬¦°����¥�¥°Î;¿X³2�� ���¸ �ÎR �¡h��§ò��¥�¥�«�¹_ _­¼«�ª_»���ª�¦°�]¡h¦�«�¯2 �¦�¯j�2 ¬��ª_é;�2��ª�¦°�� �µR�;Á ¶fÁ°µb��¥�¿¡h��ª¬¯¤��¡h�z­¼«]ª¬§��2¥°�]¡h¦�«�¯2 �µR����¡h¦�»���ù�©2�] ¬ �¦°»]��»]«�¦����RµR¯¤«�§¨¦°¯¤��¥�¿¦�·���¡¬¦°«]¯2 �µb��¡¬�RÁ�Ï!«�¶]¦°»]�¢��§¨«�ª����«]¯2��ª���¡¬�¢��à;��§¨©2¥��Rµ¤¡h£¤���¯¤¯2«�¡¬��¡h¦�«�¯¤ &��³V«�»���¹º«]�2¥�¸��]¥°¥�«�¹Ú�é;�2�� ¬¡¬¦°«]¯��]¯2 �¹º��ª¬¦�¯2¶ �ÎR �¡h��§ä¡h«Ò�]¯2 �¹º��ª+¡h£¤�´­¼«]¥°¥�«�¹_¦°¯¤¶Ñé;�2�� �¡h¦�«�¯2 jǼ ����âÅ9¦°¶]¿�2ª���ß�­¼«]ª_��¯j��à;�]§�©¤¥°�;Éhæ

8ÊgnT4L9�RKMN�,SZ_r]Nsr:#uKMa�r0��%Yc¢KMZSQSZSN�;�Z_NPW v�x�'1(�<8ÊgRa � T&�zT � ThKMr]ZSr�LMgRZ ) a�`bZSU+bKMN�,SZ��°a\Kº^VNPLMZSK�T&�LMmRKMZ�NPW v�x�'1(�<HfZSUPU0pzZ�T4`ba\m]LLMgRZ � NPW]WRZSKza1�=LMg]Z v\x1'�(=) a\`bZSU+2KMN�,SZ4�°a\K0^VNPLMZSK�T4LMmRKMZ.-8ÊgRa � T&�ËLMgRZ ) a\`bZ�U�+bKMN�,SZ4��a\KË^VNPLMZSK�T4LMmRKMZ_[�N>;\ZSWLMazNPW v�x�'�(1<ç�¯Ú¦�§¨©u«]ª¬¡¬��¯;¡�­¼����¡¬�2ª��´«]­�¡h£2�â��¯¤¯2«�¡¬��¡h¦�«�¯Ú��«�¯¤����©¤¡¦� ¡h£2�]¡G�]¯;ÎѦ°¯¤­¼«�ª�§��]¡h¦�«�¯² ¬��¶�§¨��¯;¡G����¯Í³V�Ì��¯2¯2«]¡h�]¡h��¸!æ¯2«]¡�«]¯2¥°Îz¡¬��à;¡\µ�³¤�2¡9��¥� ¬«�¦�§¨��¶]�� �µ\§+�2¥�¡h¦�§¨��¸2¦��Vµ\¸2�]¡h�]³2�� ��é;�2��ª�¦��� �µR��¯2¸â��»]��¯â©2ª¬«;����¸2�2ª��� �Á±²�½£2��»��*¦�§¨©2¥���§¨��¯;¡h��¸ñ¡¬£2�Ú�]³u«�»]�.¡h����£¤¯2«�¥�«�¶]Îô¦�¯ýÔþRÿ��;þ3?¢Ç ��]¡h·Rµ�ß��65�5Vá0��]¡h·Rµnß@�6�A��ÉhµR¡¬£2�¢î2ª� �¡=é;�2�� ¬¡¬¦°«]¯B�C@D�DFEHG I�IKJ�J�JHL MON0L>PQN&DRL SFT"U1IKE1VOW�X�S1Y&D�ZFION\[O]OW�^�M"_

Å9¦°¶]�2ª��Ìß]æ&ýÔþ]ÿ`�Rþ���¯2 �¹=��ª�¦°¯2¶�¡h£¤�éR�¤�� �¡h¦�«�¯ ë ±�£2«¨¹=«�¯¡h£¤�è�«�³V��¥`a˪�¦°·��q­¼«]ª_Ü�¦�¡h��ª���¡¬�2ª��G¦�¯*ß���5��6b¬ì

Å9¦°¶]�2ª��&ÝVæÔýÔþRÿ��;þ�]¯2 �¹º��ª¬¦�¯2¶�¡h£2�ËéR�¤�� �¡h¦�«�¯ ë ±�£2«_¹_ª�«�¡¬�¡h£¤�� ���ª�����¯¤©2¥°��Î�­¼«�ªdc=«;«R¸Ì±�¦�¥�¥VÂq�2¯;¡h¦�¯2¶�b¬ìͳ;Î���àR¡¬ª¬�]��¡¬¿¦�¯2¶Ò¦�¯2­¼«�ª�§¨��¡¬¦°«]¯*­¼ª¬«]§ä¡h£2�´êM¯R¡¬��ª�¯2��¡�öÑ«�»;¦���Äq�]¡h�]³2�� ����¯¤¸´¶]��¯2��ª¬�]¡h¦�¯2¶¨��¯â��©2©2ª�«�©¤ª¬¦���¡¬�qª��� �©u«]¯2 ��RÁ

��¯¤ ¬¹=��ª�¦°¯¤¶� �ÎR �¡h��§Ù��»���¦�¥°�]³2¥��º«]¯�¡¬£2��±²«]ª¬¥�¸�±�¦�¸2�_±Í��³�Á®2¦�¯2���¨¦�¡q����§¨��«�¯2¥�¦°¯¤��¦�¯ÒÄG������§�³u��ª¾ß��6��eVµfýÔþRÿ��;þÊ£2�] ��¯¤¶���¶]��¸�¦°¯Ì��à;��£2��¯2¶]�� 9¹_¦�¡h£Ì£;�2¯2¸2ª���¸¤ 0«�­V¡h£2«]�2 ���¯2¸2 Ë«�­�2 ���ª� +�]¥°¥º«�»���ª+¡h£2�´¹º«]ª¬¥�¸�µ� ¬�2©¤©2¥°Î;¦�¯2¶Ñ¡¬£2��§ä¹_¦�¡h£Ú�¤ ¬��¿­¼�2¥zÆ;¯2«�¹_¥���¸2¶]�RÁgf!�2ª�ª¬��¯;¡h¥�ÎRµ=«��¤ª¾ �ÎR �¡h��§ ����¯Ð��¯2 �¹º��ª§¨¦°¥�¥�¦°«]¯2 q«]­0¯2�]¡h�2ª���¥�¥°�]¯2¶��2�]¶��¢é;�2�� ¬¡¬¦°«]¯2 ��]³u«]�2¡G©2¥°�]���� Ç¼�;Á ¶fÁ°µ0��¦�¡h¦��� �µ9��«]�2¯;¡hª�¦°�� �µ&¥°�]Æ��� �áË��«;«�ª�¸2¦�¯2��¡¬�� �µ!¹º����¡h£¤��ª�µ§¨��©2 �µ¨¸2��§¨«�¶�ª���©¤£2¦°�� �µ�©V«�¥�¦°¡¬¦°����¥Ì�]¯2¸ã����«]¯2«�§¨¦��È ¬Î; �¿¡h��§¨ �ɬµ_§¨«�»;¦°�� ²Ç¼�;Á ¶fÁ°µ�¡¬¦°¡¬¥°�� �µ�����¡h«�ª� �µ=¸2¦°ª�����¡h«]ª¬ �Éhµº©V��«�¿©2¥��²Ç¼�;Á ¶fÁ°µ�³2¦�ª¬¡¬£Ê¸2��¡¬�� �µÔ³2¦�«�¶]ª¬�]©2£2¦��� �ÉhµÔ¸2¦°��¡h¦�«�¯2�]ª¬ÎÒ¸2��î2¿¯2¦�¡h¦�«�¯2 �µ;�]¯2¸´§��2��£�µ;§��2��£â§�«]ª¬�;ÁêM¯ «�ª�¸2��ª�¡h« ¶�¦�»�� ýÔþRÿ��;þó�2¯2¦�­¼«�ª�§ �������� � �¡¬« ���§¨¦° �¡hª��2��¡h�2ª���¸#ª��� �«��2ª����� A«�¯#¡¬£2� ±Í��³�µ)¹=�ò£2��»]���ª����]¡h��¸ih=§�¯¤¦°³2�] ¬�ûÇ ���¡¬·ï��¡½��¥MÁ°µ¾Ý�Þ]Þ�ÝfÉhµ¨�?»R¦�ª�¡h�2�]¥¸2�]¡h��³¤�� ��)¡¬£2��¡Ú¦°¯;¡h��¶�ª���¡¬�� .¯;�2§¨��ª�«��2 .£¤��¡h��ª¬«]¶���¯2��«]�2 ±Í��³û ¬«]�2ª¬���� Í�2¯2¸2��ªÊ�ô �¦°¯¤¶�¥��½é;�2��ª¬Îü¦�¯R¡¬��ª�­¼�����RÁ Ï!«����¡h�2�]¥°¥�Î7��¯2 �¹º��ª��2 ���ªÒé;�2�� ¬¡¬¦°«]¯2 �µz£2«�¹=��»���ª�µq¡¬£2�Ú¶]��©³V��¡M¹=����¯²¯2��¡¬�2ª���¥9¥°�]¯2¶��¤��¶���é;�2�� �¡h¦�«�¯2 G��¯¤¸Ò �¡hª��2��¡¬�2ª¬��¸h=§¨¯2¦�³2�� ��²é;�2��ª¬¦��� â§+�¤ ¬¡j³V�ͳ¤ª¬¦�¸2¶���¸�Áãèq��¡h�¤ª¬�]¥G¥���¯¤¿¶��¤��¶��+��¯2¯2«]¡h�]¡h¦�«�¯2 q ¬��ª¬»]��� G¡h£2����¯2�]³2¥°¦�¯2¶´¡h����£¤¯2«�¥�«�¶]Ρh£¤��¡Ò�]¥°¥�«�¹_ Ò¡¬£2�)¦°¯;¡h��¶�ª���¡¬¦°«]¯A«�­ÍýÔþRÿ��;þ?��¯¤¸jh=§¨¯2¦°¿

Page 39: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

k ~���l��AmonY�\�����$p�q.rts&nY}����Y�\~��RsFuk ~��vl��Amxw\}S���;�S�Y�6uOyznY}�������~��)�S�*�O{R�n.p@y}|��Qw���|R}�}@~ kv� ~��vlY�Am�w\}S�\�;���Y�6u

kv� ~��vl��Amon���������u

k ~���l`m7��~�}F�R��~�����p�qQrts��R}��\�R�����]��}���sFuk ~��vl��Amõ�Y}S�;�����)~��vl`mõ~����\}��Y~�w��1rtsK�vn�}����Y��~\�Rs � uk ~��vl��Am¼~����.���Í~���l`mõ~�����}��Y~�w\�Orts&�]����m��l��~]���.�Rs � uk �R��mX���\�Ú���O���Qr�s\�]���Y�}�R�\}��R���Ê���@���*�S�z���0s � uk �R��mX���\�Ú���O���Qr�s��R}����]�����]�\}S�Ð}1l$���0s � uk �R��m7���S�Ú���O���Qr�s&�1{R���R}��\�R�����]��}��È}1l����²���}�\}0s � u

kv� ~��vl`m7��~Y}��R��~��\�6u

Å9¦°¶]�2ª���eVæ�ç��¤¶�§¨��¯;¡h¦�¯2¶Ñ�]¯.«�¯;¡h«]¥°«]¶�β��³V«��2¡�¡h£2�}f!êMç±Í«�ª�¥°¸�Åu����¡h³V«R«]ƨ¹_¦°¡¬£¾¯2�]¡h�2ª���¥u¥°�]¯2¶��2�]¶����]¯2¯2«]¡h��¡¬¦°«]¯2 �Á

³2�] ¬�;Á�®¤¦°¯2���q�]¯2¯2«]¡h��¡¬¦°«]¯2 =����¯´¸2�� ���ª�¦�³u�G��ª�³2¦�¡hª���ª�Î�­¼ª���¶]¿§¨��¯;¡h j«]­¢Æ;¯2«�¹_¥°��¸2¶��;µ=¡h£2��ª¬�ͦ� â¯¤«Úª����� �«�¯�¹_£;Î�¡¬£2��Î���]¯�� ¡Ò³V�Ú��§¨©2¥�«�Î���¸ü¡h«Ð¸¤�� ���ª�¦°³V��h=§¨¯2¦�³2�� ��*é;�2��ª�¦��� �ÁêM¯.­¼����¡\µ���¯2¯¤«�¡h�]¡h¦�«�¯2 ����¯.³V��©2�]ª¬�]§���¡h��ª¬¦�·���¸�µf¡h£¤��¡¢¦� �µ¡h£¤��ÎÚ����¯È��«]¯;¡h��¦�¯) ¬Î;§+³V«�¥� Ìª���©2ª��� ���¯;¡h�]¡h¦�»��â«�­G��¯)��¯2¿¡h¦�ª�����¥��� � 0«�­f«]³ í ����¡h �ÁÔÅu«�ª0��àR�]§¨©2¥°�;µY¡¬£2���]¯2¯2«�¡¬��¡¬¦°«]¯ ë �©V��ª� ¬«]¯ï¹_ª¬«]¡h�Ñ¡¬£2�Í ¬��ª¬����¯2©2¥���Î)­¼«�ª$�O���������0�����1�RìÒ���]¯³V�.��¡h¡¬����£2��¸Ð¡¬«)�]¯�h=§¨¯2¦�³2�] ¬�Êé;�2��ª�Î�¡h£¤��¡�ª���¡hª�¦���»��� ¡h£¤��¹_ª¬¦�¡h��ª¬ .­¼«]ª.»\�]ª¬¦�«��¤ Í§¨«�»;¦��� .­¼ª�«�§ ¡¬£2��êM¯;¡h��ª�¯2��¡öÑ«�»;¦��qÄG��¡¬��³2�] ¬�ÌǼêMöÑÄG³�ɬÁ ��Ï_£2�z �Î;§+³V«�¥��1�0���������0���1� ���ª�»��� ��� ��´©2¥�������£2«�¥�¸2��ª�­¼«�ªq��¯;Îâ«�¯2�¢«]­&¡h£2��£;�2¯2¸2ª���¸¤ «�­G¡h£¤«��2 ���¯2¸¤ �«�­G§¨«�»R¦��� ¾¡h£2�]¡ÌêMöÑÄq³���«�¯;¡h�]¦°¯¤ �¦°¯¤­¼«�ª�¿§¨��¡¬¦°«]¯Ù��³V«��2¡�á�¹_£¤��¯ñ¡¬£2�)��¯2¯2«]¡h�]¡h¦�«�¯Ù§¨��¡¬��£2�� Í¡h£¤��2 ���ª�é;�2�� ¬¡¬¦°«]¯�µ�¡¬£2�.����¡h�2�]¥§¨«�»R¦��ͯ2��§¨�ʦ° �¦�¯2 �¡h�]¯R¡¬¦°¿��¡¬��¸½��¯2¸*©¤�� � ¬��¸Ú��¥�«�¯¤¶Ñ¡h«�h=§¨¯2¦°³¤�� ��RÁ²ç�­¼¡¬��ª h=§¨¯2¦°¿³2�] ¬�_­¼��¡h��£2�� 0¡h£2����«�ª�ª�����¡!��¯¤ ¬¹=��ª�µ�ýÔþRÿ��;þÌ©V��ª�­¼«�ª�§¨ 0��¸¤¿¸2¦�¡h¦�«�¯2�]¥0©V«� �¡h©2ª�«;���� � �¦°¯2¶fµV�RÁ ¶VÁ�µn¯2�]¡h�2ª���¥&¥���¯¤¶��2�]¶���¶���¯¤¿��ª���¡¬¦°«]¯�µR¡h«¨©2ª��� ���¯;¡�¡h£¤���¯2 �¹º��ª¢Ç¼ ����Å9¦°¶]�2ª��GÝfÉhÁ

¡ ¢ -3£Í6!/¤×�3�¤7Ø�>û69D¦¥1§�/¤1\�ÔD�×��4@©¨«ª­¬Ï_£2�¨Ã��� ¬«]�2ª¬����ÄG�� ���ª�¦�©2¡h¦�«�¯ÍŤª���§¨��¹=«�ª�Æ*ǼÃ�ÄGÅ�ÉǼÜ��� �¿ �¦°¥����]¯2¸¾®¤¹_¦°��ÆÔµfß����6�Vá���ª¬¦���Æ;¥°��Ψ��¯2¸}c=�2£2�VµRÝ]Þ�Þ]ÝVÉhµ�¡h£¤� �¡h��¯¤¸2��ª�¸2¦�·���¸â®2��§��]¯;¡h¦��G±Í��³�¥���¯¤¶��2�]¶��G­¼«�ª_¸2�� ¬��ª¬¦�³2¦�¯2¶§¨��¡¬��¸2�]¡h�Vµ�¹=�� ¢§¨���]¯;¡¢­¼«�ª���«�¯¤ ¬�2§¨©2¡¬¦°«]¯Ê³RÎÍ��«]§¨©2�2¡h¿��ª� �µY¯2«]¡Ë£;�2§¨��¯¤ �Á2c=¦°»]��¯�¡¬£2¦� º©2£¤¦°¥�«� �«�©2£;Î;µY£2«�¹È����¯Ì¹=�³V�� ¬�¤ª¬�Ì¡¬£2��¡�¹º��� ª��¨��ª�����¡¬¦°¯¤¶��2 ¬��­¼�2¥Ë§¨��¡h��¸¤��¡h�Ab7Â�«�¹���]¯Ê¹º��³V�� ��2ª��Ì¡h£2�]¡¢«��2ª+«�¯;¡h«]¥°«]¶�¦��� §¨¦°ª�ª�«�ª¡h£¤��¹º��Î�2 ���ª� «�ª�¶��]¯2¦°·��+�]¯2¸Ê¡h£2¦�¯2ÆÒ�]³u«]�2¡z��«]¯R¡¬��¯;¡�b�®2¦°¯¤���Ì¡h£¤�î2¯2�]¥n³u��¯2��î¤��¦���ª�ÎÌ«�­�¡¬£2�z®2��§¨�]¯R¡¬¦°�z±Í��³j �£2«��2¥�¸¾³V�¡h£¤���»���ª¬�]¶��=�2 ���ª�µY¹=�_��¸2»]«;����¡¬�_�¢£;�2§¨��¯2¿X����¯;¡h��ª���¸¨«]ª¬¶]��¯2¦�¿·��]¡h¦�«�¯â«�­!§¨��¡h�]¸2��¡¬�6®z¶�ª�«��¤¯2¸2��¸´¦°¯j¯2�]¡h�2ª���¥n¥���¯¤¶��2�]¶��RÁ¯ {Y���1��m �v�@°v°v° �����]��|!��w\}S� �±�² ³�´¶µR³¸·¸¹vº�³o»v¼�³H½�¼�¾1¿ÁÀv¼�·x³oµ�Â�Ã6³o»vº�Ä&º&½2¼�¾1³o´¶Å�ÆǺ\ÈÊÉ�´¶Ë¶Ëv¾vº\Ì�º\·

»v¼KÌ�ºÁ¼�¾O¿*Å\Â�¾O³¸¼�Å\³9É�´¶³¸»*»1¹v½2¼�¾vµ�Ív¼�¾vÎϽ2¼"¿*È.º�Å\·xº\¼�³¸º\ÎdÂ�¾v˶¿�ÃÐÂ�·³o»vºdÈ.º&¾�º&Ñ�³*Â�Ã9µxÂ�Ã>³ ÉA¼�·¸ºd¼�Ò�º&¾O³oµ�Í�º�Ó Ò6Ó¶Í�´¶¾OÌ�º\¾1³oÂ�·x¿Ô½�¼�¾v¼�Ò�º\½�º\¾O³µx¿Oµx³oº\½�µ�Å\Â�½�½�¹v¾v´¶Å\¼�³o´¶¾vÒ�É�´¶³o»ÏÉA¼�·¸º&»�Â�¹vµxº\µ�Ó�ÕQÂ�·3³o»vº\µxºÁ¼�ÀvÀv˶´¶Å\¼�Ö³o´¶Â�¾vµ�Í"¾v¼�³o¹v·¸¼�ËQ˶¼�¾vÒ�¹v¼�Ò�º`½�¼"¿�¾vÂ�³RÈ6º�¾vº\Å\º\µxµ¸¼�·x¿1Ó�×3º\Ì�º\·x³o»vº\˶º&µ¸µ�ÍF¼Ë¶¼�·xÒ�º9Ã>·x¼�Å\³o´¶Â�¾�Â�Ãt³o»vº¦ÄKº\½�¼�¾O³¸´¶ÅÆǺ\ÈÏ´¶¾OÌ�Â�˶Ì�º&µ`º\¾vÎ*¹�µxº\·xµFÍOÉ�»vº\·xºÉAº�È.º\˶´¶º\Ì�º`¾�¼�³¸¹v·¸¼�Ëv˶¼�¾vÒ�¹v¼�Ò�º�Ã>Â�·x½�µt³o»vº�È.º&µ¸³t´¶¾vÃÐÂ�·¸½�¼�³o´¶Â�¾2¼�Å\Å\º&µ¸µ½�º\Îv´¶¹v½dÓ

êM¯Èß@�6���]µu¹º��©2ª�«�©V«� ���¸Ñ¡¬«â�]¡h¡¬����£Ò¯¤��¡h�¤ª¬�]¥!¥���¯¤¶��2�]¶����¯¤¯2«�¡¬��¡h¦�«�¯¤ �¡h«ã��»]��ª�ÎR¡¬£2¦�¯2¶ã��»\�]¦°¥���³¤¥°�ü«�¯Û¡h£¤�Ù±²��³Ç ��]¡h·;µ�ß��6�A��ɬÁ Ť�2ª�¡h£¤��ª�§�«]ª¬�;µ�¹º�ȸ2�� ¬��ª¬¦�³u��¸?�7¸2¦� ¬¿¡hª�¦�³2�2¡h��¸j§�����£2��¯¤¦° �§ò­¼«�ª_Æ;¯2«�¹_¥���¸¤¶��z¶��]¡h£2��ª¬¦�¯2¶Væ

��Î.��¥�¥�«�¹_¦°¯2¶²¡h£2«]�2 ���¯2¸2 ¨«]­�©V��«�©¤¥°�â¡h«Í³2�2¦�¥°¸�2©ðÆ;¯2«�¹_¥���¸¤¶��È��³V«��2¡.Æ;¯2«�¹_¥���¸¤¶��Rµ�¹=��¹_¦�¥°¥��ª¬����¡h�´�²Æ;¯2«�¹_¥���¸¤¶��¾³¤�� ��´«]­���¯½¦°¯;¡h��ª¬�� ¬¡¬¦°¯¤¶­¼«]ª¬§ÑÁnÏ_£¤�&±²��³¹_¦°¥�¥���«�¯;¡h¦�¯;�2�&¡h«_³V�0³2�2¦�¥�¡Ô«��2¡«]­ ë «�©2�]é;�2��ìA¦�¯2­¼«�ª�§¨��¡¬¦°«]¯û ���¶]§¨��¯;¡h �桬��à;¡\µ§¨��©¤ �µ���£2��ª�¡h �µ����2¸2¦�«Vµ�»;¦°¸2��«VµÙØ1ÚxÛ�Ü á¢³2�¤¡â�]¡h¿¡¬����£2��¸�¡h«����]��£Ì«�­Ô¡¬£2�� ���¹_¦�¥°¥f³V�q¯¤��¡h�¤ª¬�]¥u¥���¯2¿¶]�2��¶]�¢��¯2¯¤«�¡h�]¡h¦�«�¯2 q¡h£2�]¡q­¼�]��¦�¥°¦�¡h�]¡h��ª¬��¡hª�¦°��»\�]¥�Á��Î7¶�¦�»R¦�¯2¶ï£;�2§¨�]¯2 Ò�]������ ¬ Ò¡h«ïª���¥���»���¯;¡Ñ¦�¯2¿­¼«]ª¬§¨��¡¬¦°«]¯Ú¡¬£2��¡Ì£;�2§¨��¯¤ �����¯)­¼�2ª�¡h£2��ªÌ¦°¯;¡h��ª¬¿©2ª���¡&�]¯2¸¨�2¯2¸2��ª¬ �¡h�]¯2¸�µ�¹=�_¹_¦�¥°¥2¡¬ª¬�]¯2 �­¼«�ª�§Ù¡h£2�±Í��³Ê¦�¯R¡¬«���¯Í¦°¯;¡¬��¥�¥°¦�¶���¯R¡�µn£2¦�¶�£Ê©V��ª�­¼«�ª�§¨��¯2���Æ;¯2«�¹_¥���¸2¶]�q³¤�� ��RÁ

Ï_£2�j®2��§¨��¯;¡¬¦°��±Í��³È©2ª�«�»;¦°¸2�� ¨§��]¯;Î.«�­G¡h£2��§¨����£¤¿��¯¤¦° �§¨ �ª���é;�2¦�ª���¸*¡¬«²ª����]¥°¦�·��â¡h£2¦� ¨¸2ª�����§ÑÁÊêM¯½¡h£2¦� Ì©2�]¿©V��ª�µ�¹=�˸¤�� ���ª�¦°³V�0¡h£¤ª¬���0��«]¯2��ª���¡¬�0©2ª�«�©V«� ���¥� Ô­¼«�ª�¥°��»���ª���¶]¿¦�¯2¶z®2��§¨��¯;¡¬¦°��±Í��³�ª¬�� ¬����ª���£¢¡h«z������«]§�©¤¥°¦� ¬£�«��2ª&»;¦� ¬¦�«�¯!æÅ9¦°ª� �¡\µ¤¹º��©2ª¬«]©u«] ¬�¡¬«¾��§+³V��¸Ñ¯2�]¡h�2ª���¥�¥°�]¯2¶��¤��¶���]¯2¯2«]¿¡h�]¡h¦�«�¯2 Ë¸2¦°ª�����¡h¥�Î�¦�¯¨Ã�ÄGÅÍ©¤ª¬«]©u��ª¬¡M΢¸¤��î2¯2¦�¡h¦�«�¯¤ Ë¡h«­¼�]��¦�¥°¿¦�¡h��¡¬��¥���¯¤¶��2�]¶���¿X³2�] ¬��¸âé;�2��ª¬Î;¦�¯2¶VÁq®2����«]¯2¸�µV¹=�+©2ª�«�©V«� ��¡h£¤�=�¤ ¬�=«�­u¦°¯2­¼«]ª¬§¨�]¡h¦�«�¯¢�]������ ¬ 9 ���£2��§¨�]¡h�Vµ��]¯���àR¡¬��¯2 �¦�«�¯«�­Ë¡h£2�� ¬��£¤��§¨��¡¬�¾���¤ª¬ª���¯;¡h¥�Îâ³u��¦°¯2¶â�2 ���¸²³RÎÊýÔþ]ÿ`�Rþ!µV¡¬«���]©2¡h�¤ª¬�=©2��¡¬¡h��ª¬¯2 9«]­2�2 ���ª9ª¬��é;�2�� �¡h �ÁnÏ_£2¦�ª�¸�µ\¹º�=©2ª�«�©V«� ����»]��¯Ê§¨«�ª�� ë ¯2�]¡h�2ª���¥�ì*Ǽ³2�¤¡+ �«�§¨��¹_£2�]¡¥°�� ¬ +©u«�¹=��ª�­¼�2¥Mɦ�¯2­¼«�ª�§¨��¡h¦�«�¯��]������ � � ¬��£¤��§¨��¡¬�Ì¡h£2�]¡�¹º«]�2¥�¸���¥�¥°«�¹7«�ª�¸2¦�¿¯2�]ª¬Î��¤ ¬��ª¬ �¡h«Ì³V����«�§¨�z ¬Æ;¦�¥°¥�­¼�2¥ÔÆ;¯2«�¹_¥���¸¤¶��z��¯2¶]¦°¯¤����ª� �ÁÝ�Þoß à3áoâäã3åoæ:çdèvé�ãæ�èvêOáoæ�ëÏ_£2�¢­¼«]�2¯2¸2�]¡h¦�«�¯Ñ«]­9¡h£2��®2��§¨��¯;¡¬¦°�¢±Í��³Òª¬�� ¬¡¬ �«�¯�Ã�ÄGÅ �¡h��¡¬��§¨��¯;¡¬ �µ�¹_£2¦���£��ª��&�� � ¬��¯R¡¬¦°�]¥°¥�Î�¡¬ª¬¦�©2¥��� Ô¸¤��¯2«]¡h¦�¯2¶�«�³¤¿í ����¡h �µ;©2ª�«�©V��ª�¡h¦��� �µ]��¯2¸â»���¥��2�� �Á!ç�¯â��¥�¡h��ª¬¯2�]¡h¦�»��G��¯2¸â«�­¼¿¡h��¯Ñ�2 ���¸�¸2�� ¬��ª¬¦�©2¡¬¦°«]¯�«�­9Ã�ÄGÅ) �¡h�]¡h��§���¯;¡h �¦� G¦°¯�¡¬��ª�§� «�­ ë  ��2³ í ����¡\µ ì ë ª¬��¥°�]¡h¦�«�¯�µ ìÊ��¯2¸ ë «]³ í ����¡\µ ì.ª¬��»����]¥°¦�¯2¶²�¶�ª���§¨§¨��¡¬¦°����¥Ô³¤�� �¦° q­¼«�ªGÃ�ÄqÅ���«�¯2 �¡hª��2��¡h �Á�êM¯Ñ­¼����¡\µu¡h£¤�Ã�ÄGÅô¡¬ª¬¦�©2¥��� ���ª��Ì»���ª�ÎÑ �¦°§¨¦�¥°�]ª�µ�³V«�¡h£.¦�¯. �©2¦°ª�¦�¡��¯2¸.¦�¯­¼«�ª�§Ñµz¡h«Ð«��2ªÑ¡¬��ª�¯2��ª�ÎÐ��à;©2ª¬�� ¬ �¦�«�¯üª¬��©2ª��� ���¯;¡h�]¡h¦�«�¯7«�­¯2�]¡h�2ª���¥_¥���¯2¶]�2��¶]�ÊÇ7���¡¬·â�]¯2¸È±�¦°¯¤ ¬¡¬«�¯�µzß���5�Ýfá���]¡h·;µß���565fÉhÁ½±²��©¤ª¬«]©u«] ¬�â¡h«.§¨��Æ]�j¡h£2¦� ���«]¯2¯2����¡h¦�«�¯)§¨«�ª����à;©2¥�¦���¦�¡Ô³;Î����¤¶�§¨��¯;¡h¦�¯2¶ ~��vl`m ��~�}��R�l~���� ¸2��î2¯2¦�¡h¦�«�¯2 �¹_¦°¡¬£¯2�]¡h�2ª���¥n¥���¯2¶]�2��¶]�q�]¯2¯2«]¡h��¡¬¦°«]¯2 �ÁÅ!¦�¶��¤ª¬�2ez¦°¥�¥°�¤ ¬¡¬ª¬�]¡h�� 9«]�2ª9©2ª¬«]©u«] ¬�]¥�µ\�¤ ¬¦�¯2¶G�G­¼ª���¶]§¨��¯;¡«�­���¯7«�¯;¡h«]¥°«]¶�ÎȪ���©2ª��� ���¯;¡h¦�¯2¶)¡h£2�ìf!êMç#±Í«�ª�¥°¸ôŤ�]��¡h¿³V«R«]ÆÔÁ í)êM¯;¡h�2¦�¡h¦�»���¥°Î;µV¡h£2� �R}F���R�������\}�� ©¤ª¬«]©u��ª¬¡MÎ⦰ z�´ª¬��¿¥���¡h¦�«�¯j��«�¯2¯2����¡¬¦°¯2¶Ì�Ì��«��2¯;¡¬ª¬Î´¡h«Ì¦�¡h �©V«�©¤�2¥°�]¡h¦�«�¯j»���¥��2�RÁè��]¡h�2ª���¥�¥���¯2¶]�2��¶]�¾�]¯2¯2«�¡¬��¡¬¦°«]¯2 ¨��à;©2ª��� � �¡¬£2¦� Ì��«�¯¤¯2����¿¡h¦�«�¯ô��«]¯2��ª���¡¬��¥�Î)¦�¯Ð¯¤��¡h�¤ª¬�]¥q¥���¯2¶]�2��¶]�Ñ ���¯;¡h��¯2���� ´�]¯2¸©2£2ª��� ��� �µu»;¦°�â¡h£2� �R��mX�S��� ©2ª¬«]©u��ª¬¡MÎ;ÁGŤ«]ªG��à;��§¨©2¥��RµV¡h£¤�©2£2ª��� �� ë ©V«�©2�¤¥°�]¡h¦�«�¯)«�­ �\� ì.¦� �¥�¦°¯2Æ]��¸È¡h«.��»���ª¬Î.Ã�ÄGÅ �¡h��¡¬��§¨��¯;¡�¦°¯;»�«]¥°»;¦�¯2¶�¡h£¤� �R}��\�R�����]��}�� ©2ª¬«]©u��ª¬¡MÎÔá ��� ¦°  �£2«�ª�¡h£2�]¯2¸�­¼«�ª&¦�¯2¸2¦�����¡¬¦°¯¤¶¡h£2�_ ��2³ í ����¡�Ǽ¸¤«�§¨��¦�¯�ÉV«�­V¡h£¤�î {Y���1��m �v�@°v°v° ��w��\�n� ��}O� � w\�\� � ���\|R���vw������\}l�;� � l��.wl��|R}�}@~ �

Page 40: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

ª���¥���¡h¦�«�¯!ÁzŤª�«�§å¡¬£2¦� �µf�â¯2�]¡h�2ª���¥9¥���¯2¶]�2��¶]��¿X��¹º�]ª¬�z �«�­¼¡h¿¹=��ª�����¶]��¯;¡0��«]�2¥�¸Ì��¯2 �¹º��ªË¡¬£2��­¼«]¥°¥�«�¹_¦°¯¤¶Çï˯2¶]¥°¦� �£ÌéR�¤�� �¿¡h¦�«�¯¤ �¹_¦�¡h£2«]�2¡�­¼«�ª���¦�¯2¶Ì¡¬£2�¢�2 ���ª�¡¬«�¥�����ª�¯j��¯2¸��¤ ¬�¢©¤ª¬��¿��¦� ���¥�Î�¸2��2��¸j«�¯;¡¬«�¥�«�¶�¦����]¥f¡h��ª�§¨ �æ

ð=a � p�ThWYcÇ�bZSa1�RUPZ_UPN>;\Z_NsWòñ_NsKMNP`fThLMN <8ÊgnT4L&N��ºLMg]Z��ba��]mRUIThLMNPa\W�a��nLMgRZd#9ThgfThp�T\� <HfZSUPUVpGZÇó!mnT4p�ô �Á�ba1�RmRUIT4LMNsa�W�-êM¯��¸2¸2¦�¡h¦�«�¯!µl¡¬£2� �R��m ����� ©2ª�«�©V��ª�¡MÎ� ¬©V����¦°î¤�� ��_¯¤��¡h�¤ª¬�]¥¥���¯2¶]�2��¶]�+ª���¯2¸2¦�¡h¦�«�¯Í«�­º¡h£2�¨Æ;¯2«�¹_¥���¸2¶]�RµV�]¥°¥�«�¹_¦°¯¤¶¾ �«�­¼¡h¿¹=��ª��º�]¶���¯;¡¬ !¡¬«z©2ª¬�� ¬��¯;¡�§¨���]¯2¦�¯2¶�­¼�2¥Mµ�¯2�]¡h�2ª���¥b �«��¤¯2¸2¦�¯2¶ª��� �©u«]¯2 ¬�� �¡h«Ì�2 ���ª� �Á�=Î ë £2«;«]ÆR¦�¯2¶]ì)¯¤��¡h�¤ª¬�]¥¥°�]¯2¶��2�]¶��Ê��¯¤¯2«�¡¬��¡h¦�«�¯¤ �¸2¦°¿ª�����¡¬¥°ÎЦ°¯;¡h«ÈÃ�ÄGÅÛ©2ª¬«]©u��ª¬¡MÎ�¸¤��î2¯2¦�¡h¦�«�¯¤ �µG¹=�Ê����¯7¯2«]¡«�¯¤¥°ÎÊ��¯¤ ¬�2ª��¾¡¬£2��¡�«��¤ª�«]¯R¡¬«�¥�«�¶]¦°��  ë §¨�]Æ��� ���¯2 ���첡h«²��2 ���ª�µY³¤�2¡Ë�]¥° �«©2ª�«�»R¦�¸2��¯¤��¡h�¤ª¬�]¥¤¥���¯2¶]�2��¶]�=é;�2�� ¬¡¬¦°«]¯���¯¤¿ �¹º��ª¬¦�¯2¶_��¯¤¸G¶]��¯2��ª¬�]¡h¦�«�¯z���]©2��³¤¦°¥�¦°¡¬¦°�� Ô¹_¦�¡h£¢§¨¦�¯2¦°§¨�]¥���¸¤¿¸2¦�¡h¦�«�¯2�]¥nÆ;¯2«�¹_¥°��¸2¶��z��¯2¶�¦�¯2����ª¬¦�¯2¶�«�»���ª¬£2����¸�ÁÝ�Þoõ öO÷�ø�é�è�âäùRêOáoé�÷�úÔû�û6æ�ë�ëüà�ûvý3æ0âäùRêOùÄG�� �©2¦°¡¬�²¡¬£2�Í ¬¦�§¨©2¥°¦���¦�¡MÎ�«�­+��¸2¸2¦�¯2¶½¯2��¡¬�2ª¬�]¥q¥���¯¤¶��2�]¶����¯¤¯2«�¡¬��¡h¦�«�¯¤ �¡¬«ÊÃ�ÄqÅ?©2ª�«�©V��ª�¡h¦��� ¨¸2¦�ª¬����¡h¥�ÎRµ0¡h£2��ª¬�j¦� ¨� �¦°¶]¯2¦°î¤����¯;¡_ª��� �¡hª�¦°��¡h¦�«�¯j¡h«¾¡h£2�¢¡MÎ;©V�� q«�­9é;�2�� �¡h¦�«�¯2 �¡h£2�]¡¡h£¤¦° Ì¡¬����£2¯2¦�é;�2�����]¯Ú�]¯2 ¬¹=��ª�µ9¯2�]§¨��¥�ÎRµË«�¯2¥�Î*«�¯¤�´Ã�ÄGÅ �¡h��¡¬��§¨��¯;¡����]¯Ú³V�´é;�2��ª¬¦���¸½��¡+«]¯2���;Áͱ²�â©2ª�«�©V«� ��¾¡¬««�»���ª¬��«�§¨�¢¡¬£2¦° z¥�¦°§¨¦�¡h�]¡h¦�«�¯Ò³;Îj��ª����]¡h¦�¯2¶¾ ���£2��§��]¡h�´¡h£2�]¡���]©2¡h�¤ª¬�� ¬¦�§¨¦°¥���ªG©2�]¡h¡h��ª¬¯¤ G«]­Ë¦�¯2­¼«�ª�§¨��¡¬¦°«]¯Ò�������� ¬ �ÁGŤ«]ª��à;�]§�©¤¥°�;µR��«�¯¤ ¬¦�¸2��ª_¡¬£2¦�  ë ­¼�]§�¦�¥�ÎRì�«�­!é;�2�� �¡h¦�«�¯¤ �æ

8ÊgnT4L9N��9LMgRZºQSa\m]WYLMKMczNPW�JÁ��KMNPQYT � NPLMg+LMgRZËUIThKM[\Z"�MLThKMZYT <HfZSUPUËpzZ � gnT4L+JÁ�MNIThWÊQSa�mRWYLMKMcÑgnT\�LMgRZÌgRNP[�gRZF�ML�ba��]mRUIThLMNPa\W�rRZSW��MNsLIc6-8ÊgnT4L9QSa�mRWYLMKMcNPW=þ�mRKMa1�bZ�gnT\�&LMgRZ=UPa � Z"�MLËNPW��MThWYLpza\KML�T4UPNsLIc�K�T4LMZ <8ÊgnT4L_N��_LMgRZzpza1�MLÊ�ba��]mRUIThLMZSrÊi�a�mRLMg¾J=pzZSKMNPQYT4WQSa\m]WYLMKMc <±²�©2ª�«�©V«� ��G¡h«¨���]©2¡h�2ª��z¡h£2¦�  ë ©2��¡¬¡h��ª�¯2ì�«�­�¦°¯2­¼«]ª¬§¨�]¿¡h¦�«�¯ª���é;�2�� ¬¡¬ n¦°¯�]¯¦°¯¤­¼«�ª�§��]¡h¦�«�¯z�������� ¬ n ���£2��§¨�fµl �£2«�¹_¯¦�¯´Å9¦�¶��2ª���ÿfÁ!Â���ª¬�;µ�¯¤��¡h�¤ª¬�]¥V¥���¯¤¶��2�]¶����]¯2¯2«]¡h��¡¬¦°«]¯2 ���ª����§¨©2¥�«�Î���¸�¡¬«¸2�� ���ª�¦�³u�_�©2�]¡h¡¬��ª�¯�«]­uÃ�ÄqÅÍ ¬¡¬��¡¬��§¨��¯;¡h �Á �öÑ«]ª¬�Ñ­¼«]ª¬§¨�]¥°¥�ÎRµ��]¯ï¦�¯2­¼«�ª�§¨��¡h¦�«�¯Ð�������� � ¾ ���£2��§¨�*¦� â�é;�2��¸¤ª¬�2©¤¥°�;æ

� ú�÷�÷3éHê1ùtêOá¸éH÷3ë�� ¯¤��¡h�¤ª¬�]¥�¥���¯2¶]�2��¶]�¢ ¬��¯;¡h��¯¤���� ÌǼ��¦°¿¡¬£2��ª�¸2����¥���ª���¡¬¦°»]�Ы�ªÈ¦�¯;¡h��ª�ª�«�¶��]¡h¦�»��;É�«�ªÈ©2£¤ª¬�] ¬�� ¡¬£2��¡�¸¤�� ���ª�¦°³V��¡h£2�Ò¡MÎR©V�� ´«�­z�2 ���ª�é;�2�� ¬¡¬¦°«]¯2 �¡¬£2¦°  ���£2��§¨�å����¯#�]¯2 ¬¹=��ª�Á Ï_£2�� ��ò ���¯;¡h��¯2���� ñ�]¯2¸©2£¤ª¬�] ¬�� _����¯â��«]¯;¡h��¦�¯â ¬©V����¦°�]¥Ô ¬Î;§�³u«]¥° _¡¬£2��¡_ �¡h�]¯2¸¦�¯ú­¼«]ªû¹_£¤«�¥��C��¥°�] ¬ ��� ?«�­ï¥���à;¦�����¥È¦�¡h��§� �µ��;Á ¶fÁ°µ� w\}��\�Y��~�� §�¦�¶�£;¡� �¡h�]¯2¸j¦°¯j­¼«]ª���¯;ξ��«��2¯;¡¬ª¬Î�¦�¯�¡h£¤�f!êMçû±²«]ª¬¥�¸ÒÅu����¡h³V«R«]ÆÔÁ�Ï_£¤¦° z��¥�¥�«�¹_ ���¯¤¯2«�¡¬��¡h¦�«�¯¤ ¡¬«�³V�z©2��ª���§¨��¡h��ª�¦�·���¸´­¼«�ª�¶�ª����]¡h��ª=Æ;¯2«�¹_¥���¸¤¶�����«�»;¿��ª¬�]¶��RÁ

��� º&Å\¼�¹vµ¸º�¼�¾v¾vÂ�³o¼�³¸´¶Â�¾�µHÉAÂ�¹�˶ÎÁÈ.ºAÀv·¸Â1Å\º\µxµ¸º\ÎÁÈO¿9˶´¶¾vÒ�¹v´¶µ¸³¸´¶Å\¼�˶˶¿OÖµxÂ�À�»v´¶µx³o´¶Å\¼�³¸º\ÎÙµ¸¿1µ¸³¸º\½�µFÍ0Îv´�.º&·¸º\¾1³¼�Î�>º\Å&³o´¶Ì�º\µ2µ¸¹vÅ�»ò¼�µ �>»v´¶Ò�»�º&µ¸³��¼�¾vÎ��>˶¼�·xÒ�º\µx³��ÏÅ\Â�¹v˶ÎòÈ.ºÊ¹v¾�´¶ÃÐÂ�·¸½�˶¿Ï½�¼�ÀvÀ.º\ÎòÂ�¾O³oÂd³¸»vº P�M���N�P@UFPÂ�À.º\·x¼�³o´¶Â�¾tÓ

k �R�Am�pS�6lY}l~��;���]��}��.y�wvw��������.w"{R�S�;�vu

k �R��mX���\�0u ° {R�l��w�}����Y��~\�)�S� � ~��@�]�\}��ä{R���Ñ�1{R�����~v�����l� � �����\~]�S|��Y�Y� kv� �R�AmX�����0u

k �R��m �R�l������~���uQ�@�È� monY}S���Y��~�� kv� �R��m �R�������l~��0uk �R��m �R�l������~���uQ�@�Ê�;����� � ������~]��|��Y�����ä�@���\� kv� �R�Am �R�����Y��~��0uk �R��m �R�l������~���uQ�@�gmX��}Qw\�l�]�\}�� � ~Y�O�]�\}�� kv� �R��m �R�������l~��0u

k �R��mX�Qw��]�\}��0u������F�R�\����� |R}S�����\��}��K�@����;�O���&�@�����������k�� �R��mX�vwl�]�\}���u

k �R��m��;�F�v�;�S�.�.uk �R��m {]���F{�����~��\��|R���Orts � ���\��~]�S|�������sFuk �R�Am��;�������\���R�1rts��R}����R�\���]�\}��HsFum �R}F���R�������\}��kv� �]��m��;����uk �R�Am��;�������\���R�1rts���~�����s"umX��~Y���kv� �]��m��;����u�����

kv� �R��m {R���F{0uk�� �R��m��R���v�;�S�Q�6u

kv� �]��m�pS�6l�}�~��;�����\}��.y�w�w\�������QwF{R�S�;��u

Å9¦°¶]�2ª��ÊÿVæ�ç�¯Ì¦�¯2­¼«�ª�§¨��¡h¦�«�¯¨�]������ ¬ Ë ¬��£2��§¨�­¼«�ª0 ��2©V��ª�¥°�]¿¡h¦�»��� +Ǽ­¼«]ª� ¬¦�§¨©2¥�¦°��¦°¡MÎ;µR«�¯2¥�ξ«]¯2�z��¯2¯2«]¡h�]¡h¦�«�¯¾¦� � �£2«�¹_¯�Á�É

� ç�ùtêOê1æ0èv÷�� ��¸2����¥°�]ª¬�]¡h¦�»��©2�]¡h¡¬��ª�¯�«�­&Ã�ÄGÅÈ¡hª�¦°©2¥��� Çõ��à;©2ª��� � ¬��¸Ú¦�¯½è�e�Çõ����ª�¯2��ª� ¬¿XÜ����Rµ!Ý]Þ�Þ�ÞfÉ_­¼«]ª� ���Æ]�«]­!³¤ª¬��»R¦�¡MÎÔÉ!¡h£2�]¡�ª���­¼��ª���¯¤���� ��Ì©2ª¬��¿I��à;¦° �¡h¦�¯2¶Ì«�¯;¡¬«�¥�¿«]¶�ÎRÁ�Ï_£2�z©2�]¡h¡h��ª¬¯¤ =����¯´��«�¯;¡¬��¦�¯�¡h£2�z ���§¨�� �©V����¦���¥ �Î;§+³V«�¥� ��2 ���¸Ñ¦�¯Ñ¡¬£2�+�]¯2¯2«�¡¬��¡¬¦°«]¯2 �µ2�� q¹º��¥°¥!�� q¦°¯2¿¡¬ª¬«;¸2�2���ѯ2��¹Û�¤¯R³V«��¤¯2¸�»���ª�¦°�]³2¥°�� �Áì� ��2��¥�¥�ÎRµ=¡h£¤�©2�]¡h¡¬��ª�¯¾¦� º¹_ª�¦°¡¬¡h��¯¾¦�¯¾ ��2��£¾��¹º��Ψ¡h£¤��¡=¹_£2��¯´¦°¡=¦°  ���¡¬¦° �î2��¸!µË©¤��ª�¡h¦����2¥���ª¨»���ª�¦°�]³2¥��� Ì¹=«��2¥�¸È³u�ѳV«��¤¯2¸¡¬«Ì¡h£2�z��¯¤ ¬¹=��ª�Á

� ú�û6êOá¸éH÷�� �Ò ���¡�«�­_«]©u��ª¬�]¡h«�ª� +¡¬«Ò­¼�2ª�¡h£¤��ª�©2ª¬«;���� � »���ª�¦���³2¥��� j³V«��2¯2¸7¸2�2ª�¦�¯2¶È¡h£2�.©2�]¡h¡¬��ª�¯ô§¨��¡¬��£2¦�¯2¶©2ª�«;���� ¬ �Á9Ï_£2�¢�]��¡¬¦°«]¯2 _��«]�2¥°¸�³V��� � ¬¦�§¨©2¥°�+�� _¸2¦� ¬¿©2¥���Î;¦°¯¤¶�¡h£2�º«��2¡¬©2�2¡�µ�³2�2¡���¥�¥°«�¹_ �­¼«�ª�§¨«]ª¬�º��«�§¨©2¥���à«]©u��ª¬�]¡h¦�«�¯2 �µz�RÁ ¶VÁ�µ�]¶�¶]ª¬��¶���¡¬¦°«]¯ãÇ w\}S���Y� µ �O����~��@��� µ�;�O� µ �b��� µ\��¡h�RÁ�Éhµ\��«�§¨©2�]ª¬¦� ¬«]¯¾Ç ��µ"!�µ���¡¬�RÁ�ÉhÁ�êM¯¢«�ª�¸2��ª¡¬«©2ª¬�� ¬��¯;¡&�2 ¬��ª¬ Ë¹_¦°¡¬£Ì¯2��¡¬�2ª¬�]¥2 �«��2¯2¸¤¦°¯2¶�]¯2 �¹º��ª¬ �µ¯2�]¡h�2ª���¥n¥���¯¤¶��2�]¶��G¶���¯¤��ª���¡h¦�«�¯´§¨��Î�³u�¦�¯;»�«]¥°»]��¸�Á

�$# ùRã�ã3áo÷�%&� �]¯ «�©2¡¬¦°«]¯2��¥Ò£2�� �£ ­¼«�ªï �©V����¦�­¼Î;¦°¯¤¶¡¬£2�ѳ2¦�¯2¸2¦�¯2¶� ´³u��¡M¹º����¯� �©u����¦���¥� �Î;§+³V«�¥� ��2 ���¸�¦�¯¡¬£2�+¯2�]¡h�2ª���¥�¥°�]¯2¶��¤��¶���]¯2¯2«]¡h��¡¬¦°«]¯2 ��]¯2¸ÑÃ�ÄGÅȪ¬��¿ �«��2ª����� �Á&Ť«�ª���à;��§¨©2¥��Rµ;¡¬£2�¢Ã�ÄGÅÈ©2ª�«�©V��ª�¡MÎ mX�l~����§¨¦�¶�£;¡q³V�¨¥°��à;¦°����¥�¦°·���¸Ñ�]  ë ¥°�]¯2¸Ñ�]ª¬���Vµ ì ë �]ª¬���Vµ 쾫]ªë  ¬¦�·���ìVá�Ï_£2� �R���v�;�S�Q� ©2ª�«�»;¦°¸¤�� 9¡h£2��§�����£2��¯¤¦° �§ñ­¼«]ª£2�]¯2¸2¥�¦°¯¤¶½¡¬£2¦° �¸¤¦°  í �2¯2��¡h¦�«�¯Ð³V��¡M¹=����¯ô¥���à;¦����]¥q�]¯2¸«]¯R¡¬«�¥�«�¶]¦°����¥V¡h��ª�§¨ �Á

Page 41: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

Ï_£2�. ¬��£¤��§¨�Ȧ°¯üÅ9¦°¶]�2ª��äÿ)��«��2¥�¸ü©2ª�«�»R¦�¸2�Ê��¯¤ ¬¹=��ª� ¡h«âé;�2�� �¡h¦�«�¯¤ G¡¬£2��¡G¦�¯R»]«�¥�»���ª¬��¶�¦�«�¯2¿X ¬©V����¦°î¤�+ ��2©u��ª¬¥���¡¬¦°»]���«]§¨©2��ª�¦° �«�¯²«�­=��«]�2¯;¡hª�¦°�� ¦°¯.¡¬£2�̹º«]ª¬¥�¸�Á�Ï_£2�Ì©¤��¡h¡¬��ª�¯³2¦�¯2¸2 Ñ¡¬«�¡h£2�Ú»\�]¥°�¤�Í«]­�¡¬£2�*©2�]ª¬¡¬¦°���2¥���ªj��¡¬¡hª�¦°³2�¤¡h�*­¼«]ª��«]�2¯;¡hª�¦°�� _¹_¦°¡¬£2¦�¯Ñ¡h£2�+éR�¤��ª�¦°��¸â¶]��«]¶�ª���©2£2¦��Gª¬��¶�¦�«�¯�µR�]¯2¸¡h£¤�Í�]��¡¬¦°«]¯ô �©u����¦�î2�� ��]¯Ð�]¶�¶�ª���¶]��¡¬�Ò«�©V��ª���¡¬¦°«]¯ûÇõ§��]à;¿¦�§+�2§ÑÉ�«�»���ª¾¡h£¤�²»���¥��2�� â³V«��¤¯2¸Ð¹_¦�¡h£2¦�¯7¡h£2�Í©2��¡¬¡h��ª¬¯�ÁÏ_£2�¢��«��2¯;¡¬ª¬Îâ��«]ª¬ª��� �©u«]¯2¸2¦�¯2¶¨¡h«�¡¬£2��¡q§¨��à;¦°§��2§Û»\�]¥°�¤�¦� ¢ª¬��¡h�2ª�¯2��¸Í�] ¢¡h£¤����¯2 �¹=��ª�Á 'AÏ_£2�̧¨�]©2©2¦�¯2¶Ñ©2ª�«�»;¦°¸2�� �j¡hª���¯2 �¥���¡h¦�«�¯²­¼ª¬«]§ Ǽ¥���à;¦����]¥°¦�·���¸�ɺ¥°�]¯2¶��2�]¶�����¡¬¡hª�¦°³2�¤¡h�� ¡h«ÚÃ�ÄqÅ𩤪¬«]©u��ª¬¡¬¦°�� �Á7èq«�¡h�Ò¡h£2�]¡¾¦�¯2­¼«�ª�§¨��¡h¦�«�¯��]������ �  ���£2��§¨��¡¬�.��ª��ѹ_ª�¦°¡¬¡h��¯Ð¹_¦°¡¬£ïª¬�� ¬©V����¡�¡h«Ú�*©2��ª�¡h¦����¤¥°�]ª©2ª���¿X��à;¦� ¬¡¬¦°¯2¶�«]¯;¡h«�¥�«�¶]ÎÔá�­¼«]ª+¡¬£2¦° ���à;�]§�©¤¥°�;µ�¹=���] ¬ ��2§¨�¡h£¤��¯.��¯.�]©2©2ª�«�©2ª�¦���¡h�Ì«]¯;¡h«�¥�«�¶]ÎÑ£2�] +³V����¯.�� ¬¡¬��³2¥�¦� ¬£2��¸Ç¼¦MÁ �RÁ�µ m7nY}����Y�\~�� ¦° �¸2��î2¯2��¸â�� =����¥��� � �µ]��¯¤¸ mX��}vw\���]�\}S� ¦° ¸2��î2¯2��¸j�] _��©¤ª¬«]©u��ª¬¡MÎÔÉhÁêM¯¢«��2ª�»;¦° �¦°«]¯«�­b¡¬£2�Ë®2��§¨��¯;¡h¦��˱Í��³�µ\¦�¯2­¼«]ª¬§¨��¡¬¦°«]¯����¿���� ¬  ���£2��§��]¡h�j¶�ª�«��¤¯2¸2��¸Ê¦�¯Ê¯2��¡¬�2ª¬�]¥&¥°�]¯2¶��¤��¶��¨¹=«��2¥�¸��«]¿I��à;¦° �¡q�]¥°«]¯2¶� �¦�¸2�+Ã�ÄGÅЧ¨��¡h��¸¤��¡h�fÁÏ_£2�� ¬�� ���£2��§¨��¡¬���«]�2¥�¸�³u�²¸2¦° �¡hª�¦�³2�2¡h��¸ñǼ�;Á ¶fÁ°µ���§+³V��¸2¸¤��¸�¸2¦�ª¬����¡¬¥°Î)¦°¯;¡¬«±Í��³Ð©¤��¶��� �É«�ª¾����¯;¡hª���¥�¦�·���¸!á���¦°¡¬£2��ªâ¹=��ÎRµ=�Ú �«�­¼¡M¹=��ª����¶]��¯;¡Ë¹=«��¤¥°¸´��«�§¨©2¦�¥��q¡¬£2�� ��q ���£2��§��]¡h��¦°¯;¡h«��+é;�2�� ¬¡¬¦°«]¯��¯¤ ¬¹=��ª�¦°¯¤¶� ¬Î; �¡h��§ ����©¤��³2¥���«]­0©2ª�«�»;¦°¸2¦�¯2¶�¯¤��¡h�¤ª¬�]¥!¥���¯¤¿¶��¤��¶��G¦�¯2­¼«�ª�§¨��¡h¦�«�¯â�������� ¬ _¡¬«��¤ ¬��ª¬ �ÁÅ!¦�¶��¤ª¬�)(+©2ª�«�»;¦°¸2�� Ë�]¯2«�¡¬£2��ª=��à;��§¨©2¥���«]­Ô��¯¾¦°¯2­¼«]ª¬§¨�]¿¡h¦�«�¯²�������� � G ¬��£2��§¨�´¡¬£2��¡G��«]�2¥°¸Í��¥�¥�«�¹A�´¥�¦°¯¤¶��2¦� ¬¡¬¦°����¥�¥°Î��¹=��ª��+ �«�­¼¡M¹=��ª����]¶���¯;¡G¡h«j��¯¤ ¬¹=��ªz¡h£¤�Ì­¼«�¥�¥°«�¹_¦�¯2¶âéR�¤�� �¿¡h¦�«�¯¤ �æ

Fx� o ThWfThrfT�ô �)QSa;T&�MLMUPNPWRZÐUPa\WR[�ZSK)LMgfThW+*nm0���MNIT1ô �QSabT\�MLMUPNsW]Z <8ÊgRNPQhgzQSa�mRWYLMKMc�gnT&�ÔLMg]Z&UXT4KM[\ZSK`�ba��]mRUIThLMNPa�Wnt�ó!ZSK��p�ThWYc¢a\K�!�T\�nT4W <Fx� ) Ns[�ZSKMNIT�ô ���ba��Rm]UIThLMNPa\Wü`]NP[\[\Z�KÒLMgfThWÐLMgnT4L�a��i�a�mRLMg�JÁ��KMNPQYT <

h=¯2���Ê��¶]��¦�¯�µ_�) �¡MÎR¥�¦�·���¸7¥���¯2¶]�2��¶]�Ò��¯¤¯2«�¡¬��¡h¦�«�¯ô¸2��¿ ���ª�¦°³V�� ¨��¯½Ã�ÄGÅÙÆ;¯2«�¹_¥°��¸2¶��´­¼ª¬�]¶�§¨��¯;¡�¡h£2�]¡���«�¯;¡h�]¦°¯¤ ¡h£¤����¯2 �¹º��ª�Á*ç  ¬©V����¦°î¤�� ¬��éR�¤��¯2���â«]­���«�§¨§¨��¯2¸2 Ì��à;¿¡hª�����¡h _¡¬£2�¦°¯¤­¼«�ª�§��]¡h¦�«�¯â��¯2¸âª���¡h�¤ª¬¯2 _¦�¡_¡h«Ì¡¬£2��2 ���ª�ÁÝ�ÞoÝ ,:é�áo÷�%.-�/3èvêOý3æ0èf!«�¯¤ ¬¦�¸2��ª´¡h£2�Íé;�2�� �¡h¦�«�¯ ë ¹_£2��¡´¦° â¡h£¤�²¸2¦� �¡h��¯¤���Ñ­¼ª�«�§�R��©2�]¯+¡h«®¤«��2¡¬£=�«]ª¬����b¬ì�ç�ª¬���� �«�¯2�]³2¥��º�]¯2 ¬¹=��ª9¹=«��2¥�¸³V�Ò¡h«½��«]§�©¤�2¡h�²¡h£2�²¸2¦° �¡h�]¯2���²³u��¡M¹º����¯ï¡¬£2��¦�ªâª¬�� ¬©V����¿¡h¦�»��Ú����©¤¦°¡¬��¥� �Á ç ©u��ª¬ �«�¯A­¼�]����¸ñ¹_¦°¡¬£Ù¡h£2¦� Òé;�2�� ¬¡¬¦°«]¯§¨¦°¶]£;¡´î2ª� �¡´î2¯¤¸ô¡¬£2�²����©2¦�¡h�]¥° j«]­¢¡h£2�Ê¡M¹=«Ú��«��2¯;¡hª�¦��� �µ��¯¤¸�¡¬£2��¯¨��«�§¨©2�2¡¬�+Ǽ«�ª&¥�«;«�Æ��2©�ÉV¡h£¤�_¸2¦° �¡h�]¯2���_³V��¡M¹=����¯¡h£¤«� �����¦°¡¬¦°�� �Á*a&��«�©2¥��¢¶���¯2��ª���¥�¥�Î⣤��»��¢¯2«â¸2¦10¨���2¥�¡MÎ�¸2��¿ ���ª�¦°³2¦�¯2¶Ì¦�¯�¯2��¡¬�2ª���¥�¥���¯¤¶��2�]¶��z� ë ©2¥���¯¤ìÌ­¼«�ª��]¯2 �¹º��ª¬¦�¯2¶é;�2�� �¡h¦�«�¯¤ �¡h£2�]¡_ª¬��é;�2¦°ª��§+�¤¥°¡¬¦°©2¥��¢«]©u��ª¬�]¡h¦�«�¯2 _­¼ª�«�§C¸2¦°­¼¿­¼��ª���¯;¡� �«��¤ª¬���� �Á f!«��¤¥°¸È£;�2§¨��¯2  ë ¡¬������£¤ìÍ ��2��£È©2¥���¯2 ¡h«½�Ú��«�§¨©2�2¡¬��ª¾¸¤¦°ª�����¡h¥�Î�b f!�2ª�ª���¯;¡h¥�ÎRµ=¡¬£2�²�]¯2 �¹º��ª�¦° ¯2«fµ&³V�����]�2 ��â��à;¦° �¡h¦�¯2¶Ê§¨����£2�]¯2¦° �§¨ Ì«�­qÆR¯¤«�¹_¥°��¸2¶��â����¿é;�2¦� ¬¦�¡h¦�«�¯�ª���é;�2¦�ª¬�+­¼��§¨¦°¥�¦���ª�¦°¡Mδ¹_¦°¡¬£Ñ©2ª�����¦� ��¢«�¯;¡h«]¥°«]¶�¦��� �µ

2�33˶³o»vÂ�¹vÒ�»�³o»vº`¼�¾vµxÉAº\·R´¶¾2³o»v´¶µ�µ¸Å�»vº\½�¼9´¶µRµx´¶½�À�˶¿9³¸»vº`Å\Â�¹v¾O³¸·¸¿¾v¼�½�º1Í�´¶³2Å\Â�¹v˶ÎüÈ.º¦Å&Â�¹�Å�»vº\Î�´¶¾Ù¼Ç¾v¼�³¸¹�·x¼�Ë�˶¼�¾vÒ�¹v¼�Ò�º�Ò�º&¾�º&·¸¼�³¸º\εxº\¾O³¸º\¾vÅ\º�ÃÐÂ�·�È.º\³¸³oº\·�À�·xº\µxº\¾O³¸¼�³o´¶Â�¾d³¸Â2³¸»vº9¹�µxº\·�Ó

k �R�Am�pS�6lY}l~��;���]��}��.y�wvw��������.w"{R�S�;�vu

k �R��mX���\�0u � w\}S���Y��~���465874� � �����½�\�Í����~v�Y��~�1{R��� � w\}����Y�\~���4�9:7M� � ����� kv� �R��mX���\�0u

k �R��m �R�l������~���uQ�@�� monY}S���Y��~�� kv� �R��m �R�������l~��0uk �R��m �R�l������~���uQ�@��;����� � �����"���@�Y����465 k�� �R��m �]��������~��0uk �R��m �R�l������~���uQ�l�� monY}S���Y��~�� kv� �R��m �R�������l~��0uk �R��m �R�l������~���uQ�l��;����� � �����"���@�Y����4�9 k�� �R��m �]��������~��0u

k �R��mX�Qw��]�\}��0u������F�R�\�����x�����&�@������4;5b�\�@�]����4�9��<�k�� �R��mX�vwl�]�\}���u

k �R��m��;�F�v�;�S�.�.uk �R��m {]���F{�����~��\��|R���Orts � ���\��~]�S|�������sFuk �R�Am��;�������\���R�1rts��R}����R�\���]�\}��HsFum �R}F���R�������\}��kv� �]��m��;����uk �R�Am��;�������\���R�1rts���~�����s"umX��~Y���kv� �]��m��;����u�����

kv� �R��m {R���F{0uk�� �R��m��R���v�;�S�Q�6u

kv� �]��m�pS�6l�}�~��;�����\}��.y�w�w\�������QwF{R�S�;��u

Å9¦°¶]�2ª��=(fænçq¯Ì¦°¯2­¼«]ª¬§¨�]¡h¦�«�¯��������� � 9 ¬��£2��§¨�G­¼«]ª0��«]§¨©2��ª�¿¦� ¬«]¯2 

k �R�Am�pS�6lY}l~��;���]��}��.�����S���;�S�.�<�.wF{R�S�R�Qu

k �R��mX���\�0u�������������w\�²|R��� ° ����� � w�}����Y��~\�85����� � w�}����Y��~\��9 kv� �S���0u

k �R��m �R�\���0uk ~��vl`m�����>.uk ~���l`mX����u ° {R���È���Ñ�O{R�äw\���R�l�����.}1l � w\}��\�Y��~��85

mor$�vw\���;�������65 k�� ~��vl`mI����uk ~���l`mX����u ° {R���È���Ñ�O{R�äw\���R�l�����.}1l � w\}��\�Y��~���9

mor$�vw\���;��������9 k�� ~��vl`mI����uk ~���l`mX����u ° {R���È���Ñ�O{R�²������������w\�Ñ|]��� ° ���S�

�vw����;�l���\�65��������w\���;�l�Y����9mor$�������l�Y����w\� k�� ~��vl`mI����u

kv� ~��vl�m�����>.uk�� �R��m �]�����0u

k �R��mX�Qw��]�\}��0u������F�R�\�����&���Y���l�����0w���� kv� �R�AmX�Qw��]�\}S��u

kv� �]��m�pS�6l�}�~��;�����\}��.���\�����;�S�Q���.wF{R���b�vu

Å9¦�¶��2ª��@?Væ�ç�¯j¦�¯2­¼«�ª�§¨��¡¬¦°«]¯´©2¥���¯¤¯2¦°¯¤¶� ���£2��§��

Page 42: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

 �«�§¨��¡¬£2¦°¯¤¶â¡¬£2��¡z����¯2¯2«]¡q³V�̪�����¥�¦� ¬¡¬¦°����¥�¥°Îj��à;©V����¡h��¸Ê­¼«]ª��¥�¥=�2 ¬��ª¬ �ÁâÄG�� �©2¦°¡¬�´£¤��»R¦�¯2¶Ñ©¤¥°��¯R¡MÎÍ«�­_��«�§¨§¨«�¯* ���¯¤ ¬�;µ§¨«� �¡��2 ¬��ª¬ �����¯2¯2«]¡º³V����«�§¨�z��ÓV����¡h¦�»��zÆ;¯2«�¹_¥°��¸2¶��G��¯¤¶�¦�¿¯2����ª� �Á�±Í�q©2ª�«�©V«� ���¡¬«+�2¡¬¦°¥�¦�·��z¯2��¡¬�2ª���¥V¥°�]¯2¶��2�]¶����]¯2¯2«]¿¡h�]¡h¦�«�¯2 Ë¡h«¢�]¸2¸2ª��� � &¡h£2¦� Ë¸2¦10¨���2¥�¡MÎ�¦°¯Ì¦�§¨©2��ª�¡h¦�¯2¶Æ;¯2«�¹_¥�¿��¸¤¶��z¡h«Ì��«�§¨©2�2¡¬��ª� �Á±²�¨¡h£¤¦°¯2Æ�¡¬£2��¡ ë ¦°¯2­¼«]ª¬§¨�]¡h¦�«�¯Ñ©2¥���¯¤¯2¦°¯¤¶´ ���£2��§¨�]¡h�Vµ ì��¯ã��àR¡¬��¯2 �¦�«�¯ã«�­â¡h£2�Ц°¯2­¼«]ª¬§¨�]¡h¦�«�¯ã�������� � . ���£2��§¨��¡¬�¡h����£2¯2«]¥°«]¶�Îï¸2�� ¬��ª¬¦�³V��¸ü¦°¯ü¡¬£2�*©2ª���»;¦�«��2 � �����¡h¦�«�¯�µq���]¯¸2ª���§¨��¡¬¦°����¥�¥°Î� �¦�§�©¤¥°¦�­¼Îï¡h£¤�.¡h�� �Æ�«�­�Æ;¯2«�¹_¥°��¸2¶��Í��¯¤¶�¦�¿¯2����ª�¦°¯2¶fÁ�çq¯���à;��§¨©2¥���«�­G«��2ª¾©2ª¬«]©u«] ¬�]¥_¦° ¾ ¬£2«�¹_¯È¦�¯Å9¦°¶]�2ª��A?VÁÑêM¯2 �¡h���]¸*«�­_¹_ª�¦�¡h¦�¯2¶ÑÃ�ÄGÅA©2��¡¬¡h��ª¬¯2 �µ!¹_£2¦���£¹=«��2¥�¸²ª���é;�2¦�ª���Æ;¯2«�¹_¥���¸2¶]�+«]­Ë¸2«]§��]¦°¯¤¿I �©u����¦�î2�¨«�¯;¡h«]¥°«]¿¶�¦��� �µ�¹=�_��«]�2¥�¸Ì�2 ¬�_¯¤��¡h�¤ª¬�]¥2¥°�]¯2¶��¤��¶��_¦�¡h ���¥�­f¡h«¸2�� ¬��ª¬¦�³V�¡h£¤�Ò©2ª�«;���� � ¾«]­��¯2 �¹=��ª�¦°¯2¶Ú�*é;�2�� �¡h¦�«�¯�ÁÙÏ_£2�Í��¯2 �¹º��ª©2¥���¯jÇ �R��m �R����� Éfª¬��B2����¡h 9¡h£¤�=�¤ ¬��ªv�  9¡h£2«]�2¶�£;¡!©2ª¬«;���� � !��à;¿©2ª��� � ¬��¸.¦°¯½¯2��¡¬�2ª���¥Ë¥���¯¤¶��2�]¶��Ræ+î2ª� ¬¡�î2¯2¸*¡¬£2�â����©¤¦°¡¬��¥� «�­�¡h£2�G��«]�2¯;¡hª�¦°�� �µ���¯2¸¾¡h£2��¯¾î2¯2¸â¡h£¤�q¸2¦� �¡h��¯¤����³V��¡M¹=����¯¡h£¤«� ��G��¦°¡¬¦°�� �ÁÏ_£2�*��³V«�»��.��à;��§¨©2¥��.��«��¤¥°¸A�]¯2 ¬¹=��ªÒ¡h£2�*­¼«]¥°¥�«�¹_¦�¯2¶é;�2�� �¡h¦�«�¯¤ �æ

ð=a � �MT4KËN��ËLMgRZ�yºWRNPLMZSrji�L�ThLMZ"���°KMa\pC*nm0���MNIT <8ÊgnT4LOô �jLMgRZ²rRN��ML�ThW]QSZÍ`bZSL � ZSZ�W�ó!ZSKMpÌThWYcÙThW]rþ�W][\UIThW]r <Ï_£2¦° Í§¨��¡¬£2«;¸ñ«�­Ì �©V����¦�­¼Î;¦°¯¤¶Ð ���£2��§¨��¡¬�ï�� � ���¯;¡h¦���¥�¥°Î ���ª�»��� �¡h«²����©¤¡h�2ª���¡h£¤�´¦�¯;¡h�2¦�¡h¦�»��´¡h£2«]�2¶�£;¡+©¤��¡h¡¬��ª�¯2 �«�­�)£R�¤§��]¯�µ_�]¯2¸Ð�]¥°¥�«�¹_ â«�ª�¸2¦°¯¤��ª�ν�2 ���ª� â¡h« ë ¡h������£2ìÚ���«]§¨©2�2¡h��ª_Æ;¯2«�¹_¥°��¸2¶��G�2 �¦°¯¤¶Ì¯2��¡¬�2ª¬�]¥n¥°�]¯2¶��¤��¶��;ÁÝ�Þ�D öO÷�êOæ"%Hèvùtê1áo÷�%$êOý3æFEÇý3èvæ0æ # æ�êOý3éHG3ëÏ_£2�G¡h£2ª����G©2ª�«�©V«� ���¸�§¨��¡h£2«;¸2 _­¼«]ªº¦�¯;¡h��¶]ª¬�]¡h¦�¯2¶+¯¤��¡h�¤ª¬�]¥¥���¯2¶]�2��¶]����¯2¸½Ã�ÄGÅÙ���]¯Ú³V�´�2 ���¸½¡h«]¶���¡¬£2��ª�¡h«Í��ÓV«�ª�¸¶�ª����]¡h��ªIB¤��à;¦°³¤¦°¥�¦°¡MÎ;Á�ç�¯2¯2«]¡h�]¡h¦�¯2¶jÃ�ÄqÅ7©2ª�«�©V��ª�¡h¦��� z¦� �¥�«�¹_¿I��«� �¡_Çõ­¼ª¬«]§ñ�zÆ;¯2«�¹_¥���¸2¶]�=��¯2¶�¦�¯2����ª�¦°¯2¶z©V��ª� ¬©V����¡h¦�»��;ɹ=��ÎÚ«�­z©2ª�«�»;¦°¸¤¦°¯2¶.¯¤��¡h�¤ª¬�]¥�¥°�]¯2¶��¤��¶��j�]������ ¬ ¾¡h«*Ã�ÄGÅ �¡h��¡¬��§¨��¯;¡¬ �Á êM¯2­¼«]ª¬§¨�]¡h¦�«�¯ �]������ � ½ ���£2��§¨�]¡h�Vµ´¹_£2¦�¥��³V��¦�¯2¶ñ§¨«�ª��Ð��«�§¨©2¥���àû��¯2¸ ª¬��éR�¤¦°ª�¦°¯¤¶ÙÆ;¯2«�¹_¥°��¸2¶��Ы�­¸2«]§��]¦°¯¤¿I �©u����¦�î2�¨«�¯;¡h«]¥°«]¶�¦��� �µu¶�¦�»�����à;©u��ª¬¦���¯¤����¸ÍÆ;¯2«�¹_¥�¿��¸¤¶��Ò��¯2¶�¦�¯2����ª� ´î2¯¤��¿X¶�ª���¦�¯2��¸�¡¬«R«]¥° â­¼«�ªâ§¨��¯2¦�©2�2¥���¡¬¦°¯¤¶Ã�ÄGÅÐ�]¯2¸Ê��«�¯;¡¬ª¬«]¥°¥�¦°¯¤¶´¡¬£2�Ì«��2¡¬©2�2¡�Á�Å!¦�¯2�]¥°¥�ÎRµÔ¦°¯2­¼«]ª¬§¨�]¿¡h¦�«�¯½©2¥���¯2¯2¦�¯2¶Í ¬��£¤��§¨��¡¬�Ò��¥�¥�«�¹ð�2 ¬��ª¬ ¨¡¬«²¸2�� ¬��ª¬¦�³V�Rµ9¦�¯¯2�]¡h�2ª���¥R¥°�]¯2¶��¤��¶��º¦°¡¬ ¬��¥°­hµ\£¤«�¹*¡h£2��ιº«]�2¥�¸+¶�«q��³V«��2¡!��¯¤¿ �¹º��ª¬¦�¯2¶Ì��©2�]ª¬¡¬¦°���2¥���ª���¥°�] ¬ �«]­9éR�¤�� �¡h¦�«�¯2 �ÁºÏ_£2�� ��¢¡h£¤ª¬���§¨��¡¬£2«;¸2 �����¯j��«]§+³2¦�¯2�¡h«Ì©¤ª¬«�»;¦�¸2�z¡h£2�¢­¼«]�2¯2¸2�]¡h¦�«�¯�­¼«]ªé;�2�� �¡h¦�«�¯â��¯¤ ¬¹=��ª�¦°¯¤¶�«]¯´¡¬£2�®2��§¨�]¯R¡¬¦°�±Í��³!Á�¥�¡h¦�§��]¡h��¥°Î;µb¡h£¤�����¡¬�2��¥n¸¤��¡h�]¦°¥� �«]­�¯2�]¡h�2ª���¥n¥���¯¤¶��2�]¶����¯¤¯2«�¡¬��¡h¦�«�¯¤ � ¬£¤«��2¥�¸�³u��£2¦�¸2¸2��¯Ò­¼ª¬«]§C¡¬£2�+�2 ���ª�³u��£2¦°¯¤¸c2�ên�]�2¡h£¤«�ª�¦°¯2¶�¡h«;«�¥� �µ� �«�¡h£¤��¡�£¤�Ë«�ª� ¬£¤�˯2����¸¢¯2«]¡���«�§¨�¦�¯R¡¬«¸2¦�ª¬����¡0��«�¯;¡h�]��¡&¹_¦�¡h£Ìø�öÒÜÑ«�ª&Ã�ÄGÅ�ÁYçq¸2¸2¦�¡h¦�«�¯2�]¥°¥�ÎRµ��¯½���2¡¬£2«�ª�¦�¯2¶Ò¡¬«R«]¥º��«��2¥�¸Ú©¤ª¬��¿I©2�]ª¬ ���¡¬£2�´¯¤��¡h�¤ª¬�]¥=¥���¯¤¿¶��¤��¶����]¯2¯2«]¡h��¡¬¦°«]¯2 ��]¯2¸È ¬¡¬«�ª���¡h£2«] ¬�Ѫ���©¤ª¬�� ¬��¯R¡¬��¡¬¦°«]¯2 Ç¼�� ¬ ���¯;¡¬¦°�]¥°¥�Îq¡¬ª¬¦�©2¥��� !¡h£2��§¨ ¬��¥°»]�� �Éb��¥�«�¯¤¶� �¦°¸2�Ë¡h£2�º��¯2¯2«]¡h�]¿¡h¦�«�¯¤ �Á J´±�¦�¡h£â³V«�¡h£¾¯2��¡¬�2ª¬�]¥u¥���¯¤¶��2�]¶����]¯2¸�©2�]ª¬ ���¸�ª���©¤¿ª��� ���¯;¡h�]¡h¦�«�¯2 &�]¡0¡h£2��¦°ªº¸2¦� ¬©V«� ���¥Mµ� �«�­¼¡M¹=��ª�����¶���¯;¡h 0¹=«��2¥�¸

KxÆ�´¶³o»vÂ�¹v³�¼�¾d¼�¹�³¸»vÂ�·¸´¶¾vÒ¦³¸ÂOÂ�ËoÍvµx¹vÅ�»d¼4µ¸Å�»vº\½�º�ÉAÂ�¹v˶Îd¾vÂ�³�È6ºÃ>º&¼�µ¸´¶Èv˶º�È.º\Å&¼�¹�µxº`ÉAº�Å\¼�¾v¾vÂ�³�º LOÀ.º&Å\³H»1¹v½2¼�¾vµ�³oÂ�½�¼�¾O¹v¼�˶˶¿�Ò�º&¾�º&·¸Ö¼�³oº3Àv¼�·¸µxº�µx³o·x¹vÅ\³o¹v·xº\µFÓ�×�Â�³¸º`¼�˶µx³o»v¼�³�M�º\º\Àv´¶¾vÒ¾v¼�³o¹v·x¼�Ë.˶¼�¾vÒ�¹�¼�Ò�º

£2��»]����»���¯Í¶]ª¬����¡¬��ª@B2��à;¦�³2¦�¥°¦�¡Mβ¦�¯Ê§¨��¯2¦�©2�2¥���¡¬¦°¯2¶�§¨��¡h��¿¸2�]¡h�VÁ

N ªì�HO¦�h-�@�14D��û8"P��RQ��Ô>û6!D�8;14�RS � �±Í�²³V��¥�¦°��»��²¡h£2�]¡´¯2�]¡h�2ª���¥G¥���¯2¶]�2��¶]�Ñ��¯2¯¤«�¡h�]¡h¦�«�¯2 ´��ª��¯2«]¡Ì«�¯2¥�Î*��¯)¦�¯R¡¬�2¦�¡h¦�»�����¯¤¸)£¤��¥�©2­¼�2¥���àR¡¬��¯2 �¦�«�¯)¡h«.¡h£¤�®2��§��]¯;¡h¦��=±Í��³!µ�³2�2¡&¹_¦�¥°¥b�] ¬ �¦� ¬¡!¦°¯�¡h£¤�=¸¤��©2¥�«�Î;§���¯;¡��]¯2¸��¸¤«�©2¡¬¦°«]¯Í«]­º¡¬£2��®2��§¨��¯;¡h¦��̱²��³.¦°¡¬ ¬��¥°­hÁ¾Ï_£2�Ì©2ª�¦�§��]ª¬Î³2�]ª¬ª�¦°��ª�¡h«½¦°¡¬ ´��ª¬����¡¬¦°«]¯�¦° â�Ú��¥°�] ¬ �¦°�Ò��£2¦°��Æ]��¯2¿X��¯¤¸2¿I��¶�¶©2ª�«�³2¥���§Ñæu©V��«�©¤¥°��¹_¦°¥�¥9¯2«�¡G �©u��¯2¸²��à;¡hª��¾¡h¦�§¨��§¨�]ª¬Æ;¦�¯2¶�2©Í¡h£2��¦°ªz¸2�]¡h�â�2¯2¥��� � z¡h£2��Î�©u��ª¬����¦�»��+�â»���¥��2��­¼«�ªz¡h£2��¦°ª��ÓV«�ª�¡h �µ��]¯2¸�§¨��¡h��¸¤��¡h�z¹_¦�¥°¥¤¯2«�¡9³V�_�2 ¬��­¼�2¥2�2¯;¡¬¦°¥2� ë ��ª�¦°¡¬¿¦�����¥�§��] ¬ �쨣2�� _³V����¯�����£2¦���»]��¸�Á!çq¥°¡¬£2«��¤¶�£jª¬�� ¬����ª���£2��ª� £2��»]�´³V����¯½­¼«;���¤ ¬¦�¯2¶Ê«�¯)¡h����£2¯2«�¥�«�¶]ÎÊ¡h«Êª¬��¸2�2���ⳤ��ª�ª¬¦�¿��ª� =¡h«���¯;¡hª�ÎÒǼ»;¦����]�2¡h£2«]ª¬¦�¯2¶�¡h«;«�¥� �µ]­¼«]ªº��àR�]§¨©2¥°�;Éhµ] ��2��£¦�¯2¦°¡¬¦°�]¡h¦�»��q§���Î�¯2«]¡Ë³V�� ��T0¨��¦���¯;¡Ë¡¬«+«�»���ª¬��«�§¨�_¡h£¤��£;�2ª�¿¸2¥��� �Á�çq ¾Â���¯2¸2¥���ª²ÇõÂ���¯2¸2¥���ª�µ=Ý�Þ�Þfß�Éqª���§¨��ª�Æ; �µ0¥°«�¹=��ª�¿¦�¯2¶Ò§¨��ª�Æ;�2©.��«] ¬¡�¦° �¯�� ¡+��¯2«��2¶]£�á9­¼«]ª+§¨��¯;ÎÊ�2 ���ª� �µ�¡h£¤�³V��¯2��î2¡h z«�­Ë¡h£2�¨®2��§��]¯;¡h¦��+±Í��³Í ¬£2«]�2¥�¸Ñ��«]§���­¼«�ªG­¼ª����RÁ®2��§��]¯;¡h¦��q§¨��ª�Æ;�2©� �£2«��¤¥°¸´³u�z��³RÎ;¿X©2ª�«R¸¤�2��¡º«�­Ô¯2«]ª¬§¨�]¥��«]§¨©2�2¡h��ª��¤ ¬�â��¯2¸Ú¡h£2��ª¬�⦰ �¯2«Í©2ª�«R���� � +«]­�§¨��¡h�]¸2��¡¬���ª����]¡h¦�«�¯Ê¡h£¤��¡+¦� +���� �¦°��ª¢��¯¤¸.§�«]ª¬�¾¦°¯;¡h�¤¦°¡¬¦°»]��¡h£2�]¯*¡h£¤��2 ���«�­G¯2��¡¬�2ª���¥_¥���¯2¶]�2��¶]�RÁ½��Î*¸2¦�»�«�ª���¦�¯2¶Í¡h£2�Ò§�� í «�ª�¿¦�¡Mξ«�­!�2 ���ª� _­¼ª¬«]§ò¡h£¤�¯2����¸â¡¬«Ì�2¯2¸2��ª¬ �¡h�]¯2¸j­¼«�ª�§¨��¥n«�¯¤¿¡h«]¥°«]¶�¦��� z��¯¤¸Ò�j©2ª�����¦� ���¥�Î�¸2��î2¯¤��¸Í»�«;���]³2�2¥���ª�ÎRµu¹º�����]¯¸2ª���§¨��¡¬¦°����¥�¥°Î¾¥°«�¹=��ª�³2�]ª¬ª�¦°��ª¬¿X«�­¼¿X��¯;¡¬ª¬Î;µY���� �¦°¯2¶¨¡¬£2�¢¡hª���¯¤¿ �¦°¡¬¦°«]¯´¦�¯;¡h«Ì¡h£¤�®2��§¨��¯;¡¬¦°�z±Í��³j»;¦° �¦°«]¯�Áh=�2ªÑ¡h����£2¯2«]¥°«]¶�ÎЫ]­�¦�¯2­¼«]ª¬§¨��¡¬¦°«]¯A�������� � � ���£2��§¨��¡¬�©2ª�«�»;¦°¸2�� ���¯â��¯2¯2«]¡h�]¡h¦�«�¯â ¬Î; �¡h��§  ¬�2¦�¡h�]³2¥��­¼«�ª_¸2¦�Óu��ª¬��¯;¡¥���»���¥° �«�­_�2 ���ª���à;©V��ª�¦°��¯2���;Á�èq«�»R¦����� +¡¬«Ò¡h£¤�¾®2��§��]¯;¡h¦��±Í��³â§���ª¬��¥°ÎÌ£¤��»���¡¬«�ª���¡¬¦°«]¯2��¥�¥�Î���¥°�¤��¦�¸2��¡¬�q¡¬£2�z©2ª¬«;���� � ³;ι_£2¦���£�¡¬£2��Î��«�§¨�º�¤©�¹_¦�¡h£�¡h£2���]¯2 ¬¹=��ª!¡h«z�G©2��ª�¡h¦����¤¿¥���ª0 ���¡0«]­féR�¤�� �¡h¦�«�¯2 �µY�]¯2¸Ì¡h£¤��¯Ì¸2�� ¬��ª¬¦�³u��¡h£2�]¡0©2¥���¯Ì¦�¯Ì�­¼«�ª�§ã«�­f �¡MÎR¥�¦�·���¸¾¥°�]¯2¶��2�]¶��;ÁnŤ«]ªË§¨«�ª����]¸2»���¯2����¸Ì�2 ���ª� �µ¡h£¤����³2¦�¥�¦°¡MÎ�¡h«¢�]������ � &Ã�ÄqÅʸ2¦�ª�����¡¬¥°Î���¥�¥�«�¹_ &î2¯2��ª�¿X¡h�2¯2��¸��«]¯;¡hª�«�¥Mµ�¶�ª����]¡h��ªUB2��à;¦°³2¦�¥�¦°¡MÎ;µ���¯¤¸�§¨«]ª¬�_��«�¯2��¦° ���¸2�� ���ª�¦°©¤¿¡h¦�«�¯¤ �Á V�¥�¡h¦�§��]¡h��¥°Î;µÔ¥���¡¢�¤ ¢¯2«�¡+­¼«�ª�¶���¡z¡¬£2��¡¢¡¬£2�Ì©2�2ª�©V«� ��Ì«�­¡h£¤�⮤��§¨��¯;¡h¦��â±²��³½¦� ¨¡h«Ê³u��¯2���£;�2§¨�]¯2 �µ&¯2«�¡¨��«]§¨¿©2�2¡¬��ª� �Á¢Ï_£2��«�ª�¦°¶]¦°¯2�]¥�¦�¸2���´¹=�� z¡h£¤��¡z¦°¯¤ ¬¡¬����¸²«�­º¹º�]¦°¡¬¿¦�¯2¶�­¼«�ª���«�§¨©2�¤¡h��ª� �¡¬«�³u����«]§�� �§¨��ª�¡_��¯2«]�2¶�£�¡¬«Ì ¬«]¥°»]���¥�¥Ô¡¬£2�¢©2ª�«�³2¥���§¨ �«]­��¤¯2¸2��ª� �¡h��¯¤¸2¦°¯¤¶Ì£R�¤§��]¯�¥°�]¯2¶��¤��¶��;µ¹=�â �£2«]�2¥°¸)­¼«;���2 Ì«]¯½¡¬£2�j ¬¥�¦�¶�£;¡h¥�Î*¥��� � �¸¤¦W0¨���¤¥°¡Ì©2ª�«�³¤¿¥���§ «]­�§¨�]ÆR¦�¯2¶Ì£;�2§¨��¯j¸2�]¡h�¨§¨«�ª���2¯2¸2��ª¬ �¡h�]¯2¸2�]³2¥°�z¡¬«��«]§¨©2�2¡h��ª¬ �ÁGÏ!«¾¡h£¤¦° ��¯2¸�µV¡h£¤��­¼«]�2¯2¸2�]¡h¦�«�¯2 z«�­Ë¡h£2��®2��¿§¨��¯;¡h¦��±Í��³j��ª��z¶�ª�«��2¯2¸¤��¸¾¦�¯�¥���¯2¶]�2��¶]�RÁ�Â�«�¹=��»���ª�µ]¡¬«����£¤¦°��»��z¦°¯;¡¬��ª�«�©V��ª���³2¦�¥°¦�¡MÎ��]¯2¸j¡h«�­¼�]��¦�¥°¦�¡h�]¡h�¦�¯R¡¬��ª�����¡h¦�«�¯³V��¡M¹=����¯¾ ¬«]­¼¡M¹º�]ª¬���]¶���¯R¡¬ �µ�¹=�0� »���£2�]¸�¡h«� ¬�]��ª�¦°î¤������¥°«]¡«�­�£R�¤§��]¯¾�2¯¤¸2��ª� ¬¡¬��¯2¸¤��³2¦�¥°¦�¡MÎ;À½©¤ª¬����¦� ¬��«]¯;¡h«�¥�«�¶]¦°�� Ë�]¯2¸­¼«�ª�§¨��¥�¥°ÎÚ¸2��î2¯¤��¸È ¬��§¨��¯;¡h¦��� Ì�]ª¬�â­¼«�ª���¦�¶�¯)��«]¯2����©2¡h Ì¡¬«¡h£¤�_��»���ª���¶]�Ë�2 ���ª�Á��=Î+ª¬��¦°¯;¡hª�«;¸2�2��¦°¯2¶z¯2�]¡h�2ª���¥b¥���¯¤¶��2�]¶��¼�¾v¾vÂ�³o¼�³o´¶Â�¾vµA½�¼XM�º\µR´¶³RÀ.Â�µ¸µx´¶Èv˶ºAÃ>Â�·t³¸»�º&½�³¸Â�È.º�·¸º&Ö ¼�¾v¼�˶¿�Y\º\Î�˶¼�³oº\·¼�µ`½�Â�·xº�À.ÂKÉAº\·xÃ>¹vË0Àv¼�·¸µxº\·xµ�È.º\Å\Â�½�º¼"Ì"¼�´¶Ë¶¼�È�˶º�Ó

Z ÕQÂ�·Aº L1¼�½�À�˶º�Í�º L1À.º\·x³�¹�µxº\·xµH½�¼"¿�È.º�¼�Èv˶º�³oÂ�É`·¸´¶³¸º`µxÅx»vº\½�¼�³¸¼³o»v¼�³2Àv¼�·¸¼�½�º\³¸º\·¸´Y&º�¼�Å\·xÂ�µxµ\[^]3Õ�À�·xÂ�À.º\·x³o´¶º&µ@_�³o»vº¦º `1¹v´¶ÌK¼�˶º\¾1³�Â�ÃÌ�º\·xÈvµ�´¶¾�¾v¼�³o¹v·x¼�Ë6˶¼�¾vÒ�¹v¼�Ò�º�aoÓTb�»v´¶µH½�¼"¿ÁÈ6º`¹v¾v¾�¼�³¸¹v·¸¼�Ë�Ã>Â�·�¾�Â"ÌO´¶Å&º¹vµ¸º&·¸µ�Í@È6º\Å\¼�¹vµ¸º�´¶³�Å\Â�·x·¸º\µxÀ.Â�¾vÎvµ`³¸Â2»v´¶Ò�»vº\·�Â�·¸Îvº\·�˶Â�Ò�´¶Å&µFÓ

Page 43: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

��¯¤¯2«�¡¬��¡h¦�«�¯¤ Ë��¯¤¸�ª¬��¯2¸2��ª¬¦�¯2¶¢¡¬£2�q��«�¯2¯¤����¡¬¦°«]¯Ì¡h«�£R�¤§��]¯¥���¯2¶]�2��¶]�º��àR©¤¥°¦���¦�¡\µY¹º�����]¯¨����£2¦���»]�º� ���¡¬¦° �­¼Î;¦°¯2¶z§¨¦�¸2¸2¥��¶�ª�«��¤¯2¸´³V��¡M¹=����¯´��«�§¨©2�2¡¬��ª_�]¯2¸j£R�¤§��]¯´¯¤����¸¤ �Á

c d 698;8��Ô/2Dq3*-§ Ö D¦§S-0/2>û698;14-0D ¨��He�ئ�Ô3;8;3±Í��£¤��»�����ª�¶��2��¸Ò¡h£¤��¡¯2�]¡h�2ª���¥9¥���¯2¶]�2��¶]�+��¯¤¯2«�¡¬��¡h¦�«�¯¤ ��ª��¨�j�2 ¬��­¼�2¥Ë�]¯2¸.B2��à;¦�³2¥��Ì��à;¡h��¯2 ¬¦�«�¯Í¡h«�¡¬£2�Ì®2��§��]¯;¡h¦��±Í��³�Á!���2¡ ë ¦� _¦�¡=��¯2«��¤¶�£�b¬ìÒ®¤©u����¦�î2���]¥°¥�ÎRµ;����¯´¦°¯2­¼«]ª¬§¨�]¿¡h¦�«�¯Ú�������� � � ���£2��§��]¡h�Ñ�]��£2¦���»��¾³2ª¬«]��¸.��¯2«��2¶]£*Æ;¯2«�¹_¥�¿��¸¤¶�����«�»���ª¬�]¶��¢¡¬«j³u�¨�2 ���­¼�2¥�b½±²�̳V��¥�¦���»���¡h£¤�Ì��¯2 �¹º��ª¦� _Î��� �Áè���¡¬�2ª���¥Ë¥���¯¤¶��2�]¶����]¯2¯2«�¡¬��¡¬¦°«]¯2 ¢���]¯* ���ª�»����] +§¨«�ª��¡h£¤��¯�§¨��¡¬��¸2�]¡h�Vá�¡h£2��ν����¯�����©2¡¬�2ª¬��¶]��¯2��ª¬�]¥°¦�·���¸½©¤��¡h¿¡h��ª¬¯¤ �«]­�¦�¯2­¼«�ª�§¨��¡h¦�«�¯½�������� � �ÁÍç� ¨ ¬£¤«�¹_¯Ú¦�¯Ú¡h£¤�⩤ª¬��¿»;¦°«]�2 Ì �����¡¬¦°«]¯2 �µ0«]�2ªÌ��¯2¯¤«�¡h�]¡h¦�«�¯2 Ì����¯È³V��©2��ª���§¨��¡¬��ª�¿¦�·���¸ü¡¬«ï��¯2��«�§¨©2�] ¬ ���¯R¡¬¦°ª��.��¥��� � ��� �«�­�é;�2�� �¡h¦�«�¯¤ �Áåç ���£2��§¨����³V«��2¡=¡h£2��f!êMçüŤ�]��¡¬³u«;«�Æ̦�¯â¹_£2¦°��£´¡h£2�z©2ª�«�©¤¿��ª�¡h¦��� Ì�]¯2¸)��«��¤¯R¡¬ª¬¦��� Ì�]ª¬�â©2�]ª¬�]§���¡h��ª¬¦�·���¸Ú���]¯È��¯2 �¹º��ª¡h��¯2 0«]­¤¡h£¤«��2 ���¯2¸¤ &«�­¤©V«�¡¬��¯;¡h¦���¥¤éR�¤�� �¡h¦�«�¯2 �ÁÔÏ_£¤�=��«� �¡9«�­¹_ª�¦°¡¬¦°¯2¶¨ ���£2��§¨�]¡h��¦° _¯2«]¡_©2ª�«�©V«�ª�¡h¦�«�¯2�]¥u¡¬«Ì¡h£2�z¯;�2§+³V��ª«�­Ë��¥��� � q¦�¯2 �¡h��¯¤���� �µV³2�2¡zª���¡¬£2��ªG¡h£2�¨��«�§¨©2¥���à;¦°¡MÎj«]­Ë¡h£¤���¥��� � G¦°¡¬ ¬��¥°­hÁqçû ¬¦�¯2¶�¥��+ ���£2��§��¾­¼«�ªz¡h£2��êM¯;¡h��ª�¯2��¡qöÑ«�»;¦��ÄG��¡h�]³2�� ��Rµ!­¼«]ª���à;��§¨©2¥��Rµ!��«]�2¥�¸*¶�ª���¯;¡¢¡¬£2�´�¤ ¬��ª�¯2�]¡h�2¿ª���¥Ô¥���¯2¶]�2��¶]�z�������� � �¡h«¾«�»���ª_¡¬£2ª����¢£;�2¯2¸¤ª¬��¸â¡¬£2«��¤ ¬�]¯2¸¡h¦�¡h¥��� <fêM¡�¦� �«]�2ª!��à;©u��ª¬¦���¯¤���Ë¡¬£2��¡�©u��«�©2¥��=�� �Æq¡¬£2�º ���§¨�º¡MÎR©V�� «�­zé;�2�� ¬¡¬¦°«]¯2 ¾­¼ª���é;�2��¯R¡¬¥°Î;Á7ç�¯2�]¥°Î; �¦° ´«�­z¡h£¤�Òé;�2�� �¡h¦�«�¯¤ ­¼ª�«�§ ¡h£2�²Ï_Ã4ï*f!¿x�ôÇhg&«;«]ª¬£2���� ��]¯2¸ïÏ_¦����;µ=Ý�Þ�Þ]ÞVÉG�]¯2¸Ï_Ã4ïÏf!¿IÝ]Þ�ÞVß�Çig&«R«]ª¬£¤���� �µ�Ý�Þ�Þfß�É�j=ç½Ï�ª�����ÆÔµ��� ¬¡¬��¯2¸2�]ª¬¸¤¿¦�·���¸Ò¡h�� �¡� ���¡�­¼«�ªGéR�¤�� �¡h¦�«�¯���¯¤ ¬¹=��ª�¦°¯¤¶Vµbª���»]���]¥° _�´éR�¤�� �¿¡h¦�«�¯Ì¸¤¦° �¡hª�¦°³¤�2¡h¦�«�¯�¡¬£2��¡ËéR�¤��¥�¦°¡¬��¡h¦�»���¥°Î�«�³V��Î; @k;¦°©¤­1�  =¥���¹+æ�­¼��¹�£¤¦°¶]£�­¼ª���é;�2��¯2��Î�é;�2��ª�Î+¡MÎ;©V�� º������«��2¯;¡&­¼«]ª0�¥°�]ª¬¶]�©V«�ª�¡h¦�«�¯Ú«]­���¥�¥=éR�¤�� �¡h¦�«�¯2 ÑǼÜ�¦°¯!µ9Ý�Þ]Þ�ÝVɬÁ²Å¤�2ª�¡h£2��ª¬§¨«]ª¬�;µé;�2�� �¡h¦�«�¯¤ ����]¯â«]­¼¡h��¯�³u�¢§¨«;¸2��¥°��¸��� �¸2�]¡h�]³2�� ��zéR�¤��ª�¦°�� Ç ��]¡h·´��¡���¥MÁ°µ9Ý]Þ�Þ]ÝVÉ�¡¬«Ò �¦°§¨©2¥�¦°­¼Î.�]������ � �ÁÒŤª�«�§ä¡h£¤�� ��«�³¤ ¬��ª¬»���¡¬¦°«]¯2 �µn¹=�����]¯Í��«�¯2��¥°�¤¸2��¡h£¤��¡¢�]¥°¡¬£2«��2¶]£.¦°¯¤­¼«�ª�¿§¨��¡¬¦°«]¯²�]������ ¬ z ���£2��§¨�]¡h�jª¬��é;�2¦°ª���£R�¤§��]¯²��Óu«]ª¬¡�µn¡¬£2��Î��ª��z¯2��»]��ª�¡h£2��¥°�� ¬ _�]¯j��ÓV����¡¬¦°»]�¹º��ÎÌ«]­��]��£2¦���»;¦°¯¤¶�³¤ª¬«]��¸Æ;¯2«�¹_¥���¸2¶]�q��«�»���ª���¶]���]¡=ª����] ¬«]¯2��³¤¥°�G��«] ¬¡¬ �Á

l ¬�Ø�8;Ø�/R�RS -0/Tm±Í�=£¤��»��˸¤�� ���ª�¦°³V��¸�¡h£2ª����=��«]¯2��ª���¡¬�Ë©2ª�«�©V«� ���¥� �­¼«]ª9§��]Æ;¿¦�¯2¶z¡h£2�_®2��§¨��¯;¡h¦��_±²��³¨­¼ª¬¦���¯2¸¤¥°Î�¡h«z��«]§�©¤�2¡h��ª¬ &�]¯2¸�£;�2¿§¨��¯2 =��¥�¦�Æ��RÁ�ç�¥�¥f¡h£2�q©2¦������� º­¼«]ªº¡¬£2�G¦°§¨©2¥���§¨��¯;¡¬��¡h¦�«�¯¾«�­«��¤ª�¦�¸2���] ���¥�ª�����¸¤Î���à;¦� ¬¡�Á�ýÔþRÿ��;þ!µ\«��2ª�¯2��¡¬�2ª���¥�¥���¯¤¶��2�]¶��é;�2�� �¡h¦�«�¯ñ��¯2 �¹º��ª¬¦�¯2¶� �Î; ¬¡¬��§Ñµ£2�] Ò¸2��§¨«�¯2 �¡hª���¡¬��¸Ù­¼«]ª¯2����ª�¥°Î²�Ѹ2������¸2�¾¡h£2�]¡+¯2�]¡h�2ª���¥0¥���¯¤¶��2�]¶��Ì��¯¤¯2«�¡¬��¡h¦�«�¯¤ ��ª���³V«�¡¬£*�2 ���­¼�2¥=��¯2¸.��Óu����¡¬¦°»]�RÁ$f!�2ª�ª���¯;¡h¥�ÎRµ� ���£2��§¨��¡¬�¦�¯²ýÔþ]ÿ`�RþÑ���]¯j��«�¯;¡¬��¦�¯â�]��¡¬�2��¥Ô��«�¯;¡h��¯;¡�Ǽ�;Á ¶fÁ°µ;¡¬��à;¡��]¯2¸¦�§��]¶��� �ɬµ�©2ª�«;����¸¤�2ª¬�� �Ǽ�;Á ¶fÁ°µ�¡h«�]������ ¬ &¡¬£2�� �Î; ¬¡¬��§û��¥�«R��Æ¡h«�¡h��¥°¥f¡h¦�§¨�RɬµR«�ªdh=§�¯¤¦°³2�] ¬��éR�¤��ª�¦°�� zÇõ¡h«��������� � Ë£2��¡h��ª¬«]¿¶���¯2��«]�2 9¸2��¡¬�q­¼ª�«�§ñ¡h£2��±Í��³�ɬÁ2c=��¯2��ª���¥�¦�·�¦�¯2¶�¡¬£2¦° &¡¬����£2¿¯2«]¥°«]¶�Î�¡¬«_¡h£2�0¦�¯2­¼«]ª¬§¨��¡¬¦°«]¯z�������� � n ¬��£¤��§¨��¡¬�_¹º��� »]�!©¤ª¬«]¿©V«� ���¸.¦�¯*®2����¡h¦�«�¯$efÁ°ß�¦� ��Ѫ¬��¥°�]¡h¦�»���¥°Î² ¬¡¬ª¬�]¦°¶]£R¡¬­¼«�ª�¹º�]ª¬¸¡h�] ¬ÆÔÁŤ�2ª�¡h£2��ª¬§¨«]ª¬�;µb��¥�¥!¡h£2��§��]��£2¦�¯2��ª�ξ¯2������ ¬ ���ª�ξ¡¬«�©u��ª¬¿­¼«�ª�§���«�§¨©2¥���à²éR�¤��ª�¦°�� «�»���ªÃ�ÄqÅ7 ¬¡¬«�ª��� £2�] ¢��¥�ª�����¸¤Î

³V����¯Ê¸2��»���¥°«]©u��¸Ò³;ÎÑ«�¡¬£2��ªª��� ����]ª¬��£¤��ª� �ÁGÅu«�ªz��à;��§¨©2¥��Rµ¡h£¤��Ã�ÄGÅnj=�2��ª¬Î+Ü���¯¤¶��2�]¶��ǼÃoj=Ü=É&Ç7���ª�»�«��¤¯2��ª���Æ;¦� ���¡��¥MÁ�µ_Ý�Þ]Þ�ÝfÉz©2ª�«�»R¦�¸2�� ¾�Ú®�j=Ü�¿X¥°¦�Æ��²éR�¤��ª�Î)¥���¯¤¶��2�]¶��Ñ­¼«]ª�������� ¬ �¦°¯¤¶�¥°�]ª¬¶]�¢��§¨«]�2¯;¡h G«�­&Ã�ÄGÅï¡hª�¦�©2¥°�� �Á�öÑ«]ª¬����«�¯¤¿��ª���¡¬��¥�ÎRµ9¡¬£2�j¡MÎ;©u�� �«]­�é;�2�� �¡h¦�«�¯¤ �£¤��¯2¸2¥���¸½³;Î.��¯2¯2«]¡h�]¿¡h¦�«�¯¤ �¦�¯)Å!¦�¶��¤ª¬� e²¹=«��¤¥°¸½�2¥°¡¬¦°§¨�]¡h��¥�Î*¡¬ª¬�]¯2 ¬¥���¡¬��¸½¦°¯½�¸2����¥���ª���¡h¦�»��Gé;�2��ª�Î�¥�¦°Æ]�Ræ

p"q"r"q;s"tvuw"x"y"z|{"};~"� � �"�;�"�"� �1�;� {"�"~�"� q x q }���� t � ��� � � �

êM¯+­¼����¡\µ\Ãoj=Üâ©2ª¬«�»;¦�¸2�� ���ª�¦°��£� ¬��¡�«]­2��«]§�©¤��ª�¦° �«�¯�]¯2¸��¶]¶�ª���¶]��¡h¦�«�¯+«�©V��ª���¡h«]ª¬ &¯¤������ ¬ ���ª�Ρh«+©u��ª¬­¼«]ª¬§?��«�§¨©2¥���àé;�2��ª�¦��� �µ í �2 ¬¡_¥�¦�Æ��z�¨ ¬¡¬��¯2¸2�]ª¬¸âÃ�ÄG��öÑ®�Á±²�z��ª������2ª�ª¬��¯R¡¬¥°Î¨��«�¥�¥°�]³u«]ª¬�]¡h¦�¯2¶¢¹_¦�¡h£â¡h£2�zÂ���Î; �¡h�]��Æ

a˪�« í ����¡�Çõç�¸2�]ªº��¡Ë��¥MÁ�µuß@�6�6�fá�Âq�2Î;¯2£¾��¡Ë��¥MÁ�µYÝ]Þ�Þ�ÝfÉhµY©2��ª�¡«�­=öÑêMÏ �  Ça˪�« í ����¡üh=à;ÎR¶]��¯�µ ��¡¬«�¦°§¨©2¥���§¨��¯;¡¡¬£2��¦�¸2���] ©2ª�«�©V«� ���¸�£2��ª¬�;Á�Â���Î; �¡h�]��ƾ¦� q��©V��ª� �«�¯2�]¥°¦�·���¸�¦°¯2­¼«]ª¬§¨�]¿¡h¦�«�¯jª���©V«� �¦°¡¬«�ª�Î̳2�2¦�¥°¡_«]¯´Ã�ÄqÅ�µb�]¯2¸â¹º�z©2¥���¯j¡h«¨�2 ��¦�¡�� �j¡¬�� �¡h³V��¸Ê­¼«�ª��àR©¤¥°«]ª¬¦�¯2¶j¡h£2�̦�¯;¡h��ª¬�]��¡h¦�«�¯¤ G³V��¡M¹=����¯¯2�]¡h�2ª���¥2¥���¯¤¶��2�]¶���¡h����£2¯2«�¥�«�¶]΢��¯¤¸Ì¡h£2��®¤��§¨��¯;¡h¦���±Í��³�ÁêM¯¾��¸¤¸2¦°¡¬¦°«]¯¾¡¬«�¸2��©2¥�«�ÎR¦�¯2¶¢¡¬£2�� ��G©2ª¬«]©u«] ¬��¸�¡h����£2¯2«]¥°«]¿¶�¦��� �µf¹=�¨��ª����]¥° �«´ª��� ����]ª¬��£¤¦°¯2¶´��¸2»��]¯2����¸²§¨��¡¬£2«;¸2 ¢­¼«]ª¸2����«�§¨©V«� �¦°¯¤¶���«�§¨©2¥���à�¯¤��¡h�¤ª¬�]¥n¥°�]¯2¶��2�]¶��Gé;�2��ª�¦°�� _¦°¯;¡¬«�. ���ª�¦°�� Ì«�­z ¬¦�§¨©2¥���ª�«]¯2�� �Á�Åu«�ª���àR�]§¨©2¥°�;µ0�.é;�2�� ¬¡¬¦°«]¯¥�¦°Æ]� ë ±�£2��¯�¹=�� &¡¬£2�_©2ª��� �¦°¸¤��¯;¡&«�­VÃ��2 � ¬¦��z³u«]ª¬¯!µ ì��«��2¥�¸³V�³2ª¬«]Æ���¯´¸2«�¹_¯j¦�¯;¡h«¨¡M¹º«Ì¦�¯2­¼«]ª¬§¨��¡¬¦°«]¯´ �����Æ;¦°¯2¶¨ �¡h��©2 �æî2ª� ¬¡�µ&î2¯¤¸)«]�2¡¨¹_£2«Ê¡h£2�j©2ª��� �¦�¸2��¯;¡¨«�­qÃ��2 � ¬¦��ʦ° �µ&�]¯2¸¡h£¤��¯Êî2¯2¸Ê«��¤¡G£¤¦° «]ª£2��ªz³2¦�ª¬¡¬£2¸2�]¡h�RÁ�±Í�Ì��«]�2¥°¸Í¹_ª�¦°¡¬���¯¤¯2«�¡¬��¡h¦�«�¯¤ !¡¬£2��¡9����©2¡¬�2ª¬��¡¬£2¦° &ª����] ¬«]¯2¦�¯2¶�¸2¦�ª�����¡¬¥°Î;µ�³2�2¡¡h£¤¦° ¹=«��2¥�¸Òª���é;�2¦�ª���­¼�]ªG¡¬«;«â§¨�]¯RÎ��]¯2¯2«]¡h��¡¬¦°«]¯2 G¡h«j����¿��«]§¨§�«;¸2�]¡h�²��»���ª¬ÎÈ©V«� � ¬¦�³2¥��Ò��«�§+³¤¦°¯2�]¡h¦�«�¯ï«]­©2ª�«�©V��ª�¿¡h¦��� �ÁjêM¯2 �¡h����¸�µ�¹=�¾¹=«��2¥�¸.¥�¦°Æ]��¡h«²³u�´��³2¥���¡h«²©u��ª¬­¼«]ª¬§¡h£¤¦° jé;�2�� �¡h¦�«�¯ô¸2����«�§¨©V«� �¦°¡¬¦°«]¯ï���2¡¬«�§¨��¡¬¦°����¥�¥°Î;µ�¶��¤¦°¸2��¸³;Îq¹_£¤��¡h��»���ªÔ«�¯;¡h«]¥°«]¶�¦��� n�]ª¬�&��»���¦�¥���³2¥��Rá�­¼«�ª���à;��§¨©2¥��Rµ�¹=�¹=«��2¥�¸ÌÆ;¯2«�¹È¡h£2�]¡0¡h£2�q©2ª��� �¦°¸2��¯;¡0«�­n�¢��«��2¯;¡¬ª¬Î�¦° º�¢©u��ª¬¿ �«�¯�µY��¯2¸�¡h£2�]¡0�©V��ª� ¬«]¯�£2�] 0�z³2¦�ª¬¡¬£2¸2��¡¬�RÁ�±�¦�¡h£Ì�z¥°¦�¡h¡¬¥°�³2¦�¡0«�­V �����ª���£�µ�©V��ª�£2�]©2 &¹º�_��«��2¥�¸¨¸2ª���¹Ú¡h£¤¦° º��«�¯¤¯2����¡h¦�«�¯¹_¦�¡h£2«]�2¡�§¨�]¯R�¤��¥=¦°¯;¡h��ª¬»]��¯;¡h¦�«�¯!ÁÒèq«�¡¬�Rµ9£2«�¹=��»]��ª�µ�¡h£2�]¡¡h£¤��ª����]ª¬�¢¥�¦�§�¦�¡h�]¡h¦�«�¯2 z¡¬«´¡¬£2¦° z�]©2©2ª�«�����£!á2¡¬£2�¨��à;��§¨©2¥��¶�¦�»���¯�¦°¯²Å9¦°¶]�2ª¬��?¾����¯2¯2«]¡�³V��³2ª�«�Æ]��¯�¸2«�¹_¯Ò���2¡¬«�§¨��¡¬¿¦�����¥�¥�Î�¦°¯Ê¡h£¤¦° §¨��¯¤¯2��ªz³V�����]�2 ���¦�¡z¦°§¨©2¥�¦���¦�¡h¥�ÎÒ���]©2¡h�¤ª¬�� ¡h£¤�_£2���2ª�¦� ¬¡¬¦°� ë  �¦°¯¤����«�¯2������¯2¯2«]¡�¸¤¦°ª�����¡h¥�΢���]¥°���2¥°�]¡h��¡h£¤�¸2¦� ¬¡¬��¯2���_³u��¡M¹º����¯¨¡M¹=«��«��¤¯R¡¬ª¬¦��� �µ��zª����� �«�¯¤��³2¥�����¯2 �¹º��ª¦� _¡h«Ì���]¥°���2¥���¡h�z¡¬£2�¸2¦� ¬¡¬��¯2���G³V��¡M¹=����¯j¡h£2��¦°ª_����©2¦�¡h�]¥° �Á ì

� � -0D����hØ�3;14-0D�R�2 ¬¡+¥°¦�Æ��Ì¡¬£2��¸2��»���¥�«�©¤§���¯;¡«�­�¡h£2�¾®2��§¨��¯;¡¬¦°��±Í��³.¦�¡h¿ ���¥�­hµ�����ª�¥°Î���ÓV«�ª�¡h =¡h«�¦°¯;¡¬��¶�ª���¡¬��¯2�]¡h�2ª���¥V¥°�]¯2¶��¤��¶��_¡¬����£2¿¯2«]¥°«]¶�ι_¦�¡h£�¡h£2�_®2��§¨��¯;¡h¦���±²��³�¹_¦�¥�¥b¯2«z¸2«��2³¤¡!³V�_ �¥°«�¹��¯¤¸�¦�¯2��ª���§¨��¯R¡¬��¥MÁÔÂq«�¹º��»���ª�µ\¹=��³u��¥°¦���»]�=¡¬£2��¡&¡¬£2�_¡h£¤ª¬���§¨����£2�]¯2¦� ¬§¨ ¹=�0� »��¨©2ª�«�©V«� ���¸Í��ª����� �¡h��©Í¦�¯Ê¡h£2�̪�¦�¶�£;¡¸2¦�ª¬����¡¬¦°«]¯�ÁGÏ_£2�¨î2¯2�]¥!¶]«��]¥�¦� z¡hª��2¥°Î��]¥°¥��2ª�¦°¯¤¶Væ��]¯Ò��¯¤«�ª�¿§¨«��2 �¯2��¡M¹º«]ª¬Æ�«�­�ÆR¯¤«�¹_¥°��¸2¶��q���� �¦�¥°Î¨�������� ¬ �¦°³¤¥°�q³RÎ̧¨�]¿��£2¦�¯2�� _�]¯2¸j£;�2§¨��¯2 _�]¥°¦�Æ��;Á

�KC@D�DFEHG I�I�W��X�X�OSK[HL ^OY�Z0L>PQN&DRL S@T"U1I

Page 44: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

� � ��m�D�-3£$�\�Ô×���Ô> �ÔD�8;3Ï_£2�]¯2Æ; Ë¡h«+®2�2��Åu��¥� ¬£2¦�¯�µ�Äq��»;¦�¸ÌÂ��2Î;¯2£!µ�c=ª���¶öÒ��ª�¡h«�¯!µÄG��¯2¯2¦� �j=�¤��¯�µ;�]¯2¸�g�¦�¯2����¡_®2¦�¯2£2��­¼«]ª_¡h£2��¦°ª���«�§¨§¨��¯;¡h «�¯.����ª�¥°¦���ª¸¤ª¬�]­¼¡h +«]­º¡¬£2¦� �©2�]©u��ª�Á¾±Í��¹º«]�2¥�¸.��¥� ¬«j¥�¦°Æ]�¡h«¾¡h£2�]¯2Æ´¡M¹=«��]¯2«�¯;Î;§¨«��2 �ª���»;¦���¹=��ª� _­¼«�ªq¡h£2��¦°ªq£2��¥�©2­¼�2¥��«]§¨§���¯;¡h �Á�Ï_£2¦� Gª¬�� ¬����ª���£â¦� G­¼�2¯2¸2��¸Ò³;ÎjÄ�ç�Ã4ançû�2¯2¿¸2��ª��«�¯;¡¬ª¬�]��¡z¯;�2§+³V��ªÅ�e�Þ6?�Þ]Ý�¿XÞ�Þ�¿�ß�¿XÞ�(�ÿ�(¨��¯¤¸²�]¸2§¨¦°¯2¿¦� ¬¡¬��ª���¸Ò³;Î�¡¬£2�¨ç�¦�ªGÅu«�ª�����Ã��� ����]ª¬��£ÑÜ���³V«�ª���¡¬«�ª�ÎRÁqç�¸2¿¸2¦�¡h¦�«�¯2�]¥f­¼�2¯2¸2¦�¯2¶¨¦° �©2ª�«�»;¦°¸¤��¸�³;ÎÌ¡h£¤�üh=à;Î;¶���¯�a˪�« í ����¡�Á¨���§F�Ô/t�ÔDq���Ô3ïËÎ;¡¬��¯¨ç�¸2�]ª�µYÄq��»;¦�¸Ù���ª�¶���ª�µ\��¯¤¸�Ü�Î;¯2¯¨ç�¯2¸2ª����®¤¡h��¦�¯�Áß��6���VÁÚÂ���Î; �¡h����ÆÔæ a&��ª¬¿X�2 ���ªÑ¦�¯2­¼«�ª�§¨��¡¬¦°«]¯ñ��¯;»;¦�ª¬«]¯2¿§���¯;¡h �ÁËêM¯��)� ��Û"ØFØ����h���" ¡�£¢ÔÚ�¤�ئ¥�§�§6§©¨^����¢�Ø��\Ø��RÛFت�6�«���¢��6�X¬�­�Ú��h�6��­6���o®¦�&��¯±°�Ø��X��Ø ²�­���­³��Ø�¬òØ��HÚ ÁÏ_¦�§�����ª�¯2��ª� �¿IÜ����Rµ��R��§¨�� GÂq��¯2¸¤¥°��ª�µÔ��¯2¸�h=ª¬�âÜ��] ¬ �¦�¥°�fÁÝ�Þ�Þfß�ÁâÏ_£2�²®2��§¨�]¯R¡¬¦°�ͱÍ��³�Áµ´RÛ�� Ø��HÚ�� ¶4Û¸·¡¬=Ø��X� Û�­��2µÝ656ÿfÇh(fÉhæ e6ÿ�¹�ÿ6efÁÏ_¦�§Û����ª¬¯2��ª¬ �¿XÜ����;Á9Ý�Þ]Þ�ÞfÁ�a˪�¦°§¨��ª�æ*c=��¡¬¡h¦�¯2¶¾¦�¯;¡h«�Ã�ÄGÅ��¯2¸â®2��§¨�]¯R¡¬¦°�±Í��³j�2 �¦�¯2¶¨è¦eVÁÄG��¯Í�=ª�¦���ÆR¥���Î��]¯2¸ÍÃ�Á g¨Ác=�2£2�fÁ�Ý]Þ�Þ]ÝVÁ=Ã�ÄGÅл]«R����³2�¤¿¥°�]ª¬Îq¸2�� ���ª�¦°©¤¡h¦�«�¯¥���¯2¶]�2��¶]�_ß]Á ÞfæVÃ�ÄGÅ⮤��£2��§¨�VÁn±�e�f±²«]ª¬Æ;¦�¯2¶üÄqª���­¼¡�µ¨±²«]ª¬¥�¸û±�¦°¸2�б²��³ f!«]¯2 �«�ª�¡h¦��2§Ñµç�©2ª�¦°¥MÁ

�R��§¨�� ÊÂ���¯2¸2¥���ª�Á�Ý]Þ�ÞVß]ÁÈç�¶]��¯;¡h .�]¯2¸û¡h£2�È®2��§��]¯;¡h¦��±²��³�ÁU«»º¼º½º�«��HÚ¸Ø�°h°¾� ��Ø��HÚ\´&¿< FÚ¸Ø�¬� �µnß³?VǼÝfÉhæ e�Þ6¹6e��]ÁÄG��»R¦�¸ãÂ��2Î;¯2£!µÌÄG��»R¦�¸����ª�¶���ª�µ���¯¤¸ûÄG��¯¤¯2¦° Àj=�2��¯!ÁÝ�Þ�Þ]ÝVÁ¾Â���Î; �¡h����ÆÔæjç ©2¥���¡h­¼«]ª¬§ ­¼«�ª´��ª����]¡h¦�¯2¶Vµº«�ª�¶��]¿¯2¦°·�¦°¯¤¶Ù��¯2¸ð»;¦� ¬�¤��¥�¦°·�¦°¯2¶ñ¦°¯2­¼«]ª¬§¨�]¡h¦�«�¯ã�2 �¦°¯2¶ñÃ�ÄqÅ�ÁêM¯��)� ��Û"ØFØ����h���" ¦��¢ Ú£¤6Ø¡º)°�Ø�Á1Ø��RÚ£¤nÂÃ���X°Ä�nÂÃ�h�6Ø©Â�Ø�Ũ��6��¢�Ø��\Ø��RÛFØ ´RØ�¬�­6�RÚ£� Û¸Â�Ø�ÅFÂÆ�6�XÇ� �¤6�XÈuÁ

c=ª���¶ ���ª�»�«]�2¯2�]ª¬�]ÆR¦� �µ ®2«�î2� çq¥°��à;��Æ;¦�µ g&�] ¬ �¦°¥�¦� f!£2ª¬¦� �¡h«�©¤£2¦°¸¤�� �µ9Äq¦�§¨¦°¡¬ª¬¦� �aº¥°��àR«]�2 ���Æ;¦° �µ!��¯¤¸ÚöѦ���£2��¥®2��£2«�¥�¥MÁðÝ�Þ]Þ�ÝfÁûÃÉj=Ü�æ�ç ¸2����¥���ª���¡h¦�»��7é;�2��ª�Î ¥���¯¤¿¶��2�]¶��Ò­¼«]ª´Ã�ÄqÅ�ÁÒêM¯Ê�)� ��ÛFØFØ��6�h�<�8 Ã��¢$Ú£¤6ظº)°>Ø�Á�Ø��HÚ�¤«��RÚxØ��X��­6Ú£�h���&­�°ËÂÆ�6�X°¾� ÂÆ�h��Ø Â Ø�Å ¨^����¢�Ø��\Ø��RÛFØÌ ÂµÂµÂÆÍ6Î�Î�ͳÏ�Á��«�ª�¦° d��]¡h·¨��¯¤¸Í����¡h£ÊÜ���»R¦�¯�Á¨ß���565fÁÊïËà;©2¥�«�¦�¡h¦�¯2¶j¥���à;¦°¿����¥nª���¶]�2¥���ª�¦°¡¬¦°�� º¦�¯j¸2�� �¦°¶]¯2¦�¯2¶�¯¤��¡h�¤ª¬�]¥V¥���¯¤¶��2�]¶��G ¬Î; �¿¡h��§¨ �ÁÈêM¯Ð�)� ��Û"ØFØ����h���" Ñ�£¢­Ú�¤�Øv¥�Í�Ú�¤µ«��RÚxØ��X�&­6Ú£�h����­6°¨��6��¢�Ø��\Ø��RÛFØ����R¨^�6¬ÒÈ�Ó�Ú�­�Ú��h�6��­�°�Ô=�h���"Ó8�h FÚ£� Û�  Ì ¨±ÕHÔ=Ö«»×FØÐÙ Ú6ÚXÏ�Á��«�ª�¦° *���¡¬·+�]¯2¸ a&��¡¬ª¬¦���ÆâÂ�ÁV±�¦�¯2 ¬¡¬«�¯�Áß@�65]ÝVÁ�a&��ª� ¬¦�¯2¶��¯2¸j¶���¯2��ª���¡¬¦°¯¤¶=ﺯ2¶�¥�¦° �£â�2 ¬¦�¯2¶Ì��«�§¨§+�2¡¬��¡¬¦°»]�G¡¬ª¬�]¯2 �¿­¼«�ª�§��]¡h¦�«�¯2 �ÁËç�ê_öÑ��§�«Û?�����µ¤öÑêMÏAç�ª�¡h¦�î2��¦���¥!êM¯;¡h��¥°¥�¦°¿¶���¯¤���zÜ��]³u«]ª¬�]¡h«�ª�ÎRÁ��«�ª�¦° ä��]¡h·;µò�]¦�§¨§+ÎÙÜ�¦°¯�µÌ��¯2¸ã®2�2��Åu��¥� ¬£¤¦°¯�ÁôÝ]Þ�ÞVß]Ác=��¡h£¤��ª�¦°¯2¶_Æ;¯2«�¹_¥���¸¤¶��&­¼«�ª���é;�2�� �¡h¦�«�¯�]¯2 ¬¹=��ª�¦�¯2¶� ¬Î; �¿¡h��§ú­¼ª¬«]§ £¤��¡h��ª¬«]¶���¯2��«]�2 �¦�¯2­¼«]ª¬§¨��¡¬¦°«]¯* �«��2ª����� �Á+êM¯�)� ��Û"ØFØ����h���" ���¢}Ú�¤�ئ·¸¨TÔÜÍ6Î�Î6ÎÀÂÃ���XÇ� �¤��XÈÝ����Þ�Ó8Ö¬�­6�ßÔU­6�<�8Ó8­X��صàRØ"Û�¤��&��°¾�»�8¿$­����¸®ª�&��¯±°�Ø��X��Ø�²�­6��Ö­X��Ø�¬òØ��HÚ Á��«�ª�¦° Ï��]¡h·RµV®¤�2�¨Å¤��¥� �£2¦°¯!µnÄG��¯2¦�·¦á&�2ª���¡\µnç�¥�¦&êM³2ª¬�]£2¦�§�µ�]¦�§¨§+ÎÍÜ�¦�¯�µ�c=ª¬��¶�«�ª�ÎÑöÑ�]ª¬¡¬«�¯�µ�çq¥°¡¬«�¯ä�R��ª�«�§¨�ÌöÑ��¿Å¤��ª�¥���¯2¸�µY��¯2¸Ì���]ª¬¦� &Ï���§¨��¥�ÆR�¤ª¬�]¯�ÁÔÝ]Þ�Þ]ÝVÁ�h=§¨¯2¦�³2�] ¬�;æ�¯2¦�­¼«�ª�§ �]������ � ¢¡h«Ñ£¤��¡h��ª¬«]¶���¯2��«]�2 ¢¸2�]¡h���� ��Ñ��«]§¨¿©u«]¯2��¯;¡z«�­º�⯤��¡h�¤ª¬�]¥!¥���¯¤¶��2�]¶��+é;�2�� ¬¡¬¦°«]¯²�]¯2 �¹º��ª¬¦�¯2¶

 ¬Î; �¡h��§ÑÁÑêM¯Ê�)� ��ÛFØFØ��6�h�<�8 Ã��¢$Ú£¤6Ø$â�Ú�¤ß«��RÚxØ��X�&­6Ú£�h����­6°ÂÆ�6�XÇ� �¤��³ÈÆ���É·)È�È�°¾� Û�­6Ú£�h���� ã�£¢\× ­6Ú£Ó8� ­�°�Ôä­����"Ó8­³��ØÏÚ£�«���¢��6�X¬�­�Ú��h�6��´&¿< FÚ¸Ø�¬�  Ì ×ãÔ^å æRÍ�Î6Î�ÍXÏ�Á��«�ª�¦° ò���¡¬·RÁjß���565fÁd� �¦�¯2¶/ﺯ2¶�¥�¦� ¬£*­¼«]ª¨¦°¯2¸¤��à;¦°¯¤¶²�]¯2¸ª¬��¡hª�¦°��»R¦�¯2¶fÁ�êM¯��)�»�1ÛFØFØ����h�<�8 ¦�£¢:Ú�¤�ظ¥6 FÚ¼çÒ«è·¸Õé¨^�6��Ö¢�Ø��\Ø��RÛFØI����ê¼ FØ��XÖëÕ\�X� Ø��RÚxØ��¸¨��6�RÚxØ��HÚ�Öìæ=­� FØ��¸àRØ»í�Ú�­����«�¬�­X��ØÒÞ ­�����°¾�h�<� Ì çÒ«»·¸ÕîÙ Ú�Ú³Ï�Á��«�ª�¦° Á���¡¬·RÁ=ß��6�A��ÁÔç�¯2¯2«]¡h�]¡h¦�¯2¶¢¡h£¤�q±Í«�ª�¥°¸Ì±�¦�¸2�G±²��³�2 ¬¦�¯2¶�¯¤��¡h�¤ª¬�]¥Ô¥°�]¯2¶��¤��¶��;Á!êM¯Ã�)� ��ÛFØFØ��6�h�<�8 É�£¢�Ú�¤�ئï6Ú£¤çÒ«»·¸Õ|¨^�6��¢�Ø��\Ø��HÛFØã���è^�6¬ÒÈ�Ó�Ú¸Ø��H·  � ��h FÚxØ��=«���¢��6�X¬�­6ÖÚ��h�6�ôRØ�­6�\Û�¤6�h�<�F���/Ú�¤�ØÒ«��RÚxØ��X�HØ1Ú Ì çã«»·¸ÕðÙ §6âXÏ�Á

h=ª��.Ü��� � �¦°¥��.��¯2¸�Ã���¥�©2£�Ã�ÁË®2¹_¦°��ÆÔÁÊß����6�fÁÌÃ��� ¬«]�2ª¬���Äq�� ¬��ª¬¦�©2¡h¦�«�¯ñŤª���§¨��¹=«�ª�ÆãÇõÃ�ÄGÅ_É´§¨«;¸2��¥���¯¤¸ñ �ÎR¯¤¿¡h��ಠ¬©V����¦°î¤����¡¬¦°«]¯�Á�±�e�fÐÃ�����«�§¨§¨��¯2¸2�]¡h¦�«�¯�µn±²«]ª¬¥�¸±�¦°¸2�±Í��³$f!«]¯2 ¬«]ª¬¡¬¦°�¤§�µRÅu��³2ª��2�]ª¬Î;Á

�R¦°§¨§+ÎÏ�nÁ�Ü�¦°¯�ÁÔÝ�Þ]Þ�ÝVÁVÏ_£¤�˱Í��³+�] ���ª��� �«��¤ª¬���&­¼«�ª!éR�¤�� �¿¡h¦�«�¯È��¯2 �¹=��ª�¦°¯2¶fæ a&��ª� ¬©V����¡h¦�»��� ��]¯2¸È��£2��¥�¥���¯2¶]�� �Á¨êM¯�)� ��Û"ØFØ����h���" ñ��¢dÚ£¤6ئà^¤��h� �)«��HÚ¸Ø��X��­�Ú��h�6��­6°�¨^����¢�Ø��\Ø��RÛFØ���¦Ôä­����"Ó8­³��Øãç*Ø� ���Ó8�\ÛFØ� )­6���oº)Á�­6°¾Ó8­6Ú£�h���¤Á

ïË¥�¥���¯�ö½Á�g&«;«�ª�£2���� ñ�]¯2¸�ÄG��¹_¯�ö½Á²Ï_¦����RÁ Ý]Þ�Þ�ÞfÁh=»���ª�»;¦°��¹ «�­�¡h£¤�CÏ_Ã4ïÏf!¿¸� é;�2�� �¡h¦�«�¯��]¯2 �¹º��ª¬¦�¯2¶¡hª�����ÆÔÁ�êM¯¸�)�»�1ÛFØFØ����h�<�8 É�£¢òÚ£¤6Øã×��h�HÚ�¤ßàtØ»ítÚ�çÒº�Ú£�X� Ø�Á�­6°¨��6��¢�Ø��\Ø��RÛFØ Ì àTçÒº¦¨�Ö짳Ï�Á

ïË¥�¥���¯)ö½ÁUg&«;«�ª�£2���� �Á�Ý]Þ�Þfß�Á h=»���ª¬»;¦���¹ð«�­G¡h£¤�âÏ_æï*fÝ�Þ�Þfßzé;�2�� �¡h¦�«�¯²��¯2 �¹=��ª�¦°¯2¶¾¡hª�����ÆÔÁºêM¯��)� ��ÛFØFØ��6�h�<�8 ���¢Ú�¤�ØÉÍ�Î�Î6¥ÃàtØ»ítÚTçãº�Ú��X� Ø�Á�­�°ä¨��6��¢�Ø��\Ø��RÛFØ Ì àTçÒº ¨.Í6Î�ÎXÏ�Á

Page 45: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

����������� �������������������������! "�#%$'&(�*)� ,+-�/.0�1�2��3�+-�4�657�8�9������:����/ ;��������85<�>=�3?���@���2.� "3A:��8�CB(�8��

DFE�G4HJI�KCLNMPORQSG T�UWVWX/QY�Z\[�]�^@_a`AZ\bc_2d�e,f�ghb1i*j1ghk@_aghl\km]�b1n<f�go_aZ\^a]�^@prqs_aj1n1ghZ\k

t�b1gou*Z\^vkvgo_wprd�eyx�ghZ\zhZPe{Z\zhn4|�}�Z\^v`�]�bcp~4�*���-�c�1�c�s�-�;�W�s�F���N�F���S�c�*�-�S�����w�-�

��� MPORL�E���O�"�c���c�7���������R�s�������S ��7���7�N¡��¢¡���£��9£R¤¥�R� ¦�§�¨�©ª �R�w�7�<«a£R¬��S­s���7�N®*��¬�£��s���7�N¡2¤h£R¬���­s �¡��� ��J®R�7 /�R�s�c£¢©¡��¢¡w�7����­s �¡�����£*�s�R /¯° ��R�c��­s�¢�R�\±"�s�¢¡��(���8�c�7��«a¬@� ª �7��²�"�c���F³m´�¦8©{�7�N®*��¬�£��s���7�N¡�¯°�F³m´�¦�µ1�"������³� ������c�7�´*�����s�R ¶�s�¢¡���J¦�«@�s�R�c�R�%¤h£R¬@�·�¢¡v±�«a£��s�w¡���¡�­c¡w�7�¸�¡w�7«@�s�s��«J�R  ª �R�����¹¤h£R¬º�R � ¥�R�w�1�7«a¡���£R¤�¡��c�»«a£R¬��S­s��w�J¡�­c�¥�s¬�£*«a�7�s­c¬��Rµ4¦�§�¨�© ª �R�w�7�¼�R�s�c£R¡��¢¡���£��¥£R¤½¡��c���­s �¡�����£*�s�R F�s�¢¡��*¾�¡w¬@�R�s�w¤h£R¬@�·�¢¡���£���£R¤;�c£���¦�§�¨�©�R�s�c£R¡��¢¡���£��s�J¾��R�s�9¡��c�À¿;� ª © ª �R�w�7� �R�s�R �Á*�������R�s��s�����w�7�·���s�¢¡���£��(£R¤F¡��c���s�¢¡��*²

 à G½ORL�K�X�Ä���O�U{K½GÅ �º¡��s���Æ�S�¢�1�J¬(¿;���c�7��«a¬@� ª �¼¡��c�¶�c�7�������º�R�s������©�S ��7���7�N¡��¢¡���£��¥£R¤F�R�¥¦�§�¨�© ª �R�w�7�Ç«a£R¬��S­s�y�7�N®*��¬�£��*©���7�N¡�¤h£R¬�«a£����S ��aÈ<�R�s�c£R¡��¢¡w�7�À��­s �¡�����£*�s�R /�s�¢¡��*²�"�c�!�F³m´�¦8©{�7�N®*��¬�£��s���7�N¡aɶ�s¬��7�w�7�N¡w�7�6�c�J¬��!��­c�c©�1£R¬�¡��Ê¡��c�6«a£����S ��J¡w�%«a£R¬��S­s�»�w�J¡�­c���s¬�£*«a�7�s­c¬��Rµ¦�§�¨�© ª �R�w�7�Æ�R�s�c£R¡��¢¡���£��·£R¤-¬@�7¿Ë�w�1�J�7«@�·�R�s��®*���c�J£�s�¢¡��*¾c¡��c�m¡w¬@�R�s�w¤h£R¬@�·�¢¡���£��¥£R¤F�c£��¥¦�§�¨�©W�s�¢¡��·�R�s�¡��c���R�s�R �Á*�����y�R�s�¼�s�����w�7�·���s�¢¡���£��(£R¤4¡��c�2«a£R¬��S­s�J²�"�c�·�c�J®R�7 �£R�S���7�N¡�£R¤y¡��c�Æ«a£R¬��S­s�m�7�N®*��¬�£��s���7�N¡

«a£����S ��7���7�N¡��6¡��c�̨C�7�RÍ�Î�s¬�£\Ïw�7«a¡J¾»¿8�s��«@�Ð�aÈ�©�S �£R¬��7��¡��c�¼�R«JÑ�­s������¡���£��<£R¤8�s¬�£��w£*�cÁ ª Á ª £R¡��À�w�7«v©£��s�r ��R�c��­s�¢�R�Æ ��7�¢¬@�c�J¬@��£R¤8Ò��J¬@�·�R�r�R�s�rÓ��c�� �������²Ô ¬�£��Õ¡��c�9«a£� � ��7«a¡w�7�Ö�s�¢¡��×�R�̦�§�¨�©W�R�s�c£R¡��¢¡w�7���­s �¡�����£*�s�R A«a£R¬��S­s�Ë���Ë«a¬��7�¢¡w�7��²Ø�"�c� �F³m´�¦8©�7�N®*��¬�£��s���7�N¡����/�R ��w£�­s�w�7�Æ���·��­s� ª �J¬/£R¤� ����c��­s���w¡���«�s¬�£\Ïw�7«a¡����¢¡/£�­c¬/¤°�R«J­s �¡WÁ��R�s���aÈ*¡w�J¬@�s�R � �ÁR¾N���s«J �­s�s���c�¿;£R¬�Ù�£��Ë ��R�c��­s�¢�R���c£*«J­s���7�N¡��¢¡���£���¾"��­s �¡��Ú©{�S�¢¬�¡WÁ«a£��N®R�J¬@�w¡���£��s�R s�R�s�R �Á*�����J¾R��­s �¡�����£*�s�R �«a£��s�w¡w¬@­s«a¡���£���s���R �£R��­c�7�J¾\�c£*«a¡w£R¬F�S�¢¡����7�N¡F���N¡w�J¬@�R«a¡���£����R�s�2£R¡��c�J¬@�J²�"�c�2�S�¢�1�J¬����"£R¬����R�s��ÛJ�7�����¼¤h£�­c¬��w�7«a¡���£��s�J²"�"�c�

­s�s�c�J¬@ �Á*���c�%¦�§�¨�© ª �R�w�7�Ü�F³m´�¦'¤h£R¬@�·�¢¡¸�����aÈ�©�S ��R���c�7�Ý�R�s��¡��c�׫a£����1£��c�7�N¡��¸£R¤�¡��c�×�F³m´�¦8©�7�N®*��¬�£��s���7�N¡��¢¬��¥�c�7��«a¬@� ª �7�<���¹��£R¬��¥�c�J¡��R�� Þ² Ô �Ú©�s�R � �ÁR¾c�·���c£R¬�¡"«a£��s�� �­s����£��Ç���y����®R�7��²ßÞàJáÞá{âNã ävä�å?ævç è�é ç è ç è�é ê\ë\è ìoí\è î?ç îðïÚî?ç ñ�é ñPî�ävòFó,è ç ñPî�äwáWô@õ°ö¢ä÷ àJáÞá{âNã äväwø�ø�ø�é õÞâRî?åðáÞù{ê\ó�é ê\ë\è ìoí\è î?ç îðïÚî?ç ñ�é ñPî�ä@úNîwôvû-ä

ü I(ý/Q�I �<þ4ÿ�� KCL��ËE-O³Ý«a�7�N¡w¬@�R ��R�w�1�7«a¡Æ£R¤�£�­c¬Æ¬��7�w�7�¢¬@«@������¡w£<�aÈ*�S �£R¬��­c�¼¡w£Æ¿8�s��«@�Ç�1£����N¡8«J­c¬�¬��7�N¡8�w¡��R�s�s�¢¬@�Ǧ�§�¨¹¡w�7«@�*©�c£� �£R�RÁA¯h¦�§�¨/¾�¦2´*¨�©W�m¾N¦2´*¨�© Ô�� ¾�¦�Í1³�����¾s´��Ò�¾¦� ��Ó�����±(«J�R� ª �¶­s�w�7�Ë¡w£!��£*�c�7 ���­s �¡�����£*�s�R «a£R¬��1£R¬@�*¾c¡w£�¡w¬@�R�s�w¤h£R¬@�ǾcÑ�­c�J¬�ÁÆ�R�s�Ç�s���w¡w¬@� ª ­c¡w�8¡��c�«a£��N¡w�7�N¡�£R¤·��­s«@�¸«a£R¬��1£R¬@�º�R�s�»¡w£º�1�J¬�¤h£R¬@� �R�c�a©Ñ�­s�¢¡w�� ����c��­s���w¡���«Ç�R�s�R �Á*�����J²»³��¥�<¬��7��­s �¡J¾,�R � � ����*©��­s���w¡���«8�s�¢¡������·£�­c¬y�wÁ*�w¡w�7������w¡w£R¬��7�¥���Æ�R�Ʀ�§�¨�©ª �R�w�7�º¤h£R¬@�·�¢¡(«J�R � ��7�º�F³m´�¦�µ�¡��c���-�������" ������c�7�� �����s�R ��s�¢¡������8«@�s�R�c�R��¤h£R¬@�·�¢¡J²³Ö�F³m´�¦8©W�R�s�c£R¡��¢¡w�7��«a£R¬��S­s�Æ«a£��s�����w¡��·£R¤2�r�w�J¡

£R¤4�w�7������£��s�J¾c�7�R«@�¼£��c���c£� ��s���c���R�¼�¢¬ ª ��¡w¬@�¢¬�ÁÆ��­s��©ª �J¬�£R¤-�c�7��«a¬@���s¡���®R�y¡����J¬@�J¾�«J�R � ��7�� ��7ÁR�J¬@�J²�Ó��R«@�· ��7ÁR�J¬«a£��s�����w¡��4£R¤1�m�w�J¡/£R¤1�w�J�S�¢¬@�¢¡w�7���J®R�7�N¡��J²/Ó��R«@���J®R�7�N¡�w¡w£R¬��7�8�w£����2¡w�aÈ*¡�­s�R ½���c¤h£R¬@�·�¢¡���£��r¯h�R² �c²;�·�wÁ* � �� ª  ��£R¬�¡��c�¼�c�7��«a¬@���s¡���£��<£R¤��R�7�w¡�­c¬��\±��R�s������ ����cÙR�7�¹¡w£¡��c���s¬@���·�¢¬�Á·�R­s�s��£��s�¢¡�� ª Á·¡W¿;£�¡��������w¡��R���S�y�c�a©�c£R¡����c�(¡��c�����N¡w�J¬�®¢�R F£R¤/¡��s���8�J®R�7�N¡J²��"�7 ��¢¡���£��s� ª �a©¡W¿;�J�7�¹�J®R�7�N¡���£��<�s���1�J¬��7�N¡2¡����J¬@��«J�R� ª �·�7�s«a£*�c�7�ª Ár�c���-�s���c�A ����cÙ*�2­s�����c�¼¡��c� Å��! PÅ�� �8Ó Ô ´<���7«@�*©�R�s������£R¤/¦�§�¨/²1�"�s���8���������·�� ��¢¬"¡w£(¡��c���¢�s�s¬�£��R«@�£R¤S�w¡��R�s�*©{£"�¥�·�¢¬�Ù*­c���R�4�s¬�£R�1£��w�7� ª Á2§�³��"Óº¯ � ÁN©ª Ù¢�¢�J¬4�J¡��R Þ²�¾$#�­s�c� %'&(&(&�±@¾N¬��7�w�1�7«a¡���®* �Á*) Å �"Ó!¯,+y�¢¬w© ��J¡w¡��·�J¡8�R Þ²�¾.-"/(/$-�±@² Ô ���s�R � �ÁR¾c�¢¬ ª ��¡w¬@�¢¬�Á¥���J¡��\©W�s�¢¡��«J�R� ª ���R���������c�7�!¡w£¹¡��c��«a£����S ��J¡w��«a£R¬��S­s�J¾,�7�R«@��w�7������£���¾F�7�R«@�¹ ��7ÁR�J¬��R�s�r�7�R«@�¹�J®R�7�N¡J²A§A�J¡����s�¢¡������¬��J�s¬��7�w�7�N¡w�7�Ç�R�y���s�R���7�¥�w�J¡y£R¤½�¢¡w¡w¬@� ª ­c¡w�a©{®¢�R �­c��S�R��¬@�J²0)8£¼¤°­c¬�¡��c�J¬m¬��7�w¡w¬@��«a¡���£��s�2�¢¬��Æ�·�R�c�·£��r¡��c�³��0�S�R��¬@�J² �"�c�À�c�7�������¸£R¤Æ�F³m´�¦0��� ª �R�w�7�9£��¡��c��«J�¢¬��J¤°­s F�R�s�R �Á*�����8£R¤�«a£��·��£����R�s�c£R¡��¢¡���£���¤h£R¬w©�·�¢¡��m�R�s��¡w£�£� ��J²!1¶��¡w¬@���7��¡w£¼�·�\Èc���·�R � �ÁǬ��7�s­s«a�¡��c�7�w�A¤h£R¬@�·�¢¡��Æ¿8�s�� ��Ç�w¡��� � ,�R � �£P¿8���c�r¡w£<¬��J�s¬��7�w�7�N¡�R � N£R¤S¡��c�,�w¡w£R¬��7�����c¤h£R¬@�·�¢¡���£����R�s�ǯh�J®R�7����£R¬��,����©�1£R¬�¡��R�N¡v±/¡��c�m¬��J�s¬��7�w�7�N¡w�7��243'576"8:9<;(=>2?6"3A@�=>8:BDC'=>BE8GFa²�"�c�¼�F³m´�¦8©{¤h£R¬@�·�¢¡(����¡���­s���1£P¿;�J¬�¤°­s ,�7�c£�­c����¡w£�7�s«a£*�c����£��w¡½£R¤*¡��c�;«a£R¬��S­s�C�R�s�c£R¡��¢¡���£���¤h£R¬@�·�¢¡��½���­s�w�R² Å �s�c�J�7�¥����­s� ª �J¬;£R¤½¤h£R¬@�·�¢¡,¡w¬@�R�s�w¤h£R¬@�·�¢¡���£���s¬�£R�R¬@�R�·�·�4�s�7®R� ª �J�7�������S ��7���7�N¡w�7��² Ô £R¬/�m��£R¬���c�J¡��R�� ��7���c�7��«a¬@���s¡���£��¹£R¤�¡��c�¼�F³m´�¦8©{�7�N®*��¬�£��s���7�N¡

Page 46: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

�w�J�,�R ��w£�¯°§��� ��c�;�R�s�·Ò�­c¡J¾$-"/(/ %P±@¾1¯°§��� ��c�;�R�s�·Ò�­c¡J¾-"/(/$-¢�N±@¾F¯°§��� ��c���R�s�ÇÒ�­c¡J¾D-"/(/$- ª ±@²;�F� ª  �� %����c£P¿8�¡��c�/¤h£R¬@�·�R N�F³m´�¦8©{¤h£R¬@�·�¢¡½�w�1�7«J���-«J�¢¡���£���²F�,¿;£8 ��J®N©�7 ��,£R¤4�F³m´�¦8©W�R�s�c£R¡��¢¡���£��A«J�R� ª �2�s���w¡����c��­s�����c�7��µ

%¢²;�F³m´�¦� ��J®R�7 �%¢µr�R � 8�J®R�7�N¡��¥�c£À���·���7�s���¢¡w�7 �Á¬��J¤h�J¬y¡w£�¡��c��«a£��·��£��¥¡������7 ����c�"£R¤½¡��c��­s�s�c�J¬w© �Á*���c�2�������s�R Þ²4�"�s�����¢�s�s¬�£��R«@�(�s�R� ª �J�7�·¡��¢ÙR�7�ª ÁA��£��w¡�£R¤/¡��c��«J­c¬�¬��7�N¡� �Á¼�7®¢�R�� �� ª  ����R�s�c£R¡��\©¡���£��Ǥh£R¬@�·�¢¡��"�R�s�¼�R�s�c£R¡��¢¡���£��¼¡w£�£� ��J²

-�²;�F³m´�¦  ��J®R�7 *-�µÀ�J®R�7�N¡��¼«J�R���7��¡��c�J¬Ç¬��J¤h�J¬¥¡w£¡��c� «a£��·��£��>¡������7 ����c�»£R¬Ë�J®R�7�N¡��»«J�R�>¬��a©¤h�J¬·¡w£r�J®R�7�N¡��(���À£R¡��c�J¬Æ ��7ÁR�J¬@�J²º�"�s�����R � �£P¿8�¡w£¹«a£��s�w¡w¬@­s«a¡Æ�s���J¬@�¢¬@«@�s��«J�R ,�R�s�c£R¡��¢¡���£��Ë�w¡�­s«v©¡�­c¬��7�J² 1��s�� ��(«a£��s«a�J�s¡�­s�R � �Á<���N¡w�J�R¬@�¢¡w�7�!���N¡w£¡��c�y�F³m´�¦�¤h£R¬@�·�¢¡J¾R¡��c�J¬��;���F«J­c¬�¬��7�N¡� �Á2�c£8¡w£�£� �7®¢�R�� �� ª  ���¡w£��·�R�s���S­s ��¢¡w���F³m´�¦º ��J®R�7 E-8�R�s�c£¢©¡��¢¡w�7�A ��R�c��­s�¢�R���s�¢¡��*²

����� ������� ����� ���������� � ��� � ����

��� � � � � � � ����� �� ����� "! ����� ��������#�$�% & &��� � � � � � � �'����� �������� ����� "! ����(���#�$�%

��� � � � ��)�*��'����� �������

�&���,+�- � � ��.�/ ��0 12) / � -

� ��(3+�- � � ��.�/ ��0 12) / � -����4) - / �53.�) � 6 ��) � -������74+�- � � ��.�/ ��0 12) / � -(� ���3+�- � � ��.�/ ��0 12) / � -"%

��� � � � � � � �8����(���� ����� "! ��9���#�$�%��� � � � ��)�*��8����(���

�&���,+�- � � ��.�/ ��0 12) / � -

����4) - / �53.�) � 6 ��) � -"%��� � � � � � � �4��9���� .�6+�- � � �:! ����� �$�%

��� � � � ��)�*��4��9���

�&���,+�- � � ��.�/ ��0 12) / � -

� ��� �+�- � � ��.�/ ��0 12) / � -�����,+�- � � ��.�/ ��0 12) / � -����4) - / �53.�) � 6 ��) � -�2���,+�- � � ��.�) � 6 ��) � -������+�- � � ��.�) � 6 ��) � -"%

��� � � � � � � �,����� � � ����; $�% ��� � � � ��)�*��,�������� � � � � � � �4� ����; � ��� ���<! 9��$�% �

&���,+�- � � ��.�/ ��0 12) / � -

��� � � � � � � �3��� ��� � .�6+�- � � ��$�% ����4) - / �53.�) � 6 ��) � -��� � � � � � � ��9�� � .�6+�- � � ��$�% �; ;���� �=+�- � � ��.�) � 6 ��) � -

����9��,+�- � � ��.�) � 6 ��) � -"%

�F� ª  ��<%¢µ��"�c�2�F³m´�¦ � � � ²

>,?2@ ACB4DEAGFIH3JLKNM�O8O4PRQSMTQSP�ULM�O4VWQXB4DY P�UXZ'[4\LD]O4^=_�O4D

�"�c�2«a£����S ��J¡w���F³m´�¦8©{�7�N®*��¬�£��s���7�N¡"«a£��s�����w¡��,£R¤ðµ

` ¡w£�£� ��·¤h£R¬(¡��c���R�s�c£R¡��¢¡���£��º£R¤m�7���S��¬@��«J�R , ��R�*©��­s�¢�R���s�¢¡��A¯h®*���c�J£Æ�R�s�Ç�R­s�s��£·�·�¢¡w�J¬@���R h±@¾

` �·�������S ������J¡��\©W�s�¢¡��·�7�s��¡w£R¬` �s¬�£R�R¬@�R�·�À¤h£R¬�¡��c��¡w¬@�R�s�w¤h£R¬@�·�¢¡���£��£R¤�®¢�¢¬w©��£�­s��¤h£R¬@�·�¢¡��!£R¤¶ ����c��­s���w¡���«º�w¡��R�s�s�¢¬@���w£R¤h¡?©¿y�¢¬���¯°�½¬@�R�s��«a¬@� ª �J¬7¾2Í/¬@�R�¢¡J¾�Óy´*Íy´ ¿y�7®R�7�ba�¾´�Á*�s«�1¶¬@��¡w�J¬7¾sÓ�Èc�·�¢¬@�R ��s���J¡�«¢² ±

` �Ç�w�J¡m£R¤��s¬�£R�R¬@�R�·��¤h£R¬2 ����c��­s���w¡���«��R�s�R �Á*������£R¤¡��c�2�F³m´�¦8©W�R�s�c£R¡��¢¡w�7���s�¢¡��*¾S�R�s�

` ��«a£R¬��S­s�;�wÁ*�w¡w�7�פh£R¬;¡��c���s���w¡w¬@� ª ­c¡���£���£R¤C ��R�*©��­s�¢�R�;�s�¢¡��"®*���,¡��c�;���N¡w�J¬@�c�J¡J¾¢���s«J �­s�s���c�y���N¡w�J¬w©�R«a¡���®R��«a£R¬��S­s��Ñ�­c�J¬�ÁÀ�R�s�!��­s �¡�����£*�s�R y�s�¢¡���s���w�S ��7ÁÆ���Ç�·�w¡��R�s�s�¢¬@�Ç¿;� ª¼ª ¬�£P¿8�w�J¬7²

Å �Ë¡��c��¤h£� � �£P¿8���c�¹�w�7«a¡���£��s�¥¡��c�7�w�¶��£*�s­s ��7�(�¢¬���c�7��«a¬@� ª �7�¥���Ç��£R¬����c�J¡��R�� >,?> ACB4DEAGFIH3JLKNM�O8O4PRQSMTQSP�U�"�c���F³m´�¦8©W�R�s�c£R¡��¢¡w£R¬"���y��«a�7�N¡w¬@�R �«a£����1£��c�7�N¡y£R¤¡��c�¹�F³m´�¦8©{�7�N®*��¬�£��s���7�N¡J²Ü�"�c�r¡w£�£� ��R � �£P¿8�A¡��c���­s �¡��� ��J®R�7 ��R�s�c£R¡��¢¡���£����R�s��¡w¬@�R�s��«a¬@���s¡���£���£R¤S®*���c�J£¯°��­s �¡��Ú©W«@�s�R�s�c�7 h±��R�s�!�R­s�s��£¶�s�¢¡��º¯°�w�J� �S��­c¬�� %P±@²³Ë�w�J�S�¢¬@�¢¡w�8®*���c�J£m�S ��7Á ª �R«�Ù�¿8���s�c£P¿!���4£R�1�7�c�7�·­c�¤h£R¬��7�R«@�·®*���c�J£*�- ��"�·�¢Ù*���c�2��¡��1£������ ª  ��;¡w£��R² �c²/�s���?©�S ��7Á���­s �¡����S ����1�J¬@�w�1�7«a¡���®R�7�¥£R¤�¡��c�r���R���<��«a�7�c�R²�"�c�<®*���c�J£º�S ��7Á ª �R«�Ù¸�����wÁ*�s«@�c¬�£��s��ÛJ�7�»¿8��¡��¸¡��c�¡w¬@�R�s��«a¬@���s¡���£���² Ô £R¬(�R­s�s��£¶¡w¬@�R�s��«a¬@���s¡���£��s���R��£��?©Û7�� � �£R�R¬@�R�0���Æ«J�R �«J­s ��¢¡w�7�º�R�s�º���Æ�s���w�S ��7ÁR�7�!���s�����c�¡��c�Æ�·�R����¿8���s�c£P¿2²��"�c���s¬�£R�R¬@�R�·������®R�J¬�Á¶­s�w�J¬¤h¬@���7�s�s �Ár�R�s��«J�R� ª �Ç­s�w�7�¹¿8��¡��c£�­c¡��¶�s�����À ��J®R�7 £R¤y«a£����S­c¡w�J¬2�wÙ*�� � ��J² Å ¡2�����1£������ ª  ���¡w£¼«a£����S ��J¡w�7 �Á«a£��N¡w¬�£� s¡��c�"¡w£�£�  ª Á��7��¡��c�J¬���£�­s�w��6"8 ª Á�ÙR�JÁ ª £��¢¬@����c£R¬�¡�«J­c¡��J²� ���1�J¬��7�N¡��s�¢¡��¼®*���J¿8�2�¢¬��Æ�s¬�£R�R¬@�R�·���7�˯h¡������a©

�R ������c�7�6�S�¢¬�¡���¡�­c¬��R¾Æ¿;£R¬@�*©W�R ������c�7�6�S�¢¬�¡���¡�­c¬��R¾Æ�w�a©Ñ�­c�7�N¡����R ,¡w�aÈ*¡Æ®*���J¿�±�¡w£<�·�¢ÙR���R�s�c£R¡��¢¡���£��º�R�Æ�J¤o©¤h�7«a¡���®R���R�"�1£������ ª  ��R²�"�c� =>2490F ;dc 2�e(3 FgfEh�2,FjiË���2£R¬����R�s��ÛJ�7�¹�R���A¡W¿;£

�s�����7�s�����s�R ½�R¬@����£R¤y���E�-�s��¡w������ÛJ�R²�³× ��7ÁR�J¬������s¬��a©�w�7�N¡w�7�A�R�8�·�c£R¬@��ÛJ£��N¡��R C¡����J¬"£R¤4�J®R�7�N¡��J²y�"�c�m£R¬@�c�J¬£R¤m¡��c�� ��7ÁR�J¬@�¥���Æ�¢¬ ª ��¡w¬@�¢¬�ÁÀ�R�s�Ë«J�R� ª �A«@�s�R�c�R�7����s�w¡��R�N¡� �ÁR²¸�"�c�A­s�w�J¬¥���Æ� ª  ��A¡w£À�c���-�c�¼¡����������*©¡w�J¬�®¢�R �� ª Á¹�c¬@�¢�R�����c��¡��c�Ç��£�­s�w�R²rÓ��R«@��¡������¥���*©¡w�J¬�®¢�R m¬��J�s¬��7�w�7�N¡��Ç�R�Ê�J®R�7�N¡J²Ì�"�c���J®R�7�N¡A���Ç�s���?©�S ��7ÁR�7�(�R�y�2�R¬@�¢�S�s��«J�R  ª £7È·¿8�s��«@�Æ«J�R� ª �8�w�7 ��7«a¡w�7��R�s����£P®R�7��¿8��¡��A¡��c����£�­s�w�R²m�"�c��«a£��N¡w�7�N¡m£R¤��R�

Page 47: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

������������ �������������������������! "�$#&%('*)��������)��+����,-.�/���102���+�+��34��)�5��768���93:)��;�/<=)����5>�;�@?A�;���0=5��6B ����C���D��;3@1)�5;�������EF�����:?A�� B ��)���02���/�+�'5�����+�EG,8H!�:�+��<I���G���(3:)��;� B �;��E�� B 6����(�J���K�8�+�'5����������:<=)����5��;��?A�;���0=5�MLN���� B �;���O�+��3@(-.P2�Q�R��)��+')����+��R�TSR,��(02�U?�V����6����DWX��E/�+�&��5Y�;�:���� B �G,Q-.����(��<�<2���5���J�Z�����R����Z���[?A�;E���ME��;�+<=5;)�\]��)��^02����+���G,

�?���K�>�;����K�+���EI�;�^)��I)�E�E����������)�5��+�_A�ZW=�5;EG,��Y�K\LN�����;���AE�`Sa�J���K�YLJ�, ��,�-.P2�b�J���K���6AcY)�3:d(�&#&\A���J���K������, S9)�?�)��;5;)�0=5�O�J���Y���]��<2��R)���;���I�+\A�+�+�3e��)��f02���+�EI�J���>���!�+�R)�������R��<�������G,� [��$���+��>��)��D�R���&���+�J���K�1)���EQ�J���K�+<=)���g�J���3h)i�)�0=5�jE��;�+<=5;)�\A�;���F)�5;5�R��)��R)����+��R�!���������+�5�����+�Ef�J���K��,Y-7�!�;�Y)�5;�+�^<2�����k'0=5�>�+�]E��WX��[)$?A������)�5�l��\&02��)��RE B ���;�R�@3:)�<=���������?���ml��\A�+�+���l���]�+�V)��0=���+�R)��\V�R��)��R)����+��R�����>����)������(�J���K��,-.�I���@n*op�nrqUsto�uf���$E�)��)���)��C02!3:)�����<=��5;)��+�E

�;�b)m�+�)���E�)��REv�+�_A�I�E����+���^<=)����5w,� [��9�����K�+��K����Z���]�E����+���!��<����+��K��!����5;)�\���M)���Ef�)��R�g5;�;����<����+��K��()��9�?���K��,(�x5;�;�+�[�+�5����������102��_9)�5;5�� B �� B ����R���;���102�� B ���iE��ky2����K�]5;)�\���R��,9-7���;�M<2�����k'0=5�M�+�I�+�R)����+�J��Y�+�_A�!�J���3z�+�)���E�)��RE9�+�_A�Y�E����+���R��6�, ��,|{f�;������+���J�C}j���REG6(0&\b�����C)���E~<=)��+�+f��<2��+')����������,~-.�v���RE���^�+�F)�E�E����������)�5;5�\m�+<2��Eb��<v����+�R)�������R��<�������f<���A������6") B ���REg����3@<=5��������j�N�����T'������~��)��^02���~�;3@<=5��3@��K�+�E/�J���C���9�+�_A�I?A�� B ,�Z�K�+��R�;���g���f�;������;)�5[5���+�+��I���M) B ���RE~)���Eb�����A'�+��������?��5�\F<������;���g�� [�(���!� B �;5;580��R�;���V��<v)�5;5B ���RE��Z�+�)����;��� B ����@����;�Z5���+�+���,8H!����[���(�+�_A���;��+�R)����J����E^�;�K�+�O���! "�$#&%('*)��������)��+����6A���Y�?���K���+��;5;5(��)�?�j�+�i02V)�5;�������E B ����b���f<��R�;3:)��\v)��A'E�����)���EC?A�;E���@E�)��)A,�# B ����R���;���O0=)��lI�+�@���!��;3@)�5;�������Eb?A�� B )���EQ3@�U?A�;���F���V�?���K�� B ����b���3@�����+M3:)�l���(����;���)��+lD�&�����+M��;3@<=5��,-.�������A����nNsNnN�A�o�qUsto�u�����E�)��)���)��������~02

�E����+�EG,$-.�f<��R)�����;�������;�Y3@�)����!���)��!���@E�)��)C�;��+�R)����+�J���R3@�ED�;�K�+�:)��9cY [{f�g�)�0=5�M)���EC�����1E��;�.'

<=5;)�\��EM�+�[���Z���+���,a�i�&��3O02��G����E��ky2����K��cY [{f��J���R3:)��+�+�E^?A�� B ����)�?�(02���:E����������EG,� [��[?A�� B ���)��v)�5;�+�j021�)�?��Ev�+�g�_A�+��R��)�5�WX5���@)���E/5���)�E��E�;�K�+�:�+�)���E�)��RE B �010��� B �+��R��,H!���<2���+��K��;)�5��+�+����������������� "�$#&%('

)��������)��+���v�;�F����F3:)�������m���9��)���E�5;�;�������~�_&'<2����T�`�;3@<2����F���9%!{f��0=)��+�E��;���J���R3:)�������G,���+�)���E�)��RE B )�\1���Z�+��5�?A�;���@����;�[<����0=5��3 B ����5;E102���1�;3@<=5��3@��K�)�������F���Y)j�+��@���(�J���R3:)��:�+<2����kWX�%!{f�/<=)��R�+��R� B ���;�R�f�������+�+�R�����!�����;�K�+��R��)�5a��<�'���+��K�)�������xLJ�, ��,����Y��3IS����$���f%!{f��WX5��,�-.����f "�$#&%('*)��������)��+��� B 9�J��5;5�� B )mE��ky2����K�^)�<�'<����)��R�G,D [��^�+\A�+�+�3��;�K�+����R)��+��])��g%M#A�G'* |<����'�����+���mLN�)`_A���XSR6!3:)�lA�;���/���I�)��+\b�+�/<2���J���R34������$�=\CE�)��):�+�R)����+�J���R3:)����������,Z [��O�;3@<2��������r)��%!{f�G'wWX5�$�;���+<=5;���8�;�K�+�@� B �:�+�+�<=�� �W=�R�+��)��D%M#A�G'* �+�*\A5��������Y�+�R)����+�J���R3:�[���O%!{f�FWX5�O�;�K�+�I "�$#&%]6�+�������E:)�����������%M#A�G'* /�+�*\A5�������� B �;5;5K�+�R)����+�J���R3���Z "�$#&%mWX5���;�K�+�()[��;3@<=5�"�+�_A�a���R����K�+�E$�J���R3:)���, [���;���J���R3:)��[��)��102M5���)�E��ED��I������K�5�\�,�����R�����;)�5I<����0=5��3 B ������+��+��;���x��<�5;)������

�����<2���R)C)��@�;�K�+��+'*)������)��+���M�+�R)�������R��<�������1������R��,}Q���;5�j���m "�$#&%('*)��������)��+���V�;�1E����������E��+�b02���+�EI0&\I)@��;����5�(<2��R�+���G6������+��;5;5=<���U?A�;E����)��&��3�'02��Z�����������;������+������3O0=�;���LN3@����`SR6������K�+���52)���E)�5;�����v)��������)���������I����)��+�E~0&\/)m5;)������^�+�)�3����<2���<=5��,�}j!E��O�������;�K�+����R)��+!3@���Y����3@<=5��_:�����A'�+���5��N���������������,� [���;�������K�+�R)�E��;����������Z)�<�<����)��R�:�����5��)��R5�\j�+�<=)��R)���;���f�����<=���]����)�������m�J���3������<=���)���)�5�\A��;��,��������K�5�\�)|�&��3O02��/���V��5;)��+�Ez)��������)�������

�+�&��5;�$)��:����E���!E��?��5���<=3@��K�^LJ�, ��,��Z5;)��/LN�8�R����'3:)��M)���E$}Q���+�+��K0=�����6� �¡�¡��USR6��(¢! [£�LN�����RE!��")�5w,�6 �¡�¡��USR6GLN�����RE@)���E:�a��02��R3:)��G6���¤�¤�¤�SR6A�Y�K?A�;5GLN£M��<�<�6 �¡�¡��USR6]�r_�3:)��R)�5;)�E�)�Lt#A�R��3:�;E���6O �¡�¡��US+SR6O�)��R���������3¥E����������E��J���j)~�+<2����kWX�g�)������j)���E���������,{9���+�@���Y���C�+�&��5;�:)��D����;���j��)�?�)g)��@)��v�;3@<=5��'3@��K�)�������F0=)��+I)���Eg������AE�I���I5;�;�������;�+��;��E�)��)�;�F)f����3@<=)��R)�0=5� B )�\F)��]<����<2���+�EF����jLJ%!{f�G'0=)��+�EG6Y����;���m��;3@V�+<=)����I�+�/3:)��lv�?���K���6$�+�<�')��R)���;���b3@��)`'*E�)��)~)���E������K�+��K�TSR,¦�Y�9)/�����5�����>02�����3@��>��5;)����?A5�\I��;3@<=5�Y�+�^�����K?����T�U�������R)��+ "�$#&%('*)��������)��+�E������<2���R)F�;�K�+�&�U�J���3§�����+V�J���+'3:)����,¨2©;ª «8¬`­=®a¯�°�±G²�³N®a´mµU±G±X¶J¯ [��·E��?��5���<=3@��K�¦�����+�&��5;�e�J�������¸ "�$#&%('��K?A�������3@��K�Y�;�>0=)��+�E9���9���]���������<��Y���)��Y)^��'�;3@<=5��3@��K�)�������C�����N�������������)�5;��������)�5���)�E�\C)�?�)��;5k')�0=5�g�;�~�������95;)�������)���m)���E��+<2���R��<���A������;����+���J� B )��:�;�Y�����$��������)��\�,M�Z�+�)�0=5;�;����E9�+���J� B )���+\A�+�+�3:�[����R�1)��(P��R)�)��[���(��#AP�#�� B )�?���T�xE��:�����

Page 48: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

�c�J�7�Ç¡w£ ª �2�s­c�S ���«J�¢¡w�7��²�"�c� �F³m´�¦8©{�7�N®*��¬�£��s���7�N¡�¡��c�J¬��J¤h£R¬�� ¤h£*«J­s�w�7�

£�� ¡��c� �c�J®R�7 �£R�S���7�N¡Ð£R¤>¡w¬@�R�s��«a£*�s���c� �- �¡w�J¬@�¤h¬�£�� �R�s� ���N¡w£�®¢�¢¬@��£�­s�»¤h£R¬@�·�¢¡��J² �"�c�7�w�×���*©«J �­s�c�Rµ�Í/¬@�R�¢¡ ¤h¬��7ÑS¾½Í/¬@�R�¢¡  �� ª �7 Þ¾CÓy´*Íy´ ¿y�7®R�7�ba�¾Óy´*Íy´ \Ô /\©W�R�s�R �Á*�����J¾ �½¬@�R�s��«a¬@� ª �J¬7¾ �R�s�c£R¡��¢¡���£���R¬@�¢�S�s�Ç�w¡w£R¬��7�»���˦�§�¨/¾�´�Á*�s«�1¶¬@��¡w�J¬¼�R�s� ª �R����«¡w�aÈ*¡�¤h£R¬@�·�¢¡��¼¯°�w�J�¥¡�� ª  �� -�±@² Å �À�R�s�s��¡���£���¾��- �¡w�J¬@�¤h£R¬Æ�s�¢¡��r�����1£R¬�¡��R�s���aÈ*�1£R¬�¡�£R¤�¡��c�AÓ�Èc�·�¢¬@�R ��s��wÁ*�w¡w�7� ¯{´*«@�s�·���c¡J¾D-"/(/ %P±"�¢¬��2�7®¢�R�� �� ª  ��R²�F³m´�¦ � �������� �� ����������� ��������� � ��� ��������� ��� �! #" ��� �! #"$ � �%������&('�� ��� �! #"*)�+,�.-�� ��� �! #"*)�+�/0 "213�! 4����5�& 6 ��� �! #"0 "213�! 7�������� ��8��� 6 ��� �! #"9 "2: 6 ��� �! #"*)�+�/��� -;� & ��� �! #" ��� �! #"$ � : ��� �! #" ��� �! #"< ������� 7&(��5�=& < =��&>).��� �! #" ��� �! #""?��� � ��@���� 5�=�.A4� "21CB < =��&>).��� �! #" ��� �! #"D �.-�.�;8��E>=� < =��&>).��� �! #" ��� �! #"$ � < �, 7&(��5�=& < =��& ��� �! #"$ � < �, 4E>��GF < =��&>).��� �! #"*)�+�/ ��� �! #"�;H � @ D ��� ��=� < =��& 6

�F� ª  �� -�µ ¨½���w¡ £R¤Ð«J­c¬�¬��7�N¡� �ÁÕ�����S ��7���7�N¡w�7�¡w¬@�R�s��«a£*�s���c�!¡w£�£� ��J²��"�c�r¡�� ª  ��¹���c£P¿8�A¡��c�<�s¬�£¢©�R¬@�R�·�·���c�× ��R�c��­s�¢�R�7�Ê­s�w�7�Ì¡w£ �����S ��7���7�N¡º¡��c�¡w¬@�R�s��«a£*�c�J¬@�J²§A£��w¡(£R¤m¡��c�7�w��«a£����1£��c�7�N¡��(�¢¬��������S ��7���7�N¡w�7�

��� #��7®¢�*¾�¡w¬@�R�s�w¤h£R¬@�·�¢¡���£��s�A�¢¬��¹�c���-�c�7�¸���ʦ2´*¨�©��R�s�������·�R � ��J¬���­s� ª �J¬�£R¤��R�s�s��¡���£��s�R /¡w£�£� ������¿"¬@��¡w¡w�7�A���AÍF�J¬@ ;¯°�·�R���s �Á¥¡w£(¡w¬@�R�s�w¤h£R¬@�Ì�c£��*©{¦�§�¨�s�¢¡��N±@² 1¶�,¿8�� � N�R¬@�R�s­s�R � �Á����s«a¬��7�R�w�y¡��c�"��­s� ª �J¬4£R¤¡w¬@�R�s��«a£*�s���c�·�s¬�£R�R¬@�R�·�·�J²�³� � ½�s¬�£R�R¬@�R�·�·�8¤h£� � �£P¿��­s�s�ÚÈr ���ÙR�(�s¬�£*«a�7�������c�A�¢�s�s¬�£��R«@��µÆ���c�S­c¡�«J�R� ª �¬��7�R�!¤h¬�£��0���- ��Ç£R¬Æ�w¡��R�s�s�¢¬@�!����¾�¡��c�¼¡w¬@�R�s��«a£*�*©���c��¬��7��­s �¡;¿8�� �  ª ���s¬@���N¡w�7�(¡w£·�w¡��R�s�s�¢¬@�¥£�­c¡J²��"�s����R � �£P¿8�m¡w£�«a£�� ª ���c�·¡��c�·¡w¬@�R�s��«a£*�c�J¬@�2���r�¢¬ ª ��¡w¬@�¢¬�Á¿y�7Á*�J²×�F³m´�¦ «J�R��¡���­s� ª ���w�J�7�Ê�R�Ç��«a£��·��£�����N¡w�J¬@ ����c��­s��¤h£R¬�¡��c�8��­c�s�1£R¬�¡w�7�� ����c��­s���w¡���«�¤h£R¬@�·�¢¡��J²>,?JI K M�[4\ D QSUXM Y,L D U�½£Ê�w�1�J�7� ­c�9¡��c���R�s�c£R¡��¢¡���£�� �s¬�£*«a�7�����Ë�S�R­s�w�¡w¬@�R«�Ù*���c�À�s¬�£R�R¬@�R�·�#�s�R� ª �J�7�Ë�c�J®R�7 �£R�1�7��²��"�c��s¬�£R�R¬@�R�·� �w�J�S�¢¬@�¢¡w�7�¹�w�1�J�7«@��¤h¬�£�� �S�R­s�w�7�<�R�s��R�7�c�J¬@�¢¡w�7�¼�À�F³m´�¦ �R�s�c£R¡��¢¡w�7�ʦ�§�¨6�c£*«J­s���7�N¡¿8��¡���¡W¿;£�¡����J¬@�J¾N£��c���c£� ��s���c�2�R � c�S�R­s�w�8�J®R�7�N¡��J¾*¡��c�£R¡��c�J¬8£��c�2�c£� ��s���c���R � ��w�1�J�7«@�Ç�J®R�7�N¡��J²

�"�c��¡w¬@�R«�ÙR�J¬¼­s�w�7�¥Í/¬@�R�¢¡r¯7M;£��J¬@���·�*¾ -"/(/ %P±Æ¡w£�1�J¬�¤h£R¬@�С��c�¼�R«a¡�­s�R "�w�1�J�7«@�!�R�s�R �Á*�����J² Å ¡·�������S �Á«J�R �«J­s ��¢¡w�7�·¡��c�Ç�S��¡�«@�!«J­c¬�®R�¼£R¤�¡��c�¼�R­s�s��£r�������s�R Þ²Å ¤��c£��S��¡�«@�À�����c�J¡w�7«a¡w�7��¾�¡��c�7���c£��*©W�w�1�J�7«@�������R�?©��­s���7��¾s£R¡��c�J¬�¿8���w���w�1�J�7«@��² Å �A�Æ�w�7«a£��s���w¡w�J�C¾S¡��c�¬��7��­s �¡���£R¤c¡��s���C«J ��R�������-«J�¢¡���£����¢¬���«a£�� ª ���c�7�2¡w£�«a£��*©¡�����­c£�­s�8�w¡w¬��J¡�«@�c�7�8£R¤��S�R­s�w�7� �w�1�J�7«@��² Ô ���s�R � �Á(¡��c��F³m´�¦ «a£��c¤h£R¬@�·�R�N¡"£�­c¡w�S­c¡"���,�R�7�c�J¬@�¢¡w�7��²�"�c�¥�S�R­s�w�Ç¡w¬@�R«�ÙR�J¬¥�s�R�����c£P¿8��¡w£¶¿;£R¬�Ù�Ñ�­s��¡w�

¬��7 ���� ª  �Á<£��º�r�w�J¡Æ£R¤�¬��7«a£R¬@�s���c�r���!�s���1�J¬��7�N¡· ��R�*©��­s�¢�R�7�ݯ?#��¢�S�R�c�7�w�R¾9Ó��c�� �������¾»Ò��J¬@�·�R��¾ ´*�¢¡w�J¬w©¤h¬@���7������«@��¾ Ô ¬��7�s«@��¾/Ó/���N±@²�Ó/®R�7�¹��¤y¡w¬@�R«�Ù*���c�����2¤°�¢¬¤h¬�£�� �1�J¬�¤h�7«a¡J¾c¡��c����­s�·�R�¼�R�s�c£R¡��¢¡w£R¬8�R�J¡��8���R£�£*��s¬��a©W�w�J�����7�N¡��¢¡���£��¹£R¤,¡��c�(�������s�R Þ²Ç�"�s���2�R � �£P¿8�2¡w£��£P®R�y®R�J¬�Á�Ñ�­s��«�Ù* �Á�¡��c¬�£�­c����¡��c���- ��R¾\�1£������ ª  �Á��1�J¬w©¤h£R¬@�·���c�¹�·���c£R¬¥�R�\Ï�­s�w¡����7�N¡��(¡w£À¡��c� ª £�­s�s�s�¢¬@���7�£R¬Æ«a£�� ª ���s���c�¶�r�w�J¡·£R¤��w�J�S�¢¬@�¢¡w�7�!�J®R�7�N¡��Æ£R¤�£��c��w�1�7�¢ÙR�J¬7²1��s�� ��r¡��c�<�S�R­s�w�r¡w¬@�R«�ÙR�J¬¶����®R�7�A�R£�£*�¸¬��7��­s �¡��

¿8�c�7�r�c£����c�Ç«a£��N®R�J¬@���¢¡���£��s�R ;�R�s�R �Á*�����m��¡2���2�c£R¡2£R¤��­s«@�»�c�7 ���¤h£R¬ �-�c���R¬@�R���c�7���S�c£��c�J¡���«�¬��7�w�7�¢¬@«@��²�8�J¬����r¡w¬@�R«�Ù*���c�À�wÁ*�w¡w�7� ¤h£R¬(®R£P¿;�7 ��¥�R�s�º«a£��s�w£¢©�s�R�N¡��F¿;£�­s �� ª ��®R�J¬�Á�­s�w�J¤°­s Þ²�Ò��¢¬@«J�����J¡J² �R Þ²"¯w¯{Ò��¢¬w©«J�����J¡"�R Þ²�¾ -"/(/$-�±w±,�¢¬��m¿;£R¬�Ù*���c��£��¼��­s«@�Ç�·�wÁ*�w¡w�7�Dz

>,?ON H,QSMTQX_�\ QX_ Y M?P M�O4M?POQ�\S_�\

Å ��¡��c�·���s��¡��R F�c�7���������S�s�R�w��£R¤;¡��c���F³m´�¦×�wÁ*�w¡w�7�¿;�y�S ��R�s�c�7�2¡w£m�����S ��7���7�N¡C¡��c�,�w¡��¢¡����w¡���«J�R ��R�s�R �Á*��������m¦2´*¨�©W�r�R�s� #��7®¢�*² Å �s�c�J�7��¾\�"��­s� ª �J¬�£R¤c���·�R � ��J¬�s¬�£R�R¬@�R�·�2�s�7®R� ª �J�7��¬��7�R ���ÛJ�7�r����¡��s����¡w�7«@�s�s��Ñ�­c�R²���c¤h£R¬�¡�­s�c�7�¢¡� �Á<��¡�Ñ�­s��«�Ù* �Á ª �7«J�R���Ç�J®*���c�7�N¡J¾/¡��s�¢¡¦2´*¨�©W�>���¥�c£R¡¥��­s��¡w�7�º¡w£À�1�J¬�¤h£R¬@� ��­s«@��«J�R �«J­s ��\©¡���£��s�,£��¼ ��¢¬��R�J¬8�w�J¡��,£R¤4�s�¢¡��*² Å ¡" ��R«�Ù*�8�s�����¥�s¬��7«J�Ú©����£����¢¬@��¡��s���J¡���«/¤°­s�s«a¡���£��s�C�R�s��«a£��s��­s���7�½¡w£���­s«@����7��£R¬�ÁR² 1��c�7�¹­s�����c�Ç�aÈ*¡w�J¬@�s�R  #��7®¢�A¤°­s�s«a¡���£��s�J¾�� ��¢¬��R����­s� ª �J¬/£R¤��s�¢¡���«a£��N®R�J¬@����£��s���s�7®R��¡w£�¡��¢ÙR��S ��R«a�R²¶³� ��w£�¡��c�(¬��7��­s �¡����c�A«a£*�c�¥����®R�J¬�Á<�s�¢¬@�<¡w£¬��7�R�¼�R�s�¼�c� ª ­c�c²Å �s�w¡w�7�R��¿;�!�s�7®R�º«@�c£��w�7�6¡w£»­s�w��¡��c� � �wÁ*�?©

¡w�7�Ǿ¶�R�Ì£R�1�7�Ì�w£�­c¬@«a�9�����S ��7���7�N¡��¢¡���£�� £R¤¶¡��c�R�S TVU,WYX �w¡��¢¡����w¡���«J�· ��R�c��­s�¢�R�¹¯ Å �s�¢Ù¢�¶�R�s�ËÒ��7�N¡� ��a©�·�R��¾ %'&(& Z�±@¾(¯4���7�s� ª  ��7�¼�R�s���8���S ��JÁR¾ %'&(&(&�±@² ������S ��7���7�N¡����R � y�·�PÏw£R¬Æ�w¡��¢¡����w¡���«J�R y¡w�7�w¡��·�R�s��«J�R Ú©«J­s ��¢¡���£��s���R�s�¶�����7Ñ�­s���s�1�7�A¿8��¡����Ç ��¢¬��R����­s� ª �J¬£R¤��s�����» ��J®R�7 2�R¬@�¢�S�s��«¶¬�£�­c¡����c�7�Ç¡w£º�R�7�c�J¬@�¢¡w�<®*�Ú©��­s�R � �Á����c¤h£R¬@�·�¢¡���®R�Æ�s¬��7�w�7�N¡��¢¡���£��s�2£R¤"¡��c�Ƭ��7��­s �¡��J²Ó/®R�7� ��£R¬��������1£R¬�¡��R�N¡���¡����s«J �­s�c�7�¼�.[(«J���7�N¡¶���*©�S­c¡ £�­c¡w�S­c¡F¬�£�­c¡����c�7�F¡w£2 �£��R���R�s�����7®R�8�w�7�·���w¡w¬@­s«v©¡�­c¬��7���s�¢¡���¯h�7��¡��c�J¬4¦�§�¨�©W�R�s�c£R¡��¢¡w�7��£R¬F�S ��R�����R��«J���¡w�aÈ*¡v±@²

Page 49: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

� I(ý/Q��*KCL��;Ä�MAM��/MPORQ ��"�c���·�R���º¤°­s�s«a¡���£��º£R¤2¡��c��«a£R¬��S­s�(�wÁ*�w¡w�7�#«a£��*©�w¡���¡�­c¡w�7�,¡��c�2���N¡w�J¬@�c�J¡?© ª �R�w�7�¼�s�����w�7�·���s�¢¡���£��(£R¤F¡��c�«a£R¬��S­s���s�¢¡��*² 1���¡��¹¡��c�¥«J­c¬�¬��7�N¡� �Ár�����S ��7���7�N¡w�7����N¡w�J¬�¤°�R«a����¡;���;�R ��w£��1£������ ª  ��,¡w£����s�w�1�7«a¡;�R�s�(Ñ�­c�J¬�Á¡��c���w�1�J�7«@�Ë«a£R¬��S­s�J¾y¡w£À ����w¡w�7�!¡w£<¡��c���R­s�s��£<�·�\©¡w�J¬@���R ��R�s�¥¡w£Æ�s���w�S ��7Á·¡��c�m�R¬@�¢�S�s��«�¬��J�s¬��7�w�7�N¡��¢¡���£��£R¤�¡��c��¿y�7®R�J¤h£R¬@�·�¼�����À�w¡��R�s�s�¢¬@�Ë¿;� ªÊª ¬�£P¿8�w�J¬7²1¶���·�¢ÙR�Ë­s�w�!£R¤Ç¡��c� ª ­s�� �¡?©W���9¤h�7�¢¡�­c¬��7�<£R¤Ç¡��c�¿;� ªAª ¬�£P¿8�w�J¬8�c�J¬��R² Ô ­c¬�¡��c�J¬@��£R¬��R¾c¡��c��Í1³8¦8©{¡w£�£� ��¯{Ò�� ªsª £��¹�R�s�À�½¬@���s�1�7 Þ¾ -"/(/ %P±�¤h£R¬��s���w�S ��7Á*���c�¼¡��c����N¡w£��s�¢¡���£���«a£��N¡w£�­c¬7¾�¡��c�����N¡w�7�s����¡WÁA�R�s��¡��c���w�1�7«v©¡w¬�£R�R¬@�R� £R¤F¡��c�m�w�7 ��7«a¡w�7�Ǭ��J����£��s�,���(¡��c�m�R­s�s��£��- ��«J�R� ª �m���N¡w�J�R¬@�¢¡w�7��²1��c�7�¹�S ��7Á*���c� ª �R«�Ù<¡��c�¥�w£�­s�s� �- ��R¾ ª £R¡��¹¡��c�

�R­s�s��£��S�¢¬�¡��/�R�s��¡��c�,¿y�7®R�J¤h£R¬@�%���·�¢�R�7�/�¢¬��,�R�7�c�J¬w©�¢¡w�7�A�R­c¡w£��·�¢¡���«J�R � �Á ª Á¥�����·�R �  #��7®¢�·�w�J¬�®* ��J¡,�s¬�£¢©�R¬@�R�Dz¹�"�c�Ç�w�J¬�®* ��J¡��S�¢¬@�w�7��¡��c�¥¦�§�¨�©W�R�s�c£R¡��¢¡w�7�«a£R¬��S­s�J¾��aÈ*¡w¬@�R«a¡��A¡��c�¶¡������r�w¡��R���S�¼£R¤�¡��c�r¬��7 ��a©®¢�R�N¡4�J®R�7�N¡����R�s��¡��c�7��«J­c¡��F£�­c¡4¡��c�y«a£R¬�¬��7�w�1£��s�s���c��S�¢¬�¡��,£R¤4¡��c�m£R¬@�������s�R ��w£�­s�s� �- ��R²�"�c��«a£R¬��S­s�½�wÁ*�w¡w�7�9���½�w�S ���¡C���N¡w£8¡W¿;£� ��¢¬��R�J¬4��­ ª ©

«a£����1£��c�7�N¡��Jµy¡��c�0243'576"8:9<;(=>2?6"3�� 6'6dc-�R�s�¼¡��c� C 6"8���.B@�F�3 e(243 F�¯°�w�J� �S��­c¬��*-�±@²��"�c�m���c¤h£R¬@�·�¢¡���£��Æ�1£�£� �w¡w£R¬��7�F¡��c�y�s¬@���·�¢¬�Ám�s�¢¡���¯h¬@�7¿!�R­s�s��£��s�¢¡��N±½�R�F¿;�7 � �R�/¡��c�8¦�§�¨�©W�R�s�c£R¡��¢¡w�7�(¡w¬@�R�s��«a¬@���s¡���£��s�4£R¤1¡��c���R­*©�s��£*�- ��7�J²��"�c�8«a£R¬��S­s�/�7�c�����c��«a£��s�����w¡��/£R¤ �S®R����­ ª ©�wÁ*�w¡w�7�·�Jµ

%¢²�1¶� ª ©W«J ����7�N¡Jµ4¡��c�8���N¡w�J¬@�R«a¡���®R�8­s�w�J¬;���N¡w�J¬�¤°�R«a�8���«a£����S ��J¡� �Á��c���-�c�7��¡w£�¬@­s�����·���w¡��R�s�s�¢¬@��¿;� ªª ¬�£P¿8�w�J¬7² 1¶���¢¬��2­s�����c� ���"§�¨�©WÑ�­c�J¬�Á(¤h£R¬@�·�¿8�s��«@���R«a¡���®¢�¢¡w�"�w�J¬�®*��«a�7�F£���¡��c�y�w�J¬�®R�J¬/�����c��¡w£�R�7�c�J¬@�¢¡w�"¦2´*¨�©W�y©?�- �¡w�J¬@�4�s¬�£*«a�7�������c��¡��c�"�s�¢¡��*²1r�7®R�J¤h£R¬@�·���¢¬��·�s���w�S ��7ÁR�7��­s�����c�A´��Ò�²C�"�s���¿8�� � ��R � �£P¿Ý¡��c�r­s�w�J¬Ç¡w£º�w�7 ��7«a¡Ç�S�¢¬�¡��Ç£R¤�¡��c��w£�­s�s���������s�R 4�R�s��¡w£Ç�1�J¬�¤h£��Ö��£R¬���«a£����S ��aÈ�S�c£��c�J¡���«m�R�s�R �Á*�w�7�J²

-�²�1¶� ª ©W�w�J¬�®R�J¬7µy¡��c�2¿;� ª �w�J¬�®R�J¬��s���w¡w¬@� ª ­c¡w�7�y¡��c�«a£R¬��S­s�;���c¤h£R¬@�·�¢¡���£��Æ���(�w�J®R�J¬@�R ��w¡��R�s�s�¢¬@�Ƥh£R¬w©�·�¢¡���¯h¦�§�¨/¾ ���"§�¨/¾sÍ ��Ô ¾�´��Ò�¾D1�³��2±@²

�²,´��J¬�®* ��J¡?©{�7�c�����c�Rµ4¡��c���w�J¬�®* ��J¡;�7�c�����c���R«a¡���®¢�¢¡w�7�¡��c� ��­s��¡�� ª  ��9�w�J¬�®*��«a�7�Ë£��>¡��c� �w�J¬�®R�J¬»�����c�¯h¡w¬@�R�s�w¤h£R¬@�·�¢¡���£��6£R¤¼¦�§�¨�©W�R�s�c£R¡��¢¡w�7��s�¢¡��*¾£��*©{¡��c�a©�SÁ¶�S�c£��c�J¡���«·�R�s�R �Á*�����J¾F�R�7�c�J¬@�¢¡���£��<£R¤�R¬@�¢�S�s��«J�v±@²

� ²,´��J¬�®* ��J¡��Jµ � �w�J¡ £R¤Ì�F³m´�¦ ¦�§�¨�©W�7¿y�¢¬���w�J¬�®* ��J¡��2�¢¬��(­s�w�7�¶¡w£¼¡w¬@�R�s�w¤h£R¬@��¡��c�Æ�s�¢¡��¼�����­s���J¬�£�­s�·¿y�7Á*�Jµr�R�7�c�J¬@�¢¡����c� ���"§�¨9¡w£ ª ��s���w�S ��7ÁR�7�<���r¡��c� ª ¬�£P¿8�w�J¬7¾4�R�7�c�J¬@�¢¡����c��Í ��Ô

Corpus-Engine

HTML / XML / PDFGIF / WAV / SVG

URL / CGI / SERVLETEMBEDDED SQL

SQL (JDBC)

Activates

ResultSet

ActivatesResults

�� ����� � � ���� � �Information-Pool

XML

ESPS PRAAT-ASCII

USES

USES

USES

USES

USES

USES

Comma separated listXML

���������1 & 1 [��D�+\A�+�+�3·)��R�R�����+�������1���(���1�����+'<=���Z�+\A�+�+�3D,r [��Y�����<=���Z�+\A�+�+�3��;�Z�+<=5;�����;�K�+�O� B ����0=�+\A�+�+�3:�� 8�����;���J���R3:)�������D<2�&��58LN5���J�TS[�+�+���R�;������9 "�$#&%('*)��������)��+�E~E�)��)m)���E/���f�����<=���@��A'���;��OLJ�R�����K�TS�E��;�+�+�R��0=����;���(���(E�)��)$�U?���8���(�;�K�+��+'�����,

�+��02Y<��R�;�K�+�E^������6��������R)���;��� B )�?��WX5���[)���E�;3:)�����r���X��� B )�?���J���R3:��,r%M#A�G'* v)���E�%M#A�G'�"Hv)������+�EM�+�(<2���J��3����Z�+�R)����+�J���R3:)����������, [��[�+��?A5�������)�?�Y)��������Z�+�M���(�;���J���R3:)�������<2�&��5G)���ED���$��5;)��������)�5GE�)��)�0=)��+�,

� ,8�[�5;)��������)�5�E�)��)�0=)��+� �;�����RE���e�+�h�;3�'<���U?�D���1�+\A�+�+�3·<2���J���R3:)������6Z���C%!{f�G')��������)��+�E������<=���9E�)��)b�;�9�+�+����E��;��)b��'5;)��������)�5ZE�)��)�0=)��+�,V [��IE�)��)�0=)��+C0=)���;��)�5;5�\��<=5;)�����])f�+�)���E�)��REjWX5�I�+\A�+�+�3D,9�Y�m%M#A�G' �'t<�������R)�3��+�R)����5;)��+��]���I%!{f�G'*)��������)��+�E�����<=���ME�)��)1�;�K�+�9)1������)�0=5���J���R3:)��M�J���O����!��{j#X,

[��b�;3@<=5��3@��K�)�������x���D���~�����<=���m�+\A�+�+�3 �;�0=)��+�E�������<2����+�����R��i�+���J� B )���,4 [��i "�$#&%(')��������)��+�����;�O)D<=���I��)�?�)9)�<�<=5;�;��)����������)�5;5r��������+�&��5;�Y)����3:)�5;5���(%M#A�G'* �)���E9<2��R5a����R��<����,(�Y�!)�����5���6`���8����3@<=5���+8 "�$#&%('t��K?A�������3@��K�a�R�����G���}Q�;��E�� B �$)���E��Y���k_9<=5;)��+�J���R3:��,� [��@�+���J� B )��^�;�E��;�+�+�R��0=���+�EI����E���(¢!PZ�m)���ED��)��102$E�� B ��5���)�E��E�J���3������ B �0=����+ �`,

! "$#&%('*),+(-/.0#&%(- [��i "�$#&%('t0=)��+�E�)�<�<����)��R�x��)��9<���U?��E��+�Q02��������5�\���I������K�i)���E|���)�0=5��E ���m�+��E��?��5���< )<2� B ���N��5�5;�;�������;�+��;�Y��K?A�������3@��K�Y�;�9):?���\1���������;3@�,"�!��Z�+�(���Z��������5�\Y�+�+�R���������EM�J���R3:)���������� "�$#&%('*)��������)��+�E�E�)��)v3@���g����3@<=5��_~���+�)��R�R��&����+��������D��)��Q02j�;�K?���+�����)��+�E��;�Q)/�+\A�+�+�3:)���;�B )�\�, � "�$#&% �����<=���Q��)��¦02��+�R)����+�J���R3@�E

13254�40687 9�9;:=<�> ?A@ > ? > ?A@ BDCD? E�FD? G=> G3HIG=> J�@ J�G;9�KMLN? > J�G;9�4,OQPSR/9

Page 50: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

���N¡w£Ê ��¢¬��R�!��­s� ª �J¬�£R¤(�s���1�J¬��7�N¡�¤h£R¬@�·�¢¡��J¾·�·�¢ÙN©���c�¹��¡Æ�1£������ ª  ��Ç¡w£À���N¡w�J�R¬@�¢¡w�¶�F³m´�¦8©W�7¿y�¢¬��¶«a£���©�1£��c�7�N¡��,���N¡w£��w¡��R�s�s�¢¬@�¥ ����c��­s���w¡���«,¿;£R¬�Ù*���c���s¬�£*«a�a©�s­c¬��7�J²��"�s����¿8�� � 4�c£R�1�J¤°­s � �Á¼ ��7�R��¡w£¼¡��c�Æ«a¬��7�¢¡���£��£R¤4 ����c��­s���w¡���«,¬��7�w£�­c¬@«a�7�,¿8�s��«@�¥«J�R� ª �m­s�w�7�¥£P®R�J¬�� �£��c���1�J¬@��£*��£R¤-¡������ ª Á��s���1�J¬��7�N¡4¬��7�w�7�¢¬@«@�c�J¬@��¿8��¡����¿8���c��¬@�R�c�R�2£R¤4��«J���7�N¡����-«��R£��R ��J²

��Q � QsL�QSG��*QSM´-² My��¬@�6�R�s�%§<²�¨½� ª �J¬@�·�R��² %'&(&(&�²9³ Ô £R¬@�·�R Ô ¬@�R���J¿;£R¬�Ù�¤h£R¬2¨½���c��­s���w¡���«�³��s�c£R¡��¢¡���£���²��½�7«@�*©�s��«J�R !�"�J�1£R¬�¡�§¶´N©�+ Å ´N© &(&P© / %¢¾ � �J�S�¢¬�¡����7�N¡�£R¤+;£����S­c¡w�J¬��R�s� Å �c¤h£R¬@�·�¢¡���£��º´*«J���7�s«a�R¾����s��®R�J¬w©����¡WÁ(£R¤�ÍF�7�s�s�wÁ* �®¢�R�s���*²

´�¡w�J®R�7� My��¬@��¾��2�¢Û7­s�¢Ù*�,§��¢�7�s�*¾y�R�s�À¦����¢£PÁ*�,§��*²-"/(/ %¢²!³8�R¡wÙ-µ»¡��c�¹�R�s�c£R¡��¢¡���£��¸�R¬@�¢�S�¸¡w£�£� �Ù*��¡J²Å �×ÍF�J¡w�J¬ My­s�c�7�·�R� ´�¡w�J®R�7� My��¬@�×�R�s�ק��¢¬�Ù¨½� ª �J¬@�·�R��¾��7�s��¡w£R¬@�J¾������ � 6"8���@� E6Q� 6"3�� 243 �e(BE2 @�=>2?C���;(=,;��G; @'F:@�����3D2�h(F�8 @:2>=�� 6�5�� F�3D3 @��dc h";/�3D2?;����� �2�c ; f�Fjc �� �2?;�� � � �m²

ÍF² M;£��J¬@���·�*² -"/(/ %¢²AÍ/¬@�R�¢¡J¾"�r�wÁ*�w¡w�7� ¤h£R¬(�c£����c��S�c£��c�J¡���«J� ª Á�«a£����S­c¡w�J¬7²"!Gc 6(= � 3 = F�8:3.;(=>2?6"3.;dc ¾#c¯?& %7/N±@µ � %�$ � #�²

�8�7�s�s��� M;¬@­c���·�R�¶�R�s��ÍF�J¡w�J¬!1���¡w¡w�7� ª ­c¬��c² -"/(/ %¢²§A�S�m¡w£�£� ��A¤h£R¬� ����c��­s���w¡���«��R�s�c£R¡��¢¡���£���² Å �¸ÍF�a©¡w�J¬ My­s�c�7�·�R�Ê´�¡w�J®R�7� My��¬@�º�R�s�˧��¢¬�Ù˨½� ª �J¬w©�·�R��¾��7�s��¡w£R¬@�J¾%����� �& 6"8���@� E6Q� 6"3'� 243 e(BE2 @��=>2?C(��;(=,;��G; @'F:@��)��3D2�h(F�8 @:2>=�� 6�5*� F�3D3 @��dc h";"3D2?;���� �2�c ; f�Fjc �� �2?;�� � � �m²

#c² +y�¢¬@ ��J¡w¡��*¾ � ²;§�«+�m�7 �®*���R¾y�R�s� Å ���¢¬@�!³�² -"/(/$-�²´*­c�s�1£R¬�¡����c�º ����c��­s���w¡���«<�R�s�c£R¡��¢¡���£���­s�����c�ºÈc�· �R�s�r�w¡WÁ* ��7���c�J�J¡��J² Å �¹Ò�²4´*�R���S�w£��r�R�s� � ²½§�«v©+y�¢¬�¡��NÁR¾s�7�s��¡w£R¬@�J¾,��FG; f"243 e"@ 243-� 6"80�.B@.� 243 e(BE2 @��=>2?C/�0� 6"3 =>243DBEBE91� 3 = F�8:3.;(=>2?6"3.;dc ²

¨/² � Á ª Ù¢�¢�J¬7¾Ç§<² M�²�§A£��7 � ��J¬7¾ )�² � ² M;�J¬@�s�w�7��¾#c²D+y�¢¬@ ��J¡w¡��*¾s³�² Å ���¢¬@��¾*§<²2�2 ��7����¾ � ²c§�«+�m�7 �®*���R¾�R�s��³�²C§A�7�c�R�7 Þ² #�­s�c� %'&(&(&�²��"�c���·�¢¡w��¿;£R¬�ÙN©ª �7�s«@��² Å � � �7®*���À�½¬@�R­s�Ǿ4�7�s��¡w£R¬7¾3� 8 6'C:F Fgf"243 e"@6�5 �4�5�76 898���� F�9<6"3 @�=>8 ;(=>2?6"3 �7�:@�=>8 ;$C'=4@+:���3D2�h(F�8��@:2>=�� 6�5<; ;"8��dc ;"3=f¢¾S�S�¢�R�7� %'-=$�% �²

#c² Ó"²/Ò��¢¬@«J���*¾���²2M�²�Ò�­c¡J¾��R�s�À³�²/Ò��R �®R�7�J²0-"/(/$-�²��£*«J�R ��,©½�2�w�7�·�Ú©W�R­c¡w£��·�¢¡���«,�R�s�c£R¡��¢¡���£���¡w£�£� c¤h£R¬�s¬�£��w£*�s��«�¬��7�w�7�¢¬@«@��² Å � M�² M;�7 ��R�s� Å ²½§��¢¬@ ����7��¾�7�s��¡w£R¬@�J¾>� 8 6'C:F Fgf"243 e"@�6�5 =? F � �DF FGC/ '� 8 6 @76 f@�ACB9B�A C 6"3'5'F�8GF�3.C:F��ED9D/��DGF � �.8:2�c<ACB9B�AH: ��2JI �GF�3 �� 8 6dh(F�3.C:FLKM� ;��G6"8 ;(=,6"248GFN��;"8 6dc F F7=O� ;"3 e�; eFa¾�S�¢�R�7� (-9P=$ "/*²

� ²½Ò�� ªsª £��A�R�s���m²1�½¬@���s�1�7 Þ² -"/(/ %¢²�Í4�\ȼ©"�R���R�*©�c£R¡��¢¡���£�� ª �R�w�7�¼«a£��s«a£R¬@�s�R�s«J���c��¡w£�£� �Ù*��¡J² Å �¼ÍF�a©¡w�J¬ My­s�c�7�·�R�Ê´�¡w�J®R�7� My��¬@�º�R�s�˧��¢¬�Ù˨½� ª �J¬w©�·�R��¾��7�s��¡w£R¬@�J¾%����� �& 6"8���@� E6Q� 6"3'� 243 e(BE2 @��

=>2?C(��;(=,;��G; @'F:@��)��3D2�h(F�8 @:2>=�� 6�5*� F�3D3 @��dc h";"3D2?;���� �2�c ; f�Fjc �� �2?;�� � � �m²

�2² Å �s�¢Ù¢�¶�R�s� �2²;Ò��7�N¡� ��7�·�R��²A%'&(& Z�² �2µ/³Ì ��R�*©��­s�¢�R�¥¤h£R¬·�s�¢¡����R�s�R �Á*�������R�s�<�R¬@�¢�S�s��«J�J²%Q 6"BE8��3.;dc!6�5-� 6"9 �.B =,;(=>2?6"3.;dc ;"3=fR! 8 ;Q�� �2?C ;dc � =,;(=>2 @��=>2?C�@v¾S#c¯��±@µ -(&(&L$ E% � ²

§���«@�s�¢�7 ��2���s�C²�-"/(/ %¢²m³��N®*�� ½©8�¥�R�7�c�J¬@��«��R�s�c£R¡��\©¡���£��2¡w£�£� N¤h£R¬F��­s �¡�����£*�s�R ��s���R �£R��­c�R² Å � � 8 6'C:F Fgf/�243 e"@ 6�5 =? FUT BE8 6 @0�DF FGC/ -ACB9BGD���� ;dcV�G6"82eR¾y�S�¢�R�7�% Z9P=$�% 9P /*²

#c² ©W�m²C§��� ��c���R�s� ��² M�²FÒ�­c¡J²!-"/(/ %¢²2�"�c�·�F³m´�¦8©�7�c�����c�Rµ �R� ¦�§�¨�© ª �R�w�7� «a£R¬��S­s�>�s�¢¡�� ª �R�w�¤h£R¬ ¡������Ý�R ������c�7�# ��R�c��­s�¢�R���s�¢¡��*² Å � ÍF�a©¡w�J¬ My­s�c�7�·�R�Ê´�¡w�J®R�7� My��¬@�º�R�s�˧��¢¬�Ù˨½� ª �J¬w©�·�R��¾��7�s��¡w£R¬@�J¾%����� �& 6"8���@� E6Q� 6"3'� 243 e(BE2 @��=>2?C(��;(=,;��G; @'F:@��)��3D2�h(F�8 @:2>=�� 6�5*� F�3D3 @��dc h";"3D2?;���� �2�c ; f�Fjc �� �2?;�� � � �m²

#c² ©W�m²�§��� ��c���R�s����²YM�²½Ò�­c¡J²�-"/(/$-¢�*²m³��s¬�£��w£*�s��««a£R¬��S­s��£R¤m�c£��*©W�s�¢¡���®R�¼�w�1�J�7«@��² Å � M�²*M;�7 8�R�s�Å ²-§��¢¬@ ����7��¾S�7�s��¡w£R¬@�J¾W� 8 6'C:F Fgf"243 e"@<6�5 =? F � �DF FGC/ � 8 6 @76 f@�XACB9B�A C 6"3'5'F�8GF�3.C:F��YD9D/��DGF � �.8:2�c�ACB9B�AH:��2JI �GF�3 �Z� 8 6dh(F�3.C:FLK7� ;��G6"8 ;(=,6"248GF[��;"8 6dc F F7=\� ;"3 �e�; eFa¾c�S�¢�R�7�0#"/��$]#"/�Z�²

#c² ©W�m²4§��� ��c�(�R�s� ��²2M�²/Ò�­c¡J²<-"/(/$- ª ²·�"�c�(¡��R�?È�©�7�N®*��¬�£��s���7�N¡JµÇ�R�¹Èc�· Ú© ª �R�w�7�À¡w£�£� ��w�J¡�¤h£R¬�¡�������R ������c�7�(�w�1�J�7«@�Ç«a£R¬��1£R¬@�*² Å �E� 8 6'C:F Fgf"243 e"@�6�5!=? F=? �248 f 243 = F�8:3.;(=>2?6"3.;dc.C 6"3'5'F�8GF�3.C:F�6"3Lc ;"3 e(BD; eF�8GF �@76"BE8 C:F:@ ;"3=f Fjh";dc BD;(=>2?6"3_^��`�aT7�bACB9B�AG��! 8 ;"3� ;"3.;"8:2?;¢²

�m²9´*«@�s�·���c¡J² -"/(/ %¢² Ò��7�w�s¬�c�R«@�s�w¡w¬@�R�s�wÙ�¬@���s¡���£���R­c¤ �c�7� +;£����S­c¡w�J¬ © �s�R� ´�Á*�w¡w�7�Ó/¦�§�³��8�R¨ � ³�² ! F:@0�.8Wd;$C/ $@4576"8 @7C/ �BE3 e@� =?= �WK e9e�i i i3: eF:@0�.8 ;�FGC/ $@4576"8 @7C/ �BE3 e � 6gf:@+:�fFa¾-�²

19² )�²'���7�s� ª  ��7�F�R�s� M�² � ² �8���S ��JÁR²�%'&(&(&�²3; 6 f�F�8:3� ���=c 2,Fgf � =,;(=>2 @�=>2?C�@ i 2>=? � �Z�Gc B@+: �5 �248 f=T f"2>=>2?6"3s²´��s¬@���c�R�J¬7² Å ´ M ) /\©3Gh9P7© &GhGh(-G#P© � ²

Page 51: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

Cascaded Regular Grammars over XML Documents�

Kiril Simov, Milen Kouylekov, Alexander SimovCLaRK Programme & BulTreeBank Project

http://www.BulTreeBank.orgLinguistic Modelling Laboratory - CLPPI, Bulgarian Academy of Sciences

Acad. G.Bonchev Str. 25A, 1113 So�a, BulgariaTel: (+3592) 979 28 25, (+3592) 979 38 12, Fax: (+3592) 70 72 73

[email protected], [email protected], adis [email protected]

Abstract

The basic mechanism of CLaRK for linguistic pro-cessing of text corpora is the cascade regular gram-mar processor. The main challenge to the grammarsin question is how to apply them on XML encodingof the linguistic information. The system o�ers a so-lution using an XPath language for constructing theinput word to the grammar and an XML encodingof the categories of the recognized words.

1 Introduction

This paper describes a mechanism for de�nition andapplication of cascaded regular grammars over XMLdocuments. The main problems are how to de�nethe input words for a grammar and how to incorpo-rate back in the document the grammar categoriesreturned by the rules of the grammar. The presentedsolutions are implemented within the CLaRK Sys-tem { an XML-based System for Corpora Develop-ment (Simov et. al., 2001).The main goal behind the design of the system

is the minimization of human intervention duringthe creation of corpora. Creation of corpora is stillan important task for the majority of languages likeBulgarian, where the invested e�ort in such devel-opment is very modest in comparison with moreintensively studied languages like English, Germanand French. We consider the corpora creation taskas editing, manipulation, searching and transform-ing of documents. Some of these tasks will be donefor a single document or a set of documents, otherswill be done on a part of a document. Besides eÆ-ciency of the corresponding processing in each stateof the work, the most important investment is thehuman labor. Thus, in our view, the design of thesystem has to be directed to minimization of the hu-man work. For document management, storing and

� This work is funded by the Volkswagen Stiftung, FederalRepublic of Germany under the Programme \Cooperationwith Natural and Engineering Scientists in Central and East-ern Europe" contract I/76 887. The authors are grateful toTylman Ule from SfS, University of T�ubingen, Germany and

the three anonymous reviewers for their comments on the ear-lier drafts of the paper. Needless to say, all remaining errorsin the text are ours.

querying we have chosen the XML technology be-cause of its popularity and its ease for understand-ing. The XML technology becomes a part of ourlives and a predominant language for data descrip-tion and exchange on the Internet. Moreover, a lotof already developed standards for corpus descrip-tions like (XCES, 2001) and (TEI, 2001) are alreadyadapted to the XML requirements. The core of theCLaRK System is an XML Editor which is the maininterface to the system. With the help of the editorthe user can create, edit or browse XML documents.To facilitate the corpus management, we enlarge theXML inventory with facilities that support linguisticwork. We added the following basic language pro-cessing modules: a tokenizer with a module that sup-ports a hierarchy of token types, a �nite-state enginethat supports the writing of cascaded regular gram-mars and facilities for regular pattern search, theXPath query language which is able to support nav-igation over the whole set of mark-up of a document,mechanisms for imposing constraints over XML doc-uments which are applicable in the context of someevents. We envisage several usages of our system:

1. Corpora markup. Here users work with theXML tools of the system in order to mark-up texts with respect to an XML DTD. Thistask usually requires an enormous human ef-fort and comprises both the mark-up itself andits validation afterwards. Using the availablegrammar resources such as morphological ana-lyzers or partial parsing, the system can statelocal constraints re ecting the characteristics ofa particular kind of texts or mark-up. One ex-ample of such constraints can be as follows: aPP according to a DTD can have as parent anNP or VP, but if the left sister is a VP then theonly possible parent is VP. The system can usesuch kind of constraints in order to support theuser and minimize his/her work.

2. Dictionary compilation for human users. Thesystem supports the creation of the actual lexi-cal entries whose structure is de�ned via a DTD.The XML tools will be used also for corpus in-vestigation that provides appropriate examples

Page 52: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

of the word usage in the available corpora. Theconstraints incorporated in the system are usedfor writing a grammar of the sub-languages ofthe de�nitions of the lexical items, for imposingconstraints over elements of lexical entries andthe dictionary as a whole.

3. Corpora investigation. The CLaRK System of-fers a set of tools for searching over tokens andmark-up in XML corpora, including cascadedgrammars, XPath language. The combinationsbetween these tools are used for tasks such as:extraction of elements from a corpus - for exam-ple, extraction of all NPs in the corpus; concor-dance - for example, give me all NPs in contexts,ordered by a user's de�ned set of criteria.

The structure of the paper is as follows: in thenext section we give a short introduction to themain technologies on which the CLaRK System isbuilt. These are: XML technology; Cascaded regu-lar grammars; Unicode-based tokenizers; and Con-straints over XML documents. The third sectiondescribes the de�nition of cascaded regular gram-mars. The fourth section presents an approach forapplying cascaded regular grammars over XML doc-uments. The last section concludes the paper.

2 The technologies behind theCLaRK System

CLaRK is an XML-based software system for cor-pora development implemented in JAVA. It incor-porates several technologies:

� XML technology;

� Unicode;

� Regular Grammars (they are presented in thenext sections);

� Constraints over XML Documents.

2.1 XML Technology

The XML technology is at the heart of the CLaRKsystem. It is implemented as a set of utilities forstructuring, manipulation and management of data.We have chosen the XML technology because of itspopularity, its ease of understanding and its alreadywide use in description of linguistic information. Be-sides the XML language (see (XML, 2000)) proces-sor itself, we have implemented an XPath language(see (XPath, 1999)) engine for navigation in docu-ments and an XSLT language (see (XSLT, 1999))engine for transformation of XML documents. Thedocuments in the system are represented as DOMLevel1 trees (see (DOM, 1998)). We started with ba-sic facilities for creation, editing, storing and query-ing of XML documents and developed further thisinventory towards a powerful system for processingnot only single XML documents but an integrated

set of documents and constraints over them. Themain goal of this development is to allow the user toadd the desirable semantics to the XML documents.In the implementation of cascaded regular gram-

mars within the CLaRK System, a crucial role playsthe XPath language. XPath is a powerful languagefor selecting elements from an XML document. TheXPath engine considers each XML document as atree where the nodes of the tree represent the ele-ments of the document, the document's most outertag is the root of the tree and the children of anode represent the content of the corresponding el-ement. The content nodes can be element nodes ortext nodes. Attributes and their values of each ele-ment are represented additionally to the tree.The XPath language uses the tree-based termi-

nology to point to some direction within the tree.One of the basic notions of the XPath language isthe so called context node, i.e. a chosen node in thetree. Each expression in the XPath language is eval-uated with respect to some context node. The nodesin the tree are categorized with respect to the con-text node as follows: the nodes immediately underthe context node are called children of the contextnode; all nodes under the context node are calleddescendant nodes of the context node; the node im-mediately above the context node is called parent ofthe context node; all nodes that are above the con-text node are called ancestor nodes of the contextnode; the nodes that have the same parent as thecontext node are called sibling nodes of the contextnode; siblings of the context node are divided intotwo types: preceding siblings and following siblings,depending on their order with respect to the contextnode in the content of their parent - if the sibling isbefore the context node, then it is a preceding sib-ling, otherwise it is a following sibling. Attributenodes are added with the context node as attributenodes and they are not children, descendant, parentor ancestor nodes of the context node. The contextnode with respect to itself is called self node.

2.2 Tokenization

XML considers the content of each text element awhole string that is usually unacceptable for cor-pus processing. For this reason it is required for thewordforms, punctuation and other tokens in the textto be distinguished. In order to solve this problem,the CLaRK system supports a user-de�ned hierar-chy of tokenizers. At the very basic level the user cande�ne a tokenizer in terms of a set of token types.In this basic tokenizer each token type is de�ned bya set of UNICODE symbols. Above this basic leveltokenizers the user can de�ne other tokenizers forwhich the token types are de�ned as regular expres-sions over the tokens of some other tokenizer, the socalled parent tokenizer. For each tokenizer an alpha-betical order over the token types is de�ned. This

Page 53: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

order is used for operations like the comparison be-tween two tokens, sorting and similar.

2.3 Constraints on XML documents

Several mechanisms for imposing constraints overXML documents are available. The constraints can-not be stated by the standard XML technology(even by the means of XML Schema (XML Schema,2000)). The following types of constraints are imple-mented in CLaRK: 1) �nite-state constraints - addi-tional constraints over the content of given elementsbased on a document context; 2) number restrictionconstraints - cardinality constraints over the contentof a document; 3) value constraints - restriction ofthe possible content or parent of an element in adocument based on a context. The constraints areused in two modes: checking the validity of a docu-ment regarding a set of constraints; supporting thelinguist in his/her work during the process of cor-pus building. The �rst mode allows the creation ofconstraints for the validation of a corpus accordingto given requirements. The second mode helps theunderlying strategy of minimization of the humanlabor.The general syntax of the constraints in the

CLaRK system is the following:

(Selector, Condition, Event, Action)

where the selector de�nes to which node(s) in thedocument the constraint is applicable; the conditionde�nes the state of the document at the time whenthe constraint is applied. The condition is stated asan XPath expression which is evaluated with respectto each node selected by the selector. If the evalu-ation of the condition is a non-empty list of nodesthen the constraints are applied; the event de�nessome conditions of the system when this constraintis checked for application. Such events can be: theselection of a menu item, the pressing of key short-cut, some editing command as enter a child or aparent and similar; the action de�nes the way of theactual application of the constraint.Here we present constraints of type "Some Chil-

dren". This kind of constraints deal with the contentof some elements. They determine the existence ofcertain values within the content of these elements.A value can be a token or an XML mark-up and theactual value for an element can be determined bythe context. Thus a constraint of this kind works inthe following way: �rst it determines to which ele-ments in the document it is applicable, then for eachsuch element in turn it determines which values areallowed and checks whether in the content of the el-ement some of these values are presented as a tokenor an XML mark-up. If there is such a value, thenthe constraint chooses the next element. If there isno such a value, then the constraint o�ers to the

user a possibility to choose one of the allowed valuesfor this element and the selected value is added tothe content as a �rst child. Additionally, there is amechanism for �ltering of the appropriate values onthe basis of the context of the element.

3 Cascaded Regular Grammars

The CLaRK System is equipped with a �nite-stateengine which is used for several tasks in the systemsuch as validity check for XML documents, tokeniz-ers, search and cascaded regular grammar. In thisand the next section we present the use of this en-gine for cascaded regular grammars over XML doc-uments along the lines described in (Abney, 1996).The general idea underlying cascaded regular gram-mars is that there is a set of regular grammars. Thegrammars in the set are in particular order. The in-put of a given grammar in the set is either the inputstring if the grammar is �rst in the order or the out-put string of the previous grammar. Another speci�cfeature of the cascaded grammars is that each gram-mar tries to recognize only a particular category inthe string but not the whole string. The parts ofthe input word that are not recognized by the gram-mar are copied to the output word. Before goinginto detail of how to apply grammars in the CLaRKSystem some basic notions about regular expressionsare given.Regular grammars standardly are formalized as a

set of rewriting rules of the following kinds

A -> b CA -> BA -> b

where A, B, C stand for non-terminal symbols andb stands for terminal symbols. Such grammars areapplied over a word of terminal symbols in order toparse it to a special goal symbol S. Each languageaccepted by such a grammar is called regular. Usingsuch a formalization one could situated the regularlanguages within the families of other languages likecontext free languages, context sensitive languagesand so on. In practice this formalization is rarelyused. Other formal devices for dealing with regularlanguages are regular expressions and �nite-state au-tomata with the well know correspondence betweenthem. Although regular grammars are not expres-sive enough in order to be a good model of nat-ural languages they are widely used in NLP. Theyare used in modelling of in ectional morphology (see(Koskenniemi, 1983)), tokenization and Named En-tity recognition (Grover et. al., 2000), and many oth-ers.In our work we modify the de�nition of regular

grammars along the lines of (Abney, 1996). We userewriting rules of the following kind:

C -> R

Page 54: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

where R is a regular expression and C is a categoryof the words recognized by R. We can think of C asa name of the language recognized by R.A regular grammar is a set of rules such that the

regular expressions of the rules recognize pairwisedisjoint languages. The disjointness condition is nec-essary in order the grammar to assign a unique cat-egory to each word recognized by it.A regular grammar works over a word of letters,

called input word. The grammar scans the inputword from left to right trying to �nd the �rst sub-word such that it belongs to the language of theregular expression presented by some of the rules ofthe grammar. If there is no such word starting withthe �rst letter in the input word, then the grammaroutputs the �rst letter of the word and prolongs thescanning with the second letter. When the gram-mar recognizes a sub-word then it outputs the cate-gory of the corresponding rule and prolongs the scan-ning with the letter after the recognized sub-word.The grammar works deterministically over the inputword. The result of the application of the grammaris a copy of the input word in which the recognizedsub-words are substituted with the categories of thegrammar. The result word is called output word ofthe grammar. In this respect such kind of regulargrammars could be considered a kind of �nite-statetransducers.An additional requirement suggested by (Abney,

1996) is the so-called longest match, which is a wayto choose one of the possible analyses for a grammar.The longest match strategy requires that the recog-nized sub-words from left to right have the longestlength possible. Thus the segmentation of the in-put word starts from the left and tries to �nd the�rst longest sub-words that can be recognized bythe grammar and so on to the end of the word.An example of such a regular grammar is the

grammar �1 for recognition of dates in the formatdd.mm.yyyy (10.11.2002) de�ned by the followingrule:

Date ->( (0,(1|2|3|4|5|6|7|8|9)) |((1|2),(0|1|2|3|4|5|6|7|8|9)) |(3,(0|1))

),.,((0,(1|2|3|4|5|6|7|8|9))|(1,(0|1|2))),.,(((1|2|3|4|5|6|7|8|9),(0|1|2|3|4|5|6|7|8|9)*))

Application of this grammar on the following input

word

The feast is from 12.03.2002 to 15.03.2002.

will produce the output word

The feast is from Date to Date.

A cascaded regular grammar (Abney, 1996) is asequence of regular grammars de�ned in such a waythat the �rst grammar works over the input wordand produces an output word, the second grammarworks over the output word of the �rst grammar,produces a new output word and so on.As one example of a cascaded regular grammar let

us consider the sequence of �1 as de�ned above andthe grammar �2 de�ned by the following rule:

Period -> from, Date, to, Date

Application of the grammar �2 on output of thegrammar �1:

The feast is from Date to Date.

will produce the following output word

The feast is Period.

In the next section we describe how cascaded reg-ular grammars can be applied to XML documents.

4 Cascaded Regular Grammars overXML Documents

The application of the regular grammars to XMLdocuments is connected with the following problems:

� how to treat the XML document as an inputword for a regular grammar;

� how should the returned grammar category beincorporated into the XML document; and

� what kind of `letters' to be used in the regu-lar expressions so that they correspond to the`letters' in the XML document.

The solutions to these problems are described inthe next paragraphs.First of all, we accept that each grammar works

on the content of an element in an XML document.The content of each XML element1 is either a se-quence of XML elements, or text, or both (MIXEDcontent). Thus, our �rst task is to de�ne how toturn the content of an XML element into an inputword of a grammar. We consider the two basic cases- when the content is text and when it is a sequenceof elements.When the content of the element to which the

grammar will be applied is a text we have twochoices:

1Excluding the EMPTY elements on which regular gram-mar cannot be applied.

Page 55: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

1. we can accept that the 'letters' of the grammarsare the codes of the symbols in the encoding ofthe text; or

2. we can segment the text in meaningful non-overlapping chunks (in usual terminology to-kens) and treat them as 'letters' of the gram-mars.

We have adopted here the second approach. Eachtext content of an element is �rst tokenized by a to-kenizer2 and is then used as an input for grammars.Additionally, each token receives a unique type. Forinstance, the content of the following element

<s>John loves Mary who is in love with Peter</s>

can be segmented as follows:

"John" CAPITALFIRSTWORD" " SPACE"loves" WORD" " SPACE"Mary" CAPITALFIRSTWORD" " SPACE"who" WORD" " SPACE"is" WORD" " SPACE"in" WORD" " SPACE"love" WORD" " SPACE"with" WORD" " SPACE"Peter" CAPITALFIRSTWORD

Here on each line in double quotes one token fromthe text followed by its token type is presented.Therefore when a text is considered an input word

for a grammar, it is represented as a sequence of to-kens. How can we refer now to the tokens in theregular expressions in the grammars? The mostsimple way is by tokens. We decided to go furtherand to enlarge the means for describing tokens withthe so called token descriptions which correspond tothe letter descriptions in the above section on reg-ular expressions. In the token descriptions we usestrings (sequences of characters), wildcard symbols# for zero or more symbols, @ for zero or one sym-bol, and token categories. Each token descriptionmatches exactly one token in the input word.We divide the token descriptions into two types

- those that are interpreted directly as tokens and

2There are built-in tokenizers which are always availableand there is also a mechanism for de�ning new tokenizers bythe user.

others that are interpreted as token types �rst andthen as tokens belonging to these token types.The �rst kind of token descriptions is represented

as a string enclosed in double quotes. The string isinterpreted as one token with respect to the currenttokenizer. If the string does not contain a wildcardsymbol then it represents exactly one token. If thestring contains the wildcard symbol # then it denotesan in�nite set of tokens depending on the symbolsthat are replaced by #. This is not a problem inthe system because the token description is alwaysmatched by a token in the input word. The otherwildcard symbol is treated in a similar way, but zeroor one symbol is put in its place. One token descrip-tion may contain more than one wildcard symbol.Examples:"Peter" as a token description could be matched

only by the last token in the above example."lov#" could be matched by the tokens "loves"

and "love""lov@" is matched only by "love""@" is matched by the token corresponding to the

intervals " ""#h#" is matched by "John", "who", and "with""#" is matched by any of the tokens including the

spaces.The second kind of token description is repre-

sented by the dollar sign $ followed by a string. Thestring is interpreted as either a token type or a setof token types if it contains wildcard symbols. Thenthe type of the token in the input word is matchedby the token types denoted by the string. If the to-ken type of the token in the text is denoted by thetoken description, then the token is matched to thetoken description.Examples:$WORD is matched to "loves", "who", "is", "in",

"love", "with"$CAP# is matched to "John", "Mary" and "Peter"$#WORD is matched to "John", "loves", "Mary",

"who", "is", "in", "love", "with", and "Peter"$# is matched to any of the tokens including the

spaces.Now we turn to the case when the content of a

given element is a sequence of elements. For instancethe above sentence can be represented as:

<s><N>John</N><V>loves</V><N>Mary</N><Pron>who</Pron><V>is</V><P>in</P><N>love</N><P>with</P><N>Peter</N></s>

At �rst sight the natural choice for the input wordis the sequences of the tags of the elements: <N><V> <N> <Pron> <V> <P> <N> <P> <N>, but whenthe encoding of the grammatical features is moresophisticated (see below):

Page 56: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

<s><w g="N">John</w><w g="V">loves</w><w g="N">Mary</w><w g="Pron">who</w><w g="V">is</w><w g="P">in</w><w g="N">love</w><w g="P">with</w><w g="N">Peter</w>

</s>

then the sequence of tags is simply <w> <w> ...,which is not acceptable as an input word. In order tosolve this problem we substitute each element with asequence of values. This sequence is determined byan XPath expression that is evaluated taking the ele-ment node as context node. The sequence de�ned byan XPath expression is called element value. Thuseach element in the content of the element is replacedby a sequence of text segments of appropriate typesas it is explained below.For the above example a possible element value

for tag w could be de�ned by the XPath expression:\attribute::g". This XPath expression returns thevalue of the attribute g for each element with tagw. Therefore a grammar working on the content ofthe above sentence will receive as an input word thesequence: "<" "N" ">" "<" "V" ">" "<" "N" ">" "<""Pron" ">" "<" "V" ">" "<" "P" ">" "<" "N" ">""<" "P" ">" "<" "N" ">". The angle brackets "<"">" determine the boundaries of the element valuefor each of the elements.Besides such text values, by using of XPath ex-

pressions one can point to arbitrary nodes in thedocument, so that the element value is determineddi�erently. In fact, an XPath expression can be eval-uated as a list of nodes and then the element valuewill be a sequence of values.For example, if the element values for the above

elements are de�ned by the following XPath ex-pression: \text() j attribute::g"3, then the inputword will be the sequence: "<" "John" "N" ">" "<""loves" "V" ">" "<" "Mary" "N" ">" "<" "who""Pron" ">" "<" "is" "V" ">" "<" "in" "P" ">" "<""love" "N" ">" "<" "with" "P" ">" "<" "Peter""N" ">".Using predicates in the XPath expressions one

can determine the element values on the basisof the context. For example, if in the abovecase we want to use the grammatical featuresfor verbs, nouns and pronouns, but the actualwords for prepositions, we can modify the XPathexpression for the element value in the follow-ing way: \text()[../attribute::g="P"] j at-tribute::g[not(../attribute::g="P")]". In this

3The meaning of this XPath is point to the text child of

the element and the value of the attribute g.

case the XPath expression will select the textual con-tent of the element if the value of the attribute g forthis element has value "P". If the value of the at-tribute g for the element is not "P", then the XPathexpression will select the value of the attribute g.The input word then is: "<" "N" ">" "<" "V" ">""<" "N" ">" "<" "Pron" ">" "<" "V" ">" "<" "in"">" "<" "N" ">" "<" "with" ">" "<" "N" ">".The element value is a representation of the im-

portant information for an element. One can con-sider it conceptual information about the element.From another point of view it can be seen as a tex-tual representation of the element tree structure.At the moment in the CLaRK system the nodes

selected by an XPath expression are processed in thefollowing way:

1. if the returned node is an element node, thenthe tag of the node is returned with the addi-tional information con�rming that this is a tag;

2. if the returned node is an attribute node, thenthe value of the attribute is:

(a) tokenized by an appropriate tokenizer if theattribute value is declared as CDATA inthe DTD;

(b) text if the value of the attribute is declaredas an enumerated list or ID.;

3. if the returned node is a text node, then thetext is tokenized by an appropriate tokenizer.

Within the regular expressions we use the anglebrackets in order to denote the boundaries of theelement values. Inside the angle brackets we couldwrite a regular expression of arbitrary complexity inround brackets. As letters in these regular expres-sions we use again token descriptions for the valuesof textual elements and the values of attributes. Fortag descriptions we use strings which are neither en-closed in double quotes nor preceded by a dollar sign.We can use wildcard symbols in the tag name. Thus<p> is matched with a tag p;<@> is matched with all tags with length one.<#> is matched with all tags.The last problem when applying grammars to

XML documents is how to incorporate the categoryassigned to a given rule. In general we can acceptthat the category has to be encoded as XML mark-up in the document and that this mark-up could bevery di�erent depending on the DTD we use. For in-stance, let us have a simple tagger (example is basedon (Abney, 1996)):

Det -> "the"|"a"N -> "telescope"|"garden"|"boy"Adj -> "slow"|"quick"|"lazy"V -> "walks"|"see"|"sees"|"saw"Prep -> "above"|"with"|"in"

Page 57: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

Then one possibility for representing the cate-gories as XML mark-up is by tags around the recog-nized words:

the boy with the telescope

becomes

<Det>the</Det><N>boy</N><Prep>with</Prep><Det>the</Det><N>telescope</N>

This encoding is straightforward but not very con-venient when the given wordform is homonymouslike:

V -> "move"N -> "move"

In order to avoid such cases we decided that thecategory for each rule in the CLaRK System is a cus-tom mark-up that substitutes the recognized word.Since in most cases we would also like to save therecognized word, we use the variable \w for the rec-ognized word. For instance, the above example willbe:

<Det>\w</Det> -> "the"|"a"<N>\w</N> -> "telescope"|"garden"|"boy"<Adj>\w</Adj> -> "slow"|"quick"|"lazy"<V>\w</V> -> "walks"|"see"|"sees"|"saw"<Prep>\w</Prep> -> "above"|"with"|"in"

The mark-up de�ning the category can be as com-plicated as necessary. The variable \w can be re-peated as many times as necessary (it can also beomitted). For instance, for "move" the rule couldbe:

<w aa="V;N">\w</w> -> "move"

Let us give now one examples in which elementvalues are used. For instance, the following grammarrecognizes prepositional phrases:

<PP>\w</PP> -> <"P"><"N#">

Generally it says that a prepositional phrase consistsof an element with element value "P" followed by anelement with element value which matches the to-ken description "N#". Usage of the token description"N#" ensures that the rule will work also for the casewhen a preposition is followed by an element with el-ement value "NP". The application of this grammaron the above example with appropriate de�nition ofelement values will result in the following document:

<s><w g="N">John</w><w g="V">loves</w><w g="N">Mary</w>

<w g="Pron">who</w><w g="V">is</w><PP>

<w g="P">in</w><w g="N">love</w>

</PP><PP>

<w g="P">with</w><w g="N">Peter</w>

</PP></s>

Here is another example (based on a grammardeveloped by Petya Osenova for Bulgarian nounphrases) which demonstrates a more complicatedregular expressions inside element descriptions:

<np aa="NPsn">\w</np> ->

<("An#"|"Pd@@@sn")>,<("Pneo-sn"|"Pfeo-sn")>

Here "An#" matches all morphosyntactic tags foradjectives of neuter gender, "Pd@@@sn" matches allmorphosyntactic tags for demonstrative pronouns ofneuter gender, singular , "Pneo-sn" is a morphosyn-tactic tag for the negative pronoun, neuter gender,singular, and "Pfeo-sn" is a morphosyntactic tagfor the inde�nite pronoun, neuter gender, singular.This rule recognizes as a noun phrase each sequenceof two elements where the �rst element has an ele-ment value corresponding to an adjective or demon-strative pronoun with appropriate grammatical fea-tures, followed by an element with element value cor-responding to a negative or an inde�nite pronoun.Notice the attribute aa of the category of the rule.It represents the information that the resulting nounphrase is singular, neuter gender. Let us now sup-pose that the next grammar is for determination ofprepositional phrases is de�ned as follows:

<pp>\w</pp> -> <"R"><"N#">

where "R" is the morphosyntactic tag for preposi-tions. Let us trace the application of the two gram-mars one after another on the following XML ele-ment:

<text><w aa="R">s</w><w aa="Ansd">golyamoto</w><w aa="Pneo-sn">nisto</w>

</text>

First, we de�ned the element value for the ele-ments with tag w by the XPath expression: \at-tribute::aa". Then the cascaded regular gram-mar processor calculated the input word for the�rst grammar: "<" "R" ">" "<" "Ansd" ">" "<""Pneo-sn" ">". Then the �rst grammar is applied

Page 58: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

on this input words and it recognizes the last twoelements as a noun phrase. This results in two ac-tions: �rst, the markup of the rule is incorporatedinto the original XML document:

<text><w aa="R">s</w><np aa="NPsn">

<w aa="Ansd">golyamoto</w><w aa="Pneo-sn">nisto</w>

</np></text>

Second, the element value for the new element <np>is calculated and it is substituted in the input wordof the �rst grammar and in this way the input wordfor the second grammar is constructed: "<" "R" ">""<" "NPsn" ">". Then the second grammar is ap-plied on this word and the result is incorporated inthe XML document:

<text><pp>

<w aa="R">s</w><np aa="NPsn">

<w aa="Ansd">golyamoto</w><w aa="Pneo-sn">nisto</w>

</np></pp>

</text>

Because the cascaded grammar only consists of thistwo grammars the input word for the second gram-mar is not modi�ed, but simply deleted.

5 Conclusion

In this paper we presented an approach to appli-cation of cascaded regular grammars within XML.This mechanism is implemented within the CLaRKSystem (Simov et. al., 2001). The general idea isthat an XML document could be considered as a\blackboard" on which di�erent grammars work.Each grammar is applied on the content of some ofthe elements in the XML document. The content ofeach of these elements is converted into input words.If the content is text then it is converted into a se-quence of tokens. If the content is a sequence ofelements, then each element is substituted by an el-ement value. Each element value is de�ned by anXPath expression. The XPath engine selects the ap-propriate pieces of information which determine theelement value. The element value can be consideredas the category of the element, or as textual repre-sentation of the tree structure of the element and itscontext. The result of the grammar is incorporatedin the XML document as XML markup.In fact in the CLaRK System are implemented

more tools for modi�cation of an XML document,such as: constraints, sort, remove, transformations.

These tools can write some information, reorder itor delete it. The user can order the applications ofthe di�erent tools in order to achieve the necessaryprocessing. We call this possibility cascaded pro-cessing after the cascaded regular grammars. Inthis case we can order not just di�erent grammarsbut also other types of tools.

References

Steve Abney. 1996. Partial Parsing via Finite-StateCascades. In: Proceedings of the ESSLLI'96 Ro-bust Parsing Workshop. Prague, Czech Republic.

DOM. 1998. Document Object Model (DOM) Level1. Speci�cation Version 1.0. W3C Recommenda-tion. http://www.w3.org/TR/1998/REC-DOM--Level-1-19981001

Claire Grover, Colin Matheson, Andrei Mikheev andMarc Moens. 2000. LT TTT - A Flexible To-kenisation Tool. In: Proceedings of Second Inter-national Conference on Language Resources andEvaluation (LREC 2000).

K. Koskenniemi. 1983. Two-level Model for Morpho-logical Analysis. In: Proceedings of IJCAI-83 ,pages: 683-685, Karlsruhe, Germany.

Kiril Simov, Zdravko Peev, Milen Kouylekov, Ale-xander Simov, Marin Dimitrov, Atanas Kiryakov.2001. CLaRK - an XML-based System for CorporaDevelopment. In: Proc. of the Corpus Linguistics2001 Conference, pages: 558-560. Lancaster, UK.

Text Encoding Initiative. 1997. Guidelines for Elec-tronic Text Encoding and Interchange. Sperberg-McQueen C.M., Burnard L (eds).

Corpus Encoding Standard. 2001. XCES: CorpusEncoding Standard for XML. Vassar College, NewYork, USA.http://www.cs.vassar.edu/XCES/

XML. 2000. Extensible Markup Language (XML) 1.0(Second Edition). W3C Recommendation.http://www.w3.org/TR/REC-xml

XML Schema. 2001. XML Schema Part 1: Struc-tures. W3C Recommendation.http://www.w3.org/TR/xmlschema-1/

XPath. 1999. XML Path Lamguage (XPath) version1.0. W3C Recommendation.http://www.w3.org/TR/xpath

XSLT. 1999. XSL Transformations (XSLT) version1.0. W3C Recommendation.http://www.w3.org/TR/xslt

Page 59: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

XtraGen —A Natural Language Generation System

Using XML- and Java-Technologies

Holger StenzhornXtraMind Technologies GmbH

Stuhlsatzenhausweg 366123 Saarbrucken, Germany

[email protected]

AbstractIn this paper we present XtraGen, a XML- and Java-based software system for the flexible, real-timegeneration of natural language that is easily inte-grated and used in real-world applications. We de-scribe its grammar formalism and implementationin detail, depict the context of how the system wasevaluated and finally provide an outlook on futurework with the system.

1 IntroductionIn this paper we present XtraGen, a recently de-veloped software system for the flexible, real-timegeneration of natural language that can be easily in-tegrated into real-world, industrial application en-vironments through its open XML- and Java-basedimplementation and interfaces.

Our motivation for developing a completely newgeneration system started when we made the sameobservation as stated in the following quote:

There are thousands, if not millions, ofapplication programs in everyday use thatautomatically generate texts; but proba-bly fewer than ten of these programs usethe linguistic and knowledge-based tech-niques that have been studied by the natu-ral language generation community. (Re-iter, 1995)

The goal of our company is to develop state-of-the-art software and hence we wanted to change theportrayed situation at least in our environment forthe applications we create.

Therefore we started to experiment with XSL(World Wide Web Consortium, 2001) to generatenatural language as suggested in (Cawsey, 2000)and (Wilcock, 2001). But we figured out fairly soonthat XSL did not satisfy our needs and desires be-cause with this mechanism

• we were not able to appropriately handle theissue of morphology,

• we could not parameterize the generation pro-cess at the desired level and

• we had no possibility to generate alternativesor recover from dead ends during generationsince XSL is lacking a backtracking mecha-nism.

Therefore we decided to develop our own naturallanguage generation system that incorporates on theone hand many ideas found in XSL but on the otherhand tries to give a solution the above describedproblems.

2 Generation Grammars2.1 FormalismThe grammar formalism conceived for the XtraGensystem has been developed from an application-oriented point of view. This means that from ourstandpoint real-world applications hardly ever re-quire a full and complete linguistic coverage whichis striven for by linguistically motivated generationsystems. Therefore our formalism is based on ex-tended templates that allow the inclusion of pre-defined and dynamically retrieved text, constraint-based inflection and a context-free selection mech-anism. The development of this formalism wasstrongly influenced by the ideas found in the (Lisp-based) formalism of the TG/2 system (Busemann,1996; Wein, 1996) and the YAG system (Chan-narukul, 1999).

A template has the overall form as depicted in theBackus-Naur Form in figure 2.1. Each part of thetemplate will be elaborated below.

2.2 ConditionsConditions describe the exact circumstances underwhich a certain template can be applied and its ac-tions executed. There are two distinct basic types

Page 60: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

<template id=" String"category=" String">

<conditions>Condition*

</conditions><parameters>

Parameter*</parameters><actions>

Action+</actions><constraints>

Constraint*</constraints>

</template>

Figure 1: Overview of a XtraGen template inBackus-Naur Form

of conditions: Simple-Conditions and Complex-Conditions. They in turn are the supertypes formore specific conditions:

Simple-Condition They form the actual tests thatare applied to the input structure. A set of com-monly used conditions is already predefinedsuch as ones that test for equality or that testwhether certain information is existent in theinput structure. If there is a need for some veryspecific conditional testing that cannot be re-alized with the existing means a developer isfree to implement and add its own conditionaltypes.

Complex-Condition This type of condition makesit possible to combine several conditionsinto a more complex one. Three prede-fined Complex-Conditions exist: the And-Condition, the Or-Condition and the Not-Condition. Additional Complex-Conditionscan also be added by providing an implemen-tation for them.

2.3 Parameterization

Parameterization is an easy and flexible means toguide and control the generation process with re-gard to different linguistic preferences such as mat-ters of style or rhetorical structures. Parameteriza-tion works by introducing a preference mechanismthat provides the possibility of dynamically sortingthe application of templates according to a given setof parameters.

<conditions><or>

<and><condition type="equal">

<get path="/recall"/><value>95</value>

</condition><condition type="less">

<get path="/accuracy"/><value>90</value>

</condition></and><not>

<condition type="exist"><get path="/exception"/>

</condition></not>

</or></conditions>

Figure 2: Example of some complex interleavedconditions

The way parametrization works in our system isa two-step process:

Adding of parameters to templatesDuring thedesign of a generation grammar the writer addsone or more parameters to some templates asin the example in figure 3.

Here the upper template is intended to be usedduring the generation of text targeted at expertsand the lower one in case text is to be pro-duced for novices (level is expert in onetemplate andnovice in the other). Both ofthe templates are preferably used when a lowverbosity level is desired (verbosity is lowin both cases).

Setting of the parameters at runtime At runtimethe parameters corresponding to the ones de-fined in the grammar are set to the desired val-ues. To continue our example, we now set thevalue of the parameterlevel to expert (seefigure 4) and hence the template in the upperbox would be selected.

The particularity of our system is that parame-ters can be assigned a weight and thus a priority. Inour example we might want to give a higher prior-ity to the parameterlevel than to the parameterverbosity as shown in figure 5 This now sortsthe application of templates in a way that they arefirst sorted according to their level of verbosity and

Page 61: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

the result is further sorted according to the level ofexpertise.

<template id="explainExpert"category="explain">

<parameters><parameter name="level"

value="expert"><parameter name="verbosity"

value="low"></parameters>...

</template>

<template id="explainNovice"category="explain">

<parameters><parameter name="level"

value="novice"><parameter name="verbosity"

value="low"></parameters>...

</template>

Figure 3: Example of using parameters on the levelof generation grammars

generator.addParameter("level","expert");

Figure 4: Example of using parameters on the levelof programming code

generator.addParameter("level","novice",0.75);

generator.addParameter("verbosity","low",0.5);

Figure 5: Example of using parameters with aweight specified on the level of programming code

2.4 ActionsIn the case that all conditions of a given templatehave been tested successfully (see section 2.2) theactions contained in the actions-part of the templateare executed.

There are four different types of actions that canappear: String-Action, Getter-Action, Inflection-

Action and Selection-Action. The actual purpose ofeach of them is quite different but all of them returna result string when executed successfully.

String-Action This type of action simply returns astatically specified string as a result — a so-called canned text.

Getter-Action With a Getter-Action it is possibleto directly access and retrieve data from the en-tered input structure. The syntax used for spec-ifying the path to the data conforms to the syn-tax of XPath (World Wide Web Consortium,1999). There is no additional processing doneon the returned data.

<get path="/values/startTime"/>

Inflection-Action This action inflects a stem ac-cording to the defined morphological con-straints and returns the result.

The stem can be stated statically in the gram-mar as in case (a) or can be dynamically re-trieved from the input structure as in case (b).

The needed morphological constraints are fur-nished by the constraints-part of the templateto which the given label provides a link (con-fer to section 2.5 below for details).(a) <inflect stem="bring"

label="X0"/>

(b) <inflect path="/action"label="X0"/>

Selection-Action The Selection-Action can actu-ally be seen as the most important of theactions since it accounts for the context-freebackbone of the system.

It allows to select another template directly viaa specified identifier as in case (a) or via agiven category as in case (b). In the secondcase several templates might have the givencategory and hence backtracking might be in-voked at his point. (see section 5.1)

(a) <select id="top"/>

(b) <select category="top"optional="true"/>

Selections can also be declared optional as in(b) which means that in case the selection ofthe template fails no backtracking is invokedand simply an empty string is returned.

Page 62: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

2.5 Constraints and Morphology

The treatment of morphology is naturally one ofthe major issues in the context of a complete nat-ural language generation especially when workingwith morphologically rich languages such as Ger-man, Russian or Finnish. Therefore we took greatcare to design and develop a morphological subsys-tem that is powerful and flexible yet easy to under-stand and use. The actual realization of the compo-nent is based on a constraint-based inheritance algo-rithm that follows the example of PATR-II (Shieberet al., 1989).

In the (overly simplified) example in figure 6 onecan get a glimpse on how the morphology works:There are two Selection-Actions, the first one la-belledX0, the second one labelledX1. The givenconstraint now tells that the attributenumber of X0is the same as the attributenumber of X1 and setsit dynamically to the value retrieved by the Getter-Action.

<template ...><actions>

<select category="determiner"label="X0"/>

<select category="noun"label="X1"/>

</actions><constraint>

<place label="X0"attribute="number"/>

<place label="X1"attribute="number"/>

<get path="/categoryNumber"/></constraint>

</template>

Figure 6: Example of using constraints

2.6 Compilation

In order to be able to work with a generation gram-mar the generation engine requires the grammar(and its templates) to exist in the form of a Java ob-ject. But since the original format of the grammar isplain XML this format must be transformed througha compilation process into the internally needed rep-resentation. Our system is capable to perform sucha compilation in two different ways:

Just-in-time Compilation With this technique therequired templates are compiled from their

XML source into their corresponding Java ob-jects at runtime of the generation engine, i.e.during the actual generation process. Thistype of compilation is advised only for smallergrammars or during the development and test-ing of a new grammar since the constant in-terleaving of compilation on the one hand andthe actual generation process on the other leadsto some quite noticeable overhead. This over-head is naturally not acceptable when XtraGenis used in real-time applications.

Pre-Compilation This type of compilation allowsto compile the whole grammar before its actualdeployment during the generation process. Thepre-compilation of grammars can improve theperformance of the generator-engine tremen-dously and is therefore to be preferred in mostsituations. (The pre-compilation of generationgrammars is very similar to the Translets ap-proach in XSL (Apache XML Project, 2002)where XSL stylesheets are compiled in ad-vance into Java objects.)

3 InputIn contrast to other generation systems that requiretheir input to adhere to some particular (and mostof the time proprietary) encoding format the coreengine of our system only demands its input to be avalid XML structure.

The actual restrictions on the input are imposedonly at the level of the generation grammars interms of their access to the input (see section 2.4on Getter-Actions and Inflection-Actions). This canobviously lead to a severe drawback: In case thateither the generation grammar or the input structurechanges heavily there might emerge a complete mis-match between the XPath specified in the actionsand the actual structure of the input.

Under circumstances when it is not feasible tochange either the input structure or the grammar, wepropose to introduce an additional mapping layerbetween input and generator that is based on a XSLstylesheet and that dynamically maps the input inthe way that is needed by the grammar.

4 EditorWe have stressed in the sections before that we be-lieve our formalism to be powerful yet very straight-forward to implement and use. But when develop-ing larger grammars for real-world applications itbecomes quite a demanding, non-trivial task to keep

Page 63: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

track of all the templates and especially of the rela-tions between them (e.g. relations on the level ofmorphological constraints) Common XML editorsare of no help at this point since they cannot showsuch relations at all.

Therefore the development of egram, a Java-based graphical editor for generation grammars ison its way at the site of our cooperation partnerDFKI (German Research Center for Artificial Intel-ligence). After the completion of its developmentthis tool will allow to comfortably edit all aspects ofgeneration grammars and templates. Among manyother things the editor will be able to depict thewhole generation grammar and process in a graphi-cal tree format in which dependencies between tem-plates are shown in an intuitive way.

5 Implementation

The realization and implementation of the XtraGensystem is based entirely on the two cornerstonesJava and XML.

XML was chosen because it has become the de-facto standard language in many (if not most) sce-narios where information transfer takes place. Thisin turn is caused by its unique capabilities to encodeinformation in a way that is easy to read, process,and generate (even for human beings as in the caseof our formalism).

Java was chosen because it provides many mech-anism to bolster the productivity of a programmerduring the development of new software with suchthings as an extensive programming interface orautomatic memory management for example. Anadditional advantage of Java is the availability ofmany free and readily usable open-source packagesthat provide a host of diverse functionalities. Themost important ones in our project were the differ-ent XML packages and in particular the XML parserXerces or the XSLT engine Xalan (Apache XMLProject, 2002).

5.1 Backtracking

During the generation process it is possible that twodifferent templates are applicable at the same time(i.e. they have the same category and all of theirconditions are satisfied). Now if one of the tem-plates is selected this leads to one of two differentresults:

• The application of the template was success-ful which means that all of its actions could

be successfully executed and a result was re-turned.

• The application of the template failed becausethe execution of one or more of its actions wasnot successful.

But the failure of a template described above doesnot mean that there exist no solution at all. There-fore we backtrack to the point where the unsuccess-ful template was selected and apply another tem-plate. This procedure is repeated until there are nomore templates at this backtrack point.

The underlying implementation of the backtrack-ing mechanism is quite elaborated since it has totake several important issues into account, the mostimportant ones are:

Performance issuesWe implemented several dif-ferent mechanisms that help to tremendouslyenhance the performance during the backtrack-ing phase such as the memorization of partialsolution.

Constraint issuesWe had to take great care of theconstraint inheritance mechanism during thebacktracking implementation so that an invo-cation of backtracking does not lead to a mis-guided percolation of constraints and hence acorrupted morphology.

5.2 Programming Interface

So far we have talked about the deployment of Xtra-Gen only on the level of generation grammars andtheir XML-based formalism. Now we turn to thedescription of the tasks that have to be undertakenon the level of programming code to make the sys-tem run.

The following shows the individual steps that arebe taken to generate some output with XtraGen:

Creating a new generator-engineThe very firstthing to do in order to get the whole systemrunning is to create a newGenerator objectwhich represents the core generation engine:Generator generator =

new Generator();

By doing so one implicitly creates objects forthe internal subcomponents such as the alreadymentioned morphological component and putsthem under the control of the generation en-gine.

Page 64: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

Setting the start-category or -id The generationengine needs to know which template it shouldstart from. This is done by specifying either astart-category as in case (a) or a start-identifieras in case (b).

(a) generator.setStartCategory(String category);

(b) generator.setStartId(String id);

Setting the grammar Without a generation gram-mar it would naturally be impossible to gener-ate any output at all. There are two differentpossibilities to pass a grammar to the genera-tion engine:

(a) generator.setGrammarDocument(Document grammar);

(b) generator.setGrammar(Grammar grammar);

In the first case aDocument object (WorldWide Web Consortium, 2000) that contains thegrammar in parsed XML-format is passed, inthe second case a pre-compiledGrammar ob-ject is passed. (see section 2.6)

Setting the input In addition to the grammar thegeneration engine needs an input structure togenerate from. This can be set as follows:

generator.setInputDocument(Document input);

Again, the parameter passed is aDocumentobject that contains the input in parsed XMLformat. The input can be reassigned betweentwo calls to the generation engine.

Setting parameters In subsection 2.3 we talkedabout the use of parameters to control andguide the generation process. The way param-eterization works is explained in detail there.To set parameters at runtime one has to add thefollowing methods:

(a) generator.addParameter(String name, String value);

(b) generator.addParameter(String name, String value,

double weight);

This step is only needed if parameterization isdesired. Otherwise these methods can be omit-ted and parameterization is turned automati-cally off.

Run the generation processTo now actually startthe generation process and get some output,one of the following calls can be used:(a) String result =

generator.generate();

(b) Document result =generator.generateDocument();

The difference between the two calls is that incase (a) a simpleString containing the result isreturned whereas in the case (b) aDocument ob-ject is passed back.

6 EvaluationAt the end of a software development phase anynewly created system must proof in an evaluationphase whether it reaches its predefined goals. (Mel-lish and Dale, 1998) This is especially true in anindustrial context with commercial applications asin our case.

The context for the evaluation of XtraGen wasprovided by X-Booster (Beauregard et al., 2002)which is another project at our site that was con-currently developed with our system. This systemis an optimized implementation of Slipper (Cohenand Singer, 1999), a binary classifier that inducesrules via boosting and combines a set of those clas-sifiers to solve multi-class problems. It was the goalto successfully integrate XtraGen into this system.

The motivation behind this is based on the factthat common classification systems are quite non-transparent in regard to their inner workings. There-fore it becomes rather difficult to understand howand why certain classification decisions are made bythose systems.

We departed at exactly this point: XtraGen was tobe used to automatically generate texts that explainthe learning phase of the X-Booster and hence makethe classification more transparent. As an additional“gadget” we wanted to create the explanations in allthe languages that are spoken at our company site:English, German, French, Italian, Russian, Bulgar-ian and Turkish.

6.1 Integration Tasks

As the very first task of the integration procedure weneeded to answer the question what to actually out-put to the user and in which exact format this outputshould be. We decided on producing a descriptionof the complete learning phase with two kinds ofoutput texts: One targeted at experts and one for

Page 65: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

novice users. The format chosen for the final outputwas HTML.

Now we needed to adapt the code of X-Boosterslightly to make the meta data about the learningphase accessible from the outside. To do so wewrote some small methods that simply returned inXML format the meta data which were only storedin internal variables up to this point.

The next step was to add to X-Booster’s own codethe code for calling the generation engine and fortransforming the result into the final output-format.This was done as described in section 5.2.

Finally (and most importantly) the generationgrammars for the different languages were devel-oped. This happened in the way that we first set-upa prototypical grammar for English which we testedextensively. Then in a second step the grammars forthe other languages were modelled according to thisexemplary one. For this we worked together withdifferent native-speaking colleagues that translatedthe original English grammar into their language.

6.2 Sample Template and Output

Figure 7 below shows a small portion of the outputreturned after running X-Booster together with theintegrated XtraGen on a given training set. This re-sult was produced by using the English generationgrammar and the parameters set for producing textstargeted at novice users.

The number of documents is 37, dividedinto 2 different categories. The results havebeen produced using 3 fold-cross-validationwhich means that the data-set is divided into1/3 test-set and 2/3 training-set.The learner is trained on the training-dataonly and evaluated on the test-set whichhas not been presented before. We repeatthis process 3 times on non-intersecting test-data.The overall result is then computed finallyas the average of all performed tests. Theaverage accuracy reaches 100.0% which isachieved by applying 5 rules.

Figure 7: Sample output from X-Booster

Figure 8 shows parts of the template that pro-duced the second paragraph of this sample output.The template is complete in the sense that all dif-ferent aspects of a template are exposed and it is

only simplified in the respect that similar parts ofthe template are left out.

<template id="foldNumberNovice"category="foldNumber">

<conditions><and>

<condition type="exists"/><get path="/foldNumber"/>

</condition>...

</and></conditions><parameters>

<parameter name="level"value="novice"/>

...</parameters><actions>

...We repeat this process<get path="/foldNumber"/><inflect stem="time"

label="X0"/>...

</actions><constraints>

<constraint><place label="X0"

attribute="number"/><get path="/foldNumber"/>

</constraint>...

</constraints></template>

Figure 8: Sample template from X-Booster

7 Outlook and Future WorkBecause the above described evaluation proofed tobe quite successful it was decided to further deployXtraGen at our site and to integrate it into new prod-ucts and projects. One of the first of these projectsis Mietta-II (Multilingual Information Environmentfor Travel and Tourism Applications), an EuropeanCommission funded industrial project with the goalof developing a comprehensive web search and in-formation portal that specializes on the tourism sec-tor. (Additional application scenarios are alreadyenvisaged for a possible later stage of the project.)In this environment we will apply natural languagegeneration to produce texts and messages for vari-ous types of media such as dynamically generatedweb pages, paper leaflets, hand-held devices and

Page 66: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

in particular cellular phones. For the last of thosemedia we are exploring the possibility of producingvoice-enabled output with a dedicated voice serverthat is based on VoiceXML (World Wide Web Con-sortium, 2002) or JSML (Sun Microsystems, 1999).We are experimenting at the moment with the differ-ent possible outputs and on how these outputs canbe encoded in a generation grammar enriched withVoiceXML or JSML tags.

8 ConclusionIn this contribution we presented XtraGen, a real-time, template-based natural language generationsystem designed for real-world applications andbased on standard XML- and Java-technologies. Wedescribed in detail the different aspects of its gen-eration grammars with an emphasis on their for-malism. Then we covered architectural and imple-mentational issues of the system. After depictingthe evaluation done for XtraGen we concluded withpresenting a new project where the system is usedand where new ideas are experimented with.

AcknowledgementsWe are grateful to Sven Schmeier, Huiyan Huangand Oliver Plaehn for their collaboration on X-Booster and the Mietta-II project and especially toStephan Busemann for many fruitful discussions onTG/2. This work was carried out in the Mietta-IIproject funded by the European Commission underthe Fifth Framework Programme IST-2000-30161.

ReferencesApache XML Project. 2002. Apache XML Project

Website.http://xml.apache.org/ , June.Stephane Beauregard, Huiyan Huang, Karsten Kon-

rad, and Sven Schmeier. 2002. Industrial-Strength Text Classifiers. InProceedings of theFifth International Conference on Business Infor-mation Systems (BIS2002), Poznan, Poland.

Stephan Busemann. 1996. Best-first surface re-alization. In Proceedings of the 8th. Interna-tional Workshop on Natural Language Genera-tion (INLG ’96), pages 101–110, Herstmonceux,England, June.

Alison Cawsey. 2000. Presenting tailored resourcedescriptions: will XSLT do the job? InPro-ceedings of the Ninth World-Wide Web confer-ence (WWW9).

Songsak Channarukul. 1999. YAG: A Natural Lan-guage Generator for Real-Time Systems. Mas-

ter’s thesis, Department of Electrical Engineeringand Computer Science, University of Wisconsin-Milwaukee, December.

William W. Cohen and Yoram Singer. 1999. ASimple, Fast, and Effective Rule Learner. InPro-ceedings of the Annual Conference of the Ameri-can Association for Artificial Intelligence (AAAI-99), pages 335–342.

Chris Mellish and Robert Dale. 1998. Evaluationin the Context of Natural Language Generation.Computer Speech and Language, 12(1):349–373.

Ehud Reiter. 1995. NLG vs. Templates. InPro-ceedings of the Fifth European Workshop on Nat-ural Language Generation, pages 95–105, Lei-den, The Netherlands, May. Faculty of Social andBehavioural Sciences, University of Leiden.

Stuart M. Shieber, Gertjan van Noord, Robert C.Moore, and Fernando C. N. Pereira. 1989. ASemantic-Head-Driven Generation Algorithm forUnification-Based Formalisms. InMeeting of theAssociation for Computational Linguistics, pages7–17.

Sun Microsystems. 1999. Java Speech API MarkupLanguage Specification . http://java.sun.com/products/java-media/speech/forDevelopers/JSML/ , October.

Michael Wein. 1996. Eine parametrisierbareGenerierungskomponente mit generischemBacktracking. Diploma thesis, Department ofComputer Science, University of the Saarland,Saarbrucken.

Graham Wilcock. 2001. Pipelines, Templates andTransformations: XML for Natural LanguageGeneration. InProceedings of the first NLP andXML Workshop; Workshop session of the 6th Nat-ural Language Processing Pacific Rim Sympo-sium, Tokyo, November.

World Wide Web Consortium. 1999. XML PathLanguage (XPath) Version 1.0.http://www.w3.org/TR/xpath , November.

World Wide Web Consortium. 2000. DocumentObject Model (DOM) Level 2 Core Specifica-tion Version 1.0.http://www.w3.org/TR/DOM-Level-2-Core/ , November.

World Wide Web Consortium. 2001. ExtensibleStylesheet Language (XSL) Version 1.0.http://www.w3.org/TR/xsl , October.

World Wide Web Consortium. 2002. VoiceExtensible Markup Language (VoiceXML)Version 2.0. http://www.w3.org/TR/voicexml20 , April.

Page 67: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

XiSTS – XML in Speech Technology Systems

Michael Walsh Stephen Wilson Julie Carson-Berndsen Department of Computer Science

University College Dublin Ireland

{michael.j.walsh, stephen.m.wilson, julie.berndsen}@ucd.ie

Abstract: This paper describes the use of XML in three generic interacting speech technology systems. The first, a phonological syllable recognition system, generates feature-based finite-state automaton representations of phonotactic constraints in XML. It employs axioms of event logic to interpret multilinear representations of speech utterances and outputs candidate syllables to the second system, an XML syllable lexicon. This system enables users to generate their own lexicons and its default lexicon is used to accept or reject the candidate syllables output by the speech recognition system. Furthermore its XML representation facilitates its use by the third system which generates additional lexicons, based on different feature sets, by means of a transduction process. The applicability of these alternative feature sets in the generation of synthetic speech can then be tested using these new lexicons.

1. Introduction The flexibility and portability provided by XML, and its related technologies, result in them being well suited to the development of robust, generic, Natural Language Processing applications. In this paper we describe the use of XML within the context of speech technology software, with a particular focus on speech recognition. We present a framework, based on the model of Time Map Phonology (Carson-Berndsen, 1998), for the development and testing of phonological well- formedness constraints for generic speech technology applications. Furthermore, we illustrate how the use of a syllable lexicon, specified in terms of phonological features, and marked-up in XML, contributes to both speech recognition and synthesis. In the following sections three inter-connected systems are discussed. The first, the Language Independent Phonotactic System, LIPS, a syllable recognition application based on Time Map Phonology and a significant departure from current ASR technology, is described. The

second system, Realising Enforced Feature-based Lexical Entries in XML, REFLEX, is outlined and finally, the third system, Transducing Recognised Entities via XML, T-REX, is discussed. All three systems build on earlier work on generic speech tools (Carson-Berndsen, 1999; Carson-Berndsen & Walsh, 2000a). 2. The Time Map Model This paper focuses on representing speech utterances in terms of non-segmental phonology, such as autosegmental phonology (Goldsmith, 1990), where utterances are represented in terms of tiers of autonomous features (autosegments) which can spread across a number of sounds. The advantage of this approach is that coarticulation can be modelled by allowing features to overlap. The Time Map model (Carson-Berndsen, 1998, 2000) builds on this autosegmental approach by allowing multilinear representations of autonomous features to be interpreted by an event-based computational linguistic model

Page 68: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

of phonology. The Time Map model employs a phonotactic automaton (finite-state representation of the permissible combinations of sounds in a language), and axioms of event logic, to interpret multilinear feature representations. Indeed, much recent research (e.g. Ali et al., 1999; Chang, Greenberg & Wester, 2001) has focused on extracting similar features to those used in our model. Figure 1 below, illustrates a mulitlinear feature-based representation of the syllable [So:n] 1.

Figure 1. Multilinear representation of [So:n]

Two temporal domains are distinguished by the Time Map model. The first, absolute (signal) time, considers features as events with temporal endpoints. The second, relative time, considers only the temporal relations of overlap and precedence as salient. Input to the model is in absolute time. Parsing, however, is performed in the relative time domain using only the overlap and precedence relations, and is guided by the phonotactic automaton which imposes top-down constraints on the relations that can occur in a particular language. The construction of the phonotactic automaton and the actual parsing process is carried out by LIPS. 3. LIPS LIPS is the generic framework for the Time Map model. It incorporates finite-state

1All phonemes are specified in SAMPA notation.

methodology which enables users to construct their own phonotactic automata for any language by means of a graphical user interface. Furthermore, LIPS employs an event logic, enabling it to map from absolute time to relative time, and in a novel approach to ASR, carry out parsing on the phonological feature level. The system is comprised of two principal components, the network generator and the parser, outlined in the following subsections. 3.1. The Network Generator The network generator interface allows users to build their own phonotactic automata. Users input node values and select from a list of feature overlap relations those that a given arc is to represent. These relations can be selected from a default list of IPA-like features or the user can specify their own set. In this way LIPS is feature-set independent. The network generator constructs feature-based networks and parsing takes place at the feature level. Once the user has completed the network specification, the system generates an XML representation of the phonotactic automaton. An automaton representing a small subsection of the phonotactics of English is illustrated in Figure 2. It is clear from this automaton that English permits an [S] followed by a [r] in syllable-initial position, but not the other way around.

Figure 2. Phonotactic automaton

Page 69: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

Figure 3. XML representation of subsection of phonotactic automaton for English.

Figure 3 illustrates a subsection of the XML representation of the English phonotactics output by the network generator. A single arc with a single phoneme, [S], and its overlap constraints, is shown. The motivation for generating an XML representation for our phonotactic automata is that XML enables us to specify a well-defined, easy to interpret, portable template, without compromising the generic nature of the network generator. That is to say the user can still specify a phonotactic automaton independent of any language or feature-set. The generated phonotactic automaton is then used to guide the second principal component of the system, the parser.

3.2 The Parser LIPS employs a top-down and breadth-first parsing strategy and is best explained through exemplification. Purely for the purposes of describing how the parsing procedure takes place, we return to the phonotactic automaton of Figure 2, which of course represents only a very small subsection of English. This automaton will recognise such syllables as shum, shim, shem, shown, shrun, shran etc., some being actual lexicalised syllables of English and others being phonotactically well- formed, potential, syllables of English. For our example we take the multilinear

Page 70: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

representation of the utterance [So:n] as depicted in Figure 4 as our input to the parser.

Figure 4. Interaction between the input and the automaton.

At the beginning of the parsing process the phonotactic automaton is anticipating a [S] sound, that is it requires three temporal overlap constraints to be satisfied, the feature voiceless must overlap the feature fricative, the feature palato must overlap the feature voiceless, and the feature fricative must overlap the feature palato. A variable window is applied over the input utterance and the features within the window are examined to see if they satisfy the overlap constraints. As can be seen from Figure 4 the three features are indeed present and all overlap in time. Thus the [S] is recognised and the two arcs bearing the [S] symbol are traversed and the window moves on. At this point then the automaton is anticipating either an [r] or a vowel sound. In a similar fashion the contents of the new window are examined and in the case of our example the vowel [o:] is recognised (the [r] is rejected). The vowel transition is traversed, the window moves on, and the automaton is expecting an [n] or an [m]. For full details of the parsing process see

Carson-Berndsen & Walsh (2000b). Output from LIPS is then fed through the REFLEX system to determine if actual or potential syllables have been found. 4. REFLEX REFLEX is a generic, language independent application, which allows for the rapid design and construction of syllable lexicons, for any language. One of the main focuses of other research working on broadening the scope of the lexicon across languages, has been in the development of multilingual lexicons. One such project, PolyLex (Cahill & Gazdar, 1999), captures commonalities across related languages using hierarchical inheritance mechanisms. One of the main concerns of the work presented here however, is to provide generic, reusable, tools which facilitate the development and testing of phonological systems, rather than the creation of such multilingual lexicons. Work on phonological features and lexical description has either been within this multilingual context (Tiberius & Evans, 2000) or has concentrated on using a feature-based lexicon for comparison with features extracted from a sound signal (Reetz, 2000). By removing reference to specific languages and concentrating on providing mechanisms for lexical generation, REFLEX can generate a syllable lexicon for any language that can be adequately represented in a phonetic notation. Furthermore, the decision to use XML to represent the output data means that it is readily available for use and manipulation by other outside systems with minimal effort. All background processing is completely hidden; one deals only with the marked-up output, from which idiosyncratic user-required structures can be rapidly generated.

Page 71: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

The REFLEX system outputs a feature-based syllable lexicon. This lexicon is a valid XML document, meaning that it conforms to the given REFLEX Document Type Definition (DTD). The DTD stipulates the structure, order and number of XML element tags and attributes, modelling all potential syllable structures (e.g. V, CV, CVC etc). An example of a typical lexical entry, in this case corresponding to the multilinear representation specified in Figure 5, [So:n] is given below.

Figure 5. Typical lexical entry in XML The syllable element shown has four children, described as follows: 1) A text child, in this case So:n, the SAMPA representation of the entire syllable. 2) An <onset> element whose attribute list denotes its position within the syllable, i.e.<onset type=”first”>, <onset type=”second”> etc. 3) Nucleus and 4) coda elements are similarly defined. Each of the syllable’s elements, <onset>, <nucleus> and <coda>, may have only one child element, <segment>, which tags the given phoneme. Its attribute list describes the phonemes specification in terms of phonological features. It also has a duration

attribute, which is derived from corpus analysis. <segment phonation=”voiced” manner=”nasal” place=”apical” duration=”null”>n</segment> REFLEX provides two methods by which syllables can be added to the lexicon. The first, requires users to specify an input file of monosyllables represented in a phonetic notation, in this case SAMPA. The second, enables the user to specify syllables, in terms of phonemes, position, and if desired, a typical duration, by means of a GUI illustrated below in Figure 6.

Figure 6. REFLEX lexicographer interface

Regardless of the input option chosen, new entries are added to the lexicon via a background process. REFLEX makes use of DATR, a non-monotonic inheritance based lexical representation language (Evans & Gazdar, 1996) to carry out this process. DATR is used to quickly and comprehensively define the phonological feature descriptions for a given language. For a greater understanding of how this can be achieved see Cahill, Carson-Berndsen & Gazdar (2000). Using DATR’s inference mechanisms, REFLEX manipulates the output into a valid XML document, creating a sophisticated phonological feature-based lexicon, shown in Figure 5.

Page 72: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

All syllable elements are enclosed within the root <lexicon> tag, whose sole attribute specifies the lexicon’s language. <lexicon language=”English”> <syllable>…</syllable> : <syllable>…</syllable> </lexicon> The REFLEX lexicon is a versatile tool that has a number of potential applications within the domain of speech technology systems. The following sub-sections illustrate how this syllable lexicon, by virtue of its being marked up in XML, can contribute to both speech recognition and synthesis. 4.1 LIPS and REFLEX By allowing feature overlap constraints to be relaxed in the case of underspecified input, LIPS can produce a number of candidate syllables. In Figure 4 above, at the final transition, the automaton is expecting either an [m] or an [n]. The input, however, is underspecified, no feature distinguishing between [m] or [n], or indeed any voiced nasal, is present. By allowing the overlap constraints for the [m] and the [n] to be relaxed, LIPS can consider both [So:n] and [So:m] to be candidate syllables for the utterance. Both candidate syllables are well-formed, adhering to the phonotactics of English, however only one, [So:n], is an actual syllable of English. Thus at this point a lexicon providing good coverage of the language should reject [So:m] and accept [So:n]. In order to achieve this, REFLEX makes use of the XPath specification (a means for locating nodes in an XML document) and formulates a query before applying it to the syllable lexicon. 2 In the

2 The full W3C XPath specification can be found at http://www.w3c.org/TR/xpath

example given, REFLEX searches the document, checking the value of the text child of each syllable element, against each candidate syllable output by LIPS. Any successful matches returned are therefore not only well- formed, but are deemed to be actual syllables. Thus at this point, the lexicon is searched and the syllable [So:n] is recognised. The granularity of the REFLEX search capability is such, that it can be extended to the feature level. Users can search the lexicon for syllables that contain a number of specific features in certain positions, e.g. search for syllables that contain a voiced, labial, plosive in the first onset. Again, REFLEX forms an XPath expression and queries the lexicon, returning all matches. REFLEX also functions as a knowledge source for the T-REX system. This system is responsible for mapping output from the lexicon into syllable representations using different feature sets, e.g. features from other phonologies, and is discussed below in the context of speech synthesis. 5. T-REX The role of this module is to enable lexicographers and speech scientists etc. to generate, via a transduction process, syllable lexicons based on different phonological feature sets. The default feature set employed by REFLEX is based on IPA-like features. However, T-REX provides a GUI that permits lexicographers to define phoneme to feature attribute mappings. Given this functionality T-REX operates as a testbed for investigating the merits of different feature sets in the context of speech synthesis. Different lexicons are generated by associating new feature sets with the same phonetic alphabet (SAMPA) via a GUI. The new lexicon is then transduced by T-REX which maps all syllable entries from the default lexicon (with IPA-like features) to the new lexicon,

Page 73: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

applying the features input by the user, to their associated phonemes. In order to exemplify this we return to our sample syllable, [So:n]. Figure 2 above shows the lexical representation, using IPA-like features, for [So:n]. Figure 7 below shows new features being associated with the phoneme [S].

Figure 7. GUI for T-REX Similarly, new features are associated with the remaining phonemes, [o:] and [n], and indeed the rest of the SAMPA alphabet. On completion the user initiates the transduction process and a new lexicon is produced. The XML representation of the phoneme [S], in the new lexicon, is depicted in Figure 8. Note how the feature attributes differ from those in the default lexicon.

Figure 8. Phoneme with transduced features

The advantages of this transduction capability are that numerous lexicons can be rapidly developed and used to investigate the appropriateness of specific formal models of phonological representation for the purposes of speech synthesis. Furthermore, the same computational

phonological model, i.e. the Time Map model, can be employed. Bohan et al (2001) describe how the phonotactic automaton is used to generate a multilinear event representation of overlap and precedence constraints for an utterance, which is then mapped to control parameters of the HLsyn (Sensimetrics Corporation) synthesis engine. Different feature sets can be evaluated by assessing how they influence the various control parameters of the HLsyn engine and the quality of the synthesised speech. 6. Conclusion This paper has described how the use of XML together with a computational phonological model can contribute significantly to the tasks of speech recognition, speech synthesis and lexicon development. Phonotactic automata and multilinear representations were introduced and the interpretation of these representations was discussed. Three robust, well-defined systems, LIPS, REFLEX, and T-REX, were outlined. These systems offer generic structures coupled with the portability of XML. In doing so, they enable users to recognise speech, synthesise speech, and develop lexicons for different languages using different feature sets while maintaining a common interface. The generic and portable nature of these systems means that languages with significantly different phonologies are supported. In addition, languages which, to date, have received little attention with respect to speech technology are equally provided for. Ongoing projects include work on Irish, which has a notably different phonology from English and on developing phonotactic automata and phonological lexicons for other languages. Furthermore, the models are being extended to include phoneme-

Page 74: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

grapheme mappings based on the contexts defined by the phonotactic automata. 7. Bibliography Ali, A.M..A.; J. Van der Spiegel; P. Mueller; G.

Haentjaens & J. Berman (1999): An Acoustic-Phonetic Feature-Based System for Automatic Phoneme Recognition in Continuous Speech. In: IEEE International Symposium on Circuits and Systems (ISCAS-99), III-118-III-121, 1999.

Bohan, A.; E. Creedon, , J. Carson-Berndsen & F. Cummins (2001): Application of a Computational Model of Phonology to Speech Synthesis. In: Proceedings of AICS2001, Maynooth, September 2001.

Cahill, L. & G. Gazdar (1999). The PolyLex architecture: multilingual lexicons for related languages. Traitement Automatique des Langues, 40(2), 5-23.

Cahill, L.; J. Carson-Berndsen & G. Gazdar (2000), Phonology-based Lexical Knowledge Representation. In: F. van Eynde & D. Gibbon (eds.) Lexicon Development for Speech and Language Processing, Kluwer Academic Publishers, Dordrecht.

Carson-Berndsen, J. (1998): Time Map Phonology: Finite State Models and Event Logics in Speech Recognition. Kluwer Academic Publishers, Dordrecht.

Carson-Berndsen, J. (1999): A Generic Lexicon Tool for Word Model Definition in Multimodal Applications. Proceedings of EUROSPEECH 99, 6th European Conference on Speech Communication and Technology, Budapest, September 1999.

Carson-Berndsen, J. (2000): Finite State Models, Event Logics and Statistics in Speech Recognition, In: Gazdar, G.; K. Sparck Jones & R. Needham (eds.): Computers, Language and Speech: Integrating formal theories and statistical data. Philosophical Transactions of the Royal Society, Series A, 358(1770), 1255-1266.

Carson-Berndsen, J. & M. Walsh (2000a): Generic techniques for multilingual speech technology applications, Proceedings of the 7th Conference on Automatic Natural Language Processing, Lausanne, Switzerland, 61-70.

Carson-Berndsen, J. & M. Walsh (2000b): Interpreting Multilinear Representations in Speech. In: Proceedings of the Eight International Conference on Speech Science and Technology, Canberra, December 2000.

Chang, S.; S. Greenberg & M. Wester (2001): An Elitist Approach to Articulatory-Acoustic Feature Classification. In: Proceedings of Eurospeech 2001, Aalborg.

Evans, R & G. Gazdar (1996), DATR: A language for lexical knowledge representation. In: Computational Linguistics 22, 2, pp. 167-216.

Goldsmith, J. (1990): Autosegmental and Metrical Phonology. Basil Blackwell, Cambridge, MA.

Reetz, H. (2000) Underspecified Phonological Features for Lexical Access. In: PHONUS 5, pp. 161-173. Saarbrücken: Institute of Phonetics, University of the Saarland.

Tiberius, C. & R. Evans, 2000 "Phonological feature based Multilingual Lexical Description," Proceedings of TALN 2000, Geneva, Switzerland.

Page 75: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

SALT: An XML Application for Web-based Multimodal Dialog Management

Kuansan Wang

Speech Technology Group, Microsoft Research One Microsoft Way, Microsoft Corporation

Redmond, WA, 98006, USA http://research.microsoft.com/stg

Abstract

This paper describes the Speech Application Language Tags, or SALT, an XML based spoken dialog standard for multimodal or speech-only applications. A key premise in SALT design is that speech-enabled user interface shares a lot of the design principles and computational requirements with the graphical user interface (GUI). As a result, it is logical to introduce into speech the object-oriented, event-driven model that is known to be flexible and powerful enough in meeting the requirements for realizing sophisticated GUIs. By reusing this rich infrastructure, dialog designers are relieved from having to develop the underlying computing infrastructure and can focus more on the core user interface design issues than on the computer and software engineering details. The paper focuses the discussion on the Web-based distributed computing environment and elaborates how SALT can be used to implement multimodal dialog systems. How advanced dialog effects (e.g., cross-modality reference resolution, implicit confirmation, multimedia synchronization) can be realized in SALT is also discussed.

Introduction Multimodal interface allows a human user to interaction with the computer using more than one input methods. GUI, for example, is multimodal because a user can interact with the computer using keyboard, stylus, or pointing

devices. GUI is an immensely successful concept, notably demonstrated by the World Wide Web. Although the relevant technologies for the Internet had long existed, it was not until the adoption of GUI for the Web did we witness a surge on its usage and rapid improvements in Web applications. GUI applications have to address the issues commonly encountered in a goal-oriented dialog system. In other words, GUI applications can be viewed as conducting a dialog with its user in an iconic language. For example, it is very common for an application and its human user to undergo many exchanges before a task is completed. The application therefore must manage the interaction history in order to properly infer user’s intention. The interaction style is mostly system initiative because the user often has to follow the prescribed interaction flow where allowable branches are visualized in graphical icons. Many applications have introduced mixed initiative features such as type-in help or search box. However, user-initiated digressions are often recoverable only if they are anticipated by the application designers. The plan-based dialog theory (Sadek et al 1997, Allen 1995, Cohen et al 1989) suggests that, in order for the mixed initiative dialog to function properly, the computer and the user should be collaborating partners that actively assist each other in planning the dialog flow. An application will be perceived as hard to use if the flow logic is obscure or unnatural to the user and, similarly, the user will feel frustrated if the methods to express intents are too limited. It is widely believed that spoken language can improve the user interface as it provides the user a natural and less restrictive way to express intents and receive feedbacks.

Page 76: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

The Speech Application Language Tags (SALT 2002) is a proposed standard for implementing spoken language interfaces. The core of SALT is a collection of objects that enable a software program to listen, speak, and communicate with other components residing on the underlying platform (e.g., discourse manager, other input modalities, telephone interface, etc.). Like their predecessors in the Microsoft Speech Application Interface (SAPI), SALT objects are programming language independent. As a result, SALT objects can be embedded into a HTML or any XML document as the spoken language interface (Wang 2000). Introducing speech capabilities to the Web is not new (Aron 1991, Ly et al 1993, Lau et al 1997). However, it is the utmost design goal of SALT that advanced dialog management techniques (Sneff et al 1998, Rudnicky et al 1999, Lin et al 1999, Wang 1998) can be realized in a straightforward manner in SALT. The rest of the paper is organized as follows. In Sec. 1, we first review the dialog architecture on which the SALT design is based. It is argued that advanced spoken dialog models can be realized using the Web infrastructure. Specifically, various stages of dialog goals can be modeled as Web pages that the user will navigate through. Considerations in flexible dialog designs have direct implications on the XML document structures. How SALT implements these document structures are outlined. In Sec. 2, the XML objects providing spoken language understanding and speech synthesis are described. These objects are designed using the event driven architecture so that they can included in the GUI environment for multimodal interactions. Finally in Sec. 3, we describe how SALT, which is based on XML, utilizes the extensibility of XML to allow new extensions without losing document portability.

1 Dialog Architecture Overview With the advent of XML Web services, the Web has quickly evolved into a gigantic distributed computer where Web services, communicating in XML, play the role of reusable software components. Using the universal description, discovery, and integration (UDDI) standard, Web

services can be discovered and linked up dynamically to collaborate on a task. In other words, Web services can be regarded as the software agents envisioned in the open agent architecture (Bradshaw 1996). Conceptually, the Web infrastructure provides a straightforward means to realize the agent-based approach suitable for modeling highly sophisticated dialog (Sadek et al 1997). This distributed model shares the same basis as the SALT dialog management architecture.

1.1 Page-based Dialog Management An examination on human to human conversation on trip planning shows that experienced agents often guide the customers in dividing the trip planning into a series of more manageable and relatively untangled subtasks (Rudnicky et al 1999). Not only the observation contributes to the formation of the plan-based dialog theory, but the same principle is also widely adopted in designing GUI-based transactions where the subtasks are usually encapsulated in visual pages. Take a travel planning Web site for example. The first page usually gathers some basic information of the trip, such as the traveling dates and the originating and destination cities, etc. All the possible travel plans are typically shown in another page, in which the user can negotiate on items such as the price, departure and arrival times, etc. To some extent, the user can alter the flow of interaction. If the user is more flexible for the flight than the hotel reservation, a well designed site will allow the user to digress and settle the hotel reservation before booking the flight. Necessary confirmations are usually conducted in separate pages before the transaction is executed. The designers of SALT believe that spoken dialog can be modeled by the page-based interaction as well, with each page designed to achieve a sub-goal of the task. There seems no reason why the planning of the dialog cannot utilize the same mechanism that dynamically synthesizes the Web pages today.

1.2 Separation of Data and Presentation SALT preserves the tremendous amount of flexibility of a page-based dialog system in

Page 77: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

dynamically adapting the style and presentation of a dialog (Wang 2000). A SALT page is composed of three portions: (1) a data section corresponding to the information the system needs to acquire from the user in order to achieve the sub-goal of the page; (2) a presentation section that, in addition to GUI objects, contains the templates to generate speech prompts and the rules to recognize and parse user’s utterances; (3) a script section that includes inference logic for deriving the dialog flow in achieving the goal of the page. The script section also implements the necessary procedures to manipulate the presentation sections. This document structure is motivated by the following considerations. First, the separation of the presentation from the rest localizes the natural language dependencies. An application can be ported to another language by changing only the presentation section without affecting other sections. Also, a good dialog must dynamically strike a balance between system initiative and user initiative styles. However, the needs to switch the interaction style do not necessitate changes in the dialog planning. The SALT document structure maintains this type of independence by separating the data section from the rest of the document, so that when there are needs to change the interaction style, the script and the presentation sections can be modified without affecting the data section. The same mechanism also enables the app to switch among various UI modes, such as in the mobile environments where the interactions must be able to seamlessly switching between a GUI and speech-only modes for hand-eye busy situations. The presentation section may vary significantly among the UI modes, but the rest of the document can remain largely intact.

1.3 Semantic Driven Multimodal Integration SALT follows the common GUI practice and employs an object-oriented, event-driven model to integrate multiple input methods. The technique tracks user’s actions and reports them as events. An object is instantiated for each event to describe the causes. For example, when a user clicks on a graphical icon, a mouse click event is fired. The mouse-click event object contains

information such as coordinates where the click takes place. SALT extends the mechanism for speech input, in which the notion of semantic objects (Wang 2000, Wang 1998) is introduced to capture the meaning of spoken language. When the user says something, speech events, furnished with the corresponding semantic objects, are reported. The semantic objects are structured and categorized. For example, an utterance “Send mail to John” is composed of two nested semantic objects: “John” representing the semantic type “Person” and the whole utterance the semantic type “Email command.” SALT therefore enables a multimodal integration algorithm based on semantic type compatibility (Wang 2001). The same command can be manifest in a multimodal expression, as in “Send email to him [click]” where the email recipient is given by a point-and-click gesture. Here the semantic type provides a straightforward way to resolve the cross modality reference: the handler for the GUI mouse click event can be programmed into producing a semantic object of the type “Person” which can subsequently be identified as a constituent of the “email command” semantic object. Because the notion of semantic objects is quite generic, dialog designers should find little difficulty employing other multimodal integration algorithms, such as the unification based approach described in (Johnston et al 1997), in SALT.

2 Basic Speech Elements in SALT

SALT speech objects encapsulate speech functionality. They resemble to the GUI objects in many ways. Because they share the same high level abstraction, SALT speech objects interoperate with GUI objects in a seamless and consistent manner. Multimodal dialog designers can elect to ignore the modality of communication, much the same way as they are insulated from having to distinguish whether a text string is entered to a field through a keyboard or cut and pasted with a pointing device.

2.1 The Listen Object The “listen” object in SALT is the speech input object. The object must be initialized with a speech grammar that defines the language model

Page 78: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

and the lexicon relevant to the recognition task. The object has a start method that, upon invocation, collects the acoustic samples and performs speech recognition. If the language model is a probabilistic context free grammar (PCFG), the object can return the parse tree of the recognized outcome. Optionally, dialog designers can embed XSLT templates or scripts in the grammar to shape the parse tree into any desired format. The most common usage is to transform the parse tree into a semantic tree composed of semantic objects. A SALT object is instantiated in an XML document whenever a tag bearing the object name is encountered. For example, a listen object can be instantiated as follows: <listen id=”foo” onreco=”f()”

onnoreco=”g()” mode=”automatic”> <grammar src=”../meeting.xml”/> </listen> The object, named “foo,” is given a speech grammar whose universal resource indicator (URI) is specified via a <grammar> constituent. As in the case of HTML, methods of an object are invoked via the object name. For example, the command to start the recognition is foo.start() in the ECMAScript syntax. Upon a successful recognition and parsing, the listen object raises the event “onreco.” The event handler, f(), is associated in the HTML syntax as shown above. If the recognition result is rejected, the listen object raises the “onnoreco” event, which, in the above example, invokes function g(). As mentioned in Sec. 0, these event handlers reside in the script section of a SALT page that manages the within-page dialog flow. Note that SALT is designed to be agnostic to the syntax of the eventing mechanism. Although the examples through out this article use HTML syntax, SALT can operate with other eventing standards, such as World Wide Web Consortium (W3C) XML Document Object Model (DOM) Level 2, ECMA Common Language Infrastructure (CLI), or the upcoming W3C proposal called XML Events. The SALT listen object can operate in one of the three modes designed to meet different UI requirements. The automatic mode, shown above,

automatically detects the end of utterance and cut off the audio stream. The mode is most suitable for push-to-talk UI or telephony based systems. Reciprocal to the start method, the listen object also has a stop method for forcing the recognizer to stop listening. The designer can explicitly invoke the stop method and not rely on the recognizer’s default behavior. Invoking the stop method becomes necessary when the listen object operates under the single mode, where the recognizer is mandated to continue listening until the stop method is called. Under the single mode, the recognizer is required to evaluate and return hypotheses based on the full length of the audio, even though some search paths may have reached a legitimate end of sentence token in the middle of the audio stream. In contrast, the third multiple mode allows the listen object to report hypotheses as soon as it sees fit. The single mode is designed for push-hold-and-talk type of UI, while the multiple mode is for real-time or dictation type of applications. The listen object also has methods to modify the PCFG it contains. Rules can be dynamically activated and deactivated to control the perplexity of the language model. The semantic parsing templates in the grammar can be manipulated to perform simple reference resolution. For example, the grammar below (in SAPI format) demonstrates how a deictic reference can be resolved inside the SALT listen object: <rule propname=”drink” …> <option> the </option> <list> <phrase propvalue=”coffee”> left </phrase> <phrase propvalue=”juice”> right </phrase> </list> <option> one </option> </rule> In this example, the propname and propvalue attributes are used to generate the semantic objects. If the user says “the left one,” the above grammar directs the listen object to return the semantic object as <drink text=”the left one”>coffee</drink>. This mechanism for composing semantic objects is particularly useful for processing expressions closely tied to how

Page 79: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

data are presented. The grammar above may be used when the computer asks the user for choice of the drink by displaying the pictures of the choices side by side. However, if the display is tiny, the choices may be rendered as a list, to which a user may say “the first one” or “the bottom one.” SALT allows dialog designers to approach this problem by dynamically adjusting the speech grammar.

2.2 The prompt object The SALT “prompt” object is the speech output object. Like the listen object, the prompt object has a start method to begin the audio playback. The prompt object can perform text to speech synthesis (TTS) or play pre-recorded audio. For TTS, the prosody and other dialog effects can be controlled by marking up the text with synthesis directives. Barge-in and bookmark are two events of the prompt object particularly useful for dialog designs. The prompt object raises a barge-in event when the computer detects user utterance during a prompt playback. SALT provides a rich program interface for the dialog designers to specify the appropriate behaviors when the barge-in occurs. Designers can choose whether to delegate SALT to cut off the outgoing audio stream as soon as speech is detected. Delegated cut-off minimizes the barge-in response time, and is close to the expected behavior for users who wish to expedite the progress of the dialog without waiting for the prompt to end. Similarly, non-delegated barge-in let the user change playback parameters without interrupting the output. For example, the user can adjust the speed and volume using speech commands while the audio playback is in progress. SALT will automatically turn on echo cancellation for this case so that the playback has minimal impacts on the recognition. The timing of certain user action or the lack thereof often bears semantic implications. Implicit confirmation is a good example, where the absence of an explicit correction from the user is considered as a confirmation. The prompt object introduces an event for reporting the landmarks of the playback. The typical way of

catching the playback landmarks in SALT is as such: <prompt id=”bar” onbookmark=”f()” …> Traveling to New York? <bookmark name=”imp_confirm”/> There are <emph> 3 </emph> flights available … </prompt> When the synthesizer reaches the TTS markup <bookmark>, the onbookmark event is raised and the event hander f() is invoked. When a barge-in is detected, the dialog designer can determine if the barge-in occurs before or after the bookmark by inspecting whether the function f() has been called or not. Multimedia synchronization is another main usage for TTS bookmarks. When the speech output is accompanied with, for example, graphical animations, TTS bookmarks are an effective mechanism to synchronize these parallel outputs. To include dynamic content in the prompt, SALT adopts a simple template-based approach for prompt generation. In other words, the carrier phrases can be either pre-recorded or hard-coded, while the key phrases can be inserted and synthesized dynamically. The prompt object that confirms a travel plan may appear as the following in HTML: <input name=”origin” type=”text” /> <input name=”destination” type=”text” /> <input name=”date” type=”text” /> … <prompt …> Do you want to fly from <value targetElement=”origin”/> to <value targetElement=”destination”/> on <value targetElement=”date”/>? </prompt> As shown above, SALT uses a <value> tag inside a prompt object to refer to the data contained in other parts of the SALT page. In this example, the prompt object will insert the values in the HTML input objects in synthesizing the prompt.

Page 80: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

2.3 Declarative Rule-based Programming Although the examples use procedural programming in managing the dialog flow control, SALT designers can practice inference programming in a declarative rule-based fashion in which rules are attached to the SALT objects capturing user’s actions, e.g., the listen object. Instead of authoring procedural event handlers, designers can declare inside the listen object rules that will be evaluated and invoked when the semantic objects are returned. This is achieved through a SALT <bind> element as demonstrated below: <listen …> <grammar …/> <bind test=”/@confidence $lt$ 50” targetElement=”prompt_confirm” targetMethod=”start” targetElement=”listen_confirm” targetMethod=”start” /> <bind test=”/@confidence $ge$ 50” targetElement=”origin” value=”/city/origin” targetElement=”destination” value=”/city/destination” targetElement=”date”

value=”/date” /> … </listen> The predicate of each rule is applied in turns against the result of the listen object. They are expressed in the standard XML Pattern language in the “test” clause of the <bind> element. In this example, the first rule checks if the confidence level is above the threshold. If not, the rule activates a prompt object (prompt_confirm) for explicit confirmation, followed by a listen object listen_confirm to capture the user’s response. The speech objects are activated via the start method of the respective object. Object activations are specified in the targetElement and the targetMethod clauses of the <bind> element. Similarly, the second rule applies when the confidence score exceeds the prescribed level. The rule extracts the relevant semantic objects from the parsed outcome and assigns them to the respective elements in the SALT page. As shown above, SALT reuses the W3C XPATH language for extracting partial semantic objects from the parsed outcome.

3 SALT Extensibilities Naturally spoken language is a modality that can be used in widely diverse environments where user interface constraints and capabilities vary significantly. As a result, it is only practical to define into SALT the speech functions that are universally applicable and implementable. For example, the basic speech input function in SALT only deals with speaker independent recognition and understanding, even though speaker dependent recognition or speaker verification are in many cases very useful. As a result, extensibility is crucial to a natural language interface like SALT. SALT follows the XML standards that allow extensions being introduced on demand without sacrificing document portability. Functions that are not already defined in SALT can be introduced at the component level, or as a new feature of the markup language. In addition, SALT requires the standard of XML be followed so that extensions can be identified and the methods to process the extensions can be discovered and integrated.

3.1 Component extensibility SALT components can be extended with new functions individually through the component configuration mechanism of the <param> element. For example, the <listen> element has an event to signal when speech is detected in the incoming audio stream. However, the standard does not specify an algorithm for detecting speech. A SALT document author, however, can declare reference cues so that the document can be rendered in the similar way among different processors. The <param> element can be used to set the reference algorithm and threshold for detecting speech in the <listen> object: <listen onspeechdetected = “handler( )” …> <param xmlns:xyz=”urn://xyz.edu/algo-1”> <xyz:method>energy</xyz:method> <xyz:threshold>0.7</xyz:threshold> </param> … </listen> Here the parameters are set using an algorithm whose uniform resource name (URN),

Page 81: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

xyz.edu/algo-1, is declared as an attribute of the XML namespace of <param>. The parameters for configuring this specific speech detection method are further specified in the child elements. A document processor can perform a schema translation on the URN namespace into any schema the processor understands. For example, if the processor implements the speech detection algorithm where the detection threshold has a different range, the adjustment can be easily made when the document is parsed. The same mechanism is used to extend the functionality. For instance, the <listen> object can be used for speaker verification because the algorithm used for verification and the programmatic interfaces share a lot in common with the recognition. In SALT, a <listen> object can be extended for speaker verification through configuration parameters: <listen onreco=”success( )” onerror=”na( )”

onnoreco=”failed( )”> <param xmlns:v=”urn:abc.com/spkrveri”> <v:cohort>../../data</v:cohort> <v:threshold>0.85</v:threshold> … </param> … </listen> In this example, the <listen> object is extended for speaker verification that compares a user’s voice against a cohort set. The events “onreco” and “onnoreco” are invoked when the voice passes or fails the test, respectively. As in the previous case, the extension must be decorated with URN that specifies the intended behavior of the document author. Being an extension, the document processor might not be able to discern the semantics implied by the URN natively. However, XML based protocols allow the processor to query and employ Web services that can either (1) transform the extended document segment into an XML schema the processor understands, or (2) perform the function described by the URN. By closely following XML standards, SALT documents fully enjoy the benefits of extensibility and portability of XML with SALT.

3.2 Language extensibility In addition to component extensibility, the whole language of SALT can be enhanced with new functionality using XML. Communicating with other modalities, input devices, and advanced discourse and context management are just a few potential use for language-wise extension. SALT message extension, or the <smex> element, is the standard conduit between a SALT document and the outside world. The message element takes <param> as its child element to forge a connection to an external component. Once a link is established, SALT document can communicate with the external component by exchanging text messages. The <smex> element has a “sent” attribute to which a text message can be assigned to. When its value is changed, the new value is regarded as a message intended for the external component and immediately dispatched. When an incoming message is received, the object places the message on a property called “received” and raises the “onreceive” event.

3.2.1 Telephony Interface Telephones are one of the most important access devices for spoken language enabled Web applications. Call control functions play a central role in a telephony SALT application. The <smex> element in SALT is a perfect match with the telephony call standard known as ECMA 323. ECMA 323 defines the standard XML schemas for messages of telephony functions, ranging from simple call answering, disconnecting, transferring to switching functionality suitable for large call centers. ECMA 323 employs the same message exchange model as the design of <smex>. This allows SALT application to tap into a rich telephony call controls without needing a complicated SALT processor. As shown in (SALT 2002), sophisticated telephony applications can be authored in SALT in a very straightforward manner.

3.2.2 Advanced Context Management In addition to devices such as telephones, SALT <smex> object may also be employed to connect to Web services or other software components to facilitate advanced discourse semantic analysis and context managements. Such capabilities, as

Page 82: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

described in (Wang 2001), empower the user to realize the full potential of interacting with computers in natural language. For example, the user can simply say “Show driving directions to my next meeting” without having to explicitly and tediously instruct the computer to first look up the next meeting, obtain its location, copy the location to another program that maps out a driving route. The customized semantic analysis can be achieved in SALT as follows: <listen id=”first_pass” …> <grammar src=”…” /> <!-- basic grammar --> <bind targetElement=”contextManager” targetAttribute=”sent” value=”/” /> … </listen> <smex id=”contextManager” …> <param xmlns:ws=”WebServices”> <ws:url>http://personal.com</ws:url> <ws:user>…</ws:user> … </param> <bind targetElement=”realTarget” … /> </smex> Here the <listen> element includes a rudimentary grammar to analyze the basic sentence structure of user utterance. Instead of resolving every reference (e.g. “my next meeting”), the result is cascaded to the <smex> element linked to a Web service specializing in resolving personal references.

Summary In this paper, we describe the design philosophy of SALT in using the existing Web architecture for distributed multimodal dialog. SALT allows flexible and powerful dialog management by fully taking advantage of the well publicized benefits of XML, such as separation of data from presentation. In addition, XML extensibility allows new functions to be introduced as needed without sacrificing document portability. SALT uses this mechanism to accommodate diverse Web access devices and advanced dialog management.

References Sadek M.D., Bretier P., Panaget F. (1997) ARTIMIS:

Natural dialog meets rational agency. In Proc. IJCAI-97, Japan.

Allen J.F. (1995) Natural language understanding, 2nd Edition, Benjamin-Cummings, Redwood City, CA.

Cohen P.R, Morgan J., Pollack M.E (1989) Intentions in communications, MIT Press, Cambridge MA.

SALT Forum (2002) Speech Application Language Tags Specification, http://www.saltforum.org.

Wang K. (2000) Implementation of a multimodal dialog system using extensible markup language, In Proc. ICSLP-2000, Beijing China.

Aron B. (1991) Hyperspeech: navigation in speech-only hypermedia, in Proc. Hypertext-91, San Antonio TX.

Ly E., Schmandt C., Aron B. (1993) Speech recognition architectures for multimedia environments, in Proc. AVIOS-93, San Jose, CA.

Lau R., Flammia G., Pao C., Zue V. (1994) Webgalaxy: Integrating spoken language and hyptertext navigation, in Proc. EuroSpeech-97, Rhodes, Greece.

Sneff S., Hurley E., Lau R., Pao C., Schmid P., Zue V. (1998) Galaxy-II: A reference architecture for conversational system development, in Proc. ICSLP-98, Sydney Australia.

Rudnicky A., Xu W. (1999) An agenda-based dialog management architecture for spoken language systems, in Proc. ASRU-99, Keystone CO.

Lin B.-S, Wang H.-M, Lee L.-S (1999) A distributed architecture for cooperative spoken dialog agents with coherent dialog state and history, in Proc. ASRU-99, Keystone CO.

Bradshaw J.M. (1996) Software Agents, AAAI/MIT Press, Cambridge, MA.

Wang K (1998) An event driven dialog system, in Proc. ICSLP-98, Sydney Australia.

Wang K (2001) Semantic modeling for dialog systems in a pattern recognition framework, in Proc. ASRU 2001, Trento Italy.

Johnston M., Cohen P.R., McGee D., Oviatt S.L., Pittman J.A., Smith I (1997) Unification based multimodal integration, in Proc. 35th ACL, Madrid Spain.

Wang K (2001) Natural language enabled Web applications, in Proc. 1st NLP and XML Workshop, Tokyo, Japan.

Page 83: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

Posters and Project Notes

Page 84: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP
Page 85: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

RDF(S)/XML LINGUISTIC ANNOTATION OF SEMANTIC WEB PAGES

Guadalupe Aguado de Cea DLACT1

Fac. Informática, UPM Madrid, Spain, 28660

[email protected]

Inmaculada Álvarez-de-Mon DLACT

E.U.I.T. Informáticos, UPM Madrid, Spain, 28031 [email protected]

Antonio Pareja-Lora DSIP2

Fac. Informática, UCM Madrid, Spain, 28040 [email protected]

Rosario Plaza-Arteche DLACT

Fac. Informática, UPM Madrid, Spain, 28660

[email protected]

ABSTRACT12

Although with the Semantic Web initiative much research on web page semantic annotation has already been done by AI researchers, linguistic text annotation, including the semantic one, was originally developed in Corpus Linguistics and its results have been somehow neglected by AI. The purpose of the research presented in this proposal is to prove that integration of results in both fields is not only possible, but also highly useful in order to make Semantic Web pages more machine-readable. A multi-level (possibly multi-purpose and multi-language) annotation model based on EAGLES standards and Ontological Semantics, implemented with last generation Semantic Web languages (RDF(S)/XML) is being developed to fit the needs of both communities; the present paper focuses on its semantic level.

INTRODUCTION

All of us are by now accustomed to making extensive use of the so-called World Wide Web (WWW) which we might consider a great source of information, accessible through computers but, hitherto, only understandable to human beings. In its beginning, web pages were hand made, intended and oriented to the exchange of information among human beings. Due to the astonishing growth of Internet use, new technologies emerged and, with them, machine-aided web page generation appeared. Up to that point, the structure and the edition of these pages fitted only human needs – and this, only to some extent. All of these documents contained a huge amount of text, images and even sounds, meaningless to a computer. In this way, they put

1 Dept. of Languages Applied to Science and Technology. 2 Dept. of Computer Systems and Programming.

on the reader the burden of extracting and interpreting the relevant information in them.

Currently, web page presentation in the WWW is being handled independently from its content, mainly through the use of XML (Bray et al., 1998) or other resource-oriented languages as XOL (Karp et al., 1999), SHOE (Luke et al., 2000), OML (Kent, 1998), RDF (Lassila et al., 1999), RDF Schema (Brickley et al., 2000), OIL (Horrocks et al., 2000) or DAML+OIL (Horrocks et al., 2001). But even though the automatic process of information is being eased, the above mentioned tasks – relevant information access, extraction and interpretation – cannot be wholly performed by computers yet. Hence, the goal of enabling computers to understand the meaning (the semantics) of written texts and web pages to make it explicit to computers is gaining a growing relevance. That is the main pillar sustaining the development of what we understand by Semantic Web: "the conceptual structuring of the web in an explicit machine-readable way" (Berners-Lee et al., 1999). In this context, the semantic annotation of texts makes meaning explicit, and has become a key topic. Thus, great efforts are being devoted to the design and application of models and formalisms for the semantic annotation of web pages to make these documents more machine-readable.

Following the guidelines of the Semantic Web initiative, much research has already been carried out by AI researchers on the semantic annotation of web pages (Luke et al., 2000), (Benjamins et al., 1999), (Motta et al., 1999), (Staab et al., 2000). However, these researchers have neglected, somehow, the decades of work and the results obtained in the field of Corpus Linguistics on corpus annotation, not only in the semantic level, but also in other linguistic levels. These other linguistic levels, whilst not being intrinsically semantic, can also add some semantic information

Page 86: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

and help a computer understand a text or, in our case, web pages.

This paper will show the results of our research on how linguistic annotation can help computers understand the text contained in a document – a Semantic Web document, for example. Special efforts are devoted to finding a way of bringing together and identifying complementarities between the semantic annotation models from AI and the annotations proposed by Corpus Linguistics. As stated in this paper, far from being irreconcilable, they are more than close and may be considered complementary.

This paper is organised as follows: firstly, an introduction to the state of the art in text semantic annotation in corpus linguistics is presented (section 1). Secondly, in section 2, some brief notes on the use of ontologies in semantic annotation is sketched. Thirdly, in section 3, an example of the integration of both paradigms (AI’s and Corpus Linguistics’) is presented in the scope of our project goals. The main advantages of this integration is analysed afterwards – section 4 – and some conclusions are stated – section 5 –, followed by the acknowledgments section and, finally, the references.

1. TEXT ANNOTATION IN CORPUS LINGUISTICS

The idea of text annotation was originally developed in Corpus Linguistics. Traditionally, linguists have defined corpus as "a body of naturally occurring (authentic) language data which can be used as a basis for linguistic research" (Leech, 1997). Following McEnery & Wilson (2001), Corpus Linguistics was first applied to research on language acquisition, to the teaching of a second language or to the elaboration of descriptive grammars, etc.. With the arrival of computers, the number of potential studies to which corpora could be applied increased exponentially. So, nowadays, the term corpus is being applied to "a body of language material which exists in electronic form, and which may be processed by computer for various purposes such as linguistic research and language engineering" (Leech, 1997). An annotated corpus "may be considered to be a repository of linguistic information [...] made explicit through concrete annotation" (McEnery & Wilson, 2001). The benefit of such an annotation is clear: it

makes retrieving and analysing information about what is contained in the corpus quicker and easier. In Leech (1997), a list of the different (possible) levels of linguistic annotation can be found. As Leech himself states, for the time being, no corpus includes all of them, but only two or, at most, three of them. Some of them were only in their first stage of conception at the time of writing his paper. A smaller but more realistic list of annotation levels is included in EAGLES (1996a) namely: lemma, morpho-syntactic, syntactic, semantic and discourse annotation. Standard recommendations on morpho-syntactic and syntactic annotation of corpora can be found in (EAGLES, 1996a) and (EAGLES, 1996b). A complementary list of general criteria that should be considered when elaborating an annotation scheme can be found in one of the results of the EAGLES project work, the Corpus Encoding Standard (CES, 2000) which are being taken into account in the elaboration of our model (Aguado de Cea, 2002). With respect to the previous and well-known standardization initiative, TEI3, all these works mentioned are TEI-compliant. Thus, for the sake of brevity, we will focus on semantic annotation henceforth.

As asserted in McEnery & Wilson (2001), two broad types of semantic annotation may be identified: A. The marking of semantic relationships

between items in the text (for example, the agents or patients of particular actions). This type of annotation has scarcely begun to be applied.

B. The marking of semantic features of words in a text, essentially the annotation of word senses in one form or another. This trend has quite a longer history but there is no universal agreement in semantics about which features of words should be annotated4.

Although some preliminary recommendations on lexical semantic encoding have already been posited (EAGLES, 1999), no EAGLES semantic corpus annotation standard has yet been published; nevertheless, for the second type of semantic annotation enunciated, a set of reference criteria has been proposed by Schmidt and

3 http://etext.virginia.edu/TEI.html 4 See, for example, the controversies within the SENSEVAL

initiative meetings – (Kilgarriff, 1998), (Kilgarriff & Rosenzweig, 2000).

Page 87: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

mentioned in Wilson & Thomas (1997) for choosing or devising a corpus semantic field5 annotation system. These criteria can be summarized as follows6: 1. It should make sense in linguistic or

psycholinguistic terms. 2. It should be able to account exhaustively for

the vocabulary in the corpus, not just for a part of it.

3. It should be sufficiently flexible. 4. It should operate at an appropriate level of

granularity (or delicacy of detail). 5. It should, where appropriate, possess a

hierarchical structure. 6. It should conform to a standard, if one exists7.

2. ONTOLOGIES AND SEMANTIC WEB ANNOTATIONS.

AI researchers have found in ontologies (Gruber, 1993), (Guarino et al., 1995), (Studer et al., 1998) the ideal knowledge model to formally describe web resources and its vocabulary and, hence, to make explicit in some way the underlying meaning of the concepts included in web pages. With Ontological Semantics (Niremburg & Raskin, 2001) as a support theory8, the annotation of these web resources with ontological information should allow intelligent access to them, should ease searching and browsing within them and should exploit new web inference approaches from them. The influential WordNet and EuroWordNet (Fellbaum, 2001) ontologies should be mentioned as valuable resources for this purpose. Many systems and projects have been developed towards this aim hitherto: SHOE (Luke et al., 2000) proposes HTML page semantic annotation with a Horn

clause-based language also called SHOE; the (KA)2 initiative (Benjamins et al., 1999) seeks to annotate HTML documents with ontological information, taking Knowledge Acquisition Community ontologies as a basis; PlanetOnto (Motta et al., 1999) aims at automatically annotating the HTML news pages of an organisation by means of the information obtained from an event-ontology based knowledge base; finally, within the Semantic Community Web Portals project (Staab et al., 2000) an ontology-based architecture for editing and maintaining web portals in an easier way is being developed. Besides, a number of semantic annotation tools have also been developed so far: COHSE (COHSE, 2002), MnM (Vargas-Vera et al., 2001), OntoMat-Annotizer (OntoMat, 2002), SHOE Knowledge Annotator (SHOE, 2002) and AeroDAML (AeroDAML, 2002).

3. INTEGRATION OF PARADIGMS: AN EXAMPLE

The model here shown, OntoTag, is developed within ContentWeb, a Ministry funded project, which aims at creating an ontology-based platform to enable users to query e-commerce applications by using natural language, performing the automatic retrieval of information from web documents annotated with ontological and linguistic information. Besides, a prototype in the entertainment domain will be developed. ContentWeb objectives can be found in (Aguado de Cea, 2002).

Within the elaboration of OntoTag, a first exploration phase has been performed. A short example of this first phase is presented next. It has been implemented in RDF(S), but an XML version was also developed and the possibility of using any other language has a priori not been discarded. In the annotation example given below, two different morpho-syntactic tools were applied: Conexor (Conexor, 2002) and MBT (MBT, 2002). Some other tools are being evaluated for further use and the XML and RDF(S) annotation tools and wrappers are being designed at the moment.

5 A semantic field (sometimes also called a conceptual field, a

semantic domain or a lexical domain) is a theoretical construct which groups together words that are related by virtue of their being connected – at some level of generality – with the same mental concept (Wilson & Thomas, 1997).

6 For a more detailed explanation, see (Aguado de Cea, 2002). 7 Once again the SENSEVAL initiatives must be mentioned: they

reveal the demand for semantic standardization in the field of word sense disambiguation (Kilgarriff, 1998), (Kilgarriff & Rosenzweig, 2000).

8 Ontological Semantics (Niremburg & Raskin, 2001) is a theory of meaning in natural language and an approach to natural language processing (NLP) which uses a constructed world model – the ontology – as the central resource for extracting and representing meaning of natural language texts, reasoning about knowledge derived from texts as well as generating natural language texts based on representations of their meaning.

Page 88: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

Figure 1: Morpho-Syntactic Annotation Excerpt.

3.1. RDF(S) EXAMPLE DESCRIPTION

In Figure 1, Figure 2and Figure 3, we can see the annotation of the following Spanish sentence in the first three levels “Tras cinco años de espera y después de muchas habladurías, llega a nuestras pantallas la película más esperada de los últimos tiempos.”9

In the morpho-syntactic level (Figure 1) every word or lexical token is given a different Uniform Resource Identifier (URI henceforth) and three possible categorisations are included, according to the three different tagsets and systems we want to evaluate. Each tagset has been assigned a different class in the morphAnnot namespace: TradAnnot

(CRATER tagset), MBTAnnot (MBT tagset) and ConstrAnnot (Constraint Grammar – CONEXOR FDG tagset). For the sake of space saving, just the annotation of the article “la” has been included in the figure.

In the syntactic level (Figure 2) every syntactic relationship between morpho-syntactic items is given a new URI, so that it can be referenced in higher-level relationships or by other levels of the annotation model (i.e. <synAnnot:Chunk rdf:ID="1_510">). Again for the sake of space saving, just the annotation of the phrase “la película más esperada de los últimos tiempos” has been included in the figure.

In the semantic level (see Figure 3) some components of lower level annotations are annotated with semantic references to the concepts, attributes and relationships determined by our (domain) ontology, implemented in DAML+OIL.

<contentWeb:FilmReview> <contentWeb:text>Tras cinco años de espera y después de

muchas habladurías, llega a nuestras pantallas la película más esperada de los últimos tiempos.</contentWeb:text>

</contentWeb:FilmReview> <!-- Morpho-syntactic annotation excerpt --> <morphAnnot:Word rdf:ID="1_16">

<morphAnnot:surface_form>la</morphAnnot:surface_form> <morphAnnot:TradAnnot rdf:about="#trad_ann_info_1_16"/> <morphAnnot:MBTAnnot rdf:about="#mbt_ann_info_1_16"/> <morphAnnot:ConstrAnnot rdf:about="#constr_ann_info_1_16"/>

</morphAnnot:Word> <morphAnnot:TradAnnot rdf:ID="trad_ann_info_1_16">

<trad:tag> ARTDFS </trad:tag> <morphAnnot:lemma> el </morphAnnot:lemma>

</morphAnnot:TradAnnot> <morphAnnot:MBTAnnot rdf:ID="mbt_ann_info_1_16">

<mbt:tag> TDFS0 </mbt:tag> <morphAnnot:lemma> el </morphAnnot:lemma>

</morphAnnot:MBTAnnot> <morphAnnot:ConstrAnnot rdf:ID="constr_ann_info_1_16">

<constr:tag> DET </constr:tag> <constr:genus>FEM</constr:genus> <constr:numerus>SG</constr:numerus>

<morphAnnot:lemma>la</morphAnnot:lemma> <constr:synfunction>DN&gt;</constr:synfunction>

</morphAnnot:ConstrAnnot>

<!-- Syntactic annotation excerpt --> <synAnnot:Chunk rdf:ID="1_510">

<synAnnot:synfunction>NP</synAnnot:synfunction> <synAnnot:hasChild rdf:about="#1_21">los</synAnnot:hasChild> <synAnnot:hasChild rdf:about="#1_22">últimos</synAnnot:hasChild> <synAnnot:hasChild rdf:about="#1_23">tiempos</synAnnot:hasChild>

</synAnnot:Chunk> <synAnnot:Chunk rdf:ID="1_511">

<synAnnot:synfunction>PP</synAnnot:synfunction> <synAnnot:hasChild rdf:about="#1_20">de</synAnnot:hasChild> <synAnnot:hasChild rdf:about="#1_510"> los últimos tiempos </synAnnot:hasChild>

</synAnnot:Chunk> <synAnnot:Chunk rdf:ID="1_512">

<synAnnot:synfunction>AdjP</synAnnot:synfunction> <synAnnot:hasChild rdf:about="#1_18">más</synAnnot:hasChild> <synAnnot:hasChild rdf:about="#1_19">esperada</synAnnot:hasChild><synAnnot:hasChild rdf:about="#1_511">de los últimos tiempos </synAnnot:hasChild>

</synAnnot:Chunk> <synAnnot:Chunk rdf:ID="1_513">

<synAnnot:synfunction>NP</synAnnot:synfunction> <synAnnot:hasChild rdf:about="#1_16">la</synAnnot:hasChild> <synAnnot:hasChild rdf:about="#1_17">película</synAnnot:hasChild> <synAnnot:hasChild rdf:about="#1_512">más esperada de los últimos tiempos </synAnnot:hasChild>

</synAnnot:Chunk>

Figure 2: Syntactic Annotation Excerpt. 9 After five years of expectation and gossiping, here comes the most expected film for the time being.

Page 89: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

FuannodoneteamOntothe pthe ex

3.2.

Inraw tURI after,inserthe ssynta

ThXMLunion(a Sconfodeveltaggebased

10 http:

<!-- Semantic annotation excerpt --> <onto:PremiereEvent rdf:ID="_anon27">

<semSynAnnot:includes rdf:about="#1_13">llega</semSynAnnot:includes> <semSynAnnot:includes rdf:about="#1_509">a nuestras pantallas</semSynAnnot:includes> <onto:hasFilm rdf:about="#_anon30"/>

</onto:PremiereEvent> <onto:Film rdf:ID="_anon30"> <semAnnot:includes rdf:about="#1_18">película</semAnnot:includes> <onto:comment rdf:about="#_anon40"> <onto:comment rdf:about="#_anon41"> </onto:Film> <onto:ControversialFilm rdf:ID="_anon40">

<semSynAnnot:includes rdf:about="#1_506">después de muchas habladurías</semSynAnnot:includes> </onto:ControversialFilm> <onto:AwaitedFilm rdf:ID="_anon41">

<semSynAnnot:includes rdf:about="#1_503">Tras cinco años de espera</semSynAnnot:includes> <semSynAnnot:includes rdf:about="#1_512">más esperada de los últimos tiempos</semSynAnnot:includes>

</onto:ControversialFilm> <onto:Film rdf:about="#_anon30">

<semSynAnnot:includes rdf:about="#3_507">El Señor de los Anillos</semSynAnnot:includes> <onto:filmTitle>El Señor de los Anillos</onto:filmTitle>

</onto:Film>

Figure 3: Semantic Annotation Excerpt.

rther elements susceptible of semantic tation are being sought and research is being towards their determination by the linguist in our project. The pragmatic counterpart of Tag has not yet been tackled at this phase of roject and, thus, this level is not included in ample.

THE XML DATA MODEL

our XML data model, every token from the ext is labelled with a <Word> tag and a RDF specified by the attribute rdf:ID. Immediately nested, a <surface_form> tag will be ted, introducing the token as it appeared in ource text; then come the morpho-syntactic, ctic and semantic annotations for this token. e tagset associated to our morpho-syntactic data model (namespace pos) includes the of those three others defined for CRATER10 panish POS tagset, TEI and EAGLES rmant, applied also in SonIsa, the tagger oped in our laboratory), MBT (a web-based r, also EAGLES conformant) and the web- version of CONEXOR FDG parser morpho-

syntactic part. The tool that produced a particular annotation is tagged with <Traditional>, <MBT> and <Constraint>, respectively. Lemma information is annotated by means of an attribute lemma, associated to the tag of the namespace pos.

The syntactic counterpart of our XML data model (namespace syn) contains, in a TEI conformant manner, only the syntactic information given by FDG at the moment (more tags may be added as the model is refined). This syntactic information covers EAGLES syntactic layers (c) and (d): showing dependency relations and indicating functional labels. Thus, the attributes defined at this level are: dependent_on, which shows the token on which the present one depends (via its rdf:ID); dependency, which describes the kind of dependency between them both and surface_syn_tag, which denotes the surface syntactic function of the token in a Constraint Grammar approach. We are now studying the best way to cover EAGLES syntactic layers (a) and (b) – bracketing and labelling of segments – from a Constraint Grammar perspective, not developed in the EAGLES syntactic guidelines aforementioned.

//arxiv.org/ps/cmp-lg/9406023

Page 90: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

The semantic counterpart of our XML data model (namespace sem) is ontology-based and defined by means of the tags given in the DAML+OIL implementation of our domain ontologies.

4. ADVANTAGES OF THE INTEGRATED MODEL

As shown in the previous section example, it seems that AI and Corpus Linguistics, far from being irreconcilable, can join together to give birth to an integrated annotation model. This conjunct annotation scheme would be very useful and valuable in the development of the Semantic Web and would benefit from the results of both disciplines in many ways, not restricted to the semantic level, below analysed. A particular subsection is dedicated to multi-functionality

4.1. AT THE SEMANTIC LEVEL

Let us now see the benefits at the semantic level of a hybrid annotation model, first from a linguistic point of view and, then, from an ontological point of view.

4.1.1. Regarding ontologies from a linguistic point of view

Taking a closer view to sections 1 and 2, and comparing the proposals from both Corpus Linguistics and AI, we find out that the use of ontologies as a basis for a semantic annotation scheme fits perfectly and accomplishes the criteria posited by Schmidt. Clearly, its mostly hierarchical structure fulfils by itself criterion (5) and, as a side effect, criteria (2) and (4), since the former is related to the capacity of an ontology to grow horizontally (in breadth) and the latter to the capacity of an ontology to grow vertically (in depth or in specification). Hence, the end user can decide the level of specificity needed. Criterion (3) is also satisfied by an ontology-based semantic annotation scheme, since we can always specialise the concepts in the ontology according to specific periods, languages, registers and textbases. Ontologies are, by definition, consensual and, thus, are closer to becoming a standard than many other models and formalisms or, as criteria (6) requires, at least they lay a framework of properties and axioms (principles) and major categories that can be modified to some extent to

fit individual needs. Concerning criterion (1), quite a lot of groups developing ontologies are characterized by a strong interdisciplinary approach that combines Computer Science, Linguistics and (sometimes) Philosophy; thus, an ontology-based approach should also make sense in linguistic terms.

4.1.2. Regarding linguistic annotations from an ontological point of view

The main drawback for AI researchers to adopt a linguistically motivated annotation model would lie on the statement in section 1 that says, “there is no universal agreement in semantics about which features of words should be annotated” or on that other statement in Schmidt’s criterion 1, in the same section, that says, “still an exhaustive set of categories is to be determined”.

But ontology researchers are trying to fill this gap with initiatives such as the UNSPSC (UNSPSC, 2002) or RosettaNet (RosettaNet, 2002) in specific domains (i.e. e-commerce). In any case, linguistic annotations at the semantic level are more ambitious and potentially wider than the strictly ontology-based ones. Establishing a link between semantic annotation and discourse annotation and text construction following the RST approach, which has already been applied in text generation (Mann & Thomson, 88), seems a fairly promising linguistic enhancement.

So far, we have seen how ontologies can fit in the semantic annotation of texts; let us see in the next subsections how linguistic annotations in all of its levels can improve the potential of Semantic Web Pages.

4.2. MULTI-FUNCTIONALITY

The need for (shallow) parsing in semantic processing is found in Vargas-Vera et al. (2001) and also in Kietz et al.(2000): most information extraction systems (as well as other NLP applications) use some form of shallow parsing11 to recognise syntactic constructs or, in other words, to syntactically identify some fragments of the sentences. A chunker12 called Marmot is included in the annotation process presented in the

11 Without generating a complete parse tree for each sentence. Such

partial parsing has the advantages of greater speed and robustness. 12 A chunker is a natural language (pre)processing tool that separates

and segments sentences into its subconstituents, i.e. noun, verb and prepositional phrases, etc.

Page 91: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

former. Even though this need for lower levels of linguistic analysis mentioned hitherto applies to information extraction systems, it is not restricted to this kind of NLP applications. Since the proposed annotation model adds overt linguistic information to any kind of document, then it can be used for a wide range of purposes that require a linguistic or semantic analysis or processing (i.e. machine-aided translation, information retrieval, etc.).

5. CONCLUSIONS

We have seen that, even though AI researchers are devoting many efforts to finding an optimal model for the semantic annotation of web pages, the decades of work and the results obtained in the field of Corpus Linguistics on corpus annotation have been, somehow, neglected. This paper shows the results of the research carried out on how linguistic annotation can help computers understand the text contained in a document – a Semantic Web page – bringing together semantic annotation models from AI and the annotations proposed for every linguistic level from Corpus Linguistics.

The integration of these two approaches (Corpus Linguistics and AI) entails many advantages for language engineering and AI applications. First of all, language resources will be more reusable: many of the projects involving the use of semantically annotated (web) documents must also parse to some extent the information and, prior to that, must determine somehow the grammatical category associated to every word in the document. Introducing the annotation of these two levels into the document, hence re-using one of the tools already developed for this purpose, prevents this whole process of document text tokenisation and parsing or chunking from being unnecessarily repeated each time the document is processed (reusing the annotation). Since parsing, for example, is a high time-consuming task, we can have an additional advantage, that is, reducing our overall Semantic Web page processing time. The second main advantage is that the meaning of a page with explicit semantic annotation can be reinforced by the meaning contribution provided by all of the linguistic levels; semantic analysis can also benefit from the invaluable work done so far on

the development of ontologies as conceptual and consensual models.

However, the main disadvantage lies in the limitations imposed by current technologies: obtaining automatically compact, readable and verifiable pages is a task hard to be fully specified and delimited, but the work being done in our laboratory tries to bring some light upon it.

ACKNOWLEDGEMENTS

The research described in this paper is supported by MCyT (Spanish Ministry of Science and Technology) under the project name: ContentWeb: “PLATAFORMA TECNOLÓGICA PARA LA WEB SEMÁNTICA: ONTOLOGÍAS, ANÁLISIS DE LENGUAJE NATURAL Y COMERCIO ELECTRÓNICO” – TIC2001-2745 ("ContentWeb: Semantic Web Technologic Platform: Ontologies, Natural Language Analysis and E-Business").

We would also like to thank Óscar Corcho, Socorro Bernardos and Mariano Fernández for their help with the ontological aspects of this paper.

REFERENCES AeroDAML (2002) http://ubot.lockheedmartin.com/

ubot/hotdaml/aerodaml.html Aguado de Cea, G., Álvarez de Mon, I., Gómez-Pérez,

A., Pareja-Lora, A., Plaza-Arteche, R. (2002) A Semantic Web Page Linguistic Annotation Model. “AAAI 2002 Workshop: Semantic Web Meets Language Resources”. Edmonton, Alberta, Canada. (To appear)

Benjamins, V.R., Fensel, D., Decker, S., Gómez-Pérez, A. (1999) (KA)2: Building Ontologies for the Internet: a Mid Term Report. IJHCS, International Journal of Human Computer Studies, 51, pp. 687–712.

Berners-Lee, T., Fischetti, M. (1999) Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by its Inventor. Harper. San Francisco.

Bray, T., Paoli, J., Sperberg, C. (1998) http://www.w3.org/TR/REC-xml.

Brickley, D., Guha, R.V. (2000) http://www.w3.org/ TR/PR-rdf-schema.

CES (2000) http://www.cs.vassar.edu/CES/ COHSE (2002) http://cohse.semanticweb.org/ Conexor OY (2002) http://www.conexoroy.com/

products.htm

Page 92: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

EAGLES (1996a) ftp://ftp.ilc.pi.cnr.it/pub/eagles/ corpora/annotate.ps.gz

EAGLES (1996b) ftp://ftp.ilc.pi.cnr.it/pub/eagles/ corpora/sasg1.ps.gz

EAGLES (1999) http://www.ilc.pi.cnr.it/EAGLES/ EAGLESLE.PDF

Fellbaum, C., Palmer, M., Trang Dang, H., Delfs, L., Wolff, S. (2001) Manual and Automatic Semantic Annotation with WordNet. In “Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources: Applications, Customizations”. Carnegie Mellon University, Pittsburg, PA.

Gruber, R. (1993) A Translation Approach To Portable Ontology Specification. Knowledge Acquisition, 5, pp. 199–220.

Guarino, N., Giaretta, P. (1995) Ontologies and Knowledge Bases: Towards a Terminological Clarification. In “Towards Very Large Knowledge Bases: Knowledge Building and Knowledge Sharing”, N. Mars, ed., IOS Press, Amsterdam, pp. 25–32.

Horrocks, I., Fensel, D., Harmelen, F., Decker, S., Erdmann, M., Klein, M. (2000) OIL in a Nutshell. In “12th International Conference in Knowledge Engineering and Knowledge Management, Lecture Notes in Artificial Intelligence”, Springer-Verlag, Berlin, Germany, pp. 1–16.

Horrocks, I., Van Harmelen, F. (2001) http://www.daml.org/2000/12/reference.html

Karp, R., Chaudhri, V., Thomere, J. (1999) http://www.ai.sri.com/~pkarp/xol/xol.html

Kent, R. (1998) Conceptual Knowledge Markup Language (version 0.2). http://sern.ucalgary.ca/KSI/ KAW/KAW99/papers/Kent1/CKML.pdf

Kietz, J-U., Maedche, A., Volz, R. (2000) A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet. In “Proceedings of the EKAW'00 Workshop on Ontologies and Text”, Juan-Les-Pines, France.

Kilgarriff, A. (1998) SENSEVAL: An Exercise in Evaluating Word Sense Disambiguation Programs. In “Proceedings of LREC”, Granada, Spain, pp. 581–588.

Kilgarriff, A. & Rosenzweig, J. (2000) English SENSEVAL: Report and Results. In “Proceedings of LREC”. Athens, Greece.

Lassila, O., Swick, R. (1999) http://www.w3.org/TR/ PR-rdf-syntax

Leech, G. (1997) Introducing corpus annotation. In “Corpus Annotation: Linguistic Information from Computer Text Corpora”, R. Garside, G. Leech & A. M. McEnery, ed., Longman, London.

Luke S., Heflin J. (2000) http://www.cs.umd.edu/ projects/plus/SHOE/spec1.01.htm

Mann, W & Thomson, S. (1988) Rhetorical Structure Theory: Toward a functional theory of text organization. Text Vol.18, 3, pp. 243–281.

MBT (2002) http://ilk.kub.nl/~zavrel/tagtest.html McEnery, A. M., Wilson, A. (2001) Corpus

Linguistics: An Introduction. Edinburgh University Press, Edinburgh.

Motta, E., Buckingham Shum, S. Domingue, J. (1999) Case Studies in Ontology-Driven Document Enrichment. In “Proceedings of the 12th Banff Knowledge Acquisition Workshop”, Banff, Alberta, Canada.

Nirenburg, S. and Raskin, V. (2001) http://crl.nmsu.edu/Staff.pages/Technical/sergei/book/index-book.html

OntoMat (2002) http://annotation.semanticweb.org/ ontomat.html

RosettaNet (2002) http://www.rosettanet.org/ SHOE (2002) http://www.cs.umd.edu/projects/plus/

SHOE/KnowledgeAnnotator.html Staab, S., Angele, J., Decker, S., Erdmann, M., Hotho,

A., Mädche, A., Schnurr, H.-P., Studer, R. (2000) Semantic Community Web Portals. WWW´9. Amsterdam.

Studer, R., Benjamins, R., Fensel, D. (1998) Knowledge Engineering: Principles and Methods. DKE 25(1-2), pp 161-197.

UNSPSC (2002) http://www.unspsc.org/ Vargas-Vera, M., Motta, E., Domingue, J., Shum, S.

B., Lanzoni, M. (2001) Knowledge Extraction by Using an Ontology-based Annotation Tool. In “Proceedings of the K-CAP'01 Workshop on Knowledge Markup and Semantic Annotation”, Victoria B.C., Canada.

Wilson, A., Thomas, J. (1997) Semantic Annotation. In “Corpus Annotation: Linguistic Information from Computer Text Corpora”, R. Garside, G. Leech & A. M. McEnery, ed., Longman, London.

Page 93: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

The PAPILLON project: cooperatively building a multilingual lexicaldata-base to derive open source dictionaries & lexicons

Christian BOITET(1), Mathieu MANGEOT(2) & Gilles SÉRASSET(1)

(1) GETA, CLIPS, IMAG385, av. de la bibliothèque, BP 53F-38041 Grenoble cedex 9, France

[email protected]

(2) National Institute of Informatics (NII)2-1-2-1314, Hitotsubashi

Chiyoda-ku Tokyo 101-8430, [email protected]

Abstract

The PAPILLON project aims at creating a cooperative, free, permanent, web-oriented and personalizable environment for thedevelopment and the consultation of a multilingual lexical database. The initial motivation is the lack of dictionaries, both forhumans and machines, between French and many Asian languages. In particular, although there are large F-J paper usagedictionaries, they are usable only by Japanese literates, as they never contain both original (kanji/kana) and romaji writing.This applies as well to Thai, Vietnamese, Lao, etc.

Introduction

The project was initiated in 2000 and launchedwith the support of the French Embassy and NII(Tokyo) in July 2000, and took really shape in2001, with a technical seminar in July 2001 atGrenoble, and concrete work (data gathering,tool building, etc.).

The macrostructure of Papillon is a set ofmonolingual dictionaries (one for each language)of word senses, called "lexies", linked through acentral set of interlingual links, called "axies".This pivot macrostructure has been defined bySérasset (1994) and experimented by Blanc(1994) in the PARAX mockup.

The microstructure of the monolingualdictionaries is the "DiCo" structure, which is asimplification of Mel'tchuk's (1981;1987;1995)DEC (Explanatory and CombinatoryDictionary) designed by Polguère (2000) &Mel'tchuk !"# make it possible to constructlarge, detailed and principled dictionaries intractable time.

1. Languages included in the project

In 2000, the initial languages of the Papillonproject were English, French, Japanese and Thai.Thai was included because there had been asuccessful project, SAIKAM (Ampornaramvethet al. 1998; 2000), supported by NII andNECTEC, of building a Japanese-Thai lexicon byvolunteers on the web. Lao, Vietnamese, andMalay have been added in 2001 because of activeinterest of labs and individuals.

The star-like macrostructure of Papillon makesit easy to add a new language. Also, the DiComicrostructure of each monolingual dictionary isdefined by an XML schema, containing a largecommon core and a small specialization part(morphosyntactic categories, language usage).

2. Interlingual links

Axies, also called "interlingual acceptions", arenot concepts, but simply interlingual linksbetween lexies, motivated by translations foundin existing dictionaries or proposed by thecontributors.

In case of discrepancies, 1 axie may be linkedwith lexies of some languages only, e.g.FR(mur#1), EN(wall#1), RU(stena#1), and t oother axies by refinement links:

Example:

Axie#234 --lg--> FR(mur#1),EN(wall#1), RU(stena#1)

--rf--> Axie#235, Axie#236

Axie#235 --lg--> DE(Wand#2),IT(muro#1), ES(muro#1)

Axie#236 --lg--> DE(Mauer#2),IT(parete#1), ES(pared#1)

Page 94: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

It is also possible to have 2 axies for the same"concept" at a certain stage in the life of thedatabase, because the monolingual information isnot yet detailed enough.

Suppose the level of language (usual, specialized,vulgar, familiar…) is not yet given forFR(maladie#1), FR(affection#2), EN(disease#1),EN(affection#3).

Then we might have, for translational reasons:

Axie#456--lg-->

FR(maladie#1),EN(disease#1)

Axie#457--lg-->

FR(affection#2),EN(affection#3)

When this information will be put in each of theabove 4 monolingual entries, we may merge the2 axies and get:

Axie#500--lg-->

FR(maladie#1, affection#2),EN(disease#1, affection#3)

Axies may also be linked to "external" systemsof semantic description. Each axie contains a(possibly empty) list for each such system, andthe list of systems is open. The following areincluded at this stage; UNL UWs (universalwords), ONTOS concepts, WordNet synsets,NTT semantic categories.

3. Building the content

3.1. Recuperating existing resources

Building the content of the data base has severalaspects. To initiate it, the project starts fromopen source computerized data, called "rawdictionaries", which may be monolingual (4,000French DiCo entries from UdM, 10,000 Thaientries from Kasetsart Univ.), bilingual (70,000Japanese-English entries and 10,000 Japanese-French entries in J.Breen's JDICT XML format,8000 Japanese-Thai entries in SAIKAM XMLformat, 120,000 English-Japanese entries inKDD-KATE LISP format), or multilingual(50,000 French-English-Malay entries in FEMXML format).

3.2. Integrating the data into Papillon

In the second phase, the "raw dictionaries" aretransformed into a "lexical soup" inM.Mangeot's (2001) intermediary DML format(an XML schema and namespace). Thetransformation into almost empty DiCo entriesand the creation of axies for the translationalequivalences is semi-automatic. A tool has beenprogrammed at NII for that task.

3.3. Enriching the data with contributions

After that, it is hoped that many contributorswill "fill in" the missing information. The basisfor that third and continuous phase is a serverfor cooperative building of the data base, whereeach contributor has his/her own space, so thatcontributions can be validated and integratedinto the DB by a group experts. Users canestablish user groups with specific read and writeaccess rights on their spaces.

4. Consultation of the resulting data

4.1. Online consultation

Consultation is meant to be free for anybody,and open source. Requests produce personalizableviews of the data base, the most classical ofwhich are fragments of bilingual dictionaries.However, it is also possible to producemultitarget entries, on the fly and offline. Users(consumers) are encouraged to becomecontributors. To contribute, one may propose anew word sense, a definition, an example of use,a translation, the translation of an example, acorrection, etc., or an annotation on anyaccessible information: Every user cancontribute with his own knowledge level.

4.2. Download of entire files

Users can also retrieve files, and can contributeto define new output formats. The files retrievedcan contain structural, content-oriented tags.This open source orientation contrasts with thecurrent usage of allowing users to retrieve filescontaining only presentation-oriented tags.

4.3. Coverage of the dictionary

An interesting point is that the project wants t ocover both general terms and terminologicalterms.

Another one is that it contains a translationsubproject, because definitions, examples,citations, etc. have to be translated into alllanguages of the collection. For this, the notionof complex lexie, already present to account forlexical collocations such as compound predicates(e.g. "to kick the bucket"), is extended to coverfull sentences. Axies relating them are specialbecause they can't in general relate them t oexternal semantic systems such as WordNet. Anexception is UNL: the UNL list for an axie maycontain one UNL graph, produced automatically,manually, or semi-automatically. This graphmay be automatically sent to available UNL"deconverters" to get draft translations.

Page 95: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

5. Project organisation

In the current stage, the project has no legalimplementation as a fundation, association,company, etc., although many participants havealready established official MOUs and othertypes of agreements on which to base theircooperative work.

There is a steering committee of about 10-12members, who represent Papillon where theyare, and not the converse. There is a set oftasks, and for each task a working group and anadvisory committee. One of the tasks is themanagement of the project. In between, there isa coordinating group containing the heads of thetasks and chaired by the head of themanagement task.

Sponsors may not donate money to the project,which has no bank account. Rather, they areencouraged to donate data, to assign personalpart time to the project, and to fundparticipating organizations and persons as theysee fit.

Conclusion

The theoretical frameworks for the wholedatabase, the macrostructure and themicrostructure are very well defined. I tconstitutes a solid basis for the implementation.

A lot of open problems still have to be addressedfor the Papillon project to be a success. In thisrespect, the Papillon project appears to be avery interesting experimentation platform for alot of NLP research as data acquisition or humanaccess to lexical data, among others.

All these research will improve the attraction ofsuch a project to the Internet users. Thisattraction is necessary for the project to go on,as it is highly dependent on its usersmotivations.

This way, we will be able to provide a veryinteresting multilingual lexical database that wehope useful for a lot of persons.

Rerefences

Ampornaramveth V., Aizawa A. & Oyama K. (2000)An Internet-based Collaborative DictionaryDevelopment Project: SAIKAM. Proc. of 7th Intl.Workshop on Academic Information Networks andSystems (WAINS'7), Bangkok, 7-8 December 2000,Kasetsart University.

Blanc É., Sérasset G. & Tchéou F. (1994) Designingan Acception-Based Multilingual Lexical Data Baseunder HyperCard: PARAX. Research Report, GETA,IMAG (UJF & CNRS), Aug. 1994, 10 p.

Connolly, Dan (1997) XML Principles, Tools andTechniques World Wide Web Journal, Volume 2,Issue 4, Fall 1997, O'REILLY & Associates, 250 p.

Ide, N. & Veronis, J. (1995) Text Encoding Initiative,background and context. Kluwer AcademicPublishers, 242 p.

Mangeot-Lerebours M. (2000) Papillon LexicalDatabase Project: Monolingual Dictionaries &Interlingual Links. Proc. of 7th Workshop onAdvanced Information Network and System PacificAssociation for Computational Linguistics 1997Conference (WAINS'7), Bangkok, Thailande, 7-8décembre 2000, Kasetsart University, 6 p.

Mangeot-Lerebours M. (2001) Environnementscentralisés et distribués pour lexicographes etlexicologues en contexte multilingue. Nouvelle thèse,Université Joseph Fourier (Grenoble I), 27 September2001, 280 p.

Mel’tchuk I., Clas A. & Polguère A. (1995)Introduction à la lexicologie explicative etcombinatoire. AUPELF-UREF/Duculot, Louvain-la-Neuve, 256 p.

Polguère, A. (2000) Towards a theoretically-motivatedgeneral public dictionary of semantic derivationsand collocations for French. Proc. EURALEX'2000,Stuttgart, pp 517-527.

Sérasset G. (1994a) Interlingual Lexical Organisationfor Multilingual Lexical Databases. Proc. of 15thInternational Conference on ComputationalLinguistics, COLING-94, 5-9 Aug. 1994, 6 p.

Sérasset G. (1994b) SUBLIM, un système universel debases lexicales multilingues; et NADIA, saspécialisation aux bases lexicales interlingues paracceptions. Nouvelle thèse, UJF (Grenoble 1), déc.1994.

Sérasset G. (1997) Le projet NADIA-DEC : vers undictionnaire explicatif et combinatoire informatisé ?Proc. of La mémoire des mots, 5ème journéesscientifiques du réseau LTT, Tunis, 25-27 septembre1997, AUPELF•UREF, 7 p.

Sérasset G. & Mangeot-Lerebours M. (2001) PapillonLexical Database Project: Monolingual Dictionaries& Interlingual Links. Proc. NLPRS'2001,Hitotsubashi Memorial Hall, National Center ofSciences, Tokyo, Japan, 27-30 November 2001, vol1/1, pp. 119-125.

Tomokiyo M., Mangeot-Lerebours M. & Planas E.(2000) Papillon : a Project of Lexical Database forEnglish, French and Japanese, using InterlingualLinks. Proc. of Journées des Sciences et Techniquesde l'Ambassade de France au Japon, Tokyo, Japon,13-14 novembre 2000, Ambassade de France auJapon, 3 p.

-o-o-o-o-o-o-o-o-o-

Page 96: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP
Page 97: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

Towards a web-based centre on Swedish Language Technology

Petter KARLSTRÖMGraduate School of Language Technology,

Göteborg UniversityBox 200

405 30, Göteborg, [email protected]

Robin COOPERGraduate School of Language Technology,

Göteborg UniversityBox 200

405 30, Göteborg, [email protected]

AbstractWe present an upcoming web-based centre forinformation and documentation on languagetechnology (LT) in Sweden and/or in Swedish.Some ways to collect and represent data usingdata harvesting and XML are suggested. Usersproviding data may choose one of four inputmodes, two of which will allow the user to editexisting documents at her own site. Thus wehope to facilitate data entry and inter-sitecommunication, and reduce the amount of staledata. The entry modes use annotated HTML andan emerging XML-application, respectively.

In conclusion, we propose that an XML-application in the field of LT would facilitate forthe LT community to share information, and thatautomatic data harvesting will emerge as animportant technique for building informationcentres in the future.

Introduction

The website Slate is an information centre onlanguage technology (LT) developed in Swedenand/or for the Swedish language. It is thetechnical/academic part of a collaborative effortbetween the National Graduate School ofLanguage Technology in Sweden (GSLT) andSvenska språknämnden (the official organizationfor matters concerning the Swedish language).Språknämnden will provide the second part,containing popular information. Some materialfor the popular part will be taken over from theproject Svenska.se at the Swedish Institute forComputer Science, SICS. The two parts will asfar as possible have the same structure and

cross-reference each other. We will inform onthe following matters, in both areas:

• Research projects.• Industrial projects.• Educational programmes.• Software.• LT developed in Sweden for other

languages than Swedish.• Swedish organizations and individuals

who have an interest in LT.• Employment oppurtunities in the area of

LT.

Information will be provided in both English andSwedish, depending on whether translateddocuments exist.

We cooperate via NorDokNet with centra inDenmark, Finland, Iceland and Norway. We alsohave contact with COLLATE at DFKI andfollow their way of organizing information onLT in LT-World. Several more LanguageTechnologists and LT-projects will contributewith information, notably the Ph. D. students inGSLT who have agreed to participate insoftware tests.

1 RationaleSince the breakthrough of the Internet, and morespecifically the World Wide Web (WWW, web),an information centre is commonly viewed assome sort of web-based service. In its mostsimple form, such a centre consists of acollection of links, manually updated by awebmaster or editor. Large, more sophisticatedwebsites usually utilize some kind of databasebackend in order to separate content fromdisplay information, to facilitate searching andto provide simple, site-standardized means fordata entry and editing.

Page 98: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

One of the fundamental principles of the WWWis the idea of globally reachable documents, afact given in the acronym WWW itself. Givenits global nature, the WWW naturally grows tocontain a very large amount of data. In order toresolve the issue of actually finding anything inthis vast network, a need for search engines,portals, and information centres has arisen. Ourwebsite is a response to such a need for aninformation centre on language technology inSweden.

The success of an information centre largelydepend on the perceived quality of its contents,and how easily those contents are found. Textualquality is not what we will discuss in this paper.Suffice to say that quality depends on, amongother things, information being correct, up todate, and comprehensive. For our purposes, thisis a data entry/edit issue, which we will address.

1.1 Data entryAs stated above, a common way to managewebsite is by means of a database that generatesHTML web-pages which a user may view in herweb browser. The input to the database may beprovided by other users, sometimes by means ofa web-form (Fig 1).

LanguageTechnologist

Web Form

database

Website

User

!Interesting

Work

Fig. 1. An information centre/portal Arrowssymbolize data-flow

Now, the Language Technologist in this pictureis probably more interested in his InterestingWork than filling out a web-form (we hope!). Hemay already have filled out several such forms,may not see the use of the website in question,or may have done so already and be unwilling toupdate old information.

Therefore we suggest an alternative way toprovide information (Fig 2). The alternativebuilds on the notion that people, projects,companies, etc. already have or want to have

websites or homepages of their own. Most of theinformation is thus already available there. Wewill try to harvest this information by asking thedata-providers to do one of four things:

a) Annotate their own HTML websitewith tags for our data-harvester.

b) Provide information in an XML-format, compatible with ours.

c) Submit a web-form.d) Contact our web-editor for manual

entry.The last two entry modes are a step back to oldermethods, and are provided as a cautionarymeasure for two reasons. First, not all data-providers will adopt to the new way of data-entry. A cause for not adopting is that our XML-application and attributed HTML-tags willprobably undergo several changes and demandthat early adopters change their documentsaccordingly. The second reason is of pragmaticnature, namely that we need to build our sitebefore the data-harvesting application and theXML-application is finished.

InterestingWeb page(AnnotatedHTML/XML)

Slatebot

LanguageTechnologist

databaseUser

Annotated file(XML)

Website

Other website

User

!Interesting

Work

Fig 2. An information centre utilizing dataharvesting. Arrows symbolize data-flow

There are three benefits with entry modes a) andb):

1. Data entry is facilitated by making useof existing web-pages.

2. The amount of stale data is reduced byupdating the database as the data-provider updates her own website.

3. The ability to output in XML format.This will allow for other websites, for

Page 99: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

example those in NorDokNet and LT-World to use their own harvesters onSlate’s material.

2 Data harvestingWe will now describe what documents that willbe harvested should look like, and how ourharvester will work. Understanding this part ofthe paper requires some knowledge of HTMLand XML

2.1 Document structureIn order for our harvesting engine to retrieve andstore documents properly they need to conformto some standards. Contributions may be inHTML or XML. In the case of HTML thecontributing site will need to annotate existingdocuments with tags that tell us what is what.We have chosen to use the META and SPANtags since they may be used as containers forgeneric meta-information. Use of the META-tagis uncontroversial, since it is commonly used forproviding data for search engines. The SPANtag is more commonly used for styleinformation, but may be used to tailor HTML toones own needs and tastes, according to W3C(1999). Some may still see the use of this tag ascontroversial, since we use it for semantic, notstyle information. We argue that this is not anissue because the tags will not affect the websitein question (presuming the information wasalready there), whether users implement a style-sheet or not. Alternatives might have beenHTML-comments or scripting languages. Thedrawbacks with these methods are that they mayforce data-providers to use invalid HTML andthat they make it harder to make use of existinginformation.

As for XML, documents will have to conform toa DTD under development by us in cooperationwith NorDokNet. We also have contact with LT-World, in order to resolve compatibility issueswith their information centre before they appear.Since websites using XML is not yetcommonsight, documents will usually have to bewritten from scratch. No web browsers currentlysupport XML satisfactorily, so the documentsare not yet of much use outside of Slate.Therefore, this method is aimed towards early

adopters who see the future benefits of XMLand wish to participate in that development.

2.2 ExamplesAn HTML file that we should retrieve and fileunder people should contain this META-tag:

<meta http-equiv="slate"content="people">

somewhere in its head. The document body maycontain constructs like this:

I am a <span class="title">Ph.D.student</span> in computationallinguistics at <spanclass="affiliation">the Departmentof Computer and Information Scienceat Svedala University</span>.

Note that this is a real-life example (somewhatanonymized and edited). The Ph. D. student’sexisting web-page was annotated with SPAN-tags attributed to classes in the Slatedatabase/XML-application. If the Ph. D. studentchanges affilitiation, he will probably want toupdate his own webpage. The updatedinformation will be automatically retrieved bySlateBot.

XML files are more rigid since they follow aDTD. Their structure should not be surprising.Being XML-documents they contain adeclaration and a few containers:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no" ?><!DOCTYPE people SYSTEM"http://slate.gslt.hum.gu.se/people/people.dtd">

<people><person><first_name>John</first_name><last_name>Doe</last_name>…</person></people>

The XML-compliant reader may note that ourstructure does not adhere to any standards suchas RDF or OLAC. This will be attended to inlater versions (see Discussion, below).

Page 100: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

2.3 Harvesting engineThe core parts of our data harvesting engine,SlateBot, have been implemented and alpha testshave commenced using a very small databaseand XML-application listing people.

SlateBot is an application being developed inPerl, in order to acommodate for entry modes a)and b) above. The application consists of fivemain parts, making use of corresponding Perl-modules:

• HTML head-parser• HTML parser• Link extractor• XML parser• Database Interface

For input, it takes a list of links of participatingsites. The list is maintained by our webmaster.For each entry in the list, it makes a HTTP GETrequest, just like a regular web-browser. It thenchecks whether the document in question iswritten in HTML or XML, and runs either theHTML parsers and the link extractor or theXML parser. The information returned from theparsers is then translated into XML and updatedby the database interface.

The reason for having three different HTMLparsers (including the link extractor) is, ofcourse, that they do different things. The headparser only parses information in the HTML-HEAD part of the document, in order to checkwhether the document belongs in our database atall, and if so, in what main category. The HTMLparser looks for SPAN tags, and returns thosethat correspond to our categories. Finally, thelink extractor searches the document for links toother documents. This is provided for futuredevelopment, in case we need our harvester tofollow links in documents.

The XML parser is simpler, because we canassume that the XML file conforms to our DTD.We simply parse the file to find out which of ourcategories are implemented in the document, andsend those to the database interface.

2.4 TestsWe have made some preliminary tests of theengine. Ph. D. students from GSLT and a few

others were asked to provide information onthemselves in either of three of the four entry-modes above. Web-forms were left out of thefirst tests, because we needed to ascertain thatwe could delete the database withoutconsequenses.

The Ph. D. students at GSLT come from a widerange of backgrounds, so it was expected thatsome would be more interested in XML thanothers. Three of the entries we received were inthe form of links to HTML-files and three werelinks to XML-files. One reply was in the form ofan e-mail request for entry. Naturally, theinformation base is too small to draw any realconclusions concerning user preferences, but itdoes seem that we are headed in the rightdirection in providing the different modes.

The tests have helped us iron out a few of theunforeseen bugs that come with all softwaredevelopment, and we are now ready to enlargethe database model.

3 Other related projects andstandards

We are not alone in realizing the benefits withmeta-data. In fact, there are quite a lot ofbuzzwords and hype surrounding XML-development. These are some projects andstandards that we should keep in mind,particularly with respect to data-harvesting andfuture development. Please note that this is in noway a complete list, due to the novelty of thefield.

3.1 Semantic WebThe Semantic Web is an attempt to giveinformation on the web meaning, in order tofacilitate searching, automation, etc (W3C,2001).

3.2 Resource Description FrameworkThe Resource Description Framework, RDF is alanguage for providing metadata to support theSemantic Web. It can be used for describingresources that can be located via a URI or otheridentifier (W3C, 2001). RDF is developed sideby side with the Dublin Core (below), and bothstandards may be used in one document.

Page 101: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

3.3 Dublin CoreThe Dublin Core Metadata Element Set (DublinCore, DC) is, as its name implies, a set ofelements for metadata. There are fifteenelements, strictly defined using a set of tenattributes (DCMI, 1999). The elements’ mainuses are for information or service resources,e.g. bibliographies and card catalogs.

3.4 DocBookDocBook is an SGML or XML format fortechnical documentation. It is intended forauthoring, and can be converted to other formatsfor reading (Harold and Means, 2001).

3.5 Open Archives InitiativeThe Open Archives Initiative, OAI is anexperimental init iat ive for efficientdissemination of content. It uses the DublinCore. Historically, the main intention of OAIwas providing a meta-data language for e-prints,but this has been expanded to related domains aswell. There is an OAI protocol for data-harvesting. Version 2.0 of that protocol isscheduled for release in June 2002 (OAI, 2001).

3.6 OLACOLAC, the Open Language ArchivesCommuni ty is a community who developmethods for digital archiving of languageresources and a network for housing andaccessing such services (OLAC, 2001). Theyuse methods very similar to ours, though aimedat language in general rather than languagetechnology. The Dublin Core forms the basis oftheir meta-data set. Alpha tests havecommenced, and the project has recently beenlaunched in Europe (May 2002).

3.7 Web services and SOAPWeb services is a way to share applicationmethods over the Internet, by means of somestandard interface such as Simple Object AccessProtocol, SOAP (W3C, 2000). The methodsimplemented may for example provide access toa site’s data.

4 Discussion and future developmentAs stated above, our tests were carried out withthe help of a rather small group of subjects, andusing a small database. In order for us to expandthe database and the XML-application, we see aneed for a more standardized XML for LT. Astandard would be of use to all in thecommunity, since it could be used for anythingfrom printed matters (via translation to forexample LaTex) to web pages and informationcentres such as ours. The standard would ideallybe developed as a collaborative effort in the LTcommunity, perhaps building on othersuccessful standards such as DocBook or DublinCore. Coordinating something of that magnitudeis a rather large task, and not within the scope ofour project. We will take the approach ofdeveloping some XML in NorDokNet, whilekeeping our eyes open towards the world, inparticular stay in contact with LT-World. Innon-formal discussions with representativesfrom DFKI, there is a growing interest ininterchange of data and data-harvesting, and abelief that the work of OLAC could prove veryuseful.

In order to reach data-providers unwilling orunable to edit XML files, an editing tool wouldprobably prove helpful. We will not developsuch a tool in the immediate future, since otherparts of Slate are more prioritized. With someluck, general XML-editing tools may appear onthe market or better still as open source projects.In addition, browser support for XML (withstyle sheets) would probably encourage users.However, our control of browser developmentis, of course, nonexistent.

As seen above, there are other, larger projectsand standards we should keep in mind andprobably adapt to. The XML given in theexamples should thus be treated as just examplesto demonstrate the capabilities of SlateBot.Emerging global standards will be incorporatedas consensus will dictate.

ConclusionData harvesting seems to be a feasible way ofcollecting information to an information centre,technically speaking. Our notion that different

Page 102: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

data-providers will make use of different entry-modes seems to be true but need to be testedmore. We believe that XML together withharvesting engines and other access methods canand will change the way we view the web, andthat the LT community could and should takeadvantage of this.

AcknowledgementsOur thanks go to all other participants inNorDokNet and the Ph. D. students of GSLT forproviding feedback and testing.

ReferencesNote: all URIs to web-pages were tested on01.07.2002. The documents may have beenmoved or removed after this date.

DCMI org., (1999) Dublin Core Metadata ElementSet, Version 1.1: Reference Description, web-publication, http://dublincore.org/documents/dces/

Harold, E. R. and Means, S. W. (2001) XML In aNutshell, O’Reilly & Associates

OAI org. (2001) OAI FAQ, web-publication,http://www.openarchives.org/documents/FAQ.html

OLAC org. (2001) OLAC Metadata Set, w e b -publ ica t ion , h t t p : / / w w w . l a n g u a g e -archives.org/OLAC/olacms.html

W3C org. (1999) HTML 4.01 Specification, web-publication, http://www.w3c.org/TR/html401/

W3C org. (2000) Simple Object Access Protocol( S O A P ) 1 . 1 , w e b - p u b l i c a t i o n ,http://www.w3.org/TR/SOAP/

W3C org. (2001) Semantic Web Activity Statement,web-publication,http://www.w3.org/2001/sw/Activity

Other websites mentioned in the textLT-World, http://www.lt-world.org/NorDokNet, http://www.nordoknet.org/Slate, http://slate.gslt.hum.gu.se/

Page 103: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

��� � ����� � ���������������������������! ��#"$�&%('$)*���&�$+-,.�/"0�1��'2�*��� �$+

3*4�576�8�6:9<;=;>8@?BACED�CGF/H�IKJMLON�PRQTS0U�LGS0VXWZY[CB\RD] PR^XQ_W`H�abQ[cZabLedfF$Shg�ikj�iMCLOl�PmPRQkW`npo�qMabN�PsrtW`H�Q=uvPRNxw

y z 8@?B{E;#|&4&}�?E~�;>8�#�k�B�*�m���`���m�B����� �2���B�����B�m�*���s���B���$�1�B�����p�1�� �&����� ��� �0�m�p�( k�1�B�m�¡�B�k¢E _�p¢B�$�m�p�1�(���_¢��B�_��¢B���B������p�(��£��B�/¤R�k�e¥¦�§���k¢B�[�(���m�(�������s�(�p�(�§�E�¨��©R�������«ª¬������m�§�������­�����®���°¯±�E���*�E�³²@�����7ª_�(´m�p�2 _�����¦µ/¶®²·����_���(�m����©Z�§�(���E _���m k��¸#¹�´k����©R�������º�m�p�1����� �����­�����m£`����B�_�.�B���§�e¥¦�Z»s k���1©����_¢®�B�_�·���_�§�(�¡�k¢³�B�¦���(�Z�¡���k¢E _������(��£¼¤R�k�½¥¦�����k¢B�¾�m�B���B¸¿¹¦´k�¼�(©��������ÁÀÂ���(�����M�E�_���¼��� _�����=���k�� k�@���Ã����µ/¶®²�ªp¥¦´_�¡£1´Z���>�(´_���0�����B�_���b�B�1�*�����©.�B�ĵ¬Åk²�¹Æ�_�(�R£`���(���B�����¿�K�B���ÈÇ$¹¦¶³²&ª�¥¦´_�¡£1´���G�(´k���É�_����������©B���Ê�s©:�(´_�Ë m������ÀÂ�G�_�(�½¥2������¸Ì¹�´k�µ¬¶³²Ì�_���B����£`�� _�����Ä�s©.�(´k�°��©��(�����È���¿�(���_�(�������s������_¢E _�����(��£*�(���1 _£`�( k�(�¾���Z�k���(£`�1���X���7ªÃ���B¢B���(´_���*¥¦�§�(´�k���(�B���¡�¬�B��´k�½¥��(´k��µ�ÅR²R¹º���Í©������(´k�����������B�_���b�B�1����§�����s���<Ç$¹¦¶®²�¸�Î_ k�1�(´k���G���_�K�B�1���p�(�§�E�.�p�X�E _���(´k���©R�������«ª����_£��¡ _�_���k¢*�B�°���s�����1�B£`�(�§�B�*�k���*�E�_�(���1�p�(�§�E��B���(´k�<��©R�������«ª�����©É�M�[�K�E _�m�É�E�Ï�(´k�TÐ[���Ñ�p�ÒkÓsÓ­Ô�Õ�Ö�Ö­×sØsÙ±Ú�ÛOܱÚÞÝm×­Óß_àÞß áZâmãeä�å�æeçXè@é@ê�&��� ���É�B�ë�����m�§���*���s�(�p�(�§�E�h�B�¿Ç2���B�ë�2���§�B���ì&´_�1�B���ÆÅ����1 _£`�( _�(�í�/�1�B�����p� � Ç$ì±Åm���î�B�-�k�`��(£`�1���X������� � ìÃ�E�����p�1�f�B�_�ÆÅk�p¢kª<ï�ðBðOñs��¸ Ç2ìòÅ_��M�E�(�§�(�É�í _�_���K�B�1� �(���_�1�������­�(�p�(���E��B�·��©R�s�(�B£`�(��£�����1 m£`�( k�(�Ï�B�_�í�������B�­�(��£��Ê _�(���_¢î�b���p�( k�(�ó�(���1 _£x��( k�(����ªB¥¦´_��£�´Z�p�1�������(�=�B�m�b���p�( k�(�`���p�B�� k���m�B���1��¸@Î_���O��( k�(�$�����1 _£`�( _�(��� �p�1�$ _�(�������Z�(���_�(���(���­�±�B�����¡���k¢E _������(��£<¤��k�e¥¦�§���k¢B�Bª��X�B�(´î�§�`�k��£��B���B�_�î¢B�1�B�����p�(��£��B�Þª�B�_�!�(´_�����Ä�����1 m£`�( k�(���T�p�(�Ê�B�(¢E�B�_�§ô����º´_�§���1�p�1£�´_�õ�£��B���§©®���®�¼�Í©s�M�*´_�§���1�p��£1´s©B¸�¹&©��M���°�K���p�( k�1���(���1 _£x��( k�(���#�B�m�0�(´_�±�Í©s�M��´_�§�����p�1£1´s©��B�M��©0£`���(�(�B�¡�*���p�(´R������p�(��£��B� £`�E�_�����1�B�¡�­�(� � ¯ò�p�(�M���s������ª ï�ðBðBöE��¸Á÷±���§�e¥¥±���_���(£� _�1��´k�½¥¨�(´k���(���_�p�(�������� _£`�( k�(�����p�(�¬�1���_�(�`������s�����Ê�¡�:µ/¶®²Ñ�K�B���«ª��B�_�ø´k�e¥h�(´k�®µ/¶®²î�(���k��(�������s�(�p�(�§�E�_�¿�p�(�Ä�����B�_���b�B�1�*���-���s���óÇ$¹¦¶®²Æ�R���µ�ÅR²R¹¬¸ß_à¡ù ú«û�üsæ¾ý�éXþ½ü�æOÿ�â_ãEü� ¥±���k���m�B�(��� £��¡�§���­�t���(���(�B���¨¢B�1�B�����p�Ï�_���B���§�B�k��*���s�#�*���_���s´_�B�=�/�� _���M���Ã�B���p�����1�B£`�(���B���K���p�( k�1�����

��� �/���¦£`�(�E�1�t���m���p���b�B�1�������k���B�¡�§©¼�B�s©³£��������­�$�p�m�§����Z�1 _�¼�0¥ ���¼�_�(�e¥¦�������(´k�E _�¡�*�M�2�p�m�§�$���� _��������(´k����©R�������«¸

��� �º�B�����½¥¦�-�Z _�§�(���m�§�í _�����1�-��� £`�E�_�k��£`������(´_�³��©R�������È�(�¡�0 _���(�B�k���E _�(�§©B¸ø���R�¡���(���k¢T´s���������£1´_�B�_�¡�(���±�( _£1´¼�B�ò£`�s�B¤R�§���ò£��B�G�X�$ _�����G����1���B�� _�����¦���(�p�����B�_�Á�B�¡�§�½¥Ñ�0 _���(�§�m�§�¬ _��������������m�§�$�¼�(���_¢E�§��¢B�1�B�����p���B�_�°�§�`�R�¡£`�E�°£`�E�m£� k����1���­�(�§©B¸

� Ç$¹¦¶®² � ���Ë���p�(�(��£� _���p�½ªX�(´_�� Ó� ���s×�� �(�p¢s�����¥±�������1 _�§�����:�b�B�G�1���_�(�������s�(���k¢T£`�E�*���§�`�ø�_�p�(��(���1 _£`�( k�(���³�( _£1´î�B�Á�m�p�����[���(�����°�B�_�¨��©��M����b���p�( k�(�2�����1 _£`�( k�1����¸@µ�ÅR²R¹ø�B�����½¥¦���������1�B�§¢E´s�t��b�B�(¥ò�p�1�ë���p�_�����k¢f�M����¥±�������(´k�-���¡�k¢E _�����(�¡£�m�p�(� � ¢B�1�B�����p��ª��§�`�k��£`�E��ª����(£½�����³µ/¶®²�ªX�B�_��(´_���§�=¢B�1�p�m´_��£��B���(���m�(�������s�(�p�(�§�E�*���ZÇ2¹¦¶®²�¸ � ��(��£`�(�§�E��ñ0¥ �$�_���(£� _�1�=�(´_�������p�m�m���k¢������k���(�B���Þ¸

� ����� ?p6�� �Ë{­}�� ~Þ?B6m}�?E4�{­6Î=�§¢E k�(�*ï¬�(´_�½¥¦���(´_���_�½¥Ï�B�=���k�b�B�1���p�(���E�G�(´_�(�E k¢E´�(´k�/��©R�������«ªR����»s _���­�(���B�¡�§©��B�1�k���(���¼�B�±�K�E�����½¥¦��¸��$�t����k¢ �Ä¥±���Ì�_�(�e¥¦������ªZ�Ê _�(���Ë�(���_�_�°���k�b�B�1���p�(�§�E��R���¾´­�����T���Á�¾¥±���T�����(�B���½¸Z¹�´_�Z¥±���<�����1�B���¬�(´k������s�B�B¤B���<�� � � ¯ò�E���*�E���/�p����¥ò��© � �s�����(� �B£`�e��(£`�1���_�&�(´_�p�ò���1�B�m�(���p�����±�(´k�/´­�����¾�(��»� k�����±���*�Z²@������t���`�R�_�(���(�1�§�E��¸<¹�´m�����`�R�_�(���1�(�§�E�.���Z�m�B�(�����·�e�B���G����R£(¤B���$�����(´k��£`�B�1���B�=�(´k���&���-��©��(�����«ª_�(´k��²@���R�¢E _���(�(��£«�&�k¢E�¡�k� � ²>�ò��ª&¥¦´m��£1´ø�¡���T²@�����ø�_�1�B¢B�1�B��(´_�p�Ë�k�����°�B�¡�0���¡�k¢E _�����(�¡£Ë�_�(�R£`���(�(���k¢k¸ � �b�����Ë�(�`�£`���§�R���k¢Ë�(´k�³�����`���_�1���(�(�§�E�.�b�(�E� �(´k�«¥±���Ê�����1�B����ÀÂ�¯�� � �(£`�1�§�_��ª��(´k�¬²>�ø���O�B�� m�p�����ò�(´k�2�`�R�_�(���1�(�§�E�¾�B£x�£`�B�1�_�¡�k¢Ä���É�§�(�°�§�`�R�¡£`�E�ó�B�_�î¢B�1�B�����p�½¸ ¹�´k�·�t��`�R�_�(���(�(���E�Ä£`�E _���Ê�M�³�¿�(��»s _�����G���·�m�p�1�(�®�·�����R������_£`�Bª��:»� k���(©Ê�E�¨�(´k�T�§�`�k��£`�E�� e¢B�1�B�����p��ª/�B�³�£`�E�����B�m�É���. _�M�m�p���®�(´k�[�§�`�k��£`�E�� e¢B���B�����p��ª/�B��k���(£`���§�M���·�¡�:�k���(�B�������ÄÅ���£`�(���E� !�¸Ä¹�´k�«²>�º�(�`��( k�1�m�[�§�(�Ë���O�B�¡ _�p�(�§�E�-���Ì�(´_�:�K�B�1� �B�«�B�!µ¬¶®²�k�R£� _�*���s�����¼�(´k��¯�� � �1£`�1�§�_��ª_¥2´_��£1´³�(´k���Á�����B�_�t��b�B�1���É�(´_��� µ¬¶³²È�k��£� m�*���­�¨�B£�£`�B�1�m���k¢f��� �B�

Page 104: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

web server/XSL transformer

webbrowser

language engine

XML

s−expression

HTTPPOST/GET

HTML

ÎÃ��¢E k�(��ï"�±Å�©R������� � �1£�´_�§����£`�( k�(�

µ�ÅR²R¹ ���Í©������(´k������ª�¥¦´_��£�´Ï���1�B�m���K�B�����°�(´k�¿µ¬¶®²���s���ZÇ$¹¦¶®²�¸kÎ=���_�B����©Z�(´m��� Ç$¹¦¶®²T���±�_����������©B���G�E��(´k��£��¡�§���­��¥±���Á�_�1�½¥¦�����½¸

# $&%('*) 8&|,+.- %(' 6�8&}R;#|&~Í8�5 ;0/1 ) ? )2� ?B{­4 }�?B4&{E6 �

Ð[�T´k���(�[�k���1£`�1�§�M�®´k�e¥ë���:�(���_�(���(���­�«�_�p�(�.�( _£�´�B�®�K���p�( k�1�¿�����1 m£`�( k�(���³�B�m�Ï�m�p�1�(�<���(������¸�¹¦´k������_�p�(�*������ _£`�( k�(���ò�p�(�¬ _�����¾�����(���m�(�������s�±�(´_�$¢B���B������p�1��ª��§�`�k��£`�E�_��ª��B�_�«�m´k���B���������1 m£`�( k�(�����B�#���p�1������m´k���B������¸¹�´_��²>�Á�1���( k�1�_�M�(���m�(�������s�(�p�(�§�E�_���B���(´k�����&�(���1 _£x�

�( k�(���·�B�·µ/¶®²�k�R£� _�*���s�(�¿£`�E�k�b�B�1���¡�k¢ó���Ñ�(´k��/¹�� ���ëÎ=�§¢E k�1�ºö�¸ 32�B���-�(´_�p�Ï���B���(©ëµ¬¶®²�k�R£� _�*���s�Ê�M��¢E���m�ø¥¦�§�(´Æ�(´k�Ñ���B�k���§���B���°���§���*���s�4 ×_Û½Ô�5BÝ�Ûp× ¸03¦�B���±�B�������(´k�  _���#�B�R�(´k�7698/�/¹��.�Í©s�M��b�B���(´k� 4 ×sÜ �p�����1���m k���B¸7698R���p�(�¬ _�(���¾���*�(���_�(�������s����_�_�`�����:���p�(�(���B���B�ĵ/¶®²Ì�k��£� m�*���­��¸ÉÐ[�° _����(´k��� ���[�(���_�1�������­�������1 m£`�( k�(�¾�(´_�p�1���_¢®���¿�K���p�( _�(������1 m£`�( k�(���#�B�_�*�����(���_�(�������s���Z _�§�(�§�m��� ���_´_���1�§�(�B�_£`����Z��©��M��´_�§���1�p��£1´_�§����¸ � �b�1�p¢E�*���s�#�B�M�B��µ/¶®²³�k�R£x� _�*���s�¦£`�E�k�b�B�1���¡�k¢*�����(´_���¦�$¹¦�!����¢E�§�B���®���ÁÎ=�§¢p� k�(�;:�¸

<>=@?"A ?"B"?"C D�EGF�HJIGKMLNHMFPOQI�R9E�HMF"SUTVHJW"X�Y[Z]\[F9E^RME�_`Z^aGS�b[c<>=@?"A ?"B"?"C DdI�R9E�H[FeOfR9gGFML^h^RiTV_JZGR9E"jNbMc<>=@?"A ?"B"?"C D&R9gGFML^h^ReOkj"a IGF�l^bMc<>=@?"A ?"B"?"C D._JZ�R9E jmOkj"a I�FGl^bMc<>=@?"A ?"B"?"C D.HJW"XnY�Z]\[F9EGRME�_JZ^aoOkj"a IGFiTVHJW"X�Y[Z]\[F9E^R9En_JZ"a�bMc<>=@?"A ?"B"?"C D�j"a I�F2OfK[X�p"F�_�jnl^bMc<>=@?"A ?"B"?"C D&KMX�p"FG_�jmOrq^F"RMj W^EGF�l"bMc<>=@s D"D^A�t[uMD�j"a I�FdL�RwvNFyx z"s"D"s&{9|"?�}�~�tw|"?"z�c<>=@s D"D^A�t[uMD�j"a I�F�EGF9q�twz.{�twB"�^A�tJ? z�c<>=@?"A ?"B"?"C Dyq^F"R9j9W^EGFeOrx z s"D"s>TrK[X�p"F�_�jnl^b[c<>=@s D"D^A�t[uMDyq^F"R9j9W^EGF�L�RJv�F�x z"s"D"s�{ | ?�}�~�tJ| ?"z�c<>=@s D"D^A�t[uMDyq^F"R9j9W^EGFdEGF q�tJz�{�tJB �^A�tJ?"zGc

Î=�§¢E k�(��ö�� �$¹¦�Ì�K�B�¦²@���k¢E _�¡���(��£$�&�k¢E���_�$�E _���m k�<MEGFGHJI�KMLNH[F^c<GHJW"XnY�Z]\[F9EGRME�_JZ^a�c<Mj a I�FdL�RJvNF9�]�`�NtJ|G����c<9q^F"R9j9W^EGF�L�RwvNF �]�JuM�"CGu9?"B]��c<Mj a I�F�L�RJv�F �]�`DG�M�"����c��<"�[j"a I�F^c

<"�9q^F"RMj W^EGF^c<9q^F"R9j9W^EGF�L�RwvNF �]�`� ���MC���cn��Z]\JE"KN�J<"�Mq"F"R[j W"EGF"c

<"�[j"a I�F^c<"�GHJW"X Z]\[F9EGRME�_JZ^a�c

<"�ME^F�HJI�KML�HMF^c

Î=�§¢E k�(�;:��&µ/¶®²<�E _���m k���b�(�E� ²>�¡�k¢E _�����(�¡£2� �k¢E���k�

��àÞß �7���Ñû� �����1�§�� k�������B�� _��¶³�p������£`��� � � �¬¶³���ó�p�(��£`�E����*�E�_��©G _�����³�B����£`�E�*�m�B£`�¦�(���_�(�������s�(�p�(�§�E�³�B�@�b���O��( k�(�0�����1 _£`�( k�1����¸ � ²���¹ � µ/����©��m�§�0£��B���§��� G�"�2Ú�ÛeÓ��� ¶®�B�_�_���k¢kªRï�ðBðBöE�>£��B���M�� _���������¦���1�B�m���K�B���Ï�����`����( _�B�X���_£`�R�_���k¢0�B�>�B� � �¬¶��¡�­�����Z��©��X���(��� �B���1�1�§�E��( _���(�p�m�§�&�b�B�#�_�1�¡�­�(���_¢k¸=ÅR���������p�1�§©Bªe¥ �� _�(�ò�B�Zµ�ÅR²R¹���Í©��§���1´k�����ò���Z�����B�_���b�B�1�f�B�Gµ¬¶®²Ë���_£`���m���k¢Z�B�>�B�� �/¶<ª#�( _£1´:�B�*Î=�§¢E _�(��:�ª ���­���[Ç$¹¦¶®²É�K�B���R�§��¥�����k¢¼��©®�¼¥ ���[�m�(�½¥¦�(����¸0Î=�§¢E k�1�0ñÁ�(´_�½¥¦�/�( _£�´[�B�Ç$¹¦¶³²7���(���_�k���1��� � �¬¶<¸� �K���p�( k�1�������1 _£`�( k�1�±���½©Z£`�E�s�(�B�����(�`�����s���1�B�_£`����ª

�(´_�p������ªÃ�O�B�¡ k����¥¦´m��£1´¿�p�(�G�(´m�p�(���T��©T�*�B�(�G�(´_�B��E�k�0�K���p�( k�1�B¸2Å����1 _£`�( k�1�`���(´_�p�1���_¢*¥¦�§�(´_���«���K���p�( _�(������1 m£`�( k�(�¾�p�(�«���_�_��£��p�����·���·�(´k�¼µ/¶®²Ï�k�R£� _�*���s� _�(�¡�k¢G�(´_��� Ø �B�_� 4 ×­Ü �p�����1�§�m k������¸Z¹�´_��²@�Ï�E�_��©�����p�M�B�1�p�����³�(´k�¿£`�E�s�����­�(�®�B���ø�(´m�p�(���ó�( _�_¢B�1�p�m´�E�_�§©°�E�_£`�Bª#���p�(¤R���k¢Á����¥¦�§�(´T�&� Ø �p�����1���m k���^�=�(´k��B�(´k���«�R£�£� k�(�(���m£`���¾�B���(´k�T�( _�_¢B�1�p�m´ �p�(�[�(���*���§©���p�(¤B���É��©Ê�B� ���*�m��©Ä�������*���­�«¥2�§�(´Ä�(´_�Ë£`�B�1�(�`����M�E�_�_�¡�k¢ 4 ×sÜ �p�����1���m k���B¸�¹�´k�¬²@�ø�B�M��©��ò�(´_�¬�^�*�£����B��µ/¶®²Ë���M��£�� �m£��p�(�§�E�G�(´_�p�ò��¢E���B�����p�B�� k�2�B�7�(´k�

Page 105: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

Î=�§¢E k�(�³ñ]� � �����1�§�� k�������B�� k�°¶³�p���1�§�.�1���_�(�������s�������©¼�k���(�����³Ç$¹¦¶³²<�(�p�m�§����¸

� Ø �p�����1�§�m k���/�p�_�M���p�1� �E�_�§©��E�_£`�¬���G�0�(�¡�k¢E�§�¦µ¬¶®²�k�R£� _�*���s��¸

��à¡ù ¡;¢>æOâ_û�ü�£Mþ½æpè>ãBþOè>æOü�¤±æOüsü�ûÇ$ì±Å_�h£`�E�_�(�¡�k���1���(©��s�(�B£`�(��£¾�����1 _£`�( _�(�����°�M�¦¥( _����B�k�B�(´k���*�Í©s�M�¾�B�/���k�b�B�1���p�(�§�E�·�����B�(���:���.�°��©��M����b���p�( k�(�������� _£`�( k�(�B¸�Å��M��£�� �m£��B���§©Bªk�(´_���_�B k¢E´s�����1���B��¼�m´k�1�B�1�B�Ã�1�§¢E�Ë�p�(���X�E�¡�­�����°���«��©®�¼�K���p�( _�(����©��k���£��B����©®£��B���§���.§¦¨;©«ª­¬¯®¦°²±´³@¸Ã¹�´_�*µ/¶®²Ä�1���_�(�`������s�(�p�(�§�E�:���·�(´k�«��©R������� �����(���������p�1�§©Ë�B�(¢E�B�m�§ô�����¸Ç2�½¥±���B����ª��K�B�G�(´k�³ _�(����ÀÂ�*¢B�1�p�m´m��£��B���_����������©Bª �(´_����b���p�( k�(���±���&´_�B�m�_�§�������M��£����B�¡�§©Z�(���(´_�p� �(´k�¦�m´_�1�B�(�B������1 m£`�( k�(�/���&�1���_��£`�����«���G�(´k�/�(�p�m _���p�ò�����1 _£`�( _�(�2�B��(´k��Ç$¹¦¶®²�¸>Î=�§¢E _�(�µ!«�����B�T�`�R�B�*���§���B���( _£�´T�B�Ç$¹¦¶³²7���(���_�(���(���­�����.�m�p�����G���1���¼�b�B���(´k�«�����s�����_£`�¶�·¹¸fº»w¼ ½`½k¾¿» ¸

��àÀ� ¤²Á0Â�ü�ÃdÄbüsæOâ_æOã ¢ÅÄbüsû²>�¡�k¢E _�����(�¡£:¤��_�½¥¦�§���_¢B� � �M�B�(´�¢B�1�B�����p�(��£��B�«�B�_��§�`�k��£��B�K�Z���:Ç$ì±Åm����*�(���m�(�������s�����:�s©.�Í©s�M���:�b���O��( k�(�¼�����1 _£`�( _�(����¸«¹�´k���Í©s�M���Z�p�(���B�1¢E�B�_�§ô����·´_�§������p�1£�´_��£��B���§©Bª=�B�_�·�0 m�§�(�§�m�§���¡�_´k���1�§�(�B�m£`�������B�¡�§�½¥±���i��(´_�p������ª=�®�1���k¢E�§���Í©s�M�¾´_���R���k¢®�E�k���B�*�*�B�1���M�E�t��(�§���§�«�( k�M���(��©��M����¸ � ��¥¦�§�(´.�m�p�1�(�Á���(������ª±¥ �® _����k���������[Ç$¹¦¶®²ø�(�p�m�����/���«�(���_�(�������s�¬�(´k������´m�§���1�p���£1´m�§����ªM _�(���k¢�£`�p���¡�_�_��£`���2���G�1���_�(�������s�2�Z _�§�(�§���§�����R�´k���1���(�B�_£`�B¸ � �Z¥¦�§�(´ � �/¶®��ª��(´k�¾²>�Ì�M���(�b�B�1���Z��k���_�(´k�Q�_�1���±���1���B�����(�B�M�B�>�(´k�/¢B�1�p��´G���­���_�G�1´_�p�(����( k�_��¢B�1�p�m´_���B�m�����_�m��£��p�����#�(´k�����2¥¦�§�(´�� Ø �B�_� 4 ×sÜ�p�����1�§�� k������¸�Î=�§¢E k�(�´Æ��(´_�½¥¦�$���(���B���7�Í©s�M�0´m�§���1�p���£1´s©«���1�B�_�(�K�B�1�����«�K�1�E�í�(´k��²@��ÀÂ��µ¬¶®²<�E k���� k������k���������³Ç2¹¦¶®²¿�(�p�m�§����¸

vp-2

SYNSEM

t798

LOCAL

t796

CONTENT 1

t204

QUANTS quantifier-list

NUCLEUS

sleep-relation

SLEEPER 2

t308

PERSON third-person

NUMBER singular

GENDER masc

RELN sleep-rel

RETRIEVED quantifier-list

QSTORE quantifier-set

PHON "hiro sleeps "

hiro

SYNSEM 6

t314

LOCAL

t312

CONTENT

t310

RESTR psoa-eset

INDEX 2

RETRIEVED quantifier-list

QSTORE quantifier-set

PHON "hiro"

sleeps

SYNSEM

t241

LOCAL

t239

CONTENT 1

RETRIEVED quantifier-list

QSTORE quantifier-set

PHON "sleeps"

about this demo

Î=�§¢E k�(�¯!��±ì=�p�1���0���(�����b�B� ¶n·Ç¸ÈºÉ»`¼À½�½r¾¿» �1���_�(�������s�������©¼�k���(�����³Ç$¹¦¶³²<�(�p�m�§���

HEADED-COMP-PHRASE-2

MARKING-PRINCIPLE-3

T823

1 T822

VP-3 2 VP-1

T800

VP-2 2 VP-1

LINEAR-ORDER-1

1 T822

Î=�§¢E k�(�.Æ��ÊÇ$¹¦¶³²��(�p�m m���p�Á�(���_�1�������­�(�p�(���E�Ï�B�����©��M��´_�§�����p�1£1´s©

Ê � ? ��Ë 6 � �&6_6k?�1T6k? ) ~ ËÌ�� �:�*���­�(���E�k���í�p�M�½�B�Bª®�B�fµ�ÅR²R¹ ���Í©��§���1´k�����ø��� _�����f���º���1�B�_���b�B�1� �(´k�ó�(©��������ÁÀÂ�:µ/¶®²��1���_�(�`������s�(�p�(�§�E�í�B�Ë�����k¢E m�����(��£ø�¡�k�K�B�����p�(�§�E�f���s���Ì m��������_�(�e¥¦�(�p�m���<Ç2¹2¶³²&¸2÷±���§�½¥ ¥±�¿�_�¡�(£� _�(�®´k�½¥ �(´k����Í©��§���1´k�������(���_�1�������­�(����©��X���ø�K���p�( k�1���������1 m£`�( k�(����B�_�¨���p�1���T���(������¸ ¹�´m���«����©R�§���(´_�����³���Á�½�O�B���¡�p�m�§�Bª�B�§�E�k¢Ë¥¦���(´¿�(´_�¼�1�������B�¦�(´k�Á���E _�1£`�«£`���k�¾�b�B�*�(´k��&��� ��©R�������«ª��p�¼�(´k���ÎÍ2²î¢E���B���Ê���Ä�(´k� � �s���(�p��_ _£`�(���E��¸Ï7àÞß Ð>ü�âkþeè@æeü&£Xþ½æpè@ãBþeè@æeü�ûÍ����m�(�������s�(���k¢¿�¿�b���p�( k�(�Ë�����1 _£`�( k�1�®���¼�k�E�k�®¥¦�§�(´��¥±�¨����©R�§���(´_�����°�������m���p������ªZ�E�k�·�K�B�Ë�(´k�·�K���p�( _�(������1 m£`�( k�(�Ë�B�N¥���£`�³�§�(�����§� � �(´k� 5^��Ñ�×]ÒpÓ �����*�����p���e��ª�B�_�Ë�E�k���b�B�¬���B£1´<�K���p�( k�1�`���O�B�� _�Z���B�§�¬���°�(´_�p�¬�B�k�¥���£`� � �(´k� ÜR×�EÓGÓ 4 × �������m���p���e��¸Ì÷ �B�(´ �����*�����p������p�(���(´_�½¥¦�¾�M���§�e¥Ï���¾ÎÃ��¢E k�(���ÕÔZ�B�_��Ö�¸&¹�´_�����¬��¥±������*�m�¡�p�������p�(���Z k�( _�B����©³�(��£� k���(�§�B�Bª7�(´_�p�0����ª7�(´k��©

Page 106: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

<M×�HMØ0ÙÚjGFJvGInØ R9jGFµvNR9jn_JZG�]�JK[X�p"F�_�j]��c<M×�H9Ø0Ù¹Û"Z�FML

jGF�HJj^�]�wq^F R9j W^EGF>ÜQÝMLGRwvNF9�0ÞÚz"s9~^ß � D"? |^uiÞQà]�Jc�"�"�<"�M×�H9ØiÙ@Û"Z�FML�c<M×�H9Ø0ÙQK9j Z�F9E9Û]\ HMF^c<Mj^RMXnØ F�X�KME^h^F9E^�N�"á"��c<Mj"E�c<Mj"hGc<M×�H9ØiÙÚâGR^Ø[WGF^Y K qyHMF^Ø FG_�j^�]�JÝMj"a I�FN�J�"c

<"�[j^hGc<"�Mj"E�c<M×�H9Ø0ÙQRMI"InØMaGYMjGFJv�IGØ RMj^F�H

HMF"Ø F�_�j^�N�wq^F"R9j9W^EGFN���9c<"�[jGRMXnØ F"c

<"�M×�H9ØiÙÌK9j Z�FME Û]\ HMF"c<"�M×nH9Ø0ÙÚjGFwv�InØ R9j^F^c

Î=�§¢E k�(��Ôn�³µ�ÅR²R¹ �����*�m�¡�p���Á�K�B�G�(´k� 5^��Ñ�×�ÒOÓ ���§�`��*���s�

£��B���m���B£1´¼�B�(´k���&���Z�(��£� k���(�§�B���§©Z�E k���m k�&�(´k�2Ç$¹¦¶®²�b�B���(´k���b���p�( k�(���(���1 _£`�( k�(�B¸Î=�§�1�(��ª0£`�E�_�(�¡�k���®�(´_� 5^��Ñ�×�ÒOÓ �����*�m�¡�p���B¸ � Î_�B�

�k�e¥�ª��§¢E�k�B�(�Z�(´k���������¬�p�/�(´k�0�M��¢E���m�_���k¢]���(´_���$¥¦������M�«�_���(£� _�1�����<�M���§�e¥ ���.����£`�(�§�E�:ñk¸vö�¸ ��¹�´m���Z��������m���p��� � ¥¦���(´_���G�(´k��ã ÛG��Õf5­ÓEÒm× 4nä � ÛB× ���§���*���s���å�_�1���£`�(���p�����T� Ç$¹¦¶³² �(�p�m�§�Bª0�B�m�î�(´_����1�§�����®�E k��ª�B�¬�(´k�*�(�p���§�BÀÂ�æ�_�1���/�(�e¥�ª7�(´k�*�Í©s�M�*�B�±�(´k�*�B�N¥���£`��¸32�`����ª��§�¦�p�m�m���§�����(´_� ÜR×�EÓ�Ó 4 × �������m���p�������¼���B���(©ÜR×�EÓGÓ 4 × ���§���*���s�¦���¾�(´k���B�N¥t��£`��¸32�½¥Ñ¥ ���( k�1�«���G�(´k� ÜR×�EÓGÓ 4 × �����*�m�¡�p���B¸ç32�B���

�(´_�p�����B���1©Z�b���p�( k�(�¦���§���*���s�&���#¥����p�_�M���Z�¡�*�è� Ó 4 �� �(�p�m�§�¼�(�e¥$��Ç$¹¦¶®²É���§���*���s��¸®¹�´m����£`�B�(�1�����M�E�_�_����®�®�����m�p�1�p���¼�(�p�m�§���(�½¥�¥¦�§�(´_�¡�Ë�(´k��� Ó]"���­×N� �B��(´k�Ë�B�N¥���£`�Á��� ¥¦´_��£�´ �(´_���¼�K���p�( _�(�[���¾£`�E�­�(�B���_����¸�&�B£�´ó�(�p���§�T�(�e¥ £`�E�s�(�B���_�Á�Í¥ �É£`�������&�Ä�(´k���_�1��������(´k�«�b���p�( k�(�Á�m�B�*�^���(´k�Á����£`�E�_�Ä�����(´k�¾�K���p�( _�(��p�B�� k�B¸�¹�´_�¿����£`�E�_�Ñ£`�����0���³�M�B�m m���p�����¨��©Ï£��B���õ����k¢É�(´k� 5^�¿Ñs×�ÒOÓ �����*�m���p���:�E�!�(´_� 5"�¿Ñs×�ÒpÓ ���§�`��*���s�®¥2�§�(´_���É�(´_�¡�Á�b���p�( k�(�B¸ Î_�B�°�b���p�( k�(�·�p�B�� k����(´_�p�¬�p�(���p���E����£ � �k�B�¬£`�E�­�(�B�¡�_���k¢G�K���p�( k�1���/�(´k����������§�B���x��ª_�(´k���(�¬�p�(���������p�1�p���������*�m�¡�p�������k�B���(´k�e¥¦�´k���(�B¸ � �����¬�E�����������*�K�B�#�m�(��������©��p�(���(´k���(���m�k���1���k¢�B�#���_�k�`�G�(�p¢E��¸Ï7à¡ù ¡2â_æOû�üy¤±æeü�ü�û� �7�*���s�(�§�E�k�����p�M�½�B�BªBÇ$ì±Å_�: _�����7���b���p�( k�(� £��B���§���� � �¬�/Ç$¹��éÍ/Å°���¾�(���_�(�������s�/��©��s�(�B£`�(��£������1 m£`�( k�(�¥¦�§�(´m���*�b���p�( k�(�¬�����1 _£`�( _�(����ª�¥¦´_��£�´G��� � �B�§�(´k�  _���§©Z�(�`��_�(�R�_ _£`���Ä���ø�(´k�®µ/¶®²Ñ�(´_�p���(´_�®²>��¢B���k���1�p������¸¹�´k�·����©R�§���(´_�����Ë�¡�Á�(�����M�E�_�1�§�m�§�[�b�B�Ë���(���p�(���k¢ �(´k�� � �¬�/Ç$¹��éÍ/ż�K���p�( _�(�����M��£����B���§©��(�Z�(´m�p���0���p�1���

<M×�HMØ0ÙÚjGFJvGInØ R9jGFµvNR9jn_JZG�]�wq"F"R9j W^E^FN��c<Mj"E�c<Mj"hGc<M×�H9Ø0Ù@âGR^Ø[W�F"Y K qyHMF^Ø F�_Jj^�]�JÝMLGRJvNFN��� c

<"�[j^hGc<Mj"hGc<M×�H9Ø0ÙQRMI"InØMaGYMjGFJv�IGØ RMj^F�H

HMF"Ø F�_�j^�N�JKMX�p"FG_�j����"c<"�[j^hGc

<"�Mj"E�c<"�M×nH9Ø0ÙÚjGFwv�InØ R9j^F^c

Î=�§¢E k�(�µÖ��¬µ�ÅR²R¹!�����*�����p���*�K�B���(´k� ÜR×�EÓGÓ 4 × ���§�`��*���s�

���(��� ���[�(���_�_���(���-���-Ç$¹¦¶®² �(�p�m������¸�Î=�§¢E _�(�Êð�(´k�e¥¦���(´k���_�(�R£`���(�(���k¢��B�Ã�(´k��� � �¬�/Ç$¹��éÍ/ž�b���O��( k�(�°¥¦�§�(´m���ø�(´k�Ë�B�N¥���£`�¾�����*�m���p���Bª$�k�E�k�°�s©Ê�(´k�ã Û^�&Õ ä Òm×pÝ ���������(´_�p� ¥ �¦��¢E�k�B�(���������(´k�¦�p�M�½�B�$����£x��(�§�E��¸� �m�p�1���*���(�����¡�2�1���_�k���(���[�B�¬�¾�(�p�m���Z¥2�§�(´Ë��¥±�

�(�e¥¦��¸�¹�´k�����B�®�(�½¥Ì£`�E�_�(�¡���(���B�����1���k¢E�§��£`�����>£`�E�R��(�B���_�¡�k¢G�(´_�Z���p�1���Z���(���BÀÂ���*�B�(´k���½¸�¹�´_���2�(�e¥-£`�E�R��(�B���_�#�¬�1���k¢E�§�ò£`�������(´_�p�����m�B�m� �  _�(���k¢$�(´k� ÒG5��_Û½Ô¿pÝÇ$¹¦¶³²����§���*���s���·�B���¼�B�³�(´k�¨£`�������.�����(´k�É�M�B�t����E� �(�e¥�¸<¹�´k�¼�M�B�����E� �(�½¥ £`�E�s�(�B���_���(´_�¼���p�1������(���BÀÂ�°�_�B k¢E´s�����1��ª¬�E�k�¿�_�B k¢E´s�����³�M���³£`�����Þ¸h¹�´k�µ�ÅR²R¹ ÒG5"Ó�ÝRÓ�êMë �  _�_£`�(�§�E�¿����£��B���§���<���®�k�������������k��(´k���� _�0�X���¬�B���_�B _¢E´­�����1��ª>�B�_�[�(´� _�¬�(´k���_�����k���ÒG5��_Û�Ô�pÝ �O�B�¡ k�¬�K�B�¦�(´k���*�B�(´k���½ÀÂ��£`�����Þ¸Ï7àÀ� Ã�ÄbêÅÄ é@å�ì²í7ãEü�û�ûMÄÇî�ü�ï«âkþeâð �k�G _�����  _�&�B���M��£`�0�B��µ¬ÅR²R¹������(´_�p�Z�m�p�1�B���������1�£��B�Ë�M�0���B�(�����°���«�(´k��µ¬Åk²�¹Ñ���1�B�m���K�B�����p�(�§�E�°���R�¢E���k�/�p�ò�1 _�R���(�¡�*�B¸#¹�´_��� �B���§�½¥2�±£`�E�s���(�E�X�½�B���¦¥¦´_�p��B���M��£`�(�Z�B���(´k�« _�_�_���1�§©R���k¢¾µ/¶®²Ï�(´_�E _���T�M�¼�_���t��m���½©B����¸#¹�´_���&�������*�M�B�(�(�B�s�#¥¦´k�����(´k�¦µ¬¶³²°�k�R£� R��*���s�(�±�X��£`�E���/�(�Z���p�1¢B�$�B�ò���Z�M�$�_�À��£� _�§�������R�§��¥�B�³�Ä�(���k¢E�§�TÇ$¹¦¶®²º���p¢B�B¸ � �(���_¢E�§� � �¬¶ £��B�£`�E�s�(�B���«���*�0 m£1´«���k�b�B�1���p�(���E���(´_�p���§�(�±µ¬¶³²T�(���k��(�������s�(�p�(�§�E��ª@¥¦´k���Ë���1�B�m���K�B���*���[���s���¾Ç$¹¦¶®²ø�M�`�£`�E�*���&���s���¡�p�(¢B������£`�E�­�B���m�§���­�(��©Z�R�§��¥:�E���¬�1���k¢E�§�¥±���¿�m�p¢B���*�(´k�¼ _�������0 m���0�(£`�1�E���&�`�k£`���(�(���B���§©[����R�§��¥��(´_�G���s�(�§�(� � �¬¶<¸ÃÐ[�«£`�E�*�M���_�1�p���G�b�B���(´_�����©��_�(�e�����m���k¢¦�2�m�p�1�B�*���������=���2�(´k�����Í©������(´k�����Ã�(´_�p����M��£��§�b©0�_�B�������B� � �¬¶h�K���p�( k�1�������0´m���k�ò�K�1�E�!�(´k�Ç$¹¦¶³²Ä�E k���m k��¸�¹�´k�* _�(����£��B�T�����§��£`��¥2´_��£1´°�b���O��( k�(���������(´k�½¥ó�B�¦´_���k�$¥¦�§�(´«��¥±���Á�b�B�1�«¸ �$�1�E k�m��B���(���¡�p�����¿�b���p�( k�(����ª&�( _£�´·�B���B��� �b���p�( k�(�������_£`�R�R����k¢$�������B�s�(��£±���k�b�B�1���p�(�§�E�0£��B�0�M� ¢B�(�E k�M�����b�B�=�(´k� _�����½ÀÂ��£`�E�­�B���m�§���_£`�Ë�B�_�:�(���b���(�(���:���·�B��ñ��������B�R��(��£��Vò��E�«�(´k�� _�����½ÀÂ��¥ ���®�K�B���«¸

Page 107: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

<M×�HMØ0ÙÚjGFJvGInØ R9jGFµvNR9jn_JZG�]�JK[X�p"F�_�j]��c<M×�H9Ø0Ù¹Û"Z�FML

jGF�HJj^�]�wq^F R9j W^EGFÜQÝML�RJv�F �0ÞÌz"sM~�ß �"D"?9|GuiÞQà]�Jc

<Mj^RMXnØ F�X�KME^h^F9E^�N�"á"��c<Mj"E�c<Mj"h�R"Ø�\Jg LG�N�M_MFML^j^F9E���c<M×�H9ØiÙÌR9j"j"E�\wX"W^jGF�L�RJv�F �]�M_MK"ØGHJI�RML���c<[×�H9Ø0ÙÚâ^R^Ø[W�F^Y9K q

HMF^Ø FG_�j^�]�_MK[W"L^j0Orq"F"R9j W^E^F�ÜQÝ[LGRJv�F �UÞÌz sM~�ß9�"D9?"|GuóÞkà^�

KMX�p F�_�j��"�q^F"RMj W^EGF�ÜÚÝML�RJvNF9�0ÞÌô�tw|GuMD>ÞQà�bG�w�"c

<"�M×�HMØ0ÙÌR9j"j EN\wX"W^j^F^c<M×�H9ØiÙÚâGR^Ø[WGF^Y K qyHMF^Ø FG_�j^�]�JÝMj"a I�FN�J�"c

<"�[j^hGc<"�Mj"E�c<Mj"E�c<M×nH9Ø0ÙÌRMI InØMa�YMj^FJv�I�Ø9R9j^F�H

HMF^Ø F�_Jj^��wq^F"R9j9W^EGF�ÜQÝ[L�RJvNF �UÞÌz"s[~�ß9�"D ? |GuóÞkàN��� c

<"�Mj"E�c<"�[jGRMXnØ F"c

<"�M×�H9ØiÙ@Û"Z�FML�c��<"�M×nH9Ø0ÙÚjGFwv�InØ R9j^F^c

Î=�§¢E k�(�°ð��³µ�ÅR²R¹ �����*�m�¡�p���Á�K�B�G�(´k� 5^��Ñ�×�ÒOÓ ���§�`��*���s��ª��1´k�½¥¦�¡�k¢�����£`�(�§�E�G�b�B�±´_�B�m�_�����k¢/�B�N¥���£`�(� �(´_�p��p�(����´k�1�B������¸

¹�´_�G²>�Ì�k�����.�k�B�Z¤R�k�½¥í�p�M�E k���(´_�¼�k���(�B�¡�����B��(´k�� _�����½ÀÂ�#�_�����m�¡��©¬�_�(���b���(���_£`����ª­´k�½¥±���B����ª��B�_�Z�(´_������k�b�B�1£`���$�*���m _���p�1�§�Í©G��©«�����m�p�1�p�(�¡�k¢��(´k�Z _�_�k�����§©­����k¢¦µ/¶®²¾¢B���k���1�p�(�§�E�0�b�(�E�î���(�>�_�1�������­�(�p�(���E�����$�(´k� _�����½¸õ z 8�ö±4#?�÷0ø³4#?^ö±4#? � ö�6m}k~Úùò} ) ?B~�;>8� � �*���s�(�§�E�k�����M���K�B�(�Bªs�(´k�2²@�¿�¡�&���m�(�B¢B�1�B��¥��1�§�t������[���°²>���(�³�(´_�p�¬�����(�����_�¦�E�Ë�¾¹2¯òì �M�B�(�/�B�_�°�(�`�£`���§�B���������`���_�1���(�(�§�E�_��¸ � ���(´k���.�_�(�R£`���(�����Z�(´k�¾�`����_�(���1�(�§�E�³�B�m�Á�(���( k���_�¦�B�®µ/¶®²:�k�R£� _�*���s�¦���¼�(´k�£����§���s��¸�¶Á�B�1�¦�k���(�B�������B�X�(´_���K�B�����p�&�B�M�(´k�����2���k�m k��B�_�Z�E k���m _�Ã�`�R�_�(���1�(�§�E�_�=�p�(�ò¢E�§�B���*���Z�(´_�¡�Ã�(��£`�(�§�E��¸� �B£1´��¡�k�m k�=�t���`�R�_�(���(�(���E�����#�¬�  _�_£`�(�§�E���(´m�p���(´k�

²>� ���p�B�� _�p������¸¬¹�´_���M�E�(�(�§���§�¬�b m�_£`�(�§�E�_�$�p�(�Z¢E�§�B������¾�(´k���K�E�����½¥¦���_¢0�(�p�m���B¸¹�´_���(���p�(���(´k�(�����¡�k�m k�=�`�R�_�(���(�(���E�_�Ã�(´_�p�&£��B���M�

�m�B�(�(���«�����(´k��²@�ú�

� êtÔ� 4 Ûp×üû[ý ºGþ̽VÿP»Jþ@¸`·���� ë ¸ ¹�´_���[£��B _�(���[�(´k�²@�f���:�m�p�1���Ë�(´k�Ë¢E�§�B���¨»� k�B�����Ï�(���1���k¢.�B�_��1���( k�1�_�°�(´k�:�(���1 _�§�(���k¢É�p¢B���_�_� � £`�E�*�m�§�������§©

���p�1�����Ì£`�E�_���(�§�( _���­�(���°�B�m�Ì�(´k�ø£1´_�p�1� � �m�p����(�¡�B���§©����p�1�����«£`�E�_���(���( k���­�(�x��¸

� ê�ÛMÓn���OÒ � × 4 ý ¾G¾]½J¸���Vº ý ��ÿ ¼ º�å½�¸���º ý ��ÿ ë ¸¹¦´k�·²@� �(���( k�1�_�®�(´k�:�����°�B���Í©s�M���°�(´_�p�[����M�B�(´Ê�*�B�(�®�(�X��£��À�m£¾�(´_�B�Ê�(´k�³¢E�§�B���  k�_�M������M�E _�m�Ñ�B�m�-�*�B�(�:¢B���k�����B�*�(´_�B�!�(´k�Ä¢E�§�B������½¥±�������M�E _�_��¸

� ê1ØR×]Ò��n 4 × �EÓn�­Ôm× �ó½���rþ��`¾]½ü» ý ¾]½J¸Jþ��`¾]½w» » ý ���þ��`¾]½J» ë ¸ø¹�´_�Á²>�- k���_�p�������(´_�«��©��X�Á´m�§���1�p���£�´­©³�s©Á£`�(���p�(�¡�k¢¼�¼�k��¥Ñ��©��M��¥¦�§�(´³�(´k�Z¢E�§�B����m�B�*�T�B�_�Ï�(���( _�p�����ó���É�(´_�T��©��M�T´_�§���1�p�1£�´­©�B£�£`�B���_���k¢T���¿�(´k�®�1 k�_�m�������·�1 k�M���(��©��M�����B�_��1 k�_��©��M����¸

¹�´_�·�b�E���§�e¥¦���k¢ �(�p�m�§�:�( m�����p�1�§ô����Ë�(´_�ø���k�m k�t��E k���m _���(�����p�(�§�E�_�1´_�§�m�ò�k���(£`���§�M���¼�¡�¾�(´_��������£`�(�§�E�7¸

����� ��������� ��"! º ý þ ¾ ý þú½�¼ ½�#%$êtÔ¿ 4 Ûp× ½&J¾�¸ ë � Ô� 4 Ûp×N�ê�Û�Ò¿5 ä �pÒ � × 4 þ��`¾]½��'��(�#«½ ë � ÛMÓn���OÒ � × 4 �ê1Ø�×�Ò��n 4 ×)�EÓn�­Ô_× þ��`¾]½*� ��(�#µ½ ë � ÛMÓn���OÒ � × 4 �

+Ë6�/t6_{E6m8 }R6 �÷±�B� ¯ò�p�1�X���s������¸ ï�ðBðBö�¸-, ¶N½/.>º�^·10 º32 , �`¾]½Vÿ

4 ½(^þ ý ¸V½ 5�þ@¸ ý 0[þ ý ¸V½w» �·¹þ¹¶ 6 ¾G¾�¼�·�07(^þ@·@º��]» þQº8 ��· 9:0;(^þ@·Úº��<��("»�½Vÿ>=ç¸3(�#?#@("¸`»;AB.0º�G·�0DC ¸Èº�^¸�(�#E�#´·����F(���ÿHG�º��]»Jþ@¸3(^·I�¿þKJÕ½w»Jº^¼ ý þ@·Úº�� ª��B�E�� _����:Bö�B� G�(�#L�J¸w·@ÿM�n½ , ¸3(N0Mþ¹»Õ·�� , ¶N½�º"¸V½�þ@·10;("¼�G�º�#ξ ý þ̽J¸5�0[·Ú½���0`½ ¸Ä¯ò�B���m�1���k¢B�.�2�_���B���1�(�§�Í©óì��(���1��ªè32��¥O��B�(¤X¸

¯ò´k����� ¶®�B�_�_�¡�k¢k¸ ï�ðBðBö�¸ �$�R£� R��*���s�(�p�(�§�E� �b�B� ���R�«¸Â����©B¸ ´s�����0� ^ e¥�¥¦¥�¸ ���(���`�M¸Â�B£p¸Â k¤� O�����k¢E _�¡���(��£��` £����0�` O���p���`�RñE�����k¢n O�½�����` s¸

¯ò�p�1�QPk¸=ìÃ�E�����p�1�¿�B�_� � �p�B� � ¸#Åk�p¢k¸¼ï�ðBðOñk¸LR ½(Gÿ��S ¸w·IT^½*�UC ¶�¸3("»�½V5¿þ@¸ ý 0Mþ ý ¸È½U=ç¸�(�#E#%(^¸ ¸��2�m�§�B������(����©¼�B��¯�´_��£��p¢B��ì��1���(��ª�¯ò´_�¡£��p¢B�k¸

Page 108: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP
Page 109: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

A Proposal for Screening Inconsistencies in Ontologies based onQuery Languages using WSD

Chieko NAKABASAMIToyo University

1-1-1 Izumino, Itakura,Oura, Gunma, Japan, 374-0193

[email protected]

Naoyuki NOMURAHosei University

2-17-1 Fujimi, Chiyoda-ku,Tokyo, Japan, 102-8160

[email protected]

Abstract

In this paper, we discuss a method to screeninconsistencies in ontologies by applying a nat-ural language processing (NLP) technique, es-pecially, those used for word sense disambigua-tion (WSD). In the database research field, itis claimed that queries over target ontologiesshould play a significant role because they rep-resent every aspect of the terms described ineach ontology. According to (Calvanese et al.,2001), considering the global and the local on-tologies, the terms in the global ontology can beviewed as the query over the local ontology, andthe mapping between the global and the localontologies is given by associating each term inthe global ontology with a view. On the otherhand, ontology screening systems should be ableto take advantage of some popular techniquesfor WSD, which is supposed to decide the rightsense where the target word is used in a specificcontext. We present several examples regard-ing inconsistencies in ontologies with the aid ofDAML+OIL notation(DAML+OIL, 2001), andpropose that WSD can be one of the promisingmethod to screen such as inconsistencies.

1 Introduction

In recent years, the semantic web (Berners-Leeet al., 2001) has been evolving as the next-generation web technology and has attractedthe attention of many researchers in databaseand knowledge engineering communities. Inthe semantic web, contributions obtained fromfields related to databases frequently refer to on-tology maintenance, reuse, and sharing. Basedon the results from database research, this pa-per proposes a method to screen inconsistenciesin ontologies by applying a natural languageprocessing (NLP) technique, especially, thoseused for word sense disambiguation (WSD).

Many reference books on WSD are available(e.g., (Manning and Schutze, 1999)). As forontology integration, several approarches areproposed (Mitra et al., 2001)(Euzenat, 2001).In (Calvanese et al., 2001), global-centric (akaglobal-as-view) and local-centric (aka local-as-view) approaches for an ontology integrationframework are proposed, respectively. In thispaper, we support the global-as-view approachfor screening inconsistencies in ontology. Inaddition, (Calvanese et al., 2001) claims thatqueries over target ontologies should play a sig-nificant role because they represent every as-pect of the terms described in each ontology.The terms in the global ontology can be viewedas the query over the local ontology, which de-scribes a concept definition. On the other hand,ontology screening systems should be able totake advantage of some popular techniques forWSD, which is supposed to decide the rightsense where the target word is used in a spe-cific context. We claim that WSD is a promis-ing method for determining which local ontol-ogy should be used for forming concepts for aglobal ontology. This paper contains a brief in-troduction of the global-centric approach, whichis described in Section 2. In addition, we men-tion the inconsistencies in ontologies caused bywords with multiple definitions. For example,the word gbassh is chosen, and we explain thatsome queries based on two of its definitions haveconcepts in the global ontology. After extract-ing new concepts by querying the global on-tology, each local ontology is illustrated withDAML+OIL (DAML+OIL, 2001). Several in-consistencies in the ontology are presented inSection 3 over the global ontology, and the WSDfor solving such inconsistencies are discussed.The final section is the conclusion.

Page 110: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

2 Global-as-view approach

In (Calvanese et al., 2001), global-as-view andlocal-as-view approaches are proposed for on-tology integration. According to the global-as-view approach, the mapping between the globaland the local ontologies is given by associatingeach term in the global ontology with a view.Let C be a term in the global ontology G, V aquery language over the terms of the local on-tologies, and M the mapping between the globaland the local ontologies. Given that D is a localmodel for the ontology integration system and Ia global interpretation for the system, the cor-respondence between C and V is specified asfollows by referring to (Calvanese et al., 2001):

• 〈C, V, sound〉if all tuples satisfying V in D satisfy C inI

• 〈C, V, complete〉if no tuple other than those satisfying V inD satisfies C in I

• 〈C, V, exact〉if the set of tuples that satisfy C in I isexactly the set of tuples satisfying V in D.

In the above notation, “I satisfies” meansthat I satisfies every correspondence in themapping between the global ontology and thelocal ontologies wrt D. These correspondencesare valid if the global ontology is assumed to beconsistent; however, inconsistencies might occurwhen the term in the global ontology has morethan one definition. We use the word “bass,”for example, which has at least two definitions,one, “a man whose singing voice is very low,”and another, “a kind of fish”(LDOCE, 1995).Each definition is represented in order with first-order-language-like notation, as follows:

C(x) ← singer(x), voice(x, low).C(x)← aKindOf (x, fish).

Some concepts in the global ontology are de-scribed using the above two C(x)s. The follow-ing concept is represented with the first C(x) be-cause C1(x) is a member of an orchestra, whichwould be impossible that C1(x) is a fish fromthe standard context.

C1(x)← C(x), isMemberOf (x, orchestra). (1)

The concept C(x) in (1) can be representedwith the DAML+OIL notation, as shown below.This concept is assumed to belong to one localontology (for example, local-ont-1). In the rep-resentation, some parts in the representation,such as namespace prefixes and URIs, are omit-ted because they do not directly relate to thispaperfs intention. In addition, the referencedresources and properties whose definitions arenot in the representation are supposed to be de-fined implicitly.

<rdf:RDF><daml:Ontology rdf:about=" "><rdfs:comment>local-ont-1</rdfs:comment></daml:Ontology><daml:Class rdf:ID="bass"><rdfs:subClassOf rdf:resource="#singer"/><hasVoiceOf rdf:resource=#very_low /></daml:Class><daml:ObjectProperty rdf:ID="hasVoiceOf"><rdfs:range rdf:resource="#voice"/></daml:ObjectProperty><daml:Class rdf:ID="voice"><daml:oneOf rdf:parseType="daml:collection"><voice rdf:ID="high/"><voice rdf:ID="medium"/><voice rdf:ID="low"/><voice rdf:ID="very_low"/></daml:oneOf></daml:Class></rdf:RDF>

C1(x) is assumed to be a resultfrom a query over the global ontology,“C(x), memberOf (x, orchestra)” The set ofextensions of C1 is returned as a result of thequery. Once C1 is formed, ‘x’ as a result of theglobal ontology becomes a set of instances fora new class, and the local ontology is revisedby adding a subclass of the existing class. Ifsuch a new subclass were named “sub-bass,” itwould be represented in this manner:

<daml:Class rdf:ID="sub-bass"><subClassOf rdf:resource="#bass"/></daml:Class>

Page 111: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

<daml:Class rdf:about="#sub-bass"><isMemberOf rdf:resource="#orchestra"/></daml:Class>

Let O be an ontology screening system. Inthis case, O obtains one concept “sub-bass,”i.e., “a bass singer working at an orches-tra,” from the intersection between C(x) andisMemberOf (x, orchestra). The concept “sub-bass” is formed by combining the local andglobal ontologies. Considering this example, ap-propriate extensions of the intended concept inthe global ontology are obtained and recognizedonly if the right sense of the concept in the localontology is used. A different concept might bedrawn from different senses from other local on-tologies, but the high rate of co-occurrence be-tween singer and orchestra led by WSD wouldplay a crucial role in preventing another sensefrom being considered. Let’s consider anotherexample:

C2(x)← C(x), livesIn(x, river). (2)

In (2), x is likely the second sense of “bass.”For example, x is an extension of the bass classas a subclass of fish. WSD would justify it be-cause the co-occurrence of “fish” and “river”is rather strong. The local ontology of “bass”from the second sense (local-ont-2) is repre-sented with DAML+OIL, as shown below. Itis also revised when a new subclass from query(2) is added.

<rdf:RDF><daml:Ontology rdf:about=" "><rdfs:comment>local-ont-2</rdfs:comment></daml:Ontology><daml:Class rdf:ID="bass"><rdfs:subClassOf rdf:resource="#fish"/></daml:Class><daml:Class rdf:ID="sub-bass"><subClassOf rdf:resource="#bass"/></daml:Class><daml:Class rdf:about="#sub-bass"><livesIn rdf:resource="#river"/></daml:Class></rdf:RDF>

3 Screening inconsistencies in theglobal ontology

According to (Calvanese et al., 2001), the fol-lowing situations involving inconsistency or am-biguity occur as a result of the mapping ofglobal and local ontologies:

1. There are no global models.This occurs when the data in the exten-sion of the local ontologies do not satisfythe constraints for functional attributes.For example, in (Calvanese et al., 2001), aproperty “age” is considered. Since “age”is a function, the value as its range must bedefined with only one value. However, theglobal ontology does not have a model con-cerning “age” any more if “age” has morethan one value from the constraint of theconcepts in a local ontology.

2. There are several global models.This occurs when the data in the exten-sion of the local ontologies do not satisfythe ISA relationships of the global ontol-ogy: i.e., several ways exist to add suit-able instances to the elements of the globalontology in order to satisfy the constraintsfrom a local ontology. For example, it is as-sumed that a “student” must be enrolled ina certain university. More than one univer-sity can be considered when the constraintsin a local ontology do not mention whichuniversity each extension of “student” is en-rolled in from the concept definitions of thelocal ontology. Such ambiguities depend onhow precisely the concepts of a local ontol-ogy are defined. As a result, all interpreta-tions for an intended ontology integrationsystem within which a valid concept existsmust be accepted as models because of thepotential ambiguities present in local on-tologies.

We propose two other solutions to the unsuit-able situations described above; the first relatesto yielding invalid extensions, and the secondto forming wrong concept definitions as a resultof WSD failure. For an explanation of the firstsituation, consider this example:

Example 1. Re-considering (2) in Section 2.

C2(x)← C(x), livesIn(x, river). (3)

Page 112: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

concept(bass who lives in a river)

IchirolivesIn(x,river)hasAddressOf(x,Tokyo)

manhasAddressOf()

extension

extension

local ontology

bass

(a subclass of singer)

local ontology

bass

(a subclass of fish)

bass(x), livesIn(x,river)?

wrong choice

global ontology

? Tokyo = river

’=’ means "same category"

Figure 1: An inconsistency resulting from a wrong choice from word “bass”

In (3), concept C2 is assumed to be formedfrom concept C, which has the meaning of “bassas a singer.” The meaning of C2 is “a bass thatlives in a certain river.” Provided that “Ichiro”is an extension of C2, it could be said thatIchiro lives in a river. In an ontology hierarchy,there would be multiple inheritances around oneterm. Suppose “Ichiro” is also an extension ofthe class “man.” If the class “man” has a prop-erty called “isAddressOf,” which is defined asthe same property of “livesIn,” “Ichiro” musthave an instantiated address. When his ad-dress is “Tokyo,” an inconsistency arises be-cause “Tokyo” and “river” are not classified inthe same category. In addition, this inconsis-tency in terms of having an inappropriate valueof properties is propagated towards lower sub-classes of the target class. This situation is ex-pressed in Figure 1. We claim that a situationsuch as this should be solved using WSD be-cause of the strong relationship between “fish”and “river” mentioned in Section 2.

Example 2. The second situation concernsWSD failure. Suppose the following query:

C3(x) ← C(x), hasHobbyOf (x, swimming).(4)

In (4), C3 would be assumed to be a conceptreferring to a subclass of person from the ordi-nary context. However, the ontology screeningsystem might classify the concept C3 into thewrong place in the global ontology if it takes

the concept C as the second meaning of “bass,”i.e., a kind of fish. Deriving the related conceptsin the global ontology on the basis of a wrongconcept whose meaning is never intersectablewith the right one causes an unexpected incon-sistency over the global ontology. After C3 ismistaken, the local ontology would result in aconcept with the meaning “a bass whose hobbyis swimming” in spite of the fact that a bass isa type of fish.

<rdf:RDF><daml:Ontology rdf:about=" "><rdfs:comment>local-ont-2</rdfs:comment></daml:Ontology><daml:Class rdf:ID="bass"><rdfs:subClassOf rdf:resource="#fish"/></daml:Class><daml:Class rdf:ID="sub-bass"><subClassOf rdf:resource="#bass"/></daml:Class><daml:Class rdf:about="#sub-bass"><livesIn rdf:resource="#river"/><hasHobbyOf rdf:resource="#swimming"/></daml:Class></rdf:RDF>

The above situation has not yet been solvedbecause the strength of the relationship between“fish” and “swimming” is rather high. In orderto solve such problems, WSD should be applied

Page 113: Proceedings of the - Entinen yleisen kielitieteen laitos ...gwilcock/NLPXML-2002/NLPXML-2002-Proceedin… · We are pleased to introduce the proceedings of the 2nd Workshop on NLP

to the property names as well as the propertyvalues: i.e., “hobby.”

4 Conclusion

In this paper, we described a perspective for themapping between the global and the local on-tologies based on (Calvanese et al., 2001). Wesupport the global-as-view approach for main-taining the global ontology. This approach isthought to be promising, but it has some un-solved problems concerning inconsistencies inontologies. Word sense disambiguation, calledWSD, can provide one clue for solving theseproblems because it draws certain close relation-ships among words appearing as property val-ues or concept definitions. In contrast to (Cal-vanese et al., 2001), we proposed two situationsin which there is ontology inconsistency becauseone word has multiple meanings: the first casewas that of a query based on the wrong choiceof meaning for a word, and the second situa-tion involved the inability to use WSD to de-cide the meaning of a concept. To solve theseproblems, more precise disambiguation methodsshould be developed in WSD, or more elaboraterepresentations should be provided for the con-straints in ontology definitions. On the otherhand, the ontology screening system proposedin this paper can be easily implemented us-ing query engines (e.g., (RQL, 1994)(RDFQL,2000)(Miller, 2001)(JENA, 2001) based on RDFQuery (RDFQuery, 1998) as one of the alterna-tives. A query language must play a crucial rolein the formation of concepts and the screeningfor inconsistencies over intended ontologies. Inthe semantic web world, we believe that screen-ing for inconsistencies in ontologies will becomecritical and that the use of NLP methods likeWSD to screen for inconsistencies will offer asignificant contribution.

References

T. Berners-Lee, Hendler J., and Lassila O.2001. The semantic web. Scientific Ameri-can, 279:35–43, May.

D. Calvanese, Giacomo G. D., and Lenzerini M.2001. A Framework for Ontology Integration.In The First Semantic Web Working Sympo-sium (SWWS01).

DAML+OIL, 2001. DAML+OIL language.http://www.daml.org/2001/03/daml+oil.

J. Euzenat. 2001. An Infrastructure for For-mally Ensuring Interoperability in a Hetero-geneous Semantic Web. In The First Seman-tic Web Working Symposium (SWWS01),pages 345–360.

JENA, 2001. The jena semantic webtoolkit. Hewlett-Packard Company,http://www.hpl.hp.com/semweb/jena-top.html.

LDOCE. 1995. Longman Dictionary of Con-temporary English. Longman.

C. D. Manning and H. Schutze, 1999. Foun-dations of statistical natural language pro-cessing, chapter Word Sense Disambiguation,pages 229–261. MIT Press.

L. Miller, 2001. RDF query using SquishQL.http://swordfish.rdfweb.org/rdfquery/.

P. Mitra, G. Wiederhold, and Decker S. 2001. AScalable Framework for the Interoperation ofInformation Sources. In The First SemanticWeb Working Symposium (SWWS01), pages317–329.

RDFQL, 2000. RDF Query Ana-lyzer - RDFQL. Intellidimension,http://www.intellidimension.com/RDF-Gateway/beta3/.

RDFQuery, 1998. RDF Query Specifica-tion. The World Wide Web Consortium,http://www.w3.org/TandS/QL/QL98/pp/rdfquery.html.

RQL, 1994. The RDF Query Language.Institute of Computer Science, Foun-dation of Research Technology Hellas,http://139.91.183.30:9090/RDF/RQL/.