LIRICS · represented at the different layers of linguistic description, TDG2 Morpho-syntax, TDG4 Syntax, TDG3 Semantics. Through various international projects, CNR-ILC is transferring

LIRICS

Deliverable D7.5

Final Public report Project reference number e-Content-22236-LIRICS

Project acronym LIRICS

Project full title Linguistic Infrastructure for Interoperable Resource and Systems

Project contact point Laurent Romary, INRIA-Loria

615, rue du jardin botanique BP101.

54602 Villers lès Nancy (France)

[email protected]

Project web site http://lirics.loria.fr

EC project officer Erwin Valentini

Document title Final Public report

Deliverable ID D7.5

Document type Report

Dissemination level Public

Contractual date of delivery 30th April 2007

Actual date of delivery 30th April 2007

Status & version Draft

Work package, task & deliverable responsible INRIA

Author(s) & affiliation(s) Laurent Romary (INRIA) and Gil Francopoulo (INRIA) and all WP leaders

Additional contributor(s)

Keywords standards ISO LMF MAF SynAF registry DCR Gate Lexus TC37

Document evolution

version date version date

1.0 30th April 2007

1.1 8 November 2007

1

1 Objectives

LIRICS addressed the needs to today's information and communication society where globalization and localization necessitate multilingual communication creating an increasing need for standardization as well as recognition of existing de facto standards and their transformation into de jure international standards.

The objectives of the LIRICS project were thus to:

provide ISO ratified standards for language technology;

facilitate the implementation of these standards by providing an open-source implementation platform;

gain full industry support and input to the standards development.

A great number of deliverables have been produced and uploaded on the LIRICS web site1 as scheduled in the LIRICS technical annex. This current final report summarizes the work done by all the partners and give pointers to the most important deliverables.

2 Summary of LIRICS Workpackage activities

2.1 Workpackage #1: Infrastructure for standard development & quality assurance

2.1.1 Presentation

This package provided the quality and technical framework within which the standards were developed in workpackages 2, 3 and 4.

ISO documents transition through a number of stages. The majority of these stages require some assessment – review and commentary - to be made of the document. Specific documents that will be reviewed and commented upon in the development life cycle of an ISO include the Working draft (WD), Committee draft (CD), Draft International Standard (DIS) and Final DIS (FDIS). Comments are generally provided through National Standards Bodies (NSB), where they are expected to have been validated against principles and methods used for the production of the standards. The amount of hidden effort in the production of standards can be significant: assume these four documents, and consider the possible commentary from twenty NSBs; the number of people involved with each NSB who read these documents and who provide comments into the NSB, and the potential scale of duplicated comments across NSBs becomes apparent. Comments from NSBs then have to be merged, filtered, and subsequently dealt with by the editor of the standard.

Amongst other things, ISO standards have a specific structure and require a specific approach to style and vocabulary. Largely, those who are knowledgeable about such things impart this knowledge to others in an ad hoc manner. Furthermore, while this is relatively easily achieved through communication during the authoring process, the disassociation between author and standard-reader can tend to lead to comments about standards being impenetrable.

In LIRICS, we explored mechanisms of content control that could be used in two orientations: the first by the authors of the standards to improve the quality and consistency of the language used in the authoring; the second, a side effect of the first, to find a means to demonstrate this to the standard-reader. To achieve this, we undertook implementation and

1 See http://LIRICS.loria.fr

2

evaluation of a number of software components, including those of the University of Surrey Department of Computing’s content analysis applications (System Quirk). The work covered the integration and use of supporting resources and components for the standards development process, including a Plain English thesaurus, lookup of ISO TC 37 terminology provided from a terminology management system (TMS) via ISO 16642, automatic terminology discovery using statistical and linguistic techniques, and readability metrics. These components were integrated within an existing framework to demonstrate the potential for controlled authoring based on some of the very standards being used and produced within LIRICS. The result of these efforts leads us to the development of an assistive tool for authors of standards based around LIRICS work, and also to a system that provides automatic annotation of the text of standards to help readers to understand them better.

On the basis of initial experiments, we provided additional commentary into ISO on standards documents at various stages of the ISO process; fuller sets of commentary for the LIRICS standards have been produced and, like the software, are available of the Surrey LIRICS site [http://www.cs.surrey.ac.uk/BIMA/Projects/LIRICS/index.html]. Human interpretation of, and action upon, the results produced by these components is still required to varying extents, however the analysis of language simplicity and consistency, identification of known and unknown terms, and the generation of “understandability” metrics have all been trialed and demonstrate interesting and potentially highly-valuable results.

2.1.2 Implementation

The University of Sheffield’s General Architecture for Text Engineering (GATE, Cunningham et al 2002) was selected as a basis and front-end for the implementation. GATE is established within the NLP community and has been used for numerous research projects. The GATE interface allows for different “processing resources” to be executed in sequences in what is referred to as a pipeline; the user can order the running of these processing resources. A set of reusable processing resources for common NLP tasks is provided with GATE, packaged together to form A Nearly-New Information Extraction (ANNIE) system. We used the existing plug-ins from ANNIE, the tokeniser, sentence splitter and POS tagger, for the preliminary tasks. We then developed, incrementally, a set of new processing resources that adopted techniques for improving the readability of documents by incorporating the use of terminologies, readability measurements and suggested improvements for writing from ASD Simplified Technical English and the Plain English Campaign. Eight new processing resources were developed, listed below:

1. Terminology Lookup

2. Linguistic Term Finder

3. Keyword Extractor

4. Statistical Term Finder

5. SimpleText Analyser

6. Annotation Controller

7. Readability Analyser

8. Replacer

The pipeline for these resources is shown in the figure below, with brief descriptions of each component following. It should be noted that the Readability Analyser can be run at two separate points in the pipeline.

3

Figure 1: Pipeline for the prototype document content management system

Further descriptions of these resources can be found in LIRICS project Deliverable 1.3.

When the processing resources are run against an ISO document, they will calculate values for the various readability metrics, and annotate any known terms. In addition, they will search (statistically and linguistically) for terms that are not known, including extensions to known terms that may indicate a new term – see figure below. Words and phrases for which there may be simpler alternatives are also suggested, and replacements can be selected from the pop-up. After making modifications to the document, the readability analysis can be run again, for which a history is kept, to assess the impact of these changes on the document overall.

Further evaluation efforts are needed to assess the results being produced, to improve the treatment provided and to improve the formulation of feedback on the document or documents being analysed. Outputs should ideally be fed directly to standards authors prior to the submission of a document into the ISO processes, potentially leading to a significant reduction in the quantity of comments relating to document syntax or terminology. Further work with standards authors to begin to embed the evaluation of the results into the authoring process is still required, and needed beyond LIRICS: efforts within ISO to provide a terminology database will provide a significant basis for this application in being able to annotate existing and emerging ISO documents. Greater consideration of the management of overlaps between annotations is needed beyond this deliverable, and likely beyond LIRICS, and a number of future functional opportunities have been identified. Amongst these is a readability measure that takes full account of the processing, analysis and annotation sets outlined here. These efforts demonstrate a major step towards the provision of an ISO content management system, though further efforts are clearly needed.

4

Figure 2: A screenshot of the ‘SynAF’ document in GATE displaying the KnownTerm and DiscoveredTerm annotations

2.2 Workpackage #2: NLP lexica

The main objectives of work package 2 can be summarised as follows: define a “family” of standards for creating, describing, using and sharing lexicons for NLP applications. The target of these standards is not limited to research institutions. Rather, they should be particularly intended for adoption within the industrial community and developed as a support to advanced language technologies for content access and sharing. Moreover, they should be designed on an international scale. The aim is to reach a two-level standard: on the one hand, to define the abstract conceptual data model which provides the structural elements for lexical description in terms of lexical classes and relations between them; on the other hand, to reach a set of standardized constants, the basis of common Data Categories used to “adorn” the lexical classes. As a proof-of-concept of the defined framework, work package 2 should provide a test suite of lexical entries encoded in the proposed format in order to show that it is able to achieve unification.

Some main methodological steps have been agreed upon between the so-called LIRICS topical work packages in order to harmoniously meet the objectives. Experience in harmonization efforts showed us that the road to standardization is a long one and the procedure should be as much conservative as possible. We should build on the past, thus endorsing major standardization activities and best-practices in the field. In order to attract an international public, we should stick to the very well consolidated ISO strategy (which has proved a winning strategy in other initiatives towards unification), i.e. the “structure-adornment” binomial which neatly separates the standardization effort into high-level specification (the structure) and low-level specification (the adornment). The lexical information, the data categories to combine with the lexical model are even more crucial, since they allow implementation of the abstract model itself and development of standard-conformant lexical resources. Last (but not least), a set of examples which follows the model is beneficial to users for users to use and apply it.

In such a project, aiming at defining standards for enhancing language technologies, interrelations and links with external realities are a crucial aspect of any methodology. This is true not only in view of creating consensus, but also in the phase of dissemination of results and of evangelization of prospective users. LIRICS has benefited from the policy of

5

establishing synergies with (inter)national standardization bodies and exchanges with external initiatives, thus creating a stimulating forum for discussion. CNR-ILC, as expert of UNI, follows the standard developments in Italy within the Terminology Commission, and, as Italian delegate of UNI, world-wide within the ISO/TC37 groups. Within ISO, contacts were established with the various subgroups dealing with lexical resources (WG4) and with the different thematic domain groups aiming at identifying shared sets of information to be represented at the different layers of linguistic description, TDG2 Morpho-syntax, TDG4 Syntax, TDG3 Semantics. Through various international projects, CNR-ILC is transferring knowledge about standards to other areas, like the Biomedical community (BOOTStrep Project), and is working to pilot adoption and development of standards in Asia (NEDO grant and Language Grid project at NICT).

In line with the objectives, one the main results of work package 2 consists in the definition of a set of data categories for lexicons. The LIRICS partners working in synergy with ISO TC37/SC4, realized rapidly that most of the values for lexicons and annotations are the same, even if some values are specific to annotation, for instance "punctuation" is mandatory for annotation and is not for lexicons. A second aspect deals with interoperability: with a set for lexicons and a set for annotation, the danger was high to face a balkanization and thus to have two incompatible sets. Another point was that the number of values is rather high, at least 500. Thus, at TC37/SC4 level, it was decided to split the work into four sub-tasks on a linguistic level and not on an object target level. Four ISO profiles (each one corresponding to a sub-task) have been created: meta-data, morpho-syntax, syntax, semantics, and all the values are to be shared by lexicons and annotations. Profiling activity is important in order to provide language industry with manageable portions of data modelling categories in order to enable them to compose their own profiles on the basis of the thematically organized profile specifications.

A maximum unified set of candidate lexical data categories subdivided along the layers of linguistic representation has been isolated and contributed to ISO. This set is presented in D2.1 “Evaluation of existing standards for NLP lexica” (Monachini et al., 2005). This deliverable yields lexical information reliable and harmonized enough to be recognized as crucial for the description of a computational lexical entry; then proposes an inventory of information that is candidate to become Data Categories with unified descriptors, short descriptions and exemplifications. Based on this set and on the work of EAGLES for West-European languages, MULTEXT-East for East-European languages, Sfax University for Semitic languages, IMDI for meta data values, different TC37/SC4 works like LMF, MAF and SynAF for a small set of values, data categories in the Registry today amount to 500. A value of this deliverable is that it allows lexicon managers to specify declarative rules that combine data categories and describe constrained relations between data them in presence of a given feature or value: these dependencies are formalized as XML features-structures (e.g. GrammaticalGender [feminine, masculine, neuter]). CNR-ILC will continue contributing other data categories for all profiles as a results of standardization activities undertaken at the level of UNI (Italy) and ISO TC37/SC4. CNR-ILC is currently conducting corresponding work in Asia within the NEDO project and for the specialized biomedical domain within the BOOTStrep project.

The central outcome of work package 2 is the definition of a specification dedicated to NLP lexicons, the Lexical Mark-up Framework, a high-level lexical meta-model designed as a flexible environment for user-defined mark-up languages. Activities in this work package started very soon: a draft document was presented in Berlin in April, in conjunction with an ISO TC37/SC4 meeting, and to the LIRICS Industrial Advisory Board, held in Barcelona (21-22 June). The importance of standards for lexicons was emphasized, since they give credibility to products and are of relevance for high-quality lexical resources. Some industrials adhered to the standard framework emerging within WP2.

Two main revisions of LMF are available as result of this work package. The LMF revision-9 was submitted as Committee Draft (CD) to ISO ballot on March 2006. The ballot lasted three months and on June 2006 the SC-4 P-members expressed their favourable vote. Comments coming from voting delegations have been taken into account and discussed during the an ISO Plenary meeting in Beijing. A more stable LMF revision-14 was submitted in November

6

2006 for the phase of Draft International Standard (DIS). Results of the ballot were known in June and input coming from voting delegations was discussed and included in the document after a plenary ISO meeting held in Provo, in August 2007. Now LMF is in DIS status; FDIS ballot started in August 2007. The different national bodies are currently balloting and we scheduled to reach Final Draft For International Standard status in February 2008. LMF will be published as an International Standard in September 2008.

LMF is intended to provide a common model for the creation and use of lexical resources, to manage the exchange of data between and among these resources and to enable the merging of electronic resources to form extensive global resources. It ranges from mono-lingual, to bilingual and multilingual lexical resources. Its scalability is not limited: the same specifications are to be used for both small and large lexicons. The same is true for its coverage: linguistic description range from morphology, syntax, semantic to multilingual representation. Languages covered are not restricted to European languages and the range of targeted NLP applications is not narrow.

LMF is a high level specification based on constants defined in other standards. It is a structural data model expressed by a set of Unified Modeling Language (UML) packages, each of them containing lexical classes. Each class is specified by a name, a description of its usage, an UML specification for linking with other classes. Each class is to be adorned by a set of attribute/value pairs which are not defined in the LMF specification but are to be taken from the data category registry. Values can be either constants or free strings.

Morphology

NLP Multilingual Notations

NLP MWE Pattern

NLP Paradigm Pattern

NLP Semantic

MRD

Constraint Expression

NLP Syntax

Core Package

Figure 3: LMF packages

7

The Lexical Test Suite is the last product of the LIRICS work package 2. It is a set of resources in the form of practical examples associated to the international standards produced by the project to test the applicability and usability of the proposed concepts. The objectives of developing test suites in conjunction with the delivery of a standard are to provide a guide for those who wish to apply them to their resources and, above all, to test their viability in NLP implementations and systems. Test suites accompany the standards, ensure both wide dissemination and demonstration, facilitate their implementation and capability of propagation during and after project life cycle. Finally, the development of test suites allows implementers to combine a given standard proposal in the form of a meta-model with the relevant Data Categories taken from the Registry. They can thus be used as examples of the application of data categories themselves and also act as a reference to the best practices in the representation of those phenomena.

Test suites are designed according to a methodology based on the same principles specifications:

- relevance of linguistic concepts, - conciseness with regards to the number of actual entries and - precision in relation to what is actually represented in each entry of the test suite - conformity to TC 37/SC4 design principles.

They are not intended to provide a large amount of cases, but should rather focus on the quality and relevance of the examples they provide.

It should be also mentioned that a mapping of well known NLP lexicon practices accompanies the LMF revision-9 (ISO Auxiliary working document, http://lirics.loria.fr/doc_pub/). As previous standardization experiences teach, mapping what exists to a standard has always a positive impact, thus helping to show the potentialities and capability of the standard itself.

The high number of internal and external meetings and conferences attended by the LIRICS partners evidences the increased level of community involvement and dissemination activities. CNR-ILC has continued promotion and supported adoption of LMF in other realities at international level. The BOOTStrep project has adopted LMF as the basis for the development of the model of a large-scale lexico-terminological resource (the BioLexicon) especially designed for text-mining applications in the biomedical domain. Reciprocal relations between the two lexical models, the ISO-LIRICS one and the BioLexicon one, hold: the ISO-LIRICS model strongly influences the architecture and the policies of the BioLexicon model; viceversa, the BioLexicon model constitutes both an extension and an implementation of the available standards, thus enhancing the lexical standards itself. CNR-ILC has presented papers in many conferences as well as workshops about LMF-conformant. CNR-ILC has continued to push European lexical standards in Asia in the framework of the Japanese grant NEDO. This is a project aiming at developing harmonized lexical resources for Asian languages to support advanced industrial technologies. CNR-ILC has joint another Asian initiative, a NICT Language Grid initiative about an Infrastructure for Intercultural Collaboration, which involve the Osaka, Kyoto and Tsinghua Universities and another LIRICS partner (DFKI). This grant aims at defining a language-grid service ontology and create composite services that connect existing language services on users’ request. An owl-rdf ontologized version of LMF is being developed, as a basis for defining the lexical services. Finally, LMF has been definitely adopted in the framework of Senso Comune, an Italian project involving CNR, La Sapienza University and IBM. The project aims at building, with new modalities of collaboration through the web, a linguistic knowledge base for Italian that will be openly available both to humans for consultation and systems for elaboration.

2.3 Workpackage #3: Morpho-syntactic and syntactic annotations

Works on standardization in the fields of morph-syntactic annotation and, at a less advanced level, for syntactic annotation had already been undertaken in the past, in close relations to research and development in the field of NLP. Nevertheless, the state of a generally accepted and stable standard had not been reached till the starting point of LIRICS, but valuable

8

http://lirics.loria.fr/doc_pub/

recommendations, best practices and guidelines for annotations have been proposed, on which WP3 based its work (Eagles, Multext-East etc). Reasons why the establishment of a real standard has not succeeded till now lie partially in the fact that often results have been deployed only at a local level (a group of project partners) and sometimes only within the time limits of projects. An additional pool of industrial partners that would use the “standards” and ensure their sustainability, in cooperation with national and international standardisation bodies was also missing in some initiatives.

Besides the past standardisation initiatives mentioned above (including others), LIRICS could benefit from ongoing work at the ISO level in the domain of morpho-syntax, and contribute to centralise this process with the expert group and the industrial advisory board of LIRICS. Embedding LIRICS in the ISO procedure is ensuring sustainability of the project results.

Goals and Results: WP3 in LIRCS has provided so far:

A report on emerging morpho-syntactic and syntactic standards. Their strengths and weaknesses (D.3.1, see lyrics.loria.fr)

A Morpho-syntactic annotation meta-model standard, including a Data Category Selection (DCS) standard as an additional part to the 12620 series. This is realised now till the level of a ISO DIS ballot, meaning that it is in a quite advanced stage, and already got approved by the majority of national standardisation bodies within ISO TC37/SC4 . See D3.2a and 3.2.b at lirics.loria.fr

A Syntactic annotation meta-model standard, including a Data Category Selection. This work is now at the level of a CD Ballot. Most of the point presented in this intermediate document have been discussed and agreed on at a number of ISO-LIRICS meetings and at two plenary meetings of ISO TC37/SC4. A final document is expected to be submitted to approval in the Spring of 2008. See D3.3a and 3.3.b at lirics.loria.fr

Test suites for morpho-syntactic and syntactic annotation, which is building a small reference corpus of morpho-syntactically and syntactically annotated text, fro six languages (Bulgarian, English, French, German, Italian and Spanish). See D3.4 at the lyrics.loria.fr, for a first version of this test suite,

Embedding of SynAF and MAF annotation meta-models in the on-going LAF (Linguistic Annotation Framework). A graph model supporting multi-layered (linguistic) annotation.

WP3 results have been presented and discussed at two workshops organized by LIRICS, which where dedicated to industrial relevance of the standards for language resources for industrial applications. This way we could get feedback from more than 20 companies on our current work.

Links to other bodies other initiatives:

In the lifetime of LIRICS, following links has been provided, ensuring that the WP3 work is getting disseminated to relevant communities and the work can continue after the end of the funded project:

WP3 leader, DFKI, is a leading member of DIN NAAT6, the German mirroring committee of TC37/SC4

WP3 members have been participating intensively to activities of ISO TC 37/SC 4 in general and to the activities of the Thematic Domain Groups (TDG) TDG2 on Morpho-syntax, TDG4 on Syntax and TDG3 on Semantics

Two members of WP3 (CNR-ILC and DFKI) are now member of Language-Grid initiative, lead by NICT in Japan

9

WP3 leader (DFKI) and members of DIN NAAT6 had a joint workshop with W3C in April 2007, to discuss the issue of reuse of standardised linguistic annotations within the Semantic Web efforts of W3C.

WP3 leader (DFKI) and other members of LIRICS started very fruitful discussions with OMG representatives, on the topic of the representation of language resources in Business ontologies. The relevant group within OMG is the SIG: Semantics of Business Vocabulary and Business Rules (SBVR)

A close collaboration was established as well with the SIGSEM Working Group on the Representation of Multimodal Semantic Information

2.4 Workpackage #4: Semantic content

2.4.1 Presentation

At a time when semantic content annotation is proving to be at the cutting edge of rapid and accurate information extraction of all varieties, there is some urgency in setting standards for reusability and interoperability of resources for wide application and distribution. The LIRICS project had as one of its aims to propose a common and well-defined set of descriptors for semantic annotation in the form of data categories in an on-line registry, maintained under the auspices of the International Organization for Standardization (ISO), in accordance with ISO standard 12620 (Terminology and other language resources – Data categories for electronic lexical resources; see Romary 2004).

The work in this part of the LIRICS project was performed in Work Package 4 in close collaboration with two related initiatives. First, the area of semantic content annotation has been recognized as important by the International Organization for Standardization (ISO), which has formed the Task Domain Group TDG 3, Semantic Content, as an international expert group devoted to work in this area, within Technical Committee 37, Subcommittee 4 (Terminology and Language Resources Management). In addition, an independent scientific peer group was formed within the ACL Special Interest Group on Computational Semantics (ACL/SIGSEM), the Working Group on the Representation of Multimodal Semantic Information (MMSemR). The work in WP 4 of the LIRICS project was carried out in close collaboration with this Working Group and with ISO TC 37/SC 4/TDG 3, as witnessed by joint workshops that took place in January 2005 in Tilburg (The Netherlands), in April 2006 in Marina del Rey (California), and in January 2007 in Tilburg (The Netherlands). (See also http://let.uvt.nl/research/ti/iso-tdg3 and http://let.uvt.nl/research/ti/sigsem/wg.)

2.4.2 Issues in semantic descriptor selection

At the start of the LIRICS project it was decided to focus the work in WP 4 on four areas of semantic information which were identified as potentially fruitful areas for proposing standard annotation concepts: temporal information, referential entities and links; semantic roles; and dialogue acts, For each of these areas a number of different annotation schemes have been proposed which come from different theoretical frameworks and are used on a variety of different types of data for extracting different information. The problem is which of these concepts should be included, and which should not, and for what reason.

In previous ISO work on the development of standards for language resources, the use of metamodels has been advocated as a way to abstract away from the details of specific schemes in order to approach the key common concepts; this approach has been argued to also be fruitful for semantic annotation within each semantic area of concern (see Bunt and Romary, 2004). The use of metamodels works particularly well in the clarification of highly structured related concepts, such as those representing temporal information for example, or indeed the complex relationships between morpho-syntactic concepts. It works less well for the comparison of schemes whose concepts are more linearly organised, whose concepts are in essence category labels and whose interrelations are less complex and structurally

10

http://let.uvt.nl/research/ti/iso-tdg3

http://let.uvt.nl/research/ti/sigsem/wg

interdependent, for example reference relations. (See also Bunt and Schiffrin, 2007 for a discussion of relevant issues in the design of data categories for dialogue acts.)

It is for this reason that one cannot simply rely on the abstracted metamodel for finding the common concepts between differing schemes for all kinds of semantic information, one must also make close comparison between the individual concepts that may populate the categories in alternative models. Assuming that some general consensus does in fact exist between schemes purporting to annotate the same type of semantic information, our aim has been to pick exactly those concepts for which evidence of some broader consensus can be found (with some notable exceptions included here in order to stimulate discussion within the research community).

The difficulties of producing a list of common core data categories for semantic annotation do not end with choosing the concepts that should be included. Precise definitions for such data categories are extremely difficult to come by. Definitions in different schemes will differ because different schemes will follow different frameworks. The issue can be further complicated by finding that the definitions even for a specific scheme are often imprecise or unclear, or that definitions for the same concept overlap with similar concepts even within the same scheme (which is often also a cause of poor inter-annotator agreement scores). A surprising number of definitions for core concepts are rather vague and rely on a ‘common sense’ interpretation by a user and are not rigorous or comprehensive as a result.

If we take the definition from one particular scheme, we may exclude some instances of the concept that are covered in another scheme by making it too specific; but if we broaden the definition to encompass all possible schemes, we risk the concept becoming so vague that it will encroach on instances that should be more properly covered by other concepts.

With this dilemma in mind, we have taken the approach of developing a set of data categories that are as far as possible independent from individual projects or schemes. The data categories are in this way intended to be abstract, yet also clearly related to a wide range of slightly different concepts. The idea is that one could then define the more specific, scheme-dependent concept with reference to the generic concept, or as a kind of sub-category, if one so desired.

Finally, once the concept and the definition are fixed, there is also the more minor, but equally elusive choice of terminology and labelling. This problem, while seemingly trivial in comparison to the other two, may be more important for the acceptance of the proposed semantic data categories than anything else. This is because the choice of category label may advertently or inadvertently signal a bias towards one particular theoretical background, which could then cause the alienation of certain research groups rather than the looked for consensus and acceptance.

For the purposes of being able to justify the inclusion of any particular concept into the semantic data category registry, we define the following selection criteria for any candidate data category added:

(1) Concept: Concepts that are common to more than one approach should be given priority for inclusion. This criterion is best practice in general, although some exceptions to the rule have also been allowed for the sake of flexibility and where there were compelling reasons for doing so.

(2) Definition: Because accurate definitions are hard to find, the definitions that already exist in the literature should not be taken for granted, but should be adapted to make them more precise and less controversial where appropriate. Definitions should be as concise as possible without losing intelligibility, and should distinguish the concept clearly and uniquely from other related concepts by some feature or property.

(3) Term: The most common term for the concept should be selected, or a compromise should be found. Explanations should be added where appropriate.

11

The sets of semantic data categories that have been developed in the project have, in various stages of development, been the topic of extensive discussion with peers in the SIGSEM Working Group and in the ISO Thematic Domain Group on Semantic Content, at joint LIRICS/SIGSEM/ISO workshops. The final proposal has been endorsed by the ISO Thematic Domain Group at its meeting in April 2007 in Paris. We expect this set of data categories to play a part in continuing ISO initiatives that aim at the development of international standards for language resources annotation.

The proposed semantic data categories have been put to the test for their applicability and usability, by developing test suites of practical examples of semantic annotation using these semantic descriptors for five European languages (see LIRICS Deliverable D4.4). This has helped to refine these data categories, to identify flaws or omissions, and check the general viability of the set for use in NLP implementations and systems. The data categories for definiteness and specificity, relevant for coreference annotation, have for instance been redefined. The test suites have thus provided a feedback mechanism for establishing the consistency, reliability and comprehensiveness of the data categories for semantic annotation.

The construction of test suites in some cases led to the suggestion to add certain data categories. Inmost of these cases we have refrained from doing so, since the proposed set of data categories cannot be exhaustive anyway, and should be considered to open for extension. It would, more generally, be overly presumptuous to think that the data categories proposed here are the last word in establishing data categories for semantic annotation. First, the field of natural language semantics is an active area of research, with unresolved issues whose further investigation should be expected to have repercussions for any proposed (set of) descriptors for semantic annotation. Some of these issues have been discussed in relation to dialogue act annotation by Bunt & Schiffrin (2007) and in relation to semantic roles by Petukhova, Schiffrin & Bunt (2007). Second, different linguistic research activities as well as different applications of natural language processing may require specific types of semantic information to be captured by annotations using quite specific descriptors, or may impose a particular level of granularity on annotations.

Therefore, no set of semantic data categories can be closed; instead, the current proposal presents a set of core semantic descriptors that can be refined to achieve finer granularity or extended to capture other or additional semantic distinctions according to the particular needs of specific research activities or language engineering applications.

2.4.3 Validation of semantic data categories

2.4.3.1 Dialogue act annotation

To test the usability and coverage of the semantic data categories for dialogue act annotation, test suites were annotated with these data categories for three languages: English, Dutch, and Italian.

For English selected dialogues from two corpora were annotated: TRAINS2 (5 dialogues; 349 utterances) and MapTask3 (2 dialogues; 386 utterances). Dialogues from both corpora are two-agent human-human dialogues. TRAINS dialogues are information-seeking dialogues where an information office assistant is supposed to help a client in choosing the optimal transport train connection. MapTask dialogues are so-called instructing dialogues where one participant plays the role of an instruction-giver navigating another participant, who is an instruction-follower, through the map.

2 For more information about the TRAINS corpus please visit http://www.cs.rochester.edu/research/speech/trains.html

3 Detailed information about the MapTask project can be found at http://www.hcrc.ed.ac.uk/maptask/

12

For Dutch selected dialogues from two corpora were annotated: DIAMOND4 ((one extended dialogue, 301 utterances) and Schiphol (Amsterdam Airport) Information Office (6 dialogues; 202 utterances). Dialogues from both corpora are two-agent human-human dialogues. DIAMOND dialogues have an assistance-seeking nature with one participant playing the role of an instructor explaining to the user how to configure and operate a fax-machine. Schiphol Information Office dialogues are information-seeking dialogues where an assistant is requested to provide a client the information all around the airport activities and facilities (e.g. timetable, security, etc.). The original DIAMOND dialogue is pre-segmented per dialogue utterance for each speaker with indication of utterance start and end time. The original Schiphol dialogues are pre-segmented per speaker turn without authentic turn timings.

For Italian 6 selected dialogues (393 utterances) from the SITAL corpus were annotated. All dialogues are two-agent human-human information-seeking dialogues. The SITAL corpus contains dialogues between a travel agency's operator and a person seeking travel information or to book a ticket, a hotel room or a flight.

For the dialogue act annotation the ANVIL tool was used (http://www.dfki.de/~kipp/ANVIL). The tool allows the multidimensional segmentation of dialogue units into functional segments and their annotation (labelling) in multiple dimensions simultaneously. For ANVIL is also no problem that the annotator can mark up discontinuous segments and re-segment the pre-segmented dialogue units, e.g. some dialogues were presented in pre-segmented form, either per turn as in the Dutch Schiphol Information Office corpus or per utterance as in the Dutch DIAMOND corpus, so using ANVIL annotators had the possibility to cut larger units into smaller functional segments.

4 See Geertzen et al. 2004

13

http://www.dfki.de/%7Ekipp/anvil

Figure 4: annotator's interface of the ANVIL tool

2.4.3.2 Semantic role annotation

We define a semantic role as the type of relationship that a participant plays in some real or imagined situation; therefore the semantic role annotation task involved two main activities:

Identification and labeling of markables: expressions that represent the entities involved in semantic role relations. Markables come in two varieties:

anchors, which correspond to one of three situation (or ‘eventuality’) types: events, states and facts (every semantic role must be ‘anchored’ to a situation of one of these types). Anchors are realised mainly by verbs but sometimes also by nouns.

situation participants. The are realised mainly by nouns, noun phrases and pronouns (ignoring event coreference, temporal coreference, etc.).

Identification and labeling of links: referential relations between participant and anchor markables.

Test suites of at least 500 sentences per language with semantic role annotations were constructed for four languages: English, Dutch, Italian, and Spanish.

For Dutch and English all test suite material was annotated independently by at least three different annotators, in order to investigate the usability of the tagset in terms of inter-

14

annotator agreement. For English FrameNet and PropBank data was used. We selected three unbroken FrameNet texts (120 sentences) and separate sentences (83 sentences). PropBank data consists of isolated sentences (355 sentences). For Dutch 15 unbroken texts were selected from news articles, with a total of 260 sentences.News articles were also selected to construct Italian test suites (101 sentences). All files were taken from the Italian Treebank corpus. For Spanish, the LIRICS test suite consists of 189 sentences taken from the Spanish FrameNet corpus.

The annotations were made using the GATE annotation tool form the University of Sheffield5. GATE provides annotators with a graphical interface for indicating which pieces of text denote relevant concepts (the ‘maarkables’). For the LIRICS annotation task two types of annotation label have been added to GATE: SemanticAnchor and SemanticRole (updated gate.jar file was provided by UtiL).

Figure 5: GATE interface for annotators

2.4.3.3 Reference annotation

Reference annotation was performed on corpus material for four languages, English, Dutch, Italian, and German:

For English 177 sentences were selected from the FrameNet corpus6. In their annotation with respect to referential relations, 375 markables and 233 links were identified. In addition, 142 sentences were selected from the MUC-6 891102-0148 corpus. In the annotation of these sentences 331 markables and 221 links were identified and labeled.

5 See: http://gate.ac.uk for further details and http://gate.ac.uk/documentation.html for documentation.

6 See http://framenet.icsi.berkeley.edu/ for more information.

15

http://gate.ac.uk/

http://gate.ac.uk/documentation.html

http://framenet.icsi.berkeley.edu/

For Dutch 274 sentences from news articles were selected for reference annotation. Annotators identified and labeled 494 markables and 327 coreferential links.

For Italian 137 sentences from Italian newspaper articles were annotated, where 736 markables and 265 coreferential links were identified and labeled.

The German test suite consisted of 232 sentences from newspaper articles (Handelsblatt, financial news), where 98 markables were identified and 175 coreferential links.

The annotations were performed using the PALinkA annotation tool7, an XML-based tool that was originally designed for the purpose of referential relation annotation.

Figure 6: PALinkA annotation interface

For semantic role annotation, the state of the art in computational linguistics is such that there are widely diverging views on what may constitute a useful set of semantic roles, with the FrameNet and PropBank initiatives as two opposite extremes. We have proposed a set of data categories that corresponds roughly to the upper levels of the FrameNet hierarchy, but with a more strictly semantic orientation. In view of these circumstances, we have carried out an investigation in the usability of the proposed set of descriptors by having material in English and Dutch (partly taken from FrameNet and PropBank data) annotated independently by three annotators. It turns out that even previously untrained annotators, with no specific background in the area, were able to reach substantial agreement on the use of the LIRICS data categories. This is a welcome and very encouraging result. Outside of and after the

7 Visit the Palinka site http://clg.wlv.ac.uk/projects/PALinkA/ for more information and downloads.

16

http://clg.wlv.ac.uk/projects/PALinkA/

LIRICS project, this will be investigated further, also by systematically relating LIRICS annotations to FramNet and PropBank annotations, and willl be reported at conferences and in the literature on semantic annotation.

For coreference annotation the situation is rather different. The computational linguistics community is less divided in this area, and the LIRICS data categories for reference annotation build on several related efforts in reference annotation. This part of the annotation work presented relatively little difficulty and did not warrant a separate investigation into the usability of the proposed data categories. However, annotators were asked to comment on a number of aspects of their work, and this has resulted in some suggestions for improving the set of data categories for reference, which have been taken into account in the final proposal of this set, as documented in Deliverable D4.3

For dialogue act annotation the state of the art is such that different annotation schemes use a number of common core descriptors, but vary widely in the number of additional tags, as well as in their granularity, their naming, and the strictness of their definitions. The LIRICS proposal for this domain is based on taking the common core of a range of existing approaches and extending this core in a principled way, with the help of a formalized notion of ‘multidimensionality’ in dialogue act annotation, which has been around informally in this domain for some time. The usability of the LIRICS tagset was evaluated by having two experienced annotators independently annotating the test suites for English and Dutch. The results show a near-perfect annotator agreement.

2.5 Workpackage #5: LIRICS reference implementation platform

The objectives of the workpackages were:

to define web-service APIs following the LIRICS standards defined in workpackages 1 through 4;

to provide open-source reference implementations (web-service software) of these APIs;

to demonstrate improved NLP module integration from different partners in an end-to-end NLP system, with web-service clients forming a platform for users.

Re-use and integration of existing software was encouraged.

The APIs are described in detail in deliverables D5.1.B (LMF), D5.1.C (MAF) and D5.1.D (SynAF).

For lexical annotation (LMF), we provided a reference implementation using MPI's LEXUS service, and for the platform we developed a client that runs in the well-known GATE8 NLP platform. This client connects to LEXUS (or another server following the LMF standard), lets the user log in, and displays the available lexica (which vary according to the user's credentials). The user can then open a lexicon in the GATE client and browse it. (These tools are explained in further detail in D5.2B.)

The DCR platform has been integrated into LEXUS and has become a stable component of LEXUS framework, and testing has been completed.

For morpho-syntactic and syntactic annotation (MAF and SynAF), we provided open-source web-service reference implementations developed in GATE (as well as stand-alone GATE applications) for English, French and Bulgarian. (The services are summarized below but described in detail in D5.2.C.)

8 http://www.gate.ac.uk/

17

http://www.gate.ac.uk/

The English MAF service uses standard ANNIE9 components to annotate the input document with POS (part-of-speech) tags and lemmata and a customized component that translates those annotations into the MAF standard. The Bulgarian MAF service is similar but uses the TreeTagger10 (in addition to some of the ANNIE components) with a dataset trained on the Bulgarian Treebank11 for this project.

The French MAF and SynAF services use Tagmatica's TagParser12 for POS-tagging and parsing, with output converted into the MAF and SynAF annotations.

The English and Bulgarian SynAF services use the ANNIE tokenizer and sentence splitter (which work well for Bulgarian as well as English) and a custom component which acts as a GATE wrapper around the Stanford Parser13, using either that parser's standard English dataset (derived from the Penn Treebank14) or a dataset we produced for this project by training the parser on the Bulgarian Treebank (mentioned above), and translates the resulting tags into SynAF annotations. (The University of Sheffield intends to continue developing this component and eventually provide it as a plug-in15 in the GATE distribution.)

We also demonstrated the usability of the standards for German, Spanish and Italian, but the software developed for those languages re-uses tools whose licences prevent us from releasing our implementations as open-source software.

Figure 7: MAF document in the client

9 http://gate.ac.uk/ie/annie.html

10 http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

11 http://www.bultreebank.org/

12 http://www.tagmatica.com/

13 http://nlp.stanford.edu/downloads/lex-parser.shtml

14 http://www.cis.upenn.edu/~treebank/

15 http://gate.ac.uk/gate/doc/plugins.html

18

http://gate.ac.uk/ie/annie.html

http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

http://www.bultreebank.org/

http://www.tagmatica.com/

http://nlp.stanford.edu/downloads/lex-parser.shtml

http://www.cis.upenn.edu/%7Etreebank/

http://gate.ac.uk/gate/doc/plugins.html

Figure 8: SynAF document in the client

The open-source platform for MAF and SynAF consists of GATE client applications which use the APIs defined in this workpackage to communicate with the servers described above. These also allow the user to save the documents as MAF and SynAF XML and to view the constituency trees of documents parsed with the English and Bulgarian SynAF services, as shown in the following screenshots. The clients are compatible with other GATE resources and applications to allow integration and composability. (The clients are described in further detail, with instructions for installation and use, in D5.3.B.).

19

Figure 9: SynAF XML displayed in the client

Figure 10: Constituency tree displayed in the SynAF client

Because the state of the art in automatic semantic annotation is not yet good enough to demonstrate the SemAF standards, the project did not define software APIs or provide reference implementations; instead we provided a corpus annotated according to the standard in order to demonstrate its usability.

2.6 Workpackage #6: Dissemination and exploitation

The objectives of this package was to disseminate widely the LIRICS standards to ensure wide adoption by Industry and to put in place the necessary infrastructure exploitation phase post-project.

20

The LIRICS published more than 20 scientific papers in various conferences. Let's note that LIRICS and eContent are always mentioned in the acknowledge section of the articles.

LIRICS organized two Industry Advisory Group meetings: one in Barcelona in 2005 and one in Paris in 2007 where the ISO-TC37/SC4 standards were presented and discussed.

LIRICS partners made sure that the standardization efforts in LIRICS were made visible and efficient to other EC projects. So, for example DFKI introduced the main results of LIRICS in the K-Space NoE project (6th Framework), establishing connections between standards for language resources and standards for Multimedia content (MPEG-7). Liaison activities have also been conducted with representatives of W3C (among others at GLDV 2007 workshop in Tuebingen in April 2007). DFKI also proposed to include in the LIRICS work on Semantic annotation results from other projects, and so a Time Ontology developed within the 6th Framework project MUSING is being considered for future ISO standardization work, as part 2 or part3 of SemAF.

Many LIRICS partners agreed to continue and further develop LIRICS activities within the CLARIN proposal for a European Infrastructure for Language Resources. Very fruitful contacts have also been established with the newly launched Japanese (and in fact Asian) Language Grid initiative (see http://langrid.nict.go.jp/). The aim here is to ontologize LIRICS/ISO standards in the context of linguistic web services, based on a common ontology for language data, tools and web services. As one can see, results of LIRICS have a huge potential for being included in new initiatives relevant for the language industry.

2.7 Workpackage #7: Management

INRIA, as project manager, managed the financial coordination between all project partners in order to ensure consolidation of cost statements and follow-up of EC payments.

The project manager wrote progress reports and provided internal information exchange through a discussion list ([email protected]) and a web site (http://LIRICS.loria.fr).

Regular meetings were organized mainly in conjunction with ISO meetings and NLP conferences.

INRIA monitored the project progress by means of a document called "LIRICS management board" that was a snapshot of the partners milestones and deliveries. This document was maintained by INRIA and sent on a regular basis through the LIRICS discussion list. This document not only showed the delivery as specified in the LIRICS technical annex, but also presented the dependencies among the sub-parts of a specific delivery. For instance, if a tagger for French is to produced by the French partner, that means that there is a dependency to the deliverable that collects and integrates the taggers for all languages.

INRIA verified also the synchronization between the LIRICS schedule and the ISO works, specially concerning the International ballots and meetings.

3 Contribution to EC policies

3.1 Policy on standardization

Standardization activities could not be addressed on a purely European level and strong articulation with international efforts is essential. This is why LIRICS focused on ISO processes.

The LIRICS project situated itself in the following perspectives:

it is essential to develop a European language industry that competes with equivalent structures in the US or Asia;

21

http://langrid.nict.go.jp/

it is necessary to ensure that the European linguistic variety is represented;

it is important to cover non European needs in order to address potential EU export markets.

The standards issued within LIRICS are strong candidates for regulation, since they could be part of a future policy on the open and wide diffusion of linguistic knowledge in Europe. And these standards are likely to help the Commission to deal with multilinguism in its own administration.

3.2 Contribution to economic and social objectives

ISO standards foster innovation and competition. Selective and appropriate adoption of standards assures researchers, content producers and other organizations that they are maximizing their effort due to the lack of common reference. Consumers are assured that the products and services have been developed according to specifications which they are also able to access, if desired, as opposed to proprietary non-standard development methods.

LIRICS standards will allow better access to linguistic content and lexical data for educational purposes. The citizens will have better access to knowledge via improved technology for multilingual document processing and enhancing communication between multilingual communities.

4 Conclusion

Since its start in January 2007, the LIRICS project has continuously taken more and more influence in the international standardization scene. The situation is such that most of the projects that have been carried out in the last two years have either been lead or strongly influenced by one (if not several) LIRICS partner. From the point of view of the LIRICS project (and DOW) proper, and despite the high risk it was to strongly connect the technical work of the project with the actual international ISO agenda, the whole team is quite proud that all standardization endeavors that we thought the project should be pushing forward have reached a stage where the impact on the NLP field can already be evaluated as very high. Even more, some new standardization activities which we had hardly thought mature enough at the time the project was designed have managed to become new ISO project, namely SemAF/Time (on temporal annotation) and MLIF (on the representation of multilingual information). Just to quote the latter, the technical competence but also the strong connectivity of the LIRICS partnership has not only lead to the achievement of a technically sound document that is about to go on a CD (Committee Draft) ballot in ISO, but also to attract reference players in LISA/OSCAR — in the domain of translation memories (TMX) — and OASIS — for localisation data (XLIFF) that have now expressed their wish to be associated to this initiative.

In the same way, the recent plenary meetings of ISO/TC37 in Provo have proved that the methodology defended within the LIRICS projects for the design of new standards in Language Engineering has managed to find its way in all other committees, and we now see the possibility to go towards an integrated Data Category Registry for language resources, under the auspices of the Max Planck Institute in Nijmegen. A strong example of this could be the development of ISO 639-4, which is about to provide a generic framework for language description to encompass the other ISO 639 standards on language coding. Here, INRIA and the University of Surrey, together with the help of Max Planck researchers, have produced all the relevant technical content.

The LIRICS partner are now aware, that most of the effort is now in front of us, in order to secure the high achievements we have reached so far, both from the point of view of continuing the technical work (with the Gate reference implementation) and spreading the gospel to a wider audience. The various technical development and sample data will of course contribute to improve the dissemination activity, we there should be a real strategy

22

that has to be pushed internationally in the ISO realm rather than strictly speaking in the close EU arena. This process has started already with the LMF activity whereby the last ISO meeting in August led to a global decision that the next step should be to put together a forum where samples, reference implementation like Lexus and template LMF compliant schemas would be gathered. This should be come a systematic direction to follow for all projects in ISO/TC37/SC4.

As a word of conclusion, we think that the LIRICS project could play a seminal role in showing that the EU has a central role to play in fostering standardization activities in strategic domains like language technology. Even more, and beyond sheer funding issues, it is essential that a stronger policy be designed to make sure that future project in this domain will not be supported unless they show a strong awareness in the domain of international standards. This is particularly true in the context of new research infrastructures that have to be set in the humanities (cf. e.g. the ESFRI roadmap) and that should lead to coherent technical schemes to be implemented in the various EU countries.

23

Documents

LIRICS · represented at the different layers of linguistic description, TDG2 Morpho-syntax, TDG4 Syntax, TDG3 Semantics. Through various international projects, CNR-ILC is transferring