42
NLP Lexicon Requirements Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa [email protected] N. Calzolari Nijmegen, August 2010 1 ... & LMF

Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa [email protected] N. CalzolariNijmegen, August 20101

Embed Size (px)

Citation preview

NLP Lexicon Requirements

Nicoletta CalzolariIstituto di Linguistica Computazionale -

CNR - [email protected]

N. Calzolari Nijmegen, August 2010 1

... & LMF

Looking into the past

All started with the situation we had in the late ‘80s – early ‘90s

With all the Xxx-LEX projects

2

MultiLex

GeneLex AcquiL

ex

Xxx-Lex

A. Zampolli: Let’s be coherent:

Xxx-Lex

After the “Grosseto Workshop” (1985): a turning

point

EAGLESISLE Standards, Best Practices, ...

N. Calzolari 2Nijmegen, August 2010

N. Calzolari Nijmegen, August 2010 3

Reusability as key concept true also todayTo avoid duplication of efforts, costs, etc.To allow synergies, integration, exchange of data, ...To provide a model for new data creation & acquisition

Decide on “feasible” areas & state priorities this is changing over time

The feasibility of formulation of consensual standards as a strong sign of maturity in the field we can’t propose standards if there are not enough results on which to base them

EAGLES was launched in ‘93

Key issues: Do conditions exist

for standardisation effort?

Main Results in Lexicon & Corpus WGs

First Phase (www.ilc.pi.cnr.it/EAGLES96/home.html)

N. Calzolari 4Nijmegen, August 2010

Standard for morphosyntactic encoding of lexical entries, in a multi-layered structure, with applications for all the EU languages

Standard for subcategorisation in the lexicon: a set of standardised basic notions using a frame-based structure

Proposal for a basic set of notions in lexical semantics: focus on requirements of Information Systems and MT

Corpus Encoding Standard (CES) from TEI

Standard for morphosyntactic annotation of corpora, to ensure compatibility/ interchangeability of concrete annotation schemata 

Preliminary recommendations for syntactic annotation of corpora

Dialogue annotation, for integration of written and spoken annotation

N. Calzolari Nijmegen, August 2010 5

Content vs. Format/Representation

Work on lexical description deals with two aspects

Linguistic description of lexical items (content)Formal representation of lexical descriptions (format)

EAGLES concentrated on linguistic content, not disregarding the formal representation of the proposal

TEI more on format/representation issuesIn LMF : on the abstract meta-model

N. Calzolari Nijmegen, August 2010 6

Flexibility in the Recommendationse.g. Morphosyntax

Level Information Type Recommendation

· L-0 Part-of-Speech Obligatory

· L-1 Morphosyntactic agreement Recommended

features· L-2 Language-specific (or refined)

Optional features

N. Calzolari Nijmegen, August 2010 7

MERITS Strengths (from EAGLES-ISLE)

Standardisation as a necessary component of any strategic programme to create a coherent marketLeading industrials & academics participated (> 150 EU groups)

Bottom-up community created standardsTo avoid wasting time reinventing basic/consolidated knowledge

May be true also for many “humanities” users, not interested in debates on specific lexical approaches

Work otherwise duplicated among many projects, done just once in a collaborative manner (overall cost-effectiveness)Allows the field to be more competitive:

Concentrate efforts on innovative areas Engage in new/advanced technology

N. Calzolari Nijmegen, August 2010 8

Why Standards for Language Resources? (from EAGLES-ISLE)

To ensure:

interoperability of systems (& data), through compatible interfaces

reusability and integrability of components

training based on consensual technical specifications and models (“gold standards”)

evaluation & validation based on agreed criteria

transition from prototypes to HLT products

important for workflows

essential for a LR Infrastructure

for evaluation campaigns

N. Calzolari Nijmegen, August 2010 9

Applications: requirements for systems & enabling

technologiesMachine TranslationInformation Extraction Information Retrieval Summarisation Natural Language GenerationWord Clustering Multiword Recognition + Extraction Word Sense DisambiguationProper Noun RecognitionParsingCoreference…

For HLT knowledge

of application

s’ requireme

nts is essential

N. Calzolari Nijmegen, August 2010 10

The Multilingual ISLE Lexical Entry (MILE)

General methodological principles (from EAGLES)

Basic requirements for the design of the MILE:

Discover and list the (maximal) set of basic notions needed to describe the MILE (up to which level standardisation is feasible?)

Granularity

The leading principle: the edited union of existing lexicons/models (redundancy is not a problem)

Modular & layered

Allow for under-specification (& hierarchical structure)

N. Calzolari Nijmegen, August 2010 11

MILE – Modularity The building-block model

syntacticframe

phrasephraseslot Synfeature

Lexical Objects Sem

feature

Lexical entry 1Lexical entry 1 Lexical entry 2Lexical entry 2 Lexical entry 3Lexical entry 3

Allow to express different dimensions of lexical entries

Enable modular specification of lexical entriesCreate ready-to-use packages to be combined in

different ways

Lexical Classes as the main building blocks of the lexical architecture

Done in LMF

N. Calzolari Nijmegen, August 2010 12

The MILE Data Categories User-adaptability and extensibility

HUMANARTIFACTEVENTANIMALGROUP

AGEMAMMAL

instance_of

Core

UserDefined

MLC:SemanticFeature

OK in ISOCat

N. Calzolari Nijmegen, August 2010 13

MILE Lexical Data Category RegistryA library of pre-instantiated objects

Define (an ontology of) lexical objects represent lexical notions such as semantic unit,

syntactic feature, syntactic frame, semantic predicate, semantic relation, synset, etc.

specify the relevant attributes define the relations with other classes hierarchically structured

Can be used “off the shelf” or as a departure point for the definition of new or modified categories

DC Selections

To be done … in ISOCat

N. Calzolari Nijmegen, August 2010 14

ISO - LMFLexical Markup Framework

Designed to accommodate as many models of lexical representation as possible

Its pros: Meta-model: abstract high-level specification

ISO24613 Based on constants defined in Data Category

Registry: low-level specifications ISO12620 Not a monolithic model, rather a modular

framework LMF library provides the hierarchy of lexical

objects (with structural relations among them) Data Category Registry provides a library of

descriptors to encode linguistic information associated to lexical objects (N.B. Data Categories can be also user-defined)

N. Calzolari Nijmegen, August 2010 15

ISO LMF

Morphology

NLP Multilingual notations

NLP MWE pattern

NLP Paradigm class

NLP Semantic

MRD

NLP Syntax

Constraint Expression

Core Package

Structural skeleton, with the basic hierarchy of information in a lexical entry

+ various extensions

Modular framework LMF specs comply with

modelling UML principles an XML DTD allows

implementation

Builds on EAGLES/ISLE

NEDOAsian Lang.

The field is

mature

NICT Language-

Grid Service Ontology

ICT

KYOTO

LIRICSNew

initiatives…

LexInfo

Barcelona, IEC, 7-8 juliol de 2009Monica Monachini

 

Principles of LMF: from very simple lexicons …

Lexicon

Morphological Features Form Representation

List Of Components

Related Form

Component

Lexical Entry

Referred Root

Lemma

Form

Derived FormStemOrRoot

Word Form

Sense

{ordered}

0..*

1

0..*

1

0..*

0..*

1

1

0..*

1

1

0..*

0..*

0..1

0..11

{ordered}2..*

1

1

0..*

0..*

0..*

Mettere entrata PAROLE in XML LMF compliant

Nijmegen, August 2010

Barcelona, IEC, 7-8 juliol de 2009Monica Monachini

  Nijmegen, August 2010

DCR

to very rich ones …

N. Calzolari Nijmegen, August 2010 18

Mapping experiment

Major best practices:OLIFPAROLE/SIMPLELC-Star (Speech Lexicon)WordNet - EuroWordNetFrameNetBDef formal database of lexicographic definitions derived from Explanatory Dictionary of Contemporary French

Entries from major existing lexicons mapped to LMF To prove that the model is able to represent many

best practices To test the expressive potentialities, the adequacy of

architectural model & linguistic objects

from Monica Monachini

BioLexicon SIMPLE model & ISO-LMF standard

N. Calzolari 19Nijmegen, August 2010

BL

A unique large-scale computational lexicon in the biomedical domain in

terms of coverage & typology of information Populated with info from

available biomedical resources

Semi-automatically populated from corpora:

Population toolkit available

Including both domain-specific & general

language words

Rich linguistic information ranging over

different linguistic descriptions levels

Conformant to international lexical

representation standards

Designed to meet Bio- Text Mining

requirements

from Monica Monachini

The BioLexicon: why

LMF proved to be able to provide Text Mining systems in the biomedical domain with a substantial lexicon covering Biomedical term variants (orthographic,

semantic, geographical, …) better information retrieval

Terminological verbs and their combinatorial properties (subcategorization frames and predicate-argument structure)

better information extraction and question answering

Word derivations to reach similar meaning expressed in

different ways (e.g. activation vs activate)Nijmegen, August 2010N. Calzolari 20

ICT-211423

Nijmegen, August 2010

KYOTO: the lexical resource perspective

KYOTO objectives “ … facilitating the exchange of information

across languages, domains and cultures” “ … allow definition of word meaning in a

shared Wiki platform”

from the point of view of linguistic resources … needs to share lexical & knowledge

bases, both general & domain-related, under the form of lexical repositories and ontologies

KYOTO SYSTEM

N. Calzolari 22Nijmegen, August 2010

LinearMAF/SYNAF

LinearSEMAF

Term extraction Tybot Generic

TMF

Semantic annotation

LinearGenericFACTAF

Fact extraction Kybot

Domain editing Wikyoto

Wordnet

Domain Wordnet

LMF API

Ontology

Domain ontology

OWL APIConceptUser

FactUser

from Piek Vossen

SourceDocuments

ICT-211423

Nijmegen, August 2010

A common representation format for WordNets

Seven WordNets similar but not identical hampered interoperability

to be accessed both intra- and inter-linguistically to support easier integration

WnIT

WnEN

WnEU

WnNL

WnJP

WnCH

WnES

endow WordNet with a representation format allowing easy access, integration & interoperability among resources

WnIT

WnEN

WnEU

WnNL

WnJP

WnCH

WnES

ICT-211423

Nijmegen, August 2010N. Calzolari 24

GlobalInformation

Lemma

MonolingualExternalRef

MonolingualExternalRefs

Sense

LexicalEntry

Statement

Definition

SynsetRelation

SynsetRelations

MonolingualExternalRef

MonolingualExternalRefs

Synset

Lexicon

InterlingualExternalRef

InterlingualExternalRefs

SenseAxis

SenseAxes

LexicalResource

1..1 1..* 0..1

1..*1..*

1..1 0..*

0..1

1..*

Meta0..1

0..1

Meta

0..1 0..1

Meta Meta

0..1

Meta

0..*

0..1 0..10..1

1..* 1..*0..*

0..1

1..*

A common representation format: WordNet - LMF Data

Categories

from Monica Monachini

ICT-211423

Nijmegen, August 2010

Centralized WordNet DC Registry

A list of 85 sem.rels as a result of a mapping of the KYOTO

WordNet grid

Inter-WNIntra-WN

ICT-211423

Nijmegen, August 2010

N. Calzolari 26

SWN<fuego_3, llama_1>

09686541-n

<!ELEMENT SenseAxes (SenseAxis+)><!ELEMENT SenseAxis (Meta?, Target+, InterlingualExternalRefs?)><!ATTLIST SenseAxisid ID #REQUIREDrelType CDATA #REQUIRED><!ELEMENT Target EMPTY><!ATTLIST TargetID CDATA #REQUIRED><!ELEMENT InterlingualExternalRefs (InterlingualExternalRef+)><!ELEMENT InterlingualExternalRef (Meta?)><!ATTLIST InterlingualExternalRef externalSystem CDATA #REQUIREDexternalReference CDATA #REQUIREDrelType (at|plus|equal) #IMPLIED>

IWN<fuoco_1, fiamma_1>

00001251-n

WordNet-LMF Multilingual level - Cross-lingual Relations

WN3.0<fire_1 flame_1 flaming_1>

13480848-n

groups monolingual synsets corresponding to each other and sharing the same relations to English

link to ontology/(ies)

specifies the type of correspondence

from Monica Monachini

ICT-211423

Kyoto Knowledge Base

Nijmegen, August 2010

WnIT

Domain

WnEN

Domain

WnEU

Domain

WnNL

DomainWnJP

Domain

WnCH

Domain

WnES

DomainOntologyOntology

Domain Ontology

LMF and Named Entity Lexicon

LR’s enriched with NEs can be useful within QA to : Find answers Validate answers

Construction of a multilingual NE lexicon automatically acquired Source: Wikipedia → Dynamic source, huge amount

of NEs, some degree of structure NEs extracted from Wikipedia and linked to entries

of LRs and ontologies

Nijmegen, August 2010from Monica MonachiniN. Calzolari 28

Named Entity Lexicon

Nijmegen, August 2010

<Sense id="en_s_city_1"> <MonolingualExternalRef> <feat att="external_system" val="EnWordNet"/> <feat att="external_reference" val="noun.loc:city0"/> </MonolingualExternalRef> </Sense>

<SenseAxis id="sa_001" senses="en_s_Florence it_s_Firenze"> <feat att="type" val="eq_syn"/> <InterlingualExternalRef> <feat att="external_system" val="SUMO"/> <feat att="external_reference" val="City"/> <feat att="external_reltype" val="at"/> </InterlingualExternalRef> <InterlingualExternalRef> <feat att="external_system" val="SIMPLE"/> <feat att="external_reference" val="Geopolitical_location"/> <feat att="external_reltype" val="at"/> </InterlingualExternalRef> </SenseAxis>

Wikip

LROnto

<Sense id="en_s_Florence"> <SenseRelation targets="en_s_city_1"> <feat att="semanticrelation" val="instance_of"/> </SenseRelation> <MonolingualExternalRef> <feat att="external_system" val="EnWikipedia"/> <feat att="external_reference" val="11525"/> </MonolingualExternalRef> </Sense>

from Monica MonachiniN. Calzolari 29

N. Calzolari Nijmegen, August 2010 30

LexInfo & Previous Models

LingInfo: modeling morphosyntatic decomposition of (complex) terms [Buitelaar et al. 2006]

LexOnto: capturing syntactic behaviour and syntax-semantics links [Cimiano et al. 2007]

Lexical Markup Framework (LMF): ISO standardised model for representing machine readable lexica (agnostic about connection with ontology) [Francopoulo et al. 2007]

LexInfo: building on LMF as a core, develop a model which “subsumes” LingInfo and LexOnto for flexibly associating linguistic information to ontologies [Buitelaar, Cimiano, Haase, Sintek 2009]From Paul Buitelaar

LMF: ILC infrastructure

Nijmegen, August 2010N. Calzolari 31

Desiderata for Semantic Roles First step:

What are semantic roles? Why do we need standards? Start with Lirics

Consistently recognizable Clarify sense distinctions Generalizability Learnable Potential for inferencing

32Nijmegen, August 2010

Martha Palmer

N. Calzolari

N. Calzolari Nijmegen, August 2010 33

Some steps for a “new generation” of LRs

From huge efforts in building static, large-scale, general-purpose LRs To dynamic LRs rapidly built on-demand, tailored to specific user needs

From closed, locally developed and centralized resourcesTo LRs residing over distributed places, accessible on the web, choreographed by agents acting over them

From Language Resources

To Language Services BUT

• Need of tools to make this vision operational & concrete

Interoperability

N. Calzolari Nijmegen, August 2010 34

Lexical WEB & Content Interoperability

As a critical step for semantic mark-up in the SemWeb

ComLex

SIMPLE

WordNetsWordNets

WordNets

FrameNet

Lex_x

Lex_y

LMF

with intelligent

agents

NomLex

Standards for

Interoperability

Enough??

Global WordNet GRID

BioLexicon

SIMPLE-WEB

N. Calzolari Nijmegen, August 2010 35

A new paradigm of R&D in LRs & LTDistributed Language Services

Open & distributed infrastructures for LRs & LT

Adopting the paradigm of accumulation of knowledge so successful in more mature disciplines, based on sharing LRs & LTsAbility to build on previous achievements, allowing effective cooperation of many groups on common tasksExchange and integrate information across repositories

Create new resources on the basis of existing

Compose new services on demandA new scenario implying

content interoperability standards development of architectures enabling

accessibility supra-national cooperation

N. Calzolari Nijmegen, August 2010 36

A few Issues for discussion:“content”, guidelines, tools,

priorities, ... For Semantic Web & “content” interoperability: is the

field ‘mature’ enough to converge also for the semantic/conceptual level (e.g. to automatically establish links among different languages)?

For the standards to have impact, ensure their usability & gain industry support focusing on requirements of industrial applications

To have Guidelines which are a “usable product” (to assist in creation or adaptation of lexicons, …)

Facilitate acceptance of the standards providing an open-source reference implementation platform & tools, related web services and test suites

Relation with Spoken language community Define further steps necessary to converge on common

priorities

N. Calzolari Nijmegen, August 2010 37

Limits observed& needs of further work

For usability & operability of LMF: Data Categories (DC) & others:

From Japanese NEDO: DC not defined in LMF & LMF non operational

Asian, African DCs Need of DC organised (easy to use) IsoCat & DC

Selections/Profiles Need of an ontology of DCs with structure/dependencies, and

constraints Otherwise the model remains too abstract, and doesn’t say

anything on how to implement concretely the different layers Link with Ontologies: relations Lexicons-Ontologies Need of easy, user-friendly guidelines Need of tools to make it operational, also for creating standard

compliant resources: more important than the model! More dissemination, also with industry

Linguists may be (rightly for certain purposes) not interested Younger colleagues not aware of the past work on standards

Need of operational definitions of interoperability Need of stimuli also from EC to produce standard-compliant

resources (unless differently motivated)

N. Calzolari Nijmegen, August 2010 38

Strengths

Good set of methodological principles: Granularity of basic notions, …

Many languages already compliant with EAGLES morpho-syntax, etc.

Many projects today using LMF Unified Lexicon experiment between Speechdat &

Parole, at ELRA (possible because EAGLES compliant)

Web-services to access LRs based on standards Web-based platforms for LR integration An open infrastructure of LRT need standards New topics being constantly added: Time, Space, …

N. Calzolari Nijmegen, August 2010 39

Future requirements & planning

To make LMF usable and operationalLMF User Guidelines with examples Mapping of commonly used lexicons into LMF ConvertersData categories for LMF lexiconsTool related to LMF, with particular reference to the Lexus tool

Need to address another layerThe ontological layer in a lexiconHow lexicons and ontologies are linked and information mapped from each other

An open space in a wiki environment to store/link to guidelines, examplesto allow broad discussion on these topics to ease dissemination of LMF

N. Calzolari Nijmegen, August 2010 40

FLaReNet Mission: structure the area of LR & LT of the

future Worldwide Forum for LRs & LTs

Consolidate methods, approaches, common practices, architectures

Integrate so far partial solutions into broader infrastructures

A “roadmap”: a plan of coherent actions as input to policy development

For the EU, national organisations & industryAs a model for the LRs/LTs of the next yearsStrengthening the language product market, e.g. for new products & innovative services

Identifying areas where consensus is achieved/emerging vs. areas where more discussion & testing is requiredIndicating priorities

333 Individual Subscribers 88 Institutional Members from

31 countries

N. Calzolari Nijmegen, August 2010 41

Standards & Interoperability: topics for cooperation

A metadata catalogue should involve every party Common repositories for LRT universally & easily

accessible Try to connect ongoing work done by many groups A shared repository of data formats, annotations – where

to find the most frequently used and preferred schemes –major help to achieve standardisation

For a new world-wide language infrastructure

Create the means to plug together different LR & LT, in a web-based resource and technology grid

Access to LRT is critical: involves – and has impact on – all the community

With the possibility to easily create new workflows Create conditions to easily share and re-use technologies,

to have more open (source) tools available for use also to under-funded groups

Some results from FLaReNet Vienna Forum:

International Cooperation

N. Calzolari Nijmegen, August 2010 42

Special Highlight: Contribute to building the LREC2010 Map!

Time is ripe to launch an important initiative, the LREC2010 Map of Language Resources, Technologies and Evaluation.

The Map will be a collective enterprise of the LREC community, as a first step towards the creation of a very broad, community-built, Open Resource Infrastructure.

First in a series, it will become an essential instrument to monitor the field and to identify shifts in the production, use and evaluation of LRs and LTs over the years.

When submitting a paper (< 900!), from the START page fill in a very simple template to provide essential information about resources (in a broad sense, also technologies, standards, evaluation kits.) either used for the work described or a new result of your research

Go to http://www.resourcebook.eu/LreMap !

FLaReNet & the LRE MAP… at LREC & COLING