NLP Lexicon Requirements
Nicoletta CalzolariIstituto di Linguistica Computazionale -
CNR - [email protected]
N. Calzolari Nijmegen, August 2010 1
... & LMF
Looking into the past
All started with the situation we had in the late ‘80s – early ‘90s
With all the Xxx-LEX projects
2
MultiLex
GeneLex AcquiL
ex
Xxx-Lex
A. Zampolli: Let’s be coherent:
Xxx-Lex
After the “Grosseto Workshop” (1985): a turning
point
EAGLESISLE Standards, Best Practices, ...
N. Calzolari 2Nijmegen, August 2010
N. Calzolari Nijmegen, August 2010 3
Reusability as key concept true also todayTo avoid duplication of efforts, costs, etc.To allow synergies, integration, exchange of data, ...To provide a model for new data creation & acquisition
Decide on “feasible” areas & state priorities this is changing over time
The feasibility of formulation of consensual standards as a strong sign of maturity in the field we can’t propose standards if there are not enough results on which to base them
EAGLES was launched in ‘93
Key issues: Do conditions exist
for standardisation effort?
Main Results in Lexicon & Corpus WGs
First Phase (www.ilc.pi.cnr.it/EAGLES96/home.html)
N. Calzolari 4Nijmegen, August 2010
Standard for morphosyntactic encoding of lexical entries, in a multi-layered structure, with applications for all the EU languages
Standard for subcategorisation in the lexicon: a set of standardised basic notions using a frame-based structure
Proposal for a basic set of notions in lexical semantics: focus on requirements of Information Systems and MT
Corpus Encoding Standard (CES) from TEI
Standard for morphosyntactic annotation of corpora, to ensure compatibility/ interchangeability of concrete annotation schemata
Preliminary recommendations for syntactic annotation of corpora
Dialogue annotation, for integration of written and spoken annotation
N. Calzolari Nijmegen, August 2010 5
Content vs. Format/Representation
Work on lexical description deals with two aspects
Linguistic description of lexical items (content)Formal representation of lexical descriptions (format)
EAGLES concentrated on linguistic content, not disregarding the formal representation of the proposal
TEI more on format/representation issuesIn LMF : on the abstract meta-model
N. Calzolari Nijmegen, August 2010 6
Flexibility in the Recommendationse.g. Morphosyntax
Level Information Type Recommendation
· L-0 Part-of-Speech Obligatory
· L-1 Morphosyntactic agreement Recommended
features· L-2 Language-specific (or refined)
Optional features
N. Calzolari Nijmegen, August 2010 7
MERITS Strengths (from EAGLES-ISLE)
Standardisation as a necessary component of any strategic programme to create a coherent marketLeading industrials & academics participated (> 150 EU groups)
Bottom-up community created standardsTo avoid wasting time reinventing basic/consolidated knowledge
May be true also for many “humanities” users, not interested in debates on specific lexical approaches
Work otherwise duplicated among many projects, done just once in a collaborative manner (overall cost-effectiveness)Allows the field to be more competitive:
Concentrate efforts on innovative areas Engage in new/advanced technology
N. Calzolari Nijmegen, August 2010 8
Why Standards for Language Resources? (from EAGLES-ISLE)
To ensure:
interoperability of systems (& data), through compatible interfaces
reusability and integrability of components
training based on consensual technical specifications and models (“gold standards”)
evaluation & validation based on agreed criteria
transition from prototypes to HLT products
important for workflows
essential for a LR Infrastructure
for evaluation campaigns
N. Calzolari Nijmegen, August 2010 9
Applications: requirements for systems & enabling
technologiesMachine TranslationInformation Extraction Information Retrieval Summarisation Natural Language GenerationWord Clustering Multiword Recognition + Extraction Word Sense DisambiguationProper Noun RecognitionParsingCoreference…
For HLT knowledge
of application
s’ requireme
nts is essential
N. Calzolari Nijmegen, August 2010 10
The Multilingual ISLE Lexical Entry (MILE)
General methodological principles (from EAGLES)
Basic requirements for the design of the MILE:
Discover and list the (maximal) set of basic notions needed to describe the MILE (up to which level standardisation is feasible?)
Granularity
The leading principle: the edited union of existing lexicons/models (redundancy is not a problem)
Modular & layered
Allow for under-specification (& hierarchical structure)
N. Calzolari Nijmegen, August 2010 11
MILE – Modularity The building-block model
syntacticframe
phrasephraseslot Synfeature
Lexical Objects Sem
feature
Lexical entry 1Lexical entry 1 Lexical entry 2Lexical entry 2 Lexical entry 3Lexical entry 3
Allow to express different dimensions of lexical entries
Enable modular specification of lexical entriesCreate ready-to-use packages to be combined in
different ways
Lexical Classes as the main building blocks of the lexical architecture
Done in LMF
N. Calzolari Nijmegen, August 2010 12
The MILE Data Categories User-adaptability and extensibility
HUMANARTIFACTEVENTANIMALGROUP
AGEMAMMAL
instance_of
Core
UserDefined
MLC:SemanticFeature
OK in ISOCat
N. Calzolari Nijmegen, August 2010 13
MILE Lexical Data Category RegistryA library of pre-instantiated objects
Define (an ontology of) lexical objects represent lexical notions such as semantic unit,
syntactic feature, syntactic frame, semantic predicate, semantic relation, synset, etc.
specify the relevant attributes define the relations with other classes hierarchically structured
Can be used “off the shelf” or as a departure point for the definition of new or modified categories
DC Selections
To be done … in ISOCat
N. Calzolari Nijmegen, August 2010 14
ISO - LMFLexical Markup Framework
Designed to accommodate as many models of lexical representation as possible
Its pros: Meta-model: abstract high-level specification
ISO24613 Based on constants defined in Data Category
Registry: low-level specifications ISO12620 Not a monolithic model, rather a modular
framework LMF library provides the hierarchy of lexical
objects (with structural relations among them) Data Category Registry provides a library of
descriptors to encode linguistic information associated to lexical objects (N.B. Data Categories can be also user-defined)
N. Calzolari Nijmegen, August 2010 15
ISO LMF
Morphology
NLP Multilingual notations
NLP MWE pattern
NLP Paradigm class
NLP Semantic
MRD
NLP Syntax
Constraint Expression
Core Package
Structural skeleton, with the basic hierarchy of information in a lexical entry
+ various extensions
Modular framework LMF specs comply with
modelling UML principles an XML DTD allows
implementation
Builds on EAGLES/ISLE
NEDOAsian Lang.
The field is
mature
NICT Language-
Grid Service Ontology
ICT
KYOTO
LIRICSNew
initiatives…
LexInfo
Barcelona, IEC, 7-8 juliol de 2009Monica Monachini
Principles of LMF: from very simple lexicons …
Lexicon
Morphological Features Form Representation
List Of Components
Related Form
Component
Lexical Entry
Referred Root
Lemma
Form
Derived FormStemOrRoot
Word Form
Sense
{ordered}
0..*
1
0..*
1
0..*
0..*
1
1
0..*
1
1
0..*
0..*
0..1
0..11
{ordered}2..*
1
1
0..*
0..*
0..*
Mettere entrata PAROLE in XML LMF compliant
Nijmegen, August 2010
N. Calzolari Nijmegen, August 2010 18
Mapping experiment
Major best practices:OLIFPAROLE/SIMPLELC-Star (Speech Lexicon)WordNet - EuroWordNetFrameNetBDef formal database of lexicographic definitions derived from Explanatory Dictionary of Contemporary French
Entries from major existing lexicons mapped to LMF To prove that the model is able to represent many
best practices To test the expressive potentialities, the adequacy of
architectural model & linguistic objects
from Monica Monachini
BioLexicon SIMPLE model & ISO-LMF standard
N. Calzolari 19Nijmegen, August 2010
BL
A unique large-scale computational lexicon in the biomedical domain in
terms of coverage & typology of information Populated with info from
available biomedical resources
Semi-automatically populated from corpora:
Population toolkit available
Including both domain-specific & general
language words
Rich linguistic information ranging over
different linguistic descriptions levels
Conformant to international lexical
representation standards
Designed to meet Bio- Text Mining
requirements
from Monica Monachini
The BioLexicon: why
LMF proved to be able to provide Text Mining systems in the biomedical domain with a substantial lexicon covering Biomedical term variants (orthographic,
semantic, geographical, …) better information retrieval
Terminological verbs and their combinatorial properties (subcategorization frames and predicate-argument structure)
better information extraction and question answering
Word derivations to reach similar meaning expressed in
different ways (e.g. activation vs activate)Nijmegen, August 2010N. Calzolari 20
ICT-211423
Nijmegen, August 2010
KYOTO: the lexical resource perspective
KYOTO objectives “ … facilitating the exchange of information
across languages, domains and cultures” “ … allow definition of word meaning in a
shared Wiki platform”
from the point of view of linguistic resources … needs to share lexical & knowledge
bases, both general & domain-related, under the form of lexical repositories and ontologies
KYOTO SYSTEM
N. Calzolari 22Nijmegen, August 2010
LinearMAF/SYNAF
LinearSEMAF
Term extraction Tybot Generic
TMF
Semantic annotation
LinearGenericFACTAF
Fact extraction Kybot
Domain editing Wikyoto
Wordnet
Domain Wordnet
LMF API
Ontology
Domain ontology
OWL APIConceptUser
FactUser
from Piek Vossen
SourceDocuments
ICT-211423
Nijmegen, August 2010
A common representation format for WordNets
Seven WordNets similar but not identical hampered interoperability
to be accessed both intra- and inter-linguistically to support easier integration
WnIT
WnEN
WnEU
WnNL
WnJP
WnCH
WnES
endow WordNet with a representation format allowing easy access, integration & interoperability among resources
WnIT
WnEN
WnEU
WnNL
WnJP
WnCH
WnES
ICT-211423
Nijmegen, August 2010N. Calzolari 24
GlobalInformation
Lemma
MonolingualExternalRef
MonolingualExternalRefs
Sense
LexicalEntry
Statement
Definition
SynsetRelation
SynsetRelations
MonolingualExternalRef
MonolingualExternalRefs
Synset
Lexicon
InterlingualExternalRef
InterlingualExternalRefs
SenseAxis
SenseAxes
LexicalResource
1..1 1..* 0..1
1..*1..*
1..1 0..*
0..1
1..*
Meta0..1
0..1
Meta
0..1 0..1
Meta Meta
0..1
Meta
0..*
0..1 0..10..1
1..* 1..*0..*
0..1
1..*
A common representation format: WordNet - LMF Data
Categories
from Monica Monachini
ICT-211423
Nijmegen, August 2010
Centralized WordNet DC Registry
A list of 85 sem.rels as a result of a mapping of the KYOTO
WordNet grid
Inter-WNIntra-WN
ICT-211423
Nijmegen, August 2010
N. Calzolari 26
SWN<fuego_3, llama_1>
09686541-n
<!ELEMENT SenseAxes (SenseAxis+)><!ELEMENT SenseAxis (Meta?, Target+, InterlingualExternalRefs?)><!ATTLIST SenseAxisid ID #REQUIREDrelType CDATA #REQUIRED><!ELEMENT Target EMPTY><!ATTLIST TargetID CDATA #REQUIRED><!ELEMENT InterlingualExternalRefs (InterlingualExternalRef+)><!ELEMENT InterlingualExternalRef (Meta?)><!ATTLIST InterlingualExternalRef externalSystem CDATA #REQUIREDexternalReference CDATA #REQUIREDrelType (at|plus|equal) #IMPLIED>
IWN<fuoco_1, fiamma_1>
00001251-n
WordNet-LMF Multilingual level - Cross-lingual Relations
WN3.0<fire_1 flame_1 flaming_1>
13480848-n
groups monolingual synsets corresponding to each other and sharing the same relations to English
link to ontology/(ies)
specifies the type of correspondence
from Monica Monachini
ICT-211423
Kyoto Knowledge Base
Nijmegen, August 2010
WnIT
Domain
WnEN
Domain
WnEU
Domain
WnNL
DomainWnJP
Domain
WnCH
Domain
WnES
DomainOntologyOntology
Domain Ontology
LMF and Named Entity Lexicon
LR’s enriched with NEs can be useful within QA to : Find answers Validate answers
Construction of a multilingual NE lexicon automatically acquired Source: Wikipedia → Dynamic source, huge amount
of NEs, some degree of structure NEs extracted from Wikipedia and linked to entries
of LRs and ontologies
Nijmegen, August 2010from Monica MonachiniN. Calzolari 28
Named Entity Lexicon
Nijmegen, August 2010
<Sense id="en_s_city_1"> <MonolingualExternalRef> <feat att="external_system" val="EnWordNet"/> <feat att="external_reference" val="noun.loc:city0"/> </MonolingualExternalRef> </Sense>
<SenseAxis id="sa_001" senses="en_s_Florence it_s_Firenze"> <feat att="type" val="eq_syn"/> <InterlingualExternalRef> <feat att="external_system" val="SUMO"/> <feat att="external_reference" val="City"/> <feat att="external_reltype" val="at"/> </InterlingualExternalRef> <InterlingualExternalRef> <feat att="external_system" val="SIMPLE"/> <feat att="external_reference" val="Geopolitical_location"/> <feat att="external_reltype" val="at"/> </InterlingualExternalRef> </SenseAxis>
Wikip
LROnto
<Sense id="en_s_Florence"> <SenseRelation targets="en_s_city_1"> <feat att="semanticrelation" val="instance_of"/> </SenseRelation> <MonolingualExternalRef> <feat att="external_system" val="EnWikipedia"/> <feat att="external_reference" val="11525"/> </MonolingualExternalRef> </Sense>
from Monica MonachiniN. Calzolari 29
N. Calzolari Nijmegen, August 2010 30
LexInfo & Previous Models
LingInfo: modeling morphosyntatic decomposition of (complex) terms [Buitelaar et al. 2006]
LexOnto: capturing syntactic behaviour and syntax-semantics links [Cimiano et al. 2007]
Lexical Markup Framework (LMF): ISO standardised model for representing machine readable lexica (agnostic about connection with ontology) [Francopoulo et al. 2007]
LexInfo: building on LMF as a core, develop a model which “subsumes” LingInfo and LexOnto for flexibly associating linguistic information to ontologies [Buitelaar, Cimiano, Haase, Sintek 2009]From Paul Buitelaar
Desiderata for Semantic Roles First step:
What are semantic roles? Why do we need standards? Start with Lirics
Consistently recognizable Clarify sense distinctions Generalizability Learnable Potential for inferencing
32Nijmegen, August 2010
Martha Palmer
N. Calzolari
N. Calzolari Nijmegen, August 2010 33
Some steps for a “new generation” of LRs
From huge efforts in building static, large-scale, general-purpose LRs To dynamic LRs rapidly built on-demand, tailored to specific user needs
From closed, locally developed and centralized resourcesTo LRs residing over distributed places, accessible on the web, choreographed by agents acting over them
From Language Resources
To Language Services BUT
• Need of tools to make this vision operational & concrete
Interoperability
N. Calzolari Nijmegen, August 2010 34
Lexical WEB & Content Interoperability
As a critical step for semantic mark-up in the SemWeb
ComLex
SIMPLE
WordNetsWordNets
WordNets
FrameNet
Lex_x
Lex_y
LMF
with intelligent
agents
NomLex
Standards for
Interoperability
Enough??
Global WordNet GRID
BioLexicon
SIMPLE-WEB
N. Calzolari Nijmegen, August 2010 35
A new paradigm of R&D in LRs & LTDistributed Language Services
Open & distributed infrastructures for LRs & LT
Adopting the paradigm of accumulation of knowledge so successful in more mature disciplines, based on sharing LRs & LTsAbility to build on previous achievements, allowing effective cooperation of many groups on common tasksExchange and integrate information across repositories
Create new resources on the basis of existing
Compose new services on demandA new scenario implying
content interoperability standards development of architectures enabling
accessibility supra-national cooperation
N. Calzolari Nijmegen, August 2010 36
A few Issues for discussion:“content”, guidelines, tools,
priorities, ... For Semantic Web & “content” interoperability: is the
field ‘mature’ enough to converge also for the semantic/conceptual level (e.g. to automatically establish links among different languages)?
For the standards to have impact, ensure their usability & gain industry support focusing on requirements of industrial applications
To have Guidelines which are a “usable product” (to assist in creation or adaptation of lexicons, …)
Facilitate acceptance of the standards providing an open-source reference implementation platform & tools, related web services and test suites
Relation with Spoken language community Define further steps necessary to converge on common
priorities
N. Calzolari Nijmegen, August 2010 37
Limits observed& needs of further work
For usability & operability of LMF: Data Categories (DC) & others:
From Japanese NEDO: DC not defined in LMF & LMF non operational
Asian, African DCs Need of DC organised (easy to use) IsoCat & DC
Selections/Profiles Need of an ontology of DCs with structure/dependencies, and
constraints Otherwise the model remains too abstract, and doesn’t say
anything on how to implement concretely the different layers Link with Ontologies: relations Lexicons-Ontologies Need of easy, user-friendly guidelines Need of tools to make it operational, also for creating standard
compliant resources: more important than the model! More dissemination, also with industry
Linguists may be (rightly for certain purposes) not interested Younger colleagues not aware of the past work on standards
Need of operational definitions of interoperability Need of stimuli also from EC to produce standard-compliant
resources (unless differently motivated)
N. Calzolari Nijmegen, August 2010 38
Strengths
Good set of methodological principles: Granularity of basic notions, …
Many languages already compliant with EAGLES morpho-syntax, etc.
Many projects today using LMF Unified Lexicon experiment between Speechdat &
Parole, at ELRA (possible because EAGLES compliant)
Web-services to access LRs based on standards Web-based platforms for LR integration An open infrastructure of LRT need standards New topics being constantly added: Time, Space, …
N. Calzolari Nijmegen, August 2010 39
Future requirements & planning
To make LMF usable and operationalLMF User Guidelines with examples Mapping of commonly used lexicons into LMF ConvertersData categories for LMF lexiconsTool related to LMF, with particular reference to the Lexus tool
Need to address another layerThe ontological layer in a lexiconHow lexicons and ontologies are linked and information mapped from each other
An open space in a wiki environment to store/link to guidelines, examplesto allow broad discussion on these topics to ease dissemination of LMF
N. Calzolari Nijmegen, August 2010 40
FLaReNet Mission: structure the area of LR & LT of the
future Worldwide Forum for LRs & LTs
Consolidate methods, approaches, common practices, architectures
Integrate so far partial solutions into broader infrastructures
A “roadmap”: a plan of coherent actions as input to policy development
For the EU, national organisations & industryAs a model for the LRs/LTs of the next yearsStrengthening the language product market, e.g. for new products & innovative services
Identifying areas where consensus is achieved/emerging vs. areas where more discussion & testing is requiredIndicating priorities
333 Individual Subscribers 88 Institutional Members from
31 countries
N. Calzolari Nijmegen, August 2010 41
Standards & Interoperability: topics for cooperation
A metadata catalogue should involve every party Common repositories for LRT universally & easily
accessible Try to connect ongoing work done by many groups A shared repository of data formats, annotations – where
to find the most frequently used and preferred schemes –major help to achieve standardisation
For a new world-wide language infrastructure
Create the means to plug together different LR & LT, in a web-based resource and technology grid
Access to LRT is critical: involves – and has impact on – all the community
With the possibility to easily create new workflows Create conditions to easily share and re-use technologies,
to have more open (source) tools available for use also to under-funded groups
Some results from FLaReNet Vienna Forum:
International Cooperation
N. Calzolari Nijmegen, August 2010 42
Special Highlight: Contribute to building the LREC2010 Map!
Time is ripe to launch an important initiative, the LREC2010 Map of Language Resources, Technologies and Evaluation.
The Map will be a collective enterprise of the LREC community, as a first step towards the creation of a very broad, community-built, Open Resource Infrastructure.
First in a series, it will become an essential instrument to monitor the field and to identify shifts in the production, use and evaluation of LRs and LTs over the years.
When submitting a paper (< 900!), from the START page fill in a very simple template to provide essential information about resources (in a broad sense, also technologies, standards, evaluation kits.) either used for the work described or a new result of your research
Go to http://www.resourcebook.eu/LreMap !
FLaReNet & the LRE MAP… at LREC & COLING