A centralized approach to language resources
Piek Vossen
S&T Forum on Multilingualism, Luxembourg, June 6th 2005
Overview
What has been achieved? What has not been achieved? What are the major challenges?
What has been achieved? Research and technology development:
Lexical representations Large-scale and medium-scale lexical acquisition:
Machine Readable Dictionaries Corpora
Acquilex, Multilex, Parole, Simple, EuroWordNet, BalkaNet, MEANING, etc..
Standardization: early initiatives EAGLES, ISLE best practices and descriptions
Medium-scale shallow resources for a number of languages, e.g. Parole lexicons and wordnets for about 15 languages.
Small-scale deep resources for a few languages, i.e. Acquilex, Simple
What has not been achieved (1)? Evaluation and benchmarking:
No well-defined and commonly accepted criteria No benchmark data to validate language resources
Insufficient concerage: 100K entries and 200K concepts per languages is
needed for realistic applications, only half is achieved Many European languages still do not have the basic
resources Insufficiently rich in data coverage:
Language coverage: mainly English Size: e.g. Simple, FrameNet 10,000 concepts
What has not been achieved (2)? Most resources are developed in a distributive way, i.e.
common project but national groups with different approaches: Insufficient conceptual overlap and matching across languages:
very low intersection of concepts (all Wordnets about 10,000 concepts) diversing interpretations and definitions of relations and concepts
Insufficient overlap and consensus in the representation of lexical knowledge
Not enough progress to integrate and merge different types of resources: Ontological resources (Semantic Web) Lexical semantic resources (Wordnets) Morpho-syntactic & semantic (Simple, Acquilex) Morpho-syntactic (Parole)
What has not been achieved (3)? Integration in real applications:
Evidence of added value, i.e. scientific proof that language technology and resources help -> more deep-thought applications
More acceptance by the general public (show cases): The positive effects of language technology should be visible to
the general public Be aware of the language myth! The negative effects and
limitations should be clear too... More awareness by the general public on limitations:
create realization how bad the current systems are (precision and recall)
explain the undemocratic limitations of the current Internet
What is the major challenge (1)? Critical issues:
Languages that are not well-supported: lower economic value less speakers
Divergence of resources and lack of semantic and conceptual intersection
Integration of semantic-conceptual knowledge (more language neutral and sharable) with morpho-syntactic knowledge (language-specific)
What is the major challenge (2)? Centralized development of a semantic conceptual backbone:
Maximizes sharing and re-use of lexical knowledge and tools across languages;
Maximizes intersection of concepts and this interlinking of languages;
Stimulates the standardization of lexical knowledge representation;
Enables the early development of impressive Europe-wide applications on a short term: Good show cases (Information retrieval or dialogues in all European
languages) Application-based evaluation and benchmarking
What is the major challenge (3)? Interlinking and developing morpho-syntactic
lexicons on top of the semantic backbone: Captures the valuable non-sharable, idiosyncratic
properties of languages (also has cultural value) Enables long-term high-quality applications such as
Machine Translations Should be corpus-based but is also necessary to develop
large-scale comparable corpora Can be achieved gradually (phase-by-phase) with
intermediate results
T
M
D DD
DD
D
Semantic Backbone
Wordnets Corpora Morpho-syntactic Lexicons
bank
violin
violist
play
Sharable
Language neutral Language specific
Non-Sharable
Semantic Web