Transcript
Page 1: A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005

A centralized approach to language resources

Piek Vossen

S&T Forum on Multilingualism, Luxembourg, June 6th 2005

Page 2: A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005

Overview

What has been achieved? What has not been achieved? What are the major challenges?

Page 3: A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005

What has been achieved? Research and technology development:

Lexical representations Large-scale and medium-scale lexical acquisition:

Machine Readable Dictionaries Corpora

Acquilex, Multilex, Parole, Simple, EuroWordNet, BalkaNet, MEANING, etc..

Standardization: early initiatives EAGLES, ISLE best practices and descriptions

Medium-scale shallow resources for a number of languages, e.g. Parole lexicons and wordnets for about 15 languages.

Small-scale deep resources for a few languages, i.e. Acquilex, Simple

Page 4: A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005

What has not been achieved (1)? Evaluation and benchmarking:

No well-defined and commonly accepted criteria No benchmark data to validate language resources

Insufficient concerage: 100K entries and 200K concepts per languages is

needed for realistic applications, only half is achieved Many European languages still do not have the basic

resources Insufficiently rich in data coverage:

Language coverage: mainly English Size: e.g. Simple, FrameNet 10,000 concepts

Page 5: A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005

What has not been achieved (2)? Most resources are developed in a distributive way, i.e.

common project but national groups with different approaches: Insufficient conceptual overlap and matching across languages:

very low intersection of concepts (all Wordnets about 10,000 concepts) diversing interpretations and definitions of relations and concepts

Insufficient overlap and consensus in the representation of lexical knowledge

Not enough progress to integrate and merge different types of resources: Ontological resources (Semantic Web) Lexical semantic resources (Wordnets) Morpho-syntactic & semantic (Simple, Acquilex) Morpho-syntactic (Parole)

Page 6: A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005

What has not been achieved (3)? Integration in real applications:

Evidence of added value, i.e. scientific proof that language technology and resources help -> more deep-thought applications

More acceptance by the general public (show cases): The positive effects of language technology should be visible to

the general public Be aware of the language myth! The negative effects and

limitations should be clear too... More awareness by the general public on limitations:

create realization how bad the current systems are (precision and recall)

explain the undemocratic limitations of the current Internet

Page 7: A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005

What is the major challenge (1)? Critical issues:

Languages that are not well-supported: lower economic value less speakers

Divergence of resources and lack of semantic and conceptual intersection

Integration of semantic-conceptual knowledge (more language neutral and sharable) with morpho-syntactic knowledge (language-specific)

Page 8: A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005

What is the major challenge (2)? Centralized development of a semantic conceptual backbone:

Maximizes sharing and re-use of lexical knowledge and tools across languages;

Maximizes intersection of concepts and this interlinking of languages;

Stimulates the standardization of lexical knowledge representation;

Enables the early development of impressive Europe-wide applications on a short term: Good show cases (Information retrieval or dialogues in all European

languages) Application-based evaluation and benchmarking

Page 9: A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005

What is the major challenge (3)? Interlinking and developing morpho-syntactic

lexicons on top of the semantic backbone: Captures the valuable non-sharable, idiosyncratic

properties of languages (also has cultural value) Enables long-term high-quality applications such as

Machine Translations Should be corpus-based but is also necessary to develop

large-scale comparable corpora Can be achieved gradually (phase-by-phase) with

intermediate results

Page 10: A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005

T

M

D DD

DD

D

Semantic Backbone

Wordnets Corpora Morpho-syntactic Lexicons

bank

violin

violist

play

Sharable

Language neutral Language specific

Non-Sharable

Semantic Web


Recommended