A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005.

  • Published on

  • View

  • Download

Embed Size (px)


  • Slide 1

A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005 Slide 2 Overview What has been achieved? What has not been achieved? What are the major challenges? Slide 3 What has been achieved? Research and technology development : Lexical representations Large-scale and medium-scale lexical acquisition: Machine Readable Dictionaries Corpora Acquilex, Multilex, Parole, Simple, EuroWordNet, BalkaNet, MEANING, etc.. Standardization : early initiatives EAGLES, ISLE best practices and descriptions Medium-scale shallow resources for a number of languages, e.g. Parole lexicons and wordnets for about 15 languages. Small-scale deep resources for a few languages, i.e. Acquilex, Simple Slide 4 What has not been achieved (1)? Evaluation and benchmarking: No well-defined and commonly accepted criteria No benchmark data to validate language resources Insufficient concerage: 100K entries and 200K concepts per languages is needed for realistic applications, only half is achieved Many European languages still do not have the basic resources Insufficiently rich in data coverage: Language coverage: mainly English Size: e.g. Simple, FrameNet 10,000 concepts Slide 5 What has not been achieved (2)? Most resources are developed in a distributive way, i.e. common project but national groups with different approaches: Insufficient conceptual overlap and matching across languages: very low intersection of concepts (all Wordnets about 10,000 concepts) diversing interpretations and definitions of relations and concepts Insufficient overlap and consensus in the representation of lexical knowledge Not enough progress to integrate and merge different types of resources: Ontological resources (Semantic Web) Lexical semantic resources (Wordnets) Morpho-syntactic & semantic (Simple, Acquilex) Morpho-syntactic (Parole) Slide 6 What has not been achieved (3)? Integration in real applications: Evidence of added value, i.e. scientific proof that language technology and resources help -> more deep-thought applications More acceptance by the general public (show cases): The positive effects of language technology should be visible to the general public Be aware of the language myth! The negative effects and limitations should be clear too... More awareness by the general public on limitations: create realization how bad the current systems are (precision and recall) explain the undemocratic limitations of the current Internet Slide 7 What is the major challenge (1)? Critical issues: Languages that are not well-supported: lower economic value less speakers Divergence of resources and lack of semantic and conceptual intersection Integration of semantic-conceptual knowledge (more language neutral and sharable) with morpho-syntactic knowledge (language-specific) Slide 8 What is the major challenge (2)? Centralized development of a semantic conceptual backbone: Maximizes sharing and re-use of lexical knowledge and tools across languages; Maximizes intersection of concepts and this interlinking of languages; Stimulates the standardization of lexical knowledge representation; Enables the early development of impressive Europe-wide applications on a short term: Good show cases (Information retrieval or dialogues in all European languages) Application-based evaluation and benchmarking Slide 9 What is the major challenge (3)? Interlinking and developing morpho-syntactic lexicons on top of the semantic backbone: Captures the valuable non-sharable, idiosyncratic properties of languages (also has cultural value) Enables long-term high-quality applications such as Machine Translations Should be corpus-based but is also necessary to develop large-scale comparable corpora Can be achieved gradually (phase-by-phase) with intermediate results Slide 10 T M D D D D D D Semantic Backbone Wordnets Corpora Morpho-syntactic Lexicons bank violin violist play Sharable Language neutralLanguage specific Non-Sharable Semantic Web


View more >