43
Hamburg, 22-11-2 004 [email protected] 1 The Basic Language Resources Kit (BLARK) Steven Krauwer Utrecht Institute of Linguistics UiL OTS / ELSNET

The Basic Language Resources Kit (BLARK)

  • Upload
    ernie

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

The Basic Language Resources Kit (BLARK). Steven Krauwer Utrecht Institute of Linguistics UiL OTS / ELSNET. Overview. The BLARK Enterprise How to arrive at it The Dutch Language Union approach Refining the concept Defining a BLARK Main beneficiaries References Concluding remarks. - PowerPoint PPT Presentation

Citation preview

Page 1: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 1

The Basic Language Resources Kit (BLARK)

Steven Krauwer

Utrecht Institute of Linguistics UiL OTS / ELSNET

Page 2: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 2

Overview

• The BLARK Enterprise• How to arrive at it• The Dutch Language Union approach• Refining the concept• Defining a BLARK• Main beneficiaries• References• Concluding remarks

Page 3: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 3

The BLARK Enterprise

• Define the minimal set of language resources that is necessary to do any precompetitive R&D and professional education at all for a language (the Basic Language Resource Kit or BLARK)

• Determine for each language which components are already available

• Make a priority plan to complete the BLARK for each language

• Ensure funding to get the work done

Page 4: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 4

What are the componentsof a BLARK

• Lexicons (monolingual, multilingual, …)• Corpora (language, speech; annotated,

unannotated; mono- and multilingual; mono- and multimodal; …)

• Tools (annotation, exploration, …)• Modules (lemmatizers, parsers, speech

recognizers, tts, transcribers, translation, …)• …

Page 5: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 5

What makes the BLARK Enterprise special?

• The idea is to make a common generic BLARK definition, in principle applicable to all languages

• The common definition will be based on the experience with different languages, and will prevent reinvention of wheels

• The common definition will ensure interoperability and interconnectivity (especially for multilingual or cross-lingual applications)

Page 6: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 6

Other benefits

• Experience from other languages will help making cost estimations

• Adoption of a BLARK common to all languages may help in persuading funders to support the creation of the BLARK

• Adoption of a common BLARK may facilitate porting of knowledge and expertise between languages

Page 7: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 7

Words of caution

• A BLARK definition will evolve over time, as new applications, application environment and technologies come up

• A BLARK definition should be seen as a template rather than a dictate, as different languages may have different specific requirements

• BLARK completion priorities may differ from language to language (on e.g. economic, social or political grounds)

Page 8: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 8

How to define a BLARK and assign priorities

• Methodology proposed by the Dutch Language Union [DLU] (Binnenpoorte et al, LREC 2002):– Identify a number of typical applications

– Determine for each of them which technologies (modules) are needed to make them (-, +, ++, +++)

– Identify for each module which resources they require (-, +, ++, +++)

– Assign the highest priority to the resources that support most applications

Page 9: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 9

Proposed DLU priorities for NLP

1. treebank

2. robust parsers

3. tokenisation and named entity recognition

4. semantic annotations for the treebank

5. translation equivalents

6. evaluation benchmarks

Page 10: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 10

Proposed DLU priorities for speech

1. automatic speech recognition

2. application-specific speech corpora

3. multi-media speech corpora

4. tools for transcription of speech data

5. speech synthesis

6. benchmarks for evaluation

Page 11: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 11

Next steps by DLU

• Make a survey of what exists and to what extent it is available (0-9 availability score)

• Assign priorities (not just resources but also an infrastructure for maintenance and distribution)

• Secure funding from Dutch and Flemish government for a national programme

• Issue calls for proposals for collaborative resources projects (1st call closed Nov 2 2004)

Page 12: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 12

Refining the concept

• Items not really covered by the DLU teams:– definition vs specification– availability– quality– quantity– standards– support

• Addressed in the NEMLAR project

Page 13: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 13

Definition / specification

• Not enough to say ‘a written language corpus’, what about:– size (types, tokens)

– encoding

– annotation

– text types

– representativity

– domains

• i.e. we need full specs

Page 14: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 14

Availability

• DLU: 0-9 scale, very impressionistic

• Our proposal: 3 dimensions– accessibility– cost– modifiability

• to each we assign a penalty score (0 is best)

Page 15: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 15

Accessibility

• 3 classes, with associated penalties– (3) existing, but only company-internal– (2) existing and freely usable for

precompetitive research– (1) existing and freely usable for all R&D

Page 16: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 16

Cost

• 4 cost categories:– (4) price over 10 keuro– (3) price between 1 and 10 keuro– (2) price between 100 and 1000 euro– (1) less than 100 euro

Page 17: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 17

Modifiability

• 3 categories– (3) black box: you get them as they are, but you

cannot change or even inspect its internals– (2) glass box: you can’t change them but you

can see what is inside)– (1) open resources: freely manipulable

Page 18: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 18

Comments on availability

• we can now express availability in a 3 digit score (accessibility, cost, modifiability) which should be rather easy to assign objectively

• the lowest scores are the best

• if the accessibility score is 3, the other scores don’t mean very much

Page 19: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 19

Quality

• We distinguish two types of quality: absolute (I.e. an inherent property of the resource) and relative (I.e. in relation to how you want to use it):

• Absolute: standard-compliance and soundness

• Relative: task-relevance and environment-relevance

Page 20: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 20

Standard-compliance

• criterion: to what extent is the resource based on a common standard (formal or de facto)

• possible values (penalty based):– (3) no standard– (2) standard, but not fully compliant– (1) standard and fully compliant

Page 21: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 21

Soundness

• criterion: to what extent is the resource based on well-defined specifications

• values:– (3) no specifications provided– (2) specs provided, but not fully compliant– (1) specs provided, fully compliant

Page 22: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 22

Task-relevance

• criterion (relative): to what extent is the resources suited for a specific task X

• values (3 binary values):– contains all information needed for X (yes/no)– has the proper size for X(yes/no)– based on a relevant selection of items for X

(yes/no)

Page 23: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 23

Environment-relevance

• criterion: to what extent is the resource interoperable with its environment (other resources)

• values (3 binary valuas):– information matches (yes/no)– size matches (yes/no)– selection matches (yes/no)

Page 24: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 24

Comments on quality

• We can now express absolute quality objectively in terms of a pair of scores (standard-compliance, soundness); this score can be assigned by the provider

• and relative quality (for our own purposes) in terms of two triples of yes/no answers (task-relevance, environment-relevance); this score can only be assigned by the user

• other attributes may be added as long as they can be objectively assigned

Page 25: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 25

Quantity

• The DLU team did not try to formulate any quantitative requirements

• We have tried to do this in the context of the NEMLAR project, see below for our tentative figures

• Statistical approaches can swallow any amount of resources, and minimal figures are very hard to find

• Our figure finding exercise has been very much example driven

Page 26: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 26

Standards

• Very few existing formal standards around, although some exist (cf Romary & Ide at LREC2004 workshop, Monachini et al, 2003)

• Evolving de facto standards include:– Bottom-up work by committees (TEI)– Top-down actions:

• Projects aiming at standards (e.g. EAGLES, ISLE)• Example setting R&D projects (e.g. Wordnet, Speechdat,

Multext)

• Our position: any standard is better than no standard at all

Page 27: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 27

Defining a BLARK

• Work carried out in the context of the NEMLAR project (www.nemlar.org), aimed at Arabic resources

• Work described here based on project deliverables (see site), summarized in article by Maegaard, Krauwer, Choukri, Damsgaard presented at NEMLAR conference in Cairo (Sep 2004)

Page 28: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 28

Approach adopted

• Same strategy as Dutch Language Union (applications => modules => resources)

• But with different results because of differences in social/economic situation and in language structure

• Results follow, in terms of global definitions and tentative size indications (no specs provided at this stage, but project is still ongoing)

• Feedback is welcome!!!!!!!!

Page 29: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 29

Written resources (1)

• Lexicon:– For all components: 40 000 stems with POS &

morphology

– For sentence boundary detection: list of conjunctions and other sentence starters/stoppers

– For named entity recognition: 50 000 human proper names

– For semantic analysis: same 40 000, with subcategorization, shallow lexical semantic info; possibly a WordNet

Page 30: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 30

Written resources (2)

• Bi-/Multilingual lexicon– Same size as monolingual

• Thesauri, ontologies, wordnets:– Thesaurus subtree with ca 200-300 nodes for

each domain– Ontologies and wordnets ideally same size as

lexicon

Page 31: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 31

Written resources (3)

• Corpora:– For term extraction: 100 million words unannoteted

– For small applications: 0.5 million words annotated

– For statistical POS tagger: 1-3 million (ann)

– Sentence boundary: 0.5-1.5 million (ann)

– Named entity (stat based): 1.5 million (ann)

– Term extraction: 100 million (ann)

– Co-reference resolution: 1 million (ann)

– WSD: 2-3 million (ann)

Page 32: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 32

Written resources (4)

• Multilingual corpora:– For alignment: 0.5 million (tagged)

• Multimodal corpora:– For OCR (printed): ??– For OCR (hand-written): ??

Page 33: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 33

Spoken resources (1)

• Acoustic data:– For dictation: 50-100 speakers, 20 min each, fully

transcribed, plus 10 speakers for testing– For telephony: 500 speakers uttering 50 different

sentences (speechdat, orientel based)– For embedded speech recognition: data similar to

Speecon– For broadcast news transcription: 50-100 hours well-

annotated, plus 1000 hours of non-transcribed data; should come with 300 million words of non-annotated written text

Page 34: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 34

Spoken resources (2)

• Acoustic data (cont’d):– For conversational speech: data similar to

CallHome/CallFriends from LDC– For speaker recognition: 500 speakers for training, 3

minutes each, transcribed, plus 100 speakers for testing– For language/dialect identification: data similar to

CallFriend, or from Broadcast News (esp for variants of Arabic)

– For speech synthesis: male and female speakers, 15 hours, using a read text, phonetically balanced

– For formant synthesis: sama as above, with hand-labelled formant

Page 35: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 35

Spoken resources (3)

• Multimodal corpora:– For lips movement reading: similar to M2VTS, with

some 50 faces

• Written corpora for speech technologies:– General; 300 million words unannotated, preferably

broadcast news or other press and media sources

– For phonetic lexicon and language models: 1-5 million words, annotated

– For Arabic: vowelized and non-vowelized corpus

Page 36: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 36

What next? (1)

• Check definition and quantification for completeness and consistency and correct

• Try to provide specs for every single item

• Try to differentiate between general and Arabic in definitions and specs

Page 37: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 37

What next? (2)

• For each language:– Take the BLARK definition and specs– Adapt to local conditions– Make a survey of what exists and what has to

be made– Find the funds and build the BLARK for your

language

Page 38: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 38

Prescriptive / descriptive

• Prescriptive:– the BLARK definition tells you which

ingredients you need– the specification tells you what they should

look like

• Descriptive:– a BLARK instantiation comes with a

description of its components

Page 39: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 39

Main beneficiaries (1)

• academic and industrial researchers: material to try out ideas and conduct pilot studies

• industrial developers: only for generic activities, since specific applications require more user and domain orientation

• educators: material for experimental work by students in labs

Page 40: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 40

Main beneficiaries (2)

• probably not the main languages in Europe (EN, FR, GE) as they are pretty well covered anyway

• mostly the languages that are not supported by a strong market (because of small size or poor economy)

Page 41: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 41

References

• Binnenpoorte et al at LREC 2002 (see also www.elsnet.org/dox/lrec2002-binnenpoorte.pdf

• ELRA Newsletter vol 3, n 2, 1998 (see also www.elsnet.org/blark.html)

• NEMLAR: see www.nemlar.org for– Arabic BLARK Report– NEMLAR presentation at Cairo conference

• Romary & Ide at LREC 2004 (see also www.elsnet.org/lrec2004-roadmap/Romary-Ide.ppt)

Page 42: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 42

Concluding remarks

• The BLARK aims at providing a common definition of the notion ‘minimal set of resources’

• It should help language communities to come closer to the idea of creating an equal playing field, in spite of market forces

• It should facilitate porting of expertise• It is necessarily dynamic, as technologies evolve

rapidly

Page 43: The Basic Language Resources Kit (BLARK)

Hamburg, 22-11-2004 [email protected] 43

Thanks!

Contact:

[email protected]