35
Corpus-based Terminology Corpus-based Terminology Extraction applied to Extraction applied to Information Access Information Access Anselmo Peñas, Felisa Verdejo and Julio Gonzalo NLP Group, Dpto. Lenguajes y Sistemas Informáticos, UNED, Spain Corpus Linguistics 2001, Lancaster, UK

Corpus-based Terminology Extraction applied to Information Access

  • Upload
    lorne

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Corpus Linguistics 2001, Lancaster, UK. Corpus-based Terminology Extraction applied to Information Access. Anselmo Peñas, Felisa Verdejo and Julio Gonzalo NLP Group, Dpto. Lenguajes y Sistemas Informáticos, UNED , Spain. Content. Introduction Resources, Tools and Corpora - PowerPoint PPT Presentation

Citation preview

Page 1: Corpus-based Terminology Extraction applied to Information Access

Corpus-based Terminology Extraction Corpus-based Terminology Extraction applied to Information Accessapplied to Information Access

Anselmo Peñas, Felisa Verdejo and Julio Gonzalo

NLP Group, Dpto. Lenguajes y Sistemas Informáticos,

UNED, Spain

Corpus Linguistics 2001, Lancaster, UK

Page 2: Corpus-based Terminology Extraction applied to Information Access

ContentContent

Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology-based Information Access Conclusions

Page 3: Corpus-based Terminology Extraction applied to Information Access

Introduction: Introduction: FrameworkFramework

The European Treasury Browser (ETB) project

• Web site of Educational Resources (primary and secondary school)

• Context of New Technologies• Objective: to build the structures to organise and

retrieve educational resources

Similar systems• The Educational Resources Information Centre• The British Education Index

Page 4: Corpus-based Terminology Extraction applied to Information Access
Page 5: Corpus-based Terminology Extraction applied to Information Access
Page 6: Corpus-based Terminology Extraction applied to Information Access

Introduction: Introduction: use ofuse of ThesauriThesauri

ThesauriDefinition: controlled vocabulary, structured in relations

Structure: descriptors and relations (NT, BT, RT)

Existing educational thesauri• Don’t cover primary and secondary school vocabulary

within the new technologies context

Construction of a multilingual thesaurus is needed for the ETB project purposes

Terminology Lists

Page 7: Corpus-based Terminology Extraction applied to Information Access

Objectives of the workObjectives of the work

To build the Spanish list of candidate terms for the ETB multilingual thesaurus.

To develop a general procedure to obtain terminology lists

• In an automatic way• Independently of the application domain

To explore effective ways of Information Retrieval • using the terminology lists instead of thesaurus• to bridge the gap between users’ and collection languages

Page 8: Corpus-based Terminology Extraction applied to Information Access

ContentContent

Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology based Information Access Conclusions

Page 9: Corpus-based Terminology Extraction applied to Information Access

Resources and ToolsResources and Tools

Resources• Semantic network: EuroWordNet• Monolingual dictionary (VOX)• Bilingual dictionary (VOX)

Tools• Tokeniser• Morphological analyser• POS tagger• Shallow parser (based on syntactic patterns)

Page 10: Corpus-based Terminology Extraction applied to Information Access

CorporaCorpora

Corpus of educational resources1,075 documents (670,646 words) from– Programa de Nuevas Tecnologías

(http://www.pntic.mec.es/main_recursos.html)– Aldea Global (http://sauce.pntic.mec.es/~alglobal)

Corpus of international news7,364 documents (2.9 million words)– (http://www.elpais.es/internac)

Pre-processing(html tags treatment, language detection, detection of repeated pages and chunks, etc.)

Page 11: Corpus-based Terminology Extraction applied to Information Access

ContentContent

Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology based Information Access Conclusions

Page 12: Corpus-based Terminology Extraction applied to Information Access

Terminology Extraction (TE)Terminology Extraction (TE)

Terminology List:

List of mono-lexical and poly-lexical terms which are usual in a specific domain

Steps of Terminology Extraction1. Term detection

2. Term weighting

3. Term selection

Page 13: Corpus-based Terminology Extraction applied to Information Access

1.1. Term Detection Term Detection (mono-lexical)(mono-lexical)

(Over both corpora, Educational Resources and International News)

Processing Tokenising Lemmatising,Tagging Removal of erroneous strings, abbreviations and words

from other languages Extraction of nouns, verbs and adjectives

Result List of candidate lemmas with its:

• Term frequency (any form) in both collections

• Document frequency in both collections

Page 14: Corpus-based Terminology Extraction applied to Information Access

1.1. Term Detection Term Detection ( (poly-lexical)poly-lexical)

(Over Educational Resources corpus)

Processing Tokenising, Lemmatising,Tagging Shallow parsing (Syntactic pattern recognition)

Result List of candidate terminological phrases:

• Term frequency in the collection

• Document frequency in the collection

... como/CS en/Prep la/Art educación/N a/Prep distancia/N ,/Punc el/Art ministerio/N ...

Pattern: N Prep N

Detected term: educación a distancia

Syntactic Patterns for Spanish terminological phrasesN N N A

N [A] Prep N [A] N [A] Prep Art N [A]

N [A] Prep V N [A] Prep V N [A]

Page 15: Corpus-based Terminology Extraction applied to Information Access

2.2. Term weighting Term weighting

Empirical measure• Proportional to

– term frequency

– document frequency

• Inversely proportional to– term frequency in other domain

• Normalisation

whereFt,sc: relative frequency of the term t in the specific corpus scFt,gc: relative frequency of the term t in the general corpus gcDt,sc: relative number of documents in sc where t appears.

1Relevance (t, sc, gc) = 1 –

Ft,sc · Dt,sc

log2 2 + Ft,gc

in the domain corpus

Page 16: Corpus-based Terminology Extraction applied to Information Access

3.3. Term Selection Term Selection

Removal of unfrequent terms in the study domain Removal of very frequent terms in other domains Ranking of terms according to their weight Selection of top terms in the terminology list

(thresholds to obtain 2,000 / 3,000 terms from the 75,000 detected terms)

Addition of phrases with relevant components

Page 17: Corpus-based Terminology Extraction applied to Information Access

ContentContent

Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology based Information Access Conclusions

Page 18: Corpus-based Terminology Extraction applied to Information Access

Evaluation: Evaluation: Visual explorationVisual exploration

Automatic generation of result pages in HTML

Purpose• To help in the decisions of the prototype

development

• To evaluate the measures and techniques and to suggest improvements or modifications

• To give further information to documentalists in order to assist final decisions in thesaurus construction

Page 19: Corpus-based Terminology Extraction applied to Information Access
Page 20: Corpus-based Terminology Extraction applied to Information Access
Page 21: Corpus-based Terminology Extraction applied to Information Access

Evaluation: Evaluation: Visual explorationVisual exploration

Page 22: Corpus-based Terminology Extraction applied to Information Access

Evaluation: Evaluation: PrecisionPrecision

Manual classification of the 2,856 selected terms

Adequate

Specific

domain

Computers

domain Variants Incorrect

Not

lexicalised

Not

domain

Total of

terms

1235

43.24%

513

17.96%

59

2.07%

78

2.73%

151

5.29%

515

18.03%

305

10.68%

2856

100%

66 % of terms are appropiate

Proyecto curricularCiencias socialesSistema operativoProyectos curriculares(Proyecto curricular)

Profesorado materiales ¿?Alumnos inglesesBiblioteca nacional

With a low effort, a large number of accurate terms is proposed to documentalists

Page 23: Corpus-based Terminology Extraction applied to Information Access

Evaluation: Evaluation: PrecisionPrecision

precision

number of selected candidates

Precision, % of selected terms which are appropriate terms

Higher precision on the top of the ranking

With a lower number of candidates, the precision increases

Page 24: Corpus-based Terminology Extraction applied to Information Access

ContentContent

Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology-based Information Access Conclusions

Page 25: Corpus-based Terminology Extraction applied to Information Access

Terminology-based Information AccessTerminology-based Information Access

Terminology Extraction in Information Retrieval provides:

At Indexing: to add poly-lexical terms to the indexes without the explosion of n-grams

Term browsing: to navigate through the terminology and access the documents from the terms (without the use of thesauri)

Page 26: Corpus-based Terminology Extraction applied to Information Access

Terminology-based Information AccessTerminology-based Information Access

A difference with TE: terminology list truncation

(as query gives the relevant terms, now the task is concerned with recall rather than precision of terms)

A new task: to retrieve terminology• Poly-lexical terms are retrieved from mono-lexical

ones

Lemma

Phrase

Document

Indexing Levels

Page 27: Corpus-based Terminology Extraction applied to Information Access
Page 28: Corpus-based Terminology Extraction applied to Information Access
Page 29: Corpus-based Terminology Extraction applied to Information Access

Terminology-based Information AccessTerminology-based Information Access

Terminology retrieval

To bridge the gap between• Collection terminology

• Query terms

Requires• Query expansion

• Query translation

But produces noise in the retrieval

However phrases provides an excellent way for ambiguity reduction (Ballesteros & Croft, 1998)

Page 30: Corpus-based Terminology Extraction applied to Information Access
Page 31: Corpus-based Terminology Extraction applied to Information Access

Terminology-based Information AccessTerminology-based Information AccessTratadosacuerdocapitulaciónconcertaciónconveniocuidar, pactomanejarprocesar

accorddiscoursehandlemanagepactprocesstreattreatisetreaty

Prohibiciónembargoentredichointerdiccióninterdictoproscripción

baninterdictionprohibitionproscription

Pruebascata, cataduradegustaciónensayoescandalloexperimentogustaciónmuestreo, tanteo

demonstrateestablish, exhibitexperimentexperimentationfall, fittingindicate, pointpresent, proofprove, runsample, samplingshew,show, tastetest, trial, try

de Nuclearesnuclear

nuclear

de

Nuclear test ban treaty?Nuclear fitting interdiction manage? Nuclear taste proscription process?

Exp

ansi

on

Tra

nsl

atio

n

Page 32: Corpus-based Terminology Extraction applied to Information Access
Page 33: Corpus-based Terminology Extraction applied to Information Access

ContentContent

Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology based Information Access Conclusions

Page 34: Corpus-based Terminology Extraction applied to Information Access

ConclusionsConclusions Extraction of relevant terms in Spanish for the ETB

project domain (primary and secondary school / new technologies)– Automatic process from free resources as web pages– Exploring contexts and statistical data via Internet

Development of a search engine based on terminology extraction– Using terminology lists in an intermediate way between free-

searching and thesaurus-guided searching– Without needing of thesaurus construction– Bridging the distance between the terms used in the query and

the terminology used in the collection (even in different languages)

Page 35: Corpus-based Terminology Extraction applied to Information Access

Thanks for your attention