Corpus-based Terminology Extraction applied to Information Access

Corpus-based Terminology Extraction Corpus-based Terminology Extraction applied to Information Accessapplied to Information Access

Anselmo Peñas, Felisa Verdejo and Julio Gonzalo

NLP Group, Dpto. Lenguajes y Sistemas Informáticos,

UNED, Spain

Corpus Linguistics 2001, Lancaster, UK

ContentContent

Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology-based Information Access Conclusions

Introduction: Introduction: FrameworkFramework

The European Treasury Browser (ETB) project

• Web site of Educational Resources (primary and secondary school)

• Context of New Technologies• Objective: to build the structures to organise and

retrieve educational resources

Similar systems• The Educational Resources Information Centre• The British Education Index

Introduction: Introduction: use ofuse of ThesauriThesauri

ThesauriDefinition: controlled vocabulary, structured in relations

Structure: descriptors and relations (NT, BT, RT)

Existing educational thesauri• Don’t cover primary and secondary school vocabulary

within the new technologies context

Construction of a multilingual thesaurus is needed for the ETB project purposes

Terminology Lists

Objectives of the workObjectives of the work

To build the Spanish list of candidate terms for the ETB multilingual thesaurus.

To develop a general procedure to obtain terminology lists

• In an automatic way• Independently of the application domain

To explore effective ways of Information Retrieval • using the terminology lists instead of thesaurus• to bridge the gap between users’ and collection languages

ContentContent

Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology based Information Access Conclusions

Resources and ToolsResources and Tools

Resources• Semantic network: EuroWordNet• Monolingual dictionary (VOX)• Bilingual dictionary (VOX)

Tools• Tokeniser• Morphological analyser• POS tagger• Shallow parser (based on syntactic patterns)

CorporaCorpora

Corpus of educational resources1,075 documents (670,646 words) from– Programa de Nuevas Tecnologías

(http://www.pntic.mec.es/main_recursos.html)– Aldea Global (http://sauce.pntic.mec.es/~alglobal)

Corpus of international news7,364 documents (2.9 million words)– (http://www.elpais.es/internac)

Pre-processing(html tags treatment, language detection, detection of repeated pages and chunks, etc.)

ContentContent


Terminology Extraction (TE)Terminology Extraction (TE)

Terminology List:

List of mono-lexical and poly-lexical terms which are usual in a specific domain

Steps of Terminology Extraction1. Term detection

2. Term weighting

3. Term selection

1.1. Term Detection Term Detection (mono-lexical)(mono-lexical)

(Over both corpora, Educational Resources and International News)

Processing Tokenising Lemmatising,Tagging Removal of erroneous strings, abbreviations and words

from other languages Extraction of nouns, verbs and adjectives

Result List of candidate lemmas with its:

• Term frequency (any form) in both collections

• Document frequency in both collections

1.1. Term Detection Term Detection ( (poly-lexical)poly-lexical)

(Over Educational Resources corpus)

Processing Tokenising, Lemmatising,Tagging Shallow parsing (Syntactic pattern recognition)

Result List of candidate terminological phrases:

• Term frequency in the collection

• Document frequency in the collection

... como/CS en/Prep la/Art educación/N a/Prep distancia/N ,/Punc el/Art ministerio/N ...

Pattern: N Prep N

Detected term: educación a distancia

Syntactic Patterns for Spanish terminological phrasesN N N A

N [A] Prep N [A] N [A] Prep Art N [A]

N [A] Prep V N [A] Prep V N [A]

2.2. Term weighting Term weighting

Empirical measure• Proportional to

– term frequency

– document frequency

• Inversely proportional to– term frequency in other domain

• Normalisation

whereFt,sc: relative frequency of the term t in the specific corpus scFt,gc: relative frequency of the term t in the general corpus gcDt,sc: relative number of documents in sc where t appears.

1Relevance (t, sc, gc) = 1 –

Ft,sc · Dt,sc

log2 2 + Ft,gc

in the domain corpus

3.3. Term Selection Term Selection

Removal of unfrequent terms in the study domain Removal of very frequent terms in other domains Ranking of terms according to their weight Selection of top terms in the terminology list

(thresholds to obtain 2,000 / 3,000 terms from the 75,000 detected terms)

Addition of phrases with relevant components

ContentContent


Evaluation: Evaluation: Visual explorationVisual exploration

Automatic generation of result pages in HTML

Purpose• To help in the decisions of the prototype

development

• To evaluate the measures and techniques and to suggest improvements or modifications

• To give further information to documentalists in order to assist final decisions in thesaurus construction

Evaluation: Evaluation: Visual explorationVisual exploration

Evaluation: Evaluation: PrecisionPrecision

Manual classification of the 2,856 selected terms

Adequate

Specific

domain

Computers

domain Variants Incorrect

Not

lexicalised

Not

domain

Total of

terms

1235

43.24%

513

17.96%

59

2.07%

78

2.73%

151

5.29%

515

18.03%

305

10.68%

2856

100%

66 % of terms are appropiate

Proyecto curricularCiencias socialesSistema operativoProyectos curriculares(Proyecto curricular)

Profesorado materiales ¿?Alumnos inglesesBiblioteca nacional

With a low effort, a large number of accurate terms is proposed to documentalists

Evaluation: Evaluation: PrecisionPrecision

precision

number of selected candidates

Precision, % of selected terms which are appropriate terms

Higher precision on the top of the ranking

With a lower number of candidates, the precision increases

ContentContent

Introduction Resources, Tools and Corpora Terminology Extraction (TE) Evaluation of the TE procedure Terminology-based Information Access Conclusions

Terminology-based Information AccessTerminology-based Information Access

Terminology Extraction in Information Retrieval provides:

At Indexing: to add poly-lexical terms to the indexes without the explosion of n-grams

Term browsing: to navigate through the terminology and access the documents from the terms (without the use of thesauri)


A difference with TE: terminology list truncation

(as query gives the relevant terms, now the task is concerned with recall rather than precision of terms)

A new task: to retrieve terminology• Poly-lexical terms are retrieved from mono-lexical

ones

Lemma

Phrase

Document

Indexing Levels


Terminology retrieval

To bridge the gap between• Collection terminology

• Query terms

Requires• Query expansion

• Query translation

But produces noise in the retrieval

However phrases provides an excellent way for ambiguity reduction (Ballesteros & Croft, 1998)

Terminology-based Information AccessTerminology-based Information AccessTratadosacuerdocapitulaciónconcertaciónconveniocuidar, pactomanejarprocesar

accorddiscoursehandlemanagepactprocesstreattreatisetreaty

Prohibiciónembargoentredichointerdiccióninterdictoproscripción

baninterdictionprohibitionproscription

Pruebascata, cataduradegustaciónensayoescandalloexperimentogustaciónmuestreo, tanteo

demonstrateestablish, exhibitexperimentexperimentationfall, fittingindicate, pointpresent, proofprove, runsample, samplingshew,show, tastetest, trial, try

de Nuclearesnuclear

nuclear

de

Nuclear test ban treaty?Nuclear fitting interdiction manage? Nuclear taste proscription process?

Exp

ansi

on

Tra

nsl

atio

n

ContentContent


ConclusionsConclusions Extraction of relevant terms in Spanish for the ETB

project domain (primary and secondary school / new technologies)– Automatic process from free resources as web pages– Exploring contexts and statistical data via Internet

Development of a search engine based on terminology extraction– Using terminology lists in an intermediate way between free-

searching and thesaurus-guided searching– Without needing of thesaurus construction– Bridging the distance between the terms used in the query and

the terminology used in the collection (even in different languages)

Thanks for your attention

Documents

Corpus-based Terminology Extraction applied to Information Access