18
Terminology Retrieval: towards a synergy between Terminology Retrieval: towards a synergy between thesaurus and free text searching thesaurus and free text searching Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas Informáticos UNED VIII Iberoamerican Conference on Artificial Intelligence Sevilla, 2002

Terminology Retrieval: towards a synergy between thesaurus and free text searching

  • Upload
    wright

  • View
    34

  • Download
    2

Embed Size (px)

DESCRIPTION

VIII Iberoamerican Conference on Artificial Intelligence Sevilla, 2002. Terminology Retrieval: towards a synergy between thesaurus and free text searching. Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas Informáticos UNED. Overview. Motivation Objectives - PowerPoint PPT Presentation

Citation preview

Page 1: Terminology Retrieval: towards a synergy between thesaurus and free text searching

Terminology Retrieval: towards a synergy between Terminology Retrieval: towards a synergy between thesaurus and free text searchingthesaurus and free text searching

Anselmo Peñas, Felisa Verdejo and Julio Gonzalo

Dpto. Lenguajes y Sistemas Informáticos

UNED

VIII Iberoamerican Conference on Artificial Intelligence

Sevilla, 2002

Page 2: Terminology Retrieval: towards a synergy between thesaurus and free text searching

2

OverviewOverview

Motivation Objectives Proposed approach: Terminology Retrieval Website Term Browser Evaluation Conclusions

Page 3: Terminology Retrieval: towards a synergy between thesaurus and free text searching

3

MultilingualMultilingual ThesaurusThesaurus

Designed for Indexing and searching

in a specific subject area• Vocabulary control• Promoting consistency• Cross-language

Guiding users about which terms to use

• Navigate the thesaurus

60. EDUCATIONAL SYSTEM Education

NT1 adult education RT adult (10)RT lifelong learning

NT1 basic education RT* transition from basic to secondary educationRT didactic continuity (50)

NT1 distance educationUF distance learningUF distance studyUF distance trainingUF ODLUF open and distance learning

NT1 informal educationNT1 lifelong learning

UF continuing educationUF lifelong educationUF recurrent educationRT adult education

(…)

Page 4: Terminology Retrieval: towards a synergy between thesaurus and free text searching

4

Multilingual ThesaurusMultilingual Thesaurus

Problems Construction & management (high cost) Indexing

• Manual keyword assessment• Errors in automatic keyword assessment

Domain specific• New domain needs a new thesaurus

Specialist oriented (know preferred descriptors) • Less specialized audience get poorer results

Page 5: Terminology Retrieval: towards a synergy between thesaurus and free text searching

5

ObjectivesObjectives

Develop a model

– to help users to express and precise their information needs

– to help users to overcome language barriers• Bringing to users the collection terminology• Morpho-syntactic, semantic & translingual variations• Without needs of thesauri construction

Establish an appropriate evaluation framework

Page 6: Terminology Retrieval: towards a synergy between thesaurus and free text searching

6

ProposedProposed approach approach

Information Retrieval

Controlled

Vocabulary

Searching

Free Text

Searching

NLP Technique

s

Controlled

Vocabulary

Searching

Free Text

Searching

Terminology Retrieval & Term browsing (Website Term Browser)

Automatic Terminology Extraction

Page 7: Terminology Retrieval: towards a synergy between thesaurus and free text searching

7

Terminology RetrievalTerminology RetrievalFrom Automatic Terminology Extraction...

Obtain lists of terms relevant for a specific domain• Term Extraction• Term Weighting• Term Selection

... to Terminology Retrieval

Retrieve terms relevant for an information need• User query points the relevant terms

• No terminology lists truncation

• Favor recall relaxing term extraction patterns

... & Browsing• Navigate through relevant terminology

• Access information from retrieved terms

• Bridge the gap between query and collection vocabularies

• Cross-Language

Page 8: Terminology Retrieval: towards a synergy between thesaurus and free text searching

8

Terminology RetrievalTerminology Retrieval

Requires Phrase indexing and retrieval Query expansion and translation

• To retrieve terminology variations– Morpho-syntactic variations

– Semantic variations

– Translingual variations

• Noise in retrieval• Ambiguity reduction

– Co-ocurrence of expansion words in the same phrase

Page 9: Terminology Retrieval: towards a synergy between thesaurus and free text searching

9

IndexingIndexingSteps

1. Text pre-processing and listing of words

2. Word tagging (oriented to phrase detection)

3. Phrase detection & lemmatization of components

4. Document indexing & statistics (document frequency)

5. Phrase selection (Subsumption & Lexicalization degree)

6. Phrase indexing Lemma Document

Phrase

Lemma Document

Phrase

Page 10: Terminology Retrieval: towards a synergy between thesaurus and free text searching

10

Tratadosacuerdocapitulaciónconcertaciónconveniocuidar, pactomanejarprocesar

accorddiscoursehandlemanagepactprocesstreattreatisetreaty

Query Query eexpansion and xpansion and ttranslationranslationProhibición

embargoentredichointerdiccióninterdictoproscripción

baninterdictionprohibitionproscription

Pruebas

cata, cataduradegustaciónensayoescandalloexperimentogustaciónmuestreo, tanteo

demonstrateestablish, exhibitexperimentexperimentationfall, fittingindicate, pointpresent, proofprove, runsample, samplingshew,show, tastetest, trial, try

de Nucleares

nuclear

nuclear

de

Expansion

Translation

Nuclear taste proscription process? Nuclear test ban treaty?

Ambiguity Reduction

Page 11: Terminology Retrieval: towards a synergy between thesaurus and free text searching

11

RetrievalRetrievalquery

Tokenising

Expansion / Translation

lem11 lem21 lem31

lem12 lem22 lem32

··· ··· ···EWN &

Dic.

Lemmatising

tok1 tok2 tok3

Lexicon

Phraseretrieval

exp31 exp32 ... tran31 tran32 ...

exp21 exp22 ... tran21 tran22 ...

exp11 exp12 ... tran11 tran12 ...

Phrase index

Document retrieval

Document index

Term ranking

lem11 lem12 ... lem31 lem32 ...

terms documents

Document ranking

Page 12: Terminology Retrieval: towards a synergy between thesaurus and free text searching

12

Query in Spanish

Hierarchy of terms

Catalan

English

Spanish

Ranking of documents

Page 13: Terminology Retrieval: towards a synergy between thesaurus and free text searching

13

- Translingual- Morpho-syntactic variations

(permutation, insertion)- Semantic variations

Page 14: Terminology Retrieval: towards a synergy between thesaurus and free text searching

14

Evaluation of Terminology RetrievalEvaluation of Terminology RetrievalCompare Terminology Retrieval over 42,406 web pages (200 Mb) Hand-crafted Multilingual Thesaurus (1051 descriptors)

Page 15: Terminology Retrieval: towards a synergy between thesaurus and free text searching

15

Page 16: Terminology Retrieval: towards a synergy between thesaurus and free text searching

16

Evaluation of Terminology RetrievalEvaluation of Terminology Retrieval

Recall of mono-lexical terms (lemmas)• Monolingual: 85% - 95%• Translingual: 55% - 65%

Recall of poly-lexical terms (phrases)• Monolingual: 40% - 65%• Translingual: 10% - 45%

Loss of recall due to• Phrase extraction (mainly POS tagging): 3% - 17%• Phrase indexing (mainly lemmatization): 2% - 34%• Phrase selection: 12% - 37%• Lack of connections between different languages in EWN• Lack in EWN adjective hierarchies

Page 17: Terminology Retrieval: towards a synergy between thesaurus and free text searching

17

ConclusionsConclusions

A search model based on extraction, retrieval and browsing of terminology has been developed

• User oriented

• Interaction over terminological information– Intermediate way between free-searching and thesaurus-

guided searching

– Without needs of thesaurus construction

• Bringing to users the collection terminology– Morpho-syntactic & semantic variations

– Translinguality

Page 18: Terminology Retrieval: towards a synergy between thesaurus and free text searching

18

ConclusionsConclusions

An evaluation framework for Terminology Retrieval and Term Browsing has been established

• Points the way to improve Terminology Retrieval

• Users appreciate Term Browsing

• WTB phrasal information can substantially complement the document ranking provided by the search engines