Upload
wright
View
34
Download
2
Tags:
Embed Size (px)
DESCRIPTION
VIII Iberoamerican Conference on Artificial Intelligence Sevilla, 2002. Terminology Retrieval: towards a synergy between thesaurus and free text searching. Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas Informáticos UNED. Overview. Motivation Objectives - PowerPoint PPT Presentation
Citation preview
Terminology Retrieval: towards a synergy between Terminology Retrieval: towards a synergy between thesaurus and free text searchingthesaurus and free text searching
Anselmo Peñas, Felisa Verdejo and Julio Gonzalo
Dpto. Lenguajes y Sistemas Informáticos
UNED
VIII Iberoamerican Conference on Artificial Intelligence
Sevilla, 2002
2
OverviewOverview
Motivation Objectives Proposed approach: Terminology Retrieval Website Term Browser Evaluation Conclusions
3
MultilingualMultilingual ThesaurusThesaurus
Designed for Indexing and searching
in a specific subject area• Vocabulary control• Promoting consistency• Cross-language
Guiding users about which terms to use
• Navigate the thesaurus
60. EDUCATIONAL SYSTEM Education
NT1 adult education RT adult (10)RT lifelong learning
NT1 basic education RT* transition from basic to secondary educationRT didactic continuity (50)
NT1 distance educationUF distance learningUF distance studyUF distance trainingUF ODLUF open and distance learning
NT1 informal educationNT1 lifelong learning
UF continuing educationUF lifelong educationUF recurrent educationRT adult education
(…)
4
Multilingual ThesaurusMultilingual Thesaurus
Problems Construction & management (high cost) Indexing
• Manual keyword assessment• Errors in automatic keyword assessment
Domain specific• New domain needs a new thesaurus
Specialist oriented (know preferred descriptors) • Less specialized audience get poorer results
5
ObjectivesObjectives
Develop a model
– to help users to express and precise their information needs
– to help users to overcome language barriers• Bringing to users the collection terminology• Morpho-syntactic, semantic & translingual variations• Without needs of thesauri construction
Establish an appropriate evaluation framework
6
ProposedProposed approach approach
Information Retrieval
Controlled
Vocabulary
Searching
Free Text
Searching
NLP Technique
s
Controlled
Vocabulary
Searching
Free Text
Searching
Terminology Retrieval & Term browsing (Website Term Browser)
Automatic Terminology Extraction
7
Terminology RetrievalTerminology RetrievalFrom Automatic Terminology Extraction...
Obtain lists of terms relevant for a specific domain• Term Extraction• Term Weighting• Term Selection
... to Terminology Retrieval
Retrieve terms relevant for an information need• User query points the relevant terms
• No terminology lists truncation
• Favor recall relaxing term extraction patterns
... & Browsing• Navigate through relevant terminology
• Access information from retrieved terms
• Bridge the gap between query and collection vocabularies
• Cross-Language
8
Terminology RetrievalTerminology Retrieval
Requires Phrase indexing and retrieval Query expansion and translation
• To retrieve terminology variations– Morpho-syntactic variations
– Semantic variations
– Translingual variations
• Noise in retrieval• Ambiguity reduction
– Co-ocurrence of expansion words in the same phrase
9
IndexingIndexingSteps
1. Text pre-processing and listing of words
2. Word tagging (oriented to phrase detection)
3. Phrase detection & lemmatization of components
4. Document indexing & statistics (document frequency)
5. Phrase selection (Subsumption & Lexicalization degree)
6. Phrase indexing Lemma Document
Phrase
Lemma Document
Phrase
10
Tratadosacuerdocapitulaciónconcertaciónconveniocuidar, pactomanejarprocesar
accorddiscoursehandlemanagepactprocesstreattreatisetreaty
Query Query eexpansion and xpansion and ttranslationranslationProhibición
embargoentredichointerdiccióninterdictoproscripción
baninterdictionprohibitionproscription
Pruebas
cata, cataduradegustaciónensayoescandalloexperimentogustaciónmuestreo, tanteo
demonstrateestablish, exhibitexperimentexperimentationfall, fittingindicate, pointpresent, proofprove, runsample, samplingshew,show, tastetest, trial, try
de Nucleares
nuclear
nuclear
de
Expansion
Translation
Nuclear taste proscription process? Nuclear test ban treaty?
Ambiguity Reduction
11
RetrievalRetrievalquery
Tokenising
Expansion / Translation
lem11 lem21 lem31
lem12 lem22 lem32
··· ··· ···EWN &
Dic.
Lemmatising
tok1 tok2 tok3
Lexicon
Phraseretrieval
exp31 exp32 ... tran31 tran32 ...
exp21 exp22 ... tran21 tran22 ...
exp11 exp12 ... tran11 tran12 ...
Phrase index
Document retrieval
Document index
Term ranking
lem11 lem12 ... lem31 lem32 ...
terms documents
Document ranking
12
Query in Spanish
Hierarchy of terms
Catalan
English
Spanish
Ranking of documents
13
- Translingual- Morpho-syntactic variations
(permutation, insertion)- Semantic variations
14
Evaluation of Terminology RetrievalEvaluation of Terminology RetrievalCompare Terminology Retrieval over 42,406 web pages (200 Mb) Hand-crafted Multilingual Thesaurus (1051 descriptors)
15
16
Evaluation of Terminology RetrievalEvaluation of Terminology Retrieval
Recall of mono-lexical terms (lemmas)• Monolingual: 85% - 95%• Translingual: 55% - 65%
Recall of poly-lexical terms (phrases)• Monolingual: 40% - 65%• Translingual: 10% - 45%
Loss of recall due to• Phrase extraction (mainly POS tagging): 3% - 17%• Phrase indexing (mainly lemmatization): 2% - 34%• Phrase selection: 12% - 37%• Lack of connections between different languages in EWN• Lack in EWN adjective hierarchies
17
ConclusionsConclusions
A search model based on extraction, retrieval and browsing of terminology has been developed
• User oriented
• Interaction over terminological information– Intermediate way between free-searching and thesaurus-
guided searching
– Without needs of thesaurus construction
• Bringing to users the collection terminology– Morpho-syntactic & semantic variations
– Translinguality
18
ConclusionsConclusions
An evaluation framework for Terminology Retrieval and Term Browsing has been established
• Points the way to improve Terminology Retrieval
• Users appreciate Term Browsing
• WTB phrasal information can substantially complement the document ranking provided by the search engines