64
Terminological aspects Terminological aspects of text retrieval of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgium [email protected] [email protected] Invited lecture at the University of Amsterdam, October 29, 1999

Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen [email protected]

Embed Size (px)

Citation preview

Page 1: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Terminological aspects Terminological aspects of text retrievalof text retrieval

Paul Nieuwenhuysen

Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen

Belgium

[email protected]@vub.ac.be

Invited lecture at the University of Amsterdam, October 29, 1999

Paul Nieuwenhuysen

Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen

Belgium

[email protected]@vub.ac.be

Invited lecture at the University of Amsterdam, October 29, 1999

Page 2: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Gepresenteerd op de studiedag overGepresenteerd op de studiedag over

“Interdisciplinaire aspecten van corpusgebruik”“Interdisciplinaire aspecten van corpusgebruik”

29 oktober 199929 oktober 1999

aan de Universiteit van Amsterdamaan de Universiteit van Amsterdam

georganizeerd door de georganizeerd door de Stichting Tekstcorpora en Database in de HumanioraStichting Tekstcorpora en Database in de Humaniora

STDHSTDH en de en de

Vereniging voor Nederlandstalige TerminologieVereniging voor Nederlandstalige TerminologieNL-TERMNL-TERM

Page 3: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

De slides bij deze presentatie tonen De slides bij deze presentatie tonen teksten in het teksten in het EngelsEngels, ,

opdat deze ook gebruikt kunnen opdat deze ook gebruikt kunnen worden met en door personen die geen worden met en door personen die geen

NederlandsNederlands kennen. kennen.

Page 4: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Overview Overview of this presentationof this presentation

• A few words about A few words about

» text retrieval and databases text retrieval and databases

» recall and precision in information retrievalrecall and precision in information retrieval

» knowledge organisation: knowledge organisation: classification and thesaurus systemsclassification and thesaurus systems

• Terminological aspects of text retrieval:Terminological aspects of text retrieval:

» problems, problems, and attempts to solve theseand attempts to solve these

» conclusionsconclusions

• A few words about A few words about

» text retrieval and databases text retrieval and databases

» recall and precision in information retrievalrecall and precision in information retrieval

» knowledge organisation: knowledge organisation: classification and thesaurus systemsclassification and thesaurus systems

• Terminological aspects of text retrieval:Terminological aspects of text retrieval:

» problems, problems, and attempts to solve theseand attempts to solve these

» conclusionsconclusions

Page 5: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Information management

Information retrieval

Information retrieval Information retrieval and related activities: figureand related activities: figure

Image retrievalText retrieval

Presentation of information

Page 6: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Information retrieval Information retrieval and related activities: explanationand related activities: explanation

• ““Text retrieval” Text retrieval” can be considered as a part of the larger concepts can be considered as a part of the larger concepts “information retrieval” and “information management”.“information retrieval” and “information management”.

• There is a great overlap: There is a great overlap: “text retrieval” - “image retrieval”“text retrieval” - “image retrieval”because image retrieval is in most cases based on text because image retrieval is in most cases based on text retrieval: retrieval: in most cases retrieval of images is not based on in most cases retrieval of images is not based on computerised investigation of the images themselves, but computerised investigation of the images themselves, but on searches in the text that accompanies each image.on searches in the text that accompanies each image.

• ““Text retrieval” Text retrieval” can be considered as a part of the larger concepts can be considered as a part of the larger concepts “information retrieval” and “information management”.“information retrieval” and “information management”.

• There is a great overlap: There is a great overlap: “text retrieval” - “image retrieval”“text retrieval” - “image retrieval”because image retrieval is in most cases based on text because image retrieval is in most cases based on text retrieval: retrieval: in most cases retrieval of images is not based on in most cases retrieval of images is not based on computerised investigation of the images themselves, but computerised investigation of the images themselves, but on searches in the text that accompanies each image.on searches in the text that accompanies each image.

Page 7: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

The terminology of The terminology of “searching databases”“searching databases”

Several words are used with similar or related meanings:Several words are used with similar or related meanings:

» database / databank / corpus / collection / catalog / site / database / databank / corpus / collection / catalog / site / archive / file / web / ...archive / file / web / ...

» contents of a database / records / documents / (web) pages / contents of a database / records / documents / (web) pages / items / ...items / ...

» search / query / filter / ...search / query / filter / ...

» thesaurus / (controlled) vocabulary / dictionary / lexicon / thesaurus / (controlled) vocabulary / dictionary / lexicon / term bank / ontology / categories and categorisation /...term bank / ontology / categories and categorisation /...

» results / selection / retrieved documents / retrieved items / ...results / selection / retrieved documents / retrieved items / ...

Several words are used with similar or related meanings:Several words are used with similar or related meanings:

» database / databank / corpus / collection / catalog / site / database / databank / corpus / collection / catalog / site / archive / file / web / ...archive / file / web / ...

» contents of a database / records / documents / (web) pages / contents of a database / records / documents / (web) pages / items / ...items / ...

» search / query / filter / ...search / query / filter / ...

» thesaurus / (controlled) vocabulary / dictionary / lexicon / thesaurus / (controlled) vocabulary / dictionary / lexicon / term bank / ontology / categories and categorisation /...term bank / ontology / categories and categorisation /...

» results / selection / retrieved documents / retrieved items / ...results / selection / retrieved documents / retrieved items / ...

Page 8: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Types of databases to search: Types of databases to search: some examplessome examples

The databases that form the basis for The databases that form the basis for

» catalogues of books or other types of documentscatalogues of books or other types of documents

» computerized bibliographiescomputerized bibliographies

» address directoriesaddress directories

» a full text newspaper, newsletter, magazine, journala full text newspaper, newsletter, magazine, journal+ collections of these+ collections of these

» WWW and Internet search enginesWWW and Internet search engines

» intranet search enginesintranet search engines

» ......

The databases that form the basis for The databases that form the basis for

» catalogues of books or other types of documentscatalogues of books or other types of documents

» computerized bibliographiescomputerized bibliographies

» address directoriesaddress directories

» a full text newspaper, newsletter, magazine, journala full text newspaper, newsletter, magazine, journal+ collections of these+ collections of these

» WWW and Internet search enginesWWW and Internet search engines

» intranet search enginesintranet search engines

» ......

Page 9: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

A simple database model: A simple database model: all records together form a databaseall records together form a database

The salami model:The salami model:

» the salami is a “database”the salami is a “database”

» each slice of salami is a “record”each slice of salami is a “record”

» there are no relations between recordsthere are no relations between records

» the retrieval system tries to offer the appropriate slices to the retrieval system tries to offer the appropriate slices to the userthe user

The salami model:The salami model:

» the salami is a “database”the salami is a “database”

» each slice of salami is a “record”each slice of salami is a “record”

» there are no relations between recordsthere are no relations between records

» the retrieval system tries to offer the appropriate slices to the retrieval system tries to offer the appropriate slices to the userthe user

Page 10: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Information retrieval: Information retrieval: via a database to the uservia a database to the user

Informationcontent

Informationcontent

Linear file Inverted file

Search engine

Search interface UserUser

Database

Page 11: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Comparison

Information retrieval: Information retrieval: the basic processes in search systemsthe basic processes in search systems

Information problem

Representation

Query Indexed documents

Representation

Retrieved, sorted documents

Text documents

Evaluation and

feedback

Page 12: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Evaluations in information retrieval:Evaluations in information retrieval:introductionintroduction

• The quality of the results, the outcome of any search The quality of the results, the outcome of any search using any retrieval system depends on many components / using any retrieval system depends on many components / factors.factors.

• These components can be evaluated and modified to These components can be evaluated and modified to increase the quality of the results more or less increase the quality of the results more or less independently.independently.

Page 13: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Evaluations in information retrieval:Evaluations in information retrieval:important factorsimportant factors

• The information retrieval system The information retrieval system ( = contents + system)( = contents + system)

• The user of the retrieval system The user of the retrieval system and the search strategy applied to the systemand the search strategy applied to the system

Result of a searchResult of a search

Page 14: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Evaluations in information retrieval: Evaluations in information retrieval: the simple Boolean model the simple Boolean model

Boolean model: Boolean model: # items in database = # items in database = # items selected + # items not selected# items selected + # items not selected

# Items selected = # Items selected =

# # relevant items relevant items + # + # irrelevant itemsirrelevant items

Relevant Yes

1In

IrrelevantNo0

Out

Page 15: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Recall: Recall: definition and meaningdefinition and meaning

Definition:Definition: # of selected relevant items# of selected relevant items “ “Recall” = ------------------------------------------------- * 100%Recall” = ------------------------------------------------- * 100% total # of relevant items in databasetotal # of relevant items in database

• Aim: high recallAim: high recall

• Problem: in most practical cases, the total # of relevant Problem: in most practical cases, the total # of relevant items in a database cannot be measured.items in a database cannot be measured.

Page 16: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Precision: Precision: definition and meaningdefinition and meaning

Definition:Definition: # of selected relevant items# of selected relevant items““Precision” = --------------------------------------- * 100%Precision” = --------------------------------------- * 100% total # of selected itemstotal # of selected items

Aim: high precisionAim: high precision

Page 17: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Relation between Relation between recall and precision of searchesrecall and precision of searches

100%

Recall

0 0 Precision 100%

Ideal = Impossibleto reach in most systems

Ideal = Impossibleto reach in most systems•

Search (results)

Page 18: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Evaluation in the case of systems Evaluation in the case of systems offering relevance ranking offering relevance ranking

• Many modern information retrieval systems offer output Many modern information retrieval systems offer output with relevance ranking.with relevance ranking.

• This is more complicated than simple Boolean retrieval, This is more complicated than simple Boolean retrieval, and the simple concepts of recall and precision cannot be and the simple concepts of recall and precision cannot be applied.applied.

• To compare retrieval systems or search strategies, To compare retrieval systems or search strategies, decide to consider for comparison a particular number of decide to consider for comparison a particular number of items ranked highest in each output.items ranked highest in each output.This brings us to for instance: “first-20 precision”.This brings us to for instance: “first-20 precision”.

Page 19: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Thesaurus: Thesaurus: descriptiondescriptionThesaurus: Thesaurus: descriptiondescription

• Thesaurus (contents) = Thesaurus (contents) =

» system to control a vocabulary system to control a vocabulary (= words and phrases + their relations)(= words and phrases + their relations)

» the contents of this vocabularythe contents of this vocabulary

• Thesaurus program = Thesaurus program =

program to create, manage, modify and/or search a program to create, manage, modify and/or search a thesaurus using a computerthesaurus using a computer

Page 20: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Thesaurus Thesaurus relationsrelations

Thesaurus Thesaurus relationsrelations

Term(s) with broader meaning

BT (= Broader Term)

RT (= Related Term) UF (= Use(d) For)Other term(s) Term Synonym(s)

NT (= Narrower Term)

Term(s) with narrower meaning

Page 21: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Thesaurus systems focused on a Thesaurus systems focused on a particular subject: examplesparticular subject: examples

Thesaurus systems focused on a Thesaurus systems focused on a particular subject: examplesparticular subject: examples

• Focused on a particular subject domain = Focused on a particular subject domain = narrow and deep, vertical systemsnarrow and deep, vertical systems

• Examples: the thesaurus forExamples: the thesaurus for

» the the Aquatic Sciences and Fisheries Information SystemAquatic Sciences and Fisheries Information System

» ERICERIC:: education, information science,... education, information science,...

» INSPECINSPEC:: physics, electronics, information technology physics, electronics, information technology

» MedlineMedline (the Medical Subject Headings = MeSH) (the Medical Subject Headings = MeSH)

» PsychologicalPsychological AbstractsAbstracts / PsycInfo/ PsycInfo

» Sociological Abstracts / SocioFileSociological Abstracts / SocioFile;...;...

Page 22: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Time flies like an Time flies like an arrow.arrow.

Fruit flies like a Fruit flies like a banana.banana.

Page 23: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

!? Question !? Task !? Problem !?!? Question !? Task !? Problem !?

Which problems in text retrieval are illustrated by those sentences?

Which problems in text retrieval are illustrated by those sentences?

Page 24: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and language: Text retrieval and language: an overviewan overview

Text retrieval and language: Text retrieval and language: an overviewan overview

Problems related to language / terminology occurProblems related to language / terminology occur1. even when the 1. even when the same languagesame language is used in searching and in is used in searching and in the searched databasesthe searched databases2. in the case of “2. in the case of “multi-lingualitymulti-linguality”: ”: “cross-language information retrieval” “cross-language information retrieval” that is when more than 1 language is used that is when more than 1 language is used

» in the search termsin the search terms

» in the contents of the searched database(s) in the contents of the searched database(s) and/orand/orin the subject descriptors of the searched database(s)in the subject descriptors of the searched database(s)

Page 25: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and language: Text retrieval and language: enhancing retrievalenhancing retrieval

Text retrieval and language: Text retrieval and language: enhancing retrievalenhancing retrieval

• Retrieval can be enhanced by coping with the problems Retrieval can be enhanced by coping with the problems caused by the use of natural language.caused by the use of natural language.

• Contributions to this enhancement of retrieval can be Contributions to this enhancement of retrieval can be made bymade by

» the database producerthe database producer

» the computerized retrieval systemthe computerized retrieval system

» the searcher / user of the databasethe searcher / user of the database

• (The distinction between these is not very sharp and clear (The distinction between these is not very sharp and clear in all cases.)in all cases.)

Page 26: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (1a)(1a)

Text retrieval and terminology Text retrieval and terminology (1a)(1a)

• Problem: Problem: A word or phrase is not the same as a A word or phrase is not the same as a conceptconcept: : so, to ‘cover’ a concept in a search, so, to ‘cover’ a concept in a search, to increase the recall of a search, to increase the recall of a search, the user of a retrieval system should also include the user of a retrieval system should also include

» synonymssynonyms

Page 27: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (1a’)(1a’)

Text retrieval and terminology Text retrieval and terminology (1a’)(1a’)

» narrower terms, more specific terms narrower terms, more specific terms (such as particular brand names);(such as particular brand names);including terms with prefixesincluding terms with prefixes(for instance: (for instance: viruses, rotaviruses,…)viruses, rotaviruses,…)

» spelling variations spelling variations (such as UK English versus US English);(such as UK English versus US English);possible variations after transliterationpossible variations after transliteration

» singular or plural forms of a noun singular or plural forms of a noun (when this is used as a search term)(when this is used as a search term)

Page 28: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (1a’’)(1a’’)

Text retrieval and terminology Text retrieval and terminology (1a’’)(1a’’)

» (relevant) related terms(relevant) related terms

» various forms of a verb various forms of a verb (when this is used in the query)(when this is used in the query)

» broader terms (perhaps)broader terms (perhaps)

Page 29: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (1b)(1b)

Text retrieval and terminology Text retrieval and terminology (1b)(1b)

• Method to solve the problem Method to solve the problem at the time of database production:at the time of database production:

» adding to each database record codes from a classification adding to each database record codes from a classification system or terms from a thesaurus system, system or terms from a thesaurus system, and providing the user with knowledge about the system and providing the user with knowledge about the system used;used;in some cases, this process is computerized in some cases, this process is computerized (with intellectual intervention or completely automatic)(with intellectual intervention or completely automatic)

Page 30: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (1b’)(1b’)

Text retrieval and terminology Text retrieval and terminology (1b’)(1b’)

» However, this solution is not perfect:However, this solution is not perfect:

—Addition of terms by humans from a controlled Addition of terms by humans from a controlled vocabulary / from a thesaurus is not easy and time vocabulary / from a thesaurus is not easy and time consuming. consuming. Consequences: Consequences:

• the added value lags behind the availability of the document

• the process can delay access to the document

• the process is expensive

—Moreover, in practice, most users do not exploit this Moreover, in practice, most users do not exploit this method offered.method offered.

Page 31: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (1c)(1c)

Text retrieval and terminology Text retrieval and terminology (1c)(1c)

• Method to solve the problem, Method to solve the problem, provided by the computerized retrieval system:provided by the computerized retrieval system:

» offering to the user a partly computerized access to the offering to the user a partly computerized access to the particular subject description system used by the database particular subject description system used by the database producer, and then linking to the database for searchingproducer, and then linking to the database for searching

» computerized, automatic, transparent ‘mapping’ of the computerized, automatic, transparent ‘mapping’ of the ‘free text’ search terms used by the user, to the ‘free text’ search terms used by the user, to the corresponding particular classification codes, categories, corresponding particular classification codes, categories, or thesaurus terms used by the database produceror thesaurus terms used by the database producer

Page 32: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (1c’)(1c’)

Text retrieval and terminology Text retrieval and terminology (1c’)(1c’)

» offering the searching user access to a (general) thesaurus offering the searching user access to a (general) thesaurus system, system, even when the database producer has not categorised the even when the database producer has not categorised the database contents; database contents; in this way, the user can refine his/her queryin this way, the user can refine his/her query

» better, and more generally: better, and more generally: computerized, automatic expansion of the query terms computerized, automatic expansion of the query terms introduced by the user, based on a general thesaurus!introduced by the user, based on a general thesaurus!(however, not many retrieval systems offer this feature)(however, not many retrieval systems offer this feature)

Page 33: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (1c’’)(1c’’)

Text retrieval and terminology Text retrieval and terminology (1c’’)(1c’’)

» to avoid the problems of possible variations to avoid the problems of possible variations at the end of search terms:at the end of search terms:

—offering the possibility to the user to truncate a search offering the possibility to the user to truncate a search term explicitlyterm explicitly

—computerized, automatic, transparent truncation computerized, automatic, transparent truncation without explicit user actionwithout explicit user action

Page 34: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (1c’’’)(1c’’’)

Text retrieval and terminology Text retrieval and terminology (1c’’’)(1c’’’)

» to avoid the problems of possible prefixes and suffixes:to avoid the problems of possible prefixes and suffixes:

—computerized, automatic, transparent, intelligent computerized, automatic, transparent, intelligent morphological analysis of the query terms: morphological analysis of the query terms: ‘stemming’ of the ‘free text’ search terms used by the ‘stemming’ of the ‘free text’ search terms used by the user; user; however, this does not work perfectly and has not (yet) however, this does not work perfectly and has not (yet) been implemented in most retrieval systemsbeen implemented in most retrieval systems

Page 35: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (2a)(2a)

Text retrieval and terminology Text retrieval and terminology (2a)(2a)

• Problem: Problem: A word or phrase can have more than 1 meaning.A word or phrase can have more than 1 meaning.AmbiguityAmbiguity of the meaning of a word. of the meaning of a word.This decreases the precision of many searches.This decreases the precision of many searches.The meaning can depend on the context. The meaning can depend on the context. The meaning may depend on the region where the term is The meaning may depend on the region where the term is used.used.

» Example:Example:

—PascalPascal the philosopher the philosopher

—PascalPascal the computer language the computer language

Page 36: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (2b)(2b)

Text retrieval and terminology Text retrieval and terminology (2b)(2b)

• Method to solve the problem Method to solve the problem at the time of database production:at the time of database production:

» adding to each database record codes from a classification adding to each database record codes from a classification system or terms from a thesaurus system, system or terms from a thesaurus system, and providing the user with knowledge about the system and providing the user with knowledge about the system used;used;in some cases, this process is computerized in some cases, this process is computerized (completely automatic or with intellectual intervention); (completely automatic or with intellectual intervention);

Page 37: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (2c)(2c)

Text retrieval and terminology Text retrieval and terminology (2c)(2c)

• Method to solve the problem, Method to solve the problem, provided by the computerized retrieval system:provided by the computerized retrieval system:

» offering to the user a partly computerized access to the offering to the user a partly computerized access to the subject description system and then linking to the database subject description system and then linking to the database for searchingfor searching

» searching normally (without added value), but adding value searching normally (without added value), but adding value by categorizing the retrieved items in the presentation phase by categorizing the retrieved items in the presentation phase to assist in the ‘disambiguation’ to assist in the ‘disambiguation’ (for instance the Internet search engine (for instance the Internet search engine Northern LightNorthern Light offers offers this feature)this feature)

Page 38: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (2c’)(2c’)

Text retrieval and terminology Text retrieval and terminology (2c’)(2c’)

» Natural language processing of both Natural language processing of both the documents and the documents and the queries:the queries:linguistic analysis to determine possible meanings of a linguistic analysis to determine possible meanings of a sentence, which includes disambiguation of words in their sentence, which includes disambiguation of words in their context:context:“lexical” analysis = at the level of the word“lexical” analysis = at the level of the word“semantic” analysis = at the level of the sentence“semantic” analysis = at the level of the sentenceHowever, most queries are short and therefore it is However, most queries are short and therefore it is difficult to apply semantic analysis for disambiguation.difficult to apply semantic analysis for disambiguation.

Page 39: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (3a)(3a)

Text retrieval and terminology Text retrieval and terminology (3a)(3a)

• Problem: Problem: The meaning of a word or phrase can The meaning of a word or phrase can changechange over time. over time.

Page 40: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (3b)(3b)

Text retrieval and terminology Text retrieval and terminology (3b)(3b)

• Method to solve the problem Method to solve the problem at the time of database production:at the time of database production:

» using a categorization system using a categorization system and also adapting this continuously to the changing reality and also adapting this continuously to the changing reality and meanings of termsand meanings of terms

Page 41: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (4a)(4a)

Text retrieval and terminology Text retrieval and terminology (4a)(4a)

• Problem: Problem: Most retrieval systems can search for words, Most retrieval systems can search for words, but they do not directly recognize or ‘know’ but they do not directly recognize or ‘know’ phrases / phrases / termsterms composed of more than 1 word. composed of more than 1 word.

Page 42: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (4b)(4b)

Text retrieval and terminology Text retrieval and terminology (4b)(4b)

• Methods to solve the problem, Methods to solve the problem, provided by the computerized retrieval system:provided by the computerized retrieval system:

» the user can and should indicate explicitly that a few the user can and should indicate explicitly that a few words should be considered together by the retrieval words should be considered together by the retrieval system as forming a phrase/termsystem as forming a phrase/term(for instance in many Internet search engines by putting (for instance in many Internet search engines by putting the phrase in quotes like “two word phrase”)the phrase in quotes like “two word phrase”)

Page 43: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (4b’)(4b’)

Text retrieval and terminology Text retrieval and terminology (4b’)(4b’)

» better: better: the retrieval system automatically recognizes a the retrieval system automatically recognizes a phrase/term relying on a term bank that has been created in phrase/term relying on a term bank that has been created in advance;advance;example:example:the search engine the search engine AltaVistaAltaVista works in this way works in this way

Page 44: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (5a)(5a)

Text retrieval and terminology Text retrieval and terminology (5a)(5a)

• Problem:Problem:Searching Searching various databases at the same timevarious databases at the same time, , or or mergingmerging databases for searching, databases for searching, suffers from the problem that these databases may use suffers from the problem that these databases may use categorization systems to make the problem of categorization systems to make the problem of terminology smaller, but in most cases these systems are terminology smaller, but in most cases these systems are different and incompatible.different and incompatible.

Page 45: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (5b)(5b)

Text retrieval and terminology Text retrieval and terminology (5b)(5b)

• Method to solve the problem, Method to solve the problem, provided by the computerized retrieval system:provided by the computerized retrieval system:

» mapping of the search term chosen by the user to the mapping of the search term chosen by the user to the various thesaurus terms used by the various databases; various thesaurus terms used by the various databases; only a few retrieval systems try to accomplish this only a few retrieval systems try to accomplish this (for instance (for instance KnowledgeCiteKnowledgeCite))

Page 46: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (6a)(6a)

Text retrieval and terminology Text retrieval and terminology (6a)(6a)

• Problem:Problem:In many cases, when the user combines several concepts In many cases, when the user combines several concepts in 1 search, the searching user cannot well communicate in 1 search, the searching user cannot well communicate the intended relations among these concepts to the the intended relations among these concepts to the retrieval system.retrieval system.

Page 47: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (6a’)(6a’)

Text retrieval and terminology Text retrieval and terminology (6a’)(6a’)

» Example: Example:

concept 1 = children/sons/daughters/...concept 1 = children/sons/daughters/...

concept 2 = parents/fathers/mothers/...concept 2 = parents/fathers/mothers/...

concept 3 = beating/violence/...concept 3 = beating/violence/...

How to find documents on How to find documents on “children beating their parents” “children beating their parents” while avoiding documents on while avoiding documents on “parents beating their children“parents beating their children”?”?

Page 48: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (6a’’)(6a’’)

Text retrieval and terminology Text retrieval and terminology (6a’’)(6a’’)

» Example: Example:

concept 1 = computersconcept 1 = computers

concept 2 = architectureconcept 2 = architecture

How to find documents on How to find documents on “(the application/role/importance of) “(the application/role/importance of) computers in architecture”, computers in architecture”, while avoiding documents on while avoiding documents on “the architecture of computers“the architecture of computers”?”?

Page 49: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (6b)(6b)

Text retrieval and terminology Text retrieval and terminology (6b)(6b)

• Method to solve the problem, Method to solve the problem, provided by the database producer:provided by the database producer:

» offering facilities to the user for disambiguation, offering facilities to the user for disambiguation, like in the more simple case of singular terms without like in the more simple case of singular terms without combinations with other termscombinations with other terms

Page 50: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (6c)(6c)

Text retrieval and terminology Text retrieval and terminology (6c)(6c)

• Method to solve the problem, Method to solve the problem, provided by the computerized retrieval system:provided by the computerized retrieval system:

» natural language analysis of natural language analysis of both both the the documentsdocuments and the natural language and the natural language queryqueryto interpret their structure and meaningto interpret their structure and meaning

Page 51: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and terminology Text retrieval and terminology (7a)(7a)

Text retrieval and terminology Text retrieval and terminology (7a)(7a)

• Problem:Problem:Classical queries and retrieval systems work with terms Classical queries and retrieval systems work with terms to match the subject, the “aboutness” expressed in the to match the subject, the “aboutness” expressed in the query with the documents, query with the documents, but do not try to express and to understand but do not try to express and to understand the the purpose, aim and contextpurpose, aim and context of the search. of the search.

Page 52: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and multi-linguality Text retrieval and multi-linguality (1a)(1a)

Text retrieval and multi-linguality Text retrieval and multi-linguality (1a)(1a)

• Problem: Problem: When the user does not know well the language of a When the user does not know well the language of a (monolingual) database, searching is not efficient.(monolingual) database, searching is not efficient.

Page 53: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and multi-linguality Text retrieval and multi-linguality (1b)(1b)

Text retrieval and multi-linguality Text retrieval and multi-linguality (1b)(1b)

• Methods to solve the problem, Methods to solve the problem, at the time of database production:at the time of database production:

» adding subject descriptors in various languagesadding subject descriptors in various languages(for instance in (for instance in PascalPascal and and Francis Francis made bymade by INIST INIST))

» adding abstracts in various languages adding abstracts in various languages (for instance the abstracts in English in (for instance the abstracts in English in INSPEC)INSPEC)

» translation of the complete contents of the databasetranslation of the complete contents of the database

These processes can be partly computerized, These processes can be partly computerized, but they are still time consuming and expensive.but they are still time consuming and expensive.

Page 54: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and multi-linguality Text retrieval and multi-linguality (1c)(1c)

Text retrieval and multi-linguality Text retrieval and multi-linguality (1c)(1c)

• Method to solve the problem, Method to solve the problem, provided by the computerized retrieval system:provided by the computerized retrieval system:

» translating the query of the user, translating the query of the user, by using a general multilingual thesaurus;by using a general multilingual thesaurus;however, most free text queries are quite short, which however, most free text queries are quite short, which makes it difficult to use the context to limit possible makes it difficult to use the context to limit possible ambiguity; ambiguity; disambiguation by user-computer interaction offered by disambiguation by user-computer interaction offered by the query interface, can increase the effectiveness here.the query interface, can increase the effectiveness here.

Page 55: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and multi-linguality Text retrieval and multi-linguality (2a)(2a)

Text retrieval and multi-linguality Text retrieval and multi-linguality (2a)(2a)

• Problem: Problem: When documents in a database are written in more than When documents in a database are written in more than 1 language, searching that database in a single language 1 language, searching that database in a single language may not be sufficient to retrieve all interesting, relevant may not be sufficient to retrieve all interesting, relevant documents. documents.

Page 56: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and multi-linguality Text retrieval and multi-linguality (2b)(2b)

Text retrieval and multi-linguality Text retrieval and multi-linguality (2b)(2b)

• Method to solve the problem:Method to solve the problem:

» extensions of the methods when only 1 language is used in extensions of the methods when only 1 language is used in the documentsthe documents

Page 57: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and multi-linguality Text retrieval and multi-linguality (3)(3)

Text retrieval and multi-linguality Text retrieval and multi-linguality (3)(3)

• Problem: Problem: When more than 1 database is searched at the same time, When more than 1 database is searched at the same time, the mechanisms to solve problems related to language in the mechanisms to solve problems related to language in each separate database cannot be applied so well each separate database cannot be applied so well anymore. anymore.

Page 58: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and multi-linguality Text retrieval and multi-linguality (4a)(4a)

Text retrieval and multi-linguality Text retrieval and multi-linguality (4a)(4a)

• Problem: Problem: Of course, the user should ideally be able to understand Of course, the user should ideally be able to understand the contents of all the retrieved documents, even when the contents of all the retrieved documents, even when various languages are used in those documents.various languages are used in those documents.

Page 59: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and multi-linguality Text retrieval and multi-linguality (4b)(4b)

Text retrieval and multi-linguality Text retrieval and multi-linguality (4b)(4b)

• Methods to solve the problem, Methods to solve the problem, at the time of database production:at the time of database production:

» adding abstracts in various languages adding abstracts in various languages (for instance the abstracts in English in (for instance the abstracts in English in INSPEC)INSPEC)

» translation of the complete contents of the databasetranslation of the complete contents of the database

These processes can be partly computerized, These processes can be partly computerized, but they are still time consuming and expensive.but they are still time consuming and expensive.

Page 60: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Text retrieval and multi-linguality Text retrieval and multi-linguality (4c)(4c)

Text retrieval and multi-linguality Text retrieval and multi-linguality (4c)(4c)

• Methods to solve the problem, Methods to solve the problem, provided by the computerized retrieval system:provided by the computerized retrieval system:

» rapid automated translation rapid automated translation

—of the titles of retrieved records/documentsof the titles of retrieved records/documents(for instance offered by the Internet search engine (for instance offered by the Internet search engine AltaVistaAltaVista))

—of the abstracts of retrieved records/documents of the abstracts of retrieved records/documents (for instance offered by the Internet search engine (for instance offered by the Internet search engine AltaVistaAltaVista))

—of the complete retrieved records/documentsof the complete retrieved records/documents

Page 61: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

A good text retrieval system solves A good text retrieval system solves

some problems due to languagesome problems due to language

• accepts words / terms / phrases in the query of the useraccepts words / terms / phrases in the query of the user

• maps the words to corresponding conceptsmaps the words to corresponding concepts

• presents these concepts to the user presents these concepts to the user who can then select the appropriate, relevant concept who can then select the appropriate, relevant concept (“disambiguation”)(“disambiguation”)

• searches for this concept, searches for this concept, even in documents written in another languageeven in documents written in another language

• presents the resulting, retrieved documents presents the resulting, retrieved documents in the language preferred by the userin the language preferred by the user

• accepts words / terms / phrases in the query of the useraccepts words / terms / phrases in the query of the user

• maps the words to corresponding conceptsmaps the words to corresponding concepts

• presents these concepts to the user presents these concepts to the user who can then select the appropriate, relevant concept who can then select the appropriate, relevant concept (“disambiguation”)(“disambiguation”)

• searches for this concept, searches for this concept, even in documents written in another languageeven in documents written in another language

• presents the resulting, retrieved documents presents the resulting, retrieved documents in the language preferred by the userin the language preferred by the user

Page 62: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Natural language processing of the documents AND of the query

Comparison and matching of both

Enhanced text retrieval Enhanced text retrieval using natural language processingusing natural language processing

Information problem

Representation

Query Indexed documents

Representation

Retrieved, sorted documents

Text documents

Evaluation

and feedback

Page 63: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Terminological aspects of text Terminological aspects of text retrieval: conclusionsretrieval: conclusions

• The use of terms and language to retrieve information The use of terms and language to retrieve information from databases/collections/corpora causes many from databases/collections/corpora causes many problems.problems.

• These problems are not recognized or underestimated by These problems are not recognized or underestimated by many users of search/retrieval systemsmany users of search/retrieval systems= The power of retrieval systems is overestimated by = The power of retrieval systems is overestimated by many users.many users.

• Much research and development is still needed to Much research and development is still needed to enhance text retrieval.enhance text retrieval.

• The use of terms and language to retrieve information The use of terms and language to retrieve information from databases/collections/corpora causes many from databases/collections/corpora causes many problems.problems.

• These problems are not recognized or underestimated by These problems are not recognized or underestimated by many users of search/retrieval systemsmany users of search/retrieval systems= The power of retrieval systems is overestimated by = The power of retrieval systems is overestimated by many users.many users.

• Much research and development is still needed to Much research and development is still needed to enhance text retrieval.enhance text retrieval.

Page 64: Terminological aspects of text retrieval Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen Belgiumpnieuwen@vub.ac.be

Thank youfor your interest

Any questions?