Upload
diana-strickland
View
214
Download
1
Tags:
Embed Size (px)
Citation preview
LIS618 lecture 2
Thomas Krichel
2004-02-08
Structure
• Theory: information retrieval performance
• Practice: more advanced dialog.
retrieval performance evaluation• "Recall" and "Precision" are two classic measures
to measure the performance of information retrieval in a single query.
• Both assume that there is an answer set of documents that contain the answer to the query.
• Performance is optimal if– the database returns all the documents in the answer set– the database returns only documents in the answer set
• Recall is the fraction of the relevant documents that the query result has captured.
• Precision is the fraction of the retrieved documents that is relevant.
recall and precision curves
• Assume that all the retrieved documents arrive at once and are being examined.
• During that process, the user discover more and more relevant documents. Recall increases.
• During the same process, at least eventually, there will be less and less useful document. Precision declines (usually).
• This can be represented as a curve.
Example• Let the answer set be {0,1,2,3,4,5,6,7,8,9}
and non-relevant documents represented by letters.
• A query reveals the following result:
7,a,3,b,c,9,n,j,l,5,r,o,s,e,4.
• For the first document, (recall, precision) is (10%,100%), for the third (20%,66%), for the sixth (30%,50%), for the tenth (40%,40%), and for the last (50%,33%).
recall/precision curves
• Such curves can be formed for each query.
• An average curve, for each recall level, can be calculated for several queries.
• Recall and precision levels can also be used to calculate two single-valued summaries. – average precision at seen document– R-precision
R-precision
• This is a pretty ad-hoc measure. • Let R be the size of the answer set.• Take the first R results of the query. • Find the number of relevant documents• Divide by R. • In our example, the R-precision is 40%. • An average can be calculated for a
number of queries.
average precision at seen document
• To find it, sum all the precision level for each new relevant document discovered by the user and divide by the total number of relevant documents for the query.
• In our example, it is (100+66+50+40+ 33)/5=57.8%
• This measure favors retrieval methods that get the relevant documents to the top.
critique of recall & precision
• Recall has to be estimated by an expert.
• Recall is very difficult to estimate in a large collection.
• They focus on one query only. No serious user works like this.
• There are some other measures, but that is more for an advanced course in IR.
Looking at database structure
• Up until now, we have looked at commands that take a full-text view of the database.
• Such commands can be executed for every database.
• If we want to make more precise queries, we have to take account of database structure.
bluesheet
• Each database name is linked to a blueish pop-up window called the blue sheet for the database.
• This is called the bluesheet.
• It contains the details of the database.
closer look at the bluesheet
• file description• subject coverage (free vocabulary)• format options, lists all formats
– by number (internal)– by dialog web format (external, i.e. cross-
database)
• search options– basic index, i.e. subject contents– additional index, i.e. non-subject
basic vs additional index
• the basic index – has information that is relevant to the
substantive contents of the data– usually is indexed by word, i.e. connectors are
required
• the additional index– has data that is not relevant to the substantive
matter– usually indexed by phrase, i.e. connectors are
not required
search options: basic index
• select without qualifiers searches in all fields in the basic index
• bluesheet lists field indicators available for a database
• also note if field is indexed by word or phrase. proximity searching only works with word indices. when phrases are indexed you don't need proximity indicators
search in basic index
• a field in the basic index is queried through term/IN, where term is a search term and IN is a field indicator
• Thomas calls this a appending indicator
• several field indicators can be ORed by giving a comma separated list
• for example mate/ti,de searches for mate in the title or descriptor fields
limiters and sorting
• Some databases allow to restrict the search using limiters. For example– /ABS require abstract present– /ENG English language publication
• Some fields are sortable with the sort command, i.e. records can be sorted by the values in the fields. Example: sort s1/all/ti.
• Such features are database specific.
additional indices
• additional indices lists those terms that can lead a query. Often, these are phrase indexed.
• Such fields a queried by prefix IN=term where IN is the field abbreviator and term is the search term
• Thomas calls this a pre-pending indicator
expanding queries
• names have to be entered as they appear in the database.
• The "expand" command can be used to see varieties of spelling of a value
• It has to be used in conjunction with a field identifier, example– expand au=cruz, b?– expand au=barrueco?
to search for misspellings of José Manuel Barrueco Cruz
expanding queries II
• search produces results of the form
Ref Items Index-term– Ref is a reference number– Items is the number of items where the
index term appears– Index-term is the index term
• "s Ref" searches for the reference term.
expand topics
• You can also expand a topic in a database to see what index terms are available that start with the term. Example “b 155 ; e cold”
• If you expand an entry in the expansion list again, you can see a list of related terms to the term, if such a list is available.
Example
• How many domain names are currently registered in Novosibirsk, Russia?
• Hint: use domain name database file 225.
• Note that this database also covers non-current domains.
ranking
• The rank command can be use to show the most frequent values of a phrase indexed field in a search set.
• Example– rank au s1 shows the most frequent authors– rank de s1 shows most frequent descriptors
• read the screens following rank command for instructions.
example
• Who wrote on interest rates and growth rates. Use EconLit “b 139”
• “s interest(n)rate? and growth(n)rate?”
• “rank au s1”
• You can then set some authors you are interested in, “1-5” for example
• “exit” to leave rank, confirm with “yes”.
• “exs” to search for those authors.
topic searches
• Often we want to know what literature is available on a certain topic.
• Many times authors do not use obvious words that occur to the searcher.
• Using descriptors can be very helpful.– Conduct a search– Look for descriptors– Use those in other searches
Initial file selection
• On the main menu, go to the database menu.
• After the principle menu, you get a search box
• There you can enter full-text queries for all the databases
• You can then select the database you want
• And get to the begin databases stage.
database categories
• In order to help people to find databases (files), DIALOG have grouped databases by categories.
• categories are listed at http://library.dialog.com/bluesheets/html/blo.html
• 'b category' will select databases from the category category at the start.
• 'sf category' selects files belonging to a category category at other times.
add/repeat
• add number, number
adds databases by files to the last query
• example "add 297" to see what the bible says about it
• repeat
repeats previous query with database added
to find publications
• Sometimes, you want to find out if a certain publication, say, a serial, is available on Dialog
• http://library.dialog.com/bluesheets/
has a search box specifically for journal data.
http://openlib.org/home/krichel
Thank you for your attention!