Upload
fagan
View
20
Download
0
Embed Size (px)
DESCRIPTION
Lecture 4: Indexing Files. Inverted File Lexical Analysis Stop lists. Indexing. Arrangement of data (data structure) to permit fast searching Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak. Creating inverted files. - PowerPoint PPT Presentation
Citation preview
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Lecture 4:Indexing Files
Inverted File Lexical Analysis Stop lists
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Indexing
Arrangement of data (data structure) to permit fast searching
Which list is easier to search?sow fox pig eel yak hen ant cat dog hog
ant cat dog eel fox hen hog pig sow yak
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Creating inverted files
Original Documents
Document IDs
Word Extraction Word IDs
W1:d1,d2,d3W2:d2,d4,d7,d9
Wn :di,…dn
Inverted Files
W1:d1,d2,d3W2:d2,d4,d7,d9
Wn :di,…dn
Inverted Files
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Creating Inverted file Map the file names to file IDs Consider the following Original Documents
Our staff have contributed intellectually and professionally to the advancements in these fields.
The Department also produced its first PhD graduate in 1994.
followed by the MSc in Computer Science which was started in 1991.
The Department launched its first BSc(Hons) in Computer Studies in 1987.
The Department of Computer Science was established in 1984.
D5
D4
D3
D2
D1
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Creating Inverted file
Our staff have contributed intellectually and professionally to the advancements in these fields.
The Department also produced its first PhD graduate in 1994.
followed by the MSc in Computer Science which was started in 1991.
The Department launched its first BSc(Hons) in Computer Studies in 1987.
The Department of Computer Science was established in 1984.
D5
D4
D3
D2
D1
Red: stop word
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Creating Inverted file
staff contribut intellectu profession advanc field
depart produc phd graduat
follow msc comput scienc start
depart launch bsc hons comput studi
depart comput scienc establish
D5
D4
D3
D2
D1
After stemming, make lowercase (option), delete numbers (option)
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Creating Inverted file (unsorted)
d5
d5
d5
d5
d5
d5
d4
d4
d4
Documents
field
advancprofession
intellectucontributstaff
graduatphdproducWords
d3startd3mscd3followd2studid2honsd2bscd2launchd1establishd1,d3sciencd1,d2,d3computd1,d2,d4departDocumentsWords
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Creating Inverted file (sorted)
d2
d3
d5
d1,d3
d5
d4
d4
d3
Documents
studistartstaff
sciencprofession
producphdmscWords
d2launchd5intellectud4graduatd3followd5field
d1establishd1,d2,d4departd5contributd1,d2,d3computd2bscd5advancDocumentsWords
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Searching on Inverted File
Binary Search Using in the small scale
Create thesaurus and combining techniques such as: Hashing B+tree Pointer to the address in the indexed
file
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Lexical Analysis for indexing Word extraction
Spaces as English words boundaries Chinese word segmentation
Stop words elimination “a”,”an”,”the”,”about”,”etc”,”every”,
”you”,etc. Word stemming
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Lexical Analysis
Lexical analysis is the process of converting an input stream of characters into a stream of words or tokens
Lexical analysis is the first stage of: Automatic indexing Query processing
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Lexical Analysis for Automatic Indexing
What counts as a word or token in the indexing scheme? (an easy problem?) Digits
“Year 2000”, “Y2K” Hyphens
“F-16” “MS-DOS” Other Punctuation
“COMMAND.COM” “max_size” (often in C code) Case
IBM or ibm
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Lexical Analysis for Automatic Indexing (cont.)
No technical difficulty in solving any of these problems
Must think about them carefully Tradeoff between recall and precision
Breaking up hyphenated terms increase recall but decreases precision
Preserving case distinctions enhances precision but decreases recall
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Lexical Analysis for Query Processing
Depends on the design of the lexical analyzer for automatic indexing
Distinguish operators (Boolean operators, weighting function operators etc.)
Process certain characters: Control characters
“” for phase search, {} for priority Disallowed punctuation characters (error)
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
STOPLISTS Many of the most frequently occurring words in
English (“the” ,”of” etc.) are worthless as index terms
Eliminating such words Speeds processing Saves huge amounts of space in indexes Does not damage retrieval effectiveness
Stoplists are used to eliminates such words. E.g., http://www.dcs.gla.ac.uk/idom/ir_resources/linguisti
c_utils/stop_words http://bll.epnet.com/help/ehost/Stop_Words.htm http://www.syger.com/jsc/docs/stopwords/english.h
tm
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
STOPLISTS
Choices of words in stop list may vary from person to person.
The general idea is to find words that occur often so that they are not good terms for information retrieval.
How to use vector space model to find out a list of stop words?
How to find stop words in Chinese?