Today’s Topics
• Boolean IR• Signature files• Inverted files• PAT trees• Suffix arrays
Boolean IR• Documents composed of TERMS(words, stems)• Express result in set-theoretic terms
Doc’s containingterm A
term B
term C
Doc’s containingterm A
term B
term C
A AND B (A AND B) OR C
- Pre 1970’s- Dominant industrial model through 1994 (Lexis-Nexis, DIALOG)
Boolean Operators
A AND BA OR B
(A AND B) OR CA AND ( NOT B )
Doc’s containing term A
Adjacent AND “ A B ” e.g. “Johns Hopkins”“The Who”
Proximity window A w/10 B A and B within +/- 10 words
A w/sent B A + B in same sentence
ProximityOperators(Extended
ANDs) (in +/- K words)
Boolean IR(implementation)
• Bit vectors
• Inverted files(a.k.a. Index)
• PAT tree(more powerful index)
0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0
Impractical very sparse(wastefully big) costly to compare
V1
V2
Termi
Problems with Boolean IR• Does not effectively support relevance ranking of
returned documents• Base model : expression satisfaction is Boolean
A document matches expression or it doesn’t
• Extension to permit ordering : (A AND B) OR C– Supermatches(5 terms/doc > 3 terms/doc)
– Partial matches (expression incompletely satisfied – give partial credit)– Importance weighting(10A OR 5B)
Weight/importance
Boolean IR• Advantages : Can directly control search
Good for precise queries in structured data
(e.g. database search or legal index)
• Disadvantages : Must directly control search– Users should be familiar with domain and term
space(know what to ask for and exclude)
– Poor at relevance ranking
– Poor at weighted query expansion, user modelling etc.
Signature Files
0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0
0 0 1 0 0 1 0 0 0 0 0
Problem : several different document bit vectors(i.e. different words) get mapped to same signature.(use stoplist to help avoid common words from overwhelming signatures)
DocumentBit
vector
Signature
Mappingfunction f( )
SuperimposedCoding
Using some mapping/Hash function
fewer bits
False Drop Problem
• On retrieval, all documents/bit vectors mapped to
the same signature are retrieved(returned)
• Only a portion are relevant
• Need to do secondary validation step to make sure
target words actually match
Prob(False Drop) = Prob(Signature qualifies & Text does not)
Efficiency Problem
Testing for signature match may require linear scan through all document signatures
Vertical Partitioning
• Improves sig1, sig2 comparison speed,
but still requires O(N) linear search of all signatures
• Options :
sig
- Bit sliced onto different devices for parallel comparision- And together matches on each segment
sig1sig2
comp AND AND result
Horizontal Partitioning
• Goal : avoid sequential scanning of the signature file
SignatureDatabase
Input signature
Hash functionor index
yielding specific candidates to try
Inverted Files• Like an index to a book
14151617
37383940
1439
1563945
1562904186
156217
TermsBaum
Bayes
Viterbi
index Documents
Inverted Files
• Very efficient for single word queriesJust enumerate documents pointed to by index O( |A| ) = O(SA)
• Efficient for OR’sJust enumerate both lists and remove duplicates
O(SA + SB)
AND’s using Inverted Files
1439
156227319
39455896
156208
Method 1:
• Begin with two pointers(i, j) on list # is in index(A,B)• if A[ i ] = B[ i ], write A[ i ] to output• if A[ i ] < B[ i ], i++ else j++
Ai Bj
Index forBayes
Index forViterbi
i j
O(SA + SB )
same as OR,
but smaller output
(meet search)
AND’s using Inverted Files
39227
15
252839455896
156
Method 2: Useful if one index is smaller than the other(SA << SB )
Ai
Bj
(Johns)
(Hopkins)
For all members of A
bsearch (A[ i ], B)
(do binary search
into larger index)
for all members of
smaller indexA AND B AND C
Order by smaller list pairwiseCost : SA * log2 (SB )can achieve SA * log log (SB )
Proximity SearchAH
JH
H
A
Anthony
Johns
Hopkins
Document level indexes not adequate
Option 1 :
Size of corpus = size of index
Doc 1
Doc 2
Doc 3
Doc i
Index to corpus Position offset
Before :Match if ptrA = ptrB
Now :“A B” = match if ptrA = ptrB -1
A w/10 B = match if | ptrA - ptrB | 10
Variations 1Don’t index function words
X TheJohns
Hopkins
index
wordlist
*JohnsThe
Do linear match search in corpus savings on 50% index size potential speed improvement given data access costs
Variations 2 : Multilevel Indexes
Anthony
Johns
Hopkins
JohnsHopkins
JohnsHopkinsAnthony
HopkinsAnthony
Doc level
Position level
Supports parallel search May have paging cost advantage Cost – large index N + dV
Avg. Doc/vocab size
Interpolation Search
174195
* 211 *226230231246
483496521526
995
17181920212223
48495051
100
Bi cellvalue
Useful when data are numeric and uniformly distributed
# of cells in index : 100Values range from 0 … 1000
Goal : looking for the value 211
Binary search : begin looking at cell 50Interpolation search : better guess for 1st cell to examine?
bins#
sizemax K
Binary Search
Bsearch(low, high, key)mid = (high + low) / 2
If (key = A[mid])
return mid
Else if (key < A[mid])
Bsearch (low, mid-1, key)
Else
Bsearch(mid+1, high, key)
Interpolation Search
Isearch(low, high, key)mid = best estimate of pos
mid = low + (high – low) *
(expected % of way
through range)
low] A[ - ]high A[
] low A[ -key
Binary Search
50
25
12
18
22
21
19.
Interpolation Search
21
19. go directly to expected
region
Typical sequence of cell’s tested :
log log (N)
Comparison
Cost of Computing Inverted Index
1. Simple
word position pairs and sort
2. If N >> memory size1) Tokenize(words integers)
2) Create histogram
3) Allocate space in index
4) Do multipass(K-pass) through corpus only adding tokens in K bins
Corpus size N log N
K-pass Indexingindex
W1
W2
W3
W4
Block1(passK = 1)
K = 2
Time = KN + 1But big win overN log N on paging
Vector Models for IR
• Gerald Salton, Cornell(Salton + Lesk, 68)
(Salton, 71)
(Salton + McGill, 83)
• SMART SystemChris Buckely, Cornell
Current keeper of the flame
Salton’s magical automatic retrieval tool(?)
Vector Models for IR
0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0
Doc V1
Doc V2
Boolean Model
SMART Vector Model
1.0 3.5 4.6 0.1 0.0 0.0Doc V1
Doc V2 0.0 0.0 0.0 0.1 4.0 0.0
Termi WordStemSpecial compounds
SMART vectors are composed of real valued Term weightsNOT simply Boolean Term Present or NOT
Example
3 5 4 1 0 1 0 0Doc V1
Doc V2
Comput* C++ Sparc genome Bilog* proteinCompiler DNA
1 0 0 0 5 3 1 4
Doc V3 2 8 0 1 0 1 0 0
Issues• How are weights determined? (simple option : raw freq. weighted by region, titles, keywords)• Which terms to include? Stoplists• Stem or not?
QUERIES and Documents share same vector representaion
D3
D2
D1
Q
Given Qeury DQ map to vector VQ
and find document Di : sim (Vi ,VQ) is greatest
Similarity Functions
• Many other options availabe(Dice, Jaccard)• Cosine similarity is self normalizing
D3
D2
Q
V1 100 200 300 50
V2 1 2 3 0.5
V3 10 20 30 5
Can use arbitrary integer values(don’t need to be probabilities)