COMP3410 DB32:Technologies for Knowledge Management
Lecture 4:
Inverted Files and Signature Files for IR
By Eric Atwell, School of Computing, University of Leeds
(including re-use of teaching resources from other sources, esp. Stuart Roberts, School of Computing, Univ of Leeds)
Module Objectives
“On completion of this module, students should be able to:
… describe classical and emerging information retrieval techniques, and their relevance to knowledge management; …”
Today’s objectivesBy the end of this lecture you should understand:
• why relational databases techniques eg BTree indexing are no use for ‘IR’ queries;
• how ‘inverted file’ structures work to provide efficient query processing.
• An alternative approach provided by signature files
The relational problem
• the simple approach to searching for a keyword uses leading (and trailing) wildcards: eg ‘%graphics%’
• there is no way other than ‘brute force scan’ to match such a condition with the data records held in a traditional relational database.
The relational problem• Rather than hold full text, why not do content
analysis, extract the index terms (keywords), and hold these in a relational database?
Module Index termIndexed by
module(mod_code, title, semester, …)
term(term_id, value)
index(mod_code, term_id)
Sample SQL query: ORfind all modules matching:
‘database’ or ‘AI’ or ‘knowledge base’
select distinct m.* from module m inner join index i on m.code = i.mod_code inner join term t on t.term_id = i.term_id where t.value = ‘database’ OR t.value = ‘AI’ OR t.value = ‘knowledge base’;
Another sample query: AND (?)find all modules matching:
‘database’ and ‘AI’ and ‘knowledge base’
select distinct m.* from module m inner join index i on m.code = i.mod_code inner join term t on t.term_id = i.term_id where t.value = ‘database’ AND t.value = ‘AI’ AND t.value = ‘knowledge base’;
This SQL query will not match any record; t.value cannot be simultaneously equal to ‘database’, ‘AI’ and ‘knowledge base’.
We cannot simply replace the ‘OR’s of the last SQL query with ‘AND’s.
Corrected sample query: AND
find all modules matching: ‘database’ and ‘AI’
select distinct m.* from module m inner join index i1 on m.code = i1.mod_code inner join term t1 on t1.term_id = i1.term_idinner join index i2 on m.code = i2.mod_code inner join term t2 on t2.term_id = i2.term_id where t1.value = ‘database’ and t2.value = ‘AI’;
Both tables ‘index’ and ‘term’ must be searched twice in order to establish whether, for each module, it is attached to both terms ‘database’ and ‘AI’.
If the query is a conjunction of N terms, the SQL would have 2N inner joins. AND is more complicated than OR (but common in IR)
Inverted file• Non-DB structure, so not suitable for standard SQL
• each index term entry ‘points’ to a list of document record identifiers (RIDs)
• standard indexing method for IR systems
• widely used for search engines
• can be extended to allow for positional (context) searches
Inverted file structureThe idea of an inverted file is, as well as storing a
document with its list of terms that are used to index it, we store the list of terms used in the whole collection of documents, and for each term point to the list of documents that are indexed by the term. So we have ‘inverted’ the structure:
D1: T11, T12, …, T1k
D2: T21, T22, …, T2l
…to give:T1: D11, D12, …, D1m
T2: D21, D22, …, D2n
…
Inverted file structure
Term 1 (2)
Term 2 (3)
Term 3 (1)
Term 4 (3)
Term 5 (4)..
1
2
1
2
3
2
2
3
4..
Doc 1
Doc2
Doc3
Doc4
Doc5
Doc6..
1
3
6
7
9..
dictionary Inverted or postings file Data file
Inverted file structure
Term 1 (2)
Term 2 (3)
Term 3 (1)
Term 4 (3)
Term 5 (4)..
1
2
1
2
3
2
2
3
4..
Doc 1
Doc2
Doc3
Doc4
Doc5
Doc6..
1
3
6
7
9..
dictionary Inverted or postings file Data file
Inverted file structure
Term 1 (2)
Term 2 (3)
Term 3 (1)
Term 4 (3)
Term 5 (4)..
1
2
1
2
3
2
2
3
4..
Doc 1
Doc2
Doc3
Doc4
Doc5
Doc6..
1
3
6
7
9..
dictionary Inverted or postings file Data file
Inverted file structure
Term 1 (2)
Term 2 (3)
Term 3 (1)
Term 4 (3)
Term 5 (4)..
1
2
1
2
3
2
2
3
4..
Doc 1
Doc2
Doc3
Doc4
Doc5
Doc6..
1
3
6
7
9..
dictionary Inverted or postings file Data file
Dictionary (in IR)• list of terms including ‘normalised’ keywords or
stems plus object descriptors (eg author name)
• frequency with which that term occurs in the collection
• pointer to the inverted file
• access to dictionary is by standard file access method (binary search or Btree or hashing algorithm; DB21)
Inverted file• for each entry in the dictionary:
– a list of pointers into the data file (or object-ids, or URLs..)
– identifying those objects indexed by the dictionary term
• inverted file may also contain:– positional information within each document– term frequency (or weight) within each document
Use of inverted file• Boolean query: (A or B) and C
– disjunctive normal form: (A and C) or (B and C) or (A and B and C)(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A, B and C
doc1doc3doc4doc7doc8doc10
doc2doc3doc5doc6doc8doc12
doc1doc2doc4doc9doc11doc12
Use of inverted file• Boolean query: (A or B) and C
– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A, B and C
doc1doc3doc4doc7doc8doc10
doc2doc3doc5doc6doc8doc12
doc1doc2doc4doc9doc11doc12
doc1: (1, 0, 1)
Use of inverted file• Boolean query: (A or B) and C
– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A, B and C
doc3doc4doc7doc8doc10
doc2doc3doc5doc6doc8doc12
doc2doc4doc9doc11doc12
doc1: (1, 0, 1)doc2: (0, 1, 1)
Use of inverted file• Boolean query: (A or B) and C
– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A, B and C
doc3doc4doc7doc8doc10
doc3doc5doc6doc8doc12
doc4doc9doc11doc12
doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)
Use of inverted file• Boolean query: (A or B) and C
– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A, B and C
doc4doc7doc8doc10
doc5doc6doc8doc12
doc4doc9doc11doc12
doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)
Use of inverted file• Boolean query: (A or B) and C
– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A, B and C
doc7doc8doc10
doc5doc6doc8doc12
doc9doc11doc12
doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)
Use of inverted file• Boolean query: (A or B) and C
– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A, B and C
doc7doc8doc10
doc6doc8doc12
doc9doc11doc12
doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)
Use of inverted file• Boolean query: (A or B) and C
– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A, B and C
doc7doc8doc10
doc8doc12
doc9doc11doc12
doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)
Use of inverted file• Boolean query: (A or B) and C
– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A, B and C
doc8doc10
doc8doc12
doc9doc11doc12
doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)
Use of inverted file• Boolean query: (A or B) and C
– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A, B and C
doc10 doc12 doc9doc11doc12
doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)doc9: (0, 0, 1)
Use of inverted file• Boolean query: (A or B) and C
– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A, B and C
doc10 doc12 doc11doc12
doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)doc9: (0, 0, 1)doc10: (1, 0, 0)
Use of inverted file• Boolean query: (A or B) and C
– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A, B and C
doc12 doc11doc12
doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)doc9: (0, 0, 1)doc10: (1, 0, 0)doc11: (0, 0, 1)
Use of inverted file• Boolean query: (A or B) and C
– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
– retrieve lists of document ids from inverted file corresponding to A, B and C
doc12 doc12
doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)doc9: (0, 0, 1)doc10: (1, 0, 0)doc11: (0, 0, 1)doc12: (0, 1, 1)
doc12 doc12
Use of inverted file• Boolean query: (A or B) and C
– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)doc9: (0, 0, 1)doc10: (1, 0, 0)doc11: (0, 0, 1)doc12: (0, 1, 1)
• Boolean query: (A or B) and C– disjunctive normal form:
(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)
doc1: (1, 0, 1)doc2: (0, 1, 1)
doc4: (1, 0, 1)
doc12: (0, 1, 1)
doc1: (1, 0, 1)doc2: (0, 1, 1)doc4: (1, 0, 1)doc12: (0, 1, 1)
report number of hits to user (4)(Note: can be done before any ‘hits’ are retrieved
retrieve all objects using ‘pointers’:doc1, doc2, doc4 and doc12
Use of inverted file
weighted query: A0.5, B0.7, C1.0
form weighted vector:(0.5, 0.7, 1.0)
retrieve lists of document ids from inverted file corresponding to A, B and C with weights
doc1 (.2)doc3 (.6)doc4 (.7) doc7 (.3)doc8 (.5) doc10 (.5)
doc2 (.6)doc3 (.8)doc5 (.9)doc6 (.3)doc8 (.5)doc12 (.2)
doc1 (.4)doc2 (.4)doc4 (.7)doc9 (.6)doc11 (.3)doc12 (.6)
Use of inverted file with weighted terms
weighted query: A0.5, B0.7, C1.0
form weighted vector:(0.5, 0.7, 1.0)
retrieve lists of document ids from inverted file corresponding to A, B and C with weights
doc1 (.2)doc3 (.6)doc4 (.7) doc7 (.3)doc8 (.5) doc10 (.5)
doc2 (.6)doc3 (.8)doc5 (.9)doc6 (.3)doc8 (.5)doc12 (.2)
doc1 (.4)doc2 (.4)doc4 (.7)doc9 (.6)doc11 (.3)doc12 (.6)
sim((0.5, 0.7, 1.0), (0.2, 0.0, 0.4)) = 0.85
doc1: 0.85
Use of inverted file
weighted query: A0.5, B0.7, C1.0
form weighted vector:(0.5, 0.7, 1.0)
retrieve lists of document ids from inverted file corresponding to A, B and C with weights
doc3 (.6)doc4 (.7) doc7 (.3)doc8 (.5) doc10 (.5)
doc2 (.6)doc3 (.8)doc5 (.9)doc6 (.3)doc8 (.5)doc12 (.2)
doc2 (.4)doc4 (.7)doc9 (.6)doc11 (.3)doc12 (.6)
sim((0.5, 0.7, 1.0), (0.0, 0.6, 0.4)) = 0.86
doc1: 0.85doc2: 0.86
Use of inverted file
• sort (rank) list according to similarity coefficient.
• retrieve first ‘N’ ranked objects.
• present ranked list to user.
• offer to retrieve next ‘N’.
• Note that so far we have not retrieved any documents; this is particularly important if the ids are URLS - we don’t need to start downloading web pages in order to rank them.
Use of inverted file
• proximity queries eg Q1: “A B” Q2: “A(3)B” (A…B)– postings file holds positional information– proceed as for ‘A and B’– keep positional information in (AB) list– filter (AB) list:
for Q1 pos(A) = pos(B) -1 for Q2 |pos(B) - pos(A)| < 3
• now we can distinguish ‘Venetian blind’ from ‘blind Venetian’
• in principle this should help precision without affecting recall too much
Use of inverted file
Pros and cons of inverted file• can be used for Boolean, weighted and positional
queries
• query processing can be completed without accessing data file
• number of hits for single term is available from dictionary
• expensive to update if information objects change content.
• demanding storage requirements (dictionary+inverted file approx same size as original data)
An alternative: Text signatures
• use hash algorithm to map a keyword onto one or more bits in a bit string: like Hashing (DB21)
• Simplest example: use one bit:– ‘Bath’ = [66, 97, 116, 104]
mod32 (66+97+116+104) = mod32(383) = 31
so represent ‘Bath’ by setting bit 31 in 0-31 bits:000000000000000000000000000000000001
Text signatures• Or use several bits:
– ‘Bath’ = [66, 97, 116, 104]‘ Ba’ mod32(66+97) = 3,
‘Bat’ mod32(66+97+116) = 23
‘ath’ mod32(97+116+104) = 29
‘th ’ mod32(116+104) = 28
represent ‘Bath’ by:001000000000000000000000000100001100
• This may allow wildcards, eg Bat?
Document signatures• superimpose keyword signatures
Bath 0000000000000000000000000000001
tub 0000000000100000000000000000000
0000000000100000000000000000001
• if each document has 6 keywords, there would be comb(32, 6) = 906192 different document signatures.
• Document signatures can be mapped onto numbers between 1 and 906192
Using signature file• Boolean query: (A or B) and C
– superimpose signatures of A and C– superimpose signatures of B and C– for each signature, S, in the file:
if either all bits of A&C are set in S or all bits in B&C are set, retrieve the document with signature S.
– check document to see if it is a ‘hit’– bit comparisons are very fast compared to string
comparisons.
Pros and cons of signature files
• Needs less space than inverted
• easier to update as documents change
• fast for queries with many keywords
• probabilistic - will return false hits
• cannot filter on positional information
• cannot hold keyword weights (or other weights)
these last three points imply that furtherprocessing is required to filter retrieved documents.
Summary of key points
• standard relational databases do not provide suitable indexing for handling index terms.
• standard SQL is not good at expressing ‘search-engine’ type queries
• inverted file structures are purpose made for these types of system
• storing frequencies/weights in the dictionary and inverted file allows for vector model queries
• storing positional information allows proximity queries, “Knowledge Management” v “MK”
• Signature files give faster matches but with limitations
Questions to think about
• Explain why the relational model is not good for IR.• How is it that, using an inverted file, the number of
hits can be reported without retrieving anything from the data file?
• Could this be achieved using signature files?• What are proximity queries, and how can inverted
file technology be used to deal with them?• How can signature files be used for proximity
queries?