Download ppt - COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

COMP3410 DB32:Technologies for Knowledge Management

Lecture 4:

Inverted Files and Signature Files for IR

By Eric Atwell, School of Computing, University of Leeds

(including re-use of teaching resources from other sources, esp. Stuart Roberts, School of Computing, Univ of Leeds)

Module Objectives

“On completion of this module, students should be able to:

… describe classical and emerging information retrieval techniques, and their relevance to knowledge management; …”

Today’s objectivesBy the end of this lecture you should understand:

• why relational databases techniques eg BTree indexing are no use for ‘IR’ queries;

• how ‘inverted file’ structures work to provide efficient query processing.

• An alternative approach provided by signature files

The relational problem

• the simple approach to searching for a keyword uses leading (and trailing) wildcards: eg ‘%graphics%’

• there is no way other than ‘brute force scan’ to match such a condition with the data records held in a traditional relational database.

The relational problem• Rather than hold full text, why not do content

analysis, extract the index terms (keywords), and hold these in a relational database?

Module Index termIndexed by

module(mod_code, title, semester, …)

term(term_id, value)

index(mod_code, term_id)

Sample SQL query: ORfind all modules matching:

‘database’ or ‘AI’ or ‘knowledge base’

select distinct m.* from module m inner join index i on m.code = i.mod_code inner join term t on t.term_id = i.term_id where t.value = ‘database’ OR t.value = ‘AI’ OR t.value = ‘knowledge base’;

Another sample query: AND (?)find all modules matching:

‘database’ and ‘AI’ and ‘knowledge base’

select distinct m.* from module m inner join index i on m.code = i.mod_code inner join term t on t.term_id = i.term_id where t.value = ‘database’ AND t.value = ‘AI’ AND t.value = ‘knowledge base’;

This SQL query will not match any record; t.value cannot be simultaneously equal to ‘database’, ‘AI’ and ‘knowledge base’.

We cannot simply replace the ‘OR’s of the last SQL query with ‘AND’s.

Corrected sample query: AND

find all modules matching: ‘database’ and ‘AI’

select distinct m.* from module m inner join index i1 on m.code = i1.mod_code inner join term t1 on t1.term_id = i1.term_idinner join index i2 on m.code = i2.mod_code inner join term t2 on t2.term_id = i2.term_id where t1.value = ‘database’ and t2.value = ‘AI’;

Both tables ‘index’ and ‘term’ must be searched twice in order to establish whether, for each module, it is attached to both terms ‘database’ and ‘AI’.

If the query is a conjunction of N terms, the SQL would have 2N inner joins. AND is more complicated than OR (but common in IR)

Inverted file• Non-DB structure, so not suitable for standard SQL

• each index term entry ‘points’ to a list of document record identifiers (RIDs)

• standard indexing method for IR systems

• widely used for search engines

• can be extended to allow for positional (context) searches

Inverted file structureThe idea of an inverted file is, as well as storing a

document with its list of terms that are used to index it, we store the list of terms used in the whole collection of documents, and for each term point to the list of documents that are indexed by the term. So we have ‘inverted’ the structure:

D1: T11, T12, …, T1k

D2: T21, T22, …, T2l

…to give:T1: D11, D12, …, D1m

T2: D21, D22, …, D2n

…

Inverted file structure

Term 1 (2)

Term 2 (3)

Term 3 (1)

Term 4 (3)

Term 5 (4)..

1

2

1

2

3

2

2

3

4..

Doc 1

Doc2

Doc3

Doc4

Doc5

Doc6..

1

3

6

7

9..

dictionary Inverted or postings file Data file


Term 1 (2)

Term 2 (3)

Term 3 (1)

Term 4 (3)

Term 5 (4)..

1

2

1

2

3

2

2

3

4..

Doc 1

Doc2

Doc3

Doc4

Doc5

Doc6..

1

3

6

7

9..



Term 1 (2)

Term 2 (3)

Term 3 (1)

Term 4 (3)

Term 5 (4)..

1

2

1

2

3

2

2

3

4..

Doc 1

Doc2

Doc3

Doc4

Doc5

Doc6..

1

3

6

7

9..



Term 1 (2)

Term 2 (3)

Term 3 (1)

Term 4 (3)

Term 5 (4)..

1

2

1

2

3

2

2

3

4..

Doc 1

Doc2

Doc3

Doc4

Doc5

Doc6..

1

3

6

7

9..


Dictionary (in IR)• list of terms including ‘normalised’ keywords or

stems plus object descriptors (eg author name)

• frequency with which that term occurs in the collection

• pointer to the inverted file

• access to dictionary is by standard file access method (binary search or Btree or hashing algorithm; DB21)

Inverted file• for each entry in the dictionary:

– a list of pointers into the data file (or object-ids, or URLs..)

– identifying those objects indexed by the dictionary term

• inverted file may also contain:– positional information within each document– term frequency (or weight) within each document

Use of inverted file• Boolean query: (A or B) and C

– disjunctive normal form: (A and C) or (B and C) or (A and B and C)(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

– retrieve lists of document ids from inverted file corresponding to A, B and C

doc1doc3doc4doc7doc8doc10




– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)





doc1: (1, 0, 1)




doc3doc4doc7doc8doc10



doc1: (1, 0, 1)doc2: (0, 1, 1)






doc4doc9doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)




doc4doc7doc8doc10

doc5doc6doc8doc12

doc4doc9doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)




doc7doc8doc10

doc5doc6doc8doc12

doc9doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)




doc7doc8doc10

doc6doc8doc12

doc9doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)




doc7doc8doc10

doc8doc12

doc9doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)




doc8doc10

doc8doc12

doc9doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)




doc10 doc12 doc9doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)doc9: (0, 0, 1)




doc10 doc12 doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)doc9: (0, 0, 1)doc10: (1, 0, 0)




doc12 doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)doc9: (0, 0, 1)doc10: (1, 0, 0)doc11: (0, 0, 1)




doc12 doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)doc9: (0, 0, 1)doc10: (1, 0, 0)doc11: (0, 0, 1)doc12: (0, 1, 1)

doc12 doc12



doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)doc9: (0, 0, 1)doc10: (1, 0, 0)doc11: (0, 0, 1)doc12: (0, 1, 1)

• Boolean query: (A or B) and C– disjunctive normal form:

(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

doc1: (1, 0, 1)doc2: (0, 1, 1)

doc4: (1, 0, 1)

doc12: (0, 1, 1)

doc1: (1, 0, 1)doc2: (0, 1, 1)doc4: (1, 0, 1)doc12: (0, 1, 1)

report number of hits to user (4)(Note: can be done before any ‘hits’ are retrieved

retrieve all objects using ‘pointers’:doc1, doc2, doc4 and doc12

Use of inverted file

weighted query: A0.5, B0.7, C1.0

form weighted vector:(0.5, 0.7, 1.0)

retrieve lists of document ids from inverted file corresponding to A, B and C with weights

doc1 (.2)doc3 (.6)doc4 (.7) doc7 (.3)doc8 (.5) doc10 (.5)

doc2 (.6)doc3 (.8)doc5 (.9)doc6 (.3)doc8 (.5)doc12 (.2)


Use of inverted file with weighted terms




doc1 (.2)doc3 (.6)doc4 (.7) doc7 (.3)doc8 (.5) doc10 (.5)



sim((0.5, 0.7, 1.0), (0.2, 0.0, 0.4)) = 0.85

doc1: 0.85





doc3 (.6)doc4 (.7) doc7 (.3)doc8 (.5) doc10 (.5)


doc2 (.4)doc4 (.7)doc9 (.6)doc11 (.3)doc12 (.6)

sim((0.5, 0.7, 1.0), (0.0, 0.6, 0.4)) = 0.86

doc1: 0.85doc2: 0.86


• sort (rank) list according to similarity coefficient.

• retrieve first ‘N’ ranked objects.

• present ranked list to user.

• offer to retrieve next ‘N’.

• Note that so far we have not retrieved any documents; this is particularly important if the ids are URLS - we don’t need to start downloading web pages in order to rank them.


• proximity queries eg Q1: “A B” Q2: “A(3)B” (A…B)– postings file holds positional information– proceed as for ‘A and B’– keep positional information in (AB) list– filter (AB) list:

for Q1 pos(A) = pos(B) -1 for Q2 |pos(B) - pos(A)| < 3

• now we can distinguish ‘Venetian blind’ from ‘blind Venetian’

• in principle this should help precision without affecting recall too much


Pros and cons of inverted file• can be used for Boolean, weighted and positional

queries

• query processing can be completed without accessing data file

• number of hits for single term is available from dictionary

• expensive to update if information objects change content.

• demanding storage requirements (dictionary+inverted file approx same size as original data)

An alternative: Text signatures

• use hash algorithm to map a keyword onto one or more bits in a bit string: like Hashing (DB21)

• Simplest example: use one bit:– ‘Bath’ = [66, 97, 116, 104]

mod32 (66+97+116+104) = mod32(383) = 31

so represent ‘Bath’ by setting bit 31 in 0-31 bits:000000000000000000000000000000000001

Text signatures• Or use several bits:

– ‘Bath’ = [66, 97, 116, 104]‘ Ba’ mod32(66+97) = 3,

‘Bat’ mod32(66+97+116) = 23

‘ath’ mod32(97+116+104) = 29

‘th ’ mod32(116+104) = 28

represent ‘Bath’ by:001000000000000000000000000100001100

• This may allow wildcards, eg Bat?

Document signatures• superimpose keyword signatures

Bath 0000000000000000000000000000001

tub 0000000000100000000000000000000

0000000000100000000000000000001

• if each document has 6 keywords, there would be comb(32, 6) = 906192 different document signatures.

• Document signatures can be mapped onto numbers between 1 and 906192

Using signature file• Boolean query: (A or B) and C

– superimpose signatures of A and C– superimpose signatures of B and C– for each signature, S, in the file:

if either all bits of A&C are set in S or all bits in B&C are set, retrieve the document with signature S.

– check document to see if it is a ‘hit’– bit comparisons are very fast compared to string

comparisons.

Pros and cons of signature files

• Needs less space than inverted

• easier to update as documents change

• fast for queries with many keywords

• probabilistic - will return false hits

• cannot filter on positional information

• cannot hold keyword weights (or other weights)

these last three points imply that furtherprocessing is required to filter retrieved documents.

Summary of key points

• standard relational databases do not provide suitable indexing for handling index terms.

• standard SQL is not good at expressing ‘search-engine’ type queries

• inverted file structures are purpose made for these types of system

• storing frequencies/weights in the dictionary and inverted file allows for vector model queries

• storing positional information allows proximity queries, “Knowledge Management” v “MK”

• Signature files give faster matches but with limitations

Questions to think about

• Explain why the relational model is not good for IR.• How is it that, using an inverted file, the number of

hits can be reported without retrieving anything from the data file?

• Could this be achieved using signature files?• What are proximity queries, and how can inverted

file technology be used to deal with them?• How can signature files be used for proximity

queries?