Upload
jennifer-melton
View
213
Download
0
Embed Size (px)
Citation preview
www.monash.edu.au
CSE3201/CSE4500 Information Retrieval Systems
Signature Based Text Retrieval Systems
www.monash.edu.au
2
Signature File for Text Retrieval
• A “signature” is created as an abstraction of a document.
• All the signatures that represent the documents in the collection are kept in a file called “signature file”.
www.monash.edu.au
3
Word Signature(WS)
• A word signature – is a fixed-length bit-string represents a word.– is described by
> The length (N)> A number of bits set to 1(k)
1 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0
N=24
k=7
www.monash.edu.au
4
Word Signature Generation
• Use a hash function to find the location of the bit(s) that will be set on.
• Using triplets of characters to generate word signature.
– divide the word into overlapping triplets.
– For each triplet of characters:> convert the characters to a numeric value (can be ASCII
representation of the character).> Use the the number as the input to the hash function.> The hash function will produce a number which represent the bit
position of the triplet in the word signature.
www.monash.edu.au
5
Signature Generator Algorithm
Set hash_value to 0
for each character in the triplet do
hash_value:=(hash_value*137+character ASCIIvalue)mod 256
K values
www.monash.edu.au
6
Word Signature Generation – simplified example
• Example:
– A signature 111000111001 is generated for the word “signature”.
• The position is read from left to right
-si sig ign gna nat atu tur ure re-
12 73 23 9 12 8
1 1 1 0 0 0 1 1 1 0 0 1
signature
Hash function
Position of the bit set to 1
1
www.monash.edu.au
7
Document Signature (DS)
• Document Signature can be created using two methods:– concatenation of word signatures.– superimposed coding.
www.monash.edu.au
8
Document Signature – Concatenation of WS
• The length of document signatures (DS) can vary. • A fixed number of bits may precede the document
signature (DS) to indicate the length of DS.• It is possible to fix the length of the Document Signature
(DS). – The length can be set to equal the longest document in the
collection.– Extra “0” bits are padded to the shorter documents.
www.monash.edu.au
9
Document Signature –Superimposed Coding
• Each document is divided into blocks containing a constant number of distinct words.
• To create a block signature, perform OR operation on all the words in the block.
free 001 000 110 010
text 000 010 101 001
Block signature 001 010 111 011
www.monash.edu.au
10
Document Signature – Superimposed Coding
• To create the document signature, all the block signatures are superimposed.
www.monash.edu.au
11
Query Signature
• Query will be converted to a block signature as in the document.
• Example:
free 0 0 1 0 0 0 1 1 0 0 1 0
Text 0 0 0 0 1 0 1 0 1 0 0 1
Block/Query
0 0 1 0 1 0 1 1 1 0 1 1
www.monash.edu.au
12
Matching the Query and Document Signature
• Premise:– The positions of the bits set to 1 represent the existence
of particular words in the query or document. • A relevant document is document that has a signature
with bits set to 1 at the same position of the bits in the query’s signature.
• The relevant document’s signature does not have to be an exact match of the query’s signature.
• Example:– Query: 0100– Match document signatures: 1111, 0111, 0110, 0100.
www.monash.edu.au
13
Query on Signature File
Query
001 010 111 011
0 0 1 0 0 0 1 1 1 0 1 1
0 0 1 1 1 1 1 1 1 0 1 1
0 0 1 0 1 0 1 0 1 0 1 1
0 0 1 0 1 0 1 1 1 0 1 0
1 1 1 0 1 0 1 1 1 0 1 1
0 0 1 1 0 0 1 1 1 0 1 1
0 0 1 0 1 0 1 1 1 1 1 1
No
No
No
Yes
YesNo
Yes
Match? Perform AND operation between the query and block signature, if ( result – query) = 0, they are matched
www.monash.edu.au
14
Signature File Structure
• Sequential– During searching, each signature will be compared to
query signature.– Time consuming because:
> Memory size is limited, hence all signatures cannot be loaded to the memory at once.
> May result in multiple number of I/O operations.
• We need a file structure for the signature file that minimise the I/O operation.
• Bit-Sliced Signature– At the maximum, only N (the size of the signature) number
of records need to be retrieved.
www.monash.edu.au
15
Matrix Transposed
2313
2212
2111
232221
131211
xx
xx
xx
xxx
xxxT
xij -> xji
fc
eb
da
fed
cbaT
www.monash.edu.au
16
Bit-Sliced
0 0 1 0 0 0 1 1 1 0 1 1
0 0 1 1 1 1 1 1 1 0 1 1
0 0 1 0 1 0 1 0 1 0 1 1
0 0 1 0 1 0 1 1 1 0 1 0
0 0 0 0
0 0 0 0
1 1 1 1
0 1 0 0
0 1 1 1
0 1 0 0
1 1 1 1
1 1 0 1
1 1 1 1
0 0 0 0
1 1 1 1
1 1 1 0Bit slicedsequential
N bits
N records
d1
d4
d2d3
Query: 001 010 111 011
dn
d1 d2 d3 d4 dn
www.monash.edu.au
17
Bit Sliced Signature File
• Retrieval– If ith bit in the query signature is set to 1, retrieve
the ith signature block/record.– If there is n number of bits are set to 1 in the
query, only n number of records needs to be retrieved.
www.monash.edu.au
18
Bit Slice Signature File
0 0 0 0
0 0 0 0
1 1 1 1
0 1 0 0
0 1 1 1
0 1 0 0
1 1 1 1
1 1 0 1
1 1 1 1
0 0 0 0
1 1 1 1
1 1 1 0
Query: 001 010 111 011
1 1 1 1
0 1 1 1
1 1 1 1
1 1 0 1
1 1 1 1
1 1 1 1
1 1 1 0
Match, because all bits in this column is set to 1 (the 2nd block).
Retrieved records
www.monash.edu.au
19
Bit Sliced Signature File
• Advantages:– Smaller number of records are retrieved -> faster
retrieval.• Disadvantages:
– An update operation become a very costly exercise.
www.monash.edu.au
20
False Drop
• False drop occurs when a document’s signature matches a query’s signature but the query’s word does not match any word in the document.
• It is possible because 2 distinct blocks may have the same signatures due to:– the hashing algorithm– superimposed coding
www.monash.edu.au
21
Relation Between the Signature Properties and False Drop
• The rate of false drop depends on:– The size of the signature (N bits)
> Increase in N will decrease the false drop
– The size of bits set to 1(k bits)> Increase in k to a certain level will decrease the false
drop
– The number of unique words per-block> Decrease in the number of unique word per-block will
decrease the false drop.