Upload
norman-walton
View
228
Download
0
Embed Size (px)
Citation preview
Hsin-Hsi Chen 8-1
Chapter 8Indexing and Searching
Hsin-Hsi Chen
Department of Computer Science and Information Engineering
National Taiwan University
Hsin-Hsi Chen 8-2
Introduction
• searching– Online text searching
• Scan the text sequentially
– Indexed searching• Build data structures over the text to speed up the search• Semi-static collections: updated at reasonably regular interval
• indexing techniques– inverted files– suffix (PAT) arrays– signature files
Hsin-Hsi Chen 8-3
Assumptions
• n: the size of text databases• m: the length of the search patterns (m<n)• M: the amount of memory available• n’: the size of texts that are modified (n’<n)• Experiments
– 32bit Sun UltraSparc-1 of 167 MHz with 64 MB of RAM
– TREC-2 (WSJ, DOE, FR, ZIFF, AP)
Hsin-Hsi Chen 8-4
File Structures for IR
• lexicographical indices– indices that are sorted– inverted files– Patricia (PAT) trees (Suffix trees and arrays)
• cluster file structures (see Chapter 7 in document clustering)
• indices based on hashing– signature files
Hsin-Hsi Chen 8-5
Inverted Files
Hsin-Hsi Chen 8-6
Inverted Files• Each document is assigned a list of keywords or
attributes.• Each keyword (attribute) is associated with operational
relevance weights.• An inverted file is the sorted list of keywords
(attributes), with each keyword having links to the documents containing that keyword.
• Penalty– the size of inverted files ranges from 10% to 100%
of more of the size of the text itself– need to update the index as the data set changes
Hsin-Hsi Chen 8-7
1 6 9 11 17 19 24 28 33 40 46 50 55 60
This is a text. A text has many words. Words are made from letters.
letters 60 …made 50 …many 28 …text 11, 19 …words 33, 40 …
Vocabulary Occurrences
Heaps’ law: the vocabulary grows as O(n), : 0.4~0.6 Vocabulary for 1GB of TREC-2 collection: 5MB (before stemming and normalization)Occurrences: the extra space O(n) 30% ~ 40% of the text size
Text
addressing granularity:(1) inverted list –
word positionscharacter positions
(2) inverted file –document
Hsin-Hsi Chen 8-8
Block Addressing• Full inverted indices
– Point to exact occurrences
• Blocking addressing– Point to the blocks where the word appears
– Pointers are smaller
– 5% overhead over the text sizeBlock1 Block2 Block3 Block 4
This is a text. A text has many words. Words are made from letters.
letters 4 …made 4 …many 2 …text 1, 2 …words 3 …
Vocabulary Occurrences Text
Inverted index
block:fixed size blocks,files, documents, Web pages, …
block = retrieval units?
Hsin-Hsi Chen 8-9
Sorted array implementation of an inverted filethe documents in which keyword occurs
Hsin-Hsi Chen 8-10
Indexing Small
(1 MB)
collection Medium
(200MB)
collection Large
(2GB)
Collection
Addressing
words
45% 73% 36% 64% 35% 63%
Addressing
documents
19% 26% 18% 32% 26% 47%
Addressing
64k blocks
27% 41% 18% 32% 5% 9%
Addressing
256 blocks
18% 25% 1.7% 2.4% 0.5% 0.7%
Full inversion (all words, exact positions, 4-byte pointers)
2 or 1 byte(s) per pointer independent of the text size
document size (10KB), 1, 2, 3 bytes per pointer, depending on text size
All words are indexedStop words are not indexed
Hsin-Hsi Chen 8-11
Searching
• Vocabulary search– Identify the words and patterns in the query
– Search them in the vocabulary
• Retrieval of occurrences– Retrieve the lists of occurrences of all the words
• Manipulation of occurrences– Solve phrases, proximity, or Boolean operations
– Find the exact word positions when block addressing is used
Th
ree general
steps
Hsin-Hsi Chen 8-12
Structures used in Inverted Files
• Sorted Arrays– store the list of keywords in a sorted array– using a standard binary search– advantage: easy to implement– disadvantage: updating the index is expensive
• B-Trees• Tries• Hashing Structures• Combinations of these structures
Hsin-Hsi Chen 8-13
Trie1 6 9 11 17 19 24 28 33 40 46 50 55 60
This is a text. A text has many words. Words are made from letters.
letters: 60
text: 11, 19
words: 33, 40
made: 50
many: 28
‘l’
‘m’‘a’
‘d’
‘n’‘t’
‘w’
Text
Vocabularytrie
Hsin-Hsi Chen 8-14
B-trees
F M
Al Br E Gr H Ja LRut Uni
Afgan 2 … Russian 9 … Ruthenian 1 …
Hsin-Hsi Chen 8-15
Sorted Arrays
1. The input text is parsed into a list of words along with theirlocation in the text. (time and storage consuming operation)
2. This list is inverted from a list of terms in location order to a list of terms in alphabetical order.
3. Add term weights, or reorganize or compress the files.
Hsin-Hsi Chen 8-16
Inversion of Word List
“report”appearsin tworecords
Hsin-Hsi Chen 8-17
Dictionary and postings fileIdea: the file to be searched should be as short as possible
split a single file into two pieces
e.g. data set: 38,304 records, 250,000 unique
88 postings/recordterms
(document #, frequency)
(vocabulary) (occurrences)
Hsin-Hsi Chen 8-18
Producing an Inverted File for Large Data Sets without Sorting
Idea: avoid the use of an explicit sort by using a right-threaded binary tree
current number of term postings &the storage location of postings list
traverse the binary tree and thelinked postings list
Hsin-Hsi Chen 8-19
Indexing StatisticsFinal index: only 8% of input text size for 50MB database 14% of the input text size for the larger databaseWorking storage: not much larger than the size of final index for new indexing method
p.17&18 p.20
the storage needed to build the index
933
2GB
the same
Hsin-Hsi Chen 8-20
A Fast Inversion Algorithm
• Principle 1the large primary memories are availableIf databases can be split into memory loads that can be rapidly processed and then combined, the overall cost will be minimized.
• Principle 2the inherent order of the input dataIt is very expensive to use polynomial or even nlogn sorting algorithms for large files
Hsin-Hsi Chen 8-21
FAST-INV algorithm
See p. 22.
concept postings/pointers
Hsin-Hsi Chen 8-22
document number
concept number (one concept numberfor each unique word)
Sample document vector
Similar to the document-word list shown in p. 16.
The concept numbers aresorted within documentnumbers, and document numbers aresorted within collection
Hsin-Hsi Chen 8-23
HCN=highest concept number in dictionary (total number of concepts in dictionary)L=number of concepts/document (documents/concept) pairs in the collectionM=available primary memory size, in bytes
M>>HCN, M < L
L/j<M, so that each part fill fit into primary memoryHCN/j concepts, approximately, are associated with each part
Let LL=length of current load (8 bytes for each concept-weight) S=spread of concept numbers in current load (4 bytes for each count of posting)
number of concept-weight pairs
8*LL+4*S < M
Hsin-Hsi Chen 8-24
Preparation
1. Allocate an array, con_entries_cnt, of size HCN.2. For each <doc#, con#> entry in the document vector file: increment con_entries_cnt[con#]
……………………0 (1,2), (1,4)……….. 2(2,3) …………….. 3(3,1), (3,2), (3,5) ... 6(4,2), (4,3) ………. 8...
Hsin-Hsi Chen 8-25
Preparation (continued)
5. For each <con#,count> pair obtained from con_entries_cnt: if there is no room for documents with this concept to fit in the current load, then created an entry in the load table and initialize the next load entry; otherwise update information for the current load table entry.
Hsin-Hsi Chen 8-26
: the range of concepts for each primary load
讀入 (Doc,Con)依 Con 去查 Load表,確定這個配對該落在那個 Load
依序將每個 LoadFile 反轉。 CONPTR表中的 Offset 顯示每筆資料該填入的位置。
如何產生 Load 表?LL:length of current loadS: end concept-start concept +1
space for concept/weight pair*LL+space for each conceptto store count ofpostings*S < M
留意: SS 為了管理上資料的添加
copy ratherthan sort
Hsin-Hsi Chen 8-27
PAT Tress and PAT Arrays(Suffix Trees and Suffix Arrays)
Hsin-Hsi Chen 8-28
PAT Trees and PAT Arrays
• Problems of tradition IR models– Documents and words are assumed.– Keywords must be extracted from the text (indexing).– Queries are restricted to keywords.
• New indices for text– A text is regarded as a long string.– Each position corresponds to a semi-infinite string (sistring).– suffix: a string that goes from a text position to the end of the
text– Each suffix is uniquely identified by its position– no structures and no keywords
Hsin-Hsi Chen 8-29
This is a text. A text has many words. Words are made from letters.
text. A text has many words. Words are made from letters.
text has many words. Words are made from letters.
many words. Words are made from letters.
Words are made from letters.
made from letters.
letters.
Text
Suffixes
Index points are selected from the text, whichpoint to the beginning of the text positions whichare retrievable.
different
Hsin-Hsi Chen 8-30
PATRICIA
• trie– branch decision node: search decision-markers– element node: real data– if branch decisions are made on each bit, a com
plete binary tree is formed where the depth is equal to the number of bits of the longest strings
– many element nodes and branch nodes are null
Hsin-Hsi Chen 8-31
PATRICIA (Continued)
• compressed digital search trie– the null element nodes and branch nodes are re
moved– an additional field to denote the comparing bit f
or branching decision is included in each decision node
– a matching between the searched results and their search keys is required because only some of bits are compared during the search process
Hsin-Hsi Chen 8-32
PATRICIA (Continued)
• Practical Algorithm to Retrieve Information Coded in Alphanumeric– augmented branch node: an additional field for
storing elements is included in branch node– each element is stored in an upper node or in
itself– an addition root node: note the number of leaf
nodes is always greater than that of internal nodes by one
Hsin-Hsi Chen 8-33
PAT-tree
• PATRICIA + semi-infinite strings– a text T with n basic units u1 u2 … un
– u1 u2 … un …, u2 u3 … un …, u3 u4 … un …, …
– an end to the left but none to the right– store the starting positions of semi-infinite
strings in a text using PATRICIA
Hsin-Hsi Chen 8-34
semi-infinite strings
• ExampleText Once upon a time, in a far away land …sistring 1 Once upon a time …sistring 2 nce upon a time …sistring 8 on a time, in a …sistring 11 a time, in a far …sistring 22 a far away land …
• Compare sistrings22 < 11 < 2 < 8 < 1
Hsin-Hsi Chen 8-35
PAT Tree
• PAT TreeA Patricia tree constructed over all the possible sistrings of a text
• Patricia tree– a digital tree where the individual bits of the keys are use
d to decide on the branching– each internal node indicates which bit of the query is use
d for branching• absolute bit position• a count of the number of bits to skip
– each external node is a sistring, i.e., the integer displacement
Hsin-Hsi Chen 8-36
Example
Text 01100100010111 …sistring 1 01100100010111 …sistring 2 1100100010111 …sistring 3 100100010111 …sistring 4 00100010111 …sistring 5 0100010111 …sistring 6 100010111 …sistring 7 00010111 …sistring 8 0010111 ...
1 1
21 2
23
1
1
2
23
1
2
14
2
23
1
2
4 3
15
: external node sistring (integer displacement) total displacement of the bit to be inspected
: internal node skip counter & pointer
0 1 0 1
0 1
Hsin-Hsi Chen 8-37
2
2
2
4 3
15
1
Text 01100100010111 …sistring 1 01100100010111 …sistring 2 1100100010111 …sistring 3 100100010111 …sistring 4 00100010111 …sistring 5 0100010111 …sistring 6 100010111 …sistring 7 00010111 …sistring 8 0010111 ...
4
36
註: 3 和 6 要 4 個 bits 才能區辨
2
2
2
3
15
4
36
1
3
47
2
2
2
3
15
4
36
1
3
7 5
84
Search 00101
Hsin-Hsi Chen 8-38
1 6 9 11 17 19 24 28 33 40 46 50 55 60
This is a text. A text has many words. Words are made from letters.
Suffix Trie
6050
28 19
11 40
33
‘l’‘m’ ‘a’ ‘d’
‘n’‘t’‘e’ ‘x’ ‘t’
‘w’‘o’ ‘r ‘d’ ‘s’
60
5
50
28 19
11 40
33
‘l’‘m’ ‘d’
‘n’‘t’
‘w’
1
6
3Suffix Tree
space overhead:120%~240% over the text size
Text
Hsin-Hsi Chen 8-39
PAT Trees Represented as Arrays
• indirect binary search vs. sequential searchKeep the external nodes in the bucket in the same relative order as they would be in the tree
2
2
2
3
15
4
36
1
3
7 5
84
7 4 8 5 1 6 3 2PAT array
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...Text
Hsin-Hsi Chen 8-40
1 6 9 11 17 19 24 28 33 40 46 50 55 60
This is a text. A text has many words. Words are made from letters.
Text
60 50 28 19 11 40 33
(2) Suffix Array
60
5
50
28 19
11 40
33
‘l’‘m’ ‘d’
‘n’‘t’
‘w’
1
6
3(1) Suffix Tree
40% overhead
120%~240%overhead
(3) Supra-Index
60 50 28 19 11 40 33
Suffix Array
lett text word
Hsin-Hsi Chen 8-41
difference between suffix array and inverted list
• suffix array: the occurrences of each word are sorted lexicographically by the text following the word
• inverted list: the occurrences of each word are sorted by text position
1 6 9 11 17 19 24 28 33 40 46 50 55 60
This is a text. A text has many words. Words are made from letters.
60 50 28 19 11 40 33
Suffix Array
60 50 28 11 19 33 40
Inverted list
letters made many text words
VocabularySupra-Index
Hsin-Hsi Chen 8-42
Indexing Points
• The above example assumes every position in the text is indexed.n external nodes, one for each position in the text
• word and phrase searchessistrings that are at the beginning of words are necessary
• trade-off between size of the index and search requirements
Hsin-Hsi Chen 8-43
Prefix searching• idea
every subtree of the PAT tree has all the sistrings with a given prefix.
• Search: proportional to the query lengthexhaust the prefix or up to external node.
Search for the prefix“10100” and its answer
Hsin-Hsi Chen 8-44
Searching PAT Trees as Arrays• Prefix searching and range searching
doing an indirect binary search over the array with the results of the comparisons being less than, equal, and greater than.
• exampleSearch for the prefix 100 and its answer.
7 4 8 5 1 6 3 2PAT array
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...Text
Hsin-Hsi Chen 8-45
Proximity Searching
• Find all places where s1 is at most a fixed (given by a user) number of characters away from s2.
in 4 ation ==> insulation, international, information
• Algorithm1. Search for s1 and s2.2. Select the smaller answer set from these two sets and sort by position.3. Traverse the unsorted answer set, searching every position in the sorted set and checking if the distance between positions satisfying the proximity condition.
sort+traverse time:(m1+m2)logm1 (assume m1<m2)
Hsin-Hsi Chen 8-46
Range Searching
• Search for all the strings within a certain lexicographical range.
• the range of “abc” ..”acc”: “abracadabra”, “acacia” ○
“abacus”, “acrimonious” X• Algorithm
– Search each end of the defining intervals.
– Collect all the sub-trees between (and including) them.
Hsin-Hsi Chen 8-47
Searching Suffix Array
• P1 S < P2
– Binary search both limiting patterns in the suffix array.
– Find all the elements lying between both positions
Hsin-Hsi Chen 8-48
Longest Repetition Searching
• the match between two different positions of a text where this match is the longest in the entire text, e.g.,
0 1 1 0 0 1 0 0 0 1 0 1 1 1
2
2
2
3
15
4
36
1
3
7 5
84
Text 01100100010111 sistring 1 01100100010111 sistring 2 1100100010111 sistring 3 100100010111 sistring 4 00100010111 sistring 5 0100010111 sistring 6 100010111 sistring 7 00010111 sistring 8 0010111
the tallest internal node gives a pairof sistrings that match for the greatestnumber of characters
Hsin-Hsi Chen 8-49
“Most Significant” or “Most Frequent” Matching
• the most frequently occurring strings within the text database, e.g., the most frequent trigram
• find the most frequent trigramfind the largest subtree at a distance 3 characters from root
2
2
2
3
15
4
36
1
3
7 5
84
the tallest internal node gives a pairof sistrings that match for the greatestnumber of characters
i.e., 1, 2, 3 are the same forsistrings 100100010111 and 100010111
Hsin-Hsi Chen 8-50
Building PAT Trees as Patricia Trees
• bucketing of external nodes– collect more than one external node
– a bucket replaces any subtree with size less than a certain constraint (b)save significant number of internal nodes
– the external nodes inside a bucket do not have any structure associated with themincrease the number of comparisons for each search
Hsin-Hsi Chen 8-51
Building PAT Trees as Patricia Trees(Continued)
• mapping the tree onto the disk using super-nodes– Allocate as much as possible of the tree in a disk page– Every disk page has a single entry point,
contains as much of the trees as possible,and terminates either in external nodes or in pointers to other disk pages
– The pointers in internal nodes address either a disk page or another node inside the same page
– disk pages contain on the order of 1,000 internal/external nodes
– on the average, each disk page contains about 10 steps of a root-to-leaf path
Hsin-Hsi Chen 8-52
Suffix array construction (in MM)
• The suffix array and the text must be in main memory– The suffix array is the set of pointers
lexicographically sorted– The pointers are collected in ascending text
order– The pointers are sorted by the text they point to
(accessing the text at random positions)
Hsin-Hsi Chen 8-53
Suffix array construction (in MM)
• Algorithm– All the suffixes are bucket-sorted according to
the first letter only– At iteration i, the suffixes begin already sorted
by their 2i-1 first letters and end up sorted by their first 2i letters.
• Sort the text positions Ta … and Tb … in the suffix array
• Determine the relative order between text positions Ta+2
i-1 … and Tb +2i-1 … in the current stage of search
Hsin-Hsi Chen 8-54
Construction of Suffix Arraysfor Large Text
• Split the text blocks that can be sorted in MM.– Build the suffix array for the first block
– Build the suffix array for the second block
– Merge both suffix arrays
– Build the suffix array for the third block
– Merge the suffix array with the previous one
– Build the suffix array for the fourth block
– Merge the new suffix array with previous one
– …
Hsin-Hsi Chen 8-55
Merge Step
• How to merge a large suffix array with the small suffix array?– Determine how many elements of the large array are to
be placed between each pair of elements in the small array
• Read the large array sequentially into main memory
• Each suffix of that text is searched in the small suffix array
• Increment appropriate counter
– Use the information to merge the arrays without accessing the text
Hsin-Hsi Chen 8-56
small text
small suffix array
small text
small suffix array
counters
longtext
small text
small suffix array
counters
long suffix array
final suffix array
(a)(b)
(c)
local suffix array is builtCounters arecomputed
The suffix arrays are merged
Hsin-Hsi Chen 8-57
Signature Files
Hsin-Hsi Chen 8-58
Signature Files
• basic idea: inexact filter– discard many of nonqualifying items– qualifying items definitely pass the test– “false hits” or “false drops” may also pass accidentally
• procedure– Documents are stored sequentially in “text file”.– Their signatures (hash-coded bit patterns) are stored in t
he “signature file”.– Scan the signature file, discard nonqualifying document
s, and check the rest, when a query arrives.
Hsin-Hsi Chen 8-59
Merits of Signature Files
• faster than full text scanning– 1 or 2 orders of magnitude faster
• modest space overhead– 10-15% vs. 50-300% (inversion)
• insertions can be handled more easily than inversion– append only
– no reorganization or rewriting
Hsin-Hsi Chen 8-60
Basic Concepts• Use superimposed coding to create signature.• Each document is divided into logical blocks.• A block contains D distinct non-common words.• Each word yields “word signature”.• A word signature is a F-bit pattern, with m 1-bit.
– Each word is divided into successive, overlapping triplets. e.g. free --> fr, fre, ree, ee
– Each such triplet is hashed to a bit position.
• The word signatures are OR’ed to form block signature.• Block signatures are concatenated to form the document si
gnature.
B in text bookl
Hsin-Hsi Chen 8-61
Basic Concepts (Continued)
• Example (D=2, F=12, m=4)word signaturefree 001 000 110 010text 000 010 101 001block signature 001 010 111 011
• Search– Use hash function to determine the m 1-bit positions.
– Examine each block signature for 1’s bit positions that the signature of the search word has a 1.
B l
Hsin-Hsi Chen 8-62
A Signature File
This is a text. A text has many words. Words are made from letters.Block 1 Block 2 Block 3 Block 4
000101 110101 100100 101101
Text
Text Signature
h(text) =000101h(many) =110000h(words) =100100h(made) =001100h(letters) =100001
Hsin-Hsi Chen 8-63
Basic Concepts (Continued)
• false alarm (false hit, or false drop) Fdthe probability that a block signature seems to qualify, given that the block does not actually qualify.
Fd = Prob{signature qualifies | block does not}
• Ensure the probability of a false alarm is low enough while keeping the signature file as short as possible
• For a given value of F, the value of m that minimizes the false drop probability is such that each row of the matrix contains “1”s with probability 0.5.
Fd = 2-m
Fln2=mDF: signature size in bitsm: number of bits per wordD: number of distinct noncommon words per documentFd: false drop probability
N*F binary matrix
m=ln2*F/D
Hsin-Hsi Chen 8-64
space overhead of index: (1/80)*(F/D)F is measured in bits and D in words
10% overhead: false drop probability close to 2%10%=(1/80)*(F/D) (F/D)=8m=8*ln2=5.545Fd=2-5.545=2%
20% overhead: false drop probability close to 0.046%20%=(1/80)*(F/D) (F/D)=16m=16*ln2=11.09Fd=2-11.09=0.046%
On the average, a word consists of 10 characters anda character has 8 bits.
Hsin-Hsi Chen 8-65
Sequential Signature File (SSF)
documents
the size of document signature= the size of block signature=F
assume documents span exactly one logical block
Hsin-Hsi Chen 8-66
Classification of Signature-Based Methods
• CompressionIf the signature matrix is deliberately sparse, it can be compressed.
• Vertical partitioningStoring the signature matrix columnwise improves the response time on the expense of insertion time.
• Horizontal partitioningGrouping similar signatures together and/or providing an index on the signature matrix may result in better-than-linear search.
Hsin-Hsi Chen 8-67
Classification of Signature-Based Methods
• Sequential storage of the signature matrix– without compression
sequential signature files (SSF)– with compression
bit-block compression (BC)variable bit-block compression (VBC)
• Vertical partitioning– without compression
bit-sliced signature files (BSSF, B’SSF)frame sliced (FSSF)generalized frame-sliced (GFSSF)
Hsin-Hsi Chen 8-68
Classification of Signature-Based Methods(Continued)
– with compressioncompressed bit slices (CBS)doubly compressed bit slices (DCBS)no-false-drop method (NFD)
• Horizontal partitioning– data independent partitioning
Gustafson’s methodpartitioned signature files
– data dependent partitioning2-level signature files5-trees
Hsin-Hsi Chen 8-69
Criteria
• the storage overhead
• the response time on single word queries
• the performance on insertion, as well as whether the insertion maintains the “append-only” property
Hsin-Hsi Chen 8-70
Compression
• idea– Create sparse document signatures on purpose.
– Compress them before storing them sequentially.
• Method– Use B-bit vector, where B is large.
– Hash each word into one (n) bit position(s).
– Use run-length encoding.
Hsin-Hsi Chen 8-71
Compression using run-length encoding
data 0000 0000 0000 0010 0000base 0000 0001 0000 0000 0000management 0000 1000 0000 0000 0000system 0000 0000 0000 0000 1000block signature 0000 1001 0000 0010 1000
L1 L2 L3 L4 L5
[L1] [L2] [L3] [L4] [L5]where [x] is the encoded vale of x.
search: Decode the encoded lengths of all the preceding intervalsexample: search “data” (1) data ==> 0000 0000 0000 0010 0000 (2) decode [L1]=0000, decode [L2]=00, decode [L3]=000000disadvantage: search becomes low
Hsin-Hsi Chen 8-72
Bit-block Compression (BC)Data Structure:(1) The sparse vector is divided into groups of consecutive bits (bit-blocks).(2) Each bit block is encoded individually.Algorithm:Part I. It is one bit long, and it indicates whether there are any “1”s in the bit-block (1) or the bit -block is (0). In the latter case, the bit-block signature stops here. 0000 1001 0000 0010 1000 0 1 0 1 1Part II. It indicates the number s of “1”s in the bit-block. It consists of s-1 “1” and a terminating zero. 10 0 0Part III. It contains the offsets of the “1”s from the beginning of the bit-block. 0011 10 00 說明: b=4 ,距離為 0, 1, 2, 3 ,編碼為 00, 01, 10, 11block signature: 01011 | 10 00 | 00 11 10 00
Hsin-Hsi Chen 8-73
Bit-block Compression (BC)(Continued)
Search “data”(1) data ==> 0000 0000 0000 0010 0000(2) the 4th bit-block(3) signature 01011 | 10 0 0 | 00 11 10 00(4) OK, there is at least one setting in the 4th bit-block.(5) Check furthermore. “0” tells us there is only one setting in the 4th bit-clock. Is it the 3rd bit?(6) Yes, “10” confirms the result.
Discussion:(1) Bit-block compression requires less space than Sequential Signature File for the same false drop probability.(2) The response time of Bit-block compression is lightly less than Sequential Signature File.
Hsin-Hsi Chen 8-74
Vertical Partitioning
• ideaavoid bringing useless portions of the document signature in main memory
• methods– store the signature file in a bit-sliced form or in
a frame-sliced form– store the signature matrix column-wise to
improve the response time on the expense of insertion time
Hsin-Hsi Chen 8-75
Bit-Sliced Signature Files (BSSF)
Transposed bit matrix
transpose
represent
documents
documents(document signature)
Hsin-Hsi Chen 8-76
bit-files
search: (1) retrieve m bit vectors. (instead of F bit vectors) e.g., the word signature of free is 001 000 110 010 the document contains “free”: 3rd, 7th, 8th, 11th bit are set i.e., only 3rd, 7th, 8th, 11th files are examined. (2) “and” these vectors. The 1s in the result N-bit vector
denote the qualifying logical blocks (documents).(3) retrieve text file through pointer file.
insertion: require F disk accesses for a new logical block (document), one for each bit-file, but no rewriting
documents
Hsin-Hsi Chen 8-77
Frame-Sliced Signature File (FSSF)• Ideas
– random disk accesses are more expensive than sequential ones.– force each word to hash into bit positions that are closer to each
other in the document signature– these bit files are stored together and can be retrieved with a few
random accesses
• Procedures– The document signature (F bits long) is divided into k frames of
s consecutive bits each.– For each word in the document, one of the k frames will be
chosen by a hash function.– Using another hash function, the word sets m bits in that frame.
Hsin-Hsi Chen 8-78
documents
frames
Each frame will be kept in consecutive disk blocks.
Hsin-Hsi Chen 8-79
FSSF (Continued)
• Example (D=2, F=12, s=6, k=2, m=3)Word Signaturefree 000000 110010text 010110 000000
doc. signature 010110 110010
• Search– Only one frame has to be retrieved for a single word query. I.E., only one
random disk access is required.e.g., search documents that contain the word “free”because the word signature of “free” is placed in 2nd frame,only 2nd frame has to be examined.
– At most n frames have to be scanned for an n word query.
• InsertionOnly k frames have to be accessed instead of F bit-slices.
Hsin-Hsi Chen 8-80
Vertical Partitioning and Compression
• idea– create a very sparse signature matrix– store it in a bit-sliced form– compress each bit slice by storing the position
of the 1s in the slice.
Hsin-Hsi Chen 8-81
Compressed Bit Slices (CBS)
• Rooms for improvements for bit-sliced method– Searching
• Each search word requires the retrieval m bit files.
• The search time could be improved if m was forced to be “1”.
– Insertion• Require too many disk accesses (equal to F, which
is typically 600-1000).
Hsin-Hsi Chen 8-82
Compressed Bit Slices (CBS)(Continued)
• Let m=1. To maintain the same false drop probability, F (S) has to be increased.
one bit-setting for each wordOnly one row has to be read
documents
Size of a
signature
Hsin-Hsi Chen 8-83
h(“base”)=30
Obtain the pointers to the relevant documents frombuckets
Hash a word toobtain bucket address
(documentcollection)
Do not distinguish synonyms.
representationfor a word:取某一列,擠掉 0 的部份只保留該列1 的部份,亦即將 1 的部份串接起來
Hsin-Hsi Chen 8-84
Doubly Compressed Bit Slices
h1(“base”)=30 h2(“base”)=011Follow the pointers of postingbuckets to retrieve the qualifyingdocuments.
Distinguish synonyms partially.
Idea:compressthe sparsedirectory當 S 變小碰撞在一起的的機會變大,採用中間 buckets為了區別真碰撞和假碰撞,多了一個 hashfunction
Hsin-Hsi Chen 8-85
No False Drops Method
Using pointer to the wordin the text file
Distinguish synonyms:completely. 上圖 h2 仍有可能產生碰撞
Fixed lengthSave space
Hsin-Hsi Chen 8-86
Horizontal Partitioning
documents
1. Goal: group the signatures into sets, partitioning the signature matrix horizontally.2. Grouping criterion
Hsin-Hsi Chen 8-87
Partitioned Signature Files
• Using a portion of a document signature as a signature key to partition the signature file.
• All signatures with the same key will be grouped into a so-called “module”.
• When a query signature arrives,– examine its signature key and look for the
corresponding modules
– scan all the signatures within those modules that have been selected
Hsin-Hsi Chen 8-88
Comparisons
• signature files– Use hasing techniques to produce an index– advantage
• storage overhead is small (10%-20%)
– disadvantages• the search time on the index is linear
• some answers may not match the query, thus filtering must be done
Hsin-Hsi Chen 8-89
Comparisons (Continued)
• inverted files– storage overhead (30% ~ 100%)– search time for word searches is logarithmic
• PAT arrays– potential use in other kind of searches
• phrases• regular expression searching• approximate string searching• longest repetitions• most frequent searching