89
Hsin-Hsi Chen 8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Scienc e and Information Engineering National Taiwan University

Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Embed Size (px)

Citation preview

Page 1: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-1

Chapter 8Indexing and Searching

Hsin-Hsi Chen

Department of Computer Science and Information Engineering

National Taiwan University

Page 2: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-2

Introduction

• searching– Online text searching

• Scan the text sequentially

– Indexed searching• Build data structures over the text to speed up the search• Semi-static collections: updated at reasonably regular interval

• indexing techniques– inverted files– suffix (PAT) arrays– signature files

Page 3: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-3

Assumptions

• n: the size of text databases• m: the length of the search patterns (m<n)• M: the amount of memory available• n’: the size of texts that are modified (n’<n)• Experiments

– 32bit Sun UltraSparc-1 of 167 MHz with 64 MB of RAM

– TREC-2 (WSJ, DOE, FR, ZIFF, AP)

Page 4: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-4

File Structures for IR

• lexicographical indices– indices that are sorted– inverted files– Patricia (PAT) trees (Suffix trees and arrays)

• cluster file structures (see Chapter 7 in document clustering)

• indices based on hashing– signature files

Page 5: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-5

Inverted Files

Page 6: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-6

Inverted Files• Each document is assigned a list of keywords or

attributes.• Each keyword (attribute) is associated with operational

relevance weights.• An inverted file is the sorted list of keywords

(attributes), with each keyword having links to the documents containing that keyword.

• Penalty– the size of inverted files ranges from 10% to 100%

of more of the size of the text itself– need to update the index as the data set changes

Page 7: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-7

1 6 9 11 17 19 24 28 33 40 46 50 55 60

This is a text. A text has many words. Words are made from letters.

letters 60 …made 50 …many 28 …text 11, 19 …words 33, 40 …

Vocabulary Occurrences

Heaps’ law: the vocabulary grows as O(n), : 0.4~0.6 Vocabulary for 1GB of TREC-2 collection: 5MB (before stemming and normalization)Occurrences: the extra space O(n) 30% ~ 40% of the text size

Text

addressing granularity:(1) inverted list –

word positionscharacter positions

(2) inverted file –document

Page 8: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-8

Block Addressing• Full inverted indices

– Point to exact occurrences

• Blocking addressing– Point to the blocks where the word appears

– Pointers are smaller

– 5% overhead over the text sizeBlock1 Block2 Block3 Block 4

This is a text. A text has many words. Words are made from letters.

letters 4 …made 4 …many 2 …text 1, 2 …words 3 …

Vocabulary Occurrences Text

Inverted index

block:fixed size blocks,files, documents, Web pages, …

block = retrieval units?

Page 9: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-9

Sorted array implementation of an inverted filethe documents in which keyword occurs

Page 10: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-10

Indexing Small

(1 MB)

collection Medium

(200MB)

collection Large

(2GB)

Collection

Addressing

words

45% 73% 36% 64% 35% 63%

Addressing

documents

19% 26% 18% 32% 26% 47%

Addressing

64k blocks

27% 41% 18% 32% 5% 9%

Addressing

256 blocks

18% 25% 1.7% 2.4% 0.5% 0.7%

Full inversion (all words, exact positions, 4-byte pointers)

2 or 1 byte(s) per pointer independent of the text size

document size (10KB), 1, 2, 3 bytes per pointer, depending on text size

All words are indexedStop words are not indexed

Page 11: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-11

Searching

• Vocabulary search– Identify the words and patterns in the query

– Search them in the vocabulary

• Retrieval of occurrences– Retrieve the lists of occurrences of all the words

• Manipulation of occurrences– Solve phrases, proximity, or Boolean operations

– Find the exact word positions when block addressing is used

Th

ree general

steps

Page 12: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-12

Structures used in Inverted Files

• Sorted Arrays– store the list of keywords in a sorted array– using a standard binary search– advantage: easy to implement– disadvantage: updating the index is expensive

• B-Trees• Tries• Hashing Structures• Combinations of these structures

Page 13: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-13

Trie1 6 9 11 17 19 24 28 33 40 46 50 55 60

This is a text. A text has many words. Words are made from letters.

letters: 60

text: 11, 19

words: 33, 40

made: 50

many: 28

‘l’

‘m’‘a’

‘d’

‘n’‘t’

‘w’

Text

Vocabularytrie

Page 14: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-14

B-trees

F M

Al Br E Gr H Ja LRut Uni

Afgan 2 … Russian 9 … Ruthenian 1 …

Page 15: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-15

Sorted Arrays

1. The input text is parsed into a list of words along with theirlocation in the text. (time and storage consuming operation)

2. This list is inverted from a list of terms in location order to a list of terms in alphabetical order.

3. Add term weights, or reorganize or compress the files.

Page 16: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-16

Inversion of Word List

“report”appearsin tworecords

Page 17: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-17

Dictionary and postings fileIdea: the file to be searched should be as short as possible

split a single file into two pieces

e.g. data set: 38,304 records, 250,000 unique

88 postings/recordterms

(document #, frequency)

(vocabulary) (occurrences)

Page 18: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-18

Producing an Inverted File for Large Data Sets without Sorting

Idea: avoid the use of an explicit sort by using a right-threaded binary tree

current number of term postings &the storage location of postings list

traverse the binary tree and thelinked postings list

Page 19: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-19

Indexing StatisticsFinal index: only 8% of input text size for 50MB database 14% of the input text size for the larger databaseWorking storage: not much larger than the size of final index for new indexing method

p.17&18 p.20

the storage needed to build the index

933

2GB

the same

Page 20: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-20

A Fast Inversion Algorithm

• Principle 1the large primary memories are availableIf databases can be split into memory loads that can be rapidly processed and then combined, the overall cost will be minimized.

• Principle 2the inherent order of the input dataIt is very expensive to use polynomial or even nlogn sorting algorithms for large files

Page 21: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-21

FAST-INV algorithm

See p. 22.

concept postings/pointers

Page 22: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-22

document number

concept number (one concept numberfor each unique word)

Sample document vector

Similar to the document-word list shown in p. 16.

The concept numbers aresorted within documentnumbers, and document numbers aresorted within collection

Page 23: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-23

HCN=highest concept number in dictionary (total number of concepts in dictionary)L=number of concepts/document (documents/concept) pairs in the collectionM=available primary memory size, in bytes

M>>HCN, M < L

L/j<M, so that each part fill fit into primary memoryHCN/j concepts, approximately, are associated with each part

Let LL=length of current load (8 bytes for each concept-weight) S=spread of concept numbers in current load (4 bytes for each count of posting)

number of concept-weight pairs

8*LL+4*S < M

Page 24: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-24

Preparation

1. Allocate an array, con_entries_cnt, of size HCN.2. For each <doc#, con#> entry in the document vector file: increment con_entries_cnt[con#]

……………………0 (1,2), (1,4)……….. 2(2,3) …………….. 3(3,1), (3,2), (3,5) ... 6(4,2), (4,3) ………. 8...

Page 25: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-25

Preparation (continued)

5. For each <con#,count> pair obtained from con_entries_cnt: if there is no room for documents with this concept to fit in the current load, then created an entry in the load table and initialize the next load entry; otherwise update information for the current load table entry.

Page 26: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-26

: the range of concepts for each primary load

讀入 (Doc,Con)依 Con 去查 Load表,確定這個配對該落在那個 Load

依序將每個 LoadFile 反轉。 CONPTR表中的 Offset 顯示每筆資料該填入的位置。

如何產生 Load 表?LL:length of current loadS: end concept-start concept +1

space for concept/weight pair*LL+space for each conceptto store count ofpostings*S < M

留意: SS 為了管理上資料的添加

copy ratherthan sort

Page 27: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-27

PAT Tress and PAT Arrays(Suffix Trees and Suffix Arrays)

Page 28: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-28

PAT Trees and PAT Arrays

• Problems of tradition IR models– Documents and words are assumed.– Keywords must be extracted from the text (indexing).– Queries are restricted to keywords.

• New indices for text– A text is regarded as a long string.– Each position corresponds to a semi-infinite string (sistring).– suffix: a string that goes from a text position to the end of the

text– Each suffix is uniquely identified by its position– no structures and no keywords

Page 29: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-29

This is a text. A text has many words. Words are made from letters.

text. A text has many words. Words are made from letters.

text has many words. Words are made from letters.

many words. Words are made from letters.

Words are made from letters.

made from letters.

letters.

Text

Suffixes

Index points are selected from the text, whichpoint to the beginning of the text positions whichare retrievable.

different

Page 30: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-30

PATRICIA

• trie– branch decision node: search decision-markers– element node: real data– if branch decisions are made on each bit, a com

plete binary tree is formed where the depth is equal to the number of bits of the longest strings

– many element nodes and branch nodes are null

Page 31: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-31

PATRICIA (Continued)

• compressed digital search trie– the null element nodes and branch nodes are re

moved– an additional field to denote the comparing bit f

or branching decision is included in each decision node

– a matching between the searched results and their search keys is required because only some of bits are compared during the search process

Page 32: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-32

PATRICIA (Continued)

• Practical Algorithm to Retrieve Information Coded in Alphanumeric– augmented branch node: an additional field for

storing elements is included in branch node– each element is stored in an upper node or in

itself– an addition root node: note the number of leaf

nodes is always greater than that of internal nodes by one

Page 33: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-33

PAT-tree

• PATRICIA + semi-infinite strings– a text T with n basic units u1 u2 … un

– u1 u2 … un …, u2 u3 … un …, u3 u4 … un …, …

– an end to the left but none to the right– store the starting positions of semi-infinite

strings in a text using PATRICIA

Page 34: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-34

semi-infinite strings

• ExampleText Once upon a time, in a far away land …sistring 1 Once upon a time …sistring 2 nce upon a time …sistring 8 on a time, in a …sistring 11 a time, in a far …sistring 22 a far away land …

• Compare sistrings22 < 11 < 2 < 8 < 1

Page 35: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-35

PAT Tree

• PAT TreeA Patricia tree constructed over all the possible sistrings of a text

• Patricia tree– a digital tree where the individual bits of the keys are use

d to decide on the branching– each internal node indicates which bit of the query is use

d for branching• absolute bit position• a count of the number of bits to skip

– each external node is a sistring, i.e., the integer displacement

Page 36: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-36

Example

Text 01100100010111 …sistring 1 01100100010111 …sistring 2 1100100010111 …sistring 3 100100010111 …sistring 4 00100010111 …sistring 5 0100010111 …sistring 6 100010111 …sistring 7 00010111 …sistring 8 0010111 ...

1 1

21 2

23

1

1

2

23

1

2

14

2

23

1

2

4 3

15

: external node sistring (integer displacement) total displacement of the bit to be inspected

: internal node skip counter & pointer

0 1 0 1

0 1

Page 37: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-37

2

2

2

4 3

15

1

Text 01100100010111 …sistring 1 01100100010111 …sistring 2 1100100010111 …sistring 3 100100010111 …sistring 4 00100010111 …sistring 5 0100010111 …sistring 6 100010111 …sistring 7 00010111 …sistring 8 0010111 ...

4

36

註: 3 和 6 要 4 個 bits 才能區辨

2

2

2

3

15

4

36

1

3

47

2

2

2

3

15

4

36

1

3

7 5

84

Search 00101

Page 38: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-38

1 6 9 11 17 19 24 28 33 40 46 50 55 60

This is a text. A text has many words. Words are made from letters.

Suffix Trie

6050

28 19

11 40

33

‘l’‘m’ ‘a’ ‘d’

‘n’‘t’‘e’ ‘x’ ‘t’

‘w’‘o’ ‘r ‘d’ ‘s’

60

5

50

28 19

11 40

33

‘l’‘m’ ‘d’

‘n’‘t’

‘w’

1

6

3Suffix Tree

space overhead:120%~240% over the text size

Text

Page 39: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-39

PAT Trees Represented as Arrays

• indirect binary search vs. sequential searchKeep the external nodes in the bucket in the same relative order as they would be in the tree

2

2

2

3

15

4

36

1

3

7 5

84

7 4 8 5 1 6 3 2PAT array

0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...Text

Page 40: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-40

1 6 9 11 17 19 24 28 33 40 46 50 55 60

This is a text. A text has many words. Words are made from letters.

Text

60 50 28 19 11 40 33

(2) Suffix Array

60

5

50

28 19

11 40

33

‘l’‘m’ ‘d’

‘n’‘t’

‘w’

1

6

3(1) Suffix Tree

40% overhead

120%~240%overhead

(3) Supra-Index

60 50 28 19 11 40 33

Suffix Array

lett text word

Page 41: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-41

difference between suffix array and inverted list

• suffix array: the occurrences of each word are sorted lexicographically by the text following the word

• inverted list: the occurrences of each word are sorted by text position

1 6 9 11 17 19 24 28 33 40 46 50 55 60

This is a text. A text has many words. Words are made from letters.

60 50 28 19 11 40 33

Suffix Array

60 50 28 11 19 33 40

Inverted list

letters made many text words

VocabularySupra-Index

Page 42: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-42

Indexing Points

• The above example assumes every position in the text is indexed.n external nodes, one for each position in the text

• word and phrase searchessistrings that are at the beginning of words are necessary

• trade-off between size of the index and search requirements

Page 43: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-43

Prefix searching• idea

every subtree of the PAT tree has all the sistrings with a given prefix.

• Search: proportional to the query lengthexhaust the prefix or up to external node.

Search for the prefix“10100” and its answer

Page 44: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-44

Searching PAT Trees as Arrays• Prefix searching and range searching

doing an indirect binary search over the array with the results of the comparisons being less than, equal, and greater than.

• exampleSearch for the prefix 100 and its answer.

7 4 8 5 1 6 3 2PAT array

0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...Text

Page 45: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-45

Proximity Searching

• Find all places where s1 is at most a fixed (given by a user) number of characters away from s2.

in 4 ation ==> insulation, international, information

• Algorithm1. Search for s1 and s2.2. Select the smaller answer set from these two sets and sort by position.3. Traverse the unsorted answer set, searching every position in the sorted set and checking if the distance between positions satisfying the proximity condition.

sort+traverse time:(m1+m2)logm1 (assume m1<m2)

Page 46: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-46

Range Searching

• Search for all the strings within a certain lexicographical range.

• the range of “abc” ..”acc”: “abracadabra”, “acacia” ○

“abacus”, “acrimonious” X• Algorithm

– Search each end of the defining intervals.

– Collect all the sub-trees between (and including) them.

Page 47: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-47

Searching Suffix Array

• P1 S < P2

– Binary search both limiting patterns in the suffix array.

– Find all the elements lying between both positions

Page 48: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-48

Longest Repetition Searching

• the match between two different positions of a text where this match is the longest in the entire text, e.g.,

0 1 1 0 0 1 0 0 0 1 0 1 1 1

2

2

2

3

15

4

36

1

3

7 5

84

Text 01100100010111 sistring 1 01100100010111 sistring 2 1100100010111 sistring 3 100100010111 sistring 4 00100010111 sistring 5 0100010111 sistring 6 100010111 sistring 7 00010111 sistring 8 0010111

the tallest internal node gives a pairof sistrings that match for the greatestnumber of characters

Page 49: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-49

“Most Significant” or “Most Frequent” Matching

• the most frequently occurring strings within the text database, e.g., the most frequent trigram

• find the most frequent trigramfind the largest subtree at a distance 3 characters from root

2

2

2

3

15

4

36

1

3

7 5

84

the tallest internal node gives a pairof sistrings that match for the greatestnumber of characters

i.e., 1, 2, 3 are the same forsistrings 100100010111 and 100010111

Page 50: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-50

Building PAT Trees as Patricia Trees

• bucketing of external nodes– collect more than one external node

– a bucket replaces any subtree with size less than a certain constraint (b)save significant number of internal nodes

– the external nodes inside a bucket do not have any structure associated with themincrease the number of comparisons for each search

Page 51: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-51

Building PAT Trees as Patricia Trees(Continued)

• mapping the tree onto the disk using super-nodes– Allocate as much as possible of the tree in a disk page– Every disk page has a single entry point,

contains as much of the trees as possible,and terminates either in external nodes or in pointers to other disk pages

– The pointers in internal nodes address either a disk page or another node inside the same page

– disk pages contain on the order of 1,000 internal/external nodes

– on the average, each disk page contains about 10 steps of a root-to-leaf path

Page 52: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-52

Suffix array construction (in MM)

• The suffix array and the text must be in main memory– The suffix array is the set of pointers

lexicographically sorted– The pointers are collected in ascending text

order– The pointers are sorted by the text they point to

(accessing the text at random positions)

Page 53: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-53

Suffix array construction (in MM)

• Algorithm– All the suffixes are bucket-sorted according to

the first letter only– At iteration i, the suffixes begin already sorted

by their 2i-1 first letters and end up sorted by their first 2i letters.

• Sort the text positions Ta … and Tb … in the suffix array

• Determine the relative order between text positions Ta+2

i-1 … and Tb +2i-1 … in the current stage of search

Page 54: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-54

Construction of Suffix Arraysfor Large Text

• Split the text blocks that can be sorted in MM.– Build the suffix array for the first block

– Build the suffix array for the second block

– Merge both suffix arrays

– Build the suffix array for the third block

– Merge the suffix array with the previous one

– Build the suffix array for the fourth block

– Merge the new suffix array with previous one

– …

Page 55: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-55

Merge Step

• How to merge a large suffix array with the small suffix array?– Determine how many elements of the large array are to

be placed between each pair of elements in the small array

• Read the large array sequentially into main memory

• Each suffix of that text is searched in the small suffix array

• Increment appropriate counter

– Use the information to merge the arrays without accessing the text

Page 56: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-56

small text

small suffix array

small text

small suffix array

counters

longtext

small text

small suffix array

counters

long suffix array

final suffix array

(a)(b)

(c)

local suffix array is builtCounters arecomputed

The suffix arrays are merged

Page 57: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-57

Signature Files

Page 58: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-58

Signature Files

• basic idea: inexact filter– discard many of nonqualifying items– qualifying items definitely pass the test– “false hits” or “false drops” may also pass accidentally

• procedure– Documents are stored sequentially in “text file”.– Their signatures (hash-coded bit patterns) are stored in t

he “signature file”.– Scan the signature file, discard nonqualifying document

s, and check the rest, when a query arrives.

Page 59: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-59

Merits of Signature Files

• faster than full text scanning– 1 or 2 orders of magnitude faster

• modest space overhead– 10-15% vs. 50-300% (inversion)

• insertions can be handled more easily than inversion– append only

– no reorganization or rewriting

Page 60: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-60

Basic Concepts• Use superimposed coding to create signature.• Each document is divided into logical blocks.• A block contains D distinct non-common words.• Each word yields “word signature”.• A word signature is a F-bit pattern, with m 1-bit.

– Each word is divided into successive, overlapping triplets. e.g. free --> fr, fre, ree, ee

– Each such triplet is hashed to a bit position.

• The word signatures are OR’ed to form block signature.• Block signatures are concatenated to form the document si

gnature.

B in text bookl

Page 61: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-61

Basic Concepts (Continued)

• Example (D=2, F=12, m=4)word signaturefree 001 000 110 010text 000 010 101 001block signature 001 010 111 011

• Search– Use hash function to determine the m 1-bit positions.

– Examine each block signature for 1’s bit positions that the signature of the search word has a 1.

B l

Page 62: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-62

A Signature File

This is a text. A text has many words. Words are made from letters.Block 1 Block 2 Block 3 Block 4

000101 110101 100100 101101

Text

Text Signature

h(text) =000101h(many) =110000h(words) =100100h(made) =001100h(letters) =100001

Page 63: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-63

Basic Concepts (Continued)

• false alarm (false hit, or false drop) Fdthe probability that a block signature seems to qualify, given that the block does not actually qualify.

Fd = Prob{signature qualifies | block does not}

• Ensure the probability of a false alarm is low enough while keeping the signature file as short as possible

• For a given value of F, the value of m that minimizes the false drop probability is such that each row of the matrix contains “1”s with probability 0.5.

Fd = 2-m

Fln2=mDF: signature size in bitsm: number of bits per wordD: number of distinct noncommon words per documentFd: false drop probability

N*F binary matrix

m=ln2*F/D

Page 64: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-64

space overhead of index: (1/80)*(F/D)F is measured in bits and D in words

10% overhead: false drop probability close to 2%10%=(1/80)*(F/D) (F/D)=8m=8*ln2=5.545Fd=2-5.545=2%

20% overhead: false drop probability close to 0.046%20%=(1/80)*(F/D) (F/D)=16m=16*ln2=11.09Fd=2-11.09=0.046%

On the average, a word consists of 10 characters anda character has 8 bits.

Page 65: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-65

Sequential Signature File (SSF)

documents

the size of document signature= the size of block signature=F

assume documents span exactly one logical block

Page 66: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-66

Classification of Signature-Based Methods

• CompressionIf the signature matrix is deliberately sparse, it can be compressed.

• Vertical partitioningStoring the signature matrix columnwise improves the response time on the expense of insertion time.

• Horizontal partitioningGrouping similar signatures together and/or providing an index on the signature matrix may result in better-than-linear search.

Page 67: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-67

Classification of Signature-Based Methods

• Sequential storage of the signature matrix– without compression

sequential signature files (SSF)– with compression

bit-block compression (BC)variable bit-block compression (VBC)

• Vertical partitioning– without compression

bit-sliced signature files (BSSF, B’SSF)frame sliced (FSSF)generalized frame-sliced (GFSSF)

Page 68: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-68

Classification of Signature-Based Methods(Continued)

– with compressioncompressed bit slices (CBS)doubly compressed bit slices (DCBS)no-false-drop method (NFD)

• Horizontal partitioning– data independent partitioning

Gustafson’s methodpartitioned signature files

– data dependent partitioning2-level signature files5-trees

Page 69: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-69

Criteria

• the storage overhead

• the response time on single word queries

• the performance on insertion, as well as whether the insertion maintains the “append-only” property

Page 70: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-70

Compression

• idea– Create sparse document signatures on purpose.

– Compress them before storing them sequentially.

• Method– Use B-bit vector, where B is large.

– Hash each word into one (n) bit position(s).

– Use run-length encoding.

Page 71: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-71

Compression using run-length encoding

data 0000 0000 0000 0010 0000base 0000 0001 0000 0000 0000management 0000 1000 0000 0000 0000system 0000 0000 0000 0000 1000block signature 0000 1001 0000 0010 1000

L1 L2 L3 L4 L5

[L1] [L2] [L3] [L4] [L5]where [x] is the encoded vale of x.

search: Decode the encoded lengths of all the preceding intervalsexample: search “data” (1) data ==> 0000 0000 0000 0010 0000 (2) decode [L1]=0000, decode [L2]=00, decode [L3]=000000disadvantage: search becomes low

Page 72: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-72

Bit-block Compression (BC)Data Structure:(1) The sparse vector is divided into groups of consecutive bits (bit-blocks).(2) Each bit block is encoded individually.Algorithm:Part I. It is one bit long, and it indicates whether there are any “1”s in the bit-block (1) or the bit -block is (0). In the latter case, the bit-block signature stops here. 0000 1001 0000 0010 1000 0 1 0 1 1Part II. It indicates the number s of “1”s in the bit-block. It consists of s-1 “1” and a terminating zero. 10 0 0Part III. It contains the offsets of the “1”s from the beginning of the bit-block. 0011 10 00 說明: b=4 ,距離為 0, 1, 2, 3 ,編碼為 00, 01, 10, 11block signature: 01011 | 10 00 | 00 11 10 00

Page 73: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-73

Bit-block Compression (BC)(Continued)

Search “data”(1) data ==> 0000 0000 0000 0010 0000(2) the 4th bit-block(3) signature 01011 | 10 0 0 | 00 11 10 00(4) OK, there is at least one setting in the 4th bit-block.(5) Check furthermore. “0” tells us there is only one setting in the 4th bit-clock. Is it the 3rd bit?(6) Yes, “10” confirms the result.

Discussion:(1) Bit-block compression requires less space than Sequential Signature File for the same false drop probability.(2) The response time of Bit-block compression is lightly less than Sequential Signature File.

Page 74: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-74

Vertical Partitioning

• ideaavoid bringing useless portions of the document signature in main memory

• methods– store the signature file in a bit-sliced form or in

a frame-sliced form– store the signature matrix column-wise to

improve the response time on the expense of insertion time

Page 75: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-75

Bit-Sliced Signature Files (BSSF)

Transposed bit matrix

transpose

represent

documents

documents(document signature)

Page 76: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-76

bit-files

search: (1) retrieve m bit vectors. (instead of F bit vectors) e.g., the word signature of free is 001 000 110 010 the document contains “free”: 3rd, 7th, 8th, 11th bit are set i.e., only 3rd, 7th, 8th, 11th files are examined. (2) “and” these vectors. The 1s in the result N-bit vector

denote the qualifying logical blocks (documents).(3) retrieve text file through pointer file.

insertion: require F disk accesses for a new logical block (document), one for each bit-file, but no rewriting

documents

Page 77: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-77

Frame-Sliced Signature File (FSSF)• Ideas

– random disk accesses are more expensive than sequential ones.– force each word to hash into bit positions that are closer to each

other in the document signature– these bit files are stored together and can be retrieved with a few

random accesses

• Procedures– The document signature (F bits long) is divided into k frames of

s consecutive bits each.– For each word in the document, one of the k frames will be

chosen by a hash function.– Using another hash function, the word sets m bits in that frame.

Page 78: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-78

documents

frames

Each frame will be kept in consecutive disk blocks.

Page 79: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-79

FSSF (Continued)

• Example (D=2, F=12, s=6, k=2, m=3)Word Signaturefree 000000 110010text 010110 000000

doc. signature 010110 110010

• Search– Only one frame has to be retrieved for a single word query. I.E., only one

random disk access is required.e.g., search documents that contain the word “free”because the word signature of “free” is placed in 2nd frame,only 2nd frame has to be examined.

– At most n frames have to be scanned for an n word query.

• InsertionOnly k frames have to be accessed instead of F bit-slices.

Page 80: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-80

Vertical Partitioning and Compression

• idea– create a very sparse signature matrix– store it in a bit-sliced form– compress each bit slice by storing the position

of the 1s in the slice.

Page 81: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-81

Compressed Bit Slices (CBS)

• Rooms for improvements for bit-sliced method– Searching

• Each search word requires the retrieval m bit files.

• The search time could be improved if m was forced to be “1”.

– Insertion• Require too many disk accesses (equal to F, which

is typically 600-1000).

Page 82: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-82

Compressed Bit Slices (CBS)(Continued)

• Let m=1. To maintain the same false drop probability, F (S) has to be increased.

one bit-setting for each wordOnly one row has to be read

documents

Size of a

signature

Page 83: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-83

h(“base”)=30

Obtain the pointers to the relevant documents frombuckets

Hash a word toobtain bucket address

(documentcollection)

Do not distinguish synonyms.

representationfor a word:取某一列,擠掉 0 的部份只保留該列1 的部份,亦即將 1 的部份串接起來

Page 84: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-84

Doubly Compressed Bit Slices

h1(“base”)=30 h2(“base”)=011Follow the pointers of postingbuckets to retrieve the qualifyingdocuments.

Distinguish synonyms partially.

Idea:compressthe sparsedirectory當 S 變小碰撞在一起的的機會變大,採用中間 buckets為了區別真碰撞和假碰撞,多了一個 hashfunction

Page 85: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-85

No False Drops Method

Using pointer to the wordin the text file

Distinguish synonyms:completely. 上圖 h2 仍有可能產生碰撞

Fixed lengthSave space

Page 86: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-86

Horizontal Partitioning

documents

1. Goal: group the signatures into sets, partitioning the signature matrix horizontally.2. Grouping criterion

Page 87: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-87

Partitioned Signature Files

• Using a portion of a document signature as a signature key to partition the signature file.

• All signatures with the same key will be grouped into a so-called “module”.

• When a query signature arrives,– examine its signature key and look for the

corresponding modules

– scan all the signatures within those modules that have been selected

Page 88: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-88

Comparisons

• signature files– Use hasing techniques to produce an index– advantage

• storage overhead is small (10%-20%)

– disadvantages• the search time on the index is linear

• some answers may not match the query, thus filtering must be done

Page 89: Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8-89

Comparisons (Continued)

• inverted files– storage overhead (30% ~ 100%)– search time for word searches is logarithmic

• PAT arrays– potential use in other kind of searches

• phrases• regular expression searching• approximate string searching• longest repetitions• most frequent searching