4

Click here to load reader

Heuristics for partial-match retrieval data base design

Embed Size (px)

Citation preview

Page 1: Heuristics for partial-match retrieval data base design

uuy 1976

&wived 23 §ept+rht~ 1975, wAsed version received ‘9 November 1975

Page 2: Heuristics for partial-match retrieval data base design

Volume 4, numb S ATION PROCESSING LETI’ERS February 1976

key takes as ue either 0 of 1 (the con of rest&tion to biruuy keys are discus<

to exacQ 3ne record

an attribute position

I 0 in attribute j and

case, then the 1~~ rcie~ uswlfy correspond to more than one fef; XG. ALg ,re referred to as h.d, 61:s and ilre typically assigned enough records to OCCUPY a sub-

stantial part of a physical record of the secondary

*mw We have assumed in this disccussion that the keys

of the records will be bina~. ‘I’?& is a general assump lion, as any other keys can be encoded in binary. Sometime?, however, ‘Aat is not the proper approach to take. Bentley discusses a trie-like str&ure for non-binary partial match ql:eries in ref. [ 11. Many of the ideas we discuss 111 this paper can be easily ex- tended to non-binary keys.

The stcqe costs of tries is fairly small; this is es- pecially true when buckets are used. For tries stored entirely in main memory, the cmf of a search is usual- ly *he sum of internal and external nodes visited dur- ing the search. When the records are stored in a sec- ondary memory the cost of transfer from secondary to main memory usuaily don&ates the search time and the cost of the search is th(srt&re taken to be the number of external nodes visited. We w2! not always mention which of the particular cost functions we are assum.@ in the following discussion since minimizing one usually minimizes the other.

3. HeuriPtics

in this section we present five heuristics for use in files s!ored as partial-match search tries*.

JV& descriptors

The contents of P subtrie we nlecessarily charac- terized by the path from the root of the tree to the root of thp, subtrie - this is the defnGng property of binary search tries. This characterkzCQ,n is not neces- sarily the most complete possible due to the mterde-

of the keys of the records in the file. For example, in a file containing re;ords of people, if the “over six feet tail”’ bit were on aqd the “husky build” bit were on, then the “under hundred pounds weight” bit would not be on. One approach to using this in-

I33

Page 3: Heuristics for partial-match retrieval data base design

Volw 4, number S Febmuy 1976

32. Ua frie schemes t&en from a relatively small subset of all pcmiiile queries. It is desirable to take the actual queries into

-se there arek: pos&ions in a query, any one xcount as the trie is being constructed. We therefore of which could be either a 0,l olr *, there are 3k pos- propose the following heuristic algorithm to tranp

t&-quIerieir. Given a collection of reC- form 8 set of queries into q skeletal trie into rrtimljtion that any one of these c@e-’ dp recopis of a fUe could be inserted to form a bi-

r+s & +quaBy bkely to be posed,, the optimai trie is nary trfe as de&red in section 2. &fmed to be that which nGn&nizes the,avenge nw A

’ noda to be v&&d &i& p&b@ exception if the record descri&rs&e&ion 3;l are employ sons wi&be ~visitedl if&d only if the is an asterisk it is d#rable to use as tion!: tit have the sn@est number o

In &&ri 3.2 we a&,&&d $bat~ gu&$ were. combined to pr~lduee a

equahy hkel:y drQ be posed. in host “‘&d life” &&i ’ 1 toboththereccrdstruc

ti0ns thrs is not the CdSS: - the queries art3 usually

13

Page 4: Heuristics for partial-match retrieval data base design

INFORMATION PROCESSING ‘UZTFRS February 1976

countam with each nc- thetriecouldbeumd

The anaiy!#w for Gi%A biristic5 are not ClJJTently

avail&ii and nffzi p-_ e mean% prnjecta. ly offer practical approaches es@. Thr? authors would cer-

te learning of other ex

Refere

J.L. Bentley, MultirUmonstona) Nary search trees used for associative sear&w to appear in Commun. ACM WA Burkhard, Pardakmatch queries and file design, to appear En Proc. Conf. Very La.rge Data Bases, 1975. W.A. Burkhsrd, Hashing and trie a@rithms for partial- match rMewd, armted to Commun. ACM. W.A. Rurkhard, Partiahnatch retrieval: performance bounds. UtXD Computer Science Division TR #3, 3975,. D. Comer and R. Sethi, NP~mpleteness of tree struc- tured hrdex minimizat~. Techn. rep. 167, Computer Scknae Dep., The Pennsyiw~nia State 1Jniv. (wlay 1975) 43 pp. P. Dubost and J.-M. Trotrsse, Software implementation of a new mathod of combinatorial hashing. TR# STAN- cS_7SSll, Computer Science Dept., Stanford Univ.,

35 PP* D. H&o and F. Harary, A formal system for informa- tion retrieval from files. Commun. ACM 13 (1970) 67- 73.

(8) R.L Rive& On hash-coding a@orithms for partiaknatch retrieval, Rot. 15th Ann. Symp. Switching and Auto- mata Theory (Octoter 1974) ZG-303.

19) R.L west, Analy&s of associative ic’,‘kval &gorithms. Techn. rcrp. STANCS-74415, Computer Science Dept., Sfauford Univ. (1974) 102 pp.

[ItI] R.L Rkat, Partiahnatch retrieval algorithms, to ap m is SUIM Computing J.

135