[ACM Press the sixth international conference - Las Vegas, Nevada, United States (1997.11.10-1997.11.14)] Proceedings of the sixth international conference on Information and knowledge

,I , :

e, : ;

,- ::

~_ > ::.

‘, 1

I

.’ :_ Z,’

., :: . .;, ! :;. ‘,, -r,

1 .3.:-i ‘:,<.‘“..<. j 1:; .y,~;;f$.;<:

:,i ;,- z,,* : ,; ;~:;i : Jo ;+‘, T’$ f: ,, : .*.. .;<‘: ,, :

,:;.,-.z.,&~~;51

$?t?*;,$$$

,,:;~;;~ ’ “i.‘iy$,.-;

$ y$?..ggj

Fjp?@j

5&g

73 1‘ ,q ‘,j ,; -2 ‘:$:;::;;;,~:g

1 ;? ,!.;~~~~~?J.

il ;,.y .Gqy ,q- , ‘-:‘“:;

,f,. ?,;;:: _ 4 p\<Y; ‘-;

1, .,a’.... , l-~_ ,;

_‘. ‘i :,.,. 1

I >I

~ : .:,-,. ‘ , :r,- ‘., :‘.

1.: ‘, I._’ 1 .., ,:. ).>’ .._ ‘,i . . . .’ !

Matching and Indexing Sequences of Different Lengths *

Tolga Bozkaya Nasser Yazdani Meral &.soyo@u Department of Computer Engineering and Science

Case Western Reserve University

Cleveland, OH 44106 (bozkaya, ya.zdani,ozsoy)@ces.cwru.edu

Abstract In this paper, we consider the problem of efficient matching and retrieval of sequences of different lengths. Most of the previous research is concentrated on similarity matching and retrieval of sequences of the same Iength using Euclidean distance metric. For similarity matching of sequences, we use a modified version of the edit distance function, and consider two sequences matching if a majority of the elements in the sequences match. In the matching process a mapping among non-matching elements is created to check if there are unacceptable deviations among them. This means that two matching sequences should have lengths that are comparable. For efficient retrieval of matching sequences, we propose an indexing scheme which is totally based on Iengths and relative distances between sequences. We use vptrees as the underlying distance-based index structures in our method.

1 Introduction The problem of matching sequences with respect to a similarity measure is encountered in a variety of applications such as text and information processing, genetics, time series analysis, scientific databases, etc. It is important for these applications to efficiently identify and retrieve similar data items {sequences) to a given query item. The results of these queries can be used for different purposes such as information retrieval, data mining, classification, etc.

In this paper, we address the general problem of indexing and matching sequences of different lengths while putting more emphasis on numerical sequences from scientific experiments since that was our motivating application. StiI1, our methods are general and can be applied to other data domains with sequences.

In our work, we USC a modified version of the edit distance function to compute the distance between two sequences,

*This research is partially supported by the National Science Founda- tion grant IRI 92-2%60, and the National Science Foundation FAW award IRI-90-24152

Pennissio~~ to nmkr digitalflxm! copies ol’nll or pall of‘this mat&d fbr personal or cl:rsfoom us is sp.rll~cd without fk lK0\~ided lhai lllc copis arc not made or dislrihlcd lbr pmlit or wmnwcid ;dr~~ll:~p. Ilrc copy-

right n&v. lirr titk ul’the pul~liratiw and il?; dolt appa*r. ;d nolirv is

given tht cc3pp$$t is hy pcnni.Gnl ol’lllc IKX Inc. To ca>fxy r~llb3Wisc. tu repuhlistt. to post on ~m’crs or II) redis~rib~~tc IO lisls. rcupim spccilic

prmtis4m nndlnr lk

CIKM 97 La.4 ~-qi-s Ncwltlil i .w Copyrigfa 1997 RCA1 fl-VI79 I-97kPJ7~ 1 1 ..!%%I

and to find the correspondence among the elements of the sequences. To answer similarity match queries, WC certainly need an efficient index structure, We can not use conventional index structures, since we do not nssumc any geometry on the domain of sequences. Instead, WC designed an indexing scheme which is based on the lengths of sequences and relative distances between sequences. A distance-based index structure, vp-tree [Uh191], is used as the underIying index structure in our method. The indexing scheme is used as a major filtering mechanism to climinntc distant sequences in processing a similarity match query,

The rest of the paper is organized as folIows. In section 2 we provide a brief overview of the previous work on sequence matching problem. In section 3, we give the motivation for our work, and present the definition for matching sequences. Section 4 contains our methods for computing the similarity between sequences, and the matching process. In section 5, we propose n general mcthDd for indexing sequences (of different lengths) for similarity matchqueries with respect to the matching process of section 4. Section 6 concludes.

2 Related Work

To our knowledge, [AFS93] is the first work which proposes a solution for similarity matching of sequences. In [AFS93], it is assumed that all sequences are of the same length, and each sequence is considered as a point in an IV-dimensionnl space. Then, two sequences are considered similar when the Euclidean distance between them is less than a threshold value rz.. The authors use Z-tree [BKSSBO] as the index structure. Sequences are represented as K-dimcnsionnl points using K features for each sequence. Discrctc Fourier Transform (DFT) is used for feature extraction since it preserves the Euclidean distance.

Faloutsos et al. extend the method proposed in [AFS93] to locate subsequences that match a query sequence or n subsequence of it [FRM94]. DFT is used for feature extraction, which preserves the Euclidean distance between sequences, The fact that it is a distance preserving transformation makes DFT attractive for indexing. However, it could be used only for sequences of the same length. Also, it is not WV effec-

1.28.

tive for sequences with mostly uncorrelated elements (such as random vectors). Another feature extraction method for sequences is the Eigenvector method [Fuk72]. The Eigen- vector method also preserves Euclidean distance, however, it also works with sequences of the same length. Rakesh Agrawal et al., [ALSS95], gives a method to retrieve similar sequences in the presence of noise, scaling and translation in time series.

In [LT94], a modified version of edit distance function is used to lind the matching text to a hand-written text using pen-stroke data. This approach is similar to ours in the sense that we also use a modified version of edit- distance function to compute similarity between sequences. However, we use different cost functions and we also create a one-to-one mapping between matching and non-matching elements of sequences making deletions and insertions (via interpolation) if necessary.

A sequence matching method for sequences of different lengths based on dynamic programming is proposed in [YO96]. A modified version of the Longest Common Sub- sequences (LCS), [CLR90], was used for actual matching of the elements of two sequences. For filtering, a feature-based indexing mechanism is used. where the length, mean, and variance (first two moments) of each sequence are used as features.

Indexing sequences to efficiently handle the similarity matching queries is also an important problem. In terms of indexing, we see that a good portion of the previous work [AFS93, FRM94] concentrates on matching sequences where the similarity between sequences are assumed to be Euclidean. However, note that for many domains and applications (such as text databases, genetic sequences) Euclidean distance function cannot be used directly as a similarity metric because the domain is non-numeric and sequences may be of different lengths. In these cases, indexing methods for Euclidean spaces are usually not applicable. Still, there are other distance-based index structures [Uhl9 1, Bri95] that do not assume any geometry of the application domain, but only depend on the fact that the distance function is metric (See section 5 for definition). The simple idea behind these structures is to use some reference object(s) to partition the search space with respect to the distances of data objects to the reference object(s). At the time of search, some of these partitions are not searched any further depending on their relative distances to the reference point(s) and the distance(s) between the query point and the reference point(s). Tbe vp-tree [Ublgl], a distance-based index structure, is explained in more detail in section 5, since it is used as part of our indexing scheme.

3 Motivation and Problem Definition In an earlier work [YOO94], we proposed a feature-based indexing method for exact as well as similarity matching of damage zone shapes (areas) in polypropylene that result under high stress at low temperatures [SHB92]. Such queries

are important to study the suitability of different materials for many engineering applications. The indexing method extracts some features such as the number of vertices in the polygonal approximation of the shapes and the number of components, and indexes them with respect to these features. Extracted features are represented as a feature vector for each image- [YO96] introduces a method for matching sequences of damage zone shapes in particular, and images from lab experiments in general- This method also uses the idea of feature extraction and representation of images by their feature vectors where each feature is represented by a numerical value. Therefore, each image sequence can be represented by a N * M matrix where N is the number of features for each image and M is the number of images in the sequence. The extracted features can also be the amount of change of some features like area, width, etc, in successive frames. Using such features would be helpful in studying change patterns by identifying sequences with similar patterns via similarity match queries.

For the feature extraction metbod to be effective, the following conditions must hold. Let R and S be two image sequences and FR and FS be their feature vectors. Then,

If R and S match ) FR and FS match. If R is similar to S ti FR is similar to FS.

Using tbe feature vector representation, the problem of matching image sequences is transformed into the problem matching sequences of numerical values in an N dimensional space. It may be the case that the number of images is not guaranteed to be the same for different sequences, even for the sequences obtained by repeating one experiment (since sampling rates may be different, and some data elements might be lost in the sampling period). We should be able to compare these sequences since such comparisons are important to verify the results of previous experiments and identify common patterns in these experiments.

In this paper, we use the definition for matching two sequences of different lengths given in [YO96], which uses the same idea of transforming sequence matching problem to numeric sequence matching in the way discussed above.

Before we generalize and give Definition 3.1 for matching sequences in scientific experiments, we would like to address the following points for motivation.

0 The relative times that the corresponding samples are taken are almost the same in both sequences. This means that the lengths of sequences should be close to each other to be matched.

l The elements of both sequences are taken from the lifetime of the experiment in a rather uniformly manner. Two sequences can be considered matching (or similar) if majority of their elements match.

l In numeric sequences from scientific domains, since the elements are real numbers obtained during the experiment with a limited precision, elements from different

129

‘.* . . . .

,I ‘.’

_. ;; I

1

’ ”

: I

,’ ,I’ ‘,

I -1 i -I *

,I ’

I .1’ i

. :,

1. ‘_.i

: i

~

1

1 ‘. ._J ’ -I . ...‘” 3

i ‘:: ‘, “: ,+1 2-d _‘. . ,. :1 _ ., +“*, ,

_I .,_ y’;,.‘:>, 1 .*a I, ,; Ihl 1 / :7, ,~ .I,‘;,< ‘

;1;.-‘, T”.! ;; ,: ,_ ,. ,;& : .;,“.;, f.‘:*.$;.**

,$‘, .<y,+;.! e,s ,,“, ,?..@ 1 ‘1 d& ‘-‘~&x;~~:~~, ,I. wy. -,

g$~$$J&

:; ‘B .p, +. : ;.); .c:qT+.. !

!gg&

;~$$@~

7 %4$,. ‘f. -5 ‘Z# .rg;. , jT> y$$g-J +‘. ::.+++f ~I,;i~~b;~.-<.:‘,zi C.,‘C,, -. -,c

L’ ‘j ,-.I ,,‘.,-‘,;:’

1.. I . I . _’

:, ,‘. .:‘?

~I _’ :, . ,,‘\

,,) ” I ._:

#,‘,I i I. ,“-,

,~ ,..; ,

. , -; b .,: a, 1 f

;I,: ., 3

i .,.,

sequences should be matched based on proximity. In non-numeric sequences, matching is usually done based on equality.

Definition 3.1 Assume two sequences S and Q have lengths M and N respectiveiy, and sil, s;,, . . . . siK are match@ 4jl3 4j22 -2 qjx- The sequences 5’ and Q match each other if

1. min{N, M) 1 p*max(N,Mj where p < 1. Here, p is referred as the length aspect ratio, andit is preferably close to I.

distance(sg , gj,) 5 S for all k matching elements, 1 < k 5 K where K is the number of matching elements, d is referred as the matching distance.

K 2 /3min(N, M) where B < 1. p is referred as the matching coefficient, and it is also preferably &se SO I

for higher selectivity.

A one-to-one mapping can be found berween ail KIZ- matched elements (making insertions or deletions if nec- e.ssav) of the two sequences such that for ail mapped elements si and qj, the relation distance(si,qj) < 7 holds (7 _> 0). [Most likely, 7 wiIl be afunction of 6 if 6 > 0)

Note that the distance between two elements si and qj can be defined in a different way for each domain. For numericsequences, diSta7ZCe(Si, qi> is simply ISi-qj/- This

simple distance function can’t be used for every domain such as multi-dimensional vector sequences or non-numeric sequences, As mentioned before, in most applications, the elements of non-numeric sequences can be matched based on equality (6 = 0). In that case, the distance between any two elements is defined to be 0 if they are equal, and a positive number if they are not.

The method proposed in nO96] which is based on the Longest Common Subsequence problem does not provide any mapping for the unmatched elements and cbeck- ing the condition 4 of the definition above. Note that mapping unmatching elements is important for some domains (ex:scientific experiments) where it is not desirable for matching sequences to have large deviations along the unmatched elements.

The values in the last part of this formula correspond to change, delete and insert respectively. Formula (1) also introduces an algorithm to find edit distance between a text Q and a pattern S with lengths N and M in time propor- tional to O(MN). A particular case of the edit distance algorithm gives the longest common subsequence (LCS) of two sequences ICR94]. Let us assume D[i,j] is the min- imal number of deletions and insertions, not changes, necessary to transform a text Q[l . . . i] into S(1.. . j]. Evalu- ation of D[i, jJ is equivalent to computation of the length of LCS between Q[l . . .i] and S[l , . .f. Indeed, this is a restricted version of edit distance where changes arc not nl- lowed. However, we will refer to it as as the edit distance in this paper.

Assume C[i, f is the length of the LCS of Q[l . , ,$J and S[l . . .j]. The following lemma shows the relationship between C[i, j] and D[i, j] for two sequences Q and S with lengths N and M. Due to space limitations, for the proof of the following lemma and the other proofs, please refer fo [BYO9q.

Most of the sequence matching methods, in an environ- ment with a large set of data sequences, work in two phases. In the first phase, a finite number of data sequences are filtered out by searching in an index structure. These sequences are hypothesized as matching candidates with the given query sequence. In the second phase, all hypothesized sequences are verified for actual matching. Our method also works in two phases. In section 4, we explain our method for matching two sequences. Indexing and filtering will be discussed in section 5.

Lemma 4.1 [CR941 X[Q, SJ = M -I- N - o[Q, SI cd zcIi,j]=i+j-o[i,jjforo<~<N, andO<j<M.

This is an interesting result. It indicates that finding the restricted edit distance is a dual problem of finding the LCS problem, therefore, any solution to it automatically gives the answer to the LCS of Q and S.

4.2 Edit Distance and Sequence Matching We now explain how the edit distance can be applied to find matching sequences of different lengths.

130

4 Matching Process She our method uses a modified version of edit dislancc for approximate text matching, first, we briefly review ihc edit distance and approximate text matching problem. Nc%l, we discuss our method for similarity matching of numeric sequences of different lengths.

4.1 Edit Distance and Approximate Text Matching The edit distance between two alphanumeric sequences, (referred as tar and parrent) is defined [CR941 as the minimum number of operations that are needed to change a query text into a pattern. Three operations cfeelete, irrserl and change are allowed. Assuming the cost of each of these operations is 1, the edit distance is the minimum number of operations needed to obtain a pattern from a text. A dynamic programming solution for this problem is given in [CR94], Let D[i,f denote the minimum edit distance between two texts Q[l . . .i] and S[l . . . j]. Then, the minimum distance is defined recursively as follows,

If i=o; If j=O;

D[i, j] = If i, j > 0 and qi is CCJU~ to 33 ; Otherwise;

Dfi-l,j)+l, D[ij-l)+l) (1)

1. Instead of checking equality between two elements qi and Sj from two sequences Q and S (respectively), we check if the elements are within a nrurching distance from each other. In other words, it is checked whether the relation distMKe(sj, 43) 5 d holds.

2. Change is not allowed in computing the edit distance.

3. For numeric sequences, in insertion, the new elements are calculated by interpolation instead of just inserting elements from other sequences. For instance, the value

9 would be inserted between qi and qi+l .

Formula 2 below calculates the edit distance between two sequences Q and 5’.

fj If i=O, : D[i, j] = &i&j-l,

If j=O; If i, j > 0 and qi matches si;

min(D[i-lj]+l, D[ij-I]+I) Otherwise ; (2)

The mapping between matched and unmatched elements can be found easily from this formula. Example 4.1 Consider two sequences Q =<22, 3.9, 2.9, 1.9, 4.5, 3.2> and S =<3.2, 4.2, 2.1, 3.3, 4.1, 4.3, 3.1> and let us assume the matching distance as d = 0.5 by applying [BY0971 formula 2 to sequences Q and S which is shown in Figure 1. As Figure 1 illus?ra&s, some elements of Q are deleted and also some new eIements are inserted by interpolation. The matching elements from the two sequences are shown by vertical lines. The method extends and projects Q in such a way that both sequences have the same length in the end. Besides a bad alignment,

S 3.2 4.2 2.1 3.3 4.1 4.3 3.1

I I Q

I 2.2 3.9

v

2.9 2.4~~4”’ I 4.5 3.2

Deleted elements New inserted elements

Figure 1: An alignment for sequences Q and S.

there is a subtle flaw in this method. It does not use the unmatched elements of the query sequence. Hence, formula 2 is modified as follows:

i

j If i=O, i If j=O,

D[i, j] = D[i-lj-I] If i, j > 0 and qi matches Sj; min{D[i-lj-1]+2, Otherwise ;

D[i-l,j]+l ,D[ij-l]+l) (3)

This means changes are allowed, however, weight 2 is given to changes, while weight 1 is given to each deletion and insertion, The following lemma shows that Formula 3 finds the same edit distance as Formula 2 does.

Lemma 4.2 Formulas (3) and (2) are equivalent in the sense that they compute the same edit distance.

131

In computing the values of D[i, j]s in Formula 3, we give the preference from left to right. Therefore, whenever two or three expressions give the same value for the minimum, we choose change or insertion first. Deletion will be the last choice. Nevertheless, in order to apply Formula 3 to the sequence matching problem, the original values are kept in case of changes. Applying Formula 3 to the sequences of Example 4.1 gives the alignment shown in Figure 2. The dashed boxes show the mapping among unmatched elements. As it is shown in Figure 2, it gives a better mapping and a better alignment than the one obtained using Formula 2.

c--. s

Q

13 ;[ :I .;I; 1; 1; ;[

. . . .I me-1

t

1-e - -

The new ins&ted elements

Figure 2: An alignment obtained by applying Formula 3.

The running time for finding the edit distance between two sequences Q and S with applying Formulas 2 and 3 is O(MN) where N and M are the lengths of Q and S -respectively. However, since the definition for matching of two sequences (Definition 3.1) puts a restriction on the number of non-matching elements, the formulas, and consequently the algorithm, can be modified in order to make it more efficient. The following lemma states this fact.

Lemma 4.3 Assume Q and S with lengths N and M are two sequences which match each other with a matching coeficient /3, and D[S, Q] denotes the edit distance between S and Q. Then, DES, Q] 5 M + N - 2@min(M, N).

Lemma 4.3 bounds the value of edit distance between two matching sequences and this can be used to expedite the matching process by giving a tool to exit the algorithm when two sequences do not match. It also gives a powerful tool to liniit the range of operations in finding the edit distance.

Let us assume li and ri are two lines which identify the boundary of the range and are defined as follows,

li = max(0, i - N + pmin{ M, N}}

ri = min{M, i + M - /3min(M, N}}

Then, the recursive formula to find the edit distance of two matching sequences Q and S, and, consequently, the correspondence between their elements, can be found using the formula below. The formula is defined for D[i, j] in the

.’

* ” )

~_ _ ”

,,

‘, .‘,I

,?

i

rangeli 5 j <rj.

D[i, j] =

This fan nui

j Ifi=OandO<j<ri i If j=O and

i < N - pnin(M, N}]; D[i-lj-l] Ifi > 0, li 2 j 2 r< and

pi matches sj; mini D[i-lj-I]+& Ifi>D,li< j<riand

D[ij-l]+I,D[i-l&l) qi does not match Sj; min{ D[i-l&11+2, If i > 0, j = Ii > 0;

D[i-1 ,jl+l ) min{ DIi-lj-llt2, Ifi > 0, j = 7-i > 0;

Diij-I]+1 )

la computes the modified restricted edit distance function. However, for simplicity, we will call it modijied edit distance.

Theorem 4.1 Assume that we have two sequences Q and S with lengths N and M respectively. Without loss of generality. let M 1 N, and N 2 pM so that their lengths satisfy thefirst condition of definition 3.1. Let D[Q, S] and ED[Q, s] denote the mod$ied edit distance (Formula 4). and the edit distance (Formula 3) respectively between Q and S.

I. Zf the sequences S and Q match each other with matching coe$icient ,f?, then D[Q, S] = ED[Q, s].

2. Ifrhe sequences S and Q do not match each othel; then DtQA’l 2 EDIQJl

Before analyzing the running time of the algorithm for the edit distance, we mention the following lemma from [YO96], where p is the length aspect ratio (U.lX) for sequences Q and S.

Lemma 4.4 Assume two sequences, Q and S with lengths N and M, respectively, are matching. Then, pN 5 M 5 $-

Theorem 4.2 The edit distance of mo sequences Q and S with lengths N and M, kngth aspect ratio p and matching coefficient 0 can be found in O(MN[l - pp2]) in the worst case. Furthermore, the corresponding element of S and Q can be constructedfrom the table in order to find the edit distance in O(M + N).

The values of p and p determine the running time of the algorithm. For smaII values of p and p, the running time approach to O{MN). If both p and /3 approach to 1, the running time approaches O(N) which is expected since the problem is changed into the whole matching of two sequences with the same length.

5 Indexing and Filtering for Simikwity Search Queries on Sequences

Applying a sequence matching algorithm to a large set of sequences in a sequential manner is very time consuming. The main challenge is to efficiently hypothesize a small

132

set of sequences as matching candidates for a given query sequence. Here, we propose an indexing scheme for cfficienl processing of similarity search queries when restricted edit distance between the sequences is used as Ihe similarity measure. We will present this indexing scheme as a gencrni solution for the problem of matching sequences of diffcrcnl lengths. The matching of elements are based on equality, that is, the distance between any two sequence elements will be zero if they are equal, and a positive number if they arc not (6 = 0). Note that, edit distance is not metric if elements of sequences are matched based on proximity with a non- zero 6 (See Example 5.1). A metric distance function is required for an index structure to be used. Also, note the difference between the distance between sequence elcmcnts and the distance between sequences. Below, the distnncc function refers to the distance function for the sequences if not specified otherwise.

A distance function d{z,y) is metric if it sntisfics the following simple conditions:

(i> db, Y) = 4Y, 4 (ii) 0 < d(z, y) for z # y (iii) d(z, z) = 0 (iv) d(~,g) < c@, Z) -!- d(l~, Z) (triangle inequality)

The edit distance function, when elements are matched with respect to their equality (S = 0), is a metric distnncc function. The following example shows that edit distance is not metric if elements are matched based on proximity for n non-zero 6.

Example 5.1 Consider three sequences of length 2 I :C O&OS >, y :c O-55,0.55 >, z :< 0.6,0.6 > and kO.09. Note that although y matches’ with both z and x, x and z do not match at all, causing the violation of the triangle inequality (d(z, 2) > d(z, 3) + d(y, z)).

We have previously proposed [YO96] a technique for indexing sequences based on some of their features, namely, lengths and the first two moments {mean and variance). Each data sequence is represented by a point in the fcaturo space. Any multi-dimensional point access method cnn be used as the index structure. Similarity match queries are transformed into range queries in the feature spfCe,

However, this method does not guarantee that the scnrch process does not miss any probable matching sequence in similarity matching process. This problem (referred to as fake dismissal problem) originates from the fnct that estimated values for mean and variance may have errors. Therefore, the search bounds which are computed based on these parameters may contain errors that may cause f&o dismissal of some matching sequences.

Asuming the sequence elements are matched based on equality, we present here an indexing method which gunr- antees that there won’t be any false dismissals. A framework and an analysis for indexing for similarity search when sc-

quence elements are matched approximately can be found in [BY097]. Our indexing structure is constructed as follows, First, we group the data sequences with respect to their lengths. Each group accommodates sequences that are lengthwise close to each other. Second, the data sequences in each group are indexed by a vp-tree [Uhl91]. Having constructed the index structure, a similarity search proceeds in two phases. First, we identify the groups that accommodate data sequences that may match the query sequence. This is done based on the length of the query sequence. Then, in the second phase, the VP-trees for these identified groups are searched to filter out the data sequences that are distant from the query sequence, and to find the ones that are within matching distance to the query sequence.

In the following subsections, we explain how grouping and indexing is done.

5.1 Grouping Data Sequences by Length For a similarity search query, the simplest way to filter out many distant sequences is to discard all the data sequences whose lengths do not satisfy the first condition of Definition 1 with respect to a given LAB p- For this purpose, we classify the sequences into sets q, ca, .-, c, with respect to their lengths. Then, all sequences in any given class cj have the same length which we would refer to as Z(c& Next, we can index the sequences in each class separately, ending up with an index structure for each class. We do not favor this approach due to several reasons. First, there will be too many different indices and many of them will be visited during a similarity search query. Consider a query sequence of length N. We would have to search through the indices of all the classes cj where (pN 5 Z(cj) 5 N/p) (Lemma 4.4). Although the main objective is to minimize the number of distance computations, searching through many shallow index structures would increase the I/O considerably.

We propose to group these classes with respect to the length of the sequences they have, and the LAB p. By grouping a number of these classes and building the indices on these groups we hope to increase the efficiency. The classes cl, .,., c,,, are further grouped into sets gr, ..,g,, (n << m) using the following procedure.

Grouping classes by lengths: input: classes q ) . ..) c, output: groups 91, *., g*

1. let i = 1, j = 1.

2. 6 will be in gf. Let g be the class such that Z(Q) is the smallest (among other classes) that is greater than or equal to l(~)/p (i.e., Z(Q) 1 rZ(c+)/p]). For all t, Ic 1 t > i, ct will also be in gj. We will refer to Z(s) and Z(Q) values as min(gj) and ma(gj) respectively.

3. i=Ic+1;i=j+1;

4. if i 5 m, go to step 2.

Example 5.2 Assume that we have cl, .., css where Z(s) = 89+i for i= 1,..,35. Let p be equal to 0.9. Since Z(q) =

133

1 90/p = 100 = Z(c11). Cl , ..,Crr are put into gr- Cra, .., C-24 go into ga since [(Z(cra) = 101)/p] = 113 = Z(c24). The last group ga takes ~25, -., ~35.

Note that, because of the way we do the grouping, all sequences in a given group gj have comparable lengths with respect to each other as any two sequences in gj have the length aspect ratio at most p.

Each of these groups will be indexed separately by a vp- tree. Our objective in forming the groups gr, .-, g,, was to limit the number of index structures visited during a similarity search query- The following theorem formally states the upper bound for the number of groups to be searched for a similarity search query. Its proof follows from Lemma 4.4.

Theorem 5.1 There wiZ1 be at most 3 groups visited for a similarity search query.

We made the implicit assumption that all the queries are specified with respect to the same length aspect ratio, p, the value which we used for grouping. Note that it is also possible to specify queries using different values for p. This certainly affects the number of groups searched in the query where we may not be able to guarantee the fact that at most 3 groups will be visited. The following theorem elaborates more on these bounds.

Theorem 5.2 Let p be the LAR for rhe application which the grouping is based and p’ be the LAR specifiedfor the query.

1. ifp’ 2 p1i2, then there will be at most 2 groups visited for the query.

2. ifpi+ 5 p’ 5 pi then the maximum number of groups fo be visited is 2i + 3.

5.2 Indexing the Groups by VP-trees We use vp-trees (vantage point trees) [Uhl91] as the index structure for the groups of sequences. The vantage point tree is a balanced distance-based index structure that could be useful in any metric data space. At each level of the tree, a vantage pdnt is picked among the data points that will be indexed below that level. After that, the distance between that vantage point and the other points are computed, and the points are sorted with respect to their distances from the vantage point. This sorted list is divided into m groups of equal cardinality where m is the order of the tree. Each of these subgroup of points are indexed at the next level in the same way. The vp-tree does not make any assumption about the geometry of the data space, but only assumes that the distance function used is metric. That is why we made the assumption that the sequence elements could be matched only if they are equal.

Similarity search algorithm for vp-trees is also simple, and only based on filtering out branches that index distant points using the triangle inequality. The search proceeds as follows. Assume we have a query item Q and we want all

i

,

.,‘.,

’

<.. I. : >I,. \;t-,. i*

.._.,I I

‘. , ‘; ,,’ ..I ~’ r .; ., ;1,

-, ‘. ‘I

;.;. /‘1. ’

-:. ; L’:... , :; i I.,...:

.;I , ‘-t ._

,.ii: : 7, ,~ ,-. ‘,-,

. . :,

i. ‘,-, .‘, ” 1

:..‘v. .a

‘. .:I r ’ I_ ,,y;’

.:,:,+i: I. I ,.:tq ,-A

,,fx:~.: &;. :1..::.,

i;+: :..3 ~ .--.q> ?;a;:. $4 ,Ah.:t,r I, zxi~‘ 23 .

1”: ‘: ?,:,,5&? ‘I \ k. &s.’ , ,, I .A ;.,y’-L..q

\$i&;

‘,~>~~‘;.‘5$~~

I 4%; <.:,:* .l:‘&yg$$

~-.i! + ‘i -,2. ,, +y.v,“, “(<I ?? %W.,.AL,>

L ;+ -.-pi i . 4 I I’;,,* .,.. 3

i -..;.e ‘.‘.C ).,; “‘C t :. L .,,, 5 I. : *!,.I

.c.:,. ‘I’,:*

1: ‘,:. , : , :..I

,jL ,

‘.:I

; _ .,

’ : ’ ~ .’

). ,I

a. ,.

b,

.,

_ :;

.i ’

1 : I

rr ,‘,,

1

, ‘i

, I I- .>

!

_. ’

’ .‘-‘I

:i 1 I

,I

,.. ‘.,,;

N,, ‘. 1

data items that are within distance r of Q. We start from the roof of the vp-tree. The vantage point of the root node (‘up,& partitions the data space into spherical cuts, where each branch below that node (root) accommodates the points that fall into one of these spherical cuts. These branches are searched with respect to the inner and outer radii of the spherical cut they correspond to. So, if a branch has the inner radius fi and the outer radius R,, it is searched only if:

dist(vp,,,t, Q) + T 2 & and dis~(w,,t, Q) - T I R,

otherwise the branch is not searched. Tbe search proceeds the same way continuing from the root of the branches that qualify.

The edit distance function is used when constructing the vp-trees since it is a metric distance function (with the assumption we made). At search time, the distances between the query point and the vantage-points will be calculated using the edit distance function to direct the search property. However, the distances between the data points (sequences) in the leaves (the points that are not vantage points) and the query point will be calculated using the modified edit distance function. Note that the modified edit distance function overestimates the actual edit distance function as it was shown in Theorem 4.1. Part I of Theorem 4. I guarantees that we do not dismiss any of the matching sequences since the modified edit distance function computes the exact distance for such sequences. The overestimation of the actual distances for the other sequences does not hurt, as we would dismiss them in the end anyway since they do not match. In conclusion, using modified edit distance function in the verification phase does not violate our correctness guarantee (i.e., no false dismissals) while providing a faster way for identifying matching sequences. Note that we cannot use the modified edit distance for construction of the vp-trees and throughout the full search process since it is not metric (it may overestimate). We only use the modified edit distance function to check the data points in the leaves to see if they match the query sequence. Also note that She vantage points that match tbe query point with respect to the specified similarity measure are reported early in the search, since the exact edit distance between those vantage points and the query point has already been calculated.

The vp-tree is a static index structure, i.e., it is built on a static set of data items. Insertions, are possible, but they can be handled at the expense of violating the balance of the vp- tree. Therefore, we will assume that we are given a fixed set of data sequences that won’t change (or change very little) throughout the application. This means that once the groups Lh , .., gn are formed, there will not be any insertions to or deletions from them.

The process to answer a similarity match query is rather simple. First, we find all the groups that may have data sequences which could possibly match the query sequence. This is strictIy done based on the value of p (LAR) and the iength of Q. Next, we search the vp-trees for these groups to

find data sequences that may match Q. When searching the vp-trees, the search tree is pruned

with respect to a threshold value. The search looks for data sequences whose [edit) distances from the query sequence are less than or equal to that threshold. The following lemma states the value for that threshold for a given query sequence,

Lemma 5.1 For any given query sequence Q with length N, The distance between Q and any matching sequence is less rhan or equal to iV( 1 -I- l/p - 2pp).

Note that, the value given above for the threshold is an upper bound which is simply used to direct the search. A data sequence S can be actualIy matched with a query sequence Q if and only if the edit distance between them satisfy the condition given in Lemma 4.3.

We can actually come up with a better (smaller) threshold value for searching the vp-tree of each group since WC know the minimum and the maximum lengths of the data sequences accommodated in each group. The following theorem presents these better threshold values in searching for matching sequences in a specific group.

Theorem 5.3 For any given query sequence Q wills bagrlr N, the distance between e and any matching sequence in a group gi is less than or equal to

N f mas(gJ - zrvpp, ifmaz(gi) < N Iv + N/p - 2NP. .if?Tlh(~i) 1 N N + ?lUXC:(gi) - 2mi?Z(gi)@n if?i%i7Z(gi> < N < VUXV(Qt)

5.3 Experimental Results In the experiments we have done for testing our indexing scheme, the data set consists of around 10000 integer sequences of lengths ranging from 27 to 40. These integer sequences are obtained by rounding real number sequences taken from time series of stock data, and simulated stock data with the use of statistical formulas. We compared our indexing scheme with the sequential search method in terms of the number of distance computations required. In Figure 6, we dispIay the rest&s for four different cases where WC varied the order of vp-trees we used in indexing the groups, The terms vpt2, vpt3, vptrl, vpt5 refer to the cases whcrc vp-trees of order 2, 3, 4, and 5 are used respectively. We show the ratio of exact edit distance computations and the modified edit distance computations for each of these vp- trees. These ratios are obtained by taking the average over 100 queries. In this particular application, with the indexing scheme only 21-23 percent of the distance computations are done as compared to using the sequential method. In terms of the number of distance computations, the sequential method makes on the average 3153 distance computations while with the indexing method the average number of distance computations varies between 658-713. Depending on the distance distribution among the sequences, the gnin in percentage can be much higher in different applications

134

for other data domains such as genetic sequences or text sequences. Another observation is that if higher order vp- trees are used, we end up doing more modified edit distance computations and less exact edit distance computations, due to the fact that more data sequences are accommodated in the leaves if the order is high. In Figure 3, the results for vpt(4) does not seem to conform to this trend, however, it can be considered as an exception since the performance of VP-trees very much depend on the random function used to pick the vantage points. Making more modified edit distance computations may be desirable since an exact edit distance computation is more costly compared to a modified edit distance computation. On the other hand, it is not very desirable to end up with shallow high order vp- trees since that would increase the total number of distance computations.

Ftntio of distance computations with the indexing scheme versus without the indexing scheme (sequential search).

q Total distance computations

f3 Edit distance computations

W&ditied edit distance computations

0.25

0.2

0.15

0.1

0.05

0 -I--

VPt2 VP14 VP15

Figure 3: Efficiency of the indexing scheme vs sequential search

6 Conclusion We propose a method for similarity matching of sequences with different lengths. The method uses a modified version of the edit distance algorithm which is used for approximate text matching and is based on dynamic programming. For numerical sequences, our method also includes the mapping process of unmatched elements of sequences. We also provide an indexing mechanism which is used for efficient filtering of distant (in terms of similarity) sequences in a similarity match query. Our indexing method avoids false dismissals where the distance function used to compute the similarity distance between sequences is metric. This is not the case for edit distance when it is used for numerical

sequences and the sequence elements are approximately matched (i.e., if they are within a matching distance 6 of each other). In this case, as discussed in [BY097], we can still make use of the indexing scheme with the rollnding of numerical values and matching sequence elements based on equality after rowzding. However, this imposes the possibility of false-dismissals into the filtering process. Designing an indexing mechanism for efficient filtering of numerical sequences for similarity matching queries without any false dismissals is a challenging future research problem.

References [AFS93) R. Agrawal, C. Faloutsos and A. Swami, “Efficient Sim-

ilarity Search In Sequence Databases”, FODO Confer- ence, Evanston, Illinois, Oct., 93.

[ALSS95] R. Agrawal, K.I. Lin, H.S. Sawhney and K. Shim, “Fast Similarity Search in the Presence of Noise, Scaling, and TransIation in Time-Series Databases”, Proc. of the 21th VLDB Conf.. 1995.

[BKSS90] N. Beckmann, HP. Kriegel, R. Schneider, B. Seeger, “The r-tree: An Efficient and Robust Access Method forPoints and Rectangles”, Proc. of the ACM SIGMOD COIL, 1990.

[Bri95] S. Brin, “Near Neighbor Search in Large Metric Spaces”, VLDB Conf.. 1995.

[BYO97l T. Bozkaya, N. Yazdani. 2-M. OzsoyogIu, “Matching and Indexing Sequences of Different Lengths”, Tech. Report, CES, CWRU, 1997.

[CLR90] TH. Comten, C.E. Leiserson and R.L. Rivest, “Intro- duction to Algorithms”, MIT Press, 1990.

[CR941 M. Crochemore and W. Rytter, ‘Text AIgorithtns”, Oxford univ. Press, 1994.

fFuk72] K Fukunaga. “‘Introduction to Statistical Pattern Recognition”, Academic Press, New York, 1972.

lFRM94] C. Faloutsos. M. Ranganathaa and Y. Maaolopou- los, “Fast Subsequence Matching in lime-Series Databases”, ACM SIGMOD Cot& 1994.

[LT941 D. Lopresti, A. Tomkins, “On the Searchability of Electronic Ink”, in IWFHR 94. .

IXfW D. Lopresti, A. Ton&ins, “Block Edit Models for Approximate String Matching”, in SSAWSP 95.

[SHB92] J. Snyder, A. Hiltner, E. Baer. “Analysis of the wedge- shaped damage zone in edge-notched pelypropyIene*‘, Jour. of Mated& Sci. (27), 1992.

[Uh191] J.K. Uhlmann, “Satisfying Gen- eral ProximitylSimilarity Queries with Metric Trees”, Information Processing Letters, v40, p175-179,1991.

[YOO94] N. Yazdani, M. Ozsoyoglu and G. Ozsoyoglu, “A Framework For Feature-Based Indexing for Spatial Databases”, Proceeding of 7% Int. Conf. on Statistical and Scientific Database, 1994.

1YO96] N. Yazdani and M. OzsoyogIu “Sequence Matching of Images”, Proceeding of S’th Int. Conf. on Statistical and Scientific Database, 1996.

135

I. , . !‘, ’

.

t

i

.

,

r

’ ,

Documents

[ACM Press the sixth international conference - Las Vegas, Nevada, United States (1997.11.10-1997.11.14)] Proceedings of the sixth international conference on Information and knowledge