Similarity Measures Between Strings Extended to Sets of Strings

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-4, NO. 3, MAY 1982

logarithm. Furthermore, it was conjectured that any simplenull hypothesis will eventually be rejected, and that theprobability of eventual rejection of a composite null hypoth-esis will be less than unity for all points where the uncon-ditioned false alarm rate is less than a.

REFERENCES[1 ] P. Hartman and A. Winter, "On the law of the iterated -logarithm,"

Amer. J. Math., vol. 63, pp. 169-176, 1941.

Similarity Measures Between StringsExtended to Sets of Strings

KAREN A. LEMONE

Abstract-Similarity measures between strings (finite-length sequencesof symbols) are extended to apply to sets of strings in an intuitive waywhich also preserves some of the desired properties of the initial similar-ity measure. Two quite different measures are used in the examples:the first for applications where numerical computation of pointwisesimilarity is needed; the second for applications which depend moreheavily on substring similarity. It is presumed throughout that alpha-bets may be nondenumerable; in particular, the unit interval [0, 1] isused as the alphabet in the examples.

Index Ternms-Nearest neighbor rule, nondenumerable alphabet,similarity measures between strings, substring.

I. INTRODUCTION

This correspondence extends similarity measures betweenstrings to their counterparts for sets of strings. The motiva-tions for this research are threefold: 1) to continue the studyof properties of similarity measures between strings, 2) tostudy the properties of similarity measures between sets ofstrings, and 3) to study the property-preserving characteristicsof suitable methods of extending similarity measures betweenstrings to similarity measures between sets of strings.A string w is defined to be a finite-length sequence of sym-

bols from some (possibly infinite) alphabet. The alphabetutilized in the examples is the unit interval [0, 11 . Fiveproperties of similarity will be considered. The first four ofthese-co-C3 -are suggested in [61 as "universal criteria whichany reasonable similarity measure must satisfy." The last, C4,is essentially a "triangle law" for similarity measures, usefulfor "pointwise" similarity measures such as Similarity Mea-sure 1 defined below. In the following:W = W1 W2 ... Wn, V-VI V2 * *Vm X U = UI Uu2 * U,

Ow1i,Vi,Ui1CO: 0.s(w,v).<

0 if w and v are "totally dissimilar"C : s(w,u) 11 if w and v are "the same"

C2: S(W, V) s(V, W).

Manuscript received August 23, 1979; revised November 20, 1981.The author is with the Department of Computer Science, Worcester

Polytechnic Institute, Worcester, MA 01609.

C3: S(w, v) = s(wRv R)where w ,vR denote w and v reversed

C4: s(w, V)+S(v, u) l+S(W, u).

Two similarity measures will be represented here. The firstis based on a normalized city-block metric, the second on thenumber of occurrences of identical substrings.Similarity Measure 1:

nSI (w,v)= In/n li- vil

i=i

where w=w1w2 wn

V =V= 2 ... Vn,

O<w,,v1. 1.

Note that this measure is defined between strings of equallength and would not be meaningful in applications such asnonencoded text where wi - vi cannot be evaluated.

Similarity Measure 2:n

Ep(i)i=

S2 (w, V) =n

i=l1where

W W1W2 * *Wk

V=VIV2 * * Vm

n = max (k, m)

p (i) is the number of substrings of length i which w and vhave in common.

Here, the two strings to be compared need not be of equallength.

It can be shown that s1 satisfies CO-C4 [11] and that s2satisfies CO-C3.Example 1:

s1(l, 1)= 1

S1(0, 1)=0

Si(00, Il)=0l(O0,l2 2)=2

Sl(OO1, 01)= I

S1 (0010, 01 10) = 3

sl(2 2 4, 0110)= 2

S2 (1, 1) = 1

S2(0, 1)=0

s2(00, 11)=0S2(00, 2 2 )=0

S2 (001, 0 10) = 4

S2 (0010, 01 10) = 2

S2(I 44 2 01 10) = 0

S2 (0, 00) = 13S2 (00 1, 00100) = 2.

II. EXTENSION TO SETS OF STRINGSIn this section the similarity between a single string w and a

set of strings S is defined according to the usual "nearest neigh-bor" rule from classification theory. Then this definition is inturn used to define the similarity between two sets of strings.

s (w, S) = sup s (w, v).vES

0162-8828/82/0500-0345$00.75 © 1982 IEEE

345

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-4, NO. 3, MAY 1982

Intuitively, s(w, S) is the similarity between w and its "nearestneighbor" in S. (The nearest neighbor rule is not the onlymethod for measuring the similarity between a string and a setof strings. Another frequently used method is to take an aver-age if this seems more appropriate. This increases the mathe-matical complexity, however, since an integral may be needed.)Example 2: Let

SI1 = {W1W2 * wn Wi =_ Wn-i+l; O S wi S 1}.si(l,SI)= 1

s1(0,SS) 1

SI (00, SI)= 1

sl(0l,sl)= 1

Sl(00l,SI)=2Sl(oolo,S,)- 3

S2(1,SI)= 1

s2(0,SI) =

s2(00,SO)= 1

S2(01, SI) =

S2(001, SI)=

S2 (0010, S1)=o

51(I I I I SI)= I 52( 2 2 2 1)= 1.

Before defining s*, the similarity measure between two setsof strings, some of its desired properties will be considered. Itis reasonable to expect a similarity measure s* between sets ofstrings in SI and S2 to satisfy properties analogous to CO-C4:

CO: 0.s*(SI,S2).<1{o if S1 and S2 are "totally dissimilar"

Cl: s*(S1,I2) if SI =S2

C2: S*(S, S2)=5 (S2 SI)C3: S*(SR,SR) =s*(SlS2) where SR is S with all its

strings reversed

C4: S*(S3,52) + S*(S2, S3) <l + S*(S1, SO3).Since our similarity s* between sets of strings is to be de-

rived from the similarity measure s between individual stringscomprising the sets, the concept of "totally dissimilar" willalso be an inherited quality. Thus, totally dissimilar sets forsimilarity measure 1 will be those sets S1 and S2 such that allthe strings in S, have O's, where all the strings in S2 have 1'sand vice versa. Totally dissimilar sets for Similarity Measure 2will be those sets none of whose strings have substrings incommon.

It was noted previously that s (w, S) is the similarity betweenw and its "nearest neighbor" in S. In the same intuitive fashion

inf s(w, S2)W&SI

may be thought of as the similarity between the "furthestneighbor" in S, and its nearest neighbor in S2. Since

inf s(w, S) and inf s(w, S)wESi wES2

are not necessarily equal, it is usual to consider some combina-tion of them to define the similarity between two sets ofstrings. Here their minimum is taken

s*(Sl,5S2)=min { inf s(w,S2), inf s(w,SI)}.weS1 wES2

It can be shown that if s satisfies condition ci, then s*satisfies condition Ci, for 0 < i S 4. (CO, C1, C2, and C4 areshown in [ 11 ] in a slightly different form.) Note that if thetwo sets are identical except for one "differing" string, thenthe similarity between the two sets reduces to the similaritybetween this string and the other set.Example 3: Let

SI = {W W2 * * wn i = Wn-i+i; 0 < wi < I}

and

S2 = {v1V2 ... vnIvI = V2; 0 < Vi <1}.

A. Using Similarity Measure 1

Let SN denote all strings in Si (i = 1 or 2) of length N.(Remember that s, is defined only between strings of the samelength.) Then

s*(S1,S )=s (S2,S2)y 1

and

inf s1(w,SN)=(N- 1)/NwGS2

inf s, (v, SI) =vG-S1

for N even, N> 2.

Thus

s'(SN, SN) = min { inf s (w, SN), inf s1 (v, SN )}weS1 vGS2

- I forN even,N> 2.

Also

inf s1(w, S') = (N- [N/2 I)/NNN

inf si(v,SN)=(N- 1)/NN2N

for N odd,N> 1.

Thus

s(Si , SN) (N- [N/2] )/N for N odd, N> 1.

([x] denotes the greatest integer less than x.) The details ofthe example can be found in [ 11.

B. Using Similarity Measure 2

It can be seen that for n odd, n > 1, that a "closest" stringin S, to

U1V1 * * V(n-l)/2 ES2 iS

lV1Vlv3 * V(n-l)/2V(n-l)/2+lv(n-1)/2 Vl

and

S2(VlUVLV2...½ Vn-i½

VIVl lV3 ***'V(n-i)/2V(n-l)/2+1 V(n-l)/2 ...VI)

(n + 1)(n + 3) + 84n(n + 1)

i.e.,

(n + 1)(n + 3) + 8sup s2 (v,w) 4 +

weSI 4n1(n + 1)Similarly, for n even

(n + 2)(n + 4)sup s2(V,w n(+)=

wE-Si 4n (n + 1 )Both of these approach 1 (from above) as n - oo. Thus,

inf S2 (v, S14vEGS2

and

inf S2 (W, S2) = 4WsESI

so that

S2*(Sl, S2) = 14-

346

IEEE TRANSACTIONS ON PATTERN ANALIYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-4, NO. 3, MAY 1982

III. CONCLUSIONS

The similarity measure s* described here is, of course, notlimited to the similarity measure s4 and s* used in the exam-ples. The choice of a similarity measure continues to dependon the task at hand since "good" or "bad" strings can alwaysbe found which support or contradict our intuitive ideas ofwhat is plausible. The important fact to note, however, is thatonce an appropriate measure s is chosen, it need only be ascer-tained that s satisfies ci (O -< i < 4); it is then known that s*satisfies Ci (O S i < 4).Using this procedure it is thus possible to characterize in a

mathematically meaningful way a set of strings, e.g., S mightbe described as the set of strings whose similarity to anotherset of strings, say T, is 2. Applications include any processesthat require a distinction to be made between two groups ofstrings rather than between just two single strings. Visualshape recognition, for example, can be achieved by measuringthe similarity between the shape and various prototypes. (Theelements of the shape must be encoded and a suitable similar-ity measure chosen.)Similarity measures should be computable. And they should

satisfy the semantics of the application. This correspondencerecommends the following.

1) Determine the important properties that a similaritymeasure between elements of two patterns should satisfy.Find a measure s with these properties. This is not itself atrivial matter [1], [3], [4], [14].

2) Derive a suitable extension s* of this measure that pre-serves the properties as they apply to sets of elements. Forencoded patterns, the author believes that the extension pro-cess described here is particularly good since it preserves the"universal" properties cO-c3 [6].To obtain a computer interpretation of the derived s*, it

will often be necessary to make certain "compromises;" e.g.,replace elements of [0, 1] with computer-sized numbers,replace inf with min, throw away truly deviant strings, etc.

REFERENCES

[1] H. C. Andrews, Introduction to Mathematical-Techniques inPattern Recognition. New York: Wiley, 1972.

[2] F. W. Blackwell, "Combining mathematical and structural patternrecognition," in Proc. 2nd Int. Joint Conf Pattern Recog.,Copenhagen, Denmark, Aug. 13-15, 1974, pp. 534-539.

[3] T. M. Cover and J. Van Campenhout, "On the possible orderingsin the measurement selection problem," in Proc. 3rd Int. JointConf Pattern Recog. New York: IEEE, 1976, pp. 245-248.

[41 E. Diday, "Recent progress in distance and similarity measures inpattern recognition," in Proc. 2nd Int. Joint Conf Pattern Re-cog., Copenhagen, Denmark, Aug. 13-15, 1974, pp. 534-539.

[5] E. Diday and J. C. Simon, "Clustering analysis," in DigitalPattern Recognition, K. S. Fu, Ed. New York: Springer-Verlag,1976, pp. 47-92.

[6] N. V. Findler and J. Van Leeuwen, "A family of similarity mea-sures between two strings," IEEE Trans. Pattern Anal. MachineIntell., vol. PAMI-1, pp. 116-118, 1979.

[7] K. S. Fu, Syntactic Methods in Pattern Recognition. New York:Academic, 1974.

[8] K. S. Fu and S. Y. Lu, "A clustering procedure for syntacticpatterns," IEEE Trans. Syst., Man, Cybern., vol. SMC-7, pp. 734-742, 1977.

[9] M. Kaliski and T. Johnson, "Binary classification of real se-quences by discrete-time systems," presented at the 17th IEEEConf. on Decision and Contr., 1979.

[10] D. Langridge, "On the computation of shape," in Frontiers ofPattern Recognition, S. Watanabe, Ed. New York: Academic,1962, pp. 347-365.

[11] K. Lemone, "Languages over the real numbers," Ph.D. disserta-tation, Northeastern Univ., Boston, MA, June 1979.

[12]

[13]

[14]

T. Pavlidis, "Algorithms for shape analysis," IEEE Trans. PatternAnal. Machine Intell., vol. PAMI-2, pp. 301-312, July 1980.L. G. Shapiro, "A structural model of shape," IEEE Trans.Pattern Anal. Machine Intell., vol. PAMI-2, pp. 111-126, 1980.J. C. Simon, "Recent progress to a formal approach of patternrecognition and scene analysis," in Proc. 2nd Int. Joint ConfPattern Recog., Copenhagen, Denmark, Aug. 13-15, 1974,pp. 489-495.

On the Chain Code of a Line

LI-DE WU

Abstract-In 1970 Freeman suggested the following criteria whichthe chain code of a line must meet [11, [2]:

1) at most two basic directions are present and these can differ onlyby unity, modulo eight,

2) one of these values always occurs singly,3) successive occurrences of the principal direction occurring singly

are as uniformly spaced as possible.In this correspondence we give the following:1) an algorithm presentation of Freeman's three properties about the

chain code of a line and the proof that it is also the algorithm recogniz-ing whether a chain code is the chain code of a line,2) the proof of the equivalence of the above presentation and Rosen-

feld's chord property [3 ].

Index Terms-Chain code, chord property, generating algorithm,recognition algorithm, straight line.

I. INTRODUCTIONIn 1970 Freeman suggested the following criteria which the

chain code of a line must meet [ 1], [ 2 ]:1) at most two basic directions are present and these can

differ only by unity, modulo eight,2) one of these values always occurs singly,3) successive occurrences of the principal direction occurring

singly are as uniformly spaced as possible.As Pavlidis indicated in [2], the third criterion is somewhat

fuzzy and all these criteria need a formal proof.In 1974 Rosenfeld [3] proved that the sufficient and neces-

sary condition for a chain code being the chain code of a lineis the chord property. Based on it, he gave the proof of thefirst two criteria and a number of regularity properties of thechain code of a line which are essentially equivalent to adescription of the third criterion.Also in 1974, Brons [4] gave an algorithm generating the

chain code with Freeman's three properties when the slope isrational.Later in 1975 and 1978 Arelli and Massarotti [5], [6]

proved a number of properties that the chain code of a linehas. Among them they proved that the chain code Brons' al-gorithm generates really has the chord property, so it is reallythe chain code of a line. Both [4] and [ 5] and [6] are limitedin the case when the slope of the line is rational.

Besides, in 1977 Gaafar [7] gave a different proof of Free-man's first two criteria.In this correspondence we give a complete description of

Freeman's three criteria. Since they are too complicated to

Manuscript received March 17, 1980; revised November 30, 1981.The author was on leave at the Division of Applied Mathematics,

Brown University, Providence, RI 02912. He is with the Departmentof Computer Science, Fudan University, Shanghai, China.

0162-8828/82/0500-0347$00.75 © 1982 IEEE

347

Documents

Similarity Measures Between Strings Extended to Sets of Strings