Document Similarity Measures Content: Precision Recall and F-measure Dice Coefficient Jaccard Coefficient Cosine Similarity Asymmetric Similarity

Document Similarity Measures

Content:Precision Recall and F-measureDice CoefficientJaccard CoefficientCosine SimilarityAsymmetric SimilarityEuclidean DistanceManhattan blocks distance

Similarity Measures: Requirements

Similarity Measure: A similarity measure is a function which computes the degree of similarity between a pair of text objects Similarity measures are very important for Text-Mining. We can summarize the major use of document similarity measures as:

1. Similarity between two documents or document Vs query terms: A similarity measure can be used to calculate similarity between two documents, two queries, or one document and one query.

2. Document Ranking: similarity measure score can be used to rank the documents.

Some Important issues:

1. Presence of large number of similarity measures: Because the best similarity measure doesn't exist (yet!).

2. Selection of similarity measures: Generally selection of similarity measures based upon choice.

Precision Recall and F-measure

Precision = Returned Relevant Documents / Total Retrieved Documents

ABA

P

Recall = Returned Relevant Documents / Total Relevant Documents

BBA

R

F-Measure: RPPR

2

Example: Calculate P,R when

Relevant: B

Retrieved: A

Documents

Relevant and

Retrieved

Dice Coefficient

Dice coefficient: Dice coefficient is actually harmonic mean of precision and recall

RP

E 112

=

BABA

BBA

ABA

22

Harmonic Mean (H.M.) of numbers: NXXXX ,.....,,, 321 can be defined as:

ni

in x

n

xxx

nMH1

21

1111..

Important Points

1. The function ranges between zero and one.

2. The corresponding difference function is not a proper distance metric as it does not possess the property of triangle inequality.

e.g. The simplest counterexample of this is given by the three sets {a}, {b}, and {a,b}, the distance between the first two being 1, and the difference between the third and each of the others being one-third.

Denotation of Dice Coefficient

Denotation of Dice Coefficient using some constants:

)1,0()1()1(

),(),(1

21

21

n

k kjnk kq

nk kjkq

j wwww

BABA

BADdqsim

Case1: if 5.0 , then it gives more importance to precision

Case2: if 5.0 , then it gives more importance to recall

Note: we generally consider: 5.0

A simple case:

..2

)1(),(

21

MHBABA

BABA

BAD

thenif

Ex. Calculate the bigram based similarity between two strings i.e. (1) night and (2) nacht, by usin Dice coefficient

Jaccard CoefficientDefinition: The Jaccard coefficient is used to measure similarity between sets, and it can be calculated by dividing the size of the intersection by the size of the union of the sets:

BABA

BAJ

,

Jaccard distance: It measures the dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union:

BAJ

BAJBAJBAJBAJ

,1,'

Calculating Similarity between query and given document by using Jaccard Coefficient

n

knk kjkq

nk kjkq

nk kjkq

j wwwwww

dqsim1 11

221),(

Example: Rank the following documents w.r.t. query -> Q = 0T1 + 0T2 + 2T3,

i.e. (1) D1 = 2T1 + 3T2 + 5T3 and (2) D2 = 3T1 + 7T2 + T3

Ans: J(D1 , Q) = 0.31; J(D2 , Q) = 0.04 (How??)

Cosine Similarity

Cosine similarity: Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them.

Calculation of Cosine similarity: The cosine similarity between query ‘q’ and document dj can be calculated as:

nk kj

nk kq

nk kjkq

j

j

j

ww

ww

dq

dq

BABA

PRBACdqsim

12

12

1

),(),(

Calculating Cosine SimilarityCalculation of Cosine Similarity: Calculate the cosine similarity between binary fingerprint data ‘A’ and ‘B’

struct A: 00010100010101000101010011110100 13 bits on (A)

struct B: 00000000100101001001000011100000 8 bits on (B)

A AND B: 00000000000101000001000011100000 6 bits on (C)

Answer: 0.588 (How ???)

A B C

Asymmetric SimilarityDefinition: In traditional similarity measure if A is similar to B, then we find B is also similar to A. Some coefficients have been defined in which this is not true S(A,B) S(B,A) e.g. Tversky similarity

Similarity T = CCBCAC

where and are user-defined parameters

if = = 1, equation reduces to Tanimoto coefficient

if = = ½, equation reduces to Dice coefficient

if , T becomes asymmetric

• where = 1 and = 0, T = C / A

i.e. the fraction of A which it is has in common with B

o when T = 1.0, it indicates that A is a substructure of B (at the level of fingerprint matching)

o when T 1.0 it indicates that A is almost a substructure of B

Distance Based Similarity Measures

Euclidean distance: It is a standard metric for geometrical problems. It is the ordinary distance between two points and can be easily measured with a ruler in two- or three-dimensional space.

Euclidean distance is widely used in clustering problems, including clustering text.

It is also the default distance measure used with the K-means algorithm.

Measuring distance between text documents: given two documents, ad and bd

represented by their term vectors at_

and bt_

respectively. The Euclidean distance of the two documents is defined as:

21

1

2,,,

m

tbtatbaE WWttD

Where, the term set is ntttT ,......,, 21 In this calculation tdidftfW aat ,,

Distance Based Similarity MeasuresManhattan blocks distance:

n

kkjkqjMj wwdqddqdis

1),(),(

Important Points:

In chess, the distance between squares on the chessboard for rooks is measured in Manhattan distance;

kings and queens use Chebyshev distance, and bishops use the Manhattan distance (between squares of the same color) on the chessboard rotated 45 degrees, i.e., with its diagonals as coordinate axes.

To reach from one square to another, only kings require the number of moves equal to the distance; rooks, queens and bishops require one or two moves (on an empty board, and assuming that the move is possible at all in the bishop's case).

Reference

• Leach and Gillet (2003) Chapters 3, 4, 5 and 6• Eugene F. Krause (1987). Taxicab Geometry. Dover. ISBN 0-486-25202-7.• P.-N. Tan, M. Steinbach & V. Kumar, "Introduction to Data Mining", ,

Addison-Wesley (2005), ISBN 0-321-32136-7, chapter 8; page 500.• Elena Deza & Michel Marie Deza (2009) Encyclopedia of Distances, page

94, Springer.• Hazewinkel, Michiel, ed. (2001), "Mahalanobis distance", Encyclopedia of

Mathematics, Springer, ISBN 978-1-55608-010-4