Upload
walter-nichols
View
250
Download
1
Embed Size (px)
DESCRIPTION
Precision Recall and F-measure
Citation preview
Document Similarity Measures
Content:Precision Recall and F-measureDice CoefficientJaccard CoefficientCosine SimilarityAsymmetric SimilarityEuclidean DistanceManhattan blocks distance
Similarity Measures: Requirements
Similarity Measure: A similarity measure is a function which computes the degree of similarity between a pair of text objects Similarity measures are very important for Text-Mining. We can summarize the major use of document similarity measures as:
1. Similarity between two documents or document Vs query terms: A similarity measure can be used to calculate similarity between two documents, two queries, or one document and one query.
2. Document Ranking: similarity measure score can be used to rank the documents.
Some Important issues:
1. Presence of large number of similarity measures: Because the best similarity measure doesn't exist (yet!).
2. Selection of similarity measures: Generally selection of similarity measures based upon choice.
Precision Recall and F-measure
Precision = Returned Relevant Documents / Total Retrieved Documents
ABA
P
Recall = Returned Relevant Documents / Total Relevant Documents
BBA
R
F-Measure: RPPR
2
Example: Calculate P,R when
Relevant: B
Retrieved: A
Documents
Relevant and
Retrieved
Dice Coefficient
Dice coefficient: Dice coefficient is actually harmonic mean of precision and recall
RP
E 112
=
BABA
BBA
ABA
22
Harmonic Mean (H.M.) of numbers: NXXXX ,.....,,, 321 can be defined as:
ni
in x
n
xxx
nMH1
21
1111..
Important Points
1. The function ranges between zero and one.
2. The corresponding difference function is not a proper distance metric as it does not possess the property of triangle inequality.
e.g. The simplest counterexample of this is given by the three sets {a}, {b}, and {a,b}, the distance between the first two being 1, and the difference between the third and each of the others being one-third.
Denotation of Dice Coefficient
Denotation of Dice Coefficient using some constants:
)1,0()1()1(
),(),(1
21
21
n
k kjnk kq
nk kjkq
j wwww
BABA
BADdqsim
Case1: if 5.0 , then it gives more importance to precision
Case2: if 5.0 , then it gives more importance to recall
Note: we generally consider: 5.0
A simple case:
..2
)1(),(
21
MHBABA
BABA
BAD
thenif
Ex. Calculate the bigram based similarity between two strings i.e. (1) night and (2) nacht, by usin Dice coefficient
Jaccard CoefficientDefinition: The Jaccard coefficient is used to measure similarity between sets, and it can be calculated by dividing the size of the intersection by the size of the union of the sets:
BABA
BAJ
,
Jaccard distance: It measures the dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union:
BAJ
BAJBAJBAJBAJ
,1,'
Calculating Similarity between query and given document by using Jaccard Coefficient
n
knk kjkq
nk kjkq
nk kjkq
j wwwwww
dqsim1 11
221),(
Example: Rank the following documents w.r.t. query -> Q = 0T1 + 0T2 + 2T3,
i.e. (1) D1 = 2T1 + 3T2 + 5T3 and (2) D2 = 3T1 + 7T2 + T3
Ans: J(D1 , Q) = 0.31; J(D2 , Q) = 0.04 (How??)
Cosine Similarity
Cosine similarity: Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them.
Calculation of Cosine similarity: The cosine similarity between query ‘q’ and document dj can be calculated as:
nk kj
nk kq
nk kjkq
j
j
j
ww
ww
dq
dq
BABA
PRBACdqsim
12
12
1
),(),(
Calculating Cosine SimilarityCalculation of Cosine Similarity: Calculate the cosine similarity between binary fingerprint data ‘A’ and ‘B’
struct A: 00010100010101000101010011110100 13 bits on (A)
struct B: 00000000100101001001000011100000 8 bits on (B)
A AND B: 00000000000101000001000011100000 6 bits on (C)
Answer: 0.588 (How ???)
A B C
Asymmetric SimilarityDefinition: In traditional similarity measure if A is similar to B, then we find B is also similar to A. Some coefficients have been defined in which this is not true S(A,B) S(B,A) e.g. Tversky similarity
Similarity T = CCBCAC
where and are user-defined parameters
if = = 1, equation reduces to Tanimoto coefficient
if = = ½, equation reduces to Dice coefficient
if , T becomes asymmetric
• where = 1 and = 0, T = C / A
i.e. the fraction of A which it is has in common with B
o when T = 1.0, it indicates that A is a substructure of B (at the level of fingerprint matching)
o when T 1.0 it indicates that A is almost a substructure of B
Distance Based Similarity Measures
Euclidean distance: It is a standard metric for geometrical problems. It is the ordinary distance between two points and can be easily measured with a ruler in two- or three-dimensional space.
Euclidean distance is widely used in clustering problems, including clustering text.
It is also the default distance measure used with the K-means algorithm.
Measuring distance between text documents: given two documents, ad and bd
represented by their term vectors at_
and bt_
respectively. The Euclidean distance of the two documents is defined as:
21
1
2,,,
m
tbtatbaE WWttD
Where, the term set is ntttT ,......,, 21 In this calculation tdidftfW aat ,,
Distance Based Similarity MeasuresManhattan blocks distance:
n
kkjkqjMj wwdqddqdis
1),(),(
Important Points:
In chess, the distance between squares on the chessboard for rooks is measured in Manhattan distance;
kings and queens use Chebyshev distance, and bishops use the Manhattan distance (between squares of the same color) on the chessboard rotated 45 degrees, i.e., with its diagonals as coordinate axes.
To reach from one square to another, only kings require the number of moves equal to the distance; rooks, queens and bishops require one or two moves (on an empty board, and assuming that the move is possible at all in the bishop's case).
Reference
• Leach and Gillet (2003) Chapters 3, 4, 5 and 6• Eugene F. Krause (1987). Taxicab Geometry. Dover. ISBN 0-486-25202-7.• P.-N. Tan, M. Steinbach & V. Kumar, "Introduction to Data Mining", ,
Addison-Wesley (2005), ISBN 0-321-32136-7, chapter 8; page 500.• Elena Deza & Michel Marie Deza (2009) Encyclopedia of Distances, page
94, Springer.• Hazewinkel, Michiel, ed. (2001), "Mahalanobis distance", Encyclopedia of
Mathematics, Springer, ISBN 978-1-55608-010-4