Vector Space Model:
Similarity Measure & Term Weighting
Vector Space Model: Postulate
Search Engine 2
Documents that are “close together” in vector space “talk about” the same things
t2
d2
d1
d3
d4
d5
t3
t1
θ
Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)
Similarity Measures: Set-based
Simple matching function
Dice’s coefficient
Jaccard’s coefficient
A = (wd1, wd2, wd3, wd4, wd5)B = (wd2, wd4, wd6)
A ∩ B: intersection of A and B• the set of elements that belongs to both A and B• A ∩ B = (wd2, wd4)
A ∪ B: union of A and B• the set of elements that belongs to either A or B.• A ∪ B = (wd1, wd2, wd3, wd4, wd5, wd6)
|A| : cardinality of A• the number of elements in A• |A| = 5• |B| = 3• |A ∩ B| = 2• |A ∪ B| = 6
Similarity Scores• Simple: |A ∩ B| = 2• Dice: 2* |A ∩ B| / (|A|+ |B|) = 2*2 /8 = 1/2• Jaccard: |A ∩ B| / |A ∪ B| = 2/6 = 1/3
Search Engine 3
BABASIM ∩=),(
BABA
BASIM+∩
=2
),(
BABA
BASIM∪∩
=),(
Similarity Measures: Set-based Example
Search Engine 4
Simple matching function
By Dice’s coefficient
By Jaccard’s coefficient
Object – Attribute (feature) arrayO1 = (1, 0, 1, 1, 0, 0, 0, 1) |O1| = 4O2 = (1, 0, 0, 0, 1, 1, 0, 0) |O2| = 3O3 = (1, 0, 0, 1, 1, 1, 0, 0) |O3| = 4O4 = (1, 1, 0, 1, 0, 1, 1, 0) |O4| = 5O5 = (1, 1, 1, 1, 0, 0, 1, 1) |O5| = 6
|O1∩ O2| = |(A1)| = 1|O1∩ O3| = |(A1, A4)| = 2|O1∩ O4| = |(A1, A4)| = 2|O1∩ O5| = |(A1, A3, A4 , A8)| = 4
|O1∪ O2| = |(A1, A3, A4, A5, A6, A8)| = 6|O1∪ O3| = |(A1, A3, A4, A5, A6, A8)| = 6|O1∪ O4| = |(A1, A2, A3, A4, A6, A7, A8)| = 7|O1∪ O5| = |(A1, A2, A3, A4, A7, A8)| = 6
BABASIM ∩=),(
BABA
BASIM+∩
=2
),(
BABA
BASIM∪∩
=),(
O2 O3 O4 O5
SIM 1 2 2 4
Rank 4 2 2 1
O2 O3 O4 O5
SIM 2*1/(4+3)=2/7 2*2/(4+4)=4/8 2*2/(4+5)=4/9 2*4/(4+6)=8/10
Rank 4 2 3 1
O2 O3 O4 O5
SIM 1/6 2/6 2/7 4/6
Rank 4 2 3 1
Vector Space Model: Example
Search Engine 5
t2
d3d1
d2
q
t3
t1
Query: What is information retrieval?Q: Information 1, retrieval 1
Index Term d1 d2 d3 q
t1 (information) 1 1 1 1
t2 (retrieval) 1 2 0 1
t3 (seminar) 1 1 1 0
D1: Information retrieval seminarsD2: Retrieval seminars and Information RetrievalD3: Information seminar
d3 q
√𝟐𝟐 √𝟐𝟐
√𝟐𝟐
qd1
√𝟐𝟐√𝟑𝟑
𝟏𝟏
q
d2
√𝟐𝟐
√𝟔𝟔
√𝟐𝟐
θ = 60
θ ≈ 35
θ ≈ 30
VSM: Vector Length
6
y
x
z
A = (Ax, Ay, Az)
A = 𝐴𝐴𝑥𝑥2 + 𝐴𝐴𝑦𝑦2 + 𝐴𝐴𝑧𝑧2
A = ∑𝑖𝑖=1𝑛𝑛 𝐴𝐴𝑖𝑖2
Ao
x'
y'
z'
z"
y"
VSM: Angle between two vectors
Search Engine 7
t3
d3 q
√𝟐𝟐 √𝟐𝟐
√𝟐𝟐
θ = 60
cos𝜃𝜃 =22
+ 22− 2
2
2 2 2=
22 × 2
=12
= 0.5
cos−1(0.5) = 60
VSM: Angle Computation q-d1
Search Engine 8
𝒂𝒂𝒃𝒃
𝒄𝒄
θ
cos𝜃𝜃 =𝑎𝑎2 + 𝑏𝑏2 − 𝑐𝑐2
2𝑎𝑎𝑏𝑏Law of Cosines
cos𝜃𝜃 =22
+ 32− 12
2 2 3=
2 + 3 − 12 6
=4
2 × 2.45=
44.9
= 0.82
qd1
√𝟐𝟐√𝟑𝟑
𝟏𝟏
θ ≈ 35
cos−1(0.82) = 34.9
Sin, Cosine & Tangent(Math is Fun)
VSM: Angle Computation q-d2
Search Engine 9
𝒂𝒂𝒃𝒃
𝒄𝒄
θ
cos𝜃𝜃 =𝑎𝑎2 + 𝑏𝑏2 − 𝑐𝑐2
2𝑎𝑎𝑏𝑏Law of Cosines
cos𝜃𝜃 =22
+ 62− 2
2
2 2 6=
2 + 6 − 22 12
=6
2 × 3.46=
66.92
= 0.87
cos−1(0.87) = 29.5
Sin, Cosine & Tangent(Math is Fun)
q
d2
√𝟐𝟐
√𝟔𝟔
√𝟐𝟐
θ ≈ 30
Similarity Measures: Vector-based Cosine Similarity (n-dimensional space)
Dot/Scalar product of vectors ÷ product of vector lengths
• Dot product = sum (product of each axis component)A = (A1, A2, A3, A4) B = (B1, B2, B3, B4)A•B = (A1B1+A2B2+A3B3+A4B4)
• Vector length = square root of sum (square of each axis component)|A| = sqrt [(A1)2+ (A2)2+ (A3)2+ (A4)2] |B| = sqrt [(B1)2+ (B2)2+ (B3)2+ (B4)2]
Cosine Similarity (3-dimensional space)A = (Ax, Ay, Az)B = (Bx, By, Bz)
Search Engine10
222222 )()()()()()(cos
zyxzyx
zzyyxx
BBBAAA
BABABA
++++
++=θ
∑∑
∑
==
==•
=n
ii
n
ii
n
iii
BA
BA
BABA
1
2
1
2
1
)()(cosθ
Similarity Measures: Vector-based Example
Search Engine 11
Compute cosine similarities
Rank objects O2 through O5 by descending order of similarity to O1
Object – Attribute (feature) arrayO1 = (1, 0, 1, 1, 0, 0, 0, 1)O2 = (1, 0, 0, 0, 1, 1, 0, 0) O3 = (1, 0, 0, 1, 1, 1, 0, 0) O4 = (1, 1, 0, 1, 0, 1, 1, 0) O5 = (1, 1, 1, 1, 0, 0, 1, 1)
|O1| = sqrt(12+02+12+12+02+02+02+12) = sqrt(4)|O2| = sqrt(12+02+02+02+12+12+02+02) = sqrt(3)|O3| = sqrt(12+02+02+12+12+12+02+02) = sqrt(4)|O4| = sqrt(12+12+02+12+02+12+12+02) = sqrt(5)|O5| = sqrt(12+12+12+12+02+02+12+12) = sqrt(6)
O1•O2 = (1*1+0*0+1*0+1*0+0*1+0*1+0*0+1*0) = 1O1•O3 = (1*1+0*0+1*0+1*1+0*1+0*1+0*0+1*0) = 2O1•O4 = (1*1+0*1+1*0+1*1+0*0+0*1+0*1+1*0) = 2O1•O5 = (1*1+0*1+1*1+1*1+0*0+0*0+0*1+1*1) = 4
121
341),( 21 ==OOSIM
162
442),( 31 ==OOSIM
202
542),( 41 ==OOSIM
244
644),( 51 ==OOSIM ∑∑
∑
==
==•
=n
ii
n
ii
n
iii
BA
BA
BABA
1
2
1
2
1
)()(cosθ
Text Analysis: Word Frequency TREC Volume 3 Corpus
Number of documents: 336,310 Total word occurrences: 125,720,891 Unique words: 508,209
Zipf Distribution Rank*Frequency = constant
• Population, Wealth, Popularity
A few words are very common Most words are very rare
Term Weights Represents the ability of terms to identify relevant items
& to distinguish them from non-relevant material Very common & very rare words are not
very useful for indexing (Luhn, 1958) Good
• Smaller index → Faster retrieval
Bad• Lost gems & broken phrases
Search Engine 12
B. Croft (Umass)
Text Analysis: Term Weighting Term Weighting Factors
Term frequency (tf)• Number of times that a term occurs in a given document
→ tf(dogd1) = 2, tf(dogd2) = 1→ tf(foxd1) = 3, tf(foxd2) = 0→ tf(partyd1) = 0, tf(partyd2) = 1
Inverse document frequency (idf)
• (Simple) 1/number of document in which a term occurs→ idf(dog) = 1/2, idf(fox) = 1/1, idf(party) = 1/1
• (Default) log(Nd/ number of document in which a term occurs)→ Nd = number of document in a collection→ idf(dog)=log(2/2)=0, idf(fox)=log(2/1)=0.3, idf(party) = log(2/1)=0.3
Document length (dlen)• Number of tokens in a document
→ Token = an instance/occurrence of a word (not unique word)→ dlen(d1) = 11, dlen(d2) = 10
tf⋅idf formula
Search Engine 13
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
00110230110010100
11001101001101011
Term d1 d2
k
dkiki d
Nfidftfw log=∗=wki = weight of term k in document ifki = frequency of term k in document i (tf)Nd = number of documents in collectiondk = number of documents in which term k appears (postings)
Similarity Measures: using Term Weights
Search Engine 14
1. Compute term weights (e.g., tf*idf)• Nd = 5• d1=5, d2=4, d3=1, d4=3, d5=2, d6=3, d7=4, d8=2
Document – Term array
k
dkiki d
Nfidftfw log=∗=
Similarity Measures: using Term Weights
Search Engine 15
2. Compute query-document cosine similarity with tf*idf weights
∑∑
∑
==
==•
=n
ii
n
ii
n
iii
BA
BA
BABA
1
2
1
2
1
)()(cosθ
Appendix
Search Engine 16
Search Engine 17
t3
t1 1
1
12 + 12
d3 t2
t1
1
1
12 + 12
q
t3
t21
112 + 12
d3
q
Search Engine 18
Y
X
Ay
1
Ax
Ay
o
x’ y"
Z
Az Az
o
z"
y’
z'
y’