Download pdf - Vector Space Model - widit2.knu.ac.krwidit2.knu.ac.kr/~kiyang/teaching/SE/s20/lectures/4.SE-VSM-B.pdf · Vector Space Model: Postulate Search Engine 2 Documents that are “close

Vector Space Model:

Similarity Measure & Term Weighting

Vector Space Model: Postulate

Search Engine 2

Documents that are “close together” in vector space “talk about” the same things

t2

d2

d1

d3

d4

d5

t3

t1

θ

Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

Similarity Measures: Set-based

Simple matching function

Dice’s coefficient

Jaccard’s coefficient

A = (wd1, wd2, wd3, wd4, wd5)B = (wd2, wd4, wd6)

A ∩ B: intersection of A and B• the set of elements that belongs to both A and B• A ∩ B = (wd2, wd4)

A ∪ B: union of A and B• the set of elements that belongs to either A or B.• A ∪ B = (wd1, wd2, wd3, wd4, wd5, wd6)

|A| : cardinality of A• the number of elements in A• |A| = 5• |B| = 3• |A ∩ B| = 2• |A ∪ B| = 6

Similarity Scores• Simple: |A ∩ B| = 2• Dice: 2* |A ∩ B| / (|A|+ |B|) = 2*2 /8 = 1/2• Jaccard: |A ∩ B| / |A ∪ B| = 2/6 = 1/3

Search Engine 3

BABASIM ∩=),(

BABA

BASIM+∩

=2

),(

BABA

BASIM∪∩

=),(

Similarity Measures: Set-based Example

Search Engine 4

Simple matching function

By Dice’s coefficient

By Jaccard’s coefficient

Object – Attribute (feature) arrayO1 = (1, 0, 1, 1, 0, 0, 0, 1) |O1| = 4O2 = (1, 0, 0, 0, 1, 1, 0, 0) |O2| = 3O3 = (1, 0, 0, 1, 1, 1, 0, 0) |O3| = 4O4 = (1, 1, 0, 1, 0, 1, 1, 0) |O4| = 5O5 = (1, 1, 1, 1, 0, 0, 1, 1) |O5| = 6

|O1∩ O2| = |(A1)| = 1|O1∩ O3| = |(A1, A4)| = 2|O1∩ O4| = |(A1, A4)| = 2|O1∩ O5| = |(A1, A3, A4 , A8)| = 4

|O1∪ O2| = |(A1, A3, A4, A5, A6, A8)| = 6|O1∪ O3| = |(A1, A3, A4, A5, A6, A8)| = 6|O1∪ O4| = |(A1, A2, A3, A4, A6, A7, A8)| = 7|O1∪ O5| = |(A1, A2, A3, A4, A7, A8)| = 6

BABASIM ∩=),(

BABA

BASIM+∩

=2

),(

BABA

BASIM∪∩

=),(

O2 O3 O4 O5

SIM 1 2 2 4

Rank 4 2 2 1

O2 O3 O4 O5

SIM 2*1/(4+3)=2/7 2*2/(4+4)=4/8 2*2/(4+5)=4/9 2*4/(4+6)=8/10

Rank 4 2 3 1

O2 O3 O4 O5

SIM 1/6 2/6 2/7 4/6

Rank 4 2 3 1

Vector Space Model: Example

Search Engine 5

t2

d3d1

d2

q

t3

t1

Query: What is information retrieval?Q: Information 1, retrieval 1

Index Term d1 d2 d3 q

t1 (information) 1 1 1 1

t2 (retrieval) 1 2 0 1

t3 (seminar) 1 1 1 0

D1: Information retrieval seminarsD2: Retrieval seminars and Information RetrievalD3: Information seminar

d3 q

√𝟐𝟐 √𝟐𝟐

√𝟐𝟐

qd1

√𝟐𝟐√𝟑𝟑

𝟏𝟏

q

d2

√𝟐𝟐

√𝟔𝟔

√𝟐𝟐

θ = 60

θ ≈ 35

θ ≈ 30

VSM: Vector Length

6

y

x

z

A = (Ax, Ay, Az)

A = 𝐴𝐴𝑥𝑥2 + 𝐴𝐴𝑦𝑦2 + 𝐴𝐴𝑧𝑧2

A = ∑𝑖𝑖=1𝑛𝑛 𝐴𝐴𝑖𝑖2

Ao

x'

y'

z'

z"

y"

VSM: Angle between two vectors

Search Engine 7

t3

d3 q

√𝟐𝟐 √𝟐𝟐

√𝟐𝟐

θ = 60

cos𝜃𝜃 =22

+ 22− 2

2

2 2 2=

22 × 2

=12

= 0.5

cos−1(0.5) = 60

VSM: Angle Computation q-d1

Search Engine 8

𝒂𝒂𝒃𝒃

𝒄𝒄

θ

cos𝜃𝜃 =𝑎𝑎2 + 𝑏𝑏2 − 𝑐𝑐2

2𝑎𝑎𝑏𝑏Law of Cosines

cos𝜃𝜃 =22

+ 32− 12

2 2 3=

2 + 3 − 12 6

=4

2 × 2.45=

44.9

= 0.82

qd1

√𝟐𝟐√𝟑𝟑

𝟏𝟏

θ ≈ 35

cos−1(0.82) = 34.9

Sin, Cosine & Tangent(Math is Fun)

https://www.mathopenref.com/lawofcosinesproof.html

https://www.mathsisfun.com/sine-cosine-tangent.html

VSM: Angle Computation q-d2

Search Engine 9

𝒂𝒂𝒃𝒃

𝒄𝒄

θ

cos𝜃𝜃 =𝑎𝑎2 + 𝑏𝑏2 − 𝑐𝑐2

2𝑎𝑎𝑏𝑏Law of Cosines

cos𝜃𝜃 =22

+ 62− 2

2

2 2 6=

2 + 6 − 22 12

=6

2 × 3.46=

66.92

= 0.87

cos−1(0.87) = 29.5

Sin, Cosine & Tangent(Math is Fun)

q

d2

√𝟐𝟐

√𝟔𝟔

√𝟐𝟐

θ ≈ 30

https://www.mathopenref.com/lawofcosinesproof.html

https://www.mathsisfun.com/sine-cosine-tangent.html

Similarity Measures: Vector-based Cosine Similarity (n-dimensional space)

Dot/Scalar product of vectors ÷ product of vector lengths

• Dot product = sum (product of each axis component)A = (A1, A2, A3, A4) B = (B1, B2, B3, B4)A•B = (A1B1+A2B2+A3B3+A4B4)

• Vector length = square root of sum (square of each axis component)|A| = sqrt [(A1)2+ (A2)2+ (A3)2+ (A4)2] |B| = sqrt [(B1)2+ (B2)2+ (B3)2+ (B4)2]

Cosine Similarity (3-dimensional space)A = (Ax, Ay, Az)B = (Bx, By, Bz)

Search Engine10

222222 )()()()()()(cos

zyxzyx

zzyyxx

BBBAAA

BABABA

++++

++=θ

∑∑

∑

==

==•

=n

ii

n

ii

n

iii

BA

BA

BABA

1

2

1

2

1

)()(cosθ

http://www.mathsisfun.com/algebra/vectors-dot-product.html

Similarity Measures: Vector-based Example

Search Engine 11

Compute cosine similarities

Rank objects O2 through O5 by descending order of similarity to O1

Object – Attribute (feature) arrayO1 = (1, 0, 1, 1, 0, 0, 0, 1)O2 = (1, 0, 0, 0, 1, 1, 0, 0) O3 = (1, 0, 0, 1, 1, 1, 0, 0) O4 = (1, 1, 0, 1, 0, 1, 1, 0) O5 = (1, 1, 1, 1, 0, 0, 1, 1)

|O1| = sqrt(12+02+12+12+02+02+02+12) = sqrt(4)|O2| = sqrt(12+02+02+02+12+12+02+02) = sqrt(3)|O3| = sqrt(12+02+02+12+12+12+02+02) = sqrt(4)|O4| = sqrt(12+12+02+12+02+12+12+02) = sqrt(5)|O5| = sqrt(12+12+12+12+02+02+12+12) = sqrt(6)

O1•O2 = (1*1+0*0+1*0+1*0+0*1+0*1+0*0+1*0) = 1O1•O3 = (1*1+0*0+1*0+1*1+0*1+0*1+0*0+1*0) = 2O1•O4 = (1*1+0*1+1*0+1*1+0*0+0*1+0*1+1*0) = 2O1•O5 = (1*1+0*1+1*1+1*1+0*0+0*0+0*1+1*1) = 4

121

341),( 21 ==OOSIM

162

442),( 31 ==OOSIM

202

542),( 41 ==OOSIM

244

644),( 51 ==OOSIM ∑∑

∑

==

==•

=n

ii

n

ii

n

iii

BA

BA

BABA

1

2

1

2

1

)()(cosθ

Text Analysis: Word Frequency TREC Volume 3 Corpus

Number of documents: 336,310 Total word occurrences: 125,720,891 Unique words: 508,209

Zipf Distribution Rank*Frequency = constant

• Population, Wealth, Popularity

A few words are very common Most words are very rare

Term Weights Represents the ability of terms to identify relevant items

& to distinguish them from non-relevant material Very common & very rare words are not

very useful for indexing (Luhn, 1958) Good

• Smaller index → Faster retrieval

Bad• Lost gems & broken phrases

Search Engine 12

B. Croft (Umass)

Text Analysis: Term Weighting Term Weighting Factors

Term frequency (tf)• Number of times that a term occurs in a given document

→ tf(dogd1) = 2, tf(dogd2) = 1→ tf(foxd1) = 3, tf(foxd2) = 0→ tf(partyd1) = 0, tf(partyd2) = 1

Inverse document frequency (idf)

• (Simple) 1/number of document in which a term occurs→ idf(dog) = 1/2, idf(fox) = 1/1, idf(party) = 1/1

• (Default) log(Nd/ number of document in which a term occurs)→ Nd = number of document in a collection→ idf(dog)=log(2/2)=0, idf(fox)=log(2/1)=0.3, idf(party) = log(2/1)=0.3

Document length (dlen)• Number of tokens in a document

→ Token = an instance/occurrence of a word (not unique word)→ dlen(d1) = 11, dlen(d2) = 10

tf⋅idf formula

Search Engine 13

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

00110230110010100

11001101001101011

Term d1 d2

k

dkiki d

Nfidftfw log=∗=wki = weight of term k in document ifki = frequency of term k in document i (tf)Nd = number of documents in collectiondk = number of documents in which term k appears (postings)

Similarity Measures: using Term Weights

Search Engine 14

1. Compute term weights (e.g., tf*idf)• Nd = 5• d1=5, d2=4, d3=1, d4=3, d5=2, d6=3, d7=4, d8=2

Document – Term array

k

dkiki d

Nfidftfw log=∗=

Similarity Measures: using Term Weights

Search Engine 15

2. Compute query-document cosine similarity with tf*idf weights

∑∑

∑

==

==•

=n

ii

n

ii

n

iii

BA

BA

BABA

1

2

1

2

1

)()(cosθ

Appendix

Search Engine 16

Search Engine 17

t3

t1 1

1

12 + 12

d3 t2

t1

1

1

12 + 12

q

t3

t21

112 + 12

d3

q

Search Engine 18

Y

X

Ay

1

Ax

Ay

o

x’ y"

Z

Az Az

o

z"

y’

z'

y’