Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology
M. Soleymani
Fall 2016
Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
Vector space model: pros
Partial matching of queries and docs
dealing with the case where no doc contains all search terms
Ranking according to similarity score
Term weighting schemes
improves retrieval performance
Various extensions
Relevance feedback (modifying query vector)
Doc clustering and classification
2
Problems with lexical semantics
Ambiguity and association in natural language
Polysemy: Words often have a multitude of meanings and
different types of usage
More severe in very heterogeneous collections.
The vector space model is unable to discriminate between
different meanings of the same word.
3
Problems with lexical semantics
Synonymy: Different terms may have identical or similar
meanings (weaker: words indicating the same topic).
No associations between words are made in the vector
space representation.
4
Polysemy and context
Doc similarity on single word level: polysemy and context
carcompany
•••dodgeford
meaning 2
ringjupiter
•••space
voyagermeaning 1…
saturn...
…
planet...
contribution to similarity, if
used in 1st meaning, but not
if in 2nd
5
SVD
6
Latent Semantic Indexing (LSI)
Perform a low-rank approximation of doc-term
matrix (typical rank 100-300)
latent semantic space
Term-doc matrices are very large but the number of topicsthat people talk about is small (in some sense)
General idea: Map docs (and terms) to a low-dimensional
space
Design a mapping such that the low-dimensional space reflects
semantic associations
Compute doc similarity based on the inner product in this latent
semantic space
7
Goals of LSI
Similar terms map to similar location in low
dimensional space
Noise reduction by dimension reduction
8
9
This matrix is the basis for computing similarity between docs and queries.
Can we transform this matrix, so that we get a better measure of similarity
between docs and queries? . . .
Term-document matrix
Singular Value Decomposition (SVD)
𝑀𝑀 𝑀𝑁 𝑁𝑁
For an 𝑀𝑁 matrix 𝐴 of rank 𝑟 there exists a factorization:
The columns of 𝑈 are orthogonal eigenvectors of 𝐴𝐴𝑇.
The columns of 𝑉 are orthogonal eigenvectors of 𝐴𝑇𝐴.
Singular values
Eigenvalues 1… 𝑟 of 𝐴𝐴𝑇 are also the eigenvalues of 𝐴𝑇𝐴.
𝐴 = 𝑈Σ𝑉𝑇
Typically, the singular values arranged in decreasing order.
Σ = diag 𝜎1, … , 𝜎𝑟𝜎𝑖 = 𝜆𝑖
Singular Value Decomposition (SVD)
Truncated SVD
11
min(𝑀,𝑁)
min(𝑀,𝑁)
Mmin(M,N) Min(M,N)min(M,N) Min(M,N)N
𝐴 = 𝑈Σ𝑉𝑇
SVD example
M=3, N=2
Or equivalently:
0 2/ 6
1/ 2 −1/ 6
1/ 2 −1/ 6
1 0
0 3
1/ 2 1/√2
1/ 2 −1/ 2
𝐴 =
0 2/ 6
1/ 2 −1/ 6
1/ 2 −1/ 6
1/ 3
1/ 3
1/ 3
1 0
0 30 0
1/ 2 1/√2
1/ 2 −1/ 2
𝐴 =1 −10 11 0
Example
13
We use a non-weighted matrix here to simplify the example.
Example of 𝐶 = 𝑈Σ𝑉𝑇: All four matrices
14
𝐶 = 𝑈Σ𝑉𝑇
Example of 𝐶 = 𝑈Σ𝑉𝑇: matrix 𝑈
15
Columns: “semantic” dims (distinct topics like politics, sports,...)
𝑢𝑖𝑗: how strongly related term 𝑖 is to the topic in column 𝑗 .
One row per term
One column per min(M,N)
Example of 𝐶 = 𝑈Σ𝑉𝑇: The matrix Σ
16
Singular value:
“measures the importance of the corresponding semantic dimension”.
We’ll make use of this by omitting unimportant dimensions.
square, diagonal matrix
min(M,N) × min(M,N).
Example of 𝐶 = 𝑈Σ𝑉𝑇: The matrix 𝑉𝑇
17
Columns of 𝑉: “semantic” dims
𝑣𝑖𝑗: how strongly related doc 𝑖 is to the topic in column 𝑗 .
One column per doc
One row per min(M,N)
Matrix decomposition: Summary
We’ve decomposed the term-doc matrix 𝐶 into a
product of three matrices.
𝑈: consists of one (row) vector for each term
𝑉𝑇: consists of one (column) vector for each doc
Σ: diagonal matrix with singular values, reflecting importance of
each dimension
Next:Why are we doing this?
18
LSI: Overview
19
Decompose term-doc matrix 𝐶 into a product of
matrices using SVD
𝐶 = 𝑈Σ𝑉𝑇
We use columns of matrices 𝑈 and 𝑉 that correspond to the
largest values in the diagonal matrix Σ as term and document
dimensions in the new space
SVD for this purpose is called LSI.
Solution via SVD
Low-rank approximation
set smallest r-k
singular values to zero
column notation:
sum of rank 1 matrices
𝑀 ×𝑁 𝑀 × 𝑘
𝑘 × 𝑘 𝑘 × 𝑁
We retain only 𝑘 singular values
𝐴𝑘 = 𝑈 diag 𝜎1, … , 𝜎𝑘 , 0, … 0 𝑉𝑇
𝐴𝑘 =
𝑖=1
𝑘
𝜎𝑘𝑢𝑖𝑣𝑖𝑇
SVD can be used to compute optimal low-rank approximations.
Keeping the 𝑘 largest singular values and setting all others to zero results in
the optimal approximation [Eckart-Young].
No matrix of the rank 𝑘 can approximates 𝐴 better than 𝐴𝑘 .
Approximation problem: Given matrix 𝐴, find matrix 𝐴𝑘 of rank 𝑘 (e.g.
a matrix with 𝑘 linearly independent rows or columns) such that
𝐴𝑘 and 𝑋 are both 𝑀 × 𝑁 matrices.
Typically, we want 𝑘 ≪ 𝑟.
Low-rank approximation
Frobenius norm
21
𝐴𝑘 = min𝑋:𝑟𝑎𝑛𝑘 𝑋 =𝑘
𝐴 − 𝑋 𝐹
Approximation error
How good (bad) is this approximation?
It’s the best possible, measured by the Frobenius norm of
the error:
where the 𝑖 are ordered such that 𝑖 𝑖+1.
Suggests why Frobenius error drops as 𝑘 increases.
22
min𝑋:𝑟𝑎𝑛𝑘 𝑋 =𝑘
𝐴 − 𝑋 𝐹 = 𝐴 − 𝐴𝑘 𝐹 = 𝜎𝑘+1
𝐴𝑘 = 𝑈 diag 𝜎1, … , 𝜎𝑘 , 0, … 0 𝑉𝑇
SVD Low-rank approximation
Term-doc matrix 𝐶 may have 𝑀 = 50000,𝑁 = 106
rank close to 50000
Construct an approximation 𝐶100with rank 100.
Of all rank 100 matrices, it would have the lowest Frobenius
error.
Great … but why would we??
Answer: Latent Semantic Indexing
C. Eckart, G. Young, The approximation of a matrix by another of lower rank.
Psychometrika, 1, 211-218, 1936.
Recall unreduced decomposition 𝐶 = 𝑈Σ𝑉𝑇
24
Reducing the dimensionality to 2
25
Reducing the dimensionality to 2
26
Original matrix 𝐶 vs. reduced 𝐶2 = 𝑈Σ2𝑉𝑇
27
𝐶2 as a two-dimensional
representation of 𝐶.
Dimensionality reduction to
two dimensions.
Why is the reduced matrix “better”?
28
28
Similarity of d2 and d3 in the original space: 0.
Similarity of d2 und d3 in the reduced space:
0.52 * 0.28 + 0.36 * 0.16 + 0.72 * 0.36 + 0.12 * 0.20 + - 0.39 * - 0.08 ≈ 0.52
Why the reduced matrix is “better”?
29
“boat” and “ship” are semantically similar.
The “reduced” similarity measure reflects this.
What property of the SVD reduction is responsible for improved similarity?
Example
30 [Example from Dumais et. al]
Example
31 [Example from Dumais et. al]
Example (k=2)
32 [Example from Dumais et. al]
𝑈𝑘
Σ𝑘 𝑉𝑘𝑇
33
graph
tree
minor
survey
time
responseuser
computer
interface
humanEPS
system
Squares: terms
Circles: docs
34 [Example from Dumais et. al]
How we use the SVD in LSI
Key property of SVD: Each singular value tells us how
important its dimension is.
By setting less important dimensions to zero, we keep the
important information, but get rid of the “details”.
These details may
be noise ⇒ reduced LSI is a better representation
Details make things dissimilar that should be similar ⇒ reduced LSI is a better
representation because it represents similarity better.
35
How LSI addresses synonymy and semantic
relatedness?
Docs may be semantically similar but are not similar in the
vector space (when we talk about the same topics but use
different words).
Desired effect of LSI: Synonyms contribute strongly to doc similarity.
Standard vector space: Synonyms contribute nothing to doc similarity.
LSI (via SVD) selects the “least costly” mapping:
different words (= different dimensions of the full space) are
mapped to the same dimension in the reduced space.
Thus, it maps synonyms or semantically related words to the same dimension.
“cost” of mapping synonyms to the same dimension is much less
than cost of collapsing unrelated words.
Thus, LSI will avoid doing that for unrelated words.
36
Performing the maps
Each row and column of 𝐶 gets mapped into the 𝑘-dimensional LSI space, by the SVD.
A query 𝑞 is also mapped into this space, by
Query NOT a sparse vector.
Claim: this is not only the mapping with the best
(Frobenius error) approximation to 𝐶, but also improves
retrieval.
37
Since 𝑉𝑘 = 𝐶𝑘𝑇𝑈𝑘Σ𝑘
−1, we
should transform query 𝑞 to 𝑞𝑘
𝑞𝑘 = 𝑞𝑇𝑈𝑘Σ𝑘
−1
Implementation
Compute SVD of term-doc matrix
Map docs to the reduced space
Map the query into the reduced space 𝑞𝑘 = 𝑞𝑇𝑈𝑘Σ𝑘
−1
Compute similarity of 𝑞𝑘 with all reduced docs in 𝑉𝑘 .
Output ranked list of docs as usual
What is the fundamental problem with this approach?
38
Empirical evidence
Experiments on TREC 1/2/3 – Dumais
Lanczos SVD code (available on netlib) due to Berry used
in these experiments
Running times of ~ one day on tens of thousands of docs [still an
obstacle to use]
Dimensions – various values 250-350 reported.
Reducing k improves recall.
Under 200 reported unsatisfactory
Generally expect recall to improve – what about precision?
39
Empirical evidence
Precision at or above median TREC precision
Top scorer on almost 20% of TREC topics
Slightly better on average than straight vector spaces
Effect of dimensionality:
Dimensions Precision
250 0.367
300 0.371
346 0.374
40
But why is this clustering?
We’ve talked about docs, queries, retrieval and
precision here.
What does this have to do with clustering?
Intuition: Dimension reduction through LSI brings
together “related” axes in the vector space.
41
Simplistic picture
Topic 1
Topic 2
Topic 342
Reference
43
Chapter 18 of IIR Book