Speeding up cosine computation
What if we could take our vectors and “pack” them into fewer dimensions (say 50,000100) while preserving distances?
Two methods: “Latent semantic indexing” Random projection
Two approaches
LSI is data-dependent Create a k-dim subspace by eliminating
redundant axes Pull together “related” axes – hopefully
car and automobile
Random projection is data-independent Choose a k-dim subspace that guarantees
probable stretching properties between pair of points.
Notions from linear algebra
Matrix A, vector v Matrix transpose (At) Matrix product Rank Eigenvalues and eigenvector v: Av = v
Overview of LSI
Pre-process docs using a technique from linear algebra called Singular Value Decomposition
Create a new (smaller) vector space
Queries handled in this new vector space
Intuition (contd)
More than dimension reduction: Derive a set of new uncorrelated features (roughly,
artificial concepts), one per dimension. Docs with lots of overlapping terms stay together
Terms also get pulled together onto the same dimension
Each term or document is then characterized by a
vector of weights indicating its strength of association
with each of these underlying concepts
Ex. car and automobile get pulled together, since co-occur in docs with tires, radiator, cylinder,…
Here comes “semantic” !!!
Singular-Value Decomposition
Recall m n matrix of terms docs, A. A has rank r m,n
Define term-term correlation matrix T=AAt
T is a square, symmetric m m matrix Let P be m r matrix of eigenvectors of T
Define doc-doc correlation matrix D=AtA D is a square, symmetric n n matrix Let R be n r matrix of eigenvectors of D
A’s decomposition
Do exist matrices P (for T, m r) and R (for D, n r) formed by orthonormal columns (unit dot-product)
It turns out that A = PRt
Where is a diagonal matrix with the eigenvalues of T=AAt in decreasing order.
=
A P Rt
mn mr rr rn
For some k << r, zero out all but the k biggest eigenvalues in [choice of k is crucial]
Denote by k this new version of , having rank k
Typically k is about 100, while r (A’s rank) is > 10,000
=
P k Rt
Dimensionality reduction
Ak
document
useless due to 0-col/0-row of k
m x r r x n
r
kk
k
00
0
A m x k k x n
Guarantee
Ak is a pretty good approximation to A: Relative distances are (approximately) preserved
Of all m n matrices of rank k, Ak is the best
approximation to A wrt the following measures:
minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = k
minB, rank(B)=k ||A-B||F2 = ||A-Ak||F
2 =
k2k+2
2r2
Frobenius norm ||A||F2 =
22r
2
Reduction
Xk = k Rt is the doc-matrix reduced to k<n dim
Take the doc-correlation matrix:
It is D=At A =(P Rt)t (P Rt) = (Rt)t (Rt)
Approx with k, and thus get At A Xk
t Xk
We use Xk to approx A: Xk = k Rt = Pkt A .
This means that to reduce a doc/query vector is enough to multiply it by Pk
t (i.e. k x m matrix)
Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn)
R,P are formed by
orthonormal eigenvectorsof the matrices D,T
Which are the concepts ?
c-th concept = c-th row of Pkt (which is k x m)
Denote it by Pkt [c], note its size is m =
#terms
Pkt [c][i] = strength of association between c-
th concept and i-th term
Projected document: d’j = Pkt dj
d’j[c] = strenght of concept c in dj
Gaussians are good!!
NOTE: Every col of R is unitary and uniformly distributed over the unit-sphere; moreover, the k cols of R are orthonormal on average.