View
215
Download
1
Embed Size (px)
Citation preview
TFIDF-space
An obvious way to combine TF-IDF: the coordinate of document in axis is given by
dt = TF (d;t) áIDF (t)
d t
General form of consists of three parts: dt
dt = L tdGtDd
L td :Local weight for term occurring in doc.t d
Gt :Global weight for term occurring in the corpust
Dd :Document normalization factor
Term-by-Document Matrix
A document collection (corpus) composed of n doc. that are indexed by m terms (tokens) can be represented as an matrix mâ n A
Summary
Tokenization
Removing stopwords Stemming
Term Weighting
TF: Local IDF: Global Normalization
TF-IDF Vector Space
Term-by-Document Matrix
Problems with Vector Space Model
How to define/select ‘basic concept’? VS model treats each term as a basic vector E.g., q=(‘microsoft’, ‘software’), d = (‘windows_xp’)
How to assign weights to different terms? Need to distinguish common words from uninformative words Weight in query indicates importance of term Weight in doc indicates how well the term characterizes the doc
How to define similarity/distance function?
How to store the term-by-document matrix?
Choice of ‘Basic Concepts’
Java
Microsoft
Starbucks
D1
Short Review of Linear Algebra
The Terms that You Have to Know!
Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product Eigenvalue, Eigenvector Projection
Least Squares Problem:
The normal equation for LS problem: ATAx = ATb
Finding the projection of onto thecol(A)b
Ax ù b
The projection matrix: P = A(ATA)à 1AT 2 Rmâ m
Let be a matrix with full column rankA 2 Rmâ n
If has orthonormal columns, then the LS problem becomes easy:
A
Pb= AATb=P
i=1
nA ï iAT
ï ib=P
i=1
n(AT
ï ib)A ï i
Think of orthonormal axis system
Matrix Factorization
LU-Factorization: A = LU
QR-Factorization:
Very useful for solving linear system equations Some row exchanges are required
A = QR; A 2 Rmân; Q 2 Rmân;R 2 Rnân
Every matrix with linearly independent columns can be factored into . The columns of are orthonormal,and is upper triangular and invertible. When and all matrices are square, becomes anorthogonal matrix ( )
A 2 Rmâ n
A = QR Q
Rm = n Q
QTQ = I
QR Factorization SimplifiesLeast Squares Problem
The normal equation for LS problem: ATAx = ATb
ATAx = RTQTQRx = RTRx = RTQTb
, Rx = QTb (RT is invertible)
A ï j = Q áR ï j =P
k=1
n
RkjQ ï k
A
Note: The orthogonal matrix constructs the column space of matrix
Q
LS problem: Finding the projection of onto the col(A)b
Motivation for Computing QR of the term-by-doc Matrix
The basis vectors of the column space of can be used to describe the semantic content of the corresponding text collection
A
cosòk = jjA ï kjj2jjqjj2
A Tï káq = jjQR ï kjj2jjqjj2
(QR ï k)Táq = jjR ï kjj2jjqjj2
R Tï k(Q
Táq)
Let be the angle between a query and the document vector
òk qA ï k
That means we can keep and instead of Q R A
QR also can be applied to dimension reduction
Singular Value Decomposition (SVD)
A = UÎ VT; A 2 Rmâ n; U 2 Rmâ n;V 2 Rnâ n; Î 2 Rnâ n
The columns of are eigenvectors of and the columnsU AAT
of are eigenvectors ofV ATA
Î =
û1 ááá 00
... 00 ááá ûr
0... 0
0 ááá 0
2
6664
3
7775
mâ n
r = min(m;n)
û1>û2>. . .>ûr
eigenvalues of both and AATATA
are square roots of the nonzero
Singular Value Decomposition (SVD)
à 1 1 00 à 1 1
ô õ
=à 2
2p
22
p
22
p
22
p
" #3
p0 0
0 1 0
ô õ 66
p
à 36
p
66
p
à 22
p
0 22
p
33
p
33
p
33
p
2
64
3
75
A = UÎ VT; A 2 Rmâ n; U 2 Rmâ n;V 2 Rnâ n; Î 2 Rnâ n
AAT = UÎ VTVÎ TUT = UÎ Î TUT ) col(A) = col(U)
ATA = VÎ TÎ VT ) row(A) = col(V)
Latent Semantic Indexing (LSI)
Basic idea: explore the correlation between words and documents
Two words are correlated when they co-occur together many times
Two documents are correlated when they have many words
Latent Semantic Indexing (LSI)
Computation: using single value decomposition (SVD)
Concept Space m is the number of
conceptsRep. of Concepts
in term space
Concept
Concept
Rep. of concepts in document space
m: number of concepts/topics
54.20
034.3X X
SVD: Example: m=2
54.20
034.3X X
SVD: Example: m=2
54.20
034.3X X
SVD: Example: m=2
54.20
034.3X X
SVD: Example: m=2
5
476.0
34.3
54.2
SVD: Eigenvalues
Determining m is usually difficult
SVD: Orthogonality
54.20
034.3X X
u1 u2 · = 0
v1
v2
v1 · v2 = 0
54.20
034.3
X X
SVD: Properties
rank(S): the maximum number of either row or column vectors within matrix S that are linearly independent.
SVD produces the best low rank approximation
X’: rank(X’) = 2X: rank(X) = 9
SVD: Visualization
X
=
SVD: Visualization
SVD tries to preserve the Euclidean distance of document vectors
Principal Components AnalysisAn unsupervised method for dimension reduction
The principal component is the direction such that the projections of all data points on to this direction are most spread out
An important fact:
X ø N(ö;Î );ö 2 Rd; Î 2 Rdâ d; w 2 Rd
wTx ø N(wTö; wTÎ w)then
We are looking for the direction with such that is maximized
wíí w
íí
2 = 1
wTÎ w
Principal Components AnalysisAn unsupervised method for dimension reduction
íí w
íí
2 = 1
maxw
wTÎ w) L (w;ë) = wTÎ wà ë á(wTwà 1)
@w@L (w;ë) = 0) Î w = ëw
Don’t forget your purpose: maxw
wTÎ w
The largest eigenvector will be a right choice!
The Second Principal Component
) L (w;ë;ì ) = wTÎ wà ë á(wTwà 1) à ì wTwã
Add one more constraint for the second pc: It should be orthogonal to the first pc
íí w
íí
2 = 1
maxw
wTÎ w
wTwã = 0
@w@L (w;ë;ì ) = 0) 2Î wà 2ëwà ì wã = 0
2(wã)TÎ wà 2ë(wã)Tw = ì2õ1(wã)Tw = ì ) ì = 0
) Î w = ëw; wTwã = 0
Singular Value Decomposition (SVD)
Assume: m > n
A = UÎ VT; A 2 Rmâ n; U 2 Rmâ n;V 2 Rnâ n; Î 2 Rnâ n
The columns of are eigenvectors of and the columnsU AAT
of are eigenvectors ofV ATA
Î =
û1
û2 ...ûn
2
64
3
75
nâ n
û1>û2>.. .>ûn
eigenvalues of both and AATATA
are square roots of the nonzero
How to Compute SVD?
A = UÎ VT; A 2 Rmâ n; U 2 Rmâ n;V 2 Rnâ n; Î 2 Rnâ n
AAT 2 Rmâ mATA 2 Rnâ n and
U VQ2: Is there any relation between and ?
Q1: Which one or is easier to compute?VU