Upload
fruma
View
169
Download
0
Embed Size (px)
DESCRIPTION
An Introduction To Matrix Decomposition. Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17. Outline. Matrix Decomposition PCA, SVD, NMF LDA, ICA, Sparse Coding, etc . What Is Matrix Decomposition?. - PowerPoint PPT Presentation
Citation preview
An Introduction To Matrix DecompositionLei ZhangLead Researcher
Microsoft Research Asia
2012-04-17
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
What Is Matrix Decompositionbull We wish to decompose the matrix A by writing it as a product
of two or more matricesAntimesm = BntimeskCktimesm
bull Suppose A B C are column matricesndash Antimesm = (a1 a2 hellip am) each ai is a n-dim data samplendash Bntimesk = (b1 b2 hellip bk) each bj is a n-dim basis and space B consists of k
basesndash Cktimesm = (c1 c2 hellip cm) each ci is the k-dim coordinates of ai projected to
space B
Why We Need Matrix Decomposition
bull Given one data samplea1 = Bntimeskc1
(a11 a12 hellip a1n)T = (b1 b2 hellip bk) (c11 c12 hellip c1k)T
bull Another data sample a2 = Bntimeskc2
bull More data sample am = Bntimeskcm
bull Together (m data samples) (a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm
Why We Need Matrix Decomposition
(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm
bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space
bull In general B captures the common features in A while C carries specific characteristics of the original samples
bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features
PRINCIPLE COMPONENT ANALYSIS
Definition ndash Eigenvalue amp Eigenvector
Given a m x m matrix C for any λ and w if
Then λ is called eigenvalue and w is called eigenvector
Definition ndash Principle Component Analysis
ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)
bull Let A be a n times m data matrix in which the rows represent data samples
bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each
column so each column has zero meanbull Covariance matrix C (m x m)
Principle Component Analysisbull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variancebull The objective of the rotation transformation is to find the
maximal variancebull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problembull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decompositionbull PCA can be treated as data decomposition
a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces
CVPR 1991 (Citation 2654)bull The eigenface approach
ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreduciblebull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculationbull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Outline
bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc
What Is Matrix Decompositionbull We wish to decompose the matrix A by writing it as a product
of two or more matricesAntimesm = BntimeskCktimesm
bull Suppose A B C are column matricesndash Antimesm = (a1 a2 hellip am) each ai is a n-dim data samplendash Bntimesk = (b1 b2 hellip bk) each bj is a n-dim basis and space B consists of k
basesndash Cktimesm = (c1 c2 hellip cm) each ci is the k-dim coordinates of ai projected to
space B
Why We Need Matrix Decomposition
bull Given one data samplea1 = Bntimeskc1
(a11 a12 hellip a1n)T = (b1 b2 hellip bk) (c11 c12 hellip c1k)T
bull Another data sample a2 = Bntimeskc2
bull More data sample am = Bntimeskcm
bull Together (m data samples) (a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm
Why We Need Matrix Decomposition
(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm
bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space
bull In general B captures the common features in A while C carries specific characteristics of the original samples
bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features
PRINCIPLE COMPONENT ANALYSIS
Definition ndash Eigenvalue amp Eigenvector
Given a m x m matrix C for any λ and w if
Then λ is called eigenvalue and w is called eigenvector
Definition ndash Principle Component Analysis
ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)
bull Let A be a n times m data matrix in which the rows represent data samples
bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each
column so each column has zero meanbull Covariance matrix C (m x m)
Principle Component Analysisbull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variancebull The objective of the rotation transformation is to find the
maximal variancebull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problembull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decompositionbull PCA can be treated as data decomposition
a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces
CVPR 1991 (Citation 2654)bull The eigenface approach
ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreduciblebull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculationbull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
What Is Matrix Decompositionbull We wish to decompose the matrix A by writing it as a product
of two or more matricesAntimesm = BntimeskCktimesm
bull Suppose A B C are column matricesndash Antimesm = (a1 a2 hellip am) each ai is a n-dim data samplendash Bntimesk = (b1 b2 hellip bk) each bj is a n-dim basis and space B consists of k
basesndash Cktimesm = (c1 c2 hellip cm) each ci is the k-dim coordinates of ai projected to
space B
Why We Need Matrix Decomposition
bull Given one data samplea1 = Bntimeskc1
(a11 a12 hellip a1n)T = (b1 b2 hellip bk) (c11 c12 hellip c1k)T
bull Another data sample a2 = Bntimeskc2
bull More data sample am = Bntimeskcm
bull Together (m data samples) (a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm
Why We Need Matrix Decomposition
(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm
bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space
bull In general B captures the common features in A while C carries specific characteristics of the original samples
bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features
PRINCIPLE COMPONENT ANALYSIS
Definition ndash Eigenvalue amp Eigenvector
Given a m x m matrix C for any λ and w if
Then λ is called eigenvalue and w is called eigenvector
Definition ndash Principle Component Analysis
ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)
bull Let A be a n times m data matrix in which the rows represent data samples
bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each
column so each column has zero meanbull Covariance matrix C (m x m)
Principle Component Analysisbull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variancebull The objective of the rotation transformation is to find the
maximal variancebull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problembull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decompositionbull PCA can be treated as data decomposition
a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces
CVPR 1991 (Citation 2654)bull The eigenface approach
ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreduciblebull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculationbull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Why We Need Matrix Decomposition
bull Given one data samplea1 = Bntimeskc1
(a11 a12 hellip a1n)T = (b1 b2 hellip bk) (c11 c12 hellip c1k)T
bull Another data sample a2 = Bntimeskc2
bull More data sample am = Bntimeskcm
bull Together (m data samples) (a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm
Why We Need Matrix Decomposition
(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm
bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space
bull In general B captures the common features in A while C carries specific characteristics of the original samples
bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features
PRINCIPLE COMPONENT ANALYSIS
Definition ndash Eigenvalue amp Eigenvector
Given a m x m matrix C for any λ and w if
Then λ is called eigenvalue and w is called eigenvector
Definition ndash Principle Component Analysis
ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)
bull Let A be a n times m data matrix in which the rows represent data samples
bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each
column so each column has zero meanbull Covariance matrix C (m x m)
Principle Component Analysisbull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variancebull The objective of the rotation transformation is to find the
maximal variancebull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problembull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decompositionbull PCA can be treated as data decomposition
a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces
CVPR 1991 (Citation 2654)bull The eigenface approach
ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreduciblebull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculationbull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Why We Need Matrix Decomposition
(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm
bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space
bull In general B captures the common features in A while C carries specific characteristics of the original samples
bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features
PRINCIPLE COMPONENT ANALYSIS
Definition ndash Eigenvalue amp Eigenvector
Given a m x m matrix C for any λ and w if
Then λ is called eigenvalue and w is called eigenvector
Definition ndash Principle Component Analysis
ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)
bull Let A be a n times m data matrix in which the rows represent data samples
bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each
column so each column has zero meanbull Covariance matrix C (m x m)
Principle Component Analysisbull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variancebull The objective of the rotation transformation is to find the
maximal variancebull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problembull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decompositionbull PCA can be treated as data decomposition
a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces
CVPR 1991 (Citation 2654)bull The eigenface approach
ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreduciblebull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculationbull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
PRINCIPLE COMPONENT ANALYSIS
Definition ndash Eigenvalue amp Eigenvector
Given a m x m matrix C for any λ and w if
Then λ is called eigenvalue and w is called eigenvector
Definition ndash Principle Component Analysis
ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)
bull Let A be a n times m data matrix in which the rows represent data samples
bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each
column so each column has zero meanbull Covariance matrix C (m x m)
Principle Component Analysisbull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variancebull The objective of the rotation transformation is to find the
maximal variancebull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problembull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decompositionbull PCA can be treated as data decomposition
a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces
CVPR 1991 (Citation 2654)bull The eigenface approach
ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreduciblebull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculationbull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Definition ndash Eigenvalue amp Eigenvector
Given a m x m matrix C for any λ and w if
Then λ is called eigenvalue and w is called eigenvector
Definition ndash Principle Component Analysis
ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)
bull Let A be a n times m data matrix in which the rows represent data samples
bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each
column so each column has zero meanbull Covariance matrix C (m x m)
Principle Component Analysisbull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variancebull The objective of the rotation transformation is to find the
maximal variancebull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problembull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decompositionbull PCA can be treated as data decomposition
a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces
CVPR 1991 (Citation 2654)bull The eigenface approach
ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreduciblebull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculationbull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Definition ndash Principle Component Analysis
ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)
bull Let A be a n times m data matrix in which the rows represent data samples
bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each
column so each column has zero meanbull Covariance matrix C (m x m)
Principle Component Analysisbull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variancebull The objective of the rotation transformation is to find the
maximal variancebull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problembull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decompositionbull PCA can be treated as data decomposition
a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces
CVPR 1991 (Citation 2654)bull The eigenface approach
ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreduciblebull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculationbull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Principle Component Analysisbull C can be decomposed as follows C=UΛUT
bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector
UTU=I U-1=UT
Maximizing Variancebull The objective of the rotation transformation is to find the
maximal variancebull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problembull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decompositionbull PCA can be treated as data decomposition
a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces
CVPR 1991 (Citation 2654)bull The eigenface approach
ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreduciblebull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculationbull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Maximizing Variancebull The objective of the rotation transformation is to find the
maximal variancebull Projection of data along w is Awbull Variance σ2
w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)
bull Task maximize variance subject to constraint wTw=1
Optimization Problembull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decompositionbull PCA can be treated as data decomposition
a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces
CVPR 1991 (Citation 2654)bull The eigenface approach
ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreduciblebull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculationbull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Optimization Problembull Maximize
λ is the Lagrange multiplierbull Differentiating with respect to w yields
bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same
fashion to look for the next one which is orthogonal to (all) the principal component(s) already found
Property Data Decompositionbull PCA can be treated as data decomposition
a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces
CVPR 1991 (Citation 2654)bull The eigenface approach
ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreduciblebull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculationbull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Property Data Decompositionbull PCA can be treated as data decomposition
a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T
=(u1u2hellipun) (b1 b2 hellip bn)T
= Σ biui
Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces
CVPR 1991 (Citation 2654)bull The eigenface approach
ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreduciblebull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculationbull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces
CVPR 1991 (Citation 2654)bull The eigenface approach
ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreduciblebull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculationbull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
PageRank ndash Power Iteration
bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)
bull Row i has nonzero element in positions corresponding to inlinks Ii
Column-Stochastic amp Irreduciblebull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculationbull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Column-Stochastic amp Irreduciblebull Column-Stochastic
bull where
bull Irreducible
Iterative PageRank Calculationbull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Iterative PageRank Calculationbull For k=12hellip
bull Equivalently (λ=1 A is a Markov chain transition matrix)
bull Why can we use power iteration to find the first eigenvector
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Convergence of the power iteration
bull Expand the initial approximation r0 in terms of the eigenvectors
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
SINGULAR VALUE DECOMPOSITION
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
SVD - Definitionbull Any m x n matrix A with m ge n can be factorized
bull
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Singular Values And Singular Vectors
bull The diagonal elements σj of are the singular values of the matrix A
bull The columns of U and V are the left singular vectors and right singular vectors respectively
bull Equivalent form of SVD
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Matrix approximation
bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define
bull Then
bull It means that the best approximation of rank k for the matrix A is
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A
ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A
bull Equivalently we can writebull U is just eigenvectors for AT
ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Example - LSI bull Build a term-by-document matrix A
bull Compute the SVD of A A = UΣVT
bull Approximate A by
ndash Uk Orthogonal basis that we use to approximate all the documents
ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
SVD and PCAbull For symmetric A SVD is closely related to PCA
bull PCA A = UΛUT
ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT
ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues
bull For symmetric A column eigenvectors equal to row eigenvectors
bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Latent Semantic Indexing (LSI)1 Document file preparation preprocessing
ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming
2 Construction term-by-document matrix sparse matrix storage
3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Latent Semantic Indexingbull Assumption there is some underlying latent semantic
structure in the data
bull Eg car and automobile occur in similar documents as do cows and sheep
bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Similarity Measures
bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T
UΣ are the coordinates of A (rows) projected to space V
bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T
VΣ are the coordinates of A (columns) projected to space U
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Similarity Measures
bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T
UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
HITS (Hyperlink Induced Topic Search)
bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities
bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs
point to good authorities
Hubs Authorities
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Power Iteration
bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing
bull Define the adjacency matrix L of the directed web graph
bull Now
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
HITS and SVD
bull L rows are outlinks columns are inlinks
bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT
bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
HITS vs PageRankbull PageRank may be computed once HITS is computed per
query
bull HITS takes query into account PageRank doesnrsquot
bull PageRank has no concept of hubs
bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot
bull PageRank more stable because of its random jump step
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
NMF ndash NON-NEGATIVE MATRIX FACTORIZATION
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Definition
bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that
VntimesmasympWntimeskHktimesm
bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W
vj asymp Wntimeskhj
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Motivationbull Non-negativity is natural in many applications
bull Probability is also non-negative
bull Additive model to capture local structure
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Multiplicative Update Algorithmbull Cost function Euclidean distance
bull Multiplicative Update
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Multiplicative Update Algorithmbull Cost function Divergence
ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions
bull Multiplicative update
bull PLSA is NMF with KL divergence
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
NMF vs PCA
bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values
with red pixels
bull NMF Parts-based representationbull PCA Holistic representations
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Referencebull D D Lee and H S Seung Algorithms for non-negative matrix
factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative
matrix factorization (pdf) Nature 401 788-791 (1999)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)
Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007
University of Helsinki (Highly recommend)