42
An Introduction To Matrix Decomposition Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17

An Introduction To Matrix Decomposition

  • Upload
    fruma

  • View
    169

  • Download
    0

Embed Size (px)

DESCRIPTION

An Introduction To Matrix Decomposition. Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17. Outline. Matrix Decomposition PCA, SVD, NMF LDA, ICA, Sparse Coding, etc . What Is Matrix Decomposition?. - PowerPoint PPT Presentation

Citation preview

Page 1: An Introduction To Matrix  Decomposition

An Introduction To Matrix DecompositionLei ZhangLead Researcher

Microsoft Research Asia

2012-04-17

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

What Is Matrix Decompositionbull We wish to decompose the matrix A by writing it as a product

of two or more matricesAntimesm = BntimeskCktimesm

bull Suppose A B C are column matricesndash Antimesm = (a1 a2 hellip am) each ai is a n-dim data samplendash Bntimesk = (b1 b2 hellip bk) each bj is a n-dim basis and space B consists of k

basesndash Cktimesm = (c1 c2 hellip cm) each ci is the k-dim coordinates of ai projected to

space B

Why We Need Matrix Decomposition

bull Given one data samplea1 = Bntimeskc1

(a11 a12 hellip a1n)T = (b1 b2 hellip bk) (c11 c12 hellip c1k)T

bull Another data sample a2 = Bntimeskc2

bull More data sample am = Bntimeskcm

bull Together (m data samples) (a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm

Why We Need Matrix Decomposition

(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm

bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space

bull In general B captures the common features in A while C carries specific characteristics of the original samples

bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features

PRINCIPLE COMPONENT ANALYSIS

Definition ndash Eigenvalue amp Eigenvector

Given a m x m matrix C for any λ and w if

Then λ is called eigenvalue and w is called eigenvector

Definition ndash Principle Component Analysis

ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)

bull Let A be a n times m data matrix in which the rows represent data samples

bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each

column so each column has zero meanbull Covariance matrix C (m x m)

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 2: An Introduction To Matrix  Decomposition

Outline

bull Matrix Decompositionndash PCA SVD NMFndash LDA ICA Sparse Coding etc

What Is Matrix Decompositionbull We wish to decompose the matrix A by writing it as a product

of two or more matricesAntimesm = BntimeskCktimesm

bull Suppose A B C are column matricesndash Antimesm = (a1 a2 hellip am) each ai is a n-dim data samplendash Bntimesk = (b1 b2 hellip bk) each bj is a n-dim basis and space B consists of k

basesndash Cktimesm = (c1 c2 hellip cm) each ci is the k-dim coordinates of ai projected to

space B

Why We Need Matrix Decomposition

bull Given one data samplea1 = Bntimeskc1

(a11 a12 hellip a1n)T = (b1 b2 hellip bk) (c11 c12 hellip c1k)T

bull Another data sample a2 = Bntimeskc2

bull More data sample am = Bntimeskcm

bull Together (m data samples) (a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm

Why We Need Matrix Decomposition

(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm

bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space

bull In general B captures the common features in A while C carries specific characteristics of the original samples

bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features

PRINCIPLE COMPONENT ANALYSIS

Definition ndash Eigenvalue amp Eigenvector

Given a m x m matrix C for any λ and w if

Then λ is called eigenvalue and w is called eigenvector

Definition ndash Principle Component Analysis

ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)

bull Let A be a n times m data matrix in which the rows represent data samples

bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each

column so each column has zero meanbull Covariance matrix C (m x m)

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 3: An Introduction To Matrix  Decomposition

What Is Matrix Decompositionbull We wish to decompose the matrix A by writing it as a product

of two or more matricesAntimesm = BntimeskCktimesm

bull Suppose A B C are column matricesndash Antimesm = (a1 a2 hellip am) each ai is a n-dim data samplendash Bntimesk = (b1 b2 hellip bk) each bj is a n-dim basis and space B consists of k

basesndash Cktimesm = (c1 c2 hellip cm) each ci is the k-dim coordinates of ai projected to

space B

Why We Need Matrix Decomposition

bull Given one data samplea1 = Bntimeskc1

(a11 a12 hellip a1n)T = (b1 b2 hellip bk) (c11 c12 hellip c1k)T

bull Another data sample a2 = Bntimeskc2

bull More data sample am = Bntimeskcm

bull Together (m data samples) (a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm

Why We Need Matrix Decomposition

(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm

bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space

bull In general B captures the common features in A while C carries specific characteristics of the original samples

bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features

PRINCIPLE COMPONENT ANALYSIS

Definition ndash Eigenvalue amp Eigenvector

Given a m x m matrix C for any λ and w if

Then λ is called eigenvalue and w is called eigenvector

Definition ndash Principle Component Analysis

ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)

bull Let A be a n times m data matrix in which the rows represent data samples

bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each

column so each column has zero meanbull Covariance matrix C (m x m)

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 4: An Introduction To Matrix  Decomposition

Why We Need Matrix Decomposition

bull Given one data samplea1 = Bntimeskc1

(a11 a12 hellip a1n)T = (b1 b2 hellip bk) (c11 c12 hellip c1k)T

bull Another data sample a2 = Bntimeskc2

bull More data sample am = Bntimeskcm

bull Together (m data samples) (a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm

Why We Need Matrix Decomposition

(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm

bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space

bull In general B captures the common features in A while C carries specific characteristics of the original samples

bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features

PRINCIPLE COMPONENT ANALYSIS

Definition ndash Eigenvalue amp Eigenvector

Given a m x m matrix C for any λ and w if

Then λ is called eigenvalue and w is called eigenvector

Definition ndash Principle Component Analysis

ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)

bull Let A be a n times m data matrix in which the rows represent data samples

bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each

column so each column has zero meanbull Covariance matrix C (m x m)

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 5: An Introduction To Matrix  Decomposition

Why We Need Matrix Decomposition

(a1 a2 hellip am) = Bntimesk (c1 c2 hellip cm) Antimesm = BntimeskCktimesm

bull We wish to find a set of new basis B to represent data samples A and A will become C in the new space

bull In general B captures the common features in A while C carries specific characteristics of the original samples

bull In PCA B is eigenvectorsbull In SVD B is right (column) eigenvectorsbull In LDA B is discriminant directionsbull In NMF B is local features

PRINCIPLE COMPONENT ANALYSIS

Definition ndash Eigenvalue amp Eigenvector

Given a m x m matrix C for any λ and w if

Then λ is called eigenvalue and w is called eigenvector

Definition ndash Principle Component Analysis

ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)

bull Let A be a n times m data matrix in which the rows represent data samples

bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each

column so each column has zero meanbull Covariance matrix C (m x m)

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 6: An Introduction To Matrix  Decomposition

PRINCIPLE COMPONENT ANALYSIS

Definition ndash Eigenvalue amp Eigenvector

Given a m x m matrix C for any λ and w if

Then λ is called eigenvalue and w is called eigenvector

Definition ndash Principle Component Analysis

ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)

bull Let A be a n times m data matrix in which the rows represent data samples

bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each

column so each column has zero meanbull Covariance matrix C (m x m)

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 7: An Introduction To Matrix  Decomposition

Definition ndash Eigenvalue amp Eigenvector

Given a m x m matrix C for any λ and w if

Then λ is called eigenvalue and w is called eigenvector

Definition ndash Principle Component Analysis

ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)

bull Let A be a n times m data matrix in which the rows represent data samples

bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each

column so each column has zero meanbull Covariance matrix C (m x m)

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 8: An Introduction To Matrix  Decomposition

Definition ndash Principle Component Analysis

ndash Principle Component Analysis (PCA)ndash Karhunen-Loeve transformation (KL transformation)

bull Let A be a n times m data matrix in which the rows represent data samples

bull Each row is a data vector each column represents a variablebull A is centered the estimated mean is subtracted from each

column so each column has zero meanbull Covariance matrix C (m x m)

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 9: An Introduction To Matrix  Decomposition

Principle Component Analysisbull C can be decomposed as follows C=UΛUT

bull Λ is a diagonal matrix diag(λ1 λ2hellipλn) each λi is an eigenvaluebull U is an orthogonal matrix each column is an eigenvector

UTU=I U-1=UT

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 10: An Introduction To Matrix  Decomposition

Maximizing Variancebull The objective of the rotation transformation is to find the

maximal variancebull Projection of data along w is Awbull Variance σ2

w= (Aw)T(Aw) = wTATAw = wTCwwhere C = ATA is the covariance matrix of the data (A is centered)

bull Task maximize variance subject to constraint wTw=1

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 11: An Introduction To Matrix  Decomposition

Optimization Problembull Maximize

λ is the Lagrange multiplierbull Differentiating with respect to w yields

bull Eigenvalue equation Cw = λw where C = ATA bull Once the first principal component is found we continue in the same

fashion to look for the next one which is orthogonal to (all) the principal component(s) already found

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 12: An Introduction To Matrix  Decomposition

Property Data Decompositionbull PCA can be treated as data decomposition

a=UUTa=(u1u2hellipun) (u1u2hellipun)T a=(u1u2hellipun) (ltu1agtltu2agthellipltunagt)T

=(u1u2hellipun) (b1 b2 hellip bn)T

= Σ biui

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 13: An Introduction To Matrix  Decomposition

Face Recognition ndash Eigenfacebull Turk MA Pentland AP Face recognition using eigenfaces

CVPR 1991 (Citation 2654)bull The eigenface approach

ndash images are points in a vector spacendash use PCA to reduce dimensionalityndash face spacendash compare projections onto face space to recognize faces

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 14: An Introduction To Matrix  Decomposition

PageRank ndash Power Iteration

bull Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total)

bull Row i has nonzero element in positions corresponding to inlinks Ii

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 15: An Introduction To Matrix  Decomposition

Column-Stochastic amp Irreduciblebull Column-Stochastic

bull where

bull Irreducible

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 16: An Introduction To Matrix  Decomposition

Iterative PageRank Calculationbull For k=12hellip

bull Equivalently (λ=1 A is a Markov chain transition matrix)

bull Why can we use power iteration to find the first eigenvector

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 17: An Introduction To Matrix  Decomposition

Convergence of the power iteration

bull Expand the initial approximation r0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 18: An Introduction To Matrix  Decomposition

SINGULAR VALUE DECOMPOSITION

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 19: An Introduction To Matrix  Decomposition

SVD - Definitionbull Any m x n matrix A with m ge n can be factorized

bull

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 20: An Introduction To Matrix  Decomposition

Singular Values And Singular Vectors

bull The diagonal elements σj of are the singular values of the matrix A

bull The columns of U and V are the left singular vectors and right singular vectors respectively

bull Equivalent form of SVD

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 21: An Introduction To Matrix  Decomposition

Matrix approximation

bull Theorem Let Uk = (u1 u2 hellip uk) Vk = (v1 v2 hellip vk) and Σk = diag(σ1 σ2 hellip σk) and define

bull Then

bull It means that the best approximation of rank k for the matrix A is

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 22: An Introduction To Matrix  Decomposition

SVD and PCAbull We can writebull Remember that in PCA we treat A as a row matrixbull V is just eigenvectors for A

ndash Each column in V is an eigenvector of row matrix Andash we use V to approximate a row in A

bull Equivalently we can writebull U is just eigenvectors for AT

ndash Each column in U is an eigenvector of column matrix Andash We use U to approximate a column in A

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 23: An Introduction To Matrix  Decomposition

Example - LSI bull Build a term-by-document matrix A

bull Compute the SVD of A A = UΣVT

bull Approximate A by

ndash Uk Orthogonal basis that we use to approximate all the documents

ndash Dk Column j hold the coordinates of document j in the new basisndash Dk is the projection of A onto the subspace spanned by Uk

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 24: An Introduction To Matrix  Decomposition

SVD and PCAbull For symmetric A SVD is closely related to PCA

bull PCA A = UΛUT

ndash U and Λ are eigenvectors and eigenvaluesbull SVD A = UΛVT

ndash U is left(column) eigenvectorsndash V is right(row) eigenvectorsndash Λ is the same eigenvalues

bull For symmetric A column eigenvectors equal to row eigenvectors

bull Note the difference of A in PCA and SVDndash SVD A is directly the data eg term-by-document matrixndash PCA A is covariance matrix A=XTX each row in X is a sample

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 25: An Introduction To Matrix  Decomposition

Latent Semantic Indexing (LSI)1 Document file preparation preprocessing

ndash Indexing collecting termsndash Use stop list eliminate rdquomeaninglessrdquo wordsndash Stemming

2 Construction term-by-document matrix sparse matrix storage

3 Query matching distance measures4 Data compression by low rank approximation SVD5 Ranking and relevance feedback

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 26: An Introduction To Matrix  Decomposition

Latent Semantic Indexingbull Assumption there is some underlying latent semantic

structure in the data

bull Eg car and automobile occur in similar documents as do cows and sheep

bull This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 27: An Introduction To Matrix  Decomposition

Similarity Measures

bull Term to term AAT = UΣ2UT = (UΣ)(UΣ)T

UΣ are the coordinates of A (rows) projected to space V

bull Document to documentATA = VΣ2VT = (VΣ)(VΣ)T

VΣ are the coordinates of A (columns) projected to space U

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 28: An Introduction To Matrix  Decomposition

Similarity Measures

bull Term to document A = UΣVT = (UΣfrac12)(VΣfrac12)T

UΣfrac12 are the coordinates of A (rows) projected to space VVΣfrac12 are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 29: An Introduction To Matrix  Decomposition

HITS (Hyperlink Induced Topic Search)

bull Idea Web includes two flavors of prominent pagesndash authorities contain high-quality informationndash hubs are comprehensive lists of links to authorities

bull A page is a good authority if many hubs point to itbull A page is a good hub if it points to many authoritiesbull Good authorities are pointed to by good hubs and good hubs

point to good authorities

Hubs Authorities

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 30: An Introduction To Matrix  Decomposition

Power Iteration

bull Each page i has both a hub score hi and an authority score aibull HITS successively refines these scores by computing

bull Define the adjacency matrix L of the directed web graph

bull Now

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 31: An Introduction To Matrix  Decomposition

HITS and SVD

bull L rows are outlinks columns are inlinks

bull a will be the dominant eigenvector of the authority matrix LTL bull h will be the dominant eigenvector of the hub matrix LLT

bull They are in fact the first left and right singular vectors of Lbull We are in fact running SVD on the adjacency matrix

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 32: An Introduction To Matrix  Decomposition

HITS vs PageRankbull PageRank may be computed once HITS is computed per

query

bull HITS takes query into account PageRank doesnrsquot

bull PageRank has no concept of hubs

bull HITS is sensitive to local topology insertion or deletion of a small number of nodes may change the scores a lot

bull PageRank more stable because of its random jump step

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 33: An Introduction To Matrix  Decomposition

NMF ndash NON-NEGATIVE MATRIX FACTORIZATION

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 34: An Introduction To Matrix  Decomposition

Definition

bull Given a nonnegative matrix Vntimesm find non-negative matrix factors Wntimesk and Hktimesm such that

VntimesmasympWntimeskHktimesm

bull V column matrix each column is a data sample (n-dimension)bull Wi k-basis represents one basebull H coordinates of V projected to W

vj asymp Wntimeskhj

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 35: An Introduction To Matrix  Decomposition

Motivationbull Non-negativity is natural in many applications

bull Probability is also non-negative

bull Additive model to capture local structure

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 36: An Introduction To Matrix  Decomposition

Multiplicative Update Algorithmbull Cost function Euclidean distance

bull Multiplicative Update

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 37: An Introduction To Matrix  Decomposition

Multiplicative Update Algorithmbull Cost function Divergence

ndash Reduce to Kullback-Leibler divergence whenndash A and B can be regarded as normalized probability distributions

bull Multiplicative update

bull PLSA is NMF with KL divergence

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 38: An Introduction To Matrix  Decomposition

NMF vs PCA

bull n = 2429 faces m = 19x19 pixelsbull Positive values are illustrated with black pixels and negative values

with red pixels

bull NMF Parts-based representationbull PCA Holistic representations

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 39: An Introduction To Matrix  Decomposition

Referencebull D D Lee and H S Seung Algorithms for non-negative matrix

factorization (pdf) NIPS 2001bull D D Lee and H S Seung Learning the parts of objects by non-negative

matrix factorization (pdf) Nature 401 788-791 (1999)

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference
Page 40: An Introduction To Matrix  Decomposition

Major Referencebull Saara Hyvoumlnen Linear Algebra Methods for Data Mining Spring 2007

University of Helsinki (Highly recommend)

  • An Introduction To Matrix Decomposition
  • Outline
  • What Is Matrix Decomposition
  • Why We Need Matrix Decomposition
  • Why We Need Matrix Decomposition (2)
  • Principle Component Analysis
  • Definition ndash Eigenvalue amp Eigenvector
  • Definition ndash Principle Component Analysis
  • Principle Component Analysis (2)
  • Maximizing Variance
  • Optimization Problem
  • Property Data Decomposition
  • Face Recognition ndash Eigenface
  • Slide 14
  • Slide 15
  • PageRank ndash Power Iteration
  • Column-Stochastic amp Irreducible
  • Iterative PageRank Calculation
  • Convergence of the power iteration
  • Singular value decomposition
  • SVD - Definition
  • Singular Values And Singular Vectors
  • Matrix approximation
  • SVD and PCA
  • Example - LSI
  • SVD and PCA (2)
  • Latent Semantic Indexing (LSI)
  • Latent Semantic Indexing
  • Similarity Measures
  • Similarity Measures (2)
  • HITS (Hyperlink Induced Topic Search)
  • Power Iteration
  • HITS and SVD
  • HITS vs PageRank
  • NMF ndash Non-Negative Matrix Factorization
  • Definition
  • Motivation
  • Multiplicative Update Algorithm
  • Multiplicative Update Algorithm (2)
  • NMF vs PCA
  • Reference
  • Major Reference