Clustering and Non-negative Matrix Factorizationquimper/Seminaires/files/Mohammad...Presented by...

Clustering and Non-negative MatrixFactorization

Presented by Mohammad Sajjad Ghaemi

DAMAS LAB,Computer Science and Software Engineering Department,

Laval University

12 April 2013

Presented by Mohammad Sajjad Ghaemi, Laboratory DAMAS Clustering and Non-negative Matrix Factorization 1/36

Outline

I What is clustering ?

I NMF overview

I Cost functions and multiplicative algorithms

I A geometric interpretation of NMF

I r-separable NMF

What is clustering ?

According to the label information we have 3 categories of learning

I Supervised learningLearning from labeled data.

I Semi-supervised learningLearning from both labeled and unlabeled data.

I Unsupervised learningLearning from unlabeled data.

Label information

Taken from [Jain,2010]

[Jain,2010] Jain, A.K., 2010. Data clustering : 50 years beyondk-means. Pattern Recognition Letters 31, 651666.

Semi-supervised learning and manifold assumption

Unsupervised learning or clustering

What is clustering ?I Clustering is grouping similar objects together.

I Clusterings are usually not ”right” or ”wrong”, differentclusterings/clustering criteria can reveal different things aboutthe data.

I There is no objectively ”correct” clustering algorithm, but”clustering is in the eye of the beholder”.

I Clustering algorithms :

I Employ some notion of distance between objects.

I Have an explicit or implicit criterion defining what a goodcluster is.

I Heuristically optimize that criterion to determine theclustering.

Comparing various clustering algorithms

Outline

I NMF overview

I r-separable NMF

Matrix factorization

I NMF (Non-negative Matrix Factorization)

Question :Given a non-negative matrix V , find non-negative matrixfactors W and H,

V ≈WH

Answer :Non-negative Matrix Factorization (NMF)

Advantage of non-negativity Interpretability

I NMF is NP-hard

I NMF (Non-negative Matrix Factorization)

Question :Given a non-negative matrix V , find non-negative matrixfactors W and H,

V ≈WH

Answer :Non-negative Matrix Factorization (NMF)

Advantage of non-negativity Interpretability

I NMF is NP-hard

I Generally, factorization of matrices is not unique

I Principal Component Analysis

I Singular Value Decomposition

I Nystrom Method

I Non-negative Matrix Factorization differs from the abovemethods.

I NMF enforces the constraint that the factors must benon-negative.

I All elements must be equal to or greater than zero.

I Nystrom Method

I Is there any unique solution to the NMF problem ?

V ≈WD−1DH

I NMF has the drawback of being highly ill-posed.

I Is there any unique solution to the NMF problem ?

V ≈WD−1DH

I NMF has the drawback of being highly ill-posed.

NMF is interesting because it does data clustering

Data Clustering = Matrix Factorizations

Many unsupervised learning methods are closely related in a simpleway (Ding, He, Simon, SDM 2005).

Numerical example

taken from Katsuhiko Ishiguro, et al. Extracting Essential Structurefrom Data.

Heat map of NMF on the gene expression data

The left is the gene expression data where each columncorresponds to a sample, the middle is the basis matrix, and the

right is the coefficient matrix.taken from Yifeng Li, et al. The Non-Negative Matrix

Factorization Toolbox for Biological Data MiningPresented by Mohammad Sajjad Ghaemi, Laboratory DAMAS Clustering and Non-negative Matrix Factorization 16/36

Heat map of NMF clustering on a yeast metabolic

The left is the gene expression data where each columncorresponds to a gene, the middle is the basis matrix, and the right

is the coefficient matrix.taken from Yifeng Li, et al. The Non-Negative Matrix

Factorization Toolbox for Biological Data MiningPresented by Mohammad Sajjad Ghaemi, Laboratory DAMAS Clustering and Non-negative Matrix Factorization 17/36

Outline

I NMF overview

I r-separable NMF

How to solve it

Two conventional and convergent algorithms

I Square of the Euclidean distance

||A− B||2 =∑ij

(Aij − Bij)2

I Generalized Kullback-Leibler divergence

D(A||B) =∑ij

(Aij logAij

Bij− Aij + Bij)

How to solve it

Two conventional and convergent algorithms

I Square of the Euclidean distance

||A− B||2 =∑ij

(Aij − Bij)2

I Generalized Kullback-Leibler divergence

D(A||B) =∑ij

(Aij logAij

Bij− Aij + Bij)

How to minimize it

I Minimize ||V −WH||2 or D(V ||WH)

I Convex in W only or H only (not convex in both variables)

I Goal-finding local minima (tough to get global minima)

I Gradient descent ?I Slow convergence

I Sensitive to the step size

I inconvenient for large data

Cost functions and gradient based algorithm for squareEuclidean distance

I Minimize ||V −WH||2

I W newik = Wik − µik5W

where 5W is the gradient of the approximation objectivefunction with respect to W .

I Without loss of generality, we can assume that 5W consistsof 5+ and 5−, positive and unsigned negative terms,respectively. That is,

5W = 5+ −5−

I According to the steepest gradient descent methodW new

ik = Wik − µik(5+ik −5

−ik) can minimize the NMF

objectives.

I By assuming that each matrix element has its own learningrate µik = Wik

we have,

W newik = Wik

5−ik5+

(Multiplicative update rule)

Multiplicative vs. Additive rules

By taking the gradient of the cost function with respect to W wehave,

5W = WHHT − VHT

W newik = Wik − µik((WHHT )ik − (VHT )ik)

µik =Wik

(WHHT )ik

W newik = Wik

(VHT )ik(WHHT )ik

Similar for H,

Hnewik = Hik

(W TV )ik(W TWH)ik

Cost functions and gradient based algorithm for squareEuclidean distance

I The provided justification for multiplicative update rule doesnot have any theoretical guarantee that the resulting updateswill monotonically decrease the objective function !

I Currently, the auxiliary function technique is the most widelyaccepted approach for monotonicity proof of multiplicativeupdates.

I Given an objective function J (W ) to be minimized,G(W ,W t) is called an auxiliary function if it is a tight upperbound of J (W ), that is,

I G(W ,W t) ≥ J (W )I G(W ,W ) = J (W )

for any W and W t .I Then iteratively applying the rule

W new = arg minW t G (W t ,W ), results in a monotonicallydecrease of J (W ).

Auxiliary function

Paraboloid function J (W ) and its corresponding auxiliary functionG(W ,W t), where G(W ,W t) ≥ J (W ) and G(W t ,W t) = J (W t)

taken from Zhaoshui He, et al. IEEE TRANSACTIONS ONNEURAL NETWORKS 2011

Auxiliary function

Using an auxiliary function G(W ,W t) to minimize an objectivefunction J (W ). The auxiliary function is constructed around thecurrent estimate of the minimizer ; the next estimate is found byminimizing the auxiliary function, which provides an upper bound

on the objective function. The procedure is iterated until itconverges to a stationary point (generally, a local minimum) of the

objective function.Presented by Mohammad Sajjad Ghaemi, Laboratory DAMAS Clustering and Non-negative Matrix Factorization 25/36

Updates for H of Euclidean distance

If K (ht) is the diagonal matrix

Kii (ht) =

(W tWht)ihti

G (h, ht) = J(ht) + (h − ht)T 5 J(ht) +1

2(h − ht)TK (ht)(h − ht)

is an auxiliary function for

J(h) =1

(vi −∑k

Wikhi )

I The update rule can be obtained by taking the derivative ofthe G (h, ht) w.r.t h and then set it to zero,

5J(ht) + (h − ht)K (ht) = 0

h = ht − K (ht)−1 5 J(ht)

I Since J(ht) is non-increasing under this auxiliary function, bywriting the components of this equation explicitly, we obtain,

ht+1i = hti

(W tV )i(W tWht)i

I Can be shown similarly for W of Euclidean distance.

Outline

I NMF overview

I r-separable NMF

A geometric interpretation of NMFI Given M and its NMF M ≈ UV one can scale M and U such

that they become column stochastic implying that V iscolumn stochastic :

M ≈ UV ⇐⇒ M ′ = MDm = (UDu)(D−1u VDm) = U ′V ′

M(:, j) =k∑

U(:, i)V (i , j) withk∑

V (i , j) = 1

I Therefore, the columns of M are convex combination of thecolumns of U.

I In other terms, conv(M) ⊂ conv(U) ⊂ ∆m where ∆m is theunit simplex.

I Solving exact NMF is equivalent to finding a poltyopeconv(U) between conv(M) and ∆m with minimum number ofvertices.

M(:, j) =k∑

V (i , j) = 1

M(:, j) =k∑

V (i , j) = 1

M(:, j) =k∑

V (i , j) = 1

A geometric interpretation of NMF

Outline

I NMF overview

I r-separable NMF

Separability Assumption

I If matrix V satisfies separability condition, it is possible tocompute optimal solutions in polynomial time.

I r-separable NMF

There exists an NMF (W ,H) of rank r with V = WH whereeach column of W is equal to a column of V .

I V is r-separable ⇐⇒ V ≈WH = W [Ir ,H′]Π = [W ,WH ′]Π

For some H ′ ≥ 0 with columns sum to one, some permutationmatrix Π, and Ir is the r-by-r identity matrix.

I r-separable NMF

A geometric interpretation of separability

I Under separability, NMF reduces to the following problem :

Given a set of points (the normalized columns of V ), identifythe vertices of its convex hull.

I We are still very far from knowing the best ways to computethe convex hull for general dimensions, despite the varietymethods proposed for convex hull problem.

I However, we want to design algorithms which are

I Fast : they should be able to deal with large-scale real-worldproblems where n is 106 – 109.

I Robust : if noise is added to the separable matrix, they shouldbe able to identifying the right set of columns.

I For a separable matrix V , we have,

V ≈ WH

= W [Ir ,H′]Π

= [W ,WH ′]Π

= [W ,WH ′]

(Ir H ′

0(n−r)×r 0(n−r)×(n−r)

where Π is a permutation matrix, the columns of H ′ ≥ 0 sumto one.

I Therefore for r-separable NMF, we need to solve the followingoptimization problem according to some constraints,

||V (:, i)− VX (:, i)||2 ≤ ε for all i .

Question

Question ?

Clustering and Non-negative Matrix Factorizationquimper/Seminaires/files/Mohammad...Presented by...

Documents

Affinity Clustering: Hierarchical Clustering at Scalepapers.nips.cc/paper/7262-affinity-clustering-hierarchical-clustering-at-scale.pdf · Afﬁnity Clustering: Hierarchical Clustering

MULTI-LABEL ASRS DATASET CLASSIFICATION …lkhan/papers/Multi-label...MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING MOHAMMAD SALIM AHMED 1, LATIFUR

EFU Offices...Mazhar H. Qureshi Mohammad Arif Bhatti Mohammad Haji Hashim Mohammad Hussain Mohammad Rizwanul Haq Mohammad Younus Muhammad Asif Arif, M.B.A., A.C.I.I. Muhammad Rashid

Clustering. 2 Outline Introduction K-means clustering Hierarchical clustering: COBWEB

Climatic clustering analysis for novel atlas mapping and ... · Climatic clustering analysis for novel atlas mapping and bioclimatic design recommendations Gholamreza Roshan1, Mohammad

Dynamic Cache Clustering for Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Dept. of Computer Science University of Pittsburgh

Guest Lecture: Clusteringcvml.ist.ac.at/talks/clustering-core2018.pdfsingle linkage clustering, complete linkage clustering, average linkage clustering Graph-based clustering spectral

290 IEEE TRANSACTIONS ON PATTERN ANALYSIS …jwp2128/Papers/BroderickMackeyetal2015.pdf · Combinatorial Clustering and the Beta Negative Binomial Process ... the NBP, and demonstrate

Done by Mohammad Binhussein & Mohammad Mini

Maryam Abdolali Nicolas Gillis Mohammad Rahmati arXiv:1802.07648v2 [cs.CV… · 2018-02-26 · Maryam Abdolali Nicolas Gillisy Mohammad Rahmati Abstract Sparse subspace clustering

Student Name: Mohammad Yousif Hassan ID: ST1021080050 Fares Mohammad Ibrahim ID: ST1021080051 Mohammad Basim ID: ST1021080040 Mohammad Ahamd Grade: 12.10

Similarity-based Clustering by Left-Stochastic Matrix ...jmlr.csail.mit.edu/papers/volume14/arora13a/arora13a.pdfSIMILARITY-BASED CLUSTERING In this paper, we propose a new non-negative

CMPE 544 Finding Underlying Trends and Clustering of Time ...gencersumbul.bilkent.edu.tr/pdf/time_series_clustering.pdf · mixture data via Non-negative matrix factorization (NMF),

Dynamic Cache Clustering for Chip Multiprocessorsmhhammou/ics070-hammoud.pdf · Dynamic Cache Clustering for Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem

SHER MOHAMMAD GAREWAL COLLECTION SHER MOHAMMAD GAREWAL ...library.gcu.edu.pk/Collections/SMG.pdf · SHER MOHAMMAD GAREWAL COLLECTIONSHER MOHAMMAD GAREWAL COLLECTIONCOLLECTION

Non-Negative Matrix Tri-Factorization for co-clustering ...gianvitopio/pdf/INS-NMF 2015.pdf · 2. NMF for clustering and co-clustering: a review In this section, we brie y review

An Energy-Aware Periodical Data Gathering Protocol Using Deterministic Clustering in Wireless Sensor Networks (WSN) Mohammad Rajiullah & Shigeru Shimamoto

FUZZY CLUSTERING 2009/2010. 2 What is Data Clustering? Fuzzy C-Means Clustering Subtractive Clustering Data Clustering Using the Clustering GUI

Mohammad Azadi, Ph.D. - Semnan Universityprofs.semnan.ac.ir/FilesContainer/Professors/Mohammad Azadi... · Mohammad Azadi, Ph.D. sPage ]txep TpyT[Mohammad Azadi, Ph.D. Scientific

Clustering IV. Outline Impossibility theorem for clustering Density-based clustering and subspace clustering Bi-clustering or co-clustering