Upload
phamkiet
View
218
Download
3
Embed Size (px)
Citation preview
Locally Discriminative CoclusteringLijun Zhang, Student Member, IEEE, Chun Chen, Member, IEEE, Jiajun Bu, Member, IEEE,
Zhengguang Chen, Deng Cai, Member, IEEE, and Jiawei Han, Fellow, IEEE
Abstract—Different from traditional one-sided clustering techniques, coclustering makes use of the duality between samples and
features to partition them simultaneously. Most of the existing co-clustering algorithms focus on modeling the relationship between
samples and features, whereas the intersample and interfeature relationships are ignored. In this paper, we propose a novel
coclustering algorithm named Locally Discriminative Coclustering (LDCC) to explore the relationship between samples and features as
well as the intersample and interfeature relationships. Specifically, the sample-feature relationship is modeled by a bipartite graph
between samples and features. And we apply local linear regression to discovering the intrinsic discriminative structures of both
sample space and feature space. For each local patch in the sample and feature spaces, a local linear function is estimated to predict
the labels of the points in this patch. The intersample and interfeature relationships are thus captured by minimizing the fitting errors of
all the local linear functions. In this way, LDCC groups strongly associated samples and features together, while respecting the local
structures of both sample and feature spaces. Our experimental results on several benchmark data sets have demonstrated the
effectiveness of the proposed method.
Index Terms—Coclustering, clustering, bipartite graph, local linear regression.
Ç
1 INTRODUCTION
DATA clustering is a fundamental topic in unsupervisedlearning, and becomes a common technique for data
mining, information retrieval, pattern recognition, bioinfor-matics, etc. The goal of clustering is to partition the datapoints into clusters such that those within each cluster aremore closely related to one another than points assigned todifferent clusters [1]. Typically, the data is formulated as a2D matrix where one dimension represents samples, andthe other represents features. Traditional clustering algo-rithms [2], [3], [4], [5] are one-sided in the sense that theyonly consider clustering samples based on their distribu-tions on features, or vice versa.
Recently, coclustering has become a topic of significant
interest in text mining [6], [7], [8], microarray analysis [9],
[10], [11], and Collaborative Filtering (CF) [12], [13]. Instead
of clustering one dimension of the data matrix, coclustering
makes use of the duality between samples and features to
partition both dimensions simultaneously. It has been
shown that coclustering often yields impressive perfor-
mance improvement over traditional one-sided clustering
algorithms. More importantly, the resulting coclusters may
reveal valuable insights about the data. For example,
clustering documents and words simultaneously providesone way to describe the semantics, i.e., using the words in acocluster to annotate itself; coclustering of genes expressiondata can be used to identify groups of genes showingsimilar activity patterns under a set of conditions; cocluster-ing in CF helps to discover groups of users that exhibithighly correlated ratings on groups of items.
Most of the existing coclustering algorithms focus onmodeling the relationship between samples and features,but with different strategies. The graph-based coclusteringmethods [7], [8] construct a bipartite graph to represent therelationship between samples and features. In the informa-tion theory-based coclustering methods [14], [15], [16],samples and features are treated as instances of twodiscrete random variables, and the joint probability dis-tribution between them is used to encode the sample-feature relationship. In the matrix factorization-basedcoclustering techniques [17], [18], sample-feature relation-ship is modeled from the perspective of data reconstruction.Despite of their successes in making use of the sample-feature relationship, these algorithms fail to consider theintersample and interfeature relationships, which areessential for data clustering.
In this paper, we propose a novel coclustering algorithmnamed Locally Discriminative Coclustering (LDCC) toexplore the sample-feature relationship as well as theintersample and interfeature relationships. Specifically, thesample-feature relationship is modeled by a bipartite graph[7], [8], where the edge signifies an association between asample and a feature. And we apply local linear regressionto discovering the intrinsic discriminative structures of bothsample space and feature space. For each local patch in thesample and feature spaces, a local linear function is trainedto predict the labels of the points belonging to this patch.The intersample and interfeature relationships are thusencoded in the local regression functions by minimizing thefitting errors over all the local patches. In this way, LDCC
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012 1025
. L. Zhang, C. Chen, J. Bu, and Z. Chen are with the Zhejiang ProvincialKey Laboratory of Service Robot, College of Computer Science, ZhejiangUniversity, Cao Guangbiao Building, Yuquan Campus, Hangzhou310027, China. E-mail: {zljzju, chenc, bjj, cerror}@zju.edu.cn.
. D. Cai is with the State Key Lab of CAD&CG, College of ComputerScience, Zhejiang University, 388 Yu Hang Tang Road, Hangzhou310027, China. E-mail: [email protected].
. J. Han is with the Department of Computer Science, University of Illinoisat Urbana-Champaign, Room 2132, Siebel Center for Computer Science,201 N. Goodwin Avenue, Urbana, IL 61801. E-mail: [email protected].
Manuscript received 4 May 2010; revised 31 May 2010; accepted 16 Feb.2011; published online 18 Mar. 2011.Recommended for acceptance by B.C. Ooi.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2010-05-0265.Digital Object Identifier no. 10.1109/TKDE.2011.71.
1041-4347/12/$31.00 � 2012 IEEE Published by the IEEE Computer Society
groups strongly associated samples and features together,while respecting the local discriminative structures of bothsample and feature spaces. We further develop an efficientcomputational scheme to solve the corresponding optimi-zation problem.
Our optimization scheme first projects the samples andfeatures into a common subspace and then performscoclustering in this subspace. The original Bipartite SpectralGraph Partitioning (BGP) [7], [8] methods also adopt thisstrategy, and the low-dimensional representation is com-puted directly from the data matrix. Thus, by visualizingthe projection results of BGP and LDCC, we can see theeffect of capturing the intersample and interfeature relation-ships intuitively. Fig. 1 shows the 1D projection resultsobtained by applying BGP and LDCC to one 2-Class subsetof the WebKB corpus. As can be seen from Fig. 1a, BGPmaps most of the features and samples together, which isclearly undesirable for coclustering. That is probablybecause BGP only considers the sample-feature relationshipcontained in the data matrix. When the data matrix is highlysparse, the projection results obtained from this limitedinformation may be very unstable. From Fig. 1b, we can seethat LDCC makes a big improvement compared with BGP.The two sample classes are almost separable, and featuresare distributed more evenly.
The outline of the paper is as follows: in Section 2, weprovide a brief review of the related work. Our proposedLocally Discriminative Coclustering algorithm is intro-duced in Section 3. In Section 4, we compare our algorithmwith the state-of-the-art clustering and coclustering algo-rithms. Finally, we provide some concluding remarks andsuggestions for future work in Section 5.
Notation. Small letters (e.g., �) are used to denotescalars. Lower case bold letters (e.g., w) are used to denotecolumn vectors and k � k is used to denote the ‘2-norm of avector. Capital letters (e.g., A) are used to denote matrices.We use Trð�Þ to denote the trace of a matrix, and k � kF todenote the Frobenius norm of a matrix. Script capital letters(e.g., X ) are used to denote ordinary sets. Blackboard boldcapital letters (e.g., IR) are used to denote number sets.
2 RELATED WORK
The research literature on clustering is vast and mainlyabout one-sided clustering [19]. Although introduced quiteearly [20], coclustering receives much attention only inrecent years due to its application to many practicalproblems, including text mining and microarray analysis.In this section, we briefly review three types of clustering or
coclustering algorithms: graph partition-based, informationtheory-based, and matrix factorization-based.
Through constructing a similarity graph where verticescorrespond to data points and edge weights representdegrees of similarity, clustering can be formulated as theproblem of graph partitioning. Spectral partition methodshave been used effectively for solving several graphpartitioning objectives, such as ratio cut [21] and Normal-ized Cut (Ncut) [22]. In [7] and [8], the authors model theproblem of coclustering documents and words as findingminimum cut vertex partitions in a bipartite graph betweendocuments and words, which is then relaxed and solved byspectral method. Recently, a new method for partitioningthe document-word bipartite graph called IsoperimetricCoclustering Algorithm (ICA) is proposed [6]. The ICAheuristically minimizes the ratio of the perimeter of thebipartite graph partition and the area of the partition underan appropriate definition of graph-theoretic area. In [23], anovel algorithm named Consistent Bipartite Graph Copar-titioning is proposed for star-structured high-order hetero-geneous data co-clustering. The bipartite graph model hasalso been successfully applied to cocluster genes andconditions in microarray analysis [10]. It is important tonote that there are no intersample and interfeature edges inthe bipartite graph model.
In information theory-based methods, samples, andfeatures are treated as the instances of two randomvariables, of which the joint distribution can be empiricallyestimated from the data matrix. The problem of clusteringsamples or features can be viewed as the process ofcompressing the associated random variable. The Informa-tion Bottleneck (IB) Method [24] is a one-sided clusteringalgorithm which compresses one random variable so thatthe mutual information about the other is preserved asmuch as possible. Later, an agglomerative hard vision of theIB method is applied to clustering documents afterclustering word, which is called Double Clustering (DC)[14]. Iterative Double Clustering (IDC) [15] extends DC tocluster documents and words iteratively. Both DC and IDCare heuristic procedures, whereas the Information-TheoreticCoclustering (ITCC) [16] clearly quantifies the loss inmutual information due to coclustering and presents analgorithm that reduces this loss function monotonically. Amore generalized coclustering framework was proposed in[25] wherein any Bregman divergence can be used in theobjective function, and various conditional expectation-based constraints can be supported.
Matrix factorization techniques have been widelystudied and used for clustering and coclustering. The earlywork is mainly based on Singular Value Decomposition(SVD) or eigenvalue decomposition. Latent SemanticIndexing (LSI) [26] is a typical algorithm for one-sideddocument clustering, which first projects documents onto alower dimensional subspace through SVD and thenclusters documents in the reduced subspace. NonnegativeMatrix Factorization (NMF) [27] is a recently popularizedtechnique which approximates the nonnegative datamatrix by the product of two nonnegative matrices.Although the original motivation of NMF is to learnparts-based representations, it has been successfully
1026 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
Fig. 1. The 1D projection results of one 2-Class subset selected from theWebKB corpus. To distinguish between features and samples, we plotthem in different shape, size, and color.
applied to one-sided clustering [2]. As an extension,Nonnegative Matrix Trifactorization (NM3F) is proposedfor coclustering [17], [18]. In NM3F, the original datamatrix X is decomposed into the production of threenonnegative (or two nonnegative and one unconstrained)matrices: X ¼ GSHT , where G gives row clusters and H
gives column clusters. Graph regularized nonnegativematrix trifactorization for coclustering is introduced in[28]. To reduce the computational cost, a general cocluster-ing framework named Coclustering based on Column andRow Decomposition (CRD) is proposed [29]. CRD does notrequire the whole data matrix to be in the main memory,and the execution time is linear in m (the number ofsamples) and n (the number of features).
3 LOCALLY DISCRIMINATIVE COcLUSTERING
The coclustering problem considered in this paper isformally defined as follows: given a nonnegative datamatrix X 2 IRm�n, we use the xTi to denote the ith row(sample) and f j to denote the jth column (feature) in X. Thegoal of coclustering is to simultaneously group the samplesfx1; . . . ;xmg � IRn and features ff 1; . . . ; fng � IRm intoc coclusters. The clustering result of samples is representedby a Partition Matrix (PM) G 2 f0; 1gm�c, such that Gir ¼ 1if xi belongs to cluster r and Gir ¼ 0 otherwise. Similarly,the clustering result of features is represented by a PMH 2 f0; 1gn�c.
As discussed before, most of the existing coclusteringalgorithms only consider the sample-feature relationship. Inthe following, we introduce our Locally DiscriminativeCoclustering which makes use of the sample-featurerelationship, as well as the intersample and interfeaturerelationships for coclustering. We begin with the discussionon modeling the sample-feature relationship.
3.1 Modeling the Sample-Feature Relationship
Similar to the previous approaches [7], [8], we model therelationship between samples and features using a bipartitegraph. In the bipartite graph model, m samples fx1; . . . ;xmgand n features ff 1; . . . ; fng are two sets of vertices. An edge<xi, f j> exists if and only if the jth feature is observed inthe ith sample. And the edge weight is set to be Xij, whichrepresents the association between the sample xi and thefeature f j. It is natural to require that the labels of a featureand a sample should be the same if they are stronglyassociated.
Let gTi be the ith row of G, i.e., gi is the indicator vectorof sample xi. hTi is the ith row of H, i.e., the indicatorvector of feature f j. Let DS 2 IRm�m be the diagonal degreematrix of samples with DS
ii ¼Pn
k¼1 Xik, and DF 2 IRn�n bethe diagonal degree matrix of features with DF
jj ¼Pmk¼1 Xkj. In order to group strongly associated samples
and features together, the following loss function can beused [30]:
�1ðG;HÞ ¼Xmi¼1
Xnj¼1
giffiffiffiffiffiffiffiDSii
q � hjffiffiffiffiffiffiffiDFjj
q�������
�������2
Xij: ð1Þ
By minimizing (1), we expect that if xi and f j are strongly
associated (with a large Xij), the indicator vectors of them
should be the same. After some algebraic steps, (1) can be
rewritten in the matrix form as follows:
�1ðG;HÞ ¼Xmi¼1
Xnj¼1
kgik2
DSii
� 2gTi hjffiffiffiffiffiffiffiffiffiffiffiffiffiffiDSiiD
Fjj
q þ khjk2
DFjj
0B@
1CAXij
¼Xmi¼1
kgik2 þXnj¼1
khjk2 �Xmi¼1
Xnj¼1
2XijgTi hjffiffiffiffiffiffiffiffiffiffiffiffiffiffi
DSiiD
Fjj
q¼ mþ n� 2Tr ðGT ðDSÞ�1=2XðDF Þ�1=2HÞ:
ð2Þ
Since mþ n is a constant, the loss function in (2) can be
simplified as
~�1ðG;HÞ ¼ �2TrðGTXNHÞ; ð3Þ
where
XN ¼ ðDSÞ�1=2XðDF Þ�1=2: ð4Þ
3.2 Modeling the Intersample and InterfeatureRelationships
Inspired from recent developments in local learning [31],
[32], [33], we propose to discover the intrinsic discrimina-
tive structure of both sample and feature spaces using local
linear regression. The key mathematical derivations stated
below stem from the work in [34], where Bach and
Harchaoui apply global linear regression to clustering.
3.2.1 Local Linear Regression in the Sample Space
For each sample xi, we define the local patchMðxiÞ be the
set containing xi and its neighbor points, with the size mi.
We define Ai ¼ fk j xk 2 MðxiÞg to be the set containing
the indices of samples inMðxiÞ. Let Xi 2 IRmi�n be the local
data matrix consisting of samples in MðxiÞ, that is,
Xi ¼ ½xk�T for k 2 Ai. Let Gi 2 IRmi�c be the local Partition
Matrix of MðxiÞ, that is, Gi ¼ ½gk�T for k 2 Ai. Since the
local Partition Matrix Gi is a part of G, we can construct a
selection matrix Si 2 f0; 1gmi�m for each Gi such that
Gi ¼ SiG: ð5Þ
Si is constructed as follows: Si ¼ ½ek�T for k 2 Ai, where ekis a m-dimensional vector whose kth element is one and all
other elements are zero.We consider fitting a multioutput linear function fðXiÞ ¼
XiWi þ 1mibTi for each local path MðxiÞ to model the
relationship between Xi and Gi [31]. In this linear function,
1miis a mi-dimensional vector of all ones, Wi 2 IRn�c is the
coefficient matrix, and bi 2 IRc is the intercept. Fitting this
function can be mathematically formulated as
minWi;bi
1
mi
��Gi �XiWi � 1mibTi��2
Fþ �kWik2
F : ð6Þ
The penalty term �kWik2F is introduced to avoid over-
fitting [1].Taking the first order partial derivatives of (6) with
respective to Wi, bi and requiring them to be zero, we get
the optimal W �i and b�i [34]
ZHANG ET AL.: LOCALLY DISCRIMINATIVE COCLUSTERING 1027
W �i ¼
�XTi �mi
Xi þmi�I��1
XTi �mi
Gi; ð7Þ
b�i ¼1
mi
�GTi � ðW �
i ÞTXT
i
�1mi
; ð8Þ
where I is the identity matrix and � ¼ I � 1mi
1mi1Tmi
is thecentering matrix.
Substituting the values of W �i and b�i into (6), we obtain
the fitting error of the local function [34]
Ji ¼1
mikGi �XiW
�i � 1mi
ðb�i ÞTk2
F þ �kW �i k
2F
¼ 1
miGi �XiW
�i �
1mi1Tmi
miðGi �XiW
�i Þ
����������
2
F
þ �kW �i k
2F
¼ 1
mik�ðGi �XiW
�i Þk
2F þ �kW �
i k2F
¼ 1
mik��I �Xi
�XTi �Xi þmi�I
��1XTi ��Gik2
F
þ �k�XTi �Xi þmi�I
��1XTi �Gik2
F
¼ 1
miTr�GTi
����Xi
�XTi �Xi þmi�I
��1XTi ��2Gi
�þ �Tr
�GTi �Xi
�XTi �Xi þmi�I
��2XTi �Gi
�¼ 1
miTr�GTi
����Xi
�XTi �Xi þmi�I
��1XTi ��Gi
�:
ð9Þ
In the above derivations, we have used the fact that thecentering matrix is idempotent, so that � ¼ �k fork ¼ 1; 2; . . . . For each local patch MðxiÞ, we define
LSi ¼1
mi
����Xi
�XTi �Xi þmi�I
��1XTi ��; ð10Þ
which characterizes the local discriminative structure ofMðxiÞ.
The fitting error Ji consists Gi as the variable and a goodlocal Gi should give rise to minimal fitting error. In otherwords, we are looking for a local partition such that theclusters are most linearly separated, where the separabilityof clusters is measured through the minimum of thediscriminative cost in (6) [34]. Then, it is naturally torequire the global Partition Matrix G to minimize thesummation of the fitting errors over all the local patchesfMðxiÞgmi¼1, which leads to the following loss function:
�2ðGÞ ¼Xmi¼1
Ji ¼Xmi¼1
Tr�GTi L
Si Gi
�
¼Xmi¼1
Tr�GTSTi L
Si SiG
�¼ Tr
�GTLSG
�;
ð11Þ
where
LS ¼Xmi¼1
�STi L
Si Si�: ð12Þ
The formulation of LSi in (10) involves the inverse of onem�m matrix, which is computationally expensive when
the dimensionality is high. In the following, we use theWoodbury-Morrison formula [35] to derive a more efficient
equation [34]
LSi ¼1
mi
����Xi
�XTi �Xi þmi�I
��1XTi ��
¼ 1
mi��I ��Xi
�XTi �Xi þmi�I
��1XTi ���
¼ 1
mi��I � I�Xi
�XTi �I�Xi þmi�I
��1XTi �I
��
¼ 1
mi� I þ 1
mi��mi
XiXTi �
� ��1
�
¼ ��ðmi�I þ�XiXTi �Þ�1�:
ð13Þ
Using the above equation, we only need to inverse a mi �mi matrix, which would be much efficient, since the size of
the local patch is usually very small.
3.2.2 Local Linear Regression in the Feature Space
Similarly, we can use the local linear regression to model
the interfeature relationship.For each feature f j, we define the local patch Nðf jÞ be
the set containing f j and its neighbors, with the size nj.And we define Bj ¼ fk j fk 2 Nðf jÞg be the set containing
the indices of features in Nðf jÞ. Let Fj 2 IRnj�m be the localfeature matrix consisting of features in Nðf jÞ. Let Hj 2IRnj�c be the local Partition Matrix of Nðf jÞ. Define aselection matrix Uj ¼ ½ek�T for k 2 Bj, where ek is a
n-dimensional vector whose kth element is one and allother elements are zero. We have
Hj ¼ UjH: ð14Þ
We also train a local linear function gðFjÞ ¼ FjVj þ 1njaTj
for each local patch Nðf jÞ to best approximate Hj. Asbefore, we minimize the fitting errors of all the local
functions to capture the interfeature relationship. Followingthe same derivations in Section 3.2.1, we obtain the
following loss function of H:
�3ðHÞ ¼ TrðHTLFHÞ; ð15Þ
where
LF ¼Xnj¼1
�UTj L
Fj Uj
�; ð16Þ
LFj ¼1
nj��I � Fj
�FTj �Fj þ nj�I
��1FTj
��
¼ ���nj�I þ�FjF
Tj ���1
�:
ð17Þ
3.3 The Objective
Define P 2 f0; 1gðmþnÞ�c be the total Partition Matrix con-sisting of G and H
P ¼ GH
� : ð18Þ
Combining (3), (11), and (15), the loss function of P is
given by
1028 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
�ðP Þ ¼ ~�1ðG;HÞ þ ��2ðGÞ þ ��3ðHÞ¼ Trð�2GTXNH þ �GTLSGþ �HTLFHÞ
¼ Tr ½GT HT � �LS �XN
�ðXNÞT �LF
" #G
H
� !
¼ Tr PT �LS �XN
�ðXNÞT �LF
" #P
!;
ð19Þ
where � � 0 and � � 0 are the tradeoff parameters.With the loss function in (19), we define the LDCC
problem as:
Definition 1. Locally Discriminative Coclustering:
minP
TrðPTLP Þ
s:t: L ¼ �LS �XN
�ðXNÞT �LF
� P 2 f0; 1gðmþnÞ�c; P1c ¼ 1mþn;
ð20Þ
where XN , LS , and LF are given by (4), (12), and (16),
respectively.
3.4 The Algorithm
The LDCC problem is essentially a combinatorial optimiza-tion problem which is hard to solve. In order to solve it
efficiently, we relax it according to the spectral clusteringmethod in [3] and [36]. First, we map all the features andsamples into a common low-dimensional subspace, andthen cluster features and samples simultaneously in thissubspace. Let Z be a ðmþ nÞ � r matrix whose rows givethe low-dimensional embeddings of all the samples andfeatures in a r-dimensional subspace. The optimal Z� isobtained by solving the following problem:
minZ
TrðZTLZÞs:t: Z 2 IRðmþnÞ�r; ZTZ ¼ I:
ð21Þ
Let z1; . . . ; zr be the smallest eigenvectors of L orderedaccording to their eigenvalues. Then, the optimal solutionZ� of (21) is given by
Z� ¼ ½z1; . . . ; zr�: ð22Þ
After normalization each row of Z� [3], we perform Kmeansto group the samples and features into c coclusters.
In summary, the algorithm of LDCC is stated below.
1. Constructing the matrix L.
a. Calculate XN by normalizing the data matrix Xaccording to (4).
b. Find the k nearest neighbors of each sample, andcalculate LS according to (12).
c. Find the k nearest neighbors of each feature, andcalculate LF according to (16).
d. Construct L from XN , LS , and LF according to(20).
2. Dimensionality reduction.
a. Calculate the r smallest eigenvectors of L:z1; . . . ; zr.
b. Form the matrix Z� ¼ ½z1; . . . ; zr� by stacking theeigenvectors in columns.
3. Coclustering in the low-dimensional subspace.
a. Normalize each row of Z� to have unit length.b. Cluster the samples and features into c coclus-
ters via Kmeans.
3.5 Complexity Analysis of LDCC
The computational complexity of LDCC is dominated bythe following steps:
. Find the k nearest neighbors of each sample, andcalculate LS according to (12).
- Oðm2nÞ is used to calculate the pairwisedistances between the m samples, andOðm2 logmÞ is used for k-nearest neighborsfinding for all the m samples.
- Oðmðnk2 þ k3ÞÞ is used to calculate LSi accordingto (13) for all the m samples.
. Find the k nearest neighbors of each feature, andcalculate LF according to (16).
- Oðn2mÞ is used to calculate the pairwisedistances between the n features, andOðn2 lognÞ is used for k-nearest neighborsfinding for all the n features.
- Oðnðmk2 þ k3ÞÞ is used to calculate LFj accordingto (17) for all the n features.
. Calculate the r smallest eigenvectors of L.
- There are almost mk2 and nk2 nonzero elementsin LS and LF , respectively. In practise, the datamatrix X is usually very sparse, so the ðmþnÞ � ðmþ nÞ dimensional matrix L is sparse too.As a result, the Lanczos algorithm [37] can beused to efficiently compute the first r eigenvec-tors within Oðqrðmþ nÞsÞ, where q is number ofiterations in Lanczos, and s is the number ofnonzero elements per row of L.
The total cost for LDCC is Oððmnþ qrsÞðmþ nÞ þm2 logmþ n2 lognþ ðmnþmkþ nkÞk2Þ.
The main memory cost of LDCC is to store the pairwisedistance matrices and the sparse matrix L. Thus, thememory complexity is Oðm2 þ n2 þ sðmþ nÞÞ.
4 EXPERIMENTS
In this section, we perform text and gene expressionclustering (coclustering) experiments to show the effective-ness of LDCC.
4.1 Experimental Design
We compare our method with the following five methods:
1. Kmeans on the original data matrix (Kmeans) [4].2. Normalized cut [22].3. Bipartite spectral graph partitioning [7], [8].4. Information-theoretic coclustering [16].1
5. Dual regularized coclustering (DRCC) [28].
In the above methods, Kmeans and NCut are one-sidedclustering algorithms, while BGP, ITCC, DRCC, and LDCC
ZHANG ET AL.: LOCALLY DISCRIMINATIVE COCLUSTERING 1029
1. http://www.cs.utexas.edu/users/dml/Software/cocluster.html.
are coclustering algorithms. Note that the procedures for
solving Kmeans and ITCC can only find the local optimum.
In the experiments, we ran both Kmeans and ITCC 10 times
with different random starting points and the best result in
terms of their objective functions was recorded.
4.1.1 Evaluation Metric
In practice, the ground truth about the feature clusters isusually unknown. Thus, we evaluate the result of sampleclustering in our experiments. The sample clusteringperformance is evaluated by comparing the label obtainedfrom the clustering or coclustering algorithms with thatprovided by the data set. Two metrics, the accuracy (AC)and the normalized mutual information (MI), are used tomeasure the clustering performance [2], [38]. Given asample xi, let pi and qi be the obtained cluster label andthe label provided by the data set, respectively. The AC isdefined as follows:
AC ¼Pm
i¼1 �ðqi;mapðpiÞÞm
; ð23Þ
where m is the total number of documents, �ðx; yÞ is the
delta function that equals one if x ¼ y and equals zero
otherwise, and map(pi) is the permutation mapping
function that map each cluster label pi to the equivalent
label from the data corpus. The best mapping can be found
by using the Kuhn-Munkres algorithm [39].Let C denote the set of clusters provided by the data set
and C0 obtained from our algorithm. Their mutual informa-
tion metric MIðC;C0Þ is defined as following:
MIðC;C0Þ ¼X
ci2C; c0j2C0pðci; c0jÞ: log2
pðci; c0jÞpðciÞ:pðc0jÞ
; ð24Þ
where pðciÞ and pðc0jÞ are the probabilities that a sample
arbitrarily selected from the data set belongs to the clusters
ci and c0j, respectively, and pðci; c0jÞ is the joint probability
that the arbitrarily selected sample belongs to the cluster cias well as c0j at the same time. In our experiments, we use
the normalized mutual information MI as follows:
MI ¼ MIðC;C0ÞmaxðHðCÞ; HðC0ÞÞ ; ð25Þ
where HðCÞ and HðC0Þ are the entropies of C and C0,
respectively. It is easy to check that MI takes values
between 0 and 1.
4.1.2 Parameter Settings
There are several parameters to be tuned in each clustering or
coclustering algorithm considered in our experiments. In
order to compare these algorithms fairly, we run them under
different parameter settings, and report the best result.In Kmeans and Ncut, the number of sample clusters is
set to the true number of classes for all the data sets. In BGP,
ITCC, DRCC, and LDCC, the number of sample clusters
and feature clusters are both set to the true number of
classes for all the data sets.
In Ncut [22], the similarity between two samples xi and
xj is computed with Heat kernel: wij ¼ expð� kxi�xjk2
t Þ. The
parameter t is searched from the grid: f1e�2; 1e�1; 1,
1e1; 1e2g.In DRCC, two k-nearest neighbor graphs are constructed
to encode the manifold structure in the data space andfeature space. As in [28], the neighborhood size of thesample graph is set to be the same as that of the featuregraph. k is searched from the grid: f1; 2; . . . ; 10g. The tworegularization parameters in DRCC are set to be the same,and searched from the grid: f1e�2; 1e�1; 1, 1e1; 1e2g.
In LDCC, the regularization parameter � for local linearregression is set to 1, and the dimensionality r is searchedfrom the grid: f5; 10; . . . ; 50g. The other parameter settingsof LDCC are the same as DRCC.
4.2 Data Sets
Three text corpora and two gene expression data sets areused in our experiments.
20 Newsgroups.2 The 20 Newsgroups corpus is acollection of approximately 20,000 newsgroup documents,partitioned (nearly) evenly across 20 different news-groups. On this data set, we select 2,000 words with thelargest contribution to the mutual information between thewords and the documents [14], and then remove theempty documents.
WebKB.3 The WebKB corpus contains web pagescollected from computer science departments of variousuniversities in January 1997. The 8,282 pages were manuallyclassified into the following seven categories: student,faculty, staff, department, course, project, and other. Weselect the top 1,000 words by mutual information for thisdata set.
TechTC-100.4 The Technion Repository of Text Categor-ization Data sets (TechTC) provides a large number ofdiverse test collections for use in text categorization research.We use the TechTC-100 corpus, which contains 100 binarytext data sets. Each data set in TechTC-100 consists of a totalof 150 to 200 documents from two Open Directory Project(ODP) categories. In our experiments, we select the top2,000 words by mutual information for each data set.
Leukemia.5 Leukemia data set is a benchmark in geneexpression analysis [40]. It contains 72 samples and7,129 genes. Each sample belongs to either Acute Lympho-blastic Leukemia (ALL) or Acute Myeloid Leukemia (AML).We use the subset provided by Brunet et al. [41], whichconsists of 38 bone marrow samples (27 ALL samples and11 AML samples). We screen out genes with max=min < 15and max�min < 500, leaving a total of 1,999 genes.
Medulloblastoma.6 This gene expression data set [42] iscollected from childhood brain tumors known as medullo-blastomas. The pathogenesis of these tumors is not wellunderstood, but it is generally accepted that there are twoknown histological subclasses: classic and desmoplastic. Weuse the subset provided by Brunet et al. [41], which consists
1030 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
2. http://people.csail.mit.edu/jrennie/20Newsgroups/.3. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/
data/.4. http://techtc.cs.technion.ac.il/.5. http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_
paper.cgi?mode=view&paper_id=89.6. http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_
paper.cgi?mode=view&paper_id=89.
of 25 classic samples and 9 desmoplastic samples. We
screen out genes with max=min < 15 and max�min < 500,
leaving a total of 1,710 genes.
4.3 Experimental Results
4.3.1 Performance on Balanced Data Sets
In this section, we compare the performance of different
algorithms on balanced data sets, where all the sample
classes have the same size.The first experiment is done on the 20 Newsgroups
corpus. The evaluations were conducted with different
number of document classes c, ranging from 2 to 10. At each
run of the test, c document classes are randomly selected
from the whole corpus and 300 documents are randomly
selected from each selected class. For each given class
number c, 10 test runs are conducted, and the average
performance is computed over these 10 tests. Table 1 shows
the average clustering performance, as well as the standard
deviation, for each algorithm. As can be seen, our LDCC
algorithm significantly outperforms the other clustering orcoclustering algorithms in all the cases.
Although the WebKB corpus is unbalanced, we gen-erate balanced subsets by choosing the same number ofdocuments from each document class. We performclustering or coclustering experiments with c ¼ 2; . . . ; 7document classes. At each run of the test, c documentclasses are randomly selected and 100 documents arerandomly selected from each selected class. As before,10 test runs are conducted for each c and the averageperformance is reported. The clustering results on thebalanced WebKB data set are shown in Table 2. On thisdata set, LDCC still performs the best in most cases.Compared with the results on the 20 Newsgroup corpus,the advantage of LDCC is less obvious on this corpus.That is probably because the WebKB corpus is moredifficult to be clustered.
From Tables 1 and 2, we can see that in general theaccuracy keeps decreasing as the number of classes cincreases. However, the Normalized Mutual Information
ZHANG ET AL.: LOCALLY DISCRIMINATIVE COCLUSTERING 1031
TABLE 1Performance Comparisons on the 20 Newsgroups Corpus (mean std-dev)
The best result is in bold.
TABLE 2Performance Comparisons on the Balanced WebKB Corpus (mean std-dev)
The best result is in bold.
(MI) does not have a clear pattern. The behavior of MI isquite complex and depends on the specific data set. Forexample, in [2], we can see that MI decreases on the TDT2corpus and increases on the Reuters corpus, when c increases.
The TechTC-100 corpus contains 100 balanced binarydata sets, and we perform clustering or coclusteringexperiments on each data set. The performance compar-isons on the TechTC-100 corpus are shown in Fig. 2,where we plot the result of LDCC in y-axis and best resultof the other methods in x-axis. Then, the number of pointsin the upper triangle is just the number of data sets onwhich LDCC performs the best. In terms of accuracy,LDCC performs the best on 74 sets. And in terms of MI,LDCC performs the best on 78 sets. From Fig. 2b, we cansee that the MI of many data sets is less than 0.1, whichmeans this text corpus is very hard to be clustered.However, our algorithm still performs the best in mostcases on this corpus.
4.3.2 Performance on Unbalanced Data Sets
In the following, we evaluate the performance of all themethods on unbalanced data sets, where the sizes of sampleclasses are highly skewed.
The WebKB corpus is a unbalanced data set, where thesize of the document class ranges from 137 to 3,728. As before,
we perform clustering or coclustering experiments with c ¼2; . . . ; 7 document classes. At each run of the test, we select cdocument classes randomly and then select 10 percentdocuments from each selected class. In this way, therandomly generated subsets are highly unbalanced. Ten testruns are conducted for each c and the average performance isreported. The clustering results on the unbalanced WebKBdata set are shown in Table 3. Compared with the results inTable 2, the performance of all the methods decreases in thisexperiments. Thus, the unbalanced data sets are moredifficult to be clustered. Nevertheless, our LDCC stillperforms the best on most cases.
Finally, we show the clustering performance on the twogene expression data sets in Table 4. This two data sets are
1032 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
Fig. 2. Performance comparisons on the TechTC-100 corpus. We plot the result of LDCC in y-axis and best result of the other methods in x-axis.
TABLE 3Performance Comparisons on the Unbalanced WebKB Corpus (mean std-dev)
c
c
The best result is in bold.
TABLE 4Clustering Performance on the Two Gene Expression Data Sets
The best result is in bold.
rather challenging for coclustering algorithms, since theycontain a small number of samples but large amount offeatures. As a result, BGP groups all the samples into thesame cocluster on the Medulloblastoma data set, thus itsMI is 0 on this data set. In contrast, our LDCC is the winnerfor most of the cases, which verifies that the intrasampleand interfeature relationships are essential for coclustering.
4.3.3 A Case Study of Feature Clusters
In this section, we provide one case study of featureclusters. We form one text data set by choosing four classes(“atheism,” “hardware,” “motorcycles,” and “guns”) fromthe 20 Newsgroups corpus.
Table 5 summarizes the results of applying LDCC to thisdata set. The top of the table is the confusion matrix, fromwhich we can see that LDCC is able to recover the originalclasses. Since LDCC partitions documents and wordssimultaneously, there is one associated word cluster foreach document cluster. We list the words near the center ofeach word cluster in the bottom of Table 5. It should beobserved that most of these words are consistent with the“concept” of the associated document cluster.
4.3.4 Parameter Selection
In LDCC, we apply ridge regression at each local patch tocapturing the intersample (or intersample) relationship,where � appears as a regularization parameter. Besides,there is another parameter � which is used to control theimportance of intersample and interfeature relationships. Inthe following, we examine the impacts of the twoparameters on the performance of LDCC.
The performance of LDCC under different settings of �and � is evaluated on the WebKB corpus. The number ofdocument classes c is set to 4, and other experimentalsettings are the same as these in Section 4.3.1. For brevity,we just show how the accuracy of LDCC varies with the
parameter � and �. Fig. 3 plots the accuracy versus the valueof � with � ¼ f0:1; 1; 10g. As can be seen, the performance ofLDCC is quite stable with respect to � as long as it is smallerthan certain threshold. Comparing the results underdifferent settings of �, we can see that LDCC is alsoinsensitive to �.
5 CONCLUSIONS AND FUTURE WORK
In this paper, we proposed a novel coclustering algorithmnamed Locally Discriminative Coclustering which exploressample-feature, intersample, and interfeature relationshipssimultaneously. The sample-feature relationship is modeledby a bipartite graph, while the intersample and interfeaturerelationships are captured by local linear regressions. Theresults of the coclustering experiment are very promising.
The idea of local learning can be used to extend theexisting coclustering algorithms, such as the informationtheory-based [16], [25] and matrix factorization-basedalgorithms [17], [18]. And we will investigate this in ourfuture work. Although clustering is inherently an unsuper-vised learning problem, sometimes a small set of labeledsamples (features) might be available. Thus, the extensionof LDCC to incorporate prior knowledge is another researchtopic. Furthermore, more efficient optimization methods forLDCC will be considered.
ACKNOWLEDGMENTS
This work was supported by Scholarship Award forExcellent Doctoral Student granted by Ministry of Educa-tion, National Key Technology R&D Program of China(2008BAH26B00), Program for New Century ExcellentTalents in University (NCET-09-0685), and National BasicResearch Program of China (973 Program) under Grant2011CB302206.
ZHANG ET AL.: LOCALLY DISCRIMINATIVE COCLUSTERING 1033
TABLE 5Coclustering Results of LDCC on the 20 Newsgroups Corpus
Fig. 3. The impacts of the parameters � and � on the performance of LDCC.
REFERENCES
[1] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of StatisticalLearning, Springer Series in Statistics. Springer, 2009.
[2] W. Xu, X. Liu, and Y. Gong, “Document Clustering Based on Non-Negative Matrix Factorization,” Proc. 26th Ann. Int’l ACM SIGIRConf. Research and Development in Informaion Retrieval, pp. 267-273,2003.
[3] A.Y. Ng, M.I. Jordan, and Y. Weiss, “On Spectral Clustering:Analysis and an Algorithm,” Proc. Advances in Neural InformationProcessing Systems, pp. 849-856, 2002.
[4] J. McQueen, “Some Methods of Classification and Analysis ofMultivariate Observations,” Proc. Fifth Berkeley Symp. Math.Statistics and Probability, pp. 281-297, 1967.
[5] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-BasedAlgorithm for Discovering Clusters in Large Spatial Databaseswith Noise,” Proc. Second Int’l Conf. Knowledge Discovery and DataMining, pp. 226-231, 1996.
[6] M. Rege, M. Dong, and F. Fotouhi, “Co-Clustering Documents andWords Using Bipartite Isoperimetric Graph Partitioning,” Proc.Sixth Int’l Conf. Data Mining, pp. 532-541, 2006.
[7] I.S. Dhillon, “Co-Clustering Documents and Words UsingBipartite Spectral Graph Partitioning,” Proc. Seventh ACMSIGKDD Int’l Conf. Knowledge Discovery and Data Mining,pp. 269-274, 2001.
[8] H. Zha, X. He, C. Ding, H. Simon, and M. Gu, “Bipartite GraphPartitioning and Data Clustering,” Proc. 10th Int’l Conf. Informationand Knowledge Management, pp. 25-32, 2001.
[9] Y. Cheng and G.M. Church, “Biclustering of Expression Data,”Proc. Eighth Int’l Conf. Intelligent Systems for Molecular Biology,pp. 93-103, 2000.
[10] Y. Kluger, R. Basri, J. Chang, and M. Gerstein, “SpectralBiclustering of Microarray Data: Coclustering Genes and Condi-tions,” Genome Research, vol. 13, no. 4, pp. 703-716, 2003.
[11] S.C. Madeira and A.L. Oliveira, “Biclustering Algorithms forBiological Data Analysis: A Survey,” IEEE/ACM Trans. Computa-tional Biology and Bioinformatics, vol. 1, no. 1, pp. 24-45, Jan.-Mar.2004.
[12] T. George and S. Merugu, “A Scalable Collaborative FilteringFramework Based on Co-Clustering,” Proc. IEEE Fifth Int’l Conf.Data Mining, pp. 625-628, 2005.
[13] P. Symeonidis, A. Nanopoulos, A.N. Papadopoulos, and Y.Manolopoulos, “Nearest-Biclusters Collaborative Filtering Basedon Constant and Coherent Values,” Information Retrieval, vol. 11,no. 1, pp. 51-75, 2008.
[14] N. Slonim and N. Tishby, “Document Clustering Using WordClusters via the Information Bottleneck Method,” Proc. 23rd Ann.Int’l ACM SIGIR Conf. Research and Development in InformationRetrieval, pp. 208-215, 2000.
[15] R. El-Yaniv and O. Souroujon, “Iterative Double Clustering forUnsupervised and Semi-Supervised Learning,” Proc. 12th Eur-opean Conf. Machine Learning (ECML ’01), pp. 121-132, 2001.
[16] I.S. Dhillon, S. Mallela, and D.S. Modha, “Information-TheoreticCo-Clustering,” Proc. Ninth ACM SIGKDD Int’l Conf. KnowledgeDiscovery and Data Mining, pp. 89-98, 2003.
[17] B. Long, Z.M. Zhang, and P.S. Yu, “Co-Clustering by Block ValueDecomposition,” Proc. 11th ACM SIGKDD Int’l Conf. KnowledgeDiscovery in Data Mining, pp. 635-640, 2005.
[18] C. Ding, T. Li, W. Peng, and H. Park, “Orthogonal NonnegativeMatrix T-Factorizations for Clustering,” Proc. 12th ACM SIGKDDInt’l Conf. Knowledge Discovery and Data Mining, pp. 126-135,2006.
[19] J. Han, Data Mining: Concepts and Techniques. Morgan KaufmannPublishers Inc., 2005.
[20] J.A. Hartigan, “Direct Clustering of a Data Matrix,” J. Am.Statistical Assoc., vol. 67, no. 337, pp. 123-129, 1972.
[21] A. Pothen, H.D. Simon, and K.-P. Liou, “Partitioning SparseMatrices with Eigenvectors of Graphs,” SIAM J. Matrix Analysisand Applications, vol. 11, no. 3, pp. 430-452, 1990.
[22] J. Shi and J. Malik, “Normalized Cuts and Image Segmentation,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8,pp. 888-905, Aug. 2000.
[23] B. Gao, T.-Y. Liu, X. Zheng, Q.-S. Cheng, and W.-Y. Ma,“Consistent Bipartite Graph Co-Partitioning for Star-StructuredHigh-Order Heterogeneous Data Co-Clustering,” Proc. 11th ACMSIGKDD Int’l Conf. Knowledge Discovery in Data Mining, pp. 41-50,2005.
[24] N. Tishby, F.C. Pereira, and W. Bialek, “The Information Bottle-neck Method,” Proc. 37th Ann. Allerton Conf. Comm., Control andComputing, pp. 368-377, 1999.
[25] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D.S. Modha, “AGeneralized Maximum Entropy Approach to Bregman Co-Clustering and Matrix Approximation,” Proc. 10th ACM SIGKDDInt’l Conf. Knowledge Discovery and Data Mining, pp. 509-514, 2004.
[26] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. L, and R. Harsh-man, “Indexing by Latent Semantic Analysis,” J. Am. Soc. forInformation Science, vol. 41, pp. 391-407, 1990.
[27] D.D. Lee and H.S. Seung, “Learning the Parts of Objects by Non-Negative Matrix Factorization,” Nature, vol. 401, no. 6755, pp. 788-791, 1999.
[28] Q. Gu and J. Zhou, “Co-Clustering on Manifolds,” Proc. 15th ACMSIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 359-368, 2009.
[29] F. Pan, X. Zhang, and W. Wang, “CRD: Fast Co-Clustering onLarge Datasets Utilizing Sampling-Based Matrix Decomposition,”Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 173-184,2008.
[30] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, and B. Scholkopf,“Learning with Local and Global Consistency,” Advances in NeuralInformation Processing Systems 16, vol. 16, pp. 321-328, 2004.
[31] Y. Yang, D. Xu, F. Nie, J. Luo, and Y. Zhuang, “Ranking with LocalRegression and Global Alignment for Cross Media Retrieval,”Proc. 17th Ann. ACM Int’l Conf. Multimedia, pp. 175-184, 2009.
[32] S.T. Roweis and L.K. Saul, “Nonlinear Dimensionality Reductionby Locally Linear Embedding,” Science, vol. 290, no. 5500,pp. 2323-2326, 2000.
[33] M. Wu and B. Scholkopf, “A Local Learning Approach forClustering,” Advances in Neural Information Processing Systems 19,vol. 19, pp. 1529-1536, 2007.
[34] F. Bach and Z. Harchaoui, “DIFFRAC: A Discriminative andFlexible Framework for Clustering,” Advances in Neural InformationProcessing Systems 20, vol. 20, pp. 49-56, 2008.
[35] G. Strang, Introduction to Linear Algebra, third ed. Wellesley-Cambridge Press, 2003.
[36] M. Belkin and P. Niyogi, “Laplacian Eigenmaps and SpectralTechniques for Embedding and Clustering,” Advances in NeuralInformation Processing Systems 14, vol. 14, pp. 585-591, 2002.
[37] G.H. Golub and C.F. Van Loan, Matrix Computations, third ed.Johns Hopkins Univ. Press, 1996.
[38] D. Cai, X. He, and J. Han, “Document Clustering UsingLocality Preserving Indexing,” IEEE Trans. Knowledge and DataEng., vol. 17, no. 12, pp. 1624-1637, Dec. 2005.
[39] L. Lovaz and M.D. Plummer, Matching Theory. North-Holland,1986.
[40] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek,J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri,C.D. Bloomfield, and E.S. Lander, “Molecular Classification ofCancer: Class Discovery and Class Prediction by Gene ExpressionMonitoring,” Science, vol. 286, no. 5439, pp. 531-537, 1999.
[41] J.-P. Brunet, P. Tamayo, T.R. Golub, and J.P. Mesirov, “Metagenesand Molecular Pattern Discovery Using Matrix Factorization,”Proc. Nat’l Academy of Sciences USA, vol. 101, no. 12, pp. 4164-4169,2004.
[42] S. Pomeroy et al., “Prediction of Central Nervous SystemEmbryonal Tumour Outcome Based on Gene Expression,” Nature,vol. 415, no. 6870, pp. 436-442, 2002.
Lijun Zhang received the BS degree in compu-ter science from Zhejiang University, China, in2007. Currently, he is working toward the PhDdegree in computer science at Zhejiang Uni-versity. His research interests include machinelearning, information retrieval, and data mining.He is a student member of the IEEE.
1034 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
Chun Chen received the BS degree in mathe-matics from Xiamen University, China, in 1981,and the MS and PhD degrees in computerscience from Zhejiang University, China, in 1984and 1990, respectively. Currently, he is workingas a professor in the College of ComputerScience, Zhejiang University. His research inter-ests include information retrieval, data mining,computer vision, computer graphics, and em-bedded technology. He is a member of the IEEE.
Jiajun Bu received the BS and PhD degrees incomputer science from Zhejiang University,China, in 1995 and 2000, respectively. Cur-rently, he is a professor in the College ofComputer Science, Zhejiang University. Hisresearch interests include embedded system,data mining, information retrieval, and mobiledatabase. He is a member of the IEEE.
Zhengguang Chen received the BS degree incomputer science from Zhejiang University,China, in 2009. Currently, he is working towardthe MS degree in computer science at ZhejiangUniversity. His research interests include com-puter vision, machine learning, and data mining.
Deng Cai received the BS and MS degrees fromTsinghua University both in automation, in 2000and 2003, respectively, and the PhD degree incomputer science from the University of Illinoisat Urbana Champaign in 2009. Currently, he isworking as an associate professor in the StateKey Lab of CAD&CG, College of ComputerScience at Zhejiang University, China. Hisresearch interests include machine learning,data mining, and information retrieval. He is a
member of the IEEE.
Jiawei Han is working as a professor ofcomputer science at the University of Illinois.Currently, he is working as the director ofInformation Network Academic Research Center(INARC) supported by the Network Science-Collaborative Technology Alliance (NS-CTA)program of US Army Research Lab. He haschaired or served in more than 100 programcommittees of the major international confer-ences in the fields of data mining and database
systems, and also served or is serving on the editorial boards for DataMining and Knowledge Discovery, the IEEE Transactions on Knowledgeand Data Engineering, the Journal of Computer Science and Technol-ogy, and the Journal of Intelligent Information Systems. He is thefounding editor-in-chief of ACM Transactions on Knowledge Discoveryfrom Data (TKDD). His book Data Mining: Concepts and Techniques(Morgan Kaufmann) has been used worldwide as a textbook. He hasreceived IBM Faculty Awards, HP Innovation Awards, the ACM SIGKDDInnovation Award (2004), the IEEE Computer Society TechnicalAchievement Award (2005), and the IEEE W. Wallace McDowell Award(2009). He is a fellow of the ACM and the IEEE.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
ZHANG ET AL.: LOCALLY DISCRIMINATIVE COCLUSTERING 1035