12
PaCK: Scalable Parameter-Free Clustering on K-Partite Graphs Jingrui He * Hanghang Tong * Spiros Papadimitriou Tina Eliassi-Rad Christos Faloutsos * Jaime Carbonell * Abstract Given an author-paper-conference graph, how can we auto- matically find groups for author, paper and conference re- spectively. Existing work either (1) requires fine tuning of several parameters, or (2) can only be applied to bipar- tite graphs (e.g., author-paper graph, or paper-conference graph). To address this problem, in this paper, we propose PaCK for clustering such k-partite graphs. By optimizing an information-theoretic criterion, PaCK searches for the best number of clusters for each type of object and generates the corresponding clustering. The unique feature of PaCK over existing methods for clustering k-partite graphs lies in its parameter-free nature. Furthermore, it can be easily gen- eralized to the cases where certain connectivity relations are expressed as tensors, e.g., time-evolving data. The proposed algorithm is scalable in the sense that it is linear with respect to the total number of edges in the graphs. We present the theoretical analysis as well as the experimental evaluations to demonstrate both its effectiveness and efficiency. 1 Introduction Complex graphs that express various relationships among objects of different types are rapidly proliferating, largely due to the prevalence of the Web and the Internet. For example, the Resource Description Framework (RDF) [25] aims to express practically all multi-relational data in a standard, machine-understandable form. One of the Web’s creators has even envisioned the Giant Global Graph (GGG) [6], which would capture and represent relationships across documents and networks. In this paper we focus on a simpler but expressive subset of multi-relational data representations. We consider entities or objects of multiple types and allow any pair of types to be linked by a binary relationship. For example, in a publication corpus, papers (one object type) are associated with several other object types, e.g., authors, subject keywords, and publication venues. Given such data, how can we find meaningful patterns and groups of objects, across different types? One approach might be to cluster the objects of each type independently of * Carnegie Mellon University IBM T.J. Watson Lab Lawrence Livermore National Laboratory the rest. Traditional clustering techniques [14] are designed to group objects in an unlabeled data set such that objects within the same cluster are similar to each other, whereas objects from different clusters are dissimilar. Most clustering algorithms, such as k-means [10], spectral clustering [2] and information-theoretic clustering [12], focus on one-way clustering, i.e. clustering the objects according to their similarity based on the features. However, for sparse relational data, co-clustering or bi- clustering techniques [19] simultaneously cluster objects of all types and typically produce results of better quality, by leveraging clusters along other types in the similarity mea- sure. Most co-clustering algorithms focus on just two types of objects, typically viewed as either rows and columns of a matrix, or source and destination nodes of a bipartite graph. The information-theoretic co-clustering (ITCC) algorithm [9] was among the first in the machine learning community to address this problem. Follow up work includes [18], [16], and [4]. More recently, algorithms that generalize co-clustering to more than two object types have appeared, such as Con- sistent Bipartite Graph Co-partitioning (CBGC) [11], spec- tral relational clustering [17], and collective matrix factoriza- tion [21, 22]. These address some of the challenges in min- ing multi-relational data. However, despite their success, all of these methods require the user to provide several parame- ters, such as the number of clusters of each type, weights for different relations, etc. We aim to provide a method that can also recover the number of clusters, by employing a model selection criterion based on lossless compression principles. Our starting point is the cross-associations [8] and Autopart [7] methods, both of which are parameter-free and provide the basic underpinnings for the model selection principles we shall employ. However, neither of them apply to multi-relational data. These pose additional challenges, which we address by introducing several new ideas, including exponential cluster splits, cluster merges, and multiple concurrent trials. In this paper we propose PaCK to co-cluster k-partite graphs. Our main contributions in this paper are:

PaCK: Scalable Parameter-Free Clustering on K-Partite Graphschristos/PUBLICATIONS/PaCK2009.pdf · PaCK: Scalable Parameter-Free Clustering on K-Partite Graphs Jingrui He ∗ Hanghang

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PaCK: Scalable Parameter-Free Clustering on K-Partite Graphschristos/PUBLICATIONS/PaCK2009.pdf · PaCK: Scalable Parameter-Free Clustering on K-Partite Graphs Jingrui He ∗ Hanghang

PaCK: Scalable Parameter-Free Clustering on K-Partite Graphs

Jingrui He∗ Hanghang Tong∗ Spiros Papadimitriou† Tina Eliassi-Rad‡

Christos Faloutsos∗ Jaime Carbonell∗

Abstract

Given an author-paper-conference graph, how can we auto-matically find groups for author, paper and conference re-spectively. Existing work either (1) requires fine tuningof several parameters, or (2) can only be applied to bipar-tite graphs (e.g., author-paper graph, or paper-conferencegraph). To address this problem, in this paper, we proposePaCK for clustering such k-partite graphs. By optimizingan information-theoretic criterion,PaCK searches for thebest number of clusters for each type of object and generatesthe corresponding clustering. The unique feature ofPaCKover existing methods for clustering k-partite graphs liesinits parameter-free nature. Furthermore, it can be easily gen-eralized to the cases where certain connectivity relationsareexpressed as tensors, e.g., time-evolving data. The proposedalgorithm is scalable in the sense that it is linear with respectto the total number of edges in the graphs. We present thetheoretical analysis as well as the experimental evaluationsto demonstrate both its effectiveness and efficiency.

1 Introduction

Complex graphs that express various relationships amongobjects of different types are rapidly proliferating, largelydue to the prevalence of the Web and the Internet. Forexample, the Resource Description Framework (RDF) [25]aims to express practically all multi-relational data in astandard, machine-understandable form. One of the Web’screators has even envisioned the Giant Global Graph (GGG)[6], which would capture and represent relationships acrossdocuments and networks.

In this paper we focus on a simpler but expressive subsetof multi-relational data representations. We consider entitiesor objects of multiple types and allow any pair of types to belinked by a binary relationship. For example, in a publicationcorpus, papers (one object type) are associated with severalother object types, e.g., authors, subject keywords, andpublication venues.

Given such data, how can we find meaningful patternsand groups of objects, across different types? One approachmight be to cluster the objects of each type independently of

∗Carnegie Mellon University†IBM T.J. Watson Lab‡Lawrence Livermore National Laboratory

the rest. Traditional clustering techniques [14] are designedto group objects in an unlabeled data set such that objectswithin the same cluster are similar to each other, whereasobjects from different clusters are dissimilar. Most clusteringalgorithms, such as k-means [10], spectral clustering [2]and information-theoretic clustering [12], focus on one-wayclustering, i.e. clustering the objects according to theirsimilarity based on the features.

However, for sparse relational data, co-clustering or bi-clustering techniques [19]simultaneouslycluster objects ofall types and typically produce results of better quality, byleveraging clusters along other types in the similarity mea-sure. Most co-clustering algorithms focus on just two typesof objects, typically viewed as either rows and columns of amatrix, or source and destination nodes of a bipartite graph.The information-theoretic co-clustering (ITCC) algorithm[9] was among the first in the machine learning communityto address this problem. Follow up work includes [18], [16],and [4].

More recently, algorithms that generalize co-clusteringto more than two object types have appeared, such as Con-sistent Bipartite Graph Co-partitioning (CBGC) [11], spec-tral relational clustering [17], and collective matrix factoriza-tion [21, 22]. These address some of the challenges in min-ing multi-relational data. However, despite their success, allof these methods require the user to provide several parame-ters, such as the number of clusters of each type, weights fordifferent relations, etc.

We aim to provide a method that can also recover thenumber of clusters, by employing a model selection criterionbased on lossless compression principles. Our startingpoint is the cross-associations [8] and Autopart [7] methods,both of which are parameter-free and provide the basicunderpinnings for the model selection principles we shallemploy. However, neither of them apply to multi-relationaldata. These pose additional challenges, which we address byintroducing several new ideas, including exponential clustersplits, cluster merges, and multiple concurrent trials.

In this paper we proposePaCK to co-clusterk-partitegraphs. Our main contributions in this paper are:

Page 2: PaCK: Scalable Parameter-Free Clustering on K-Partite Graphschristos/PUBLICATIONS/PaCK2009.pdf · PaCK: Scalable Parameter-Free Clustering on K-Partite Graphs Jingrui He ∗ Hanghang

• PaCK is parameter-free, by employing a simple buteffective model selection criterion, which can recoverthe “true” cluster structure (when known).

• We carefully design a search procedure that can find agood approximate solution.

• Our algorithms are scalable to large datasets (linear on.number of edges).

• We generalizePaCK to tensors.

Extensive experiments on both real and synthetic datasets,include comparisons to several other methods, validate theeffectiveness ofPaCK.

The rest of the paper is organized as follows: Section 2formulates the problem; Section 3 introduces thePaCKsearch procedure; Section 4 presents experimental results,and Section 5 discusses the generalization ofPaCK to ten-sors; finally, Section 6 reviews related methods and Section7concludes.

2 Problem FormulationIn this section, we give the problem formulation ofPaCK.Similar to cross-associations [8] and Autopart [7],PaCKtries to formulate the clustering problem as a compressionproblem. Unlike cross-associations [8] and Autopart [7],PaCKtries to compress a set of inter-correlated matrices col-lectively, as apposed to a single matrix in cross-associationsand Autopart.

Given a set of inter-connected objects of different types,our goal is to find patterns based on the binary connectivitymatrices in a parameter-free fashion, i.e. find the clusteringsfor different types of objects simultaneously so that afterre-arranging, the connectivity matrices will consist of homoge-neous, rectangular regions of high or low density.

Generally speaking, to achieve this goal, we make use ofthe MDL (Minimum Description Length) principle to designa criterion, and try to minimize this criterion greedily. Wedescribe the criterion in this section and deal with the searchprocedure in the next section.

2.1 Notation.Given m types of data objects, X 1 ={x11, . . . , x1n1}, . . . ,Xm = {xm1, . . . , xmnm

}, whereX i is the ith object type,xij is the jth object of theith

type, andni is the number of objects from theith type, weare interested in collectively clusteringX 1 into k1 disjointclusters,. . ., andXm into km disjoint clusters. Let

Φi : {1, 2, . . . , ni} → {1, 2, . . . , ki}, i = 1, . . . , m

denote the assignments (i.e., mappings) of objects inX i tothe corresponding clusters.

Let Dij denote theni × nj binary connectivity matrixbetween object typesX i andX j , i 6= j, i.e., the element in

the sth row andtth column ofDij is 1 if and only if xis isconnected toxjt.

To better understand the collective clustering, given themappingsΦi, i = 1, . . . , m, let us rearrange the connectivitymatrix Dij such that the objects within the same clustersare put together. In this way, the matrixDij is divided intosmaller blocks, which are referred to asDpq

ij , p = 1, . . . , ki

andq = 1, . . . , kj . Let the dimensions ofDpqij be (ap

i , aqj).

In other words,api is the number of objects fromX i that

belong to clusterp, andaqj is the number of objects fromX j

that belong to clusterq. Table 1 summarizes the notationused in this paper.

2.2 A Lossless Code for Connectivity Matrices.Suppose that we are interested in transmitting the connectiv-ity matricesDij , i = 1, . . . , m − 1 andj = i + 1, . . . , m.We are also given the mappingΦi that partitions the objectsinX i into ki clusters, with none of them empty. Next, we in-troduce how to simultaneously code multiple matrices basedon the above information.

2.2.1 Description Complexity. The first part is the de-scription complexity of transmitting the connectivity matri-ces. It consists of the following parts:

1. Send the number of object types, i.e.,log∗(m), wherelog∗ is the universal code length for integers.1 This termis independent of the collective clustering.

2. Send the number of objects of each type, i.e.,∑m

i=1 log∗(ni). This term is also independent of thecollective clustering.

3. Send the permutations of the objects of the same typeso that the objects within the same clusters are puttogether, i.e.,

∑m

i=1 ni⌈log ki⌉.4. Send the number of clusters using

∑m

i=1 log∗ ki bits.5. Send the number of objects in each cluster. For object

typeX i, suppose thata1i ≥ a2

i ≥ . . . ≥ aki

i ≥ 1.Compute

api := (

ki∑

t=p

ati)− ki + p

for i = 1, . . . , m andp = 1, . . . , ki − 1. So altogetherthe desired quantities can be sent using the followingnumber of bits:

m∑

i=1

ki−1∑

p=1

⌈log api ⌉

6. For each matrix blockDpqij , i = 1, . . . , m − 1, j =

i + 1, . . . , m, p = 1, . . . , ki, andq = 1, . . . , kj , sendthe number of ones in the matrix, using⌈log(ap

i aqj +1)⌉

bits.

1log∗(x) ≈ log2(x) + log

2log

2(x) + . . ., where only the positive

terms are retained and this is the optimal length, if we do notknow therange of values for x beforehand.

Page 3: PaCK: Scalable Parameter-Free Clustering on K-Partite Graphschristos/PUBLICATIONS/PaCK2009.pdf · PaCK: Scalable Parameter-Free Clustering on K-Partite Graphs Jingrui He ∗ Hanghang

Table 1: NotationsSymbol Definition

m Number of object typesX i Theith object typeni Number of objects inX i

ki Number of clusters forX i

kSi ki in theS th iteration step ofPaCK

k∗i Optimal number of clusters forX i

Φi Assignments/mappings of objects inXi to the corresponding clustersΦi(s) Φi in thesth iteration step ofCCsearchΦS

i Φi in theS th iteration step ofPaCKΦ∗

i Optimal assignments/mappings of objects inX i to the corresponding clustersDij ni × nj binary connectivity matrix between object typesX i andX j

Dij{t} Dij at time stampt when the relationship betweenX i andX j is a tensorap

i Number of objects fromX i that belong to clusterpDpq

ij Block of Dij that corresponds to thepth cluster inX i and theqth cluster inX j

Dpqij (s) Dpq

ij in thesth iteration step ofCCsearchn(A, u) Number of elements in the matrix/vectorA that are equal tou, u = 0, 1n(A) Total number of elements in the matrix/vectorAP pq

ij Proportion of 1s inDpqij

P pqij (s) P pq

ij in thesth iteration step ofCCsearchH(P pq

ij ) Binary Shannon entropy function with respect toP pqij

Cpqij Coding length required to transmit the blockDpq

ij using arithmetic codingCpq

ij (s) Cpqij in thesth iteration step ofCCsearch

TD(Φ1, . . . , Φm) Total coding cost with respect to the mappingsΦ1, . . . , Φm

2.2.2 Code for the Matrix Blocks. In addition to theabove information, we also need to transmit the matrixblocks Dpq

ij . For a single blockDpqij , we can model its

elements as iid draws from a Bernoulli distribution withbias P pq

ij = n(Dpqij , 1)/(n(Dpq

ij , 1) + n(Dpqij , 0)), where

n(Dpqij , 1) andn(Dpq

ij , 0) are the numbers of ones and zerosin Dpq

ij . Therefore, the number of bits required to transmitthis block using arithmetic coding is

Cpqij = C(Dpq

ij ) := n(Dpqij )H(P pq

ij )

= −n(Dpqij , 1) log(P pq

ij )− n(Dpqij , 0) log(1− P pq

ij )

wheren(Dpqij ) = n(Dpq

ij , 1)+n(Dpqij , 0), andH is the binary

Shannon entropy function.2.2.3 Total Coding Cost.Based on the above discussion,the total coding cost for the connectivity matricesDij , i =1, . . . , m − 1, j = i + 1, . . . , m with respect to the givenmappingsΦi, i = 1, . . . , m is as follows.

TD(Φ1, . . . , Φm) :=

m∑

i=1

ni⌈log ki⌉+

m∑

i=1

log∗ ki

+

m∑

i=1

ki−1∑

p=1

⌈log api ⌉+

m−1∑

i=1

m∑

j=i+1

ki∑

p=1

kj∑

q=1

⌈log(api a

qj + 1)⌉

+

m−1∑

i=1

m∑

j=i+1

ki∑

p=1

kj∑

q=1

Cpqij(2.1)

Note that we ignore the costslog∗(m) and∑m

i=1 log∗(ni),since they do not depend on the collective clustering.

3 Search Procedure in PaCK

The optimal collective clustering corresponds to the numberof clustersk∗

i and the mappingΦ∗i for object typeX i such

that the total coding costTD(Φ∗1, . . . , Φ

∗m) is minimized.

This problem is NP-hard2. Therefore, we have designed agreedy algorithm to minimize the total coding cost (Equa-tion (2.1)). Specifically, to determine the optimal collectiveclustering, we must set the number of clusters for each ob-ject type, and then find the optimal mappings. These twocomponents correspond to the two major steps inPaCK: (1)finding a good collective clustering given the number of clus-ters of each objective type; and (2) searching for the optimalnumber of clusters.

In this section, we first describe these two steps respec-tively, followed by computational complexity analysis forthe proposedPaCK.

3.1 CCsearchIn CCsearchstep, we are given the values ofk1, . . . , km,and want to find a set of mappingsΦ1, . . . , Φm that mini-

2It is NP-hard since even allowing only column re-ordering for a singleconnectivity matrix, a reduction to the TSP problem can be found [15].

Page 4: PaCK: Scalable Parameter-Free Clustering on K-Partite Graphschristos/PUBLICATIONS/PaCK2009.pdf · PaCK: Scalable Parameter-Free Clustering on K-Partite Graphs Jingrui He ∗ Hanghang

mizes the total coding cost (Equation (2.1)). Note that inthis case, the first two terms in Equation (2.1) are fixed, andonly the remaining three terms

∑m

i=1

∑ki−1p=1 ⌈log ap

i ⌉,∑m−1i=1

∑m

j=i+1

∑ki

p=1

∑kj

q=1⌈log(api a

qj + 1)⌉ and

∑m−1i=1

∑m

j=i+1

∑ki

p=1

∑kj

q=1 Cpqij depend on the map-

pings. Based on our experiments, in the regions whereCCsearch searches for the optimal mappings given thenumbers of clusters, the code for transmitting the matrixblocks dominates the total coding cost. Therefore, inCCsearch, we aim to minimize the following criterion:

(3.2)m−1∑

i=1

m∑

j=i+1

ki∑

p=1

kj∑

q=1

Cpqij

CCsearch(Alg. 1) is an intuitive and efficient alternat-ing minimization algorithm that yields a local minimum ofEquation (3.2).

Note thatCCsearchis essentially a Kmeans-style algo-rithm [20]: it alternates between finding the cluster centroidand assigning objects to the ‘closest’ cluster, except thatineach iteration step, the ‘features’ (i.e., clusterings of otherobject types) of an object may change. This is different fromthe sequential clustering algorithm proposed in [23] wherethe cluster membership of only one object may change ineach iteration.

The correctness ofCCsearchis given by theorem 3.1.

THEOREM 3.1. For s ≥ 1

m−1∑

i=1

m∑

j=i+1

ki∑

p=1

kj∑

q=1

Cpqij (s) ≥

m−1∑

i=1

m∑

j=i+1

ki∑

p=1

kj∑

q=1

Cpqij (s + 1)

whereCpqij (s) is Cpq

ij in the sth iteration step ofCCsearch.In other words, CCsearch never increases the objectivefunction (Equation(3.2)).

Proof. Omitted for brevity.�

3.2 Cluster Number Search.The second part ofPaCK is an algorithm to look for goodvalues ofki (i.e., the cluster numbers for each object type).Here is the basic idea ofPaCK: we start with small valuesof ki, progressively increase them, and find the bestΦi usingCCsearch. We use two strategies to split the current clustersto increase the cluster number: the linear split (step 14)and exponential split (step 12). In order to escape the localminimum, we also allow merging two existing clusters intoa bigger one (steps 26-36). Finally, in order to find a goodlocal minimum, we propose to runPaCK multiple times andchoose the one with the lowest coding cost. This step canbe easily paralleled and therefore almost does not increasethe overall running time. Note that unlike the outer loopof cross-associations [8],PaCK additionally performs the

Algorithm 1 CCsearchInput: The connectivity matricesDi,j , i, j = 1, ..., m; the

cluster numberski, i = 1, ..., m for each type.Output: The cluster assignmentΦi for each object type.

1: Let s denote the iteration index and sets = 0.2: Initialize collective clusteringΦ1(s), . . . , Φm(s).3: SetΦ′

i(s) = Φi(s), i = 1, . . . , m.4: while truedo5: ### alternate among different types of objects6: for l = 1, . . . , m do7: ### try to update the clustering for typel8: for i = 1, . . . , m − 1, j = i + 1, . . . , m, p =

1, . . . , ki, q = 1, . . . , kj do9: Compute the matrix blocksDpq

ij (s) andthe corresponding biasP pq

ij (s) based onΦ′

1(s), . . . , Φ′m(s).

10: end for11: Hold the mappingΦ′

l(s) for object typeX l. Con-catenate all the matricesDlj , j 6= l, to form a singlematrixDl, which is annl ×

∑j 6=l nj matrix.

12: for each rowx of Dl do13: Split it into

∑j 6=l kj parts, each corresponding

to one cluster inΦ′j(s), j 6= l.

14: Let the∑

j 6=l kj parts found in step 11 be

x11, . . . , x1k1 , . . . , x(l−1)1, . . . , x(l−1)kl−1 , x(l+1)1

, . . . , x(l+1)kl+1 , . . . , xm1, . . . , xmkm .15: end for16: for each of the

∑j 6=l kj partsdo

17: Compute n(xjq , u), u = 0, 1, j 6= l, q =1, . . . , kj , which is the number of elements inxjq

that are equal tou.18: end for19: DefineΦ′

l(s) such that the cluster it assigns to rowx satisfies the following condition: for all1 ≤ p ≤kl, a + b ≤ c + d, wherea =

∑j 6=l

∑kj

q=1[n(xjq , 1) log 1

PΦ′

l(s)q

lj(s)

]

b =∑

j 6=l

∑kj

q=1[n(xjq , 0) log 1

1−PΦ′

l(s)q

lj(s)

]

c =∑

j 6=l

∑kj

q=1[n(xjq , 1) log 1P

pq

lj(s)

]

d =∑

j 6=l

∑kj

q=1[n(xjq , 0) log 11−P

pq

lj(s)

]

20: end for21: ### terminate the whole program22: if there is no decrease in Equation (3.2)then23: Break.24: else25: for l = 1, . . . , m do26: SetΦl(s + 1) = Φ′

l(s).27: end for28: Sets← s + 1.29: end if30: end while

Page 5: PaCK: Scalable Parameter-Free Clustering on K-Partite Graphschristos/PUBLICATIONS/PaCK2009.pdf · PaCK: Scalable Parameter-Free Clustering on K-Partite Graphs Jingrui He ∗ Hanghang

exponential split, merge operation and multiple concurrenttrials over m (m ≥ 2) types of objects. As we willshow in the experimental section, these additional operations(exponential split, merge, and multiple concurrent trials)will significantly improve the search quality. The completePaCK algorithm is summarized in Alg. 2.

In Alg. 2, if the split operation in the previous iterationfor a given typel is not successful, we will try to split itlinearly, i.e., to increase the cluster number by 1 instead ofdoubling it. In order to decide which cluster to split, we usethe procedure in Alg. 3, which is based on maximum entropycriterion.

The correctness ofPaCK is given by theorem 3.2

THEOREM 3.2. (1) The outer loop ofPaCK (i.e. step 6)never increases the objective function (Equation(2.1)); (2)PaCK converges in finite steps.

Proof. Omitted for brevity.�

Note thatPaCK stops when the total coding cost (Equa-tion (2.1)) does not decrease. Therefore,PaCK can be seenas a model selection procedure. Given a specific model (thenumber of clusters for each object type), we find the param-eters (the mappings from object to object clusters) accord-ing to the empirical risk (Equation (3.2)). Then we evalu-ate different models based on the regularized empirical risk(Equation (2.1)). In this way,PaCK avoids over-fitting, i.e.,generating a large number of clusters where each object cor-responds to an individual cluster.

Although we assume that there is connectivity betweeneach pair of object types,PaCK can be easily generalized tothe cases where some object types are not connected witheach other. In this case, the corresponding terms in theobjective function (Equation (3.2)) and the total coding cost(Equation (2.1)) disappear.

3.3 Analysis of Computational Complexity.

In this subsection, we analyze the computational com-plexity of the proposed algorithms.

First, in each iteration of the proposedCCsearchalgo-rithm, for object typeX i, we need to count the number ofnon-zero elements in each row, and assign it to one of theki clusters. Therefore, we have the following lemma for theproposedCCsearchalgorithm:

LEMMA 3.1. The computational complexity for each itera-tion ofCCsearchalgorithm isO(ki ·

∑j 6=i n(Dij , 1)). LetI

denote the total number of iterations. The overall complexityfor CCsearchalgorithm is

∑m

i=1(ki ·∑

j 6=i n(Dij , 1)) · I.

Proof. Omitted for brevity.�

Algorithm 2 PaCKInput: The connectivity matricesDi,j , (i, j = 1, ..., m).Output: k∗

i and the corresponding mappingΦSi for i =

1, . . . , m.1: Let S denote the search iteration index.2: Initialize S = 0 andki = 1, i = 1, . . . , m.3: for l = 1, ..., m do4: Initialize typel as ‘split successfully’.5: end for6: while truedo7: ### alternate among different type of objects8: for l = 1 : m do9: ### try to update the cluster number for typel

10: if typel is marked as ‘split successfully’then11: ### try exponential split12: SetkS+1

l = 2 ∗ kSl .

13: else14: ### try linear split15: SetkS+1

l = kSl + 1.

16: end if17: Concatenate all the matricesDlj , j 6= l, to form a

single matrixDl, which is annl×∑

j 6=l nj matrix.

18: Construct an initial mappingΦ′S+1l using Ini-

tialSearch.19: Use CCsearch to find new mappings

ΦS+11 , . . . , ΦS+1

m .20: if there is no decrease in Equation (2.1)then21: SetkS+1

l = kSl , ΦS+1

1 = ΦS1 , . . . , ΦS+1

m = ΦSm.

22: Mark typel as ‘split un-successfully’.23: else24: Mark typel as ‘split successfully’.25: end if26: ### try to merge two clusters27: if

∑m

i=1 kS+1i > S + 1 then

28: Randomly select two clusters of typel.29: Merge the two selected clusters.30: Use CCsearch to find new mappings

ΦS+11 , . . . , ΦS+1

m .31: if ΦS+1

1 , . . . , ΦS+1m produce decrease in Equa-

tion (2.1)then32: ### successful merge33: SetΦS+1

1 = ΦS+11 , . . . , ΦS+1

m = ΦS+1m .

34: UpdatekS+1l ← kS+1

l − 1.35: end if36: end if37: end for38: UpdateS ← S + 1.39: ### terminate the whole program40: if

∑m

l=1 kS+1l =

∑m

l=1 kSl then

41: Setk∗l = kS

l , l = 1, . . . , m.42: SetS∗ = S.43: Break.44: end if45: end while

Page 6: PaCK: Scalable Parameter-Free Clustering on K-Partite Graphschristos/PUBLICATIONS/PaCK2009.pdf · PaCK: Scalable Parameter-Free Clustering on K-Partite Graphs Jingrui He ∗ Hanghang

Algorithm 3 InitialSearchInput: The connectivity matricesDlj , j 6= l, and the

original mappingΦl with kl clusters.Output: Initial mappingΦ′

l with kl + 1 clusters.1: Split the row groupr with the maximum entropy per

row, i.e.

r := arg max1≤p≤kl

j 6=l

kj∑

q=1

n(Dpqlj )H(P pq

lj )

apl

2: ConstructΦ′l as follows. For every rowx in row group

r, place it into the new groupkl + 1 if and only if itdecreases the per-row entropy of groupr, i.e., if and onlyif

j 6=l

kj∑

q=1

n(Drqlj )H(P rq

lj )

arl − 1

<∑

j 6=l

kj∑

q=1

n(Drqlj )H(P rq

lj )

arl

whereDrqlj is Drq

lj without rowx, andP rqlj is the bias of

Drqlj . Otherwise, we placex in the original group. If we

move the row to the new group, we updateDrqlj andP rq

lj ,j 6= l, q = 1, . . . , kj by removing rowx.

Next, we analyze the complexity of thePaCK algo-rithm. Letk∗

max be the total number of times thatCCsearchis called inPaCK and letImax be the maximum numberof iteration steps each time we run theCCsearchalgorithm.We have the following lemma for the proposedPaCK:

LEMMA 3.2. The overall complexity of thePaCK algo-rithm isO(k∗

maxImax ·∑m−1

i=1 (k∗i ·

∑j 6=i n(Dij , 1))).

Proof. Omitted for brevity.�

Finally, notice that in each iteration of the outer loop ofPaCK (step 6), we will callCCsearch at most2m times(i.e., at most twice for each type of object). And also,we only invoke the merge operation when the condition∑m

l=1 kS+1l > S + 1 holds (step 27 inPaCK, whereS is

the iteration index for the outer loop ofPaCK). Therefore,we have the following lemma for the total number of timesthatCCsearchis called inPaCK.

LEMMA 3.3. k∗max in lemma 3.2 is upper bounded by

2m∑m

l=1 k∗l .

Proof. Omitted for brevity.�

4 Experimental Results

In this Section, we evaluate the performance ofPaCK.The experiments are designed to answer the following twoquestions:

• How good is the search quality ofPaCK?

• How fast isPaCK?

4.1 Experimental SetupData sets.We use both synthetic and real data sets. For syn-thetic data sets, given the number of object types, we firstspecify the connectivity pattern among the different types,such as line-shape, star-shape, loop-shape and so on. Fig-ure 1 illustrates the connectivity patterns used in our exper-iments. Then within each object type, we generate clustersof different sizes. Finally, we randomly flip the elements ofthe connectivity matrices with a certain probability (i.e., thenoise level).

(a) (b) (c) (d)

Figure 1: Different connectivity patterns: (a) line; (b) star;(c) loop; (d) clique.

In addition to the synthetic data sets, we also testPaCKon two real data sets: the 20 newsgroup data set3 and theNIPS data set4.

For 20 newsgroup data set, we use the documents from4 different domains (‘talk’, ‘sci’, ‘rec’ and ‘comp’) to forma star shape k-partite (k = 5) graph, where the ‘documents’from each domain is treated as one type of objects at the leafand the ‘words’ as another type of objects in the center. Theconnectivity structure of this data set is shown in Figure 2.Altogether, there are 61,188 words, 16,015 documents and4,140,814 edges.

Figure 2: The connectivity structure for 20 newsgroup dataset.

For the NIPS data set, we use the papers from 1987to 1999 NIPS proceedings to construct a ‘keyword-paper-author’ tri-partite graph, with ‘paper’ in the middle. The‘keyword’ is extracted from the paper titles by removing thestop words (such as ‘a’, ‘the’, ‘and’, etc). The connectivitystructure of this data set is shown in Figure 3. Altogether,there are 2,037 authors, 1,740 papers, 2,151 keywords and458,995 edges.

3http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html

4http://www.cs.toronto.edu/˜roweis/data.html

Page 7: PaCK: Scalable Parameter-Free Clustering on K-Partite Graphschristos/PUBLICATIONS/PaCK2009.pdf · PaCK: Scalable Parameter-Free Clustering on K-Partite Graphs Jingrui He ∗ Hanghang

Figure 3: The connectivity structure for NIPS data set.

Comparison methods. To the best of our knowledge,there are no parameter-free methods for clustering k-partitegraphs in the literature. For comparison, we have designedthe following type-oblivious method based on Autopart [7]as the baseline. In the type-oblivious method, we first forma big connectivity matrix, where the rows and columns cor-respond to all the objects, and then apply Autopart based onthis big matrix. Note that in this way, the generated clus-ters may consist of objects of different types. To address thisproblem, we add a post-processing step to further decomposesuch heterogeneous clusters into homogeneous clusters, i.e.,clusters of the same type of objects. We also compare theproposedPaCK with the two most recent clustering meth-ods for k-partite graphs: the collective matrix factorizationmethod [21, 22] (referred to as ‘CollMat’) and the spectralrelational clustering algorithm [17] (referred to as ‘Spec’).Both ‘CollMat’ and ‘Spec’ require the user to provide thecluster number as inputs.

Evaluation metric. To evaluate the quality of theclusters generated byPaCK, we adopt the measurementfrom [1]. Specifically, given two clusteringsB = B1 ∪ . . .∪Bk andB′ = B′

1 ∪ . . .∪B′k′ that partition the objects intok

andk′ subsets, defined2(B, B′) = k+k′−∑

i,i′|Bi∩B′

i′|2

|Bi|×|B′

i′| ,

whereBi ∩ B′i′ is the common objects inBi andB′

i′ , and| · | denotes the set cardinality. Note that in its originalform, d2(B, B′) is between0 (if B andB′ are exactly thesame) andk + k′ − 2 (if k′ = 1). In our application, wenormalize the distance so that it is always between 0 and 1,

i.e.,d2(B, B′) = (k+k′−∑

i,i′|Bi∩B′

i′|2

|Bi|×|B′

i′|)/(k+k′−2). For

each object typeX i, we compare the clustering generated byPaCK and the ground truth using this distance function togetd2

i . Then we calculate the average distance over all theobject types, i.e.,d2 =

∑m

i=1 d2i /m. Based on the above

discussion, smaller values ofd2 indicate better clustering forall object types.

Machine configurations. All running time reportedin this paper is measured on the same machine with four3.0GHz Intel Xeon CPUs and 16GB memory, running Linux(2.6 kernel). Unless otherwise stated, all the experimentalresults are averaged over 20 runs.

4.2 Quality Assessment on Synthetic Data Sets.

Figure 4 compares the clustering quality ofPaCK andthe type-oblivious method on four connectivity patterns.Both methods are parameter-free. From these figures, we canmake the following conclusions: (1) in all the cases, if thereis no noise,PaCK is able to recover the exact clusterings

for different types of objects simultaneously (i.e., averagedistance is 0); (2) the quality ofPaCK is quite robust againstthe noise level; (3) in terms of the clustering quality,PaCKis always significantly better than the type-oblivious method.

0 0.1 0.2 0.3 0.4 0.5

0

0.2

0.4

0.6

0.8

1

noise level

aver

age

dist

ance

clique

Pack

Type−oblivious

(a) Clique-shape

0 0.1 0.2 0.3 0.4 0.5

0

0.2

0.4

0.6

0.8

1

noise level

aver

age

dist

ance

line

Pack

Type−oblivious

(b) Line-shape

0 0.1 0.2 0.3 0.4 0.5

0

0.2

0.4

0.6

0.8

1

noise level

aver

age

dist

ance

star

Pack

Type−oblivious

(c) Star-shape

0 0.1 0.2 0.3 0.4 0.5

0

0.2

0.4

0.6

0.8

1

noise level

aver

age

dist

ance

loop

Pack

Type−oblivious

(d) Loop-shape

Figure 4: Comparison ofPaCK vs. type-oblivious onsynthetic data sets. (Smaller is better.)

We also comparePaCK with ‘CollMat’ [21, 22] and‘Spec’ [17]. Since both ‘CollMat’ and ‘Spec’ require thecluster numbers as inputs, we tune the true cluster numbersfor all the three methods for a fair comparison. Note thatin this case,PaCK degenerates toCCsearch. Figure 5presents the comparison results. In most cases, ourPaCKconsistently outperforms both ‘CollMat’ and ‘Spce’ .

0 0.1 0.2 0.3 0.4 0.5

0

0.2

0.4

0.6

0.8

1

noise level

aver

age

dist

ance

clique

Pack

CollMat

Spec

(a) Clique-shape

0 0.1 0.2 0.3 0.4 0.5

0

0.2

0.4

0.6

0.8

1

noise level

aver

age

dist

ance

line

Pack

CollMat

Spec

(b) Line-shape

0 0.1 0.2 0.3 0.4 0.5

0

0.2

0.4

0.6

0.8

1

noise level

aver

age

dist

ance

star

Pack

CollMat

Spec

(c) Star-shape

0 0.1 0.2 0.3 0.4 0.5

0

0.2

0.4

0.6

0.8

1

noise level

aver

age

dist

ance

loop

Pack

CollMat

Spec

(d) Loop-shape

Figure 5: Comparison ofPaCK vs. ‘CollMat’ and ‘Spce’ onsynthetic data sets. (Smaller is better.)

Figure 6 to Figure 7 illustrate one run ofPaCK onthree connectivity patterns, with noise level 20%. In thethree figures, the first sub-figure shows a set of originalconnectivity matrices. In the second sub-figure, the rows

Page 8: PaCK: Scalable Parameter-Free Clustering on K-Partite Graphschristos/PUBLICATIONS/PaCK2009.pdf · PaCK: Scalable Parameter-Free Clustering on K-Partite Graphs Jingrui He ∗ Hanghang

(a) Original

(b) After re-orderingFigure 6: Clustering results ofPaCK on loop-shape connectivity.

(a) Original

(b) After re-orderingFigure 7: Clustering results ofPaCK on line-shape connec-tivity.

and columns of each connectivity matrix are rearranged sothat the objects from the same cluster are put together. Adark block means that it has a large proportion of 1s, and awhite block means that it has a small proportion of 1s. Fromthe figures, we see that the clusters generated byPaCK arequite homogeneous: either quite dark or quite white.

If we ignore the exponential split, merge operation andmultiple concurrent trials, the proposedPaCK (referredas ‘PaCK-Basic’) is similar to the outer loop of cross-associations except that in ‘PaCK-Basic’, we will try to al-ternate amongm, instead of2, types of objects. We evaluatethe benefit of those additional operations (i.e., introducingexponential split, merge as well as multiple concurrent tri-als). The results are presented in Figure 9. For the multipleconcurrent trials part, we run 10 trials of the proposedPaCKand pick up the one which gives the lowest coding cost. Inour case, we use parallelism (i.e., multiple concurrent trials)

(a) Original

(b) After re-ordering

Figure 8: Clustering results ofPaCK on star-shape connec-tivity .to improve the search quality, instead of search speed. Theaverage distance is normalized by thePaCK-Basic. FromFigure 9, we can see that these additional operations (i.e.,introducing exponential split, merge and multiple concurrenttrials) largely improve search quality: in all the cases, the av-erage distance ofPaCK is only a fraction of that byPaCK-Basic.

4.3 Quality Assessment on 20 Newsgroup Data Sets.We use the 20 newsgroup data sets to compare the pro-

posedPaCK with the two most recent clustering algo-rithms for k-partite graphs: the collective matrix factor-ization method [21, 22] (referred to as ‘CollMat’) and thespectral relational clustering algorithm [17] (referred to as‘Spec’). Both ‘CollMat’ and ‘Spec’ require the user to pro-vide the cluster numbers as inputs. Therefore, we use thetrue cluster number for all the three methods (‘CollMat’,‘Spec’ and ‘PaCK’) for a fair comparison. The overall graphfor this data set is a 5-partite graph with ‘word’ object in

Page 9: PaCK: Scalable Parameter-Free Clustering on K-Partite Graphschristos/PUBLICATIONS/PaCK2009.pdf · PaCK: Scalable Parameter-Free Clustering on K-Partite Graphs Jingrui He ∗ Hanghang

Line Loop Star Clique0

0.2

0.4

0.6

0.8

1

Nor

mal

ized

Dis

tanc

e

Noie = 0.0%

Pack

Pack−Basic

(a) No noise

Line Loop Star Clique0

0.2

0.4

0.6

0.8

1

Nor

mal

ized

Dis

tanc

e

Noise = 20%

Pack

Pack−Basic

(b) 20% noise

Figure 9: Comparison of search procedures. (Smaller isbetter.)

the center. We also randomly select a subset from all four‘document’ objects and form a smaller star-shape k-partite(2 ≤ k ≤ 5) graphs. The results are presented in Fig-ure 10(a). Since we do not have the ground truth for the‘word’ object, the cluster distance is averaged over all ‘docu-ment’ objects in the corresponding graph and it is normalizedby the highest values among three different methods. It canbe seen that in all cases, the proposedPaCK performs thebest. In Figure 10(b), we also present the final normalizedcoding cost for the three methods. It is interesting to noticethat PaCK also achieves the lowest coding cost, which in-dicates that there might be a close relationship between thecoding cost and the clustering quality. Finally, it is worthpointing out that both ‘CollMat’ and ‘Spec’ do not work ifthe cluster numbers are not given by the users. On the otherhand,PaCK will try to search for such parameters automat-ically.

4.4 Quality Assessment on NIPS Data Sets.

Unlike the synthetic data sets and the 20 newsgroup dataset, for the NIPS data set, we do not have the ground truth.Therefore, we use this data set as a case study to illustratethe effectiveness ofPaCK. The original connectivity matri-ces (‘paper’ versus ‘author’, and ‘paper’ versus ‘keyword’)are shown in Figure 11(a). UsingPaCK, we find 13 ‘pa-per’ clusters, 12 ‘author’ clusters and 22 ‘keyword’ clusters.Figure 11(b) plots the connectivity matrices after rearrang-ing based on the clustering results. Using these reorderedmatrices, we have a concise summary of the original dataset, e.g., we can see that ‘author group’ 12 is working on thesame topic (‘paper’ group 1), using the same set of keywords

2 domains 3 domains 4 domains0

0.2

0.4

0.6

0.8

1

Nor

mal

ized

Dis

tanc

e

Spec

CollMat

Pack

(a) Normalized distance

2 domains 3 domains 4 domains0

0.2

0.4

0.6

0.8

1

Nor

mal

ized

Cod

ing

Cos

t

Spec

CollMat

Pack

(b) Normalized coding cost

Figure 10: Comparison of clustering results for 20 news-groups data sets. (Smaller is better.)

(‘keyword’ group 7). We manually verify that the topic is on‘neural information process’, and report in Figure 12 somesample papers, authors and keywords from each of theseclusters. Other clusters found byPaCK are also consistentwith human intuition, e.g., ‘paper’ cluster 2 is about statis-tical machine learning; ‘paper’ cluster 12 is about computervision and so on.

4.5 Evaluation of Speed.According to our analysis in Subsection 3.3, the complexityof PaCK grows linearly with the total number of edges in theconnectivity matrices. Figure 13 shows the wall-clock timeof PaCK versus the total number of edges in the connectivitymatrices with different noise levels. This figure illustratesthat when there is no noise (the left sub-figure), the curvesare straight lines. When the noise level is 10% (the rightsub-figure), the curves are close to straight lines, and thedeviations may be due to the different number of iterationsteps in theCCsearchalgorithm.

5 Generalization of PaCK to Tensors

PaCK can be easily generalized to the cases where certainconnectivity relations form tensors. For the sake of sim-plicity, assume that the relationship betweenX 1 andX 2 isa tensor, i.e.D12{t} varies with the third dimension, saytime t, t = 1, . . . , T . In theCCsearchalgorithm, besidesmaintaining the mappingsΦ1(s), . . . , Φm(s), we also needto maintain the mappingΦm+1(s), which maps every timestampt to one of thekm+1 time clusters. Letal

m+1 denotethe number of time stamps that belong to thelth time cluster,l = 1, . . . , km+1. With respect to the algorithm, we need tomake the following modifications.

Page 10: PaCK: Scalable Parameter-Free Clustering on K-Partite Graphschristos/PUBLICATIONS/PaCK2009.pdf · PaCK: Scalable Parameter-Free Clustering on K-Partite Graphs Jingrui He ∗ Hanghang

pape

r

author500 1000 1500 2000

500

1000

1500

pape

rword

500 1000 1500 2000

500

1000

1500

(a) Original connectivity matrices.

author

pape

r

500 1000 1500 2000

200

400

600

800

1000

1200

1400

1600

word

pape

r

500 1000 1500 2000

200

400

600

800

1000

1200

1400

1600

paper cluster 1 vs.author cluster 12

paper cluster 1 vs.word cluster 7

(b) Connectivity matrices after reordering.

Figure 11: Clustering result on NIPS data sets.

Figure 12: An example of the resulting ‘author-paper-keyword’ clusters.

1. In step 9 ofCCsearch, we compute the matrix blocksDpq

ij (s) and the corresponding biasP pqij (s) for i =

2, . . . , m − 1, j = i + 1, . . . , m, p = 1, . . . , ki andq = 1, . . . , kj . For object typeX 1, we need to calculatethe tensor segmentDpql

12 (s) for p = 1, . . . , k1, q =1, . . . , k2 and l = 1, . . . , km+1, which corresponds toDpq

12(s) within the lth time cluster. Same as before, thecorresponding biasP pql

12 (s) is the proportion of 1s inDpql

12 (s).

2. In steps 11-19 ofCCsearch, to update the mappingΦ1

for object typeX 1, we need to construct the matrixD1,which is originally the concatenation of matricesD1j ,j = 2, . . . , m. When the relationship betweenX 1 andX 2 is a tensor, we first calculate the following matrices:

0 2 4 6 8

x 105

0

5

10

15

20

25

30

Number of edges

Ave

rage

tim

e (s

econ

ds)

line3star4line4loop4star5line5loop5

(a) 0 noise

0 2 4 6 8

x 105

0

20

40

60

80

100

120

140

Number of edges

Ave

rage

tim

e (s

ccon

ds)

line3star4line4loop4star5line5loop5

(b) 10% noise

Figure 13: Wall-clock time versus the total number ofedges. The number following each connectivity pattern isthe number of object types.

D12{l} =∑

t∈lth clusterD12{t}, l = 1, . . . , km+1. Notethat this matrix is not binary any more. Then weconcatenate the matricesD12{1}, . . . , D12{km+1} andD13, . . . , D1m to formD1, which isn1×(n2×km+1+∑m

j=3 ni).

3. For each rowx of D1, split it intok2×km+1+∑m

j=3 kj

parts. The firstk2 × km+1 parts correspond to the clus-ters of object typeX 2 in D12{1}, . . . , D12{km+1}. De-note them asx2ql, q = 1, . . . , k2, l = 1, . . . , km+1.Note that their values are not binary, and the corre-spondingn(x2ql, u), u = 0, 1 are defined as follows.n(x2ql, 1) = ‖x2ql‖1, where‖ · ‖ is the L1 norm,and n(x2ql, 0) = aq

2 × alm+1 − n(x2ql, 1). The re-

maining∑m

j=3 kj parts correspond to the clusters ofobject typesX 3 . . .Xm, which are denoted as beforex31, . . . , x3k3 , . . . , xm1, . . . , xmkm . Their values arebinary, and the correspondingn(xjq , u), j = 3, . . . , m,q = 1, . . . , kj , u = 0, 1 is the number of elements inxjq that are equal tou.

4. The new mappingΦ′1(s) satisfies the following condi-

tion: for all 1 ≤ q ≤ k1,

k2∑

q=1

km+1∑

l=1

[n(x2ql, 1) log1

PΦ′

1(s)ql

12 (s)

Page 11: PaCK: Scalable Parameter-Free Clustering on K-Partite Graphschristos/PUBLICATIONS/PaCK2009.pdf · PaCK: Scalable Parameter-Free Clustering on K-Partite Graphs Jingrui He ∗ Hanghang

+n(x2ql, 0) log1

1− PΦ′

1(s)ql

12 (s)]

+

m∑

j=3

kj∑

q=1

[n(xjq , 1) log1

PΦ′

1(s)q1j (s)

+n(xjq, 0) log1

1− PΦ′

1(s)q1j (s)

]

k2∑

q=1

km+1∑

l=1

[n(x2ql, 1) log1

P pql12 (s)

+n(x2ql, 0) log1

1− P pql12 (s)

]

+

m∑

j=3

kj∑

q=1

[n(xjq , 1) log1

P pq1j (s)

+n(xjq , 0) log1

1− P pq1j (s)

]

The new mapping for object typeX 2 can be foundsimilarly. The new mappings for the other object typesX i, i = 3, . . . , m are found as before.

5. To update the mapping for the time dimension, for eachtime stampt, we first divide the matrixD12{t} intosmaller blocks, denoted asDpq

12{t}, which correspondsto thepth cluster inX i and theqth cluster inX j at timestampt. Then the new mappingΦ′

m+1(s) satisfies thefollowing condition: for all1 ≤ l ≤ km+1,

k1∑

p=1

k2∑

q=1

[n(Dpq12{t}, 1) log

1

PpqΦ′

m+1(s)

12 (s)

+n(Dpq12{t}, 0) log

1

1− PpqΦ′

m+1(s)

12 (s)]

k1∑

p=1

k2∑

q=1

[n(Dpq12{t}, 1) log

1

P pql12 (s)

+n(Dpq12{t}, 0) log

1

1− P pql12 (s)

]

Note that although the multi-way relation graph cluster-ing algorithm proposed in [3] can also deal with this prob-lem, it has to be given the number of clusters for each ob-ject type as well as the weight for each connectivity tensor,whereas our method is parameter-free.

6 Related Work

The idea of using compression for clustering can betraced back to the information-theoretic co-clustering algo-rithm [9], where the normalized non-negative contingencytable is treated as a joint probability distribution betweentwo discrete random variables that take values over the rowsand columns. The optimal co-clustering is the one that mini-mizes the difference in mutual information between the orig-inal random variables and the mutual information betweenthe clustered random variables.

As mentioned in Section 1, the information-theoreticco-clustering algorithm can only be applied to bipartite graphs.However, the idea behind this algorithm can be generalizedto more than two types of heterogeneous objects. For ex-ample, in [11], the authors proposed the CBGC algorithm.It aims to do collective clustering for star-shaped inter-relationships among different types of objects. Followed upwork includes the high order co-clustering [13]. Another ex-ample is the spectral relational clustering algorithm proposedin [17]. Unlike the previous algorithm, this algorithm is notrestricted to star-shaped structures. More recently, the col-lective matrix factorization proposed by Singhet al. [21, 22]can also be used for clustering k-partite graphs.

Despite of their success, one major drawback of theabove algorithms is that they all require the user to spec-ify certain parameters. In the information-theoretic co-clustering algorithm, the user needs to specify the numbersof row clusters and column clusters. In both the CBGC algo-rithm and the spectral relational clustering algorithm, besidesgiving the number of clusters for each type of objects, theuser also needs to specify reasonable weights for differenttypes of relations or features. However, in real applications,it might be very difficult to determine the number of clus-ters for clustering algorithms, especially when the data setis very large, not to mention the challenge of specifying theweights. On the other hand, the proposedPaCK is totallyparameter-free, i.e., it requires no user intervention.

In terms of parameter-free clustering algorithms forgraphs, cross-associations in [8], which is designed for bi-partite graph, is most representative. Similarly, the al-gorithm (Autopart) proposed in [7] also tries to find thenumber of clusters and the corresponding clustering for aunipartite graph in a parameter-free fashion. Both cross-associations [8] and Autopart [7] do not apply to k-partitegraphs whenk > 2. The proposedPaCK generalizes theidea of cross association/autopart so that it can deal withmultiple types of objects. In terms of problem formula-tion, PaCK is similar to cross-associations/Autopart, ex-cept that inPaCK, we try to compress multiple, insteadof a single, matrices collectively. In fact, if we ignore thetype information and treat the whole heterogeneous graphas one big (unipartite) graph, conceptually, it seems that wecan directly leverage Autopart for clustering. However, as

Page 12: PaCK: Scalable Parameter-Free Clustering on K-Partite Graphschristos/PUBLICATIONS/PaCK2009.pdf · PaCK: Scalable Parameter-Free Clustering on K-Partite Graphs Jingrui He ∗ Hanghang

we show in the experimental section, this strategy (‘type-oblivious’) usually leads to poor performance exactly be-cause the type information is ignored. Moreover, we care-fully design the search procedure (e.g., by introducing ex-ponential split, merge, multiple concurrent trials, etc) in theproposedPaCK, which largely improve the search quality aswe show in the experimental section. On the other hand, theproposedPaCK inherits the two important merits from theoriginal cross association/autopart: (1)parameter-freeand(2) scalability, which are very important for many real appli-cations.

Other related work includes (1) GraphScope [24], whichuses a similar information-theoretic criterion as cross associ-ation for time-evolving graphs to segment time into homoge-neous intervals; and (2) multi-way distributional clustering(MDC) [5] which is demonstrated to outperform the previ-ous information-theoretic clustering algorithms by the timethe algorithm was proposed. However, in MDC, we still needto tune the weights for different connectivity matrices anditis not clear what its computational complexity is in big-O no-tations. On the other hand, ourPaCK is parameter-free andit is clearly linear on the number of the edges in the graph.

7 Conclusion

In this paper, we have proposedPaCK to cluster k-partitegraphs, which to the best of our knowledge, is the firstparameter-free method to cluster k-partite graphs. In termsof problem formulation,PaCK seeks a good compressionfor multiple matrices collectively. We carefully design thesearch procedure inPaCK so that (1) it can find a goodapproximate solution, and (2) the whole algorithm is scal-able in the sense that it is linear on the number of edgesin the graphs. The major advantage ofPaCK over all ex-isting methods for clustering k-partite graphs (CBGC algo-rithm, the spectral relational clustering algorithm and collec-tive matrix, etc.) lies in its parameter-free nature. Further-more,PaCK can be easily generalized to the cases wherecertain connectivity relations form tensors. We verify theef-fectiveness and efficiency ofPaCK by extensive experimen-tal results.

References

[1] F. Bach and Z. Harchaoui. Diffrac: a discriminative andflexible framework for clustering. InNIPS, 2007.

[2] F. Bach and M. Jordan. Learning spectral clustering. InNIPS,2003.

[3] A. Banerjee, S. Basu, and S. Merugu. Multi-way clusteringon relation graphs. InSDM, 2007.

[4] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha.A generalized maximum entropy approach to bregman co-clustering and matrix approximation.The Journal of MachineLearning Research, 8:1919–1986, October 2007.

[5] R. Bekkerman, R. El-Yaniv, and A. McCallum. Multi-waydistributional clustering via pairwise interactions. InICML,pages 41–48, New York, NY, USA, 2005. ACM.

[6] T. Berners-Lee. Giant Global Graph.[7] D. Chakrabarti. Autopart: Parameter-free graph partitioning

and outlier detection. InPKDD, pages 112–124, 2004.[8] D. Chakrabarti, S. Papadimitriou, D. Modha, and C. Falout-

sos. Fully automatic cross-associations. InKDD, pages 79–88, 2004.

[9] I. Dhillon, S. Mallela, and D. Modha. Information-theoreticco-clustering. InKDD, pages 89–98, 2003.

[10] R. Duda, P. Hart, and D. Stork. Pattern classification. 2001.[11] B. Gao, T. Liu, X. Zheng, Q. Cheng, and W. Ma. Consistent

bipartite graph co-partitioning for star-structured high-orderheterogeneous data co-clustering. InKDD, pages 41–50,2007.

[12] E. Gokcay and J. Principe. Information theoretic clustering.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 24:158–171, February 2002.

[13] G. Greco, A. Guzzo, and L. Pontieri. An information-theoretic framework for high-order co-clustering of hetero-geneous objects. InSEBD, pages 397–404, 2007.

[14] A. K. Jain and R. C. Dubes.Algorithms for Clustering Data.Prentice Hall, 1988.

[15] D. Johnson, S. Krishnan, J. Chhugani, S. Kumar, andS. Venkatasubramanian. Compressing large boolean matricesusing reordering techniques. InVLDB, pages 13–23. VLDBEndowment, 2004.

[16] T. Li. A general model for clustering binary data. InKDD,pages 188–197, 2005.

[17] B. Long, Z. Zhang, X. Wu, and P. Yu. Spectral clustering formulti-type relational data. InICML, pages 585–592, 2006.

[18] B. Long, Z. Zhang, and P. Yu. A probabilistic framework forrelational clustering. InKDD, pages 470–479, 2007.

[19] S. C. Madeira and A. L. Oliveira. Biclustering algorithms forbiological data analysis: A survey”.IEEE/ACM TCBB, 1:24–45, 2004.

[20] T. Mitchell. Machine Learning. McGraw-Hill ScienceEngineering, 1997.

[21] A. P. Singh and G. J. Gordon. Relational learning viacollective matrix factorization. InKDD, pages 650–658,2008.

[22] A. P. Singh and G. J. Gordon. A unified view of matrixfactorization models. InECML/PKDD (2), pages 358–373,2008.

[23] N. Slonim, N. Friedman, and N. Tishby. Unsupervised doc-ument classification using sequential information maximiza-tion. In SIGIR, pages 129–136, New York, NY, USA, 2002.ACM.

[24] J. Sun, C. Faloutsos, S. Papadimitriou, and P. Yu. Graph-scope: parameter-free mining of large time-evolving graphs.In KDD, pages 687–696, 2007.

[25] W3C. Resource Description Framework.