16
Appl Intell DOI 10.1007/s10489-014-0526-0 Co-clustering over multiple dynamic data streams based on non-negative matrix factorization Chun-Yan Sang · Di-Hua Sun © Springer Science+Business Media New York 2014 Abstract Clustering multiple data streams has become an active area of research with many practical applications. Most of the early work in this area focused on one-sided clustering, i.e., clustering data streams based on feature correlation. However, recent research has shown that data streams can be grouped based on the distribution of their features, while features can be grouped based on their distribution across data streams. In this paper, an evolu- tionary clustering algorithm is proposed for multiple data streams using graph regularization non-negative matrix fac- torization (EC-NMF) in which the geometric structure of both the data and feature manifold is considered. Instead of directly clustering multiple data streams periodically, EC-NMF works in the low-rank approximation subspace and incorporates prior knowledge from historic results with temporal smoothness. Furthermore, we develop an itera- tive algorithm and provide convergence and correctness proofs from a theoretical standpoint. The effectiveness and efficiency of the algorithm are both demonstrated in exper- iments on real and synthetic data sets. The results show that the proposed EC-NMF algorithm outperforms existing methods for clustering multiple data streams evolving over time. C.-Y. Sang () · D.-H. Sun College of Automation, Chongqing University, Chongqing, 400030, China e-mail: [email protected] C.-Y. Sang · D.-H. Sun Key Laboratory of Dependable Service Computing in Cyber Physical Society of Ministry of Education, Chongqing University, Chongqing, 400030, China Keywords Low-rank approximation · Non-negative matrix factorization · Graph regularization · Co-clustering 1 Introduction Data streams appear in many application fields such as traffic systems, sensor networks, and social networks. Clus- tering multiple data streams has become an active topic in data mining and machine learning [14]. To obtain inter- esting relationships and other useful information from these data streams, a variety of methods have been developed over the past few decades to cluster multiple data streams. Traditional clustering techniques mainly focus on parti- tioning a single data stream at each time step or monitoring multiple data streams concurrently. To cluster multiple data streams, Beringer et al. [5] used a discrete Fourier transfor- mation (DFT) to summarize the data streams and presented an online version of the classic K-means clustering algo- rithm. To avoid recalculating the DFT coefficients when new readings arrive and thus minimize processing time, an incremental update mechanism was proposed in [6]. To find the optimized cluster number automatically and partition the data objects effectively, Masoud et al. [7] proposed a com- binatorial particle swarm optimization method, CPSOII, for dynamic data clustering. To discover cross-relationships among streams, a clustering-on-demand framework is pre- sented in [8], while Dai et al. [9] proposed online clustering over multiple evolving streams through correlation and use of an event (COMET-CORE) framework. To find commu- nities in a large set of interacting entities, Aggarwal et al. [1] proposed an online analytical processing framework for community detection of data streams.

Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

  • Upload
    di-hua

  • View
    216

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

Appl IntellDOI 10.1007/s10489-014-0526-0

Co-clustering over multiple dynamic data streams basedon non-negative matrix factorization

Chun-Yan Sang · Di-Hua Sun

© Springer Science+Business Media New York 2014

Abstract Clustering multiple data streams has become anactive area of research with many practical applications.Most of the early work in this area focused on one-sidedclustering, i.e., clustering data streams based on featurecorrelation. However, recent research has shown that datastreams can be grouped based on the distribution of theirfeatures, while features can be grouped based on theirdistribution across data streams. In this paper, an evolu-tionary clustering algorithm is proposed for multiple datastreams using graph regularization non-negative matrix fac-torization (EC-NMF) in which the geometric structure ofboth the data and feature manifold is considered. Insteadof directly clustering multiple data streams periodically,EC-NMF works in the low-rank approximation subspaceand incorporates prior knowledge from historic results withtemporal smoothness. Furthermore, we develop an itera-tive algorithm and provide convergence and correctnessproofs from a theoretical standpoint. The effectiveness andefficiency of the algorithm are both demonstrated in exper-iments on real and synthetic data sets. The results showthat the proposed EC-NMF algorithm outperforms existingmethods for clustering multiple data streams evolving overtime.

C.-Y. Sang (�) · D.-H. SunCollege of Automation, Chongqing University,Chongqing, 400030, Chinae-mail: [email protected]

C.-Y. Sang · D.-H. SunKey Laboratory of Dependable Service Computing in CyberPhysical Society of Ministry of Education, Chongqing University,Chongqing, 400030, China

Keywords Low-rank approximation · Non-negative matrixfactorization · Graph regularization · Co-clustering

1 Introduction

Data streams appear in many application fields such astraffic systems, sensor networks, and social networks. Clus-tering multiple data streams has become an active topic indata mining and machine learning [1–4]. To obtain inter-esting relationships and other useful information from thesedata streams, a variety of methods have been developed overthe past few decades to cluster multiple data streams.

Traditional clustering techniques mainly focus on parti-tioning a single data stream at each time step or monitoringmultiple data streams concurrently. To cluster multiple datastreams, Beringer et al. [5] used a discrete Fourier transfor-mation (DFT) to summarize the data streams and presentedan online version of the classic K-means clustering algo-rithm. To avoid recalculating the DFT coefficients whennew readings arrive and thus minimize processing time, anincremental update mechanism was proposed in [6]. To findthe optimized cluster number automatically and partition thedata objects effectively, Masoud et al. [7] proposed a com-binatorial particle swarm optimization method, CPSOII,for dynamic data clustering. To discover cross-relationshipsamong streams, a clustering-on-demand framework is pre-sented in [8], while Dai et al. [9] proposed online clusteringover multiple evolving streams through correlation and useof an event (COMET-CORE) framework. To find commu-nities in a large set of interacting entities, Aggarwal et al.[1] proposed an online analytical processing framework forcommunity detection of data streams.

Page 2: Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

C.-Y. Sang, D.-H. Sun

Recently, to discover communities and capture theirevolution with temporal smoothness, incremental and evo-lutionary clustering technologies have been designed tohandle dynamic data streams [1–3, 10]. FacetNet was pro-posed for analyzing communities from social network dataand community evolution [2]. Based on low-rank kernelmatrix factorization, Wang et al. [11] proposed the ECKFframework, which is an evolutionary clustering algorithmfor large-scale data. To identify communities simultane-ously across different networks, Comar et al. [12] pre-sented a framework that considers prior information aboutthe potential relationships between the communities. Thecommunity structure can be characterized by clusteringmethods or a latent space model, while evolution of thecommunity structure is captured in terms of various crite-ria [13, 14]. Meanwhile, several co-clustering algorithmshave shown that the duality between data points and fea-tures is superior to traditional one-sided clustering [15–17].Wu et al. [17] extended information theoretic co-clusteringand proposed co-clustering with an augmented matrix tosimultaneously co-cluster dyadic data using two augmentedmatrices.

In this paper, an evolutionary clustering algorithm formultiple data streams is proposed using graph regularizationnon-negative matrix factorization (EC-NMF). Our proposalwas motivated by recent progress in graph regularizationand matrix factorization [15, 18, 19], with the aim beingto capture their evolution with temporal smoothness andto identify groups of correlated data streams. EC-NMFsimultaneously considers the geometric structure containedin the data and feature graph, which is also constructedto encode the geometric information. To maintain a con-sistent clustering result, there is a trade-off between thehistorical cost embedded in EC-NMF and the benefit fromcurrent observations and historical results. Thus, the clus-tering results at a given time step are determined usingboth the current snapshot and prior knowledge from historicresults.

The main contributions of this work are as follows:

1. We propose a novel clustering algorithm EC-NMF overmultiple data streams. EC-NMF simultaneously con-siders the geometric structure information contained indata points and features.

2. As a trade-off against the benefit of maintaining a con-sistent clustering over different time steps, EC-NMFembeds the historical cost in the model.

3. We develop an iterative multiplicative updating opti-mization scheme to solve our proposed algorithmand provide convergence proofs of the optimizationscheme.

The rest of this paper is organized as follows: Relatedresearch works are introduced in Section 2. Then, the pre-liminaries of EC-NMF are given in Section 3. Section 4presents the proposed algorithm in detail, while conver-gence and correctness proofs of the EC-NMF algorithm areprovided in Section 5. Experimental results using syntheticand real world data sets are presented in Section 6. Finally,Section 7 gives the conclusion.

2 A brief review of non-negative matrix factorization

Given a data matrix A, non-negative matrix factorization(NMF) aims to find two nonnegative matrixes F and Gas A ≈ FGT , given the constraints that F ∈ R

d×k

and G ∈ Rn×k are non-negative. Semi-NMF and convex-

NMF algorithms expand the range of application of NMF[20]. Semi-NMF offers a low-dimensional representationof data points, which lends itself to a convenient cluster-ing interpretation, while convex-NMF restricts the columnsof F to be convex combinations of data points in X. Thisconstraint has the advantage that the columns can be inter-preted as weighted sums of certain data points. Based onNMF, a clustering method was proposed in [20], where Fis considered to be a centroid matrix with every columnrepresenting a cluster center, and G is the cluster indicatormatrix.

Although these methods are generally successful, theymay require huge amounts of space for large sparse graphs.Indeed, several important applications can be modeled aslarge sparse graphs, including transportation and social net-work analyses. Moreover, low-rank approximation is essen-tial for finding patterns and detecting anomalies in thematrix of a graph. It can also extract correlations and removenoise from matrix structured data. This has led to the devel-opment of methods such as CUR [21], CMD [13], and theColibri family [22].

Recent research has shown that the observed data lieon a nonlinear low dimensional manifold embedded in ahigh dimensional ambient space. To extend the applicablerange of NMF methods, Cai et al. [18] proposed a graphregularized non-negative matrix factorization to find a com-pact representation that uncovers the hidden semantics andsimultaneously respects the intrinsic geometric structure.Moreover, it has been shown that learning performance canbe greatly improved if the manifold structure informationcontained in the data is exploited. Gu et al. [19] proposeda dual regularized co-clustering method based on semi-NMF tri-factorization with two graph regularizations. Graphdual regularization nonnegative matrix tri-factorization thatsimultaneously considers both the geometric structure of the

Page 3: Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

Co-clustering over multiple dynamic data streams based on NMF

data manifold as well as the feature manifold has also beenproposed [23].

3 Preliminaries

3.1 Problem statement

The data streams are generated from various sensor net-works with the data items arriving synchronously, whichmeans that all data streams are updated simultaneously.Table 1 lists the main symbols used throughout the paper.Let G(t) = (

S(t), E(t), W(t))

be a graph associated with asensor network at time step t, in which S(t) denotes the setof data points and E(t) denotes the set of edges. Each datapoint s

(t)i corresponds to an entity. The edge set E(t) con-

sists of a set of edges between s(t)i and s

(t)j . Each edge e

(t)ij

represents an interaction between sensors i and j at time stept, and has a weight ρ

(t)ij associated with it. The weight ρ

(t)ij

on each edge is a function of the similarity between datastreams s

(t)i and s

(t)j .

Without loss of generality, we use an adjacency matrixA(t) ∈ R

m×n to represent graph G(t) = (S(t), E(t), W(t)).If there is an edge from node s

(t)i ∈ S(t) to node s

(t)j ∈ S(t)

with similarity ρ(t)ij , we set the value of A(t)(i, j) to ρ

(t)ij ∈

W(t). Otherwise, we set it to zero. A(t)(i, j) is the elementin the ith row and j th column of matrix A(t), and A(t)(:, j)

denotes the j th column of A(t). Each row or column in A(t)

corresponds to a node in S(t).The cluster partitions are obtained by decomposing the

adjacency matrix representation of the graph into a prod-uct of its latent factors. The square of the Euclidean dis-tance and the generalized Kullback-Leibler divergence are

common metrics for the approximation error [24]. In EC-NMF, we seek to minimize the distance function ‖A − B‖2

F

between two non-negative matrices A and B, definedas

‖A − B‖2F =

ij

(Aij − Bij )2, (1)

where ‖ · ‖2F is the Frobenius norm. This is lower bounded

by zero, and clearly vanishes if, and only if, A = B.At snapshot t, the objective of EC-NMF is to create sets

of partitions{�

(t)i

}k

i=1. ∀i ∈ {1, · · · , k}, �

(t)i should meet

the following conditions:k∪

i=1�

(t)i =

{s(t)1 , s

(t)2 , · · · , s

(t)n

}

andk∩

i=1�

(t)i = ∅. The community �

(t)i is a set of simi-

lar data streams such that �(t)i =

{s(t)1 , · · · , s

(t)

|�(t)i |

}, where

|�(t)i | is the total number of streams in �

(t)i . The similarity

of ∀s(t)i , s

(t)j ∈ S(t) is determined by ρ

(t)ij .

Since the data streams can grow infinitely and arrive con-tinuously, it is often hard to buffer and process all the data.We consider readings within a predefined time window oflength w. The similarity matrix is computed as

ρ(t)ij = exp

⎜⎝−

∥∥∥s

(t)i − s

(t)j

∥∥∥

2

2σ 2

⎟⎠ , (2)

where σ is a scaling parameter controlling how rapidlysimilarity ρ

(t)ij reduces with the distance between s

(t)i and

s(t)j .

Table 1 SymbolsSymbol Meaning Symbol Meaning

A ∈ Rm×n adjacency matrix of graph G N(s:,j ) ε−nearest neighbor s:,j

ρ(t)ij similarity between s

(t)i and s

(t)j LR graph Laplacian of GR

A approximation of matrix A DR diagonal degree matrix of GR

C ∈ Rn×c representative column matrix of A N number of vertexes of GR

U ∈ Rc×k weight matrix with size c × k GC feature graph of G

R ∈ Rn×k indicator matrix with size n × k WC feature weight matrix of GC

GR data graph of G f·i cluster label for GC

WR data weight matrix of GR N(sTj,:

)ε−nearest neighbor sT

j,:I middle matrix of A LC graph Laplacian of GC

J right matrix of A DC diagonal degree matrix of GC

gi· cluster label for GR M number of vertexes of GC

Page 4: Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

C.-Y. Sang, D.-H. Sun

3.2 Graph regularization

Recent studies in spectral graph theory and manifold learn-ing theory have demonstrated that local geometric structurecan be effectively modeled through a nearest neighbor graphon a scatter of data points [25, 26]. It can be deduced thatthe data points are not only sampled from the data manifold,but also from the dual view, namely, data manifold and fea-ture manifold [19]. To explore the geometric structure of thedata and feature manifolds, a data graph and feature graphare constructed as follows.

Data graph GR has vertices corresponding to{s·1, · · · , s·n}. According to the cluster assumption [27], ifdata points s·i and s·j are close to each other, their respec-tive cluster labels gi· and gj · should also be close. Weuse the 0–1 weighting scheme to construct the ε−nearestneighbor data graph as in [18, 19]. The data weight matrixis defined as

[WR]ij ={

1, if s:,j ∈ N(s:,j )0, otherwise

i, j = 1, · · · , n, (3)

where N(s:,j ) represents the set of ε−nearest neighbors s:,j .The graph Laplacian of the data graph is defined as LR =DR−WR [28], where DR is a diagonal degree matrix whoseentries are given by [DR]ii = ∑

j [WR]ij .Next we construct feature graph GC in a similar way

to data graph GR . Vertices of the feature graph correspondto {s1·, · · · , sm·}. According to the cluster assumption [27],if features si· and sj · are close, their cluster labels fi· andfj · should also be close to each other. The 0–1 weightingscheme is used to construct an ε−nearest neighbor feature

graph, with its vertices corresponding to{sT

1,:, · · · , sTm,:

}.

The feature weight matrix is defined as

[WC

]

ij=

{1, if sT

j,: ∈ N(sTj,:

)

0, otherwisei, j = 1, · · · , m. (4)

Furthermore, the graph Laplacian of the feature graph isdefined as LC = DC − WC .

4 EC-NMF approach

4.1 Objective function of EC-NMF

Based on the above analysis, it is assumed that we have asensor network representation of the adjacency matrix. Ifthe community structure evolves from t to t + 1, we con-struct graphs G(t) and G(t+1) at time steps t and t + 1,respectively. Let A(t) ∈ R

m×n+ and A(t+1) ∈ Rm×n+ be the

adjacency matrices associated with graphs G(t) and G(t+1),respectively. Here R+ represents a set of non-negative real

numbers. Based on the two graph regularizations presentedin (3) and (4), we propose a new method for clustering mul-tiple data streams. This method incorporates the geometricstructure of both the data manifold and the feature manifoldand captures their evolution with temporal smoothness. Thegoal is to discover the latent community from the adjacencymatrix using matrix approximation techniques [20, 29]. Theobjective function can be written as

J = α

∥∥∥A(t+1) − C(t+1)U(t+1)R(t+1)T

∥∥∥

2

F

+(1 − α)

∥∥∥A(t) − C(t)U(t)R(t+1)T

∥∥∥

+λT r(C(t+1)T LC(t+1)C

(t+1))

+μT r(R(t+1)T LR(t+1)R

(t+1))

s.t. C(t+1) ≥ 0, U(t+1) ≥ 0, R(t+1) ≥ 0, C(t)

≥ 0, U(t) ≥ 0, (5)

where parameter α is defined by the user as a trade-offbetween the current snapshot cost and the historical cost,λ and μ ≥ 0 are regularization parameters to balance thereconstruction error of EC-NMF in the first term and thelabel smoothness of the data points and features in thethird and fourth terms, R(t+1) ∈ R

n×k+ is the correspond-ing cluster membership matrices for graph G(t+1), U(t+1)

and U(t) are weight matrices, and C(t+1) and C(t) are therepresentative column matrices of A(t+1) and A(t), respec-tively. LR(t+1) = DR(t+1) − WR(t+1)

is the graph Laplacianof the data graph, which reflects the label smoothness ofthe data points, and LC(t+1) = DC(t+1) − WC(t+1)

is thegraph Laplacian of the feature graph, which reflects thelabel smoothness of the features.

Specifically, assume cluster numbers k(t+1) and k(t) attime steps t + 1 and t, respectively. If k(t) < k(t+1), thenthe extra clusters at time t + 1 must be computed using onlythe current data and we let U(t) = [U(t), 0c×(k(t)+1:k(t))].Otherwise, k(t) > k(t+1) and we simply remove the deletedcluster weight vectors in U(t).

4.2 Optimization of EC-NMF

As can be seen, the objective function in (5) is minimizedwith respect to C(t+1), U(t+1), and R(t+1), and it is unre-alistic to expect an algorithm to find the global minimum.In the following, we introduce an alternating schema tooptimize the objective function, which can achieve a localminimum.

Next, we discuss how to minimize objective function J.We optimize the objective function with respect to one vari-able while fixing the other variables. This iterative proce-dure repeats until convergence. Using the matrix properties

Page 5: Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

Co-clustering over multiple dynamic data streams based on NMF

T r(AB) = T r(BA) and T r(A) = T r(AT ), the objectivefunction J in (5) can be rewritten as:

J = αT r

((A(t+1) − C(t+1)U(t+1)R(t+1)T

)

×(A(t+1) − C(t+1)U(t+1)R(t+1)T

)T)

+ (1 − α)

((A(t) − C(t)U(t)R(t+1)T

)

×(A(t) − C(t)U(t)R(t+1)T

)T)

+λT r(C(t+1)T LC(t+1)C

(t+1))

+μT r(R(t+1)T LR(t+1)R

(t+1))

= α[T r

(A(t+1)A(t+1)T

)

−2T r(A(t+1)R(t+1)U(t+1)T C(t+1)T

)

+T r(C(t+1)U(t+1)R(t+1)T R(t+1)U(t+1)T C(t+1)T

)]

+(1 − α)[T r

(A(t)A(t)T

)

−2T r(A(t)R(t+1)U(t)T C(t)T

)

+T r(C(t)U(t)R(t+1)T R(t+1)U(t)T C(t)T

)]

+λT r(C(t+1)T LC(t+1)C

(t+1))

+μT r(R(t+1)T LR(t+1)R

(t+1))

(6)

To alternately update the entries of C(t+1), U(t+1), andR(t+1), we resort to a Lagrangian function. Let ψij , ξkj ,

and ζjl be Lagrange multipliers for constraints C(t+1)ij ≥

0, R(t+1)kj ≥ 0, and U

(t+1)j l ≥ 0, respectively. Then, the

Lagrange function L is expressed as

L = α[T r

(A(t+1)A(t+1)T

)

−2T r(A(t+1)R(t+1)U(t+1)T C(t+1)T

)

+T r(C(t+1)U(t+1)R(t+1)T R(t+1)U(t+1)T C(t+1)T

)]

+ (1 − α)[T r

(A(t)A(t)T

)

−2T r(A(t)R(t+1)U(t)T C(t)T

)

+T r(C(t)U(t)R(t+1)T R(t+1)U(t)T C(t)T

)]

+λT r(C(t+1)T LC(t+1)C

(t+1))

+μT r(R(t+1)T LR(t+1)R

(t+1))

+T r(ψC(t+1)T

)+T r

(ξR(t+1)T

)+T r

(ζU(t+1)T

)

(7)

Since the derivatives of L with respect to U(t+1), C(t+1),and R(t+1) must be zero, we have

∂L

∂U(t+1)= −2α

(C(t+1)T A(t+1)R(t+1)

−C(t+1)T C(t+1)U(t+1)R(t+1)T R(t+1))+ ζ,

(8)

∂L

∂C(t+1)= −2α

(A(t+1)R(t+1)U(t+1)T

−C(t+1)U(t+1)R(t+1)T R(t+1)U(t+1)T)

+2λLC(t+1)C(t+1) + ψ, (9)

∂L

∂R(t+1)= −2α

(A(t+1)T C(t+1)U(t+1)

−R(t+1)U(t+1)T C(t+1)T C(t+1)U(t+1))

−2(1 − α)(A(t)T C(t)U(t)

−R(t+1)U(t)T C(t)T C(t)U(t))

+2μLR(t+1)R(t+1) + ξ. (10)

Using the Karush-Kuhn-Tucker conditions [30] and let-ting ζjlU

(t+1)j l = 0, ψijC

(t+1)ij = 0, and ξkjR

(t+1)kj = 0,

we obtain the following equations for U(t+1), C(t+1), andR(t+1):

[−C(t+1)T A(t+1)R(t+1) + C(t+1)T C(t+1)U(t+1)

×R(t+1)T R(t+1)]

j lU

(t+1)j l = 0, (11)

[−αA(t+1)R(t+1)U(t+1)T + αC(t+1)U(t+1)R(t+1)T

×R(t+1)U(t+1)T + λLC(t+1)C(t+1)

]

ijCij = 0, (12)

[−α

(A(t+1)T C(t+1)U(t+1)−R(t+1)U(t+1)T C(t+1)T C(t+1)U(t+1)

)

− (1 − α)(A(t)T C(t)U(t) − R(t+1)U(t)T C(t)T C(t)U(t)

)

+μLR(t+1)R(t+1)

]

kjR

(t+1)kj = 0. (13)

Since LC(t+1) = DC(t+1) −WC(t+1)and LR(t+1) = DR(t+1)

−WR(t+1), this leads to the following update rules:

U(t+1)j l ← U

(t+1)j l

[C(t+1)T A(t+1)R(t+1)

]

j l[C(t+1)T C(t+1)U(t+1)R(t+1)T R(t+1)

]j l

,

(14)

Page 6: Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

C.-Y. Sang, D.-H. Sun

C(t+1)ij ← C

(t+1)ij

[αA(t+1)R(t+1)U(t+1)T + λWC(t+1)

C(t+1)]

ij[αC(t+1)U(t+1)R(t+1)T R(t+1)U(t+1)T + λDC(t+1)

C(t+1)]

ij

, (15)

R(t+1)kj ← R

(t+1)kj

[αA(t+1)T C(t+1)U(t+1) + (1 − α)A(t)T C(t)U(t) + μWR(t+1)

R(t+1)]

kj[αR(t+1)U(t+1)T C(t+1)T C(t+1)U(t+1) + (1 − α)R(t+1)U(t)T C(t)T C(t)U(t) + μDR(t+1)

R(t+1)]

kj

. (16)

Proofs of the convergence of update rules (14), (15), and(16) are given in Section 5.1.

4.3 Algorithm description

To eliminate linearly dependent columns and construct thesubspace used for low rank approximation from data matrixA, a unique and independent subspace C is constructedusing Colibri-S and Colibri-D [21, 22]. Further details ofthis process are given below.

First, we initialize subspace C0 using the samplingmethod of matrix A. Subspace C is initialized as C = C0

(:, 1) and the core matrix I as I = (CT C)−1, where(CT C)−1 is the Moore-Penrose pseudo-inverse of thesquare matrix CT C. Next, we iteratively test if a new col-umn in C0 is linearly dependent on the current columns inC. If not, this column is appended to C and core matrixI is updated; else, this sample is discarded. This processis iterated until C is obtained by eliminating all redun-dant columns from C0. In this way, subspace matrix C isobtained. The approximation of the adjacency matrix can becomputed as A = C(CT C)−1CT A. Core matrix I satisfiesI = (CT C)−1, so J is defined as J = CT A in the algorithm,and the final approximation of A is A = CIJ . The approx-imation error between matrix A and A can be computedby (1).

In EC-NMF, A(t+1) and A(t) are given, as well as theclustering results at time step t, including A(t), C

(t)c , U(t),

and R(t). Now, we need to obtain U(t+1) and R(t+1),as well as the new representative subspace and approx-imation. First, we use Colibri methods to get C

(t+1)c

and A(t+1), as well as two disjoint subsets C(t+1)a and

C(t+1)b

(C

(t+1)c = C

(t+1)a ∪ C

(t+1)b

), such that the elements

in C(t+1)a correspond to those unchanged selected samples

from t to t+1, while the items in C(t+1)b correspond to those

changed or unselected samples between the two time steps.Then, we obtain a unique and independent subset for C

(t+1)b

using Colibri-S. As more columns are added to C(t+1), corematrix I (t+1) is updated simultaneously. Finally, J (t+1) iscomputed by J (t+1) = C(t+1)T A(t+1).

Algorithm 1 EC-NMF

Input: Matrices A(t+1), A(t), C(t), and U(t), the number ofclusters k(t+1), regularization parameters λand μ, and max-imum number of iterations TOutput: Matrices R(t+1), U(t+1), and C(t+1)

1. use the family of Colibri methods to get C(t+1)

2. determine cluster number k(t+1)

3. initialize R(t+1) and U(t+1) using R(t) and U(t),respectively

4. if we need to form a new partition5. go to 86. else7. let R(t+1) = R(t) and U(t+1) = U(t), and return8. while not converging and t ≤ T do9. update U(t+1) by (14)

10. update C(t+1) by (15)11. update R(t+1) by (16)12. end while13. return U(t+1), C(t+1), and R(t+1)

Algorithm 1 summarizes the details of the iterative updat-ing algorithm for (5). Applying the update rules in (14),(15), and (16) until convergence, we obtain a new U(t+1),C(t+1), and R(t+1). To improve the clustering efficiency,the variable factors are initialized using the previous clus-tering results instead of random values. The computationalcost of the proposed EC-NMF algorithm is discussed inSection 5.2.

5 Theoretical analysis

5.1 Convergence

In this section, we investigate the convergence and correct-ness of the objective function in (5) under update rules (14),(15), and (16). Regarding these three update rules, we havethe following theorem.

Page 7: Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

Co-clustering over multiple dynamic data streams based on NMF

Theorem 5.1 For A(t+1), C(t+1), U(t+1), R(t+1) ≥ 0, theupdate formulas for U(t+1), C(t+1), and R(t+1) given in(14), (15), and (16), respectively, monotonically decreasethe objective function in (5).

The proofs are provided with the aid of auxiliary func-tions. Since the first term of the objective function in (5) isonly related to U(t+1), we can use a 3-factor NMF to showthat (5) is monotonically decreasing under the update rulein (14) [15]. Next, we prove convergence under the updaterule for R(t+1) in (16).

Definition 5.1 Z(u, u′) is an auxiliary function for J (u) ifthe following two conditions are satisfied: Z(u, u′) ≥ J (u),Z(u, u) = J (u).

Lemma 5.1 For any u and u′, if Z is an auxiliary function,J is monotonically decreasing under the formula

ut+1 = arg minu

Z(u, ut ). (17)

Proof J (ut+1) ≤ Z(ut+1, ut ) ≤ Z(ut , ut ) = J (ut )

Thus, Z(ut+1, ut ) ≤ Z(ut , ut ) and J (ut+1) ≤ J (ut ), whichcompletes the proof.

Next, we show that the update rule for R(t+1) in (16) isexactly the same as the update in (17) with a proper auxil-

iary function. Considering any element R(t+1)kj in R(t+1), we

let

J(R(t+1)

)= α

∥∥∥A(t+1) − C(t+1)U(t+1)R(t+1)T

∥∥∥

2

F

+ (1 − α)

∥∥∥A(t) − C(t)U(t)R(t+1)T

∥∥∥

+μT r(R(t+1)T LR(t+1)R

(t+1))

.

Then, we get

J(R(t+1)

)′kj

=[

∂J (R(t+1))

∂R(t+1)

]

kj

=[− 2α

(A(t+1)T C(t+1)U(t+1)

−R(t+1)U(t+1)T C(t+1)T C(t+1)U(t+1))

− 2(1−α)(A(t)T C(t)U(t)−R(t+1)U(t)T C(t)T C(t)U(t)

)

+ 2μLR(t+1)R(t+1)

]

kj

and

J(R(t+1)

)′′kj

=[2αU(t+1)T C(t+1)T C(t+1)U(t+1)

+ 2(1 − α)U(t)T C(t)T C(t)U(t)

+ 2μLR(t+1)

]

kj.

Since our update rule operates element-wise, it is suffi-cient to show that each J (R(t+1))kj is non-increasing underthe update formula in (16).

Lemma 5.2 The function

Z(R

(t+1)kj , R

(t+1)′kj

)= Jkj

(R

(t+1)′kj

)+ J ′

kj

(R

(t+1)′kj

) (R

(t+1)kj − R

(t+1)′kj

)

+[αR(t+1)U(t+1)T C(t+1)T C(t+1)U(t+1)+(1−α)R(t+1)U(t)T C(t)T C(t)U(t)+μDR(t+1)

R(t+1)]

kj

R(t+1)′kj

(R

(t+1)kj − R

(t+1)′kj

)2 (18)

is an auxiliary function for Jkj

(R

(t+1)kj

).

Proof We first get the Taylor series expansion of

Jkj

(R

(t+1)kj

):

Jkj

(R

(t+1)kj

)= Jkj

(R

(t+1)′kj

)+ J ′

kj

(R

(t+1)′kj

) (R

(t+1)kj − R

(t+1)′kj

)

+{[

αU(t+1)T C(t+1)T C(t+1)U(t+1) + (1 − α)U(t)T C(t)T C(t)U(t)]

jj+ μ[LR]kk

} (R

(t+1)kj − R

(t+1)′kj

)2 .

Page 8: Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

C.-Y. Sang, D.-H. Sun

Using (18) to find that Z(R

(t+1)kj , R

(t+1)′kj

)≥ Jkj

(R

(t+1)kj

)

is equivalent to

[αR(t+1)U(t+1)T C(t+1)T C(t+1)U(t+1)+(1−α)R(t+1)U(t)T C(t)T C(t)U(t)+μDR(t+1)

R(t+1)]

kj

R(m+1)′kj

≥{[

αU(t+1)T C(t+1)T C(t+1)U(t+1) + (1 − α)U(t)T C(t)T C(t)U(t)]

jj+ μ[LR]kk

} , (19)

we get[αR(t+1)U(t+1)T C(t+1)T C(t+1)U(t+1)

+(1 − α)R(t+1)U(t)T C(t)T C(t)U(t) + μDR(t+1)

R(t+1)]

kj

=∑K

lR

(t+1)′kl

[αU(t+1)T C(t+1)T C(t+1)U(t+1)

+ (1 − α)U(t)T C(t)T C(t)U(t)]

lj

≥ R(t+1)′kl

[αU(t+1)T C(t+1)T C(t+1)U(t+1)

+ (1 − α)U(t)T C(t)T C(t)U(t)]

jj

and[μDR(t+1)

R(t+1)]

kj= μ

∑M

lDR(t+1)

kl R(t+1)′lj ≥ μDR(t+1)

kk R(t+1)′kj

≥ μ[DR(t+1) − WR(t+1)

]

kkR

(t+1)′kj

= μ[LR(t+1)

]kk

R(t+1)′kj .

Thus, (19) holds Z(R

(t+1)kj , R

(t+1)′kj

)≥ Jkj

(R

(t+1)kj

).

Furthermore, Z(R

(t+1)kj , R

(t+1)′kj

)= Jkj

(R

(t+1)kj

)is

obvious.

Proof of Theorem 5.1 Replacing Z(R

(t+1)kj , R

(t+1)′kj

)in

(17) by (18), we get

R(t+1)kj = R

(t+1)′kj − R

(t+1)′kj

J ′kj

(R

(t+1)′kj

)

2[αR(t+1)U(t+1)T C(t+1)T C(t+1)U(t+1)+(1−α)R(t+1)U(t)T C(t)T C(t)U(t)+μDR(t+1)

R(t+1)]

kj

= R(t+1)′kj

[αA(t+1)T C(t+1)U(t+1)+(1−α)A(t)T C(t)U(t)+μWR(t+1)

R(t+1)]

kj[αR(t+1)U(t+1)T C(t+1)T C(t+1)U(t+1)+(1−α)R(t+1)U(t)T C(t)T C(t)U(t)+μDR(t+1)

R(t+1)]

kj

.

Since (18) is an auxiliary function for Jkj

(R

(t+1)kj

),

Jkj

(R

(t+1)kj

)is non-increasing under the update rule in (16).

Note that since C(t+1) is related to the first and thirdterms, we can prove analogous convergence under theupdate rule for C(t+1) in (15).

5.2 Computational complexity

To compute the complexity of the EC-NMF algorithm, weneed to inspect its main operations, namely, the decomposi-tion matrix, constructing the ε−nearest neighbor graph, andmultiplicative updates.

Each U(t+1) and R(t+1)ij is multiplied by an update factor,

which involves several matrix multiplications. The compu-tational complexity of EC-NMF is of the order 3c1m1k +m1n1k + nk2 + c1k

2 for (14) and of the order 3c1m1k +m1n1k + 2n1k

2 + c1k2 + 3c0m0k +m0n0k + c0k

2 for (16),

where k is the cluster number at time step t + 1, c1 andc0 are the sizes of the data subspaces at time steps t + 1and t, respectively, and n1 and n0 are the sizes of the inputmatrices at time steps t + 1 and t, respectively. Supposethe multiplicative update stops after T iterations, the overallcost for EC-NMF is O(T (cmk+mnk+nk2 + ck2)), wheren = max(n0, n1) and c = max(c0, c1). EC-NMF also needsO(n2m + nm2) time to construct the ε−nearest neighbordata graph and feature graph. Based on the above analysis,the overall cost for EC-NMF is O(T (cmk + mnk + nk2 +ck2) + n2m + nm2).

6 Experiments and results

In this section, we use several synthetic data sets and realworld data sets to evaluate the efficiency and effectivenessof the EC-NMF algorithm.

Page 9: Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

Co-clustering over multiple dynamic data streams based on NMF

6.1 Baseline algorithms and evaluation methods

To demonstrate the efficiency of EC-NMF, we compare itwith the following three popular clustering algorithms.

The first is K-means clustering in the principle com-ponent analysis (PCA) subspace. Mathematically, PCAis equivalent to performing singular value decomposition(SVD) on the centered data matrix. In this paper, we simplyuse SVD instead of PCA because the centered data matrixis too large to fit into memory. The other two algorithmsare the typical spectral clustering algorithm normalized cut(Ncut) [31] and NMF-based clustering.

To evaluate clustering quality, all our comparisons arebased on clustering accuracy (Acc) and normalized mutualinformation (NMI) measurements.

Clustering accuracy (Acc) discovers the one-to-one rela-tionships between clusters and classes and measures theextent to which each cluster contains data points from thecorresponding class. It is defined as [18, 19]

Acc =∑n

i=1 δ(map(ri), li)

n, (20)

where ri denotes the cluster label, li denotes the true classlabel, n is the total number of documents, δ(x, y) is the deltafunction that is equal to one if x = y, and zero otherwise,and map(ri) is the permutation mapping function that mapseach cluster label ri to the equivalent label from the data set.

Between two random variables i (category label) and j(cluster label), NMI is defined as [32]:

NMI =∑k(a)

i=1∑k(b)

j=1 ni,j log(

n·ni,j

ni ·nj

)

√(∑k(a)

i=1 ni log ni

n

) (∑k(b)

j=1 nj lognj

n

) , (21)

where n is the number of documents, ni and nj denote thenumber of documents in category i and cluster j, respec-tively ni,j , denotes the number of documents in categoryi as well as in cluster j, and k(a) and k(b) are the clusternumbers in the true and predicted clusters, respectively. TheNMI score is 1 if the clustering results match the categorylabels perfectly, whereas the score is 0 if the data are ran-domly partitioned. The higher the NMI score is, the betterthe clustering quality is.

Since there are no predefined categories in our data, wehad to design an alternative way to carry out the evalua-tion. In this paper, we use the K-means clustering resultof the combined feature data as the target labels for fea-ture clustering. Similarly, the K-means clustering result ofthe combined user data is used as the target labels for userclustering. To randomize the experiments, we conducted theevaluation using different cluster numbers. At each timestep, for each given cluster number, 30 test runs were carriedout on different randomly chosen clusters.

6.2 Synthetic data sets

For the experiments in this section, we used synthetic datasets to demonstrate that EC-NMF can efficiently clusterdata for multiple evolving data streams over time. The syn-thetic data sets were generated by applying the prototypesystem used in [5], which is defined as f (t+ � t) =f (t) + f ′(t+ � t), f ′(t+ � t) = f ′(t) + u(t), wheref (·) is a stochastic process, and t = 0,� t, 2 � t, · · · areindependent random variables uniformly distributed in theinterval [−a, a]. Thus, data stream S(·) can be defined asS(t) = f (t+h(t))+g(t), where h(·) and g(·) are stochasticprocesses generated in the same way as the prototype f (·).Constant a determines the smoothness of a data stream, andcan be different for p(·), h(·), and g(·), for example, 0.2,0.4, and 0.5, respectively.

To examine the quality of EC-NMF for clustering mul-tiple data streams, we generated random data sets with 200streams, each containing 1000 points. For each method, thelength of a block was fixed at 100 points. The number ofclusters in the data varied from 2 to 10.

To randomize the experiments, we conducted the eval-uation with different numbers of time steps. For differentvalues of α, the weight of the historical results varied;hence, different clustering results were obtained. Note that,α = 1 means no temporal smoothness is considered. Ahigher value of α indicates less incorporation of historicalknowledge into clustering. The clustering results showingthe effect of parameter α are given in Tables 2 and 3, withthe best performance for each data set highlighted. As canbe seen in these tables, when α = 0.6, EC-NMF achievesthe highest performance in terms of Acc and NMI. Hence,in subsequent experiments, we compared the four methodswith respect to Acc and NMI with α = 0.6.

To evaluate the performance of EC-NMF, we comparedthe four methods with α = 0.6 using the synthetic datasets. For each given cluster number k, the mean and standarderror of the performance of each of these clustering algo-rithms on the synthetic data sets are given in Tables 4 and 5,with the best performance for each data set highlighted.Regardless of the number of clusters, EC-NMF performsbetter than the other three algorithms in terms of Acc andNMI. EC-NMF is much more effective than the other threemethods in evolutionary clustering because it incorporatesthe previous clustering results and considers the geometricstructure of the data.

Next, we evaluated the effect of window size and thenumber of data streams on the response time for the EC-NMF algorithm.

The following series of experiments evaluates the effectof window size on the execution time of these algorithms.In this experiment, window size w(t) varies from 30 to 360,the number of data streams is fixed at 200, each data stream

Page 10: Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

C.-Y. Sang, D.-H. Sun

Table 2 Acc measurements for EC-NMF with α ranging from 0.2 to 1 at each time step using synthetic data sets

Time step α = 0.2 α = 0.4 α = 0.6 α = 0.8 α = 1

1 0.497 ± 0.029 0.551 ± 0.032 0.861 ± 0.109 0.543 ± 0.089 0.498 ± 0.009

2 0.518 ± 0.017 0.572 ± 0.123 0.802 ± 0.201 0.538 ± 0.001 0.410 ± 0.308

3 0.551 ± 0.142 0.667 ± 0.019 0.813 ± 0.039 0.661 ± 0.029 0.493 ± 0.021

4 0.578 ± 0.091 0.611 ± 0.153 0.798 ± 0.045 0.548 ± 0.123 0.521 ± 0.073

5 0.509 ± 0.023 0.598 ± 0.050 0.702 ± 0.054 0.697 ± 0.065 0.519 ± 0.048

6 0.573 ± 0.211 0.612 ± 0.029 0.831 ± 0.057 0.518 ± 0.037 0.453 ± 0.039

7 0.556 ± 0.021 0.591 ± 0.091 0.702 ± 0.073 0.675 ± 0.098 0.500 ± 0.026

8 0.502 ± 0.013 0.601 ± 0.018 0.776 ± 0.078 0.573 ± 0.017 0.410 ± 0.062

9 0.519 ± 0.109 0.621 ± 0.031 0.802 ± 0.091 0.518 ± 0.083 0.479 ± 0.067

10 0.577 ± 0.128 0.662 ± 0.075 0.811 ± 0.021 0.536 ± 0.047 0.479 ± 0.088

average 0.538 0.609 0.790 0.581 0.476

Table 3 NMI measurements for EC-NMF with α ranging from 0.2 to 1 at each time step using synthetic data sets

Time step α = 0.2 α = 0.4 α = 0.6 α = 0.8 α = 1

1 0.429 ± 0.031 0.545 ± 0.023 0.763 ± 0.091 0.681 ± 0.025 0.541 ± 0.089

2 0.419 ± 0.097 0.539 ± 0.039 0.781 ± 0.002 0.641 ± 0.192 0.530 ± 0.011

3 0.397 ± 0.035 0.668 ± 0.020 0.800 ± 0.039 0.641 ± 0.006 0.663 ± 0.032

4 0.434 ± 0.048 0.546 ± 0.009 0.793 ± 0.050 0.581 ± 0.010 0.543 ± 0.015

5 0.413 ± 0.031 0.699 ± 0.192 0.782 ± 0.102 0.642 ± 0.070 0.691 ± 0.098

6 0.456 ± 0.073 0.510 ± 0.023 0.812 ± 0.009 0.641 ± 0.281 0.518 ± 0.044

7 0.397 ± 0.099 0.671 ± 0.045 0.783 ± 0.003 0.689 ± 0.000 0.676 ± 0.009

8 0.415 ± 0.010 0.575 ± 0.037 0.766 ± 0.098 0.642 ± 0.0034 0.573 ± 0.060

9 0.465 ± 0.023 0.513 ± 0.089 0.778 ± 0.023 0.641 ± 0.116 0.518 ± 0.011

10 0.478 ± 0.001 0.532 ± 0.031 0.809 ± 0.056 0.680 ± 0.004 0.538 ± 0.046

average 0.430 0.580 0.787 0.648 0.579

Table 4 Comparison of Acc results for the four methods based on synthetic data sets

k K-means Ncut NMF EC-NMF

2 0.281 ± 0.031 0.341 ± 0.034 0.281 ± 0.213 0.365 ± 0.045

5 0.290 ± 0.209 0.372 ± 0.030 0.316 ± 0.051 0.802 ± 0.009

8 0.316 ± 0.090 0.376 ± 0.020 0.396 ± 0.037 0.814 ± 0.011

10 0.305 ± 0.030 0.381 ± 0.201 0.410 ± 0.077 0.800 ± 0.037

12 0.329 ± 0.007 0.370 ± 0.001 0.341 ± 0.029 0.790 ± 0.078

15 0.341 ± 0.045 0.363 ± 0.001 0.359 ± 0.031 0.834 ± 0.092

18 0.359 ± 0.071 0.360 ± 0.003 0.290 ± 0.010 0.790 ± 0.023

10 0.313 ± 0.005 0.359 ± 0.056 0.315 ± 0.004 0.776 ± 0.055

25 0.298 ± 0.023 0.396 ± 0.007 0.376 ± 0.000 0.801 ± 0.041

30 0.293 ± 0.057 0.376 ± 0.092 0.376 ± 0.018 0.812 ± 0.029

average 0.312 0.369 0.346 0.758

Page 11: Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

Co-clustering over multiple dynamic data streams based on NMF

Table 5 Comparison of NMI results for the four methods based on synthetic data sets

k K-means Ncut NMF EC-NMF

2 0.381 ± 0.011 0.326 ± 0.028 0.363 ± 0.063 0.377 ± 0.024

5 0.283 ± 0.033 0.342 ± 0.032 0.316 ± 0.087 0.781 ± 0.033

8 0.232 ± 0.001 0.370 ± 0.015 0.373 ± 0.033 0.800 ± 0.056

10 0.312 ± 0.010 0.326 ± 0.032 0.363 ± 0.052 0.793 ± 0.089

12 0.283 ± 0.032 0.342 ± 0.012 0.316 ± 0.040 0.782 ± 0.053

15 0.232 ± 0.009 0.370 ± 0.006 0.373 ± 0.099 0.812 ± 0.006

18 0.281 ± 0.055 0.326 ± 0.009 0.363 ± 0.021 0.783 ± 0.010

10 0.283 ± 0.017 0.342 ± 0.005 0.396 ± 0.037 0.766 ± 0.017

25 0.332 ± 0.043 0.370 ± 0.044 0.317 ± 0.017 0.778 ± 0.035

30 0.285 ± 0.057 0.326 ± 0.011 0.360 ± 0.023 0.809 ± 0.074

average 0.290 0.344 0.354 0.748

Fig. 1 Average processing timewith window size varying from30 to 360

Fig. 2 Average processing timewith the number of streamsvarying from 50 to 1000

Page 12: Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

C.-Y. Sang, D.-H. Sun

contains 1000 points, and a data set contains four clus-ters. Figure 1 shows that as the window size increases, theprocessing time of each algorithm also increases. The y-axis shows the execution time, while the x-axis representswindow size w(t), which varies from 30 to 360. Althoughthe time increases for all these algorithms, the EC-NMFalgorithm greatly outperforms the others.

In Fig. 2, the y-axis shows the execution time, while thex-axis represents the number of data streams, which variesfrom 100 to 1000. The other parameters are fixed as follows:w(t) = 120, k = 4. As the number of streams increases,the execution time of all these methods increases. However,the execution efficiency of EC-NMF for clustering multipledata streams in the approximate subspace is higher than thatfor K-means, Ncut, and NMF.

6.3 Real world data sets

In this section, we discuss the three real data sets used toevaluate the EC-NMF algorithm.

The first data set contained weather data obtained fromthe Dayton1 database. Since 1995, average daily tempera-tures in 290 cities around the world have been collected.Each city is regarded as a data stream, and having extractedrecords from January 1995 to November 2004, we regardedeach year as a time step with dimension 290 (the numberof cities) and length 365. The aim of clustering this data setwas to identify groups of similar data streams on the samecontinent and belonging to the same temperature zone.

The second data set contained transportation data,obtained from the PeMS2 database. With sensor devicesinstalled on road networks, a system monitors traffic flowson major US highways 24 hours a day, 7 days a week, thusacquiring huge volumes of data. We selected 15 months ofdaily data records giving the occupancy rates (between 0and 1) of different car lanes on freeways in the San Fran-cisco bay area over time. The data set covered recordssampled every 10 min from 1st January 2008 to 30th March2009. We considered each day in this database as a singletime series with dimension 963 (the number of sensors thatfunctioned consistently throughout the study period) andlength 6 × 24 = 144. Data for public holidays were removedfrom the data set, as well as data collected on two days withanomalies (8th and 9th March 2008), where all sensors weremuted between 2:00 and 3:00 am. Ultimately, the data setconsisted of 440 time series.

The experiment on this data set aimed to classify eachobserved day as the correct day of the week, from Mondayto Sunday, labeling it with an integer in {1, 2, 3, 4, 5, 6, 7}.Each line describes a time-series provided as a matrix. Each

1http://www.engr.udayton.edu/weather/2http://pems.dot.ca.gov/.

Table 6 Overview of the real world data sets

Data sets Number of Number of Number of

samples n features m classes k

Dayton 290 365 5

PeMS 440 138672 7

Digg 10000 6 56

matrix describes the different occupancy rates (963 lines,one for each station/detector) sampled every 10 min duringthe day (144 columns). At a given timestamp, each attributedescribes the measurement of the occupancy rate (between0 and 1) of a captor location as recorded by a measuring sta-tion. There are 963 (stations) × 144 (timestamps) = 138,672attributes for each record.

The third data set contained Internet data collected onDigg,3 a popular online social media website. We extractedtuples with timestamps from 1st to 30th August 2008. Wesampled 10,000 Digg users who had bookmarked URLs onthe following three topics: politics, computer science, andnatural science. Included in this data set were stories, usersand their actions (submit, dig, comment and reply) withrespect to the stories, as well as explicit friendship (contact)relations among these users. To study data evolution, wesegmented the duration into 10 time slots (each three dayslong), and constructed a user-user connectivity matrix fromthe user-URL matrix. Two users were linked if they had atleast 5 URLs in common. To analyze users’ topical inter-ests, we also retrieved the topics of the stories and extractedkeywords from the story titles.

The basic information of the three data sets is summa-rized in Table 6.

As shown in Tables 2 and 3, EC-NMF achieved the high-est NMI values on the synthetic data sets with α = 0.6. Inthe following, we evaluate the performance of EC-NMF onthe real world data sets with α = 0.6.

The clustering results of all four clustering algorithmson the Dayton data set are shown in Tables 7 and 8, withthe best performance for each data set highlighted. FromTables 7 and 8, we observe that the EC-NMF algorithmperforms better than the other three algorithms for mosttime steps, except t = 1. The reason for this is that EC-NMF integrates more historical knowledge, such as the lastapproximation results and last clustering results, and is thusmore effective than the other three methods.

Next, we discuss the experiments on the PeMS data set.For each given cluster number k, the clustering results of allthese algorithms on the PeMS data set are shown in Tables 9and 10. Ncut, NMF, and the proposed EC-NMF algorithmachieved better performance than the K-means algorithm,

3http://www.digg.com/

Page 13: Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

Co-clustering over multiple dynamic data streams based on NMF

Table 7 Comparison of the Acc results for the four methods on Dayton data set with α = 0.6

Time step K-means Ncut NMF EC-NMF

1 0.311 ± 0.009 0.407 ± 0.012 0.449 ± 0.034 0.530 ± 0.001

2 0.327 ± 0.074 0.389 ± 0.042 0.419 ± 0.003 0.727 ± 0.029

3 0.344 ± 0.052 0.359 ± 0.070 0.496 ± 0.007 0.735 ± 0.107

4 0.328 ± 0.044 0.383 ± 0.068 0.441 ± 0.019 0.726 ± 0.032

5 0.355 ± 0.073 0.395 ± 0.024 0.498 ± 0.045 0.776 ± 0.074

6 0.333 ± 0.006 0.407 ± 0.011 0.480 ± 0.069 0.762 ± 0.023

7 0.351 ± 0.030 0.395 ± 0.058 0.468 ± 0.087 0.731 ± 0.013

8 0.348 ± 0.081 0.383 ± 0.022 0.496 ± 0.093 0.770 ± 0.050

9 0.312 ± 0.059 0.407 ± 0.029 0.455 ± 0.053 0.747 ± 0.027

10 0.337 ± 0.011 0.408 ± 0.001 0.488 ± 0.071 0.786 ± 0.081

average 0.334 0.393 0.469 0.729

Table 8 Comparison of the NMI results for the four methods on Dayton data set with α = 0.6

Time step K-means Ncut NMF EC-NMF

1 0.351 ± 0.132 0.400 ± 0.051 0.481 ± 0.002 0.534 ± 0.077

2 0.300 ± 0.029 0.380 ± 0.073 0.416 ± 0.028 0.659 ± 0.081

3 0.303 ± 0.172 0.352 ± 0.067 0.454 ± 0.013 0.705 ± 0.053

4 0.319 ± 0.090 0.355 ± 0.048 0.428 ± 0.070 0.707 ± 0.029

5 0.323 ± 0.010 0.395 ± 0.082 0.486 ± 0.006 0.753 ± 0.011

6 0.317 ± 0.044 0.408 ± 0.015 0.491 ± 0.044 0.759 ± 0.004

7 0.348 ± 0.085 0.400 ± 0.030 0.465 ± 0.091 0.763 ± 0.063

8 0.326 ± 0.013 0.397 ± 0.035 0.543 ± 0.069 0.762 ± 0.032

9 0.321 ± 0.010 0.396 ± 0.059 0.435 ± 0.037 0.786 ± 0.047

10 0.314 ± 0.134 0.399 ± 0.079 0.446 ± 0.044 0.796 ± 0.035

average 0.322 0.388 0.464 0.722

Table 9 Comparison of the Acc results for the four methods on PeMS data set with α = 0.6

k K-means Ncut NMF EC-NMF

3 0.252 ± 0.008 0.322 ± 0.019 0.410 ± 0.091 0.663 ± 0.098

5 0.309 ± 0.173 0.368 ± 0.041 0.399 ± 0.073 0.681 ± 0.010

8 0.319 ± 0.123 0.390 ± 0.063 0.378 ± 0.029 0.600 ± 0.012

10 0.309 ± 0.092 0.332 ± 0.077 0.342 ± 0.032 0.693 ± 0.054

12 0.299 ± 0.043 0.377 ± 0.023 0.401 ± 0.081 0.682 ± 0.093

15 0.315 ± 0.075 0.309 ± 0.062 0.431 ± 0.005 0.612 ± 0.016

18 0.291 ± 0.023 0.350 ± 0.011 0.379 ± 0.063 0.683 ± 0.031

20 0.305 ± 0.015 0.379 ± 0.029 0.408 ± 0.027 0.666 ± 0.073

average 0.240 0.283 0.315 0.528

Page 14: Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

C.-Y. Sang, D.-H. Sun

Table 10 Comparison of the NMI results for the four methods on PeMS with α = 0.6

k K-means Ncut NMF EC-NMF

3 0.251 ± 0.027 0.321 ± 0.063 0.343 ± 0.021 0.572 ± 0.018

5 0.290 ± 0.010 0.334 ± 0.045 0.401 ± 0.012 0.503 ± 0.005

8 0.310 ± 0.017 0.320 ± 0.052 0.458 ± 0.001 0.589 ± 0.060

10 0.321 ± 0.003 0.318 ± 0.027 0.389 ± 0.009 0.599 ± 0.009

12 0.298 ± 0.024 0.390 ± 0.089 0.401 ± 0.0048 0.510 ± 0.011

15 0.288 ± 0.054 0.288 ± 0.030 0.320 ± 0.0087 0.535 ± 0.087

18 0.293 ± 0.097 0.309 ± 0.010 0.402 ± 0.002 0.600 ± 0.026

20 0.310 ± 0.033 0.313 ± 0.009 0.398 ± 0.032 0.513 ± 0.058

average 0.236 0.259 0.311 0.442

Table 11 Comparison of the Acc results for the four methods on Digg data set with α = 0.6

Time step K-means Ncut NMF EC-NMF

1 0.297 ± 0.042 0.422 ± 0.061 0.459 ± 0.059 0.685 ± 0.092

2 0.307 ± 0.153 0.387 ± 0.029 0.426 ± 0.035 0.730 ± 0.089

3 0.317 ± 0.114 0.374 ± 0.081 0.481 ± 0.026 0.749 ± 0.143

4 0.317 ± 0.072 0.392 ± 0.031 0.442 ± 0.088 0.749 ± 0.014

5 0.344 ± 0.297 0.418 ± 0.074 0.482 ± 0.057 0.783 ± 0.039

6 0.314 ± 0.058 0.414 ± 0.025 0.495 ± 0.011 0.748 ± 0.201

7 0.324 ± 0.139 0.396 ± 0.023 0.504 ± 0.095 0.725 ± 0.019

8 0.333 ± 0.017 0.390 ± 0.008 0.461 ± 0.083 0.753 ± 0.028

9 0.316 ± 0.045 0.411 ± 0.045 0.477 ± 0.060 0.732 ± 0.010

10 0.349 ± 0.067 0.417 ± 0.011 0.520 ± 0.035 0.778 ± 0.007

average 0.322 0.402 0.475 0.743

Table 12 Comparison of the NMI results for the four methods on Digg data set with α = 0.6

Time step K-means Ncut NMF EC-NMF

1 0.338 ± 0.019 0.425 ± 0.081 0.496 ± 0.191 0.628 ± 0.029

2 0.304 ± 0.024 0.388 ± 0.077 0.429 ± 0.015 0.673 ± 0.055

3 0.315 ± 0.001 0.366 ± 0.033 0.442 ± 0.003 0.718 ± 0.081

4 0.304 ± 0.072 0.368 ± 0.035 0.496 ± 0.051 0.726 ± 0.033

5 0.323 ± 0.081 0.404 ± 0.023 0.502 ± 0.047 0.750 ± 0.060

6 0.313 ± 0.178 0.419 ± 0.178 0.527 ± 0.006 0.754 ± 0.013

7 0.353 ± 0.073 0.432 ± 0.091 0.476 ± 0.058 0.788 ± 0.016

8 0.305 ± 0.041 0.429 ± 0.121 0.548 ± 0.021 0.784 ± 0.020

9 0.318 ± 0.024 0.403 ± 0.060 0.468 ± 0.041 0.788 ± 0.142

10 0.322 ± 0.003 0.419 ± 0.018 0.471 ± 0.033 0.792 ± 0.027

average 0.319 0.405 0.485 0.740

Page 15: Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

Co-clustering over multiple dynamic data streams based on NMF

since these three algorithms consider the geometric struc-ture information contained in the data.

Finally, we report the experimental results on the Diggdata set. Tables 11 and 12 show the clustering results overa different number of time steps with a varying number ofclusters. Of these four algorithms, EC-NMF outperformsthe other three algorithms in terms of both Acc and NMI,because it simultaneously considers the geometric structureof both the data and feature manifolds and incorporates priorknowledge from historic results with temporal smoothness.

7 Conclusion

Clustering data streams has been extensively studied in vari-ous research areas, including social network analysis, sensornetworks, stock market trading, and web community anal-ysis. In this paper, we presented a novel method, EC-NMF,for clustering multiple data streams, which simultaneouslyco-clusters multiple data streams using the geometric struc-ture information contained in data points and features. Fur-thermore, we developed an iterative updating optimizationalgorithm. Finally, we discussed a variety of experimentson synthetic and real world data sets carried out to demon-strate the effectiveness of the proposed algorithm. We usedclustering accuracy and NMI to measure the quality of theclustering results. On average, EC-NMF achieved betterperformance than the other methods for clustering multi-ple data streams evolving over time. In future work, wewill extend the EC-NMF method based on semi-supervisedNMF for clustering multi-type relational data in real worldapplications.

Acknowledgments The authors gratefully acknowledge the sup-ports provided for this research by the Research Fund for the DoctoralProgram of Higher Education of China (Grant No. 20120191110047),Natural Science Foundation Project of CQ CSTC of China (GrantNo. CSTC2012JJB40002), Engineering Center Research Program ofChongqing of China (Grant No. 2011pt-gc30005), and Fundamen-tal Research Funds for the Central Universities of China (Grant No.CDJXS10170004).

References

1. Aggarwal CC, Yu P (2005) Online analysis of community evolu-tion in data streams. In: Proceedings of the SIAM internationalconference on data mining (SDM 2005)

2. Lin Y-R et al (2009) Analyzing communities and their evolu-tions in dynamic social networks. ACM Trans Knowl Disc Data(TKDD) 3(2):8

3. Newman ME, Girvan M (2004) Finding and evaluating commu-nity structure in networks. Phys Rev E 69(2):026113

4. Papadopoulos S et al (2012) Community detection in social media.Data Min Knowl Disc 24(3):515–554

5. Beringer J, Hullermeier E (2006) Online clustering of parallel datastreams. Data Knowl Eng 58(2):180–204

6. Al Aghbari Z, Kamel I, Awad T (2012) On clustering large numberof data streams. Intell Data Anal 16(1):69–91

7. Masoud H, Jalili S, Hasheminejad SMH (2013) Dynamic clus-tering using combinatorial particle swarm optimization. ApplIntell 38(3):289–314

8. Dai BR et al (2006) Adaptive clustering for multiple evolvingstreams. IEEE Trans Knowl Data Eng 18(9):1166–1180

9. Yeh MY, Dai BR, Chen MS (2007) Clustering over multipleevolving streams by events and correlations. IEEE Trans KnowlData Eng 19(10):1349–1362

10. Ning H et al (2010) Incremental spectral clustering by efficientlyupdating the eigen-system. Pattern Recog 43(1):113–127

11. Wang LJ et al (2012) Low-Rank Kernel matrix factorization forlarge-scale evolutionary clustering. IEEE Trans Knowl Data Eng24(6):1036–1050

12. Mandayam Comar P, Tan P-N, Jain AK (2012) A frameworkfor joint community detection across multiple related networks.Neurocomputing 76(1):93–104

13. Sun J, Xie Y, Zhang H, Faloutsos C (2007) Less is more: compactmatrix decomposition for large sparse graphs. In: Proceedings ofthe 2007 SIAM international conference on data mining (SDM2007)

14. Sarkar P, Moore AW (2005) Dynamic social network analysisusing latent space models. ACM SIGKDD Explor Newsl 7(2):31–40

15. Ding C et al (2006) Orthogonal nonnegative matrix t-factorizations for clustering. In: Proceedings of the 12th ACMSIGKDD international conference on knowledge discovery anddata mining. ACM

16. Dhillon IS (2001) Co-clustering documents and words usingbipartite spectral graph partitioning. In: Proceedings of the 7thACM SIGKDD international conference on knowledge discoveryand data mining. ACM

17. Wu M-L, Chang C-H, Liu R-Z (2013) Co-clustering with aug-mented matrix. Appl Intell 39(1):153–164

18. Cai D et al (2011) Graph regularized nonnegative matrix factoriza-tion for data representation. IEEE Trans Pattern Anal Mach Intell33(8):1548–1560

19. Gu Q, Zhou J (2009) Co-clustering on manifolds. In: Proceed-ings of the 15th ACM SIGKDD international conference onKnowledge discovery and data mining. ACM

20. Ding CH, Li T, Jordan MI (2010) Convex and semi-nonnegativematrix factorizations. IEEE Trans Pattern Anal Mach Intell32(1):45–55

21. Drineas P, Kannan R, Mahoney MW (2006) Fast Monte Carloalgorithms for matrices III: computing a compressed approximatematrix decomposition. SIAM J Comput 36(1):184–206

22. Tong H et al (2008) Colibri: fast mining of large static anddynamic graphs. In: Proceedings of the 14th ACM SIGKDD inter-national conference on knowledge discovery and data mining.ACM

23. Shang F, Jiao L, Wang F (2012) Graph dual regularization non-negative matrix factorization for co-clustering. Pattern Recog45(6):2237–2250

24. Seung D, Lee L (2001) Algorithms for non-negative matrix fac-torization. Adv Neural Inf Process Syst 13:556–562

25. Chung FR (1997) Spectral graph theory, vol 92. AMS Bookstore26. Belkin M, Niyogi P (2001) Laplacian eigenmaps and spec-

tral techniques for embedding and clustering. In: Advances inneural information processing systems, vol 14. MIT Press, pp585–591

27. Chapelle O, Scholkopf B, Zien A (2006) Semi-supervised learn-ing, vol 2. MIT Press, Cambridge

28. Cvetkovic D, Rowlinson P (2004) Spectral graph theory. In: Top-ics in algebraic graph theory. Cambridge University Press, pp 88–112

Page 16: Co-clustering over multiple dynamic data streams based on non-negative matrix factorization

C.-Y. Sang, D.-H. Sun

29. Lee DD, Seung HS (1999) Learning the parts of objectsby non-negative matrix factorization. Nature 401(6755):788–791

30. Boyd SP, Vandenberghe L (2004) Convex optimization. Cam-bridge University Press

31. Shi JB, Malik J (2000) Normalized cuts and image segmentation.IEEE Trans Pattern Anal Mach Intell 22(8):888–905

32. Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuseframework for combining multiple partitions. J Mach Learn Res3:583–617

Chun-Yan Sang a PH.D. student at Chongqing university inChina. She received her Masters degree in Computer Science fromChongqing University in 2010, China. Her reserch interests includedata mining, machine learning, and Cyber Physical systems.

Di-Hua Sun is a professor in college of automation at ChongqingUniversity. He receieved the Bachelors degree from Huazhong Uni-versity of Science and Technology in 1982, China, and the Mastersand PH.D. degrees from Chongqing University, China, in 1989 and1997, respectively. His research interests include Cyber Physical Sys-tems, intelligent transportation system, computer-based control, dataanalysis and decision support.