Clustering over Multiple Evolving Streams by Events and Correlations

Clustering over Multiple Evolving Streams by Events and Correlations

Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen

Electrical Engineering, National Taiwan University

IEEE Transaction on Knowledge and Data Engineering (TKDE) 2007

Outline

Introduction Data Summarization Similarity Measurement COMET-CORE Framework Empirical Studies Conclusion

Introduction (1)

Good clustering puts similar objects together and separates dissimilar ones into different clusters.

Useful information from clusters Data collection in sensor networks

Stock market trades

A B

G

F E

C

D

Introduction (2)

Online data summarization with offline clustering.

Periodical Online Clustering

A

BG

F

E

C

D

Waste!! Lose Information!!

User

Introduction (3)

COMET-CORE Use online piecewise linear line segments to approximate original data

Update correlations when a stream encounters a new end point

Update clusters by the updated correlations

End point

Data point

Update stream correlations

Data Summarization (1)

Problem Model Γ = {S1, S2, …, Sn}

Si = Si[1, …, t, …] : i-th stream

Si[t] : arriving data of Si at time t

Siapp[t] : approximated data of Si at time t

: end points summary of stream Si

The objective is that given a set of data streams Γ and the threshold parameters, stream clusters are monitored online.

)}],[(),...,],[{(ˆ11 vkvkivvii ttSttSS


Approximation Line Formulation For a sub-stream Si[ts,…,te]

The parameters :

],[ ,][ esappi tttbtatS

se

siei

tt

tStSa

][][

ssieei tatStatSb ][][

(ts, Si[ts])

(te, Si[te])


Error Function

Error Threshold It may not easy to give a proper absolute error threshold Relative error threshold (EX: 2% error of square sum of original

data stream)

e

s

e

s

t

tti

t

tt

appii

btatS

tStSerror

2

2

)][(

])[][(


Online Linear Line Segment Approximation

)}],[(, ... ),],[{(ˆ11 vkvkivviii ttSttSSS

Time

Error < Threshold δlValue Error > Threshold δl

Generate New End Point

tv1 tvk

Similarity Measurement (1)

Use Pearson correlation as similarity measure

yx

YEXEYXEYXcorr

)()(),(

),(

t jjt ii

t t jit jiji

StSStS

tStStStSSScorr

22 )][()][(

][][][][),(

Regard two streams as two different random variables


Definition 4.2. Given two streams Si and Sj, and a weight function w(t), the weighted correlation coefficient between these two streams is defined as :

)()(

)(

][)(][)(][][)(

),(ji

tt

t jt iji

jiSASA

tw

tStwtStwtStStw

SSwcorr

t

tx

txx tw

tStwtStwSA

)(

)][)((])[)(()(

2

2

index t timeoffunction ingnondecreaslly monotonica a :)(tw


Definition 4.3. Given two streams Si and Sj, and a weight function w(t), the WC vector of Si and Sj is defined as :

),,,,,( 54321 kttttt

SS tWCWCWCWCWCWC kkkkkji

k

k

t

ti

t tStwWC0

1 ][)(

k

k

t

ti

t tStwWC0

22 ])[)((

k

k

t

tji

t tStStwWC0

3 ][][)(

k

k

t

tj

t tStwWC0

4 ][)(

k

k

t

tj

t tStwWC0

25 ])[)((

)()(

)(),(

413

ji

jiSASA

tw

WCWCWC

SSwcorr

)(

)()(

21

2 tw

WCWCSA i

]1,0(,)( )( ttnowtw


Similarity Update Update WC vector when a new end point generated Linear scan of data streams incremental update

iS

jS

nowi

nowi

appj btatS ][

][)(1 tStwWC i

e

s

e

s

e

s

e

s

t

tt

nowi

t

tt

nowi

t

tt

nowi

nowi

t

tt

appi

twbtwta

btatw

tStwWC

11

1

11

)()(

))((

][)(st et

),)((),)(( 5~15~1 emtmsm

tmSS tWCtWCWC es

ji


11)(

1 1

)()()(

1

)(

11

][][

][

][)(

WCWC

tStS

tS

tStwWC

sse

s e

s

esse

e e

ee

ttt

t

t

t

tt

appi

ttappi

tttt

t appi

tt

t appi

t

),)(( ~1)(

emmmtm

ttSS tWCWCWC sse

ji

iS

jS

st et1

. . .

COMET-CORE Framework (1)

Definition 5.1. Assume that the centers of two clusters Ci and Cj are represented by end point sequence and , respectively. Then, the WC vector of two clusters denoted by is equal to . The weighted correlation between Ci and Cj denoted by wcorr(Ci, Cj) is equal to wcorr(Si, Sj) .

COMET-CORE

iS jS

jiCCWC jiSSWC

A stream encounters a new end point

Split Cluster Merge cluster

COMET-CORE Framework (2)

Split cluster

Ck

Update Weighted Correlation

Compare Correlation with δa

New trigger groupsNon-trigger streams

Ctmp

Compare correlation between non-trigger stream and representative stream with δa

Three new groups

Cnew1

Cnew2

Cnew3

trigger streams

COMET-CORE Framework (3) Assign WC vectors to newly generated clusters

Type1: Ci and Cj are belong to the same cluster originally.

Type2: Ci and Cj are belong to different clusters originally.

Type3: Ci is newly generated cluster, Coo is originally existing one.

S1, S2, S3, S4, S5, S6,S7

C1 Cx Cy

S11, S12, S13, S14

C11

S4,S5

S6,S7

S13,S14S11,S12

S1,S2,S3

C11 C14

Cx Cy

C1 C6

C4

(a)Type1: 4141 ssWCccWC

S4,S5

S6,S7

S13,S14S11,S12

S1,S2,S3

C11 C14

Cx Cy

C1 C6

C4

(b)Type2: 111144 ccWCccWC

S4,S5

S6,S7

S13,S14S11,S12

S1,S2,S3

C11 C14

Cx Cy

C1 C6

C4

(c)Type3: yy ccWCccWC 14

COMET-CORE Framework (4) Merge Cluster

After splitting and updating the inter-cluster correlation Two clusters are merged if the correlation ≥ δe until no this kind of cluster

pair exists.

C1 C2

Ck

wcorr(C1, C2)wcorr(C2, Ck)

wcorr(C1, C2) ≥ δe

Merge

Ck

Cnew

wcorr(Cnew , Ck) = min(wcorr(C2 ,Ck), wcorr(C2 ,Ck))

Empirical Studies (1)

Clustering algorithms Basic: periodically agglomerative clustering ODAC: periodically hierarchical clustering

COMET-CORE

Dissimilarity > Threshold

2Dis(P) – (Dis(C1) + Dis(C2)) < Threshold

Clustering Result

All streams


Clustering quality measurement Silhouette Validation

a(Si) is the average dissimilarity of stream Si to all other streams in the same cluster

b(Si) is the average dissimilarity of stream Si to all other streams in the another closest cluster

Cluster Silhouette

Global Silhouette

)}(),(max{

)()()(

ii

iii SbSa

SaSbSsil

streams. m owns ,)(

kCS i

C Cm

Ssilsil ki

k

clusters p total,1

1p

Cksil

pGS


Evaluation on Real Data δa =δe = 0.5

Data Sets


Evaluation on Cylinder-Bell-Funnel Data Set δa =δe = 0.8 100 streams for each type (total 600 streams) normal distribution number ranges from 0 to 1 are randomly added on each

streams

128 long6 types


Evaluation on Random Walk Data Set δa =δe = 0.7 Period = 200 data points (Basic & ODAC)

20000 Points in Each Stream Fixed 500 Streams

Almost independent of cluster num

1. Streams number 2. Cluster number

Conclusion

The paper proposes a novel and efficient online clustering framework COMET-CORE for clustering over streams.

COMET-CORE uses efficient split and merge algorithm to modify clusters with good clustering quality.

Documents

Clustering over Multiple Evolving Streams by Events and Correlations