24
Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University IEEE Transaction on Knowledge and Data Engineering (TKDE) 2007

Clustering over Multiple Evolving Streams by Events and Correlations

  • Upload
    micol

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Clustering over Multiple Evolving Streams by Events and Correlations. Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University IEEE Transaction on Knowledge and Data Engineering (TKDE) 2007. Outline. Introduction Data Summarization Similarity Measurement - PowerPoint PPT Presentation

Citation preview

Page 1: Clustering over Multiple Evolving Streams by Events and Correlations

Clustering over Multiple Evolving Streams by Events and Correlations

Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen

Electrical Engineering, National Taiwan University

IEEE Transaction on Knowledge and Data Engineering (TKDE) 2007

Page 2: Clustering over Multiple Evolving Streams by Events and Correlations

Outline

Introduction Data Summarization Similarity Measurement COMET-CORE Framework Empirical Studies Conclusion

Page 3: Clustering over Multiple Evolving Streams by Events and Correlations

Introduction (1)

Good clustering puts similar objects together and separates dissimilar ones into different clusters.

Useful information from clusters Data collection in sensor networks

Stock market trades

A B

G

F E

C

D

Page 4: Clustering over Multiple Evolving Streams by Events and Correlations

Introduction (2)

Online data summarization with offline clustering.

Periodical Online Clustering

A

BG

F

E

C

D

Waste!! Lose Information!!

User

Page 5: Clustering over Multiple Evolving Streams by Events and Correlations

Introduction (3)

COMET-CORE Use online piecewise linear line segments to approximate original data

Update correlations when a stream encounters a new end point

Update clusters by the updated correlations

End point

Data point

Update stream correlations

Page 6: Clustering over Multiple Evolving Streams by Events and Correlations

Data Summarization (1)

Problem Model Γ = {S1, S2, …, Sn}

Si = Si[1, …, t, …] : i-th stream

Si[t] : arriving data of Si at time t

Siapp[t] : approximated data of Si at time t

: end points summary of stream Si

The objective is that given a set of data streams Γ and the threshold parameters, stream clusters are monitored online.

)}],[(),...,],[{(ˆ11 vkvkivvii ttSttSS

Page 7: Clustering over Multiple Evolving Streams by Events and Correlations

Data Summarization (2)

Approximation Line Formulation For a sub-stream Si[ts,…,te]

The parameters :

],[ ,][ esappi tttbtatS

se

siei

tt

tStSa

][][

ssieei tatStatSb ][][

(ts, Si[ts])

(te, Si[te])

Page 8: Clustering over Multiple Evolving Streams by Events and Correlations

Data Summarization (3)

Error Function

Error Threshold It may not easy to give a proper absolute error threshold Relative error threshold (EX: 2% error of square sum of original

data stream)

e

s

e

s

t

tti

t

tt

appii

btatS

tStSerror

2

2

)][(

])[][(

Page 9: Clustering over Multiple Evolving Streams by Events and Correlations

Data Summarization (4)

Online Linear Line Segment Approximation

)}],[(, ... ),],[{(ˆ11 vkvkivviii ttSttSSS

Time

Error < Threshold δlValue Error > Threshold δl

Generate New End Point

tv1 tvk

Page 10: Clustering over Multiple Evolving Streams by Events and Correlations

Similarity Measurement (1)

Use Pearson correlation as similarity measure

yx

YEXEYXEYXcorr

)()(),(

),(

t jjt ii

t t jit jiji

StSStS

tStStStSSScorr

22 )][()][(

][][][][),(

Regard two streams as two different random variables

Page 11: Clustering over Multiple Evolving Streams by Events and Correlations

Similarity Measurement (2)

Definition 4.2. Given two streams Si and Sj, and a weight function w(t), the weighted correlation coefficient between these two streams is defined as :

)()(

)(

][)(][)(][][)(

),(ji

tt

t jt iji

jiSASA

tw

tStwtStwtStStw

SSwcorr

t

tx

txx tw

tStwtStwSA

)(

)][)((])[)(()(

2

2

index t timeoffunction ingnondecreaslly monotonica a :)(tw

Page 12: Clustering over Multiple Evolving Streams by Events and Correlations

Similarity Measurement (3)

Definition 4.3. Given two streams Si and Sj, and a weight function w(t), the WC vector of Si and Sj is defined as :

),,,,,( 54321 kttttt

SS tWCWCWCWCWCWC kkkkkji

k

k

t

ti

t tStwWC0

1 ][)(

k

k

t

ti

t tStwWC0

22 ])[)((

k

k

t

tji

t tStStwWC0

3 ][][)(

k

k

t

tj

t tStwWC0

4 ][)(

k

k

t

tj

t tStwWC0

25 ])[)((

)()(

)(),(

413

ji

jiSASA

tw

WCWCWC

SSwcorr

)(

)()(

21

2 tw

WCWCSA i

]1,0(,)( )( ttnowtw

Page 13: Clustering over Multiple Evolving Streams by Events and Correlations

Similarity Measurement (4)

Similarity Update Update WC vector when a new end point generated Linear scan of data streams incremental update

iS

jS

nowi

nowi

appj btatS ][

][)(1 tStwWC i

e

s

e

s

e

s

e

s

t

tt

nowi

t

tt

nowi

t

tt

nowi

nowi

t

tt

appi

twbtwta

btatw

tStwWC

11

1

11

)()(

))((

][)(st et

),)((),)(( 5~15~1 emtmsm

tmSS tWCtWCWC es

ji

Page 14: Clustering over Multiple Evolving Streams by Events and Correlations

Similarity Measurement (5)

11)(

1 1

)()()(

1

)(

11

][][

][

][)(

WCWC

tStS

tS

tStwWC

sse

s e

s

esse

e e

ee

ttt

t

t

t

tt

appi

ttappi

tttt

t appi

tt

t appi

t

),)(( ~1)(

emmmtm

ttSS tWCWCWC sse

ji

iS

jS

st et1

. . .

Page 15: Clustering over Multiple Evolving Streams by Events and Correlations

COMET-CORE Framework (1)

Definition 5.1. Assume that the centers of two clusters Ci and Cj are represented by end point sequence and , respectively. Then, the WC vector of two clusters denoted by is equal to . The weighted correlation between Ci and Cj denoted by wcorr(Ci, Cj) is equal to wcorr(Si, Sj) .

COMET-CORE

iS jS

jiCCWC jiSSWC

A stream encounters a new end point

Split Cluster Merge cluster

Page 16: Clustering over Multiple Evolving Streams by Events and Correlations

COMET-CORE Framework (2)

Split cluster

Ck

Update Weighted Correlation

Compare Correlation with δa

New trigger groupsNon-trigger streams

Ctmp

Compare correlation between non-trigger stream and representative stream with δa

Three new groups

Cnew1

Cnew2

Cnew3

trigger streams

Page 17: Clustering over Multiple Evolving Streams by Events and Correlations

COMET-CORE Framework (3) Assign WC vectors to newly generated clusters

Type1: Ci and Cj are belong to the same cluster originally.

Type2: Ci and Cj are belong to different clusters originally.

Type3: Ci is newly generated cluster, Coo is originally existing one.

S1, S2, S3, S4, S5, S6,S7

C1 Cx Cy

S11, S12, S13, S14

C11

S4,S5

S6,S7

S13,S14S11,S12

S1,S2,S3

C11 C14

Cx Cy

C1 C6

C4

(a)Type1: 4141 ssWCccWC

S4,S5

S6,S7

S13,S14S11,S12

S1,S2,S3

C11 C14

Cx Cy

C1 C6

C4

(b)Type2: 111144 ccWCccWC

S4,S5

S6,S7

S13,S14S11,S12

S1,S2,S3

C11 C14

Cx Cy

C1 C6

C4

(c)Type3: yy ccWCccWC 14

Page 18: Clustering over Multiple Evolving Streams by Events and Correlations

COMET-CORE Framework (4) Merge Cluster

After splitting and updating the inter-cluster correlation Two clusters are merged if the correlation ≥ δe until no this kind of cluster

pair exists.

C1 C2

Ck

wcorr(C1, C2)wcorr(C2, Ck)

wcorr(C1, C2) ≥ δe

Merge

Ck

Cnew

wcorr(Cnew , Ck) = min(wcorr(C2 ,Ck), wcorr(C2 ,Ck))

Page 19: Clustering over Multiple Evolving Streams by Events and Correlations

Empirical Studies (1)

Clustering algorithms Basic: periodically agglomerative clustering ODAC: periodically hierarchical clustering

COMET-CORE

Dissimilarity > Threshold

2Dis(P) – (Dis(C1) + Dis(C2)) < Threshold

Clustering Result

All streams

Page 20: Clustering over Multiple Evolving Streams by Events and Correlations

Empirical Studies (2)

Clustering quality measurement Silhouette Validation

a(Si) is the average dissimilarity of stream Si to all other streams in the same cluster

b(Si) is the average dissimilarity of stream Si to all other streams in the another closest cluster

Cluster Silhouette

Global Silhouette

)}(),(max{

)()()(

ii

iii SbSa

SaSbSsil

streams. m owns ,)(

kCS i

C Cm

Ssilsil ki

k

clusters p total,1

1p

Cksil

pGS

Page 21: Clustering over Multiple Evolving Streams by Events and Correlations

Empirical Studies (3)

Evaluation on Real Data δa =δe = 0.5

Data Sets

Page 22: Clustering over Multiple Evolving Streams by Events and Correlations

Empirical Studies (4)

Evaluation on Cylinder-Bell-Funnel Data Set δa =δe = 0.8 100 streams for each type (total 600 streams) normal distribution number ranges from 0 to 1 are randomly added on each

streams

128 long6 types

Page 23: Clustering over Multiple Evolving Streams by Events and Correlations

Empirical Studies (5)

Evaluation on Random Walk Data Set δa =δe = 0.7 Period = 200 data points (Basic & ODAC)

20000 Points in Each Stream Fixed 500 Streams

Almost independent of cluster num

1. Streams number 2. Cluster number

Page 24: Clustering over Multiple Evolving Streams by Events and Correlations

Conclusion

The paper proposes a novel and efficient online clustering framework COMET-CORE for clustering over streams.

COMET-CORE uses efficient split and merge algorithm to modify clusters with good clustering quality.