Upload
micol
View
29
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Clustering over Multiple Evolving Streams by Events and Correlations. Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University IEEE Transaction on Knowledge and Data Engineering (TKDE) 2007. Outline. Introduction Data Summarization Similarity Measurement - PowerPoint PPT Presentation
Citation preview
Clustering over Multiple Evolving Streams by Events and Correlations
Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen
Electrical Engineering, National Taiwan University
IEEE Transaction on Knowledge and Data Engineering (TKDE) 2007
Outline
Introduction Data Summarization Similarity Measurement COMET-CORE Framework Empirical Studies Conclusion
Introduction (1)
Good clustering puts similar objects together and separates dissimilar ones into different clusters.
Useful information from clusters Data collection in sensor networks
Stock market trades
A B
G
F E
C
D
Introduction (2)
Online data summarization with offline clustering.
Periodical Online Clustering
A
BG
F
E
C
D
Waste!! Lose Information!!
User
Introduction (3)
COMET-CORE Use online piecewise linear line segments to approximate original data
Update correlations when a stream encounters a new end point
Update clusters by the updated correlations
End point
Data point
Update stream correlations
Data Summarization (1)
Problem Model Γ = {S1, S2, …, Sn}
Si = Si[1, …, t, …] : i-th stream
Si[t] : arriving data of Si at time t
Siapp[t] : approximated data of Si at time t
: end points summary of stream Si
The objective is that given a set of data streams Γ and the threshold parameters, stream clusters are monitored online.
)}],[(),...,],[{(ˆ11 vkvkivvii ttSttSS
Data Summarization (2)
Approximation Line Formulation For a sub-stream Si[ts,…,te]
The parameters :
],[ ,][ esappi tttbtatS
se
siei
tt
tStSa
][][
ssieei tatStatSb ][][
(ts, Si[ts])
(te, Si[te])
Data Summarization (3)
Error Function
Error Threshold It may not easy to give a proper absolute error threshold Relative error threshold (EX: 2% error of square sum of original
data stream)
e
s
e
s
t
tti
t
tt
appii
btatS
tStSerror
2
2
)][(
])[][(
Data Summarization (4)
Online Linear Line Segment Approximation
)}],[(, ... ),],[{(ˆ11 vkvkivviii ttSttSSS
Time
Error < Threshold δlValue Error > Threshold δl
Generate New End Point
tv1 tvk
Similarity Measurement (1)
Use Pearson correlation as similarity measure
yx
YEXEYXEYXcorr
)()(),(
),(
t jjt ii
t t jit jiji
StSStS
tStStStSSScorr
22 )][()][(
][][][][),(
Regard two streams as two different random variables
Similarity Measurement (2)
Definition 4.2. Given two streams Si and Sj, and a weight function w(t), the weighted correlation coefficient between these two streams is defined as :
)()(
)(
][)(][)(][][)(
),(ji
tt
t jt iji
jiSASA
tw
tStwtStwtStStw
SSwcorr
t
tx
txx tw
tStwtStwSA
)(
)][)((])[)(()(
2
2
index t timeoffunction ingnondecreaslly monotonica a :)(tw
Similarity Measurement (3)
Definition 4.3. Given two streams Si and Sj, and a weight function w(t), the WC vector of Si and Sj is defined as :
),,,,,( 54321 kttttt
SS tWCWCWCWCWCWC kkkkkji
k
k
t
ti
t tStwWC0
1 ][)(
k
k
t
ti
t tStwWC0
22 ])[)((
k
k
t
tji
t tStStwWC0
3 ][][)(
k
k
t
tj
t tStwWC0
4 ][)(
k
k
t
tj
t tStwWC0
25 ])[)((
)()(
)(),(
413
ji
jiSASA
tw
WCWCWC
SSwcorr
)(
)()(
21
2 tw
WCWCSA i
]1,0(,)( )( ttnowtw
Similarity Measurement (4)
Similarity Update Update WC vector when a new end point generated Linear scan of data streams incremental update
iS
jS
nowi
nowi
appj btatS ][
][)(1 tStwWC i
e
s
e
s
e
s
e
s
t
tt
nowi
t
tt
nowi
t
tt
nowi
nowi
t
tt
appi
twbtwta
btatw
tStwWC
11
1
11
)()(
))((
][)(st et
),)((),)(( 5~15~1 emtmsm
tmSS tWCtWCWC es
ji
Similarity Measurement (5)
11)(
1 1
)()()(
1
)(
11
][][
][
][)(
WCWC
tStS
tS
tStwWC
sse
s e
s
esse
e e
ee
ttt
t
t
t
tt
appi
ttappi
tttt
t appi
tt
t appi
t
),)(( ~1)(
emmmtm
ttSS tWCWCWC sse
ji
iS
jS
st et1
. . .
COMET-CORE Framework (1)
Definition 5.1. Assume that the centers of two clusters Ci and Cj are represented by end point sequence and , respectively. Then, the WC vector of two clusters denoted by is equal to . The weighted correlation between Ci and Cj denoted by wcorr(Ci, Cj) is equal to wcorr(Si, Sj) .
COMET-CORE
iS jS
jiCCWC jiSSWC
A stream encounters a new end point
Split Cluster Merge cluster
COMET-CORE Framework (2)
Split cluster
Ck
Update Weighted Correlation
Compare Correlation with δa
New trigger groupsNon-trigger streams
Ctmp
Compare correlation between non-trigger stream and representative stream with δa
Three new groups
Cnew1
Cnew2
Cnew3
trigger streams
COMET-CORE Framework (3) Assign WC vectors to newly generated clusters
Type1: Ci and Cj are belong to the same cluster originally.
Type2: Ci and Cj are belong to different clusters originally.
Type3: Ci is newly generated cluster, Coo is originally existing one.
S1, S2, S3, S4, S5, S6,S7
C1 Cx Cy
S11, S12, S13, S14
C11
S4,S5
S6,S7
S13,S14S11,S12
S1,S2,S3
C11 C14
Cx Cy
C1 C6
C4
(a)Type1: 4141 ssWCccWC
S4,S5
S6,S7
S13,S14S11,S12
S1,S2,S3
C11 C14
Cx Cy
C1 C6
C4
(b)Type2: 111144 ccWCccWC
S4,S5
S6,S7
S13,S14S11,S12
S1,S2,S3
C11 C14
Cx Cy
C1 C6
C4
(c)Type3: yy ccWCccWC 14
COMET-CORE Framework (4) Merge Cluster
After splitting and updating the inter-cluster correlation Two clusters are merged if the correlation ≥ δe until no this kind of cluster
pair exists.
C1 C2
Ck
wcorr(C1, C2)wcorr(C2, Ck)
wcorr(C1, C2) ≥ δe
Merge
Ck
Cnew
wcorr(Cnew , Ck) = min(wcorr(C2 ,Ck), wcorr(C2 ,Ck))
Empirical Studies (1)
Clustering algorithms Basic: periodically agglomerative clustering ODAC: periodically hierarchical clustering
COMET-CORE
Dissimilarity > Threshold
2Dis(P) – (Dis(C1) + Dis(C2)) < Threshold
Clustering Result
All streams
Empirical Studies (2)
Clustering quality measurement Silhouette Validation
a(Si) is the average dissimilarity of stream Si to all other streams in the same cluster
b(Si) is the average dissimilarity of stream Si to all other streams in the another closest cluster
Cluster Silhouette
Global Silhouette
)}(),(max{
)()()(
ii
iii SbSa
SaSbSsil
streams. m owns ,)(
kCS i
C Cm
Ssilsil ki
k
clusters p total,1
1p
Cksil
pGS
Empirical Studies (3)
Evaluation on Real Data δa =δe = 0.5
Data Sets
Empirical Studies (4)
Evaluation on Cylinder-Bell-Funnel Data Set δa =δe = 0.8 100 streams for each type (total 600 streams) normal distribution number ranges from 0 to 1 are randomly added on each
streams
128 long6 types
Empirical Studies (5)
Evaluation on Random Walk Data Set δa =δe = 0.7 Period = 200 data points (Basic & ODAC)
20000 Points in Each Stream Fixed 500 Streams
Almost independent of cluster num
1. Streams number 2. Cluster number
Conclusion
The paper proposes a novel and efficient online clustering framework COMET-CORE for clustering over streams.
COMET-CORE uses efficient split and merge algorithm to modify clusters with good clustering quality.