Upload
vinayaka-gangadharappa
View
117
Download
3
Embed Size (px)
Citation preview
Agenda
Abstract Introduction Preliminaries Drifting concept detection Clustering relationship analysis Experimental results Conclusions
Abstract
the problem of how to allocate those unlabeled data points into proper clusters remains as a challenging issue in the categorical domain.
Abstract
In this paper, a mechanism named MAximal Resemblance Data Labeling (abbreviated as MARDL) is proposed to allocate each unlabeled data point into the corresponding appropriate cluster.
MARDL has two advantages: 1) MARDL exhibits high execution efficiency, and2) MARDL can achieve high intracluster similarity and
low intercluster similarity
Introduction
As the concepts behind the data evolve with time, the underlying clusters may also change considerably with time.
Previous works on clustering categorical data focus on doing clustering on the entire data set and do not take the drifting concepts into consideration.
The problem of clustering time-evolving data in the categorical domain remains a challenging issue.
Introduction
Introduction
A practical categorical clustering representative, named “Node Importance Representative”(NIR).
NIR represents clusters by measuring the importance of each attribute value in the clusters.
Introduction
Based on NIR, we propose the “Drifting Concept Detection”(DCD).
In DCD, the incoming categorical data points at the present sliding window are first allocated into the corresponding proper cluster at the last clustering result
If the distribution is changed (exceeding some criteria), the concepts are said to drift.
Introduction
The framework presented in this paper not only detects the drifting concepts in the categorical data but also explains the drifting concepts by analyzing the relationship between clustering results at different times.
The analyzing algorithm is named “Cluster Relationship Analysis” (CRA).
Preliminaries The problem of clustering the
categorical time-evolving data is formulated as follows:
a series ofcategorical data set D
data point
Attribute
Preliminaries1. Suppose that the window size N is also given. The data set D is separated into several continuous subsets St2. The superscript number t is the identification number of the sliding window and t is also called time stamp in this paper. For example, the first N data points in D are located in the first subset S1
Preliminaries the objective of the framework is to
perform clustering on the data set D and consider the drifting concepts between St and Stt1 and also analyze the relationship between different clustering results.
Preliminaries
Preliminaries
The basic idea behind NIR is to represent a cluster as the distribution of the attribute values, which are called “nodes”.
the importance of a node is evaluated based on the following two concepts: The node is important in the cluster when
the frequency of the node is high in this cluster.
The node is important in the cluster if the node appears prevalently in this cluster rather than in other clusters.
Preliminaries
Definition 1 (node). A node, , is defined as attribute name + attribute value : The age is in the range 50-59 and the weight is in the
range 50-59, the attribute value 50-59 is confusing when we separate the attribute value from the attribute name.
Nodes [age = 50-59] and [weight = 50-59] avoid this ambiguity
rI
Preliminaries Definition 2 (node importance). The importance
value of the node is calculated as the following equations :irI
tk
zzr
yryr
k
yyryrr
ri
iriri
I
IIp
where
IpIpk
If
IfmI
Icw
1
1
)(
))(log()(*log11)(
)(*),(
Preliminaries
1)30log
30
33log
33(*
2log11)(
))(log()(*log11)(
1
1
AA
k
yyryrr
If
IpIpk
If
01*20),(
11*33),(
)(*),(
112
111
AAcw
AAcw
IfmI
Icw ri
iriri
Preliminaries
Drifting Concept Detection
The objective of the DCD algorithm is to detect the difference of cluster distributions between the current data subset and the last clustering result and to decide whether the reclustering is required or not in .
tS]1,[ tteC
tS
Drifting Concept Detection
Drifting Concept Detection The goal of data labeling is to decide the most
appropriate cluster label for each incoming data point.
Definition 3 (resemblance and maximal resemblance).Given a data point and an NIR table of clusters , a data point is labeled to the cluster that obtains the maximal resemblance:
q
ririij IcwcpR
1
),(),(
jp
ic jp
Drifting Concept Detection When a data point contains nodes that are more
important in cluster than in cluster , will be larger than .
if the maximal resemblance (the most appropriate cluster) is smaller than the threshold
in that cluster, the data point is seen as an outlier.
xc yc
),( yj cpR),( xj cpR
i
.,,1,),(max,*
otherwiseoutlierskiwherecpRifC
Label iiji
Drifting Concept Detection
00
),,( 12
11
6 cincin
GEBp
529.11029.05.0029.0
),,( 12
11
7 cincin
PMXp
皆小於 threshold=0.5 故此 data point 為 outlier
1.529>0.029 且大於 threshold=0.5故此 data point 屬第二群
Drifting Concept Detection
The clustering results are said to be different according to the following two criteria: The clustering results are different if quite a
large number of outliers are found by data labeling.
The clustering results are different if quite a large number of clusters are varied in the ratio of data points.
Drifting Concept Detection
Drifting Concept Detection
2SThere are three outliers in , and the ratio of outliers in S2 is Therefore, S2 is considered as a concept driftingwindow and is going to do reclustering.
4.06.053
Drifting Concept Detection
5.0122
1),(),(3.06.0
50
53
3.04.054
52
3'2
32'2
231'2
132'2
2
31'2
1
isCandC
ccdccdiscandc
iscandc
4.02.051
the ratio of outliers in is However, the variation of the ratio of data points between clusters
3S
S3 is also considered as a concept-drifting window
Drifting Concept Detection
Drifting Concept Detection
Drifting Concept Detection
The bottlenecks of the execution time in DCD may occur on the reclustering step when the concept drifts and on the updating NIR table step when the concept does not drift.
if we can obtain prior knowledge such as the frequency of the drifting concepts of the data from domain experts, the prior knowledge can help us to set proper parameter values.
Clustering relationship analysis
CRA measures the similarity of clusters between the clustering results at different time stamps.
CRA links the similar clusters , when
similarity is higher than the threshold.
CRA will provide clues for us to catch the time-evolving trends in the data set.
Node Importance Vector and Cluster Distance
Node importance vector
The dimensions of all the vectors are the same. ic
Example
Vector space(14 nodes) : TA,PA,GA,DA,CA,NA,MA,FA,EA,ZA,YA,XA,BA,AA 33333222211111
Cosine measure
Calculate the cosine of the angle between two vectors.
Measure of similarity.
Example
The similarity between vectors and11c
21c
Visualizing the Evolving Clusters
Cluster
Time
Clustering result
Experimental Results-Test Environment
Synthetic data sets Numerical data set Clustering data Drifting concept is generated by
combining two different clustering results.
Experimental Results-Test Environment
Real data set ( KDD-CUP’99 Network intrusion Detection )
Each record : normal connection 、 attack Drifting concept : the change is continued
for at least 300 connections. 493,857 records ; each record contains 42
attributes. 33 drifting concepts
Evaluation on Efficiency The number of drifting concepts directly
impacts the execution time of DCD.
The execution time of DCD is faster than that of EM.
a little influence
dimensionality=20
# of clusters=20
N=500
Evaluation on scalability
Data size=50000
N=500
# of clusters=20 dimensionality=20
bottleneck : the number of drifting concepts that require doing reclustering.
Evaluation on Accuracy Test the accuracy of drifting concepts that
are detected by DCD. The CU function To maximize both the probability
the same cluster the same attribute values different clusters different attributes
Evaluation on Accuracy
Confusion matrix accuracy (CMA)
Evaluate the clustering results by comparing with the original clustering labels j.
By maximizing the count of ( i ; j ) in which one output cluster is mapped to one original clustering label j.
ic
ic
Accuracy Evaluation on Synthetic Data Set
Each synthetic data set is generated by randomly combining 50 clustering results
DCD is effective for detecting drifting concepts. data set varies dramatically smaller N data set is stable larger N , save the execution time
>0.8
The highest
# of clusters ,averages of 20 experimentsimummaxk,5.0,1.0,1.0
Accuracy Evaluation on Synthetic Data Set
Clustering results : DCD VS. EM performing in setting
N=2000, drifting concepts occur once per five sliding windows (50*10000/2000=250,250/50=5)
The variation of CU and CMA on doing EM once is quite larger than DCD.
1D
Accuracy Evaluation on Synthetic Data Set
Clustering results : DCD VS. EM performing in setting
The drifting concepts occur irregularly. DCD better than performing EM when we do
clustering on the categorical time-evolving data.
2D
Accuracy Evaluation on Real Data Set
The small sliding window size is induced to a high recall but a little low precision.
the data set does not evolve frequently larger N
3000N,10k,5.0,1.0,1.0
Accuracy Evaluation on Real Data Set
The records are the same in 51-114,134-149, and 155-160 sliding windows.
The peak value of CU in DCD is the time stamp that a drifting concept occurs.
DCD is able to quickly reflect the drifting concept and generate better clustering results.
Conclusions
A framework to perform clustering on categorical time-evolving data.
Detects the drifting concepts at different sliding window by DCD.
CRA to analyze and show the changes between different clustering results.
Shows the relationship between clustering results by visualization.
DCD can provide high-quality clustering results with correctly detected drifting concepts.