Catching the Trenda Framework for Clustering Concept

Agenda

Abstract Introduction Preliminaries Drifting concept detection Clustering relationship analysis Experimental results Conclusions

Abstract

the problem of how to allocate those unlabeled data points into proper clusters remains as a challenging issue in the categorical domain.

Abstract

In this paper, a mechanism named MAximal Resemblance Data Labeling (abbreviated as MARDL) is proposed to allocate each unlabeled data point into the corresponding appropriate cluster.

MARDL has two advantages: 1) MARDL exhibits high execution efficiency, and2) MARDL can achieve high intracluster similarity and

low intercluster similarity

Introduction

As the concepts behind the data evolve with time, the underlying clusters may also change considerably with time.

Previous works on clustering categorical data focus on doing clustering on the entire data set and do not take the drifting concepts into consideration.

The problem of clustering time-evolving data in the categorical domain remains a challenging issue.

Introduction

Introduction

A practical categorical clustering representative, named “Node Importance Representative”(NIR).

NIR represents clusters by measuring the importance of each attribute value in the clusters.

Introduction

Based on NIR, we propose the “Drifting Concept Detection”(DCD).

In DCD, the incoming categorical data points at the present sliding window are first allocated into the corresponding proper cluster at the last clustering result

If the distribution is changed (exceeding some criteria), the concepts are said to drift.

Introduction

The framework presented in this paper not only detects the drifting concepts in the categorical data but also explains the drifting concepts by analyzing the relationship between clustering results at different times.

The analyzing algorithm is named “Cluster Relationship Analysis” (CRA).

Preliminaries The problem of clustering the

categorical time-evolving data is formulated as follows:

a series ofcategorical data set D

data point

Attribute

Preliminaries1. Suppose that the window size N is also given. The data set D is separated into several continuous subsets St2. The superscript number t is the identification number of the sliding window and t is also called time stamp in this paper. For example, the first N data points in D are located in the first subset S1

Preliminaries the objective of the framework is to

perform clustering on the data set D and consider the drifting concepts between St and Stt1 and also analyze the relationship between different clustering results.

Preliminaries

Preliminaries

The basic idea behind NIR is to represent a cluster as the distribution of the attribute values, which are called “nodes”.

the importance of a node is evaluated based on the following two concepts: The node is important in the cluster when

the frequency of the node is high in this cluster.

The node is important in the cluster if the node appears prevalently in this cluster rather than in other clusters.

Preliminaries

Definition 1 (node). A node, , is defined as attribute name + attribute value ： The age is in the range 50-59 and the weight is in the

range 50-59, the attribute value 50-59 is confusing when we separate the attribute value from the attribute name.

Nodes [age = 50-59] and [weight = 50-59] avoid this ambiguity

rI

Preliminaries Definition 2 (node importance). The importance

value of the node is calculated as the following equations ：irI

tk

zzr

yryr

k

yyryrr

ri

iriri

I

IIp

where

IpIpk

If

IfmI

Icw

1

1

)(

))(log()(*log11)(

)(*),(

Preliminaries

1)30log

30

33log

33(*

2log11)(

))(log()(*log11)(

1

1

AA

k

yyryrr

If

IpIpk

If

01*20),(

11*33),(

)(*),(

112

111

AAcw

AAcw

IfmI

Icw ri

iriri

Preliminaries

Drifting Concept Detection

The objective of the DCD algorithm is to detect the difference of cluster distributions between the current data subset and the last clustering result and to decide whether the reclustering is required or not in .

tS]1,[ tteC

tS


Drifting Concept Detection The goal of data labeling is to decide the most

appropriate cluster label for each incoming data point.

Definition 3 (resemblance and maximal resemblance).Given a data point and an NIR table of clusters , a data point is labeled to the cluster that obtains the maximal resemblance:

q

ririij IcwcpR

1

),(),(

jp

ic jp

Drifting Concept Detection When a data point contains nodes that are more

important in cluster than in cluster , will be larger than .

if the maximal resemblance (the most appropriate cluster) is smaller than the threshold

in that cluster, the data point is seen as an outlier.

xc yc

),( yj cpR),( xj cpR

i

.,,1,),(max,*

otherwiseoutlierskiwherecpRifC

Label iiji


00

),,( 12

11

6 cincin

GEBp

529.11029.05.0029.0

),,( 12

11

7 cincin

PMXp

皆小於 threshold=0.5 故此 data point 為 outlier

1.529>0.029 且大於 threshold=0.5故此 data point 屬第二群


The clustering results are said to be different according to the following two criteria: The clustering results are different if quite a

large number of outliers are found by data labeling.

The clustering results are different if quite a large number of clusters are varied in the ratio of data points.



2SThere are three outliers in , and the ratio of outliers in S2 is Therefore, S2 is considered as a concept driftingwindow and is going to do reclustering.

4.06.053


5.0122

1),(),(3.06.0

50

53

3.04.054

52

3'2

32'2

231'2

132'2

2

31'2

1

isCandC

ccdccdiscandc

iscandc

4.02.051

the ratio of outliers in is However, the variation of the ratio of data points between clusters

3S

S3 is also considered as a concept-drifting window




The bottlenecks of the execution time in DCD may occur on the reclustering step when the concept drifts and on the updating NIR table step when the concept does not drift.

if we can obtain prior knowledge such as the frequency of the drifting concepts of the data from domain experts, the prior knowledge can help us to set proper parameter values.

Clustering relationship analysis

CRA measures the similarity of clusters between the clustering results at different time stamps.

CRA links the similar clusters ， when

similarity is higher than the threshold.

CRA will provide clues for us to catch the time-evolving trends in the data set.

Node Importance Vector and Cluster Distance

Node importance vector

The dimensions of all the vectors are the same. ic

Example

Vector space(14 nodes) ： TA,PA,GA,DA,CA,NA,MA,FA,EA,ZA,YA,XA,BA,AA 33333222211111

Cosine measure

Calculate the cosine of the angle between two vectors.

Measure of similarity.

Example

The similarity between vectors and11c

21c

Visualizing the Evolving Clusters

Cluster

Time

Clustering result

Experimental Results-Test Environment

Synthetic data sets Numerical data set Clustering data Drifting concept is generated by

combining two different clustering results.

Experimental Results-Test Environment

Real data set （ KDD-CUP’99 Network intrusion Detection ）

Each record ： normal connection 、 attack Drifting concept ： the change is continued

for at least 300 connections. 493,857 records ； each record contains 42

attributes. 33 drifting concepts

Evaluation on Efficiency The number of drifting concepts directly

impacts the execution time of DCD.

The execution time of DCD is faster than that of EM.

a little influence

dimensionality=20

# of clusters=20

N=500

Evaluation on scalability

Data size=50000

N=500

# of clusters=20 dimensionality=20

bottleneck ： the number of drifting concepts that require doing reclustering.

Evaluation on Accuracy Test the accuracy of drifting concepts that

are detected by DCD. The CU function To maximize both the probability

the same cluster the same attribute values different clusters different attributes

Evaluation on Accuracy

Confusion matrix accuracy (CMA)

Evaluate the clustering results by comparing with the original clustering labels j.

By maximizing the count of （ i ； j ） in which one output cluster is mapped to one original clustering label j.

ic

ic

Accuracy Evaluation on Synthetic Data Set

Each synthetic data set is generated by randomly combining 50 clustering results

DCD is effective for detecting drifting concepts. data set varies dramatically smaller N data set is stable larger N ， save the execution time

>0.8

The highest

# of clusters ,averages of 20 experimentsimummaxk,5.0,1.0,1.0


Clustering results ： DCD VS. EM performing in setting

N=2000, drifting concepts occur once per five sliding windows (50*10000/2000=250,250/50=5)

The variation of CU and CMA on doing EM once is quite larger than DCD.

1D


Clustering results ： DCD VS. EM performing in setting

The drifting concepts occur irregularly. DCD better than performing EM when we do

clustering on the categorical time-evolving data.

2D

Accuracy Evaluation on Real Data Set

The small sliding window size is induced to a high recall but a little low precision.

the data set does not evolve frequently larger N

3000N,10k,5.0,1.0,1.0

Accuracy Evaluation on Real Data Set

The records are the same in 51-114,134-149, and 155-160 sliding windows.

The peak value of CU in DCD is the time stamp that a drifting concept occurs.

DCD is able to quickly reflect the drifting concept and generate better clustering results.

Conclusions

A framework to perform clustering on categorical time-evolving data.

Detects the drifting concepts at different sliding window by DCD.

CRA to analyze and show the changes between different clustering results.

Shows the relationship between clustering results by visualization.

DCD can provide high-quality clustering results with correctly detected drifting concepts.

Documents

Catching the Trenda Framework for Clustering Concept