47
Agenda Abstract Introduction Preliminaries Drifting concept detection Clustering relationship analysis Experimental results Conclusions

Catching the Trenda Framework for Clustering Concept

Embed Size (px)

Citation preview

Page 1: Catching the Trenda Framework for Clustering Concept

Agenda

Abstract Introduction Preliminaries Drifting concept detection Clustering relationship analysis Experimental results Conclusions

Page 2: Catching the Trenda Framework for Clustering Concept

Abstract

the problem of how to allocate those unlabeled data points into proper clusters remains as a challenging issue in the categorical domain.

Page 3: Catching the Trenda Framework for Clustering Concept

Abstract

In this paper, a mechanism named MAximal Resemblance Data Labeling (abbreviated as MARDL) is proposed to allocate each unlabeled data point into the corresponding appropriate cluster.

MARDL has two advantages: 1) MARDL exhibits high execution efficiency, and2) MARDL can achieve high intracluster similarity and

low intercluster similarity

Page 4: Catching the Trenda Framework for Clustering Concept

Introduction

As the concepts behind the data evolve with time, the underlying clusters may also change considerably with time.

Previous works on clustering categorical data focus on doing clustering on the entire data set and do not take the drifting concepts into consideration.

The problem of clustering time-evolving data in the categorical domain remains a challenging issue.

Page 5: Catching the Trenda Framework for Clustering Concept

Introduction

Page 6: Catching the Trenda Framework for Clustering Concept

Introduction

A practical categorical clustering representative, named “Node Importance Representative”(NIR).

NIR represents clusters by measuring the importance of each attribute value in the clusters.

Page 7: Catching the Trenda Framework for Clustering Concept

Introduction

Based on NIR, we propose the “Drifting Concept Detection”(DCD).

In DCD, the incoming categorical data points at the present sliding window are first allocated into the corresponding proper cluster at the last clustering result

If the distribution is changed (exceeding some criteria), the concepts are said to drift.

Page 8: Catching the Trenda Framework for Clustering Concept

Introduction

The framework presented in this paper not only detects the drifting concepts in the categorical data but also explains the drifting concepts by analyzing the relationship between clustering results at different times.

The analyzing algorithm is named “Cluster Relationship Analysis” (CRA).

Page 9: Catching the Trenda Framework for Clustering Concept

Preliminaries The problem of clustering the

categorical time-evolving data is formulated as follows:

a series ofcategorical data set D

data point

Attribute

Page 10: Catching the Trenda Framework for Clustering Concept

Preliminaries1. Suppose that the window size N is also given. The data set D is separated into several continuous subsets St2. The superscript number t is the identification number of the sliding window and t is also called time stamp in this paper. For example, the first N data points in D are located in the first subset S1

Page 11: Catching the Trenda Framework for Clustering Concept

Preliminaries the objective of the framework is to

perform clustering on the data set D and consider the drifting concepts between St and Stt1 and also analyze the relationship between different clustering results.

Page 12: Catching the Trenda Framework for Clustering Concept

Preliminaries

Page 13: Catching the Trenda Framework for Clustering Concept

Preliminaries

The basic idea behind NIR is to represent a cluster as the distribution of the attribute values, which are called “nodes”.

the importance of a node is evaluated based on the following two concepts: The node is important in the cluster when

the frequency of the node is high in this cluster.

The node is important in the cluster if the node appears prevalently in this cluster rather than in other clusters.

Page 14: Catching the Trenda Framework for Clustering Concept

Preliminaries

Definition 1 (node). A node, , is defined as attribute name + attribute value : The age is in the range 50-59 and the weight is in the

range 50-59, the attribute value 50-59 is confusing when we separate the attribute value from the attribute name.

Nodes [age = 50-59] and [weight = 50-59] avoid this ambiguity

rI

Page 15: Catching the Trenda Framework for Clustering Concept

Preliminaries Definition 2 (node importance). The importance

value of the node is calculated as the following equations :irI

tk

zzr

yryr

k

yyryrr

ri

iriri

I

IIp

where

IpIpk

If

IfmI

Icw

1

1

)(

))(log()(*log11)(

)(*),(

Page 16: Catching the Trenda Framework for Clustering Concept

Preliminaries

1)30log

30

33log

33(*

2log11)(

))(log()(*log11)(

1

1

AA

k

yyryrr

If

IpIpk

If

01*20),(

11*33),(

)(*),(

112

111

AAcw

AAcw

IfmI

Icw ri

iriri

Page 17: Catching the Trenda Framework for Clustering Concept

Preliminaries

Page 18: Catching the Trenda Framework for Clustering Concept

Drifting Concept Detection

The objective of the DCD algorithm is to detect the difference of cluster distributions between the current data subset and the last clustering result and to decide whether the reclustering is required or not in .

tS]1,[ tteC

tS

Page 19: Catching the Trenda Framework for Clustering Concept

Drifting Concept Detection

Page 20: Catching the Trenda Framework for Clustering Concept

Drifting Concept Detection The goal of data labeling is to decide the most

appropriate cluster label for each incoming data point.

Definition 3 (resemblance and maximal resemblance).Given a data point and an NIR table of clusters , a data point is labeled to the cluster that obtains the maximal resemblance:

q

ririij IcwcpR

1

),(),(

jp

ic jp

Page 21: Catching the Trenda Framework for Clustering Concept

Drifting Concept Detection When a data point contains nodes that are more

important in cluster than in cluster , will be larger than .

if the maximal resemblance (the most appropriate cluster) is smaller than the threshold

in that cluster, the data point is seen as an outlier.

xc yc

),( yj cpR),( xj cpR

i

.,,1,),(max,*

otherwiseoutlierskiwherecpRifC

Label iiji

Page 22: Catching the Trenda Framework for Clustering Concept

Drifting Concept Detection

00

),,( 12

11

6 cincin

GEBp

529.11029.05.0029.0

),,( 12

11

7 cincin

PMXp

皆小於 threshold=0.5 故此 data point 為 outlier

1.529>0.029 且大於 threshold=0.5故此 data point 屬第二群

Page 23: Catching the Trenda Framework for Clustering Concept

Drifting Concept Detection

The clustering results are said to be different according to the following two criteria: The clustering results are different if quite a

large number of outliers are found by data labeling.

The clustering results are different if quite a large number of clusters are varied in the ratio of data points.

Page 24: Catching the Trenda Framework for Clustering Concept

Drifting Concept Detection

Page 25: Catching the Trenda Framework for Clustering Concept

Drifting Concept Detection

2SThere are three outliers in , and the ratio of outliers in S2 is Therefore, S2 is considered as a concept driftingwindow and is going to do reclustering.

4.06.053

Page 26: Catching the Trenda Framework for Clustering Concept

Drifting Concept Detection

5.0122

1),(),(3.06.0

50

53

3.04.054

52

3'2

32'2

231'2

132'2

2

31'2

1

isCandC

ccdccdiscandc

iscandc

4.02.051

the ratio of outliers in is However, the variation of the ratio of data points between clusters

3S

S3 is also considered as a concept-drifting window

Page 27: Catching the Trenda Framework for Clustering Concept

Drifting Concept Detection

Page 28: Catching the Trenda Framework for Clustering Concept

Drifting Concept Detection

Page 29: Catching the Trenda Framework for Clustering Concept

Drifting Concept Detection

The bottlenecks of the execution time in DCD may occur on the reclustering step when the concept drifts and on the updating NIR table step when the concept does not drift.

if we can obtain prior knowledge such as the frequency of the drifting concepts of the data from domain experts, the prior knowledge can help us to set proper parameter values.

Page 30: Catching the Trenda Framework for Clustering Concept

Clustering relationship analysis

CRA measures the similarity of clusters between the clustering results at different time stamps.

CRA links the similar clusters , when

similarity is higher than the threshold.

CRA will provide clues for us to catch the time-evolving trends in the data set.

Page 31: Catching the Trenda Framework for Clustering Concept

Node Importance Vector and Cluster Distance

Node importance vector

The dimensions of all the vectors are the same. ic

Page 32: Catching the Trenda Framework for Clustering Concept

Example

Vector space(14 nodes) : TA,PA,GA,DA,CA,NA,MA,FA,EA,ZA,YA,XA,BA,AA 33333222211111

Page 33: Catching the Trenda Framework for Clustering Concept

Cosine measure

Calculate the cosine of the angle between two vectors.

Measure of similarity.

Page 34: Catching the Trenda Framework for Clustering Concept

Example

The similarity between vectors and11c

21c

Page 35: Catching the Trenda Framework for Clustering Concept

Visualizing the Evolving Clusters

Cluster

Time

Clustering result

Page 36: Catching the Trenda Framework for Clustering Concept

Experimental Results-Test Environment

Synthetic data sets Numerical data set Clustering data Drifting concept is generated by

combining two different clustering results.

Page 37: Catching the Trenda Framework for Clustering Concept

Experimental Results-Test Environment

Real data set ( KDD-CUP’99 Network intrusion Detection )

Each record : normal connection 、 attack Drifting concept : the change is continued

for at least 300 connections. 493,857 records ; each record contains 42

attributes. 33 drifting concepts

Page 38: Catching the Trenda Framework for Clustering Concept

Evaluation on Efficiency The number of drifting concepts directly

impacts the execution time of DCD.

The execution time of DCD is faster than that of EM.

a little influence

dimensionality=20

# of clusters=20

N=500

Page 39: Catching the Trenda Framework for Clustering Concept

Evaluation on scalability

Data size=50000

N=500

# of clusters=20 dimensionality=20

bottleneck : the number of drifting concepts that require doing reclustering.

Page 40: Catching the Trenda Framework for Clustering Concept

Evaluation on Accuracy Test the accuracy of drifting concepts that

are detected by DCD. The CU function To maximize both the probability

the same cluster the same attribute values different clusters different attributes

Page 41: Catching the Trenda Framework for Clustering Concept

Evaluation on Accuracy

Confusion matrix accuracy (CMA)

Evaluate the clustering results by comparing with the original clustering labels j.

By maximizing the count of ( i ; j ) in which one output cluster is mapped to one original clustering label j.

ic

ic

Page 42: Catching the Trenda Framework for Clustering Concept

Accuracy Evaluation on Synthetic Data Set

Each synthetic data set is generated by randomly combining 50 clustering results

DCD is effective for detecting drifting concepts. data set varies dramatically smaller N data set is stable larger N , save the execution time

>0.8

The highest

# of clusters ,averages of 20 experimentsimummaxk,5.0,1.0,1.0

Page 43: Catching the Trenda Framework for Clustering Concept

Accuracy Evaluation on Synthetic Data Set

Clustering results : DCD VS. EM performing in setting

N=2000, drifting concepts occur once per five sliding windows (50*10000/2000=250,250/50=5)

The variation of CU and CMA on doing EM once is quite larger than DCD.

1D

Page 44: Catching the Trenda Framework for Clustering Concept

Accuracy Evaluation on Synthetic Data Set

Clustering results : DCD VS. EM performing in setting

The drifting concepts occur irregularly. DCD better than performing EM when we do

clustering on the categorical time-evolving data.

2D

Page 45: Catching the Trenda Framework for Clustering Concept

Accuracy Evaluation on Real Data Set

The small sliding window size is induced to a high recall but a little low precision.

the data set does not evolve frequently larger N

3000N,10k,5.0,1.0,1.0

Page 46: Catching the Trenda Framework for Clustering Concept

Accuracy Evaluation on Real Data Set

The records are the same in 51-114,134-149, and 155-160 sliding windows.

The peak value of CU in DCD is the time stamp that a drifting concept occurs.

DCD is able to quickly reflect the drifting concept and generate better clustering results.

Page 47: Catching the Trenda Framework for Clustering Concept

Conclusions

A framework to perform clustering on categorical time-evolving data.

Detects the drifting concepts at different sliding window by DCD.

CRA to analyze and show the changes between different clustering results.

Shows the relationship between clustering results by visualization.

DCD can provide high-quality clustering results with correctly detected drifting concepts.