Efficient Clustering of Uncertain Data Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold Cheng, Michael Chau, Kevin Y. Yip Speaker: Wang Kay Ngai

Efficient Clustering of Uncertain Data

Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold Cheng, Michael Chau, Kevin Y.

Yip

Speaker: Wang Kay Ngai

Data clustering

• Data clustering is used to discover any cluster patterns in a data set.

• One kind of clustering is to partition the data set into groups, called clusters, such that data within the same cluster are closer to each other (based on some distance functions such as an Euclidean distance) than data from any other clusters.

• K-means is a common method that tries to achieve the above clustering by ensuring each data closer to a representative of its cluster than those of any other clusters.

• The representative of a cluster is the mean value of all the data in that cluster.

Data

Representative

ClusterCluster

Cluster

Uncertain data

• How about clustering of uncertain data?• Sometimes a data value is uncertain but is within a

certain range represented by a probability density function (pdf). We call it an uncertain data.

• An example of uncertain data is the 2D location reported from a mobile device in a tracking system.

• The device may already move to a location other than the reported one when the data is received.

• The exact location of the device is uncertain but is within a circular region determined by the maximum velocity v of the device and the time t elapsed since the location data is sent.

r = vt2

1

r

f(x,y)

x

y

Uniform pdf

Reported Location

• The region is called uncertainty region. • It can have an arbitrary shape and its pdf can also

be arbitrary.

f(x,y)

x

y2

1

r

f(x,y)

x

y2

1

r

Lake

Clustering of uncertain data

• For the example of mobile devices, in some applications each device may need to communicate with a server.

• The communication cost may depend on the communication distance and could be saved by clustering the devices.

• Most devices’ communications are in short distance with the leaders in their clusters.

• Only the communications between the leaders and the servers are in long distance.

• So the clustering could save the total communication cost of the devices.

Device

Leader

ServerCluster

Cluster

UK-means

• K-means can be used to cluster uncertain data by using a new distance function, called expected distance, between the data and the cluster representative.

• We call this specialized K-means method UK-means.

• An expected distance between the data o and the cluster representative c is defined as follows:

• For a cluster C of uncertain data, its representative value is assigned the mean value of the centers of mass of all the data in that cluster as follows:

Regionop

dpcpDpfcoED.

),()(),(

Co Regionop

i

i i

pdppfC

c.

)(1

Major overhead in UK-means

• A clustering process of UK-means composes of iterations.

• In each iteration, for each data o, UK-means assigns o to the cluster, among the others, whose representative has the smallest expected distance to o. We refer this as a cluster assignment.

• In a brute-force approach for the cluster assignment of each data o, UK-means simply computes the expected distance between o and every cluster representative in order to find the smallest expected distance.

• In some applications where the uncertainty regions or pdfs are arbitrary, computing an expected distance requires an expensive numerical integration.

• Since the brute-force approach incurs a lot of expected distance computations, they become the major overhead in the whole clustering process.

• Suppose a lower bound of the expected distance ED(o,c1) between a data o and a cluster representative c1 is larger than an upper bound of ED(o,c2) for another cluster representative c2, then, without computing ED(o,c1), we know that o cannot be assigned to c1 (or c1 is pruned).

c1

Upper bound of ED(o,c2)

Lower bound of ED(o,c1)

Data o

c2

• If this condition is not true for c1, ED(o,c1), ED(o,c2) or both may need to be computed in the cluster assignment of o.

• For each data, most of the cluster representatives will likely satisfy this condition since in general each data is much closer to a few of all the cluster representatives than the others.

• So expected distance computations can be reduced in the whole clustering process using the upper and lower bounds.

• The amount of reduction depends on the tightness of the upper and lower bounds.

Min-max-dist

• The basic approach for computing upper and lower bounds of ED(o,c), is to compute the maximum distance (MaxDist) and minimum distance (MinDist) , respectively, between c and any points in the Minimum Bounding Box (MBR) of o’s uncertainty region.

• The approach is called min-max-dist.• These bounds may bound ED(o,c) very loosely.

MBRc

MaxDist

MinDistData o

Upre and Lpre

• Using the pdf (and hence uncertainty region) instead of MBR of the data o in computing the bounds of ED(o,c) may give tighter bounds.

• So we propose two approaches Upre and Lpre that use the pdf for computing the upper and lower bounds, respectively.

• At first, some anchor points y are placed nearby the data o, and the expected distance ED(o,y) is pre-computed for each anchor point y.

MBR

• Then, by the Triangle Inequalities, the following inequality shows an upper bound of ED(o,c) for any cluster representative c:

c

yp

),(),(

)],(),()[(

),()(),(

.

.

cyDyoED

dpcyDypDpf

dpcpDpfcoED

Regionop

Regionop

• Similarly, by the Triangle Inequalities, the lower bound of ED(o,c) in Lpre approach is derived as follows:

)),(),((max cyDyoEDy

• Then in Upre approach the minimum of such bounds among all anchor points y is used as the upper bound of ED(o,c):

)),(),((min cyDyoEDy

• Upre and Lpre approaches try to reduce expected distance computations in the cluster assignment while they incur few extra expected distance computations before the clustering process starts.

Ucs and Lcs

• As mentioned before, if a cluster representative c1 cannot be pruned when the lower bound of ED(o,c1) is compared to the upper bound of ED(o,c2) for another cluster representative c2, some expected distances may be computed.

• Suppose the min-max-dist approach is used for computing the upper and lower bounds.

• And suppose, after any pruning in the cluster assignment of o in an iteration of the clustering process, ED(o,c) still needs to be computed for some cluster representative c.

• Then in any later iteration, we propose two approaches Ucs and Lcs that use the computed expected distance ED(o,c) for computing the upper and lower bounds of ED(o,c’), respectively, where c’ represents c in that iteration.

• Note that the values of c’ and c could be different because the value of a cluster representative is updated in each iteration.

• So, the approaches Ucs and Lcs save the pre-computations of expected distances required in the approaches Upre and Lpre.

• But they can only be used after a specific expected distance is computed.

• By the Triangle Inequalities, in Ucs approach the upper bound of ED(o,c’) is computed as follows:

),(),( ,ccDcoED • The values of these bounds become closer to

ED(o,c’) and hence become tighter bounds as D(c,c’) becomes smaller.

• D(c,c’) will become smaller and eventually zero in the later iterations of a clustering process.

• Similarly, in Lcs approach the lower bound of ED(o,c’) is computed as follows:

),(),( ,ccDcoED

Experimental results

• We want to see how many expected distance computations in the brute-force approach of the cluster assignment are saved when the approaches of upper and lower bounds are used.

• And see how the saving affects the efficiency of the whole clustering process.

• So, we conduct experiments on random 2D uncertain data with or without cluster patterns.

• Each data has a random uncertainty region of a random pdf.

• The pdf is approximated by s sample points for computing expected distances.

• The larger s is, the more accurate the computed expected distance is (with a higher computation cost).

• We vary the parameters: s, maximum size of uncertainty region (d), number of objects (n) and number of clusters (K).

• The cluster representatives are initialized to be distributed uniformly among the data or to be the centers of mass of the uncertainty regions of randomly selected data.

• In the experiments the approaches Upre, Lpre, Ucs and Lcs also use the bounds computed in the min-max-dist approach for the pruning.

• For an example of the Upre approach, the minimum (tightest) of its upper bound and the one computed in the min-max-dist approach is used to get the actual upper bound for the pruning in a cluster assignment.

• K (= 49) expected distances are computed per object per iteration for the brute-force approach.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 10000 20000 30000n

Ave

rage

num

ber

of e

xpec

ted

dist

ance

cal

cula

tions

min-max-dist only

UpreLpre

UcsLcs

ALL (9pts)

ALL (9pts, precomp.excluded)

• K (= 49) expected distances are computed per object per iteration for the brute-force approach.

0

0.5

1

1.5

2

2.5

3

0 5 10 15 20d

Ave

rage

num

ber

of e

xpec

ted

dist

ance

cal

cula

tions

min-max-dist only

UpreLpre

UcsLcs

ALL (9pts)


• 100% of K expected distances are computed per object per iteration for the brute-force approach.

0

1

2

3

4

5

6

0 20 40 60 80k

Ave

rage

num

ber

of e

xpec

ted

dist

ance

cal

cula

tions

(pe

rcen

tage

of

the

brut

e-fo

rce

appr

oach

)

min-max-dist only

UpreLpre

UcsLcs

ALL (9pts)


0

2000

4000

6000

8000

10000

0 200 400 600 800 1000 1200s

Clu

ster

ing

time

(sec

onds

)

min-max-dist only

UpreLpre

UcsLcs

ALL (9pts)


Brute-force

• The run-time of the clustering process using the proposed approaches is at least 10 times shorter than that using the brute-force approach.

0

500

1000

1500

2000

2500

0 2000 4000 6000 8000 10000s

Clu

ster

ing

time

(sec

onds

)min-max-dist only

UpreLpre

UcsLcs

ALL (9pts)


Conclusions• Approaches are proposed for reducing expected

distance computations, the major overhead, in the brute-force approach of UK-means.

• In most experiments, using both Ucs and Lcs reduce the overheads at least 200 times (10 times more than that of min-max-dist), which results at least 10-folds increase in the clustering efficiency.

• If the expected distance pre-computations in Upre and Lpre can be discounted in an application, using all the proposed approaches incurs the most overhead reduction, which should yield the most increase in the clustering efficiency.

Documents

Efficient Clustering of Uncertain Data Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold Cheng, Michael Chau, Kevin Y. Yip Speaker: Wang Kay Ngai