Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Jian Pei: CMPT 459/741 Clustering (4) 1
Fuzzy Clustering
• Each point xi takes a probability wij to belong to a cluster Cj
• Requirements – For each point xi,
– For each cluster Cj
11
=∑=
k
jijw
mwm
iij
Jian Pei: CMPT 459/741 Clustering (4) 2
Fuzzy C-Means (FCM)
Select an initial fuzzy pseudo-partition, i.e., assign values to all the wij
Repeat Compute the centroid of each cluster using the fuzzy
pseudo-partition Recompute the fuzzy pseudo-partition, i.e., the wij
Until the centroids do not change (or the change is below some threshold)
Jian Pei: CMPT 459/741 Clustering (4) 3
Critical Details
• Optimization on sum of the squared error (SSE):
• Computing centroids: • Updating the fuzzy pseudo-partition
– When p=2
∑∑= =
=k
j
m
iji
pijk cxdistwCCSSE
1 1
21 ),(),,( …
∑∑==
=m
i
pij
m
ii
pijj wxwc
11
/
∑=
−−=k
q
pqi
pjiij cxdistcxdistw
1
11
211
2 )),(/1()),(/1(
∑=
=k
qqijiij cxdistcxdistw
1
22 ),(/1),(/1
Jian Pei: CMPT 459/741 Clustering (4) 4
Choice of P
• When p à 1, FCM behaves like traditional k-means
• When p is larger, the cluster centroids approach the global centroid of all data points
• The partition becomes fuzzier as p increases
Jian Pei: CMPT 459/741 Clustering (4) 5
Effectiveness
Jian Pei: CMPT 459/741 Clustering (4) 6
Mixture Models
• A cluster can be modeled as a probability distribution – Practically, assume a distribution can be
approximated well using multivariate normal distribution
• Multiple clusters is a mixture of different probability distributions
• A data set is a set of observations from a mixture of models
Jian Pei: CMPT 459/741 Clustering (4) 7
Object Probability
• Suppose there are k clusters and a set X of m objects – Let the j-th cluster have parameter θj = (µj, σj) – The probability that a point is in the j-th cluster is
wj, w1 + …+ wk = 1 • The probability of an object x is
∑=
=Θk
jjjj xpwxprob
1)|()|( θ
∏∑∏= ==
=Θ=Θm
i
k
jjijj
m
ii xpwxprobXprob
1 11
)|()|()|( θ
Jian Pei: CMPT 459/741 Clustering (4) 8
Example
2
2
2)(
21)|( σ
µ
σπ
−−
=Θx
i exprob
)2,4()2,4( 21 =−= θθ
8)4(
8)4( 22
221
221)|(
−−
+−
+=Θxx
eexprobππ
Jian Pei: CMPT 459/741 Clustering (4) 9
Maximal Likelihood Estimation
• Maximum likelihood principle: if we know a set of objects are from one distribution, but do not know the parameter, we can choose the parameter maximizing the probability
• Maximize
– Equivalently, maximize
∏=
−−
=Θm
j
x
i exprob1
2)(2
2
21)|( σ
µ
σπ
∑=
−−−
−=Θm
i
i mmxXprob1
2
2
log2log5.02
)()|(log σπσµ
Jian Pei: CMPT 459/741 Clustering (4) 10
EM Algorithm
• Expectation Maximization algorithm Select an initial set of model parameters Repeat
Expectation Step: for each object, calculate the probability that it belongs to each distribution θi, i.e., prob(xi|θi)
Maximization Step: given the probabilities from the expectation step, find the new estimates of the parameters that maximize the expected likelihood
Until the parameters are stable
Jian Pei: CMPT 459/741 Clustering (4) 11
Advantages and Disadvantages
• Mixture models are more general than k-means and fuzzy c-means
• Clusters can be characterized by a small number of parameters
• The results may satisfy the statistical assumptions of the generative models
• Computationally expensive • Need large data sets • Hard to estimate the number of clusters
Jian Pei: CMPT 459/741 Clustering (4) 12
Grid-based Clustering Methods
• Ideas – Using multi-resolution grid data structures – Using dense grid cells to form clusters
• Several interesting methods – CLIQUE – STING – WaveCluster
Jian Pei: CMPT 459/741 Clustering (4) 13
CLIQUE
• Clustering In QUEst • Automatically identify subspaces of a high
dimensional data space • Both density-based and grid-based
Jian Pei: CMPT 459/741 Clustering (4) 14
CLIQUE: the Ideas
• Partition each dimension into the same number of equal length intervals – Partition an m-dimensional data space into non-
overlapping rectangular units • A unit is dense if the number of data points
in the unit exceeds a threshold • A cluster is a maximal set of connected
dense units within a subspace
Jian Pei: CMPT 459/741 Clustering (4) 15
CLIQUE: the Method
• Partition the data space and find the number of points in each cell of the partition – Apriori: a k-d cell cannot be dense if one of its (k-1)-d
projection is not dense • Identify clusters:
– Determine dense units in all subspaces of interests and connected dense units in all subspaces of interests
• Generate minimal description for the clusters – Determine the minimal cover for each cluster
Jian Pei: CMPT 459/741 Clustering (4) 16
Sala
ry
(10,
000)
age
Vac
atio
n
30 50
20 30 40 50 60 age
5 4
3 1
2 6
7 0
Vaca
tion
(wee
k)
20 30 40 50 60 age
5 4
3 1
2 6
7 0
CLIQUE: An Example
Jian Pei: CMPT 459/741 Clustering (4) 17
CLIQUE: Pros and Cons
• Automatically find subspaces of the highest dimensionality with high density clusters
• Insensitive to the order of input – Not presume any canonical data distribution
• Scale linearly with the size of input • Scale well with the number of dimensions • The clustering result may be degraded at
the expense of simplicity of the method
Jian Pei: CMPT 459/741 Clustering (4) 18
Bad Cases for CLIQUE
Parts of a cluster may be missed
A cluster from CLIQUE may contain noise
Jian Pei: CMPT 459/741 Clustering (4) 19
Dimensionality Reduction
• Clustering a high dimensional data set is challenging – Distance between two points could be dominated by
noise • Dimensionality reduction: choosing the informative
dimensions for clustering analysis – Feature selection: choosing a subset of existing
dimensions – Feature construction: construct a new (small) set of
informative attributes
Jian Pei: CMPT 459/741 Clustering (4) 20
Variance and Covariance
• Given a set of 1-d points, how different are those points? – Standard deviation: – Variance:
• Given a set of 2-d points, are the two dimensions correlated? – Covariance:
1
)(1
2
−
−=∑=
n
XXs
n
ii
1
)(1
2
2
−
−=∑=
n
XXs
n
ii
1
))((),cov( 1
−
−−=∑=
n
YYXXYX
n
iii
Jian Pei: CMPT 459/741 Clustering (4) 21
Principal Components
Art work and example from http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
Jian Pei: CMPT 459/741 Clustering (4) 22
Step 1: Mean Subtraction
• Subtract the mean from each dimension for each data point
• Intuition: centralizing the data set
Jian Pei: CMPT 459/741 Clustering (4) 23
Step 2: Covariance Matrix
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=
),cov(),cov(),cov(
),cov(),cov(),cov(),cov(),cov(),cov(
21
22212
12111
nnnn
n
n
DDDDDD
DDDDDDDDDDDD
C
!"#""
!!
Jian Pei: CMPT 459/741 Clustering (4) 24
Step 3: Eigenvectors and Eigenvalues
• Compute the eigenvectors and the eigenvalues of the covariance matrix – Intuition: find those direction invariant vectors as
candidates of new attributes – Eigenvalues indicate how much the direction
invariant vectors are scaled – the larger the better for manifest the data variance
Jian Pei: CMPT 459/741 Clustering (4) 25
Step 4: Forming New Features
• Choose the principal components and forme new features – Typically, choose the top-k components
Jian Pei: CMPT 459/741 Clustering (4) 26
New Features
NewData = RowFeatureVector x RowDataAdjust
The first principal component is used
Clustering in Derived Space
Jian Pei: CMPT 459/741 Clustering (4) 27
Y
XO
- 0.707x + 0.707y
Spectral Clustering
Jian Pei: CMPT 459/741 Clustering (4) 28
cluster the original data
ij[ ]
Data Affinity matrixk eigenvectors of A
A = f(W)Av = \lamda v
Clustering in thenew space
Computing the leading Projecting back to
W
Affinity Matrix
• Using a distance measure where σ is a scaling parameter controling how fast the affinity Wij decreases as the distance increases
• In the Ng-Jordan-Weiss algorithm, Wii is set to 0
Jian Pei: CMPT 459/741 Clustering (4) 29
Wij = e� dist(oi,oj)
�
w
Clustering
• In the Ng-Jordan-Weiss algorithm, we define a diagonal matrix such that
• Then, • Use the k leading eigenvectors to form a
new space • Map the original data to the new space and
conduct clustering Jian Pei: CMPT 459/741 Clustering (4) 30
Dii =nX
j=1
Wij
A = D�12WD�
12
Is a Clustering Good?
• Feasibility – Applying any clustering methods on a uniformly
distributed data set is meaningless • Quality
– Are the clustering results meeting users’ interest? – Clustering patients into clusters corresponding
various disease or sub-phenotypes is meaningful – Clustering patients into clusters corresponding to
male or female is not meaningful
Jian Pei: CMPT 459/741 Clustering (4) 31
Major Tasks
• Assessing clustering tendency – Are there non-random structures in the data?
• Determining the number of clusters or other critical parameters
• Measuring clustering quality
Jian Pei: CMPT 459/741 Clustering (4) 32
Uniformly Distributed Data
• Clustering uniformly distributed data is meaningless
• A uniformly distributed data set is generated by a uniform data distribution
Jian Pei: CMPT 459/741 Clustering (4) 33
504CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS
Figure 10.21: A data set that is uniformly distributed in the data space.
• Measuring clustering quality. After applying a clustering method on adata set, we want to assess how good the resulting clusters are. A numberof measures can be used. Some methods measure how well the clustersfit the data set, while others measure how well the clusters match theground truth, if such truth is available. There are also measures thatscore clusterings and thus can compare two sets of clustering results onthe same data set.
In the rest of this section, we discuss each of the above three topics.
10.6.1 Assessing Clustering Tendency
Clustering tendency assessment determines whether a given data set has a non-random structure, which may lead to meaningful clusters. Consider a dataset that does not have any non-random structure, such as a set of uniformlydistributed points in a data space. Even though a clustering algorithm mayreturn clusters for the data, those clusters are random and are not meaningful.
Example 10.9 Clustering requires non-uniform distribution of data. Figure 10.21shows a data set that is uniformly distributed in 2-dimensional data space.Although a clustering algorithm may still artificially partition the points intogroups, the groups will unlikely mean anything significant to the applicationdue to the uniform distribution of the data.
“How can we assess the clustering tendency of a data set?” Intuitively, wecan try to measure the probability that the data set is generated by a uniformdata distribution. This can be achieved using statistical tests for spatial ran-domness. To illustrate this idea, let’s look at a simple yet effective statisticcalled Hopkins Statistic.
The Hopkins Statistic is a spatial statistic that tests the spatial random-ness of a variable as distributed in a space. Given a data set, D, which isregarded as a sample of a random variable, o, we want to determine how faraway o is from being uniformly distributed in the data space. We calculate theHopkins Statistic as follows:
Hopkins Statistic
• Hypothesis: the data is generated by a uniform distribution in a space
• Sample n points, p1, …, pn, uniformly from the space of D
• For each point pi, find the nearest neighbor of pi in D, let xi be the distance between pi and its nearest neighbor in D
Jian Pei: CMPT 459/741 Clustering (4) 34
xi = minv2D
{dist(pi, v)}
Hopkins Statistic
• Sample n points, q1, …, qn, uniformly from D • For each qi, find the nearest neighbor of qi
in D – {qi}, let yi be the distance between qi and its nearest neighbor in D – {qi}
• Calculate the Hopkins Statistic H
Jian Pei: CMPT 459/741 Clustering (4) 35
yi = minv2D,v 6=qi
{dist(qi, v)}
H =
nPi=1
yi
nPi=1
xi +nP
i=1yi
Explanation
• If D is uniformly distributed, then and would be close to each other, and thus H would be round 0.5
• If D is skewed, then would be substantially smaller, and thus H would be close to 0
• If H > 0.5, then it is unlikely that D has statistically significant clusters
Jian Pei: CMPT 459/741 Clustering (4) 36
nX
i=1
yi
nX
i=1
xi
nX
i=1
yi
Finding the Number of Clusters
• Depending on many factors – The shape and scale of the distribution in the
data set – The clustering resolution required by the user
• Many methods exist – Set , each cluster has points on
average – Plot the sum of within-cluster variances with
respect to k, find the first (or the most significant turning point)
Jian Pei: CMPT 459/741 Clustering (4) 37
k =
rn
2
p2n
A Cross-Validation Method • Divide the data set D into m parts • Use m – 1 parts to find a clustering • Use the remaining part as the test set to test
the quality of the clustering – For each point in the test set, find the closest
centroid or cluster center – Use the squared distances between all points in the
test set and the corresponding centroids to measure how well the clustering model fits the test set
• Repeat m times for each value of k, use the average as the quality measure
Jian Pei: CMPT 459/741 Clustering (4) 38
Measuring Clustering Quality
• Ground truth: the ideal clustering determined by human experts
• Two situations – There is a known ground truth – the extrinsic
(supervised) methods, comparing the clustering against the ground truth
– The ground truth is unavailable – the intrinsic (unsupervised) methods, measuring how well the clusters are separated
Jian Pei: CMPT 459/741 Clustering (4) 39
Quality in Extrinsic Methods • Cluster homogeneity: the more pure the
clusters in a clustering are, the better the clustering
• Cluster completeness: objects in the same cluster in the ground truth should be clustered together
• Rag bag: putting a heterogeneous object into a pure cluster is worse than putting it into a rag bag
• Small cluster preservation: splitting a small cluster in the ground truth into pieces is worse than splitting a bigger one
Jian Pei: CMPT 459/741 Clustering (4) 40
Bcubed Precision and Recall
• D = {o1, …, on} – L(oi) is the cluster of oi given by the ground truth
• C is a clustering on D – C(oi) is the cluster-id of oi in C
• For two objects oi and oj, the correctness is 1 if L(oi) = L(oj) ßà C(oi) = C(oj), 0 otherwise
Jian Pei: CMPT 459/741 Clustering (4) 41
Bcubed Precision and Recall
• Precision
• Recall
Jian Pei: CMPT 459/741 Clustering (4) 42
508CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS
one, denoted by o, belong to the same category according to ground truth.Consider a clustering C2 identical to C1 except that o is assigned to acluster C′ ̸= C in C2 such that C′ contains objects from various categoriesaccording to ground truth, and thus is noisy. In other words, C′ in C2 isa rag bag. Then, a clustering quality measure Q respecting the rag bagcriterion should give a higher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).
• Small cluster preservation. If a small category is split into small piecesin a clustering, those small pieces may likely become noise and thus thesmall category cannot be discovered from the clustering. The small clus-ter preservation criterion states that splitting a small category into piecesis more harmful than splitting a large category into pieces. Consider anextreme case. Let D be a data set of n + 2 objects such that, accord-ing to the ground truth, n objects, denoted by o1, . . . , on, belong toone category and the other 2 objects, denoted by on+1, on+2, belong toanother category. Suppose clustering C1 has three clusters, C1={o1, . . . ,on}, C2={on+1}, and C3={on+2}. Let clustering C2 have three clusters,too, namely C1={o1, . . . , on−1}, C2={on}, and C3={on+1, on+2}. Inother words, C1 splits the small category and C2 splits the big category.A clustering quality measure Q preserving small clusters should give ahigher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).
Many clustering quality measures satisfy some of the above four criteria.Here, we introduce the BCubed precision and recall metrics, which satisfy allof the above criteria.
BCubed evaluates the precision and recall for every object in a clusteringon a given data set according to the ground truth. The precision of an objectindicates how many other objects in the same cluster belong to the same cat-egory as the object. The recall of an object reflects how many objects of thesame category are assigned to the same cluster.
Formally, let D ={o1, . . . , on} be a set of objects, and C be a clusteringon D. Let L(oi) (1 ≤ i ≤ n) be the category of oi given by ground truth,and C(oi) be the cluster ID of oi in C. Then, for two objects, oi and oj ,(1 ≤ i, j,≤ n, i ̸= j), the correctness of the relation between oi and oj inclustering C is given by
Correctness(oi, oj) ={ 1 if L(oi) = L(oj)⇔ C(oi) = C(oj)
0 otherwise.(10.28)
BCubed precision is defined as
Precision BCubed =
n∑
i=1
∑
oj :i̸=j,C(oi)=C(oj)
Correctness(oi, oj)
∥{oj|i ̸= j, C(oi) = C(oj)}∥n
. (10.29)
10.6. EVALUATION OF CLUSTERING 509
BCubed recall is defined as
Recall BCubed =
n∑
i=1
∑
oj :i̸=j,L(oi)=L(oj)
Correctness(oi, oj)
∥{oj|i ̸= j, L(oi) = L(oj)}∥n
. (10.30)
Intrinsic Methods
When the ground truth of a data set is not available, we have to use an intrinsicmethod to assess the clustering quality. In general, intrinsic methods evaluatea clustering by examining how well the clusters are separated and how compactthe clusters are. Many intrinsic methods take the advantage of a similaritymetric between objects in the data set.
The silhouette coefficient is such a measure. For a data set D of nobjects, suppose D is partitioned into k clusters, C1, . . . , Ck. For each object o∈ D, we calculate a(o) as the average distance between o and all other objectsin the cluster to which o belongs. Similarly, b(o) is the minimum averagedistance from o to all clusters to which o does not belong. Formally, supposeo ∈ Ci (1 ≤ i ≤ k), then
a(o) =
∑
o′∈Ci,o≠o′dist(o, o′)
|Ci|− 1(10.31)
and
b(o) = minCj :1≤j≤k,j ̸=i
{
∑
o′∈Cjdist(o, o′)
|Cj |}. (10.32)
The silhouette coefficient of o is then defined as
s(o) =b(o)− a(o)
max{a(o), b(o)} . (10.33)
The value of the silhouette coefficient is between −1 and 1. The value ofa(o) reflects the compactness of the cluster to which o belongs. The smallerthe value is, the more compact the cluster is. The value of b(o) capturesthe degree to which o is separated from other clusters. The larger b(o) is,the more separated o is from other clusters. Therefore, when the silhouettecoefficient value of o approaches 1, the cluster containing o is compact and ois far away from other clusters, which is the preferable case. However, whenthe silhouette coefficient value is negative (that is, b(o) < a(o)), this meansthat, in expectation, o is closer to the objects in another cluster than to theobjects in the same cluster as o. In many cases, this is a bad case, and shouldbe avoided.
To measure the fitness of a cluster within a clustering, we can compute theaverage silhouette coefficient value of all objects in the cluster. To measure thequality of a clustering, we can use the average silhouette coefficient value of allobjects in the data set. The silhouette coefficient and other intrinsic measures
Silhouette Coefficient
• No ground truth is assumed • Suppose a data set D of n objects is partitioned
into k clusters, C1, …, Ck • For each object o,
– Calculate a(o), the average distance between o and every other object in the same cluster – compactness of a cluster, the smaller, the better
– Calculate b(o), the minimum average distance from o to every objects in a cluster that o does not belong to – degree of separation from other clusters, the larger, the better
Jian Pei: CMPT 459/741 Clustering (4) 43
Silhouette Coefficient
• Then
• Use the average silhouette coefficient of all objects as the overall measure
Jian Pei: CMPT 459/741 Clustering (4) 44
b(o) = minCj :o 62Cj
{
Po
02Cjdist(o, o0)
|Cj
| }
a(o) =
Po,o
02Ci,o0 6=odist(o, o0)
|Ci
|� 1
s(o) =
b(o)� a(o)max{a(o), b(o)}
Multi-Clustering
• A data set may be clustered in different ways – In different subspaces, that is, using different
attributes – Using different similarity measures – Using different clustering methods
• Some different clusterings may capture different meanings of categorization – Orthogonal clusterings
• Putting users in the loop Jian Pei: CMPT 459/741 Clustering (4) 45
To-Do List
• Read Chapters 10.5, 10.6, and 11.1 • Find out how Gaussian mixture can be used
in SPARK MLlib • (for thesis-based graduate students only)
Learn LDA (Latent Dirichlet allocation) by yourself
Jian Pei: CMPT 459/741 Clustering (4) 46