46
Jian Pei: CMPT 459/741 Clustering (4) 1 Fuzzy Clustering Each point x i takes a probability w ij to belong to a cluster C j • Requirements – For each point x i , – For each cluster C j 1 1 = = k j ij w m w m i ij < < = 1 0

Fuzzy Clustering - Simon Fraser University...Jian Pei: CMPT 459/741 Clustering (4) 2 Fuzzy C-Means (FCM) Select an initial fuzzy pseudo-partition, i.e., assign values to all the w

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Jian Pei: CMPT 459/741 Clustering (4) 1

    Fuzzy Clustering

    •  Each point xi takes a probability wij to belong to a cluster Cj

    •  Requirements – For each point xi,

    – For each cluster Cj

    11

    =∑=

    k

    jijw

    mwm

    iij

  • Jian Pei: CMPT 459/741 Clustering (4) 2

    Fuzzy C-Means (FCM)

    Select an initial fuzzy pseudo-partition, i.e., assign values to all the wij

    Repeat Compute the centroid of each cluster using the fuzzy

    pseudo-partition Recompute the fuzzy pseudo-partition, i.e., the wij

    Until the centroids do not change (or the change is below some threshold)

  • Jian Pei: CMPT 459/741 Clustering (4) 3

    Critical Details

    •  Optimization on sum of the squared error (SSE):

    •  Computing centroids: •  Updating the fuzzy pseudo-partition

    – When p=2

    ∑∑= =

    =k

    j

    m

    iji

    pijk cxdistwCCSSE

    1 1

    21 ),(),,( …

    ∑∑==

    =m

    i

    pij

    m

    ii

    pijj wxwc

    11

    /

    ∑=

    −−=k

    q

    pqi

    pjiij cxdistcxdistw

    1

    11

    211

    2 )),(/1()),(/1(

    ∑=

    =k

    qqijiij cxdistcxdistw

    1

    22 ),(/1),(/1

  • Jian Pei: CMPT 459/741 Clustering (4) 4

    Choice of P

    •  When p à 1, FCM behaves like traditional k-means

    •  When p is larger, the cluster centroids approach the global centroid of all data points

    •  The partition becomes fuzzier as p increases

  • Jian Pei: CMPT 459/741 Clustering (4) 5

    Effectiveness

  • Jian Pei: CMPT 459/741 Clustering (4) 6

    Mixture Models

    •  A cluster can be modeled as a probability distribution – Practically, assume a distribution can be

    approximated well using multivariate normal distribution

    •  Multiple clusters is a mixture of different probability distributions

    •  A data set is a set of observations from a mixture of models

  • Jian Pei: CMPT 459/741 Clustering (4) 7

    Object Probability

    •  Suppose there are k clusters and a set X of m objects – Let the j-th cluster have parameter θj = (µj, σj) – The probability that a point is in the j-th cluster is

    wj, w1 + …+ wk = 1 •  The probability of an object x is

    ∑=

    =Θk

    jjjj xpwxprob

    1)|()|( θ

    ∏∑∏= ==

    =Θ=Θm

    i

    k

    jjijj

    m

    ii xpwxprobXprob

    1 11

    )|()|()|( θ

  • Jian Pei: CMPT 459/741 Clustering (4) 8

    Example

    2

    2

    2)(

    21)|( σ

    µ

    σπ

    −−

    =Θx

    i exprob

    )2,4()2,4( 21 =−= θθ

    8)4(

    8)4( 22

    221

    221)|(

    −−

    +−

    +=Θxx

    eexprobππ

  • Jian Pei: CMPT 459/741 Clustering (4) 9

    Maximal Likelihood Estimation

    •  Maximum likelihood principle: if we know a set of objects are from one distribution, but do not know the parameter, we can choose the parameter maximizing the probability

    •  Maximize

    – Equivalently, maximize

    ∏=

    −−

    =Θm

    j

    x

    i exprob1

    2)(2

    2

    21)|( σ

    µ

    σπ

    ∑=

    −−−

    −=Θm

    i

    i mmxXprob1

    2

    2

    log2log5.02

    )()|(log σπσµ

  • Jian Pei: CMPT 459/741 Clustering (4) 10

    EM Algorithm

    •  Expectation Maximization algorithm Select an initial set of model parameters Repeat

    Expectation Step: for each object, calculate the probability that it belongs to each distribution θi, i.e., prob(xi|θi)

    Maximization Step: given the probabilities from the expectation step, find the new estimates of the parameters that maximize the expected likelihood

    Until the parameters are stable

  • Jian Pei: CMPT 459/741 Clustering (4) 11

    Advantages and Disadvantages

    •  Mixture models are more general than k-means and fuzzy c-means

    •  Clusters can be characterized by a small number of parameters

    •  The results may satisfy the statistical assumptions of the generative models

    •  Computationally expensive •  Need large data sets •  Hard to estimate the number of clusters

  • Jian Pei: CMPT 459/741 Clustering (4) 12

    Grid-based Clustering Methods

    •  Ideas – Using multi-resolution grid data structures – Using dense grid cells to form clusters

    •  Several interesting methods – CLIQUE – STING – WaveCluster

  • Jian Pei: CMPT 459/741 Clustering (4) 13

    CLIQUE

    •  Clustering In QUEst •  Automatically identify subspaces of a high

    dimensional data space •  Both density-based and grid-based

  • Jian Pei: CMPT 459/741 Clustering (4) 14

    CLIQUE: the Ideas

    •  Partition each dimension into the same number of equal length intervals – Partition an m-dimensional data space into non-

    overlapping rectangular units •  A unit is dense if the number of data points

    in the unit exceeds a threshold •  A cluster is a maximal set of connected

    dense units within a subspace

  • Jian Pei: CMPT 459/741 Clustering (4) 15

    CLIQUE: the Method

    •  Partition the data space and find the number of points in each cell of the partition –  Apriori: a k-d cell cannot be dense if one of its (k-1)-d

    projection is not dense •  Identify clusters:

    –  Determine dense units in all subspaces of interests and connected dense units in all subspaces of interests

    •  Generate minimal description for the clusters –  Determine the minimal cover for each cluster

  • Jian Pei: CMPT 459/741 Clustering (4) 16

    Sala

    ry

    (10,

    000)

    age

    Vac

    atio

    n

    30 50

    20 30 40 50 60 age

    5 4

    3 1

    2 6

    7 0

    Vaca

    tion

    (wee

    k)

    20 30 40 50 60 age

    5 4

    3 1

    2 6

    7 0

    CLIQUE: An Example

  • Jian Pei: CMPT 459/741 Clustering (4) 17

    CLIQUE: Pros and Cons

    •  Automatically find subspaces of the highest dimensionality with high density clusters

    •  Insensitive to the order of input – Not presume any canonical data distribution

    •  Scale linearly with the size of input •  Scale well with the number of dimensions •  The clustering result may be degraded at

    the expense of simplicity of the method

  • Jian Pei: CMPT 459/741 Clustering (4) 18

    Bad Cases for CLIQUE

    Parts of a cluster may be missed

    A cluster from CLIQUE may contain noise

  • Jian Pei: CMPT 459/741 Clustering (4) 19

    Dimensionality Reduction

    •  Clustering a high dimensional data set is challenging –  Distance between two points could be dominated by

    noise •  Dimensionality reduction: choosing the informative

    dimensions for clustering analysis –  Feature selection: choosing a subset of existing

    dimensions –  Feature construction: construct a new (small) set of

    informative attributes

  • Jian Pei: CMPT 459/741 Clustering (4) 20

    Variance and Covariance

    •  Given a set of 1-d points, how different are those points? – Standard deviation: – Variance:

    •  Given a set of 2-d points, are the two dimensions correlated? – Covariance:

    1

    )(1

    2

    −=∑=

    n

    XXs

    n

    ii

    1

    )(1

    2

    2

    −=∑=

    n

    XXs

    n

    ii

    1

    ))((),cov( 1

    −−=∑=

    n

    YYXXYX

    n

    iii

  • Jian Pei: CMPT 459/741 Clustering (4) 21

    Principal Components

    Art work and example from http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

  • Jian Pei: CMPT 459/741 Clustering (4) 22

    Step 1: Mean Subtraction

    •  Subtract the mean from each dimension for each data point

    •  Intuition: centralizing the data set

  • Jian Pei: CMPT 459/741 Clustering (4) 23

    Step 2: Covariance Matrix

    ⎟⎟⎟⎟⎟

    ⎜⎜⎜⎜⎜

    =

    ),cov(),cov(),cov(

    ),cov(),cov(),cov(),cov(),cov(),cov(

    21

    22212

    12111

    nnnn

    n

    n

    DDDDDD

    DDDDDDDDDDDD

    C

    !"#""

    !!

  • Jian Pei: CMPT 459/741 Clustering (4) 24

    Step 3: Eigenvectors and Eigenvalues

    •  Compute the eigenvectors and the eigenvalues of the covariance matrix –  Intuition: find those direction invariant vectors as

    candidates of new attributes – Eigenvalues indicate how much the direction

    invariant vectors are scaled – the larger the better for manifest the data variance

  • Jian Pei: CMPT 459/741 Clustering (4) 25

    Step 4: Forming New Features

    •  Choose the principal components and forme new features –  Typically, choose the top-k components

  • Jian Pei: CMPT 459/741 Clustering (4) 26

    New Features

    NewData = RowFeatureVector x RowDataAdjust

    The first principal component is used

  • Clustering in Derived Space

    Jian Pei: CMPT 459/741 Clustering (4) 27

    Y

    XO

    - 0.707x + 0.707y

  • Spectral Clustering

    Jian Pei: CMPT 459/741 Clustering (4) 28

    cluster the original data

    ij[ ]

    Data Affinity matrixk eigenvectors of A

    A = f(W)Av = \lamda v

    Clustering in thenew space

    Computing the leading Projecting back to

    W

  • Affinity Matrix

    •  Using a distance measure where σ is a scaling parameter controling how fast the affinity Wij decreases as the distance increases

    •  In the Ng-Jordan-Weiss algorithm, Wii is set to 0

    Jian Pei: CMPT 459/741 Clustering (4) 29

    Wij = e� dist(oi,oj)

    w

  • Clustering

    •  In the Ng-Jordan-Weiss algorithm, we define a diagonal matrix such that

    •  Then, •  Use the k leading eigenvectors to form a

    new space •  Map the original data to the new space and

    conduct clustering Jian Pei: CMPT 459/741 Clustering (4) 30

    Dii =nX

    j=1

    Wij

    A = D�12WD�

    12

  • Is a Clustering Good?

    •  Feasibility – Applying any clustering methods on a uniformly

    distributed data set is meaningless •  Quality

    – Are the clustering results meeting users’ interest? – Clustering patients into clusters corresponding

    various disease or sub-phenotypes is meaningful – Clustering patients into clusters corresponding to

    male or female is not meaningful

    Jian Pei: CMPT 459/741 Clustering (4) 31

  • Major Tasks

    •  Assessing clustering tendency – Are there non-random structures in the data?

    •  Determining the number of clusters or other critical parameters

    •  Measuring clustering quality

    Jian Pei: CMPT 459/741 Clustering (4) 32

  • Uniformly Distributed Data

    •  Clustering uniformly distributed data is meaningless

    •  A uniformly distributed data set is generated by a uniform data distribution

    Jian Pei: CMPT 459/741 Clustering (4) 33

    504CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS

    Figure 10.21: A data set that is uniformly distributed in the data space.

    • Measuring clustering quality. After applying a clustering method on adata set, we want to assess how good the resulting clusters are. A numberof measures can be used. Some methods measure how well the clustersfit the data set, while others measure how well the clusters match theground truth, if such truth is available. There are also measures thatscore clusterings and thus can compare two sets of clustering results onthe same data set.

    In the rest of this section, we discuss each of the above three topics.

    10.6.1 Assessing Clustering Tendency

    Clustering tendency assessment determines whether a given data set has a non-random structure, which may lead to meaningful clusters. Consider a dataset that does not have any non-random structure, such as a set of uniformlydistributed points in a data space. Even though a clustering algorithm mayreturn clusters for the data, those clusters are random and are not meaningful.

    Example 10.9 Clustering requires non-uniform distribution of data. Figure 10.21shows a data set that is uniformly distributed in 2-dimensional data space.Although a clustering algorithm may still artificially partition the points intogroups, the groups will unlikely mean anything significant to the applicationdue to the uniform distribution of the data.

    “How can we assess the clustering tendency of a data set?” Intuitively, wecan try to measure the probability that the data set is generated by a uniformdata distribution. This can be achieved using statistical tests for spatial ran-domness. To illustrate this idea, let’s look at a simple yet effective statisticcalled Hopkins Statistic.

    The Hopkins Statistic is a spatial statistic that tests the spatial random-ness of a variable as distributed in a space. Given a data set, D, which isregarded as a sample of a random variable, o, we want to determine how faraway o is from being uniformly distributed in the data space. We calculate theHopkins Statistic as follows:

  • Hopkins Statistic

    •  Hypothesis: the data is generated by a uniform distribution in a space

    •  Sample n points, p1, …, pn, uniformly from the space of D

    •  For each point pi, find the nearest neighbor of pi in D, let xi be the distance between pi and its nearest neighbor in D

    Jian Pei: CMPT 459/741 Clustering (4) 34

    xi = minv2D

    {dist(pi, v)}

  • Hopkins Statistic

    •  Sample n points, q1, …, qn, uniformly from D •  For each qi, find the nearest neighbor of qi

    in D – {qi}, let yi be the distance between qi and its nearest neighbor in D – {qi}

    •  Calculate the Hopkins Statistic H

    Jian Pei: CMPT 459/741 Clustering (4) 35

    yi = minv2D,v 6=qi

    {dist(qi, v)}

    H =

    nPi=1

    yi

    nPi=1

    xi +nP

    i=1yi

  • Explanation

    •  If D is uniformly distributed, then and would be close to each other, and thus H would be round 0.5

    •  If D is skewed, then would be substantially smaller, and thus H would be close to 0

    •  If H > 0.5, then it is unlikely that D has statistically significant clusters

    Jian Pei: CMPT 459/741 Clustering (4) 36

    nX

    i=1

    yi

    nX

    i=1

    xi

    nX

    i=1

    yi

  • Finding the Number of Clusters

    •  Depending on many factors – The shape and scale of the distribution in the

    data set – The clustering resolution required by the user

    •  Many methods exist – Set , each cluster has points on

    average – Plot the sum of within-cluster variances with

    respect to k, find the first (or the most significant turning point)

    Jian Pei: CMPT 459/741 Clustering (4) 37

    k =

    rn

    2

    p2n

  • A Cross-Validation Method •  Divide the data set D into m parts •  Use m – 1 parts to find a clustering •  Use the remaining part as the test set to test

    the quality of the clustering – For each point in the test set, find the closest

    centroid or cluster center – Use the squared distances between all points in the

    test set and the corresponding centroids to measure how well the clustering model fits the test set

    •  Repeat m times for each value of k, use the average as the quality measure

    Jian Pei: CMPT 459/741 Clustering (4) 38

  • Measuring Clustering Quality

    •  Ground truth: the ideal clustering determined by human experts

    •  Two situations – There is a known ground truth – the extrinsic

    (supervised) methods, comparing the clustering against the ground truth

    – The ground truth is unavailable – the intrinsic (unsupervised) methods, measuring how well the clusters are separated

    Jian Pei: CMPT 459/741 Clustering (4) 39

  • Quality in Extrinsic Methods •  Cluster homogeneity: the more pure the

    clusters in a clustering are, the better the clustering

    •  Cluster completeness: objects in the same cluster in the ground truth should be clustered together

    •  Rag bag: putting a heterogeneous object into a pure cluster is worse than putting it into a rag bag

    •  Small cluster preservation: splitting a small cluster in the ground truth into pieces is worse than splitting a bigger one

    Jian Pei: CMPT 459/741 Clustering (4) 40

  • Bcubed Precision and Recall

    •  D = {o1, …, on} – L(oi) is the cluster of oi given by the ground truth

    •  C is a clustering on D – C(oi) is the cluster-id of oi in C

    •  For two objects oi and oj, the correctness is 1 if L(oi) = L(oj) ßà C(oi) = C(oj), 0 otherwise

    Jian Pei: CMPT 459/741 Clustering (4) 41

  • Bcubed Precision and Recall

    •  Precision

    •  Recall

    Jian Pei: CMPT 459/741 Clustering (4) 42

    508CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS

    one, denoted by o, belong to the same category according to ground truth.Consider a clustering C2 identical to C1 except that o is assigned to acluster C′ ̸= C in C2 such that C′ contains objects from various categoriesaccording to ground truth, and thus is noisy. In other words, C′ in C2 isa rag bag. Then, a clustering quality measure Q respecting the rag bagcriterion should give a higher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).

    • Small cluster preservation. If a small category is split into small piecesin a clustering, those small pieces may likely become noise and thus thesmall category cannot be discovered from the clustering. The small clus-ter preservation criterion states that splitting a small category into piecesis more harmful than splitting a large category into pieces. Consider anextreme case. Let D be a data set of n + 2 objects such that, accord-ing to the ground truth, n objects, denoted by o1, . . . , on, belong toone category and the other 2 objects, denoted by on+1, on+2, belong toanother category. Suppose clustering C1 has three clusters, C1={o1, . . . ,on}, C2={on+1}, and C3={on+2}. Let clustering C2 have three clusters,too, namely C1={o1, . . . , on−1}, C2={on}, and C3={on+1, on+2}. Inother words, C1 splits the small category and C2 splits the big category.A clustering quality measure Q preserving small clusters should give ahigher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).

    Many clustering quality measures satisfy some of the above four criteria.Here, we introduce the BCubed precision and recall metrics, which satisfy allof the above criteria.

    BCubed evaluates the precision and recall for every object in a clusteringon a given data set according to the ground truth. The precision of an objectindicates how many other objects in the same cluster belong to the same cat-egory as the object. The recall of an object reflects how many objects of thesame category are assigned to the same cluster.

    Formally, let D ={o1, . . . , on} be a set of objects, and C be a clusteringon D. Let L(oi) (1 ≤ i ≤ n) be the category of oi given by ground truth,and C(oi) be the cluster ID of oi in C. Then, for two objects, oi and oj ,(1 ≤ i, j,≤ n, i ̸= j), the correctness of the relation between oi and oj inclustering C is given by

    Correctness(oi, oj) ={ 1 if L(oi) = L(oj)⇔ C(oi) = C(oj)

    0 otherwise.(10.28)

    BCubed precision is defined as

    Precision BCubed =

    n∑

    i=1

    oj :i̸=j,C(oi)=C(oj)

    Correctness(oi, oj)

    ∥{oj|i ̸= j, C(oi) = C(oj)}∥n

    . (10.29)

    10.6. EVALUATION OF CLUSTERING 509

    BCubed recall is defined as

    Recall BCubed =

    n∑

    i=1

    oj :i̸=j,L(oi)=L(oj)

    Correctness(oi, oj)

    ∥{oj|i ̸= j, L(oi) = L(oj)}∥n

    . (10.30)

    Intrinsic Methods

    When the ground truth of a data set is not available, we have to use an intrinsicmethod to assess the clustering quality. In general, intrinsic methods evaluatea clustering by examining how well the clusters are separated and how compactthe clusters are. Many intrinsic methods take the advantage of a similaritymetric between objects in the data set.

    The silhouette coefficient is such a measure. For a data set D of nobjects, suppose D is partitioned into k clusters, C1, . . . , Ck. For each object o∈ D, we calculate a(o) as the average distance between o and all other objectsin the cluster to which o belongs. Similarly, b(o) is the minimum averagedistance from o to all clusters to which o does not belong. Formally, supposeo ∈ Ci (1 ≤ i ≤ k), then

    a(o) =

    o′∈Ci,o≠o′dist(o, o′)

    |Ci|− 1(10.31)

    and

    b(o) = minCj :1≤j≤k,j ̸=i

    {

    o′∈Cjdist(o, o′)

    |Cj |}. (10.32)

    The silhouette coefficient of o is then defined as

    s(o) =b(o)− a(o)

    max{a(o), b(o)} . (10.33)

    The value of the silhouette coefficient is between −1 and 1. The value ofa(o) reflects the compactness of the cluster to which o belongs. The smallerthe value is, the more compact the cluster is. The value of b(o) capturesthe degree to which o is separated from other clusters. The larger b(o) is,the more separated o is from other clusters. Therefore, when the silhouettecoefficient value of o approaches 1, the cluster containing o is compact and ois far away from other clusters, which is the preferable case. However, whenthe silhouette coefficient value is negative (that is, b(o) < a(o)), this meansthat, in expectation, o is closer to the objects in another cluster than to theobjects in the same cluster as o. In many cases, this is a bad case, and shouldbe avoided.

    To measure the fitness of a cluster within a clustering, we can compute theaverage silhouette coefficient value of all objects in the cluster. To measure thequality of a clustering, we can use the average silhouette coefficient value of allobjects in the data set. The silhouette coefficient and other intrinsic measures

  • Silhouette Coefficient

    •  No ground truth is assumed •  Suppose a data set D of n objects is partitioned

    into k clusters, C1, …, Ck •  For each object o,

    – Calculate a(o), the average distance between o and every other object in the same cluster – compactness of a cluster, the smaller, the better

    – Calculate b(o), the minimum average distance from o to every objects in a cluster that o does not belong to – degree of separation from other clusters, the larger, the better

    Jian Pei: CMPT 459/741 Clustering (4) 43

  • Silhouette Coefficient

    •  Then

    •  Use the average silhouette coefficient of all objects as the overall measure

    Jian Pei: CMPT 459/741 Clustering (4) 44

    b(o) = minCj :o 62Cj

    {

    Po

    02Cjdist(o, o0)

    |Cj

    | }

    a(o) =

    Po,o

    02Ci,o0 6=odist(o, o0)

    |Ci

    |� 1

    s(o) =

    b(o)� a(o)max{a(o), b(o)}

  • Multi-Clustering

    •  A data set may be clustered in different ways –  In different subspaces, that is, using different

    attributes – Using different similarity measures – Using different clustering methods

    •  Some different clusterings may capture different meanings of categorization – Orthogonal clusterings

    •  Putting users in the loop Jian Pei: CMPT 459/741 Clustering (4) 45

  • To-Do List

    •  Read Chapters 10.5, 10.6, and 11.1 •  Find out how Gaussian mixture can be used

    in SPARK MLlib •  (for thesis-based graduate students only)

    Learn LDA (Latent Dirichlet allocation) by yourself

    Jian Pei: CMPT 459/741 Clustering (4) 46