State-of-the-art Clustering Techniques: Support Vector Methods and Minimum Bregman Information Principle

VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE

UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II

Support Vector Methods and Minimum Bregman Information Principle

State-of-the-art Clustering Techniques

by

VINCENZO RUSSO

SUPERVISORCO-SUPERVISOR

Anna CORAZZAprof. prof.

Ezio CATANZARITI



Unsupervised learning: groups a set of objects in subsets called clustersThe objects are represented as points in a subspace of d is the number of point components, also called attributes or featuresSeveral application domains: information retrieval, bioinformatics, cheminformatics, image retrieval, astrophsics, market segmentation, etc.

What is the clustering?

Introduction

CLUSTERING

Rd

Non

-str

uctu

red

data

3-cl

uste

rs s

truc

ture



Support Vector Clustering (SVC)Bregman Co-clustering

Two state-of-the-art approaches

Goals

Goals Application domain

Robustness w.r.t. Missing-valued Data

Robustness w.r.t. Sparse Data

Robustness w.r.t. High “dimensionality”

Robustness w.r.t. Noise/Outliers

Astrophysics

Textual documents

Textual documentstt

Synthetic data

Other desirable propertiesNonlinear separable problems handlingAutomatic detection of the number of clustersApplication domain independent



Nonlinear Mapping from a data space to a higher dimensional feature space In we find the Minimum Enclosing Ball (MEB), i.e. the sphere enclosing all feature-space images and having the minimum radiusMapping back on : the sphere splits in contours

The contours constist of Support Vectors (SV), and describe the clusters

Support Vector Clustering: the idea

Support Vector Clustering

6.1. INTRODUCTION TO SUPPORT VECTOR CLUSTERING

!X (data space) F (feature space)

Figure 6.1: A graphical example of a data mapping from input space to feature space using theMEB formulation.

clusters.Unfortunately, finding the MEB is not enough. The process of finding such asphere is only able to detect the cluster boundaries which are modeled mappingback the sphere to the data space. This first phase was called Cluster Description byBen-Hur et al. (2001). A second stage for determining the membership of pointsto the clusters is needed. The authors called this step Cluster Labeling, even if itsimply does a cluster assignment.5In the following subsections we provide an overview of the Support Vector Clus-tering algorithm as originally proposed by Ben-Hur et al. (2001).

6.1.1 Cluster descriptionThe SVM formulation for the smallest enclosing sphere was firstly presented toexplain the Vapnik-Chervonenkis (VC) dimension (Vapnik, 1995). Later it was usedfor estimating the support of a high-dimensional distribution (Schölkopf et al.,1999). Finally, the MEB was used for the Support Vector Domain Description(SVDD), an SVM formulation for the one class classification (Tax, 2001; Tax andDuin, 1999a,b, 2004).6 The SVDD is the basic step of the SVC and allows describ-ing the boundaries of clusters.Let X = {!x1, !x2, · · · , !xn} be a dataset of n points, with X ! Rd, the data space. Weuse a nonlinear transformation " : X " F from the input space X to some highdimensional feature space F , wherein we look for the smallest enclosing sphereof radius R. This is formalized as follows

5The name cluster labeling probably descends from the originally proposed algorithm which isbased on finding the connected components of a graph: the algorithms for finding the connectedcomponents usually assign the “components labels” to the vertices.

6An alternative SVM formulation for the same task, called One Class SVM, can be found inSchölkopf et al. (2000b) (see Appendix A).

91








91








91








91




clusters.Unfortunately, finding the MEB is not enough. The process of finding such asphere is only able to detect the cluster boundaries which are modeled mappingback the sphere to the data space. This first phase was called Cluster Description byBen-Hur et al. (2001). A second stage for determining the membership of pointsto the clusters is needed. The authors called this step Cluster Labeling, though itsimply does a cluster assignment.5In the following subsections we provide an overview of the Support Vector Clus-tering algorithm as originally proposed by Ben-Hur et al. (2001).




91

!!1: F ! X



Finding the Minimum Enclosing Ball (MEB)Nonlinear Support Vector Domain Description (SVDD)QP problem; computational complexity

Phase I: Cluster description

CHAPTER 6. SUPPORT VECTOR CLUSTERING

minR,!a

R2 (6.1)

subject to!!("xk)" "a!2 # R2, k = 1, 2, · · · , n

where "a is the center of the sphere. Soft constraints are incorporated by addingslack variables #k

minR,!a,"

R2 + Cn!

k=1

#k (6.2)

subject to!!("xk)" "a!2 # R2 + #k, k = 1, 2, · · · , n

#k $ 0, k = 1, 2, · · · , n

To solve this problem we introduce the Lagrangian

L(R,"a, #; $, µ) = R2 "!

k

(R2 + #k " !!("xk)" "a!2)$k "!

k

#kµk + C!

k

#k (6.3)

with Lagrangian multipliers $k $ 0 and µk $ 0 for k = 1, 2, · · · , n. The posi-tive real constant C provides a way to control outliers percentage, allowing thesphere in feature space to not enclose all points of the target class. As in the caseof supervised SVMs (see subsection 2.6.3), this could lead to some erroneous clas-sifications, but it allows finding a solution to the problem.The solution is characterized by the saddle point of the Lagrangian

max#,µ

minR,!a,"

L(R,"a, #; $, µ) (6.4)

Setting to zero the partial derivatives of L with respect to R,"a and #k, respectively,leads to

!

k

$k = 1 (6.5)

"a =!

k

$k!("xk) (6.6)

$k = C " µk (6.7)

We still need to determine R. The KKT complementarity conditions7 (Boyd andVandenberghe, 2004, sec. 5.5.3) result in

7It is a generalization of the method of the Lagrange multipliers. Also in this case, if the primalproblem is a convex problem, the strong duality holds.

92


Definition 6.1 (Squared Feature Space Distance) . Let !x be a data point. We definethe distance of its image in feature space F , "(!x), from the center of the sphere, !a, asfollows

d2R(!x) = !"(!x)" !a!2 (6.13)

In view of Equation 6.6 and the definition of the kernel we have the kernelizedversion of the above distance

d2R(!x) = K(!x, !x)" 2

n!

k=1

#kK(!xk, !x) +n!

k=1

n!

l=1

#k#lK(!xk, !xl) (6.14)

Since the solution vector !# is sparse, i.e. only the Lagrangian multipliers associ-ated to the support vectors are non-zero, we can rewrite the above equation asfollows

d2R(!x) = K(!x, !x)" 2

nsv!

k=1

#kK(!xk, !x) +nsv!

k=1

nsv!

l=1

#k#lK(!xk, !xl) (6.15)

where nsv is the number of support vectors.It is trivial to define the radius R of the feature space sphere as the distance of anysupport vector from the center of the sphere

R = dR(!xk),#!xk $ SV (6.16)

where SV = {!xk $ X : 0 < #k < C} and dR(·) ="

d2R(·).

The contours that enclose the points in data space are defined by the level set8 ofthe function dR(·)

LdR(R) % {x | dR(!x) = R} (6.17)

Such contours are interpreted as forming cluster boundaries. In view of Equa-tion 6.16, we can say that SVs lie on cluster boundaries, BSVs are outside, and allother points lie inside the clusters.

6.1.1.1 The kernel choice

All works9 about the Support Vector Clustering assume the Gaussian kernel (seesubsubsection 2.6.1.1)

K(!x, !y) = "q!!x" !y!2 (q=1/2!2)= "!!x" !y!2

2$2(6.18)

as the only kernel available. There are several motivations for the exclusive em-ployment of this kernel. The main reason is that Tax (2001) has investigated onlytwo kernels for the SVDD, the polynomial and the Gaussian ones. The result

8More on level sets in Gray (1997).9Actually, in one of the preliminary papers about SVC a Laplacian kernel was also employed

(Ben-Hur et al., 2000a).

94


Definition 6.1 (Squared Feature Space Distance) . Let !x be a data point. We definethe distance of its image in feature space F , "(!x), from the center of the sphere, !a, asfollows

d2R(!x) = !"(!x)" !a!2 (6.13)

In view of Equation 6.6 and the definition of the kernel we have the kernelizedversion of the above distance

d2R(!x) = K(!x, !x)" 2

n!

k=1

#kK(!xk, !x) +n!

k=1

n!

l=1

#k#lK(!xk, !xl) (6.14)

Since the solution vector !# is sparse, i.e. only the Lagrangian multipliers associ-ated to the support vectors are non-zero, we can rewrite the above equation asfollows

d2R(!x) = K(!x, !x)" 2

nsv!

k=1

#kK(!xk, !x) +nsv!

k=1

nsv!

l=1

#k#lK(!xk, !xl) (6.15)

where nsv is the number of support vectors.It is trivial to define the radius R of the feature space sphere as the distance of anysupport vector from the center of the sphere

R = dR(!xk),#!xk $ SV (6.16)

where SV = {!xk $ X : 0 < #k < C} and dR(·) ="

d2R(·).

The contours that enclose the points in data space are defined by the level set8 ofthe function dR(·)

LdR(R) % {x | dR(!x) = R} (6.17)

Such contours are interpreted as forming cluster boundaries. In view of Equa-tion 6.16, we can say that SVs lie on cluster boundaries, BSVs are outside, and allother points lie inside the clusters.

6.1.1.1 The kernel choice

All works9 about the Support Vector Clustering assume the Gaussian kernel (seesubsubsection 2.6.1.1)

K(!x, !y) = "q!!x" !y!2 (q=1/2!2)= "!!x" !y!2

2$2(6.18)

as the only kernel available. There are several motivations for the exclusive em-ployment of this kernel. The main reason is that Tax (2001) has investigated onlytwo kernels for the SVDD, the polynomial and the Gaussian ones. The result

8More on level sets in Gray (1997).9Actually, in one of the preliminary papers about SVC a Laplacian kernel was also employed

(Ben-Hur et al., 2000a).

94

The kernel need to be normalized. Gaussian is the most used one

2.6. SUPPORT VECTOR MACHINES

Class -1

Class +1

Class -1

Class +1


Figure 2.3: A nonlinear separable problem in the data space X that becomes linear separable inthe feature space F .

In polynomial kernels, the parameter k is the degree. In the exponential-basedkernels (and others), the parameter q is called kernel width. The kernel widthhas different mathematical meaning depending on the kernel: in the case of theGaussian kernel, it is a function of the variance

q =1

2!2

Kernel width13 is a general term to indicate the scale at which data is probed. Be-cause of the aforementioned mathematical differences, a proper kernel width fora particular kernel function can be not proper also for another kernel in contextof the same problem.

2.6.2 Maximal margin classifierThe margin concept is a first important step towards understanding the formula-tion of Support Vector Machines (SVMs). As we can see in Figure 2.1, there existseveral separating hyperplanes that separate the data in two classes and someclassifiers (such as the perceptron) just find one of them as solution. On the otherhand, we can also define a unique separating hyperplane according to some cri-terion. The SVMs in particular define the criterion to be looking for a decisionsurface that is maximally far away from any data points. This distance from thedecision surface to the closest data point determines the margin of the classifier.An SVM is designed to maximize the margin around the separating hyperplane.This necessarily means that the decision function for an SVM is fully specifiedby a usually small subset of the data which defines the position of the separa-tor. These points are referred to as the support vectors. The data points other thansupport vectors play no part in determining the decision surface.

13As we will see in the sequel, the kernel width is one of the SVM training hyper-parameters.

19

CHAPTER 2. MACHINE LEARNING ESSENTIALS

Theorem 2.4 Let X ! R be a non-empy set such that X = {!x1, !x2, · · · , !xn}. Let" : X " X # R be a positive semidefinite kernel. The following condition holds

$!x % X , "(!x, !x) & 0

Theorem 2.5 Let X ! R be a non-empy set such that X = {!x1, !x2, · · · , !xn}. For anypositive definite kernel " : X " X # R the following inequality holds

| "(!xi, !xj) |2' "(!xi, !xi)"(!xj, !xj) (2.5)

where i, j = 1, 2, · · · , n and i (= j.

Now we introduce the result that allows to use Mercer kernels to make innerproducts.

Theorem 2.6 Let K be a symmetric function such that for all !xi, !xj % X = {!x1, !x2, · · · ,, !xn} ! R

K(!xi, !xj) = "(!xi) · "(!xj) (2.6)

where ! : X # F and F is an inner product higher dimensional space, called FeatureSpace. The function K can be represented in terms of Equation 2.6 if and only if theGram matrix G = (K(!xi, !xj))i,j=1,2,··· ,n is positive semidefinite, i.e. K(·) is a Mercerkernel.

The kernel function K(·) defines an explicit mapping if " is known, otherwise themapping is said to be implicit. In the majority of cases, the function " is unknown,so we can implicitly perform an inner product in the feature space F by means ofa kernel K.Using nonlinear kernel transformations, we have a chance to transform a non sep-arable problem in data space to a separable one in feature space (see Figure 2.3).

2.6.1.1 Valid Mercer kernels in Rn subspaces

There are several functions which are known to satisfy Mercer’s condition forX ! Rn. Some of them are

• Linear kernel: K(!x, !y) = !x!y

• Polynomial kernel: K(!x, !y) = (!x!y + r)k, r & 0, k % N• Gaussian kernel: K(!x, !y) = e!q"!x!!y"2 , q > 0

• Exponential kernel: K(!x, !y) = e!q"!x!!y", q > 0

• Laplacian kernel: K(!x, !y) = e!q|!x!!y|, q > 0

where ) · ) is the L2 metric (also known as Euclidean distance) and | · | is the L1metric (also known as Manhattan distance).In addition, we can build other kernel functions: given two Mercer kernels K1

and K2, we can construct new Mercer kernels by properly combining K1 and K2

(Cristianini and Shawe-Taylor, 2000, sec. 3.3.2).

18



Algorithm 5 The general execution scheme for the SVC1: procedure SVC(X )2: q ! initial kernel width3: C ! initial soft constraint4: while stopping criterion is not met do5: ! ! clusterDescription(X , q, C) " Find the MEB via SVDD6: results ! clusterLabeling(X , !)7: choose new q and/or C8: end while9: return results

10: end procedure

6.1.5 ComplexityWe recall that the SVC is composed of two steps: the cluster description and thecluster labeling. To calculate the overall time and space complexity we have toanalyze each step separately.

6.1.5.1 Cluster description complexity

The complexity of the cluster description is the complexity of the QP problem(see Equation 6.3) we have to solve for finding the MEB. Such a problem hasO(n3) worst-case running time complexity and O(n2) space complexity. Anyway,the QP problem can be solved through efficient approximation algorithms likeSequential Minimal Optmization (SMO) (Platt, 1998) and many other decomposi-tion methods. These methods can practically scale down the worst-case runningtime complexity to (approximately) O(n2), whereas the space complexity can bereduced to O(1) (Ben-Hur et al., 2001, sec. 5).12

6.1.5.2 Cluster labeling complexity

The cluster labeling is composed of two sub-steps: (i) the construction of the ad-jacency matrix A (see Equation 6.19) and (ii) the computation of the connectedcomponents of the undirected graph induced by the matrix A. The size of such amatrix is n" n, where n = n# nbsv.In the first sub-step we have to compute the sphere radius R, i.e. the distancedR(#s) where #s is anyone of the support vectors, and the distance dR(#y) for eachpoint #y sampled along the path connecting two points. The number m of pointssampled along a path is the same for all paths we have to check. Finally, due tothe dimensions of the adjacency matrix A, we have to check n2 paths.Let us recall Equation 6.15

d2R(#x) = K(#x, #x)# 2

nsv!

k=1

!kK(#xk, #x) +nsv!

k=1

nsv!

l=1

!k!lK(#xk, #xl)

12In the sequel we discuss the problem of the SVM training complexity more in details.

101

Parametersq = kernel widthC = soft margin



Each component is a clusterOriginal Phase II is a bottleneck (caso peggiore ) Alternatives

Cone Cluster Labeling: best performance/accuracy rateGradient Descent

The Phase I only describes the clusters’ boundariesPhase II: finding the connected components of the graph induced by the matrix A

Phase II: Cluster labeling



was that the Gaussian kernel produces tighter descriptions (and so tighter con-tours in the SVC), whereas the polynomial kernel stretches the data in the highdimensional feature space, causing data to become very hard to describe with ahypersphere. In the subsubsection 6.1.3.2 we mention another important reasonwhy the Gaussian kernel is the most common choice in this application.In all correlated works, the employment of the Gaussian kernel has been takenfor granted. Anyway, in the sequel we will also show that the SVC can work alsowith other types of kernels.

6.1.1.2 SVDD improvement

Tax and Juszczak (2003) have proposed a preprocessing method in order to en-hance the performance of the SVDD. The SVDD (and, generally, any one-classclassification formulation) is quite sensible to data distributed in subspaces whichharm the performance. The preprocessing method is called kernel whitenening andit consists of a way to scale the data such that the variances of the data are equalin all directions.

6.1.2 Cluster labelingThe cluster description algorithm does not differentiate between points that be-long to different clusters. To build a decision method for cluster membership weuse a geometric approach involving dR(!x), based on the following observation:given a pair of data points that belong to different clusters, any path that con-nects them must exit from the sphere in the feature space. Therefore, such a pathcontains a segment of points !y such that dR(!y) > R. This leads to the definition ofthe adjacency matrix A between all pairs of points whose images lie in or on thesphere in feature space.Let Sij be the line segment connecting !xi and !xj , such that Sij = {!xi+1, !xi+2, · · · ,!xj!2, !xj!1} for all i, j = 1, 2, · · · , n, then

Aij =

!1 if !!y " Sij, dR(!y) # R

0 otherwise.(6.19)

Clusters are now defined as the connected components of the graph induced bythe matrix A. Checking the line segment is implemented by sampling a numberm of points between the starting point and the ending point. The exactness of Aij

depends on the number m.Clearly, the BSVs are unclassified by this procedure since their feature space im-ages lie outside the enclosing sphere. One may decide either to leave them un-classified or to assign them to the cluster that they are closest to. Generally, thelatter is the most appropriate choice.

6.1.3 Working with Bounded Support VectorsThe shape of enclosing contours in data space is governed by two hyper-parameters(or simply parameters): the kernel width q that determines the scale at which data

95

Sij = {!xi+1, !xi+2, · · · , !xj!2, !xj!1}


Algorithm 5 The general execution scheme for the SVC1: procedure SVC(X )2: q ! initial kernel width3: C ! initial soft constraint4: while stopping criterion is not met do5: ! ! clusterDescription(X , q, C) " Find the MEB via SVDD6: results ! clusterLabeling(X , !)7: choose new q and/or C8: end while9: return results

10: end procedure

6.1.5 ComplexityWe recall that the SVC is composed of two steps: the cluster description and thecluster labeling. To calculate the overall time and space complexity we have toanalyze each step separately.

6.1.5.1 Cluster description complexity

The complexity of the cluster description is the complexity of the QP problem(see Equation 6.3) we have to solve for finding the MEB. Such a problem hasO(n3) worst-case running time complexity and O(n2) space complexity. Anyway,the QP problem can be solved through efficient approximation algorithms likeSequential Minimal Optmization (SMO) (Platt, 1998) and many other decomposi-tion methods. These methods can practically scale down the worst-case runningtime complexity to (approximately) O(n2), whereas the space complexity can bereduced to O(1) (Ben-Hur et al., 2001, sec. 5).12

6.1.5.2 Cluster labeling complexity

The cluster labeling is composed of two sub-steps: (i) the construction of the ad-jacency matrix A (see Equation 6.19) and (ii) the computation of the connectedcomponents of the undirected graph induced by the matrix A. The size of such amatrix is n" n, where n = n# nbsv.In the first sub-step we have to compute the sphere radius R, i.e. the distancedR(#s) where #s is anyone of the support vectors, and the distance dR(#y) for eachpoint #y sampled along the path connecting two points. The number m of pointssampled along a path is the same for all paths we have to check. Finally, due tothe dimensions of the adjacency matrix A, we have to check n2 paths.Let us recall Equation 6.15

d2R(#x) = K(#x, #x)# 2

nsv!

k=1

!kK(#xk, #x) +nsv!

k=1

nsv!

l=1

!k!lK(#xk, #xl)

12In the sequel we discuss the problem of the SVM training complexity more in details.

101



Parameters explorationThe greater the kernel width q, the greater the number of support vectors (and so of clusters)C rules the number of outliers and allows to deal with strongly overlapping clusters

Brute force approach unfeasibleApproaches proposed in literature

Secant-like algorithm for q explorationNo theoretical-rooted method for C exploration

Data analysis is performed at different levels of detailPseudo-hierarchical: strict hierarchy not guaranteed when ‘C < 1’, due to the Bounded Support Vectors

Pseudo-hierarchical execution




Soft Margin C parameter selectionHeuristics: successfully applied in 90% of casesOnly 10 tests out of 100 needed further tuning

10 datasets had a high percentage of missing valuesNew robust stop criterion

Based upon Relative evaluation criteria (C-index, Dunn Index, ad hoc)

Kernel width (q) selectionSVC integrationSoftening strategy heuristicsFor all normalized kernels

More kernelsExponential ( ), Laplace ( )

Proposed improvements


K(!x, !y) = e!q|!x!!y|K(!x, !y) = e!q"!x!!y"

O(Qn3) O(n2

sv)



Improvements - Stop criterionDetected clusters Actual clusters Validity index

1 3 1,00E-063 3 0,134 3 0,05

Brea

stIr

is

1 2 1,00E-052 2 0,804 2 0,27

The bigger the Validity index the better the clustering foundThe stop criterion halt the process when the index value start to decrease

The idea: the SVC outputs quality-increasing clusterings before reaching the optimal clustering. After that, it provides quality-decreasing partitionings.




Improvements - Kernel width selection


Algorithm Accuracy Macroaveraging # iter # potential “q”

SVC+ softening

K-means

88,00% 87,69% 2 994,00% 93,99% 1 1385,33% 85,11% not applicable

SVC+ softening

K-means

87,07% 87,55% 3 793,26% 93,91% 2 650,00% 51,78% not applicableW

ine

Iris

SVC+ softening

K-means

91,85% 11,00% 3 1196,71% 2,82% 3 1360,23% 32,00% not applicable

Benign Contamination

B. C

ance

r

SVC+ softening

K-means

88,80% 100,00% 8 1888,00% 100,00% 4 1568,40% 63,84% not applicableSy

n02

SVC+ softening

K-means

87,30% 100,00% 17 3687,30% 100,00% 6 3139,47% 39,90% not applicableSy

n03



Improvements - non-Gaussian kernels


Algorithm Accuracy Macroaveraging # iter # potential “q”SVC + softening+ Exp Kernel

K-means

94,00% 93,99% 1 1397,33% 97,33% 1 1585,33% 85,11% not applicable

Iris

SVC + softening+ Exp Kernel

K-means

Failed - only one class out of 3 separated94,00% 93,99% 1 1185,33% 85,11% not applicableC

LA3

Exponential Kernel: improves the cluster separation in several cases

Laplace Kernel: improves/allows the cluster separation with normalized data

Algorithm Accuracy # iter # potential “q”SVC + softening

+ Laplace Kernel

K-means

Failed - no class separated99,94% 1 1783,00% not applicableQ

uad

SVC + softening

+ Laplace Kernel

K-means

73,15% 3 1991,04% 1 1650,24% not applicableSG

03



with convex, real, differentiable

5.1. TOWARDS THE BREGMAN DIVERGENCES

This idea can be generalized to more than two points. We refer to a point of theform !1"x1, !1"x2, · · · , !1"xp where !1 +!2 + · · ·+!p = 1, as an affine combination of thepoints "x1, "x2, · · · , "xp. Using the induction from the definition of affine set, it canbe shown that an affine set contains every affine combination of its points.The set of all affine combinations of points in some set C ! Rn is called the affinehull of C, and denoted aff(C)

aff(C) = {!1"x1 + !1"x2 + · · · + !1"xp : "x1, "x2, · · · , "xp " C, !1 + !2 + · · · + !p = 1}.

The affine hull is the smallest affine set that contains C, in the following sense: ifS is any affine set with C ! S, then aff(C) ! S.Finally, we define the relative interior of the set C, denoted ri(C), as its interior2

relative to aff(C)

ri(C) = {"x " C : B("x, r) # aff(C) ! C for some r > 0}, (5.2)

where B("x, r) is the ball of radius r and center x (Boyd and Vandenberghe, 2004,chap. 2).

5.1.2 Bregman divergencesLet us define Bregman divergences (Bregman, 1967), which form a large class ofwell-behaved loss functions with a number of desirable properties.

Definition 5.1 (Bregman divergence) Let # be a real-valued convex function of Leg-endre type3 defined on the convex set S $ dom(#) ! Rd. The Bregman divergenced! : S % ri(S) & R+ is defined as

d!("x1, "x2) = #("x1)' #("x2)' ("x1 ' "x2,)#("x2)* (5.3)

where )# is the gradient of #, (·* is the dot product, and ri(S) is the relative interior ofS.

Example 5.1 (Squared Euclidean Distance) Squared Euclidean distance is perhapsthe simplest and most widely used Bregman divergence. The underlying function #("x) =("x, "x* is strictly convex, differentiable in Rd and

d!("x1, "x2) = ("x1, "x1* ' ("x2, "x2* ' ("x1 ' "x2,)#("x2)* =

= ("x1, "x1* ' ("x2, "x2* ' ("x1 ' "x2, 2"x2* =

= ("x1 ' "x2, "x1 ' "x2* = +"x1 ' "x2+2

(5.4)

2The interior of a set C consists of all points of C that are intuitively not on the “edge” of C(Boyd and Vandenberghe, 2004, app. A).

3A proper, closed, convex function ! is said to be of Legendre type if: (i) int(dom(!))is non-empty, (ii) ! is strictly convex and differentiable on int(dom(!)), and (iii) ,zb "bd(dom(!)), lim

z!dom(!)"zb+)!(z)+ & -, where dom(!) is the domain of the ! application,

int(dom(!)) is the interior of the domain of ! and bd(dom(!)) is the boundary of the domain of! (Banerjee et al., 2005c).

57

Co-clustering: simultaneous clustering of both rows and columns of a data matrix Bregman framework

Generalizes K-means strategyLarge class of divergences: Bregman divergencesMinima Bregman Information (MBI) principleMeta-algorithm

Bregman Co-clustering (BCC)

Minimum Bregman Information Principle


This idea can be generalized to more than two points. We refer to a point of theform !1"x1, !1"x2, · · · , !1"xp where !1 +!2 + · · ·+!p = 1, as an affine combination of thepoints "x1, "x2, · · · , "xp. Using the induction from the definition of affine set, it canbe shown that an affine set contains every affine combination of its points.The set of all affine combinations of points in some set C ! Rn is called the affinehull of C, and denoted aff(C)

aff(C) = {!1"x1 + !1"x2 + · · · + !1"xp : "x1, "x2, · · · , "xp " C, !1 + !2 + · · · + !p = 1}.

The affine hull is the smallest affine set that contains C, in the following sense: ifS is any affine set with C ! S, then aff(C) ! S.Finally, we define the relative interior of the set C, denoted ri(C), as its interior2

relative to aff(C)

ri(C) = {"x " C : B("x, r) # aff(C) ! C for some r > 0}, (5.2)

where B("x, r) is the ball of radius r and center x (Boyd and Vandenberghe, 2004,chap. 2).

5.1.2 Bregman divergencesLet us define Bregman divergences (Bregman, 1967), which form a large class ofwell-behaved loss functions with a number of desirable properties.

Definition 5.1 (Bregman divergence) Let # be a real-valued convex function of Leg-endre type3 defined on the convex set S $ dom(#) ! Rd. The Bregman divergenced! : S % ri(S) & R+ is defined as

d!("x1, "x2) = #("x1)' #("x2)' ("x1 ' "x2,)#("x2)* (5.3)

where )# is the gradient of #, (·* is the dot product, and ri(S) is the relative interior ofS.

Example 5.1 (Squared Euclidean Distance) Squared Euclidean distance is perhapsthe simplest and most widely used Bregman divergence. The underlying function #("x) =("x, "x* is strictly convex, differentiable in Rd and

d!("x1, "x2) = ("x1, "x1* ' ("x2, "x2* ' ("x1 ' "x2,)#("x2)* =

= ("x1, "x1* ' ("x2, "x2* ' ("x1 ' "x2, 2"x2* =

= ("x1 ' "x2, "x1 ' "x2* = +"x1 ' "x2+2

(5.4)

2The interior of a set C consists of all points of C that are intuitively not on the “edge” of C(Boyd and Vandenberghe, 2004, app. A).

3A proper, closed, convex function ! is said to be of Legendre type if: (i) int(dom(!))is non-empty, (ii) ! is strictly convex and differentiable on int(dom(!)), and (iii) ,zb "bd(dom(!)), lim

z!dom(!)"zb+)!(z)+ & -, where dom(!) is the domain of the ! application,

int(dom(!)) is the interior of the domain of ! and bd(dom(!)) is the boundary of the domain of! (Banerjee et al., 2005c).

57


E! [d"(X,!s)] (5.6)

that depends on the choice of the representative !s and can be optimized by pick-ing up the right representative. We call this optimal distortion-rate function theBregman Information (BI) of the random variable X for the Bregman divergence d"

and denote it by I"(X), i.e.

I"(X) = min#s!ri(S)

E! [d"(X,!s)] = min#s!ri(S)

n!

i=1

"id"(!xi,!s) (5.7)

The optimal vector !s which achieves the minimal distortion will be called theBregman representative or, simply the representative of X . The following proposi-tion4 states that such a representative always exists, is uniquely determined anddoes not depend on the choice of Bregman divergence. In fact, the minimizer is justthe expectation of the random variable X .

Proposition 5.1 Let X be a random variable that takes values in X = {!xi}ni=1 ! S "

Rd following a positive probability distribution measure " such that E! [X] # ri(S).5Given a Bregman divergence d" : S $ ri(S)% [0,&), the problem

min#s!ri(S)

E! [d"(X,!s)] (5.8)

has a unique solution given by !s" = !µ = E! [X].

Using the proposition above, we can now give a more direct definition of theBregman Information (BI).

Definition 5.2 (Bregman Information) Let X be a random variable that take valuesin X = {!xi}n

i=1 ! S " Rd following a positive probability distribution measure ".Let µ = E! [X] =

"ni=1 "i!xi # ri(S) and let d" : S $ ri(S) % [0,&) be a Bregman

divergence. Then the Bregman Information of X in terms of d" is defined as

I"(X) = E! [d"(X, !µ)] =n!

i=1

"id"(!xi, !µ) (5.9)

Example 5.3 (Variance) Let X = {!xi}ni=1 be a set in Rd, and consider the uniform

measure over X , i.e. "i = 1/n. The Bregman Information of X with squared Euclideandistance as Bregman divergence is actually the variance

I"(X) =n!

i=1

"id"(!xi, !µ) =1

n

n!

i=1

'!xi ( !µ'2 (5.10)

where

!µ =n!

i=1

"i!xi =n!

i=1

1

n!xi

4A proof is available in Banerjee et al. (2005c).5The assumption that E! [X] # ri(S) is not restrictive since a violation can occur only when

co(X ) ! bd(S), i.e. the entire convex hull of X is on the boundary of S.

59

Bregman divergence Bregman Information

Divergence Information MBI Algorithm

Euclidean Variance Least Squares K-means

Relative Entropy Mutual Information Maximum Entropy unnamed

Itakura-Saito unnamed unnamed Lindo-Buzo-Gray



Sparse data and missing-valued data

Other experiments

Dataset SVC BCC K-means # attr. affected % obj. affected

MV5000 (25D)MV10000 (25D)AMV5000 (15D)AMV10000 (15D)

99,02% 94,00% 71,08% 10 27,0%96,10% 95,60% 75,12% 10 29,0%91,76% 79,46% 74,90% 6 30,0%90,31% 83,51% 68,20% 6 30,0%

Star/Galaxy data with missing values

Textual document data: sparsity and high “dimensionality”

Dataset SVC BCC K-means

CLASSIC3 (3303D)

SCI3 (9456D)

PORE (13821D)

99,80% 100,00% 49,80%

failed 89,39% 39,15%

failed 82,68% 45,91%



Outliers

Dataset SVC Best BCC K-means # objects # outliers

SynDECA 02

SynDECA 03

100,00% 94,18% 68,04% 1000 112

100,00% 49,00% 39,47% 10000 1.270

Other experiments

9.8. MISSING VALUES IN ASTROPHYSICS: THE DATA

Figure 9.2: Since the dataset is 3-dimensional, this plot exactly represents Syndeca 03 data. Thefeatures are unnamed, so in the plot we have written “col1”, “col2”, and “col3” for the first, thesecond, and the third feature respectively. The outliers are represented by empty circles.

object is represented by 25 features:27 brightness (Point Spread Function flux),size (Petrosian radius, in arcsec), texture (small-scale roughness of object), andtwo shape features, all observed at 365 nm. The two shape features (10 actualattributes) are the ones that can have missing values.The guidelines used to extract the objects from the SDSS allowed us to get datathat are fairly separable based on the features we used, so we were able to fo-cus on and accurately test the missing values robustness of our two clusteringalgorithms.In the second session data were extracted in a similar way, but the actual fea-tures are 15 because we avoided features with missing values; then we createddifferent missing-valued versions of the various datasets by means of a simplehomemade program that removes values on specific features. The items affectedby this procedure are randomly chosen and their number is specified as an inputparameter.Finally, to evaluate our clustering results we considered an object to be a Star if the“type_u, type_g, type_r, type_i, type_z” attributes in the SDSS catalogue assumethe value 6. An object is a Galaxy if the aforesaid attributes assume the value 3.28

27Five attributes, each over the u, g, r, i, z bands.28The “type” attributes assume values in function of the Stellarity index returned by the SEx-

tractor. SExtractor (Source-Extractor) is a program that builds a catalogue of objects from anastronomical image. It is particularly oriented towards reduction of large scale galaxy-surveydata, but it also performs well on moderately crowded star fields. For more information seehttp://sextractor.sourceforge.net/.

189

SynDECA 02 SynDECA 03



Conclusions

Conclusions and future works

Support Vector Clustering achieves the goals

Bregman Co-clustering achieves same goals, but the following still holdthe problem of estimating the number of clustersoutliers handling problem

Goals Application domainRobustness w.r.t. Missing-valued Data

Robustness w.r.t. Sparse DataRobustness w.r.t. High “dimensionality”

Robustness w.r.t. Noise/OutliersOther properties

Automatic discovering of the number of clustersApplication domain independent

Nonlinear separable problems handlingArbitrary-shaped clusters handling

AstrophysicsTextual documentsTextual documents

Synthetic dataApplication domain

Whole experimental stage



SVC was made applicable in practiceComplexity reduction for the kernel width selectionSoft margin C parameter estimationNew effective stop criterion

non-Gaussian KernelsThe kernel width selection was shown to be applicable to all normalized kernelsExponential and Laplacian kernel successfully used

Improved accuracySoftening strategy for the kernel width selection

Contribution

Conclusions and future works



Minimum Enclosing Bregman Ball (MEBB)Generalization of the Minimum Enclosing Ball (MEB) problem and the Bâdoiu-Clarkson (BC) algorithm with Bregman divergences

Core Vector Machines (CVM)SVM reformulated as MEB problemThey make use of the BC algorithm

MEBB + CVM = Bregman Vector MachinesNew implications for vector machines New implications for SVC

Adapting cluster labeling algorithms to the Bregman divergences

Future works

Conclusion and future works

10.3. FUTURE WORK

Itakura-Saito L22 Kullbach-Leibler

Fig. 2. Examples of Bregman Balls, for d = 2. Blue dots are the centers of the balls.

Here, !F is the gradient operator of F . A Bregman divergence has the followingproperties: it is convex in x’, always non negative, and zero i! x = x’. WheneverF (x) =

!di=1 x2

i = !x!22, the corresponding divergence is the squared Euclidean

distance (L22): DF (x’,x) = !x " x’!2

2, with which is associated the commondefinition of a ball in an Euclidean metric space:

Bc,r = {x # X : !x" c!22 $ r} , (2)

with c # S the center of the ball, and r % 0 its (squared) radius. Eq. (2)suggests a natural generalization to the definition of balls for arbitrary Bregmandivergences. However, since a Bregman divergence is usually not symmetric, anyc # S and any r % 0 define actually two dual Bregman balls :

Bc,r = {x # X : DF (c,x) $ r} , (3)

B!c,r = {x # X : DF (x, c) $ r} . (4)

Remark that DF (c,x) is always convex in c while DF (x, c) is not always, butthe boundary !Bc,r is not always convex (it depends on x, given c), while !B!

c,r

is always convex. In this paper, we are mainly interested in Bc,r because ofthe convexity of DF in c. The conclusion of the paper extends some results tobuild B!

c,r as well. Figure 2 presents some examples of Bregman balls for threepopular Bregman divergences (see Table 1 for the analytic expressions of thedivergences). Let S & X be a set of m points that were sampled from X . Asmallest enclosing Bregman ball (SEBB) for S is a Bregman ball Bc!,r! with r"

the minimal real such that S & Bc!,r! . With a slight abuse of language, we willrefer to r" as the radius of the ball. Our objective is to approximate as best aspossible the SEBB of S, which amounts to minimizing the radius of the enclosingball we build. As a simple matter of fact indeed, the SEBB is unique.

Lemma 1. The smallest enclosing Bregman ball Bc!,r! of S is unique.

Figure 10.1: Examples of Bregman balls. The two ones on the left are balls obtained by means ofthe Itakura-Saito distance. The middle one is a classic Euclidean ball. The other two are obtainedby employing the Kullback-Leibler distance (Nock and Nielsen, 2005, fig. 2).

tion of the data, therefore we can take it into account for much research in theSVC and generally in the SVM. In fact, we wish to recall that the classical BC al-gorithm is the optimization algorithm exploited by the already mentioned CVMs.The CVMs reformulate the SVMs as a MEB problem and we already expressedour will of testing such machines for the cluster description stage of the SVC (seesection 6.12). Since the BC algorithm has been generalized to Bregman diver-gences, the research about vector machines (and therefore about the SVC) couldhave very interesting implications. We definitely intend to explore this way.

10.3.2 Improve and extend the SVC softwareFor the sake of accuracy and in order to perform more robust comparisons withother clustering algorithms, an improved and extended software for the SupportVector Clustering (SVC) is needed. More stability and reliability is necessary.Moreover, it is important to implement all the key contributions to this promisingtechnique proposed all around the world. In fact, all the tests have been currentlyperformed by exploiting only some of the characteristic and/or special contribu-tion at time.

223

10.3. FUTURE WORK




!di=1 x2




Bc,r = {x # X : !x" c!22 $ r} , (2)


Bc,r = {x # X : DF (c,x) $ r} , (3)

B!c,r = {x # X : DF (x, c) $ r} . (4)


c,r








223

10.3. FUTURE WORK




!di=1 x2




Bc,r = {x # X : !x" c!22 $ r} , (2)


Bc,r = {x # X : DF (c,x) $ r} , (3)

B!c,r = {x # X : DF (x, c) $ r} . (4)


c,r








223

Itakura-Saito

Squared Euclidean

Kullback-Leibler

The End

Breg

man

Bal

ls

Technology

State-of-the-art Clustering Techniques: Support Vector Methods and Minimum Bregman Information Principle