6
Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 Spectral Feature Analysis Fei Wang State Key Laboratory of Intelligent Technology and Systems Department of Automation Tsinghua University Beijing 100084, P.R.China E-mail: [email protected] Jingdong Wang Department of Computer Science The Hong Kong University of Science and Technology Clear Water Bay Hong Kong E-mail: [email protected] Changshui Zhang State Key Laboratory of Intelligent Technology and Systems Department of Automation Tsinghua University Beijing 100084, P.R.China E-mail: [email protected] Abstract- The recent years have seen a surge of interest in spectral-based methods and kernel-based methods for machine learning and data mining. Despite the significant research, these methods remain only loosely related. In this paper, we give theoretically an explicit relation between spectral clustering and weighted kernel principal component analysis (WKPCA). We show that spectral clustering is not only a method for data clustering, but also for feature extraction. We are then able to re- interpret the spectral clustering algorithm in terms of WKPCA and propose our spectral feature analysis (SFA) method. The spectral features extracted by SFA can capture the distinguishing information of data from different classes effectively. Finally some experimental results are presented to show the effectiveness of our method. I. INTRODUCTION Spectral clustering [1][2][3] refers to a class of techniques which is dependent on the eigenstructure of an affinity matrix to partition the data objects into disjoint clusters. The data points in the same cluster will have high similarity while the data in different clusters will have low similarity. Usually these methods treat the data objects as the nodes of a graph, and the pairwise similarities of the dataset are represented as the weighted edges in the graph. Then the data clustering problems are converted to graph partitioning problems. The spectral decomposition procedure will be performed afterwards to achieve the optimal graph partitioning results. Recently these approaches have been proven useful for many applications including computer vision [l] and VLSI design [5]. As another kind of powerful tools, kernel-based methods have aroused considerable interest in the fields of machine learning and data mining these years[6]. The main idea of these approaches is to map the data objects from the original low-dimensional space to a feature space with a much higher dimensionality. But processing these data directly in a high- dimensional space is very difficult and usually we even do not know what the nonlinear mapping function is. Fortunately the ultimate problem which will be solved is only relevant to the computation of the inner product of the pairwise data objects in the feature space. The kernel-based methods are such kind of methods that can define an appropriate form of the inner product (which is also called a "kernel function"), and analyze the data objects in the feature space. On the surface, spectral clustering and kernel-based methods appear to be completely different. However, recently some researchers found that there had some relations between them. For example, Bengio et al [7] proposed that spectral clustering can be seen as learning eigenfunctions of a kernel. Dhillon et al [8] proved that spectral clustering is a special case of the kernel kmeans clustering algorithm. This paper unites spectral clustering and kernel princi- pal component analysis in a more explicit way. We prove theoretically that spectral clustering is a special version of the Weighted Kernel Principal Component Analysis (WKPCA) method. And the weights of the data objects are only deter- mined by the dataset itself. So the spectral clustering method can be treated as a nonlinear feature extraction approach. The features extracted by spectral clustering are called Spectral Features in this paper. Our experiments show that the spectral features of a dataset could discriminate the data generated from different classes effectively. This also implicitly explains the reason why the spectral clustering methods can solve the clustering problem so well. Finally we apply our Spectral Feature Analysis method to face recognition and object recog- nition, and the experimental results are presented to show the effectiveness of our method. The paper is organized as follows: section 2 will introduce some related works. WKPCA and SFA are presented in section 3. Our experiments will be shown in section 4, followed by the conclusions and discussions in section 5. II. RELATED WORK A. Spectral Clustering We will review spectral clustering more formally in this subsection. Assume the dataset is {xi}-'L1 CE RY, if we take each data object xi as a node of a graph, and the pairwise similarity of the data objects as the weighted edges of the graph. Then the distribution of the dataset can be represented as a weighted undirected graph G = (V, E), where V is the vertex set and E is the edge set. The data clustering problem is then equivalent to a graph partition problem. A graph bipartition problem is to partition V into two subsets V1 and V2, subject toY1 Vn V2 = qV1 UV2 = V. There are many criterions to measure the quality of the final partition results. Then the Normalized Cut value of a bipartition result can be defined as follows [I]: 0-7803-9048-2/05/$20.00 ©2005 IEEE 1971

[IEEE 2005 IEEE International Joint Conference on Neural Networks, 2005. - Montreal, QC, Canada (July 31-Aug. 4, 2005)] Proceedings. 2005 IEEE International Joint Conference on Neural

Embed Size (px)

Citation preview

Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005

Spectral Feature Analysis

Fei WangState Key Laboratory of

Intelligent Technology and SystemsDepartment of Automation

Tsinghua UniversityBeijing 100084, P.R.China

E-mail: [email protected]

Jingdong WangDepartment of Computer ScienceThe Hong Kong University of

Science and TechnologyClear Water BayHong Kong

E-mail: [email protected]

Changshui ZhangState Key Laboratory of

Intelligent Technology and SystemsDepartment of Automation

Tsinghua UniversityBeijing 100084, P.R.China

E-mail: [email protected]

Abstract- The recent years have seen a surge of interest inspectral-based methods and kernel-based methods for machinelearning and data mining. Despite the significant research, thesemethods remain only loosely related. In this paper, we givetheoretically an explicit relation between spectral clustering andweighted kernel principal component analysis (WKPCA). Weshow that spectral clustering is not only a method for dataclustering, but also for feature extraction. We are then able to re-interpret the spectral clustering algorithm in terms of WKPCAand propose our spectral feature analysis (SFA) method. Thespectral features extracted by SFA can capture the distinguishinginformation of data from different classes effectively. Finally someexperimental results are presented to show the effectiveness ofour method.

I. INTRODUCTION

Spectral clustering [1][2][3] refers to a class of techniqueswhich is dependent on the eigenstructure of an affinity matrixto partition the data objects into disjoint clusters. The datapoints in the same cluster will have high similarity while thedata in different clusters will have low similarity. Usually thesemethods treat the data objects as the nodes of a graph, andthe pairwise similarities of the dataset are represented as theweighted edges in the graph. Then the data clustering problemsare converted to graph partitioning problems. The spectraldecomposition procedure will be performed afterwards toachieve the optimal graph partitioning results. Recently theseapproaches have been proven useful for many applicationsincluding computer vision [l] and VLSI design [5].As another kind of powerful tools, kernel-based methods

have aroused considerable interest in the fields of machinelearning and data mining these years[6]. The main idea ofthese approaches is to map the data objects from the originallow-dimensional space to a feature space with a much higherdimensionality. But processing these data directly in a high-dimensional space is very difficult and usually we even do notknow what the nonlinear mapping function is. Fortunately theultimate problem which will be solved is only relevant to thecomputation of the inner product of the pairwise data objectsin the feature space. The kernel-based methods are such kindof methods that can define an appropriate form of the innerproduct (which is also called a "kernel function"), and analyzethe data objects in the feature space.On the surface, spectral clustering and kernel-based methods

appear to be completely different. However, recently someresearchers found that there had some relations between them.For example, Bengio et al [7] proposed that spectral clusteringcan be seen as learning eigenfunctions of a kernel. Dhillon etal [8] proved that spectral clustering is a special case of thekernel kmeans clustering algorithm.

This paper unites spectral clustering and kernel princi-pal component analysis in a more explicit way. We provetheoretically that spectral clustering is a special version ofthe Weighted Kernel Principal Component Analysis (WKPCA)method. And the weights of the data objects are only deter-mined by the dataset itself. So the spectral clustering methodcan be treated as a nonlinear feature extraction approach. Thefeatures extracted by spectral clustering are called SpectralFeatures in this paper. Our experiments show that the spectralfeatures of a dataset could discriminate the data generatedfrom different classes effectively. This also implicitly explainsthe reason why the spectral clustering methods can solve theclustering problem so well. Finally we apply our SpectralFeature Analysis method to face recognition and object recog-nition, and the experimental results are presented to show theeffectiveness of our method.The paper is organized as follows: section 2 will introduce

some related works. WKPCA and SFA are presented in section3. Our experiments will be shown in section 4, followed bythe conclusions and discussions in section 5.

II. RELATED WORKA. Spectral ClusteringWe will review spectral clustering more formally in this

subsection. Assume the dataset is {xi}-'L1 CE RY, if we takeeach data object xi as a node of a graph, and the pairwisesimilarity of the data objects as the weighted edges of thegraph. Then the distribution of the dataset can be representedas a weighted undirected graph G = (V, E), where V isthe vertex set and E is the edge set. The data clusteringproblem is then equivalent to a graph partition problem. Agraph bipartition problem is to partition V into two subsetsV1 and V2, subject toY1Vn V2 = qV1 UV2 = V. There aremany criterions to measure the quality of the final partitionresults. Then the Normalized Cut value of a bipartition resultcan be defined as follows [I]:

0-7803-9048-2/05/$20.00 ©2005 IEEE 1971

N~ ~_cut(Vi,V2)-cut(VVi,7V2)+ct( 2

Ncut(V1,V2)- assoc(V1,V) assoC(V27V)where cut(V1, V2) = EuEV,vEv2 a(u, v), andassoc(Vi, V) = EuEvi,tEV a(u, t), a(u, v) is the weight ofthe edge linking u and v. Ncut is the Nornmalized Cut value.The Normalized Cut criterion tells us to minimize the Ncutvalue.Yu et al [9] generalize Eq.(1) to a K-way cut case, which

aims to partition V into K subsets{Vi}K l, where for Vi y4K

Vi n Vj = 0 u Vi = V. Thent=1

Nkcut(G)k cut(Vi,V\Vi)Nkcut(G)-E assoc(Vi, V) (2)

The term cut(V-,V\Vi) measures how many links escapefrom Vi, and assoc(Vi,V) measures the total connectionsfrom nodes in Vi to all nodes in the graph. We can define theindication matrix XMIXK = [xl, x2, - ,XK], with the entry

1 I EV,X-1 0 othrwie where M is the number of the0 otherwz'severtices. Let AvIXAI be the affinity matrix with Aij a(i,j).The degree matrix is DAIXmI = diag(djj,d22... , dmmA),where dii = EuEv a(i, u). Yu et al[9] proved that theminimization of (2) is equivalent to the following optimizationproblem:

rnaxrmize £(Z) Ktrace(ZTAZ)s.t. ZTDZ = IK

where Zorder K.solve the

(3)= X(XTDX) 1/2, and IK is the identity matrix ofLet Y = D1 2Z, then optimizer (3) is equivalent tofollowing spectral decomposition problem:

D- 1/2AD- 1/2y = AY (4)

The matrix L = D-1/2AD-1/2 is also called the Laplacianmatrix [4]. In this way, the graph partitioning problem isconverted to a spectral decomposition problem, that is to say,the data clustering problem can be converted to a spectraldecomposition problem.

B. Kernel PCAAs it is known that Principal Component Analysis (PCA)

can only extract the linear principal components of thedatasets. However most of the realworld datasets are non-linearly distributed. To overcome the problem, Scholkopf etal [10] propose the Kernel Principal Component Analysis(KPCA) method which can extract the nonlinear principalcomponents of the datasets using a kernel trick.More precisely, let the dataset be {xi}^',l e lRd. We will

map the dataset to a high dimensional feature space by somenonlinear function '1 : {xi}V E1 Rd_ {1(X2)}A1l E IF.Assume the data objects have already been centered in thefeature space, that is > 4LDJ(xi) = 0. We can define the

covariance matrix C = Z$I(xI(xi)T. The eigen-value decomposition of C is Cv = Av, which is equiv-

Malent to Av = Cv = M- (I,(Xj)TV)D(Xi), then v E

Span{b(xl), , **.9xmi)} is the nonlinear principal directionwe want. Let v = EL1 ai(D(xi) and we can get the followingformula:

AL,=1ai ( b(Xk ) -

m (Xi) )M M

ME E aj((((Xk) (Xi))()(Xi) -(Xj)) (S)j=1 i=1

Define the kernel matrix (K)ij = (I(xi) -D(xj)), whererepresents the inner product of the two elements. Then

Eq.(5) can be written as MAKcxa K2a. As K is symmetric,it has a set of eigenvectors which spans the whole space, thusMAa= Kx. The projection of the data 4P(xi) on the k-thprincipal component is:

4) (Xii)* vk= (.(Xi ) S_a_ k (Xj)

= >- (4(xi) - I(xj)) = EZI, cjkKij (6)

From the derivation above we can see that the kernelprincipal components of the dataset can be computed oncethe kernel function (K)Ij = (D(xi) . I(xj)) is appropriatelydefined.

III. SPECTRAL FEATURE ANALYSIS (SFA)

A. Weighted Kernel Principal Component Analysis (WKPCA)We have briefly reviewed the main idea of conventional

KPCA in section II-B. This algorithm do not take into accountthe weights of the data objects. That is to say, all the samplesare assumed to be of the same importance. In this sectionwe will extend the conventional KPCA to a more generalcase, where we assume that the importance of the data objectsis different. We call this generalized KPCA Weighted KernelPrincipal Component Analysis (WKPCA).

Let the training dataset be {xi},L1 C IRd, and 'J : IRd -IF maps the data from the original data space to the fea-ture space. The conventional KPCA define the covariancematrix of the training data in the feature space as C =11j M1 4I(xi)T(xi)T, which assumes the contribution of each

data object to C is the same (1/M). Now we generalize thisformula and define the weighted covariance matrix of thetraining data in the feature space as:

C = E l Pi(xi)(D(xj) (7)

where pi represent the contribution of the data object JD(xi) toC. Assume the weights have been normalized, and the datasethas been centered in the feature space. That is to say Ei Pi =

1972

1, EM pi4(xi) = 0 (which can be seen from the appendix).Let wi = Vp3i, then the spectral decomposition of C is:

TABLE ISPECTRAL FEATURE ANALYSIS

Cv= 1 (Wi(Xi)T)( V= E-= (wj(Di)(iX) = Ai (8)

From Eq.(8) we can see that v lies in the Span of

{wij(xj)}j1. Let

(9)V = Ei=-1 iwi4(xi)

Combining Eq.(8) and Eq.(9) we can get:

Wkq(Xk) -

Cv

= Wk'I (Xk) -E (W Xi()X))(W,i (Xi))TJ5 aj wj (xj)i=l j=1

M M

= E E (Wk4(Xk-) (Wi (Xi) * Wji(Xj))j=1 i=1

= AL aiWiD(xi) (10)i=l1

Now we define the kernel matrix to be KwhereKij = (wiD(xi) wj3I(xj)). Let the diagonal matrix Wdiag(wl, W2, *,w/W), then

K= WKW (11)

where K is the kernel matrix in conventional KPCA. Fromsection 1I-B we know that Eq.(10) is equivalent to K&= A&.

If we do eigenvalue decomposition on K, then we can get theweighted kernel principal components of the training dataset.And we can perform WKPCA on the dataset afterwards.

According to [10], we need to normalize the eigenvectorsof the covariance matrix C through scaling the eigenvectorsof K by multiplying 1/y'X-, that is a-' =i/v/ii, where Aiis the eigenvalue of K corresponding to the eigenvector &i.Then we can compute the projections of the training datasetas follows:

A/I

(4)(Xk)-* j) 5iM1 &j(D(Xk) Wi4D(Xi))-=

i

= Wk (Kc&)k

= W- (>jai&/ A)k

= w )&j3)k (12)

For the testing dataset 1't = [K(YD, (Y2), * YN)their projections on the weighted kernel principal componentsare (1(Yk)O j) = EL1 &q,(I(Yk) Wi4(xi))- So we can

define a kernel matrix Kt E RNxM , where Kt3 = (4(Yi) -

Wj((xj)). Then the projections of the testing data objects can

be calculated as follows:

(1(Yk) *'j) =

Si1i (M(Yk)*Wi(Xi))t

= (K cl)k (13)

After we got the projections of the training data and testingdata, the data analysis (e.g. classification) procedure can beperformed in a much lower dimensional projection space.

All the inferences above are based on the assumption thatZA 1 pi-(J(xi) = 0. But this condition cannot be satisfied inmost of the cases. So we need to centralize all the samples inthe feature space. The details of the centralization procedurecan be referred to the appendix.

B. Spectral Feature Analysis (SFA)We have proposed the WKPCA method in the last subsec-

tion. If reviewing the spectral clustering problem introducedin section Il-A, we can easily find that the Laplacian matrixL in Eq.(4) is very similar to the weighted kernel matrixK in (11). If we set the weights of the training dataset tobe pi = ,/dii, where dii is the diagonal element ofthe degree matrix D, then the weight matrix becomes P =

D-'/Zwhere Z = Ei l/dii is the normalization factor. Ifwe treat the matrix A in Eq.(4) as a kernel matrix, then theweighted kernel matrix can be defined as:

K P1P2AP'12 ZD-2AD-2 = ZL (14)

Since Z is a constant, the eigenvectors of the Laplacianmatrix L and the matrix K are the same. Then spectralclustering can be viewed as a special WKPCA method. And theweights of the training data are only determined by the trainingset itself. Then we propose our Spectral Feature Analysis (SFA)approach which can be seen in Table I.

Some nonlinear features can be extracted from the trainingdataset by SFA. It is interesting to do some research on thesefeatures. We will present some experimental results in the nextsection.

IV. EXPERIMENTSIn this section, we give a set of experiments where we used

SFA to extract the spectral features. First, we shall take a lookat a simple toy example; following that, we describe realworld

1973

Spectral Feature Analysis (SFA) (using Gaussian kernel)Input: Trainining dataset{xi}I E Rd, Testing dataset{yi}N 1 E Rd

Variance of the RBF kernel c2,The desired dimensionality kOutput: Projections of the training dataset and the testing dataset1. Construct the kernel matrix Aij = exp(-IIxi -xj112/2o2)

and the degree matrix D2. Construct the weighted kemel matrix K = D-1/2AD-1/23. Do eigenvalue decomposition on the weighted kernel matrix

Calculate the projections of the training data on the first k eigenvectorscorresponding to the k largest eigenvalues by (12)

4. Calculate the affinity matrix between the testing set and the training set5. Compute the projections of the testing dataset on the first k eigenvectors

calculated from step 3.

KPC I_ _507I

O. . ', %, .

44...

1 4 0 05

SFA 1454

Fig. 1. Two-dimensional toy examples

experiments where we tried to assess the utility of the extractedfeatures by classification tasks.

A. Toy ExamplesIn order to provide some intuition on how the spectral

feature analysis in the feature space behaves in the originalinput space, we show some experiments with a synthetic two-dimensional dataset. We select Gaussian kernel as our kernelfunction, with its variance 0.06. The experimental results canbe seen in Figure 1.The synthetic dataset is generated from three

2-D Gaussians. The mean vectors of them are[-0.5 0.2]T, [0 0.6]T, [0.5 O]T respectively, thecovariance matrices of them are all [ ]01 Thefigures in Figure 1 contain lines of constant kernel pnncipalcomponent value (the upper row) and of constant spectralfeature value (the lower row). The corresponding eigenvaluesof these figures are arranged in a descending order from leftto right.

It can be easily seen from figure 1 that the fearures ex-tracted by KPCA are some global descriptive features, whilethe features extracted by SFA are some local discriminativefeatures. The first three features extracted by SFA (whichcorresponding to the three largest eigenvalues) can dividethe dataset into three clusters automatically, which implicitlyexplain the effectiveness of the spectral clustering methods.B. Face Recognition

In the experiments of this subsection, the SFA will beapplied to face recognition. We select the ORL database [12]as our experimental data. There are 10 different images of 40distinct subjects. For some of the subjects, the images were

taken at different times, with slightly varying lightings, facialexpressions (open or close eyes, smiling or not smiling), orfacial details (wear glasses or not). All the subjects are in up-right, frontal-position (with tolerance for some pose variation).Figure 2 shows the 10 images of one subject.

Fig. 2. ORL database

We tested the recognition accuracy with different numbersof training samples. k(k = 2,3,4) images of each subjectwere randomly selected for training and the remaining 10 - kimages of each subject for testing.

Besides our method, the PCA method (which representsto reduce the dimensionality of the training set by PCA),the GKPCA method (which represents to reduce the dimen-sionality of the training set by KPCA based on Gaussiankernel), and the SFA2 method (non-centered SFA methodusing Gaussian kernel) were also tested for comparisons. Afterthe dimensionality reduction procedure, the Nearest-Neighborclassification was performed to recognize each testing samplein the projection space. For each value of k , 100 independentruns were performed with different random partitions betweenthe training and testing sets. And we adjusted the variance ofthe Gaussian kernel to achieve the best results for each exper-iment. Finally the average recognition accuracy are displayedin Table II.

From Table II we can see that the results produced byGKPCA and SFA are better than others. And the recognitionaccuracy of SFA is nearly the same as GKPCA, from whichwe can see the effectiveness of our method.

C. Object RecognitionOur SFA method can also be used for object recognition.

COIL -20 database [13] is selected as our experimental dataset,

1974

-1i 4 5 0~ 05

SFA IO -7

0 " '-'.O, ..

42 0

0,A ~,ta=ol0

1.5

MA

0 a

0

':: I.Q. *$.

_0.

-1 0 I

I

I

.1

I

I

.V.

TABLE 11RECOGNITION ACCURACY ON ORL DATABASE

K PCA GKPCA SFA2 SFA2 0.7225 0.77 13 0.7572 0.77073 0.7996 0.85 11 0.8246 0.84754 0.8492 0.8990 0.8675 0.8917

which is a database of gray-scale images of 20 objects. Foreach object there are totally 72 images. The images weretaken around the object at the pose interval of 5 degrees.All the images were resized to 128 x 128 using interpolation-decimation filters to minimize aliasing. Figure 3 shows the 20images of each object.

TABLE III

RECOGNITION ACCURACY RESULTS ON TRAINING SET 4

F=4 N=10 N=20SFA2 0.7816 0.7868SFA 0.7912 0.8169PCA 0.7912 0.7949

GKPCA 0.7912 0.8169

TABLE IVRECOGNITION ACCURACY RESULTS ON TRAINING SET 6

Fig. 3. COIL-20 database

Since the 72 photos of each object are taken around theobject uniformly, we select our training set in manner of Figure4. In this way we can maintain some global information ofthe object. For each object, 4 and 6 photos were selectedrespectively as our training set. The remaining photos are

used for testing. The final recognition accuracy on these twodatasets are shown in Table III and IV.

E

W.) (b)

Fig. 4. The selection of the training set

Table III are the recognition results produced on the trainingset composed of 4 images per object, while Table IV are theresults produced on the training set composed of 6 imagesper object. In both tables the first column represents therecognition methods (where the meanings of the abbreviationsare the same as we have mentioned in section IV-B). The

second and third column are the recognition accuracy when thefinal dimension of the training set are 10 and 20 respectively.We can see from the tables that SFA and GKPCA can performbetter than the other two approaches.

V. CONCLUSIONS AND DISCUSSIONSThis paper proposes to do researches on spectral cluster-

ing methods from a novel point of view. We pointed thatthe spectral clustering methods are not only methods thatcan be used for clustering, but also for feature extraction.Theoretical analysis in this paper proved that the spectralclustering methods are special kind of WKPCA approaches.Our experiments showed that the results produced by our

methods are comparable to the results produced by KPCA.There are still some interesting problems remain unresolved

in this paper. On one hand, we only showed the differencesbetween SFA and KPCA by a toy example, but the experi-mental results on face recognition and object recognition are

nearly the same. So it is interesting to study the effect ofthe weight matrix W in SFA, and what kind of problemsare more suitable for our SFA approach. On the other hand,viewing spectral clustering from a kernel viewpoint providesus another powerful way to solve some problems existing inthese methods. For example, the determination of the varianceof the Gaussian kernel, which is still a hard problem thatcannot be resolved perfectly. But there has been some methodsthat can determine the parameters in kernel-based methods[14][15][16]. And we hope to do some researches on this topicin the future.

ACKNOWLEDGEMENTSThis work is supported by the project (60475001) of the

National Natural Science Foundation.

APPENDIXIn section Ill-A, we assume that all the mapped data

are centered in the feature space. Now we will drop this

1975

F=6 N=10 N=20SFA2 0.8492 0.8523SFA 0.8674 0.8955PCA 0.8477 0.8591

GKPCA 0.8697 0.8955

assumption. Generally speaking, the mean of the samples in adataset is the best representative of it in a sense that the totalsquare distance between all the data objects and the mean isthe smallest [11]. More precisely, if p = ,1 E -1 xi, then, = arg min6Z4Yi=1 lx, - l2. If we extend this idea, thenwe can define the mean of the samples in the weighted case:Theorem: Let the dataset be {xi}L, e Rd, the weight of

xi is pi, then:

m = E =l Y = arg minm Pilxi- (

Proof:

*J = E_1~~~pilixi ellSt-1 [.t/ji(xi -E)][VT(xi - )

dJ = -2EiZ Pi(xi-E)dE

.-.aZ=pi(x( )T

Let aJ =O thent

'15)

.AI

Ei=Ipi

It is easy to see that if we set all the weights of the dataobjects to be one, then m = ft. According to (15), sincewe have assumed that Eipi = 1, then we can define theweighted mean of the data objects in the feature space asm = 2Lpi(D(x-). Let P = diag(p,,P2,... ,PA)then m= 4Pe, where 1 = [b(xl),<(x2), ,.(xM/I)] ande represents a column vector of dimension M with all theelements being one. Assume the centered weighted kernelmatrix is jKc, then

= ((4 - meT)W)T((4p _ meT)W)- (PW - PeeTW)T (4W - <PPeeTW)

(4 W (I - WeeTW)) (4W (I - WeeTW))- (I - WeeTW) (4W)T(4W) (I - WeeTW)- (I-WeeTW) K (I-WeeTW)

late the projections of the training and testing data as follows:

((D(Xk) - Vj) = E-=1 7t ((D(Xk) * Wi(D(Xi))=w-lk(K7j)k (18)

(J)(Yk) 1Vj) = E-1 At(jJ(Yk) *Wi(Xi))= tcCj) (19)

where y=-i /J , and 6i is the eigenvalue of K corre-sponding to its eigenvector yi.

REFERENCES[1] J. Shi, J. Malik. Normalized Cuts and Image Segmentation. IEEE Trans-

actions on PAMI. Vol. 22, no. 8. August 2000[2] C. Ding, X. He, H. Zha, M. Gu, H. Simon. Spectral Min-Max Cut for

Graph Partitioning and Data Clustering. Proceedings of the First IEEEICDM. 2001.

[3] H. Zha, C. Ding, M. Gu, X. He, H. Simon. Spectral Relaxation for K-means Clustering. In NIPSOI. 2001.

[4] F. Chung. Spectral Graph Theory. Number 92 in CMBS Regional Con-ference Series in Mathematics. American Mathematical Society. 1997.

[5] C. Alpert, A. Kahng, S. Yao. Spectral partitioning: The more eigenvectors,the better. Discrete Applied Math, 90: 3-26. 1999.

[6] B. Scholkopf, A. Smola. Learning with Kernels-Support Vector Ma-chines, Regularization, Optimization and Beyond. The MIT Press. Cam-bridge, Massachusetts. London, England. 2002.

[7] Y. Bengio, J. Paiement, P. Vincent. Out-of-Sample Extensions for LLE,Isomap, MDS, Eigenmaps, and Spectral Clustering. In NIPS03. 2003.

[8] 1. S. Dhillon, Y. Guan, B. Kulis. A Unified View of Kernel k-means,Spectral Clustering and Graph Cuts. UTCS Technical Report # TR-04-25. July 27, 2004.

[9] S. X. Yu, J. Shi. Multiclass Spectral Clustering. ICCV03. 2003[10] B. Scholkopf, A. Smola, K. Muler. Nonlinear Component Analysis as

a Kernel Eigenvalue Problem. MPI Technical Report No. 44. December1996.

[11] D. Gering. Linear and Nonlinear Data Dimensionality Reduction. Infullfillment of the Area Exam doctoral requirements. April 17, 2002.

[12] F. Samaria, A. Harter. Parameterisation of a stochastic model forhuman face identification. Proceedings of the second IEEE Workshopon Applications of Computer Vision. 1994.

[13] S. A. Nene, S. K. Nayar and H. Murase. Columbia Object Image Library(COIL-20). Technical Report CUCS-005-96, February 1996.

[14] 0. Chapelle, V. Vapnik, 0. Bousquet, S. Mukherjee, Choosing multipleparameters for support vector machines, Machine Learning, vol.46, 2002.

[15] 0. Chapelle, V. Vapnik, Model selection for support vector machines,In NIPS99. 1999.

[16] N. Cristianini, C. Campbell, J. Shawe-Taypor, Dynamically adaptingkernels in support vector machines, In NIPS99. 1999.

(16)

For the testing dataset 41t = [I(Y1),O'(Y2)," ,1(YN)]' thecentralization matrix of K can be computed as

Ktc = (t - meT)T ((4-meT)W)= (,- PeeT) (>W- 4PeeTW)= Kt (I- WeeTW) - eteTWK (I- WeeTW)= (Kt - eteTWk) (I - WeeTW) (17)

where et = [1,1,.***, 1ITXN, N is the number of testingdata.Assume ij is the j-th eigenvector of the covariance matrix

C = Z.1Pi (b(x)) M)(i(X,) - m)T, then we can calcu-

1976