Semi-supervised Learning with Kernelszbodo/thesis/thesis.pdf · Acknowledgements First of all I would like to thank my advisor, Zoltán Kása, for giving me the opportunity to continue

Babes–Bolyai University, Cluj-Napoca

Faculty of Mathematics and Computer Science

Semi-supervised Learning withKernels

Ph.D Thesis

Scientific Advisor: Ph.D Student:

Zoltán KÁSA Zalán-Péter BODÓ

Cluj-Napoca

2009

Universitatea Babes–Bolyai, Cluj-Napoca

Facultatea de Matematica si Informatica

Învatare semisupervizata folosindkerneluri

Teza de doctorat

Conducator stiintific: Doctorand:

Zoltán KÁSA Zalán-Péter BODÓ

Cluj-Napoca

2009

In loving memory of my parents

Acknowledgements

First of all I would like to thank my advisor, Zoltán Kása, for giving me the opportunityto continue my studies, for his guidance and encouragement throughout this research,and especially for the rigorous correction of the thesis and priceless suggestions.

I would like to thank Lehel Csató and Zsolt Minier for the pleasant joint work per-formed during my PhD studies and research. Although we do not have tutors in ourpostgraduate studies, I can devoutly call Lehel my tutor, because I have learned a lotfrom him, not only about artificial intelligence and machine learning, but about scien-tific research at all.

I am deeply grateful to my girlfriend, Annamária Biró, who always helped me, espe-cially by supporting me and tolerating my attitude in the fruitless phases of the research,and also when writing the PhD thesis; I could not have done any of this without her.

I would like to thank to my family and all my relatives for their help and supportduring the years of study. I am also grateful to all my friends for their support and forthe unforgettable revels.

Last but not least, I would like to thank to Jakob Jonsson for proofreading the thesisand for the helpful and relevant suggestions and corrections he made.

I also acknowledge the support of the grant CNCSIS/TD-35 of the Romanian Min-istry of Education and Research.

v CONTENTS

Contents

1 Introduction 11.1 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Publications and contributions to the field . . . . . . . . . . . . . . . . 5

2 Semi-supervised learning 72.1 Assumptions in semi-supervised learning . . . . . . . . . . . . . . . . 9

2.1.1 The smoothness assumption . . . . . . . . . . . . . . . . . . . 92.1.2 The cluster assumption . . . . . . . . . . . . . . . . . . . . . . 102.1.3 The manifold assumption . . . . . . . . . . . . . . . . . . . . . 11

2.2 Transduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 A classification of semi-supervised methods . . . . . . . . . . . . . . . 11

2.3.1 Generative models . . . . . . . . . . . . . . . . . . . . . . . . 122.3.2 Low density separation . . . . . . . . . . . . . . . . . . . . . . 132.3.3 Graph-based methods . . . . . . . . . . . . . . . . . . . . . . . 142.3.4 Change of representation . . . . . . . . . . . . . . . . . . . . . 18

3 Kernels and kernel methods 223.1 A simple classification algorithm . . . . . . . . . . . . . . . . . . . . . 253.2 Some general purpose kernels . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1 The linear kernel . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.2 The polynomial kernel . . . . . . . . . . . . . . . . . . . . . . 273.2.3 The RBF kernel . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2.4 The sigmoid kernel . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Classification with Support Vector Machines . . . . . . . . . . . . . . . 293.3.1 Hard margin SVMs . . . . . . . . . . . . . . . . . . . . . . . . 30

vi CONTENTS

3.3.2 Soft margin SVMs . . . . . . . . . . . . . . . . . . . . . . . . 323.3.3 Kernelization . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3.4 Classification with multiple classes . . . . . . . . . . . . . . . 34

3.4 Dimensionality reduction with PCA and KPCA . . . . . . . . . . . . . 373.4.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . 373.4.2 Kernel Principal Component Analysis . . . . . . . . . . . . . . 40

4 Data-dependent kernels 424.1 The ISOMAP kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 The neighborhood kernel . . . . . . . . . . . . . . . . . . . . . . . . . 454.3 The bagged cluster kernel . . . . . . . . . . . . . . . . . . . . . . . . . 454.4 Multi-type cluster kernel . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4.1 Linear transfer function . . . . . . . . . . . . . . . . . . . . . . 474.4.2 Step transfer function . . . . . . . . . . . . . . . . . . . . . . . 474.4.3 Linear step transfer function . . . . . . . . . . . . . . . . . . . 484.4.4 Polynomial transfer function . . . . . . . . . . . . . . . . . . . 48

4.5 Manifold regularization and data-dependent kernels for SSL using pointcloud norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Wikipedia-based kernels for text categorization 515.1 Text categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.1.1 The bag-of-words representation . . . . . . . . . . . . . . . . . 525.1.2 Feature selection techniques in text categorization . . . . . . . . 555.1.3 Machine learning in text categorization . . . . . . . . . . . . . 605.1.4 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . 63

Precision and recall . . . . . . . . . . . . . . . . . . . . . . . . 64Break-even point . . . . . . . . . . . . . . . . . . . . . . . . . 65The E and F-measures . . . . . . . . . . . . . . . . . . . . . . 65

5.2 String and text kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2.1 String kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2.2 The VSM kernel . . . . . . . . . . . . . . . . . . . . . . . . . 695.2.3 The GVSM kernel . . . . . . . . . . . . . . . . . . . . . . . . 705.2.4 WordNet-based kernels . . . . . . . . . . . . . . . . . . . . . . 705.2.5 Latent Semantic Kernel . . . . . . . . . . . . . . . . . . . . . . 73

vii CONTENTS

5.2.6 The von Neumann kernel . . . . . . . . . . . . . . . . . . . . . 745.3 Wikipedia-based text kernels . . . . . . . . . . . . . . . . . . . . . . . 75

5.3.1 Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.3.2 Wikipedia-based document representation . . . . . . . . . . . . 775.3.3 Dimensionality reduction for the Wikipedia kernel . . . . . . . 795.3.4 Link structure of Wikipedia . . . . . . . . . . . . . . . . . . . 805.3.5 Concept weighting . . . . . . . . . . . . . . . . . . . . . . . . 805.3.6 Experimental methodology and test results . . . . . . . . . . . 825.3.7 Related methods . . . . . . . . . . . . . . . . . . . . . . . . . 845.3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6 Hierarchical cluster kernels 876.1 Motivation for a cluster kernel . . . . . . . . . . . . . . . . . . . . . . 876.2 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.2.1 Linkage distances . . . . . . . . . . . . . . . . . . . . . . . . . 926.3 Metric multi-dimensional scaling . . . . . . . . . . . . . . . . . . . . . 956.4 Ultrametric matrices and trees . . . . . . . . . . . . . . . . . . . . . . 966.5 The hierarchical cluster kernel . . . . . . . . . . . . . . . . . . . . . . 97

6.5.1 Hierarchical cluster kernel with graph distances . . . . . . . . . 99Connecting the graph . . . . . . . . . . . . . . . . . . . . . . . 100

6.6 New test points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.7 Experimental methodology and test results . . . . . . . . . . . . . . . . 1026.8 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7 Variations on the bagged cluster kernel 1097.1 The bagged cluster kernel . . . . . . . . . . . . . . . . . . . . . . . . . 1097.2 Computing the reweighting kernel . . . . . . . . . . . . . . . . . . . . 111

7.2.1 Combining kernels . . . . . . . . . . . . . . . . . . . . . . . . 1117.2.2 Using the Hadamard product for kernel reweighting . . . . . . . 113

Gaussian reweighting kernel . . . . . . . . . . . . . . . . . . . 114Dot product-based reweighting kernel . . . . . . . . . . . . . . 115

7.3 Getting the clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177.4 Experimental methodology and test results . . . . . . . . . . . . . . . . 118

viii CONTENTS

7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

8 Conclusions 122

A Data sets 125A.1 Two-moons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125A.2 Reuters-21578 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125A.3 USPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127A.4 Digit1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128A.5 COIL2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128A.6 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

ix NOTATION, SYMBOLS AND ACRONYMS

Notation, symbols and acronyms

Notation and symbols

aaa,bbb, . . . ,ααα,βββ, . . . vectorsAAA,BBB, . . . ,ΛΛΛ,ΣΣΣ, . . . matrices

III identity matrixOOO null matrixKKK kernel matrix

a, b, . . . , α, β, . . . scalars`, u, N number of labeled examples, number of unlabeled

examples and total number of examplesK number of classes / clusters

Aij element of matrix AAA from the ith row and jth columnAAAi·,AAA·j ith row and jth column of matrix AAA, respectivelyaaa ′,AAA ′ transpose of vector aaa / matrix AAA

aaa∗,AAA∗ conjugate transpose of vector aaa / matrix AAA; optimalsolution of an optimization problem

aaa(k),AAA(k) elementwise power of vector aaa / matrix AAA; kth ele-ment of a set of vectors / matrices

AAA† Moore–Penrose pseudo inverse of matrix AAA

111,111k vector of ones, vector of ones of size k× 1

000,000k vector of zeros, vector of zeros of size k× 1

〈xxx,zzz〉 inner product, 〈xxx,zzz〉 =∑

i xizi = xxx ′zzz

φ(xxx) feature mapping, φ : X → Hk(·, ·) kernel function

diag(AAA) vector of the diagonal elements of AAA

x NOTATION, SYMBOLS AND ACRONYMS

diag(aaa) diagonal matrix, where the elements of aaa are on thediagonal

sgn(a) signum function: sgn(a) = −1 if a < 0, sgn(a) = 1

if a ≥ 0

tr(AAA) matrix trace operator, tr(AAA) =∑

i Aii

‖ · ‖, ‖ · ‖2 the Euclidean norm, ‖xxx‖ =√∑

i x2i

‖ · ‖F the Frobenius matrix norm, ‖AAA‖F =√∑

i,j A2ij =√

tr(AAA∗AAA)

¯ elementwise, Hadamard or Schur product, (AAA ¯BBB)ij = Aij · Bij

⊕ direct sum; see definition on page 112⊗ Kronecker product; see definition on page 112m “not necessarily equal”º AAA º 0 indicates that AAA is positive semi-definite

Acronyms

AI Artificial IntelligenceBEP Break-even Point

COIL Columbia Object Image LibraryDAG Directed Acyclic GraphDFT Document Frequency ThresholdingDIA Darmstadt Indexing Approach

ECOC Error Correcting Output CodesEM Expectation Maximization

ESA Explicit Semantic AnalysisFN False NegativesFP False Positives

gHCK Graph-based Hierarchical Cluster KernelGVSM Generalized Vector Space Model

HCK Hierarchical Cluster KernelIG Information GainIR Information Retrieval

xi NOTATION, SYMBOLS AND ACRONYMS

ISOMAP ISOmetric feature MAPpingkNN k-Nearest Neighbors

KPCA Kernel Principal Component AnalysisLapRLS Laplacian Regularized Least Squares

LapSVM Laplacian Support Vector MachinesLLE Locally Linear Embedding

LLSF Linear Least Squares FitLP Label Propagation

LSA/LSI Latent Semantic Analysis/Latent Semantic IndexingMDS Multi-Dimensional Scaling

MI Mutual InformationML Machine Learning

MLE Maximum Likelihood EstimationNLP Natural Language ProcessingPCA Principal Component AnalysisPSD Positive Semi-DefiniteRBF Radial Basis FunctionRLS Regularized Least SquaresSSK String Subsequence KernelSSL Semi-Supervised Learning

SVD Singular Value DecompositionSVM Support Vector Machine

TC Text CategorizationTFIDF Term Frequency × Inverse Document Frequency

TN True NegativesTP True Positives

TSVM Transductive Support Vector MachinesUPGMA Unweighted Pair Group Method using Arithmetic

meanUPGMC Unweighted Pair Group Method using Centroids

USPS United States Postal ServiceVSM Vector Space Model

WPGMA Weighted Pair Group Method using Arithmetic meanWPGMC Weighted Pair Group Method using Centroids

xii LIST OF FIGURES

List of Figures

2.1 Clusters in the data set of concentric clusters . . . . . . . . . . . . . . . 102.2 The propagation of labels in LP . . . . . . . . . . . . . . . . . . . . . . 172.3 Illustration for SSL based on change of representation . . . . . . . . . . 182.4 The input and the result of the LLE algorithm . . . . . . . . . . . . . . 20

3.1 The XOR problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 A simple classification algorithm . . . . . . . . . . . . . . . . . . . . . 253.3 Training and testing with SVMs on the “two-moons” data set . . . . . . 333.4 Ambiguous classification . . . . . . . . . . . . . . . . . . . . . . . . . 343.5 Codewords assigned to categories . . . . . . . . . . . . . . . . . . . . 353.6 DAG-based evaluation for multi-class settings . . . . . . . . . . . . . . 363.7 PCA in two dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . 393.8 Illustration for KPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1 The input and the result of the ISOMAP algorithm . . . . . . . . . . . 444.2 Semi-supervised learning using LapSVM . . . . . . . . . . . . . . . . 49

5.1 The bag-of-words (or VSM) representation of documents . . . . . . . . 535.2 Calculating document similarity . . . . . . . . . . . . . . . . . . . . . 545.3 The scheme of the segmentation based feature selection for TC . . . . . 595.4 The graphical representation of document sets . . . . . . . . . . . . . . 665.5 Computing the mismatch string kernel . . . . . . . . . . . . . . . . . . 695.6 Scheme of forming the Wikipedia kernel . . . . . . . . . . . . . . . . . 795.7 Concept weighting with PageRank . . . . . . . . . . . . . . . . . . . . 81

6.1 Motivation for a cluster kernel . . . . . . . . . . . . . . . . . . . . . . 886.2 The points in the new representational space . . . . . . . . . . . . . . . 89

xiii LIST OF FIGURES

6.3 Hierarchical clustering represented by dendrograms . . . . . . . . . . . 916.4 Hierarchical clustering represented by a Venn diagram . . . . . . . . . 916.5 Merging 3 clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946.6 Example of an ultrametric tree . . . . . . . . . . . . . . . . . . . . . . 97

7.1 Kernel reweighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167.2 Data setting where conventional clustering fails . . . . . . . . . . . . . 117

A.1 The “two-moons” data set . . . . . . . . . . . . . . . . . . . . . . . . 126

xiv LIST OF TABLES

List of Tables

3.1 Some similarity metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.1 Contingency table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.2 Highest and lowest ranked articles in the reduced Wikipedia set . . . . . 825.3 Results obtained using the Wikipedia-based kernel . . . . . . . . . . . 835.4 Words in the WordSimilarity-353 corpus . . . . . . . . . . . . . . . . . 855.5 Words not in Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . 855.6 Words in Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.1 Linear and Gaussian kernels with SVMs. . . . . . . . . . . . . . . . . . 1036.2 Hierarchical cluster kernels (HCK). . . . . . . . . . . . . . . . . . . . 1036.3 Graph-based hierarchical cluster kernels (gHCK). . . . . . . . . . . . . 1046.4 ISOMAP and neighborhood kernels. . . . . . . . . . . . . . . . . . . . 1046.5 The bagged cluster kernel and LapSVM with Gaussian kernel. . . . . . 1056.6 Multi-type cluster kernel. . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.1 Reweighting cluster kernels using hierarchical clustering . . . . . . . . 1187.2 Reweighting cluster kernels obtained with k-means . . . . . . . . . . . 1197.3 Reweighting cluster kernels with spectral clustering using k-means . . . 120

A.1 The 10 most frequent Reuters categories . . . . . . . . . . . . . . . . . 126

1

Chapter 1

Introduction

Since 1956, Artificial Intelligence (AI) has been one of the intensely studied areas ofcomputer science, the goal being to construct intelligent machines. Intelligence,however, is not a well defined and not an easily definable concept – and we refer

here to the informal definition of intelligence. An early attempt to demonstrate machineintelligence – and thus an approach for its definition – was the well-known Turing test:a human holds conversations in a natural language with another human and a machine,each of which tries to appear human, and the task is to determine which is the machineand which is the human. If this cannot be reliably judged by the first human, then themachine is said to pass the test. Thus one can say that AI’s goal is to build machinesthat behave like humans.

According to Russel and Norvig [1995] AI systems can be organized into fourgroups: systems that think like humans, systems that act like humans, systems thatthink rationally and systems that act rationally. Machine Learning (ML) is a subdomainof AI which tries to model the most important brain activities: classification, differenti-ation and prediction; learning machines belong to the first category mentioned above. Inthe last 20 years cognitive psychological and computer sciences drifted apart and new,quite narrow areas appeared in the field of AI; some technologies became indispensableand somewhat more what humans could do: let us just think about information retrievalsystems, where millions of documents have to be searched to return to the user thosecontaining relevant information, and nowadays the accuracy reached by these systemsis comparable to human judgement.

Mitchell [2006] defines ML as follows:

2

~ Machine Learning is a natural outgrowth of the intersection of ComputerScience and Statistics. We might say the defining question of ComputerScience is “How can we build machines that solve problems, and whichproblems are inherently tractable/intractable?” The question that largelydefines Statistics is ”What can be inferred from data plus a set of model-ing assumptions, with what reliability?” The defining question for MachineLearning builds on both, but it is a distinct question. Whereas ComputerScience has focused primarily on how to manually program computers, Ma-chine Learning focuses on the question of how to get computers to programthemselves (from experience plus some initial structure). Whereas Statis-tics has focused primarily on what conclusions can be inferred from data,Machine Learning incorporates additional questions about what compu-tational architectures and algorithms can be used to most effectively cap-ture, store, index, retrieve and merge these data, how multiple learningsubtasks can be orchestrated in a larger system, and questions of computa-tional tractability.

That is, in ML we use mathematical tools to define and solve typical problems like clas-sification, and we use statistical models to build machines with the ability to learn. Weassume that human learning resembles collecting statistics. Consider the classificationof a color as red or orange; we possibly learn the correct decision boundary betweenthese two colors by seeing many examples.

Two of the most important subdomains of ML are supervised and unsupervisedlearning: in supervised learning we are given examples together with teaching instruc-tions, which we call labels; unsupervised learning is more difficult, since no additionalinstructions are given. The usual task is in unsupervised learning is to separate sepa-rate clusters grouping similar points have to be found, or the probability density of thedata. This thesis focuses on semi-supervised learning (SSL): since human annotationof training examples in most cases asks for domain experts, it is costly and very timeconsuming. A solution is to use the small proportion of labeled data together with amuch larger set of easily collected unlabeled data to improve the performance of thelearning algorithm. For example, suppose that in a text categorization problem the word“professor” turns out to be a good predictor for positive examples based on the labeleddata. Then, if the unlabeled data shows that the words “professor” and “university” arecorrelated, then using both words the accuracy of the classifier is expected to improve.

3

According to Chapelle et al. [2006],

~ Semi-supervised learning (SSL) is halfway between supervised and unsu-pervised learning. In addition to unlabeled data, the algorithm is providedwith some supervision information – but not necessarily for all examples.

Zhu et al. [2007] proved by an experiment that human learning indeed makes use ofunlabeled data: when a small portion of labeled and a large portion of unlabeled exam-ples was shown to the participants, the decision boundary was shifted according to thedistribution of the unlabeled examples. As in the example of discriminating betweencolors here the subjects were shown artificially generated 3D shapes; they were toldthat the images they would see are microscopic images of pollen particles from the –fictive – flowers Belianthus or Nortulaca. After seeing a few labeled and then moreunlabeled examples they were asked to classify some test shapes. From the experimentit resulted that the decision boundary between the two classes are highly influenced bythe unlabeled data, i.e. decision boundary was shifted according to the distribution ofthe unlabeled examples.

Kernel functions or kernels return similarities between examples. In kernel methodswe do not directly work with the data, but with a matrix containing the similarities ofthe examples, called the kernel matrix. This is again analogous to human classifica-tion, where similarities between examples play an important role [Estes, 1994]. Kernelsare tools for non-linear extension of linear methods: if an algorithm can be written interms of dot products, we can simply exchange the dot product matrix with an arbitrarypositive semi-definite kernel function, containing now the dot products of the data in aso-called feature space, and we achieve a non-linear extension to the simple algorithm.That is, we do not actually perform the inefficient mapping of the data points to a pos-sibly higher dimensional feature space, but we only provide their similarities in thatspace. In this way it is even possible to work in infinite-dimensional spaces. Therefore,by choosing different kernels, one can build different learning machines from the samesimple learning algorithm.

Data-dependent kernels give rise to semi-supervised learning machines: the kernelfunction does not depend anymore solely on the two points in question, but uses of theentire data, the information contained in the whole learning set available. Mathemati-cally, if D1 and D2 denote two data sets, D1 6= D2, xxx,zzz ∈ D1 ∩D2, then

k(xxx,zzz;D1) m k(xxx,zzz;D2)

4 1.1. STRUCTURE OF THE THESIS

where k(·, ·) denotes the kernel function, “;” means conditioning, and “m” means “notnecessarily equal”. Data-dependent kernels are used to improve the similarity measureconsidering only the labeled data; the feature space representation is now chosen usingthe information exploited from the labeled and unlabeled data sets. Such kernels can beused in any kernel method if unlabeled data are available too.

The subject of this thesis is the construction of such data-dependent kernels forsupervised and semi-supervised learning, and the proof of superiority of data-dependentkernels over conventional kernels.

1.1 Structure of the thesis

The thesis is divided into eight chapters. Chapter 2 gives an overview of SSL techniquesand related concepts. After presenting the assumptions used in SSL, a classificationof SSL methods is given, shortly presenting a method belonging to each category. InChapter 3 – which is based on [Csató and Bodó, 2008], especially on Chapter 6 – weintroduce kernel methods and give a detailed description of Support Vector Machines(SVMs), Principal Component Analysis (PCA) and Kernel Principal Component Anal-ysis (KPCA). These methods will be used later in the thesis. Chapter 4 presents existingdata-dependent kernel construction techniques such as the ISOMAP kernel (other spec-tral kernel construction methods as described in different parts of the thesis), neighbor-hood kernel, bagged cluster kernel, multi-type cluster kernel and a data-dependent ker-nel related to manifold regularization. The chapter is based also on [Bodó and Minier,2008]. The following three chapters constitute the main part of the thesis: they containthe main contributions to the field of data-dependent kernel construction. Chapter 5presents our Wikipedia-based kernel for text categorization published in articles [Bodóet al., 2007] and [Minier et al., 2007]. This chapter also offers a detailed introductionto the field of text categorization. Chapter 6 presents our hierarchical cluster kernelfor semi-supervised learning. We implemented the data-dependent kernels presented inChapter 4 and experimentally compared them to our hierarchical kernel on different datasets. In Chapter 7 we propose three cluster kernels using the Hadamard product prop-erty of positive semi-definite matrices. Chapter 8 concludes the thesis, while AppendixA describes the data sets used for evaluating the methods presented in the thesis.

5 1.2. PUBLICATIONS AND CONTRIBUTIONS TO THE FIELD

1.2 Publications and contributions to the field

The thesis is based on the following publications:

[Minier et al., 2006] Zsolt Minier, Zalán Bodó & Lehel Csató. Segmentation-basedfeature selection for text categorization. In Proceedings of the 2nd InternationalConference on Intelligent Computer Communication and Processing, pages 53–59, 2006, IEEE.

[Bodó et al., 2007] Zalán Bodó, Zsolt Minier & Lehel Csató. Text Categorization Ex-periments Using Wikipedia. Special Issue of Studia Universitatis Babes-Bolyai,Series Informatica, pages 66–72, 2007.

[Minier et al., 2007] Zsolt Minier, Zalán Bodó & Lehel Csató. Wikipedia-based Ker-nels for Text Categorization. In Proceedings of the 9th International Symposiumon Symbolic and Numeric Algorithms for Scientific Computing, pages 157–164,2007, IEEE.

[Csató & Bodó, 2008] Lehel Csató & Zalán Bodó. Neurális hálók és a gépi tanulásmódszerei (Neural networks and methods of machine learning). Presa Universi-tara Clujeana, Cluj-Napoca, 2008.

[Bodó, 2008] Zalán Bodó. Hierarchical Cluster Kernels For Supervised And Semi-Supervised Learning. In Proceedings of the 4nd International Conference on In-telligent Computer Communication and Processing, pages 9–16, 2008, IEEE.

[Bodó & Minier, 2008] Zalán Bodó & Zsolt Minier. On Supervised and Semi-Supervised K-Nearest Neighbor Algorithms. Presented at the 7th Joint Con-ference on Mathematics and Computer Science, Cluj-Napoca, Romania, 2008;appeared in STUDIA UNIV. BABES-BOLYAI, INFORMATICA, Volume LIII,Number 2, Cluj-Napoca, 2008, pp. 79–92.

[Csató & Bodó, 2009] Lehel Csató & Zalán Bodó. Decomposition Methods for LabelPropagation, in Proceedings of the conference Knowledge Engineering: Princi-ples and Techniques (KEPT 2009), Presa Universitara Clujeana, July 2–4, 2009.Special Issue of Studia Universitatis Babes–Bolyai, Series Informatica. 2009,pages 127–130.

6 1.2. PUBLICATIONS AND CONTRIBUTIONS TO THE FIELD

[Bodó & Minier, 2009] Zalán Bodó & Zsolt Minier. Semi-supervised Feature Selec-tions with SVMS, in Proceedings of the conference Knowledge Engineering:Principles and Techniques (KEPT 2009), Presa Universitara Clujeana, July 2–4, 2009. Special Issue of Studia Universitatis Babes–Bolyai, Series Informatica.2009, pages 159–162.

Our contribution to the field can be summarized in the following way:

• new kernel for text categorization that is based on information extracted fromWikipedia

– proposing the inclusion of the link structure of Wikipedia in the kernel

– concept weighting in the Wikipedia-based kernel using the PageRank algo-rithm

• new method of construction of hierarchical cluster kernels for supervised andsemi-supervised learning

– proposal of a general framework for constructing hierarchical cluster kernels

– definition of the hierarchical and the graph-based hierarchical cluster kernel

• construction of kernels using the Hadamard product property of positive semi-definite kernels

– introduction of the Gaussian reweighting kernel

– introduction of two reweighting kernels using the dot products of the clustermembership vectors

7

Chapter 2

Semi-supervised learning

The aim of supervised learning is to find the function f : X → Y which best approx-imates f : X → Y on a given subset of X. The function f is given by trainingexamples, (xxxi, yi), xxxi ∈ X, yi ∈ Y, yi = f(xxxi), i = 1, 2, . . . , `, where the xxxi’s are

called the predictive or independent variables, while the yi’s are the target or dependentvariables. If Y is finite, we talk about classification, if Y is infinite, we call the task re-gression. If Y = 0, 1 (or Y = −1, +1) we talk about binary classification; otherwise,|Y| > 2, the classification is called multi-class. If Y = 2C, C = c1, . . . , cK, K ≥ 2, wecall it multi-label classification, otherwise the classification is called single-label. Theset 2C denotes the power set, i.e. all subsets of set C.

A classical example for supervised learning is spam filtering: it is a binary classi-fication task where the underlying system has to decide whether an incoming email isspam or ham. We usually define spam as “unsolicited bulk electronic mail (email)”,while the ham category consists of the relevant emails one receives. In actual spam fil-tering systems [Zdziarski, 2005] sophisticated machine learning algorithms are used asNaive Bayes or other methods performing well in text categorization. For further detailsregarding text categorization see Chapter 5.

In unsupervised learning clustering is used to find coherent groups of data withoutknowing the labels, the number of classes or any other information about the data. Unsu-pervised learning of a real valued function is called density estimation. For hierarchicalclustering see Chapter 6.

Semi-supervised learning (SSL) is a special case of classification; it is halfwaybetween classification and clustering. In semi-supervised learning the train-

8

ing data is augmented by a set of unlabeled data samples, that is we have(xxx1, y1), . . . , (xxx`, y`), xxx`+1, xxx`+2, . . . , xxx`+u, where usually there are far less labeleddata than unlabeled ones, i.e. ` ¿ u. We denote by N = `+u the size of the entire dataset. Semi-supervised learning is the problem of assigning labels to the unlabeled sam-ples of the data set using the information provided by both the labeled and the unlabeleddata.

SSL techniques can also be used in the conventional classification setting, whensimply the labels of unknown data points are needed without any extra knowledge. Inthese situations the test points play the role of the unlabeled points, assumed they aredrawn from the same distribution as the training data.

In the semi-supervised case the inputs X = xxx1, . . . , xxxN are separated from theircorresponding labels. The unlabeled set improves the estimation of the density func-tion of the inputs. With additional assumptions about the nature of the whole data set,one can improve the performance of a specific algorithm by incorporating this extraknowledge into the database.

The unlabeled data can be used to reveal important facts. For example, suppose thatin a text categorization problem the word “professor” turns out to be a good predictor forpositive examples based on the labeled data. Then, if the unlabeled data shows that thewords “professor” and “university” are correlated, then using both words the accuracyof the classifier is expected to improve. To understand how one can use the unlabeleddata to improve prediction, consider a simple semi-supervised learning method calledself-training or bootstrapping: train the classifier on the labeled examples, make predic-tions on the unlabeled data, add the most confidently predicted points from the unlabeledset to the labeled set with their predicted labels, and retrain the classifier. This proce-dure is usually repeated until convergence. Thus we expect to improve the classifier’sperformance.

Throughout the thesis, unless stated otherwise, the following notation will be usedfor learning settings: X = xxx1, . . . , xxxN ⊆ Rd denotes the independent variables, andD = (xxx1, y1), (xxx2, y2), . . . , (xxx`, yl), xxx`+1, xxx`+2, . . . , xxx`+u the training data, N = ` + u,where the first part denotes the labeled data, and second part the unlabeled data set; yi

denotes the label of xxxi, which may vary depending of the number of classes.Among related areas which also use a small portion of labeled samples and a larger

unlabeled set we can mention active learning and semi-supervised or constrained clus-tering. In active learning [Cohn et al., 1996] first the classifier is trained on the labeled

9 2.1. ASSUMPTIONS IN SEMI-SUPERVISED LEARNING

examples, then some of the unlabeled examples are selected and transmitted to a do-main expert for labeling. After receiving the right labels the classifier is retrained withthe selected data and their labels, and this whole process is repeated until some termi-nation criterion is met. The central problem of active learning is how to select the most“interesting” examples. To exemplify, one possibility would be to use a probabilisticclassifier, which assigns probabilities to the label outputs, in which case the “interest-ing” cases would be the most uncertain predictions. Semi-supervised clustering [Basu,2005; Chapelle et al., 2006] clusters the data subject to some constraints given in form ofmust-links and cannot-links. Must-links indicate that two examples have to be put in thesame cluster, while cannot-links show the inconsistencies of the examples. For examplespectral clustering can be easily extended to constrained clustering by the introductionof an indicator matrix [Bie et al., 2004; Bie, 2005].

This chapter serves as an introduction to SSL: presents the general assumptions usedin semi-supervised learning methods and a classification of semi-supervised techniquessupported by examples.

2.1 Assumptions in semi-supervised learning

In order to be able to effectively use the unlabeled data to improve the system’s perfor-mance we need to state some assumptions on which we will rely when improving themethod.

2.1.1 The smoothness assumption

The smoothness assumption says that points in a high density region should have similarlabels, that is labels should change in low density regions [Chapelle et al., 2006]:

Assumption 1. If two points xxxi and xxxj in a high density region are close, then so shouldthe corresponding outputs yi and yj.

This assumption can be used in classification and regression as well as in clusteringproblems. For example consider the optimization problem of label propagation in Sec-tion 2.3.3: by equation (2.2) we search for a function f which changes smoothly in thedense region.

10 2.1. ASSUMPTIONS IN SEMI-SUPERVISED LEARNING

−10 −5 0 5 10

−10

−5

0

5

10

−10 −5 0 5 10

−10

−8

−6

−4

−2

0

2

4

6

8

10

(a) (b)Figure 2.1: Clusters in the data set of concentric clusters: (a) the data set; (b) the 3

clusters shown by the contours.

2.1.2 The cluster assumption

The cluster assumption states that two points from the same cluster should have similarlabels [Chapelle et al., 2006]:

Assumption 2. If two points are in the same cluster, they are likely to be of the sameclass.

An alternative formulation of the above assumption is the following, called low-density separation [Chapelle et al., 2006]:

Assumption 3. The decision boundary should lie in a low density region.

Clusters are chosen to group points in high-density regions. Thus a decision bound-ary in a high-density region would cut clusters into different classes. That is if therewere points in the same cluster with different labels, that would require the decisionboundary to go through a high-density region.

Figure 2.1 illustrates the cluster assumption on a data set formed by two dense re-gions. In Figure 2.1(b) the clustered data is shown with the 3 clusters found. It isimportant to observe that the cluster assumption does not say that all the points belong-ing to one class must lie in the same cluster; still the cluster assumption clearly holdson the data set.

11 2.2. TRANSDUCTION

2.1.3 The manifold assumption

The manifold assumption is usually used for dimensionality reduction, which is an im-portant step of learning. It says [Chapelle et al., 2006],

Assumption 4. The high dimensional data lie roughly on a low dimensional manifold.

That is the manifold assumption presumes that the data lie on a low-dimensionalmanifold. It is basically an assumption regarding dimensionality, but if one considersthe manifold as the approximation of the high-dimensional region, then it is equivalentto the smoothness assumption. We use the manifold assumption several times in thethesis. For example, in Section 4.1 we will describe a method and a kernel which arebased on the assumption.

2.2 Transduction

Transduction [Vapnik, 1998] is an alternative approach to machine learning and is re-lated to SSL methods as follows: transductive methods are always semi-supervised (seeChapter 25 of [Chapelle et al., 2006]). Transduction is a new type of learning, and isbased on Vapnik’s principle: always try to solve the simpler problem. That is, do notconstruct an inductive classifier for the whole domain X, but find the decision functiononly at the test points, f : XU → Y, where XU denotes the set xxx`+1, . . . , xxx`+u.

For transductive techniques we mention the popular Transductive Support VectorMachines (TSVM) [Chapelle et al., 2006] (see also Section 2.3.2), but the algorithmspresented in Section 2.3.3 – mincut-based SSL and label propagation – are also of trans-ductive nature.

2.3 A classification of semi-supervised methods

In this section we give a classification of existing semi-supervised techniques; this isidentical to the classification given in [Chapelle et al., 2006]. Subsections correspondto different classes, namely: generative models, low-density separation, graph-basedmethods and change of representation. For a clearer comprehension every subsectionalso contains a short description of an algorithm.

12 2.3. A CLASSIFICATION OF SEMI-SUPERVISED METHODS

2.3.1 Generative models

Generative methods model the class conditional density P(xxx | y) and the class priorsP(y) and use the Bayes theorem (e.g. in [Bishop, 2006]) for calculating posteriorswhich are used for classification:

P(y |xxx) =P(xxx | y)P(y)

P(xxx)=

P(xxx | y)P(y)∑y P(xxx | y)P(y)

Here xxx represents the random variable assigned to the independent variables and y therandom variable assigned to the dependent variables. We used the same notation for thevariables xxx, y and the random variables assigned to these just for the ease of compre-hension; usually random variables are denoted by uppercase letters X, Y, etc.

A generative semi-supervised technique is Naive Bayes with EM [Dempster et al.,1977] which can be considered as a special self-learning method. It has been proposedin [Nigam et al., 2000] and applied to text categorization. We will shortly describethe Naive Bayes-based text categorization here, and show how to modify it for semi-supervised settings.

In text categorization the documents dddi are the independent and the classes cj arethe dependent variables. Let nD and nT denote the number of documents in the trainingcorpus and the total number of words in the vocabulary, respectively. Our naive as-sumption is that the words, denoted by wdddi,k or wk, are independent of each other, andtheir order within a document is not important. These are evidently false statements, butfortunately they work well in practice. Thus we calculate class conditional probabilitiesas

P(dddi | cj) =∏

k

P(wdddi,k | cj)

The probabilities P(wdddi,k | cj) and P(cj) can be approximated by maximum likelihoodestimation (MLE) and using a smoothing scheme avoid zero probabilities:

P(wdddi,k | cj) = P(wk | cj) =1 +

∑nD

i=1 zij · n(wk,dddi)

nT +∑nT

m=1

∑nD

i=1 zij · n(wm,dddi)

P(cj) =n(cj)

nD

where

zij = 1 if the ith document is in the jth category, otherwise 0


n(wk,dddi) = number of occurrence of word wk in document dddi

n(cj) = size of category cj

The difference between wdddi,k and wk is that the former denotes the kth word in docu-ment dddi, while the latter represents the kth word in the list of words selected from thecorpus for indexing the documents (indexing terms, vocabulary).

Classification uses the Bayes theorem, and the predicted category of an arbitrarydocument dddi is obtained as

argmaxj

P(cj, di) = argmaxj

log P(cj) + log P(dddi | cj)

where we employed the fact the P(dddi) is constant for all classes. Therefore it is suffi-cient to use the numerator of the right side of the Bayes theorem. The above steps ofestimation and prediction are iteratively repeated in the semi-supervised version of thealgorithm [Nigam et al., 2000; Nigam, 2001] as follows:

Algorithm 1 Naive Bayes + EM1: Train the classifier on the training examples; determine probabilities P(dddi | cj) and

P(cj).2: while there is improvement do3: E-step: Classify unlabeled examples using the trained classifier.4: M-step: Re-estimate the classifier using the previously classified examples; re-

calculate P(dddi | cj) and P(cj).5: end while

Using the above algorithm the authors achieved 30% improvement on classificationaccuracy for text categorization. The method can be used in any classification settingwhere the features are discrete and independence assumptions are acceptable.

2.3.2 Low density separation

Low density separation techniques are those semi-supervised methods that are basedon the cluster assumption or the low density separation principle. These methods pushaway the decision boundary from regions with data. A natural choice is to use a large-margin method (see SVMs in Chapter 3) that can exploit the information contained in


the unlabeled data. To this end transduction is applied and a transductive SVM (TSVM)is built [Vapnik, 1998; Chapelle et al., 2006]. SVMs are large-margin classifiers, thatis they maximize the margin of the separating hyperplane, thus minimizing an upperbound on the actual risk. For a detailed description of SVMs see Chapter 3.

The optimization task can be written in the following form:

minimize1

2‖www‖2

subject to yi(www′xxxi + b) ≥ 1, i = 1, . . . , `

yj(www′xxxj + b) ≥ 1, j = ` + 1, . . . , N

yj ∈ −1, +1, j = ` + 1, . . . , N

where (www, b) are the parameters – normal vector and offset – of the hyperplane, andyi, i = 1, . . . , `, are known and fixed. The values to be found by the optimizationprocess are yj, j = ` + 1, . . . , N. This is the hard-margin version of TSVM; the soft-margin is slightly different with additional slack variables introduced. TSVMs find alabeling of the unlabeled (test) data, y`+1, . . . , yN, such that the separating hyperplanehas maximum margin.

A different semi-supervised version of SVM we mention here is the Laplacian SVM(LapSVM) [Belkin et al., 2006; Chapelle et al., 2006] (see also Section 4.5), where onecould argue that LapSVMs rather make use of the smoothness assumption, but it canbe considered a low-density separator as well, since a maximum margin separator issought for, which varies smoothly in dense regions. However in contrast with TSVM,LapSVM is an inductive classifier.

2.3.3 Graph-based methods

Graph-based methods use the – usually undirected – graph of the joint set of labeled andunlabeled data, which reflects similarities in the data by assigning a weight to each edge,the weight representing proximity. The graph can be represented with the adjacency ma-trix WWW, where Wij > 0 if there is a connection between examples xxxi and xxxj, and is equalto 0 if the examples are not connected. If the similarity relations are symmetric, thenWWW is symmetric. But sometimes directed graphs are needed [Zhou et al., 2004], forexample in Web page categorization based on hyperlink structure or document classifi-cation based on citation graphs. For graph-based methods every information needed is


contained in the graph Laplacian LLL [Chung, 1997; von Luxburg, 2006], which is definedas

LLL = DDD − WWW

where DDD is the diagonal matrix containing the row sums of WWW on its diagonal. Thereare two normalized Laplacians used frequently in the literature:

• symmetric: LLLsym = DDD−1/2LLLDDD−1/2 = III − DDD−1/2WWWDDD−1/2

• random walk: LLLrw = DDD−1LLL = III − DDD−1WWW

One of the simplest graph-based algorithms is the method of Blum and Chawla[2001] based on graph minimum cuts (mincuts). Mincuts have been used in unsuper-vised learning for separating/detecting clusters. Clustering is performed by finding theminimum cut in the graph, that is those edges whose removal partition the graph into two– or more, depending on the clustering task – disconnected subgraphs with minimal sumof weights. The algorithm has the following steps: first the graph is built, after whichtwo special nodes are introduced, called classification vertices. One of these vertices isconnected to the positive examples with weight ∞, while the other is connected only tothe negative examples with weight ∞. The main step of the algorithm is to determinea mincut in the graph, a set of edges which disconnects the classification vertices. Weassign label +1 to vertices from the set of the positive classification vertices, and −1

to the rest. The algorithm is transductive, and results in hard classification; the mincutproblem can be solved using for example the Ford–Fulkerson algorithm [Cormen et al.,2001] for undirected graphs.

Another known algorithm is Zhu and Ghahramani’s label propagation (LP) [Zhuand Ghahramani, 2002; Zhu, 2005], which we briefly present here. The idea of LP is tobuild the data graph and iteratively propagate labels from labeled examples to unlabeleddata. In every iteration, the labeled examples distribute their label among the neighborsaccording to the strength of the connection. First the data graph is constructed using theGaussian similarity:

Wij = exp(

−‖xxxi − xxxj‖2

2σ2

)

where Wij is a measure of the degree of similarity between examples xxxi and xxxj. Thedegree of similarity determines how close two examples are, and is usually domain de-pendent. For example, when dealing with textual data, a better choice is to use a dot


product-based metric (cosine similarity, Jaccard coefficient, overlap coefficient, Dicecoefficient, etc; see Table 3.1). The width parameter σ in the Gaussian similarity speci-fies the distance to which the neighbourhood relationship means similarity. Other graphconstruction methods include both fully connected (or complete) and sparse graphs:εNN, kNN and tanh-weighted graphs [Zhu, 2005].

We will use the following matrix

PPP = DDD−1WWW

called the transition probability matrix, where a row PPPi· defines transition probabilitiesfrom node i to the other nodes, that is PPP defines a transition probability distribution of arandom walk [Grinstead and Snell, 2003]. To define the problem of label propagation,we additionally define the following matrices for the case of classifying data into K

classes:

YYYL – an (`× K) matrix with each row an indicator for a labeled example, thus if K = 4

and the ith example belongs to classes 3 and 4, then the ith row is [0, 0, 1, 1], ormathematically (YYYL)ic = δ(yi, c).

YYYU – a (u×K) matrix, containing assignment probabilities for example–class pairs; itsmeaning is similar to the previous case.

YYY – the concatenation of the above two matrices, an (N× K) matrix: YYY =

[YYYL

YYYU

]

Label propagation propagates labels from the labeled examples to the unlabeled onesusing the following sequence of steps [Zhu and Ghahramani, 2002]:

Algorithm 2 Label propagation1: Compute YYY(t + 1) = PPP YYY(t).2: Reset the labeled data, YYYL(t + 1) = YYYL(0).3: t = t + 1 and repeat the above steps until convergence.

When converged, labels are determined in the following way:

yi = argmaxj=1,...,K

(YYYU)ij i = 1, . . . , u (2.1)

This means getting hard classification with a single label, but alternatively, one canobtain multiple class results by thresholding the converged values.


−10 −5 0 5 10 15 20−10

−5

0

5

10

15

1

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

4

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

7

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

10

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

13

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

14

Figure 2.2: The propagation of labels in LP. At the beginning (step 0, not shown here)only the 2 points in black frame are labeled.

One can show that the final labels do not depend on the initial values of YYYU. Theexact solution of LP [Zhu and Ghahramani, 2002] is

YYYU = (III − PPPUU)−1

PPPULYYYL

It is interesting to note, that Google’s PageRank algorithm [Page et al., 1998] is similarto label propagation. Actually there is a slightly different label propagation methodhaving the same computations as PageRank. For variations on label propagation see[Chapelle et al., 2006].

It is easy to show that label propagation minimizes the expression

1

2

N∑

i,j=1

Wij‖YYYi· − YYYj·‖22 = tr(YYY ′LLLYYY) (2.2)

with the constraint YYYL = YYY∗L, where YYY∗L contains the fixed labels of the labeled examples.For the binary case this simplifies to

1

2

N∑

i,j=1

Wij(Yi − Yj)2 = YYY ′LLLYYY

which is equal to the graph cut [Bie, 2005].


−4 −2 0 2 4

−4

−2

0

2

4

B

C

A

D

Figure 2.3: Illustration for SSL based on change of representation.

Figure 2.2 shows the propagation of the labels on the “two-moons” data set (seeAppendix A) at different iterations. LP is able find here the right labeling of the datahaving only two labeled examples.

In [Csató and Bodó, 2009] we proposed an efficient decomposition scheme for solv-ing the problem of label propagation.

2.3.4 Change of representation

In these algorithms we change the representation of the points considering all the ex-amples given, that is the labeled and unlabeled data. We try to find structure in the datawhich is better emphasized or more observable in the presence of the larger unlabeleddata set. The algorithms thus proceed in the following two steps:

Algorithm 3 SSL with representational change1: Build the new representation – new distance, dot-product or kernel – of the learning

examples.2: Use a supervised learning method for obtaining the decision function employing

the new representation obtained in the previous step.

To understand how a change in representation can influence the classification, con-sider the situation in Figure 2.3. Having only the four labeled points we see that A

is equally similar to D and B. In contrast, by adding the unlabeled points we observethat A, D and B, C are in different clusters, since we see that the labeled and unla-beled points define two well-separated clusters: the middle Gaussian, and the points on


the circle around the Gaussian. Hence, using first an appropriate clustering algorithmcapable of producing probabilities that two points are in the same cluster, the basic sim-ilarity metric (kernel) can be weighted by these probabilities, thus producing a betterrepresentational space for the data. Later we will discuss such kernels in Chapters 4 and7.

Here we present a dimensionality reduction technique called Locally Linear Embed-ding (LLE), which maps the points to a lower dimensional space considering labeled andunlabeled points, such that the neighborhood of the points is preserved. The method wasproposed in [Roweis and Saul, 2000] and consists of the following three steps:

Algorithm 4 Locally Linear Embedding1: Determine k-nearest neighbors for each point.2: Rewrite each point as a linear combination of its neighbors.3: Map the points to a lower-dimensional space such that the same linear relation holds

between a point and its neighbors.

The second step means that we minimize

N∑

i=1

∥∥∥∥∥∥xxxi −

∑

j∈I(N(xxxi))

Wijxxxj

∥∥∥∥∥∥

2

where N(xxx) denotes the neighbors of xxx, while I(N(xxx)) returns the indices of the neigh-bors; the remaining entries of WWW are zeros. This can be minimized separately for everypoint, that is we minimize ‖xxxi −

∑j∈I(N(xxxi))

Wijxxxj‖2 with the constraint∑

j Wij = 1,which ensures translation invariance. This can be written as

∑jk WijWikCCC

(i), whereCCC(i) is called the local covariance matrix for xxxi, and is defined as

C(i)jk = (xxxi − xxxj)

′ · (xxxi − xxxk)

where j, k ∈ I(N(xxxi)). The problem can be solved by introducing a Lagrange multiplierfor the sum-to-one constraint of the coefficients, and the solution can be expressed as

Wi· =(CCC(i))−1111

111 ′(CCC(i))−1111

Since the local covariance matrix can be singular, usually a regularization parameter r

is introduced, and CCC(i) + r ·III is used in the computations instead of CCC(i). In the next step


−20

−10

0

10

20

−20

−10

0

10

200

50

−0.1 −0.05 0 0.05 0.1−0.06

−0.04

−0.02

0

0.02

0.04

0.06

(a) (b)Figure 2.4: The input and the result of the LLE algorithm: (a) the “swiss-roll” (three-dimensional) data set and (b) its two-dimensional embedding (k = 7, r = 10−3). Thehighlighted points show how neighborhood relations are preserved.

the points have to be mapped to a lower-dimensional space using the same coefficients,that is we minimize

N∑

i=1

∥∥∥∥∥zzzi −

N∑

j=1

Wijzzzj

∥∥∥∥∥

2

By introducing constraints that do not change the solution, we can solve this problemby forming the matrix MMM = (III−WWW) ′(III−WWW) and computing the bottom (smallest) d ′+1

eigenvalues, from which we discard the eigenvector of the smallest eigenvalue. Thusthe new representation of the points becomes – reading from the rows of the followingmatrix –

VVV = [uuu2 uuu3 . . . uuud ′+1]

where d ′ denotes the dimension of the new space, d ′ < d, and uuui is the ith eigenvectorof MMM, starting from the smallest one. For a more detailed description of LLE and itssolution see [Saul and Roweis, 2001]1.

LLE has three important parameters: the number of neighbors k, the regulariza-tion parameter r and the number of new dimensions d ′. For setting these parametersoptimally see [Busa-Fekete and Kocsor, 2005].

Figure 2.4 shows the result of LLE applied to the swiss-roll data set. Here we usedthe parameters k = 7 and r = 10−3 while we chose d ′ = 2 dimensions for representing

1The MATLAB implementation of LLE and related papers can be downloaded fromhttp://www.cs.toronto.edu/∼roweis/lle


the data. We highlighted only three points to see how the neighborhood of the pointsand thus relative distances are preserved.

In this thesis we propose kernel construction methods for semi-supervised learn-ing, see Chapters 5, 6 and 7, therefore the SSL techniques we deal with belong to thecategory of semi-supervised methods with change of representation.

22

Chapter 3

Kernels and kernel methods

Kernel methods are the tools of non-linear extension. These techniques are basedon the Mercer-theorem [Schölkopf and Smola, 2002], which says that any con-tinuous symmetric positive semi-definite kernel function can be expressed as a

dot product in a high-dimensional space. The first application of kernels in machinelearning appeared in 1964 in the work of Aizerman et al. [1964], where the authors useda kernelized version of Rosenblatt’s perceptron algorithm, but only recently researchersrecognized the great importance and large applicability of kernels. Based on the workof Boser et al. [1992] the Support Vector Machines (SVM) became one of the mostremarkable kernel-based techniques. The SVM builds a maximum margin separatinghyperplane, separating positive and negative examples, where the “maximum-margin”condition is needed to minimize an upper bound on the actual risk. SVMs will be pre-sented in detail in Section 3.3.

The chapter is based on [Csató and Bodó, 2008].

Kernels are defined as positive semi-definite functions which return the dot productof the vectors in some high-dimensional space H. Let φ : X → H be the mapping fromthe the input space X to the – usually high-dimensional – feature space H, sometimescalled linearization space. Then we define the kernel function to be k : X× X → R,

k(xxx,zzz) = 〈φ(xxx), φ(zzz)〉 = φ(xxx) ′ · φ(zzz)

Thus if one can rewrite an algorithm in such a form, that the independent variablesappear only in form of dot products, then one does not need to map the points to thefeature space. Instead, one only needs a kernel function that returns the dot product

23

×××

×××

0

0.20.4

0.60.8

10

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

X2

2

X1

2

X1X

2

(a)

00.2

0.40.6

0.81 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

X22

X12

X1X

2

(a) (b)Figure 3.1: The XOR problem in the input space and in the feature space generated bythe polynomial kernel of degree 2.

of the vectors. Furthermore since angles, lengths and distances can be written as dotproducts, kernel methods cover all the geometric constructs that involve these quantities.The substitution of the dot product by another kernel function is called the kernel trick:

Kernel trick. In an algorithm which is formulated in terms of a kernel function k(·, ·),this kernel can be replaced by another positive semi-definite kernel function k(·, ·).

In Figure 3.1 the XOR problem is shown (a) in the input space and in the (b) featurespace generated by the 2nd order homogeneous polynomial kernel. As one can readoff from the first illustration, our points are X = (0, 0), (0, 1), (1, 0), (1, 1) and thecorresponding labels Y = −1, 1, 1, −1. It is easy to observe, that the data cannot beseparated by a linear classifier, i.e. by a line, such that the circles appear on one sideof the line, and the points denoted by “x” on the other side. Now we use the followingtransformation

φ(xxx) =[x2

1 x22

√2x1x2

] ′(3.1)

to map the data to the three-dimensional Euclidean space, where xk denotes the kth com-ponent (dimension) of the vector xxx. After the mapping the points become linearly sep-arable as Figure 3.1(b) shows: now the task is to separate the points (0, 0, 0), (1, 1, 1)

and (0, 1, 0), (1, 0, 0), and obviously many such hyperplanes exist. Equation (3.1) de-fines the mapping which corresponds to the 2nd order homogeneous polynomial kernel,which will be discussed later in detail. In this case our kernel function will be

k(xxx,zzz) = φ(xxx) ′ · φ(zzz) = x21z

21 + 2x1z1x2z2 + x2

2z22 = (xxx ′ · zzz)2

24

set-based form vector-based form

matching coefficient |A ∩ B| xxx ′ ·yyy

cosine similarity|A ∩ B|√|A| · |B|

xxx ′ ·yyy‖xxx‖ · ‖yyy‖

Jaccard coefficient|A ∩ B|

|A ∪ B|

xxx ′ ·yyy‖xxx‖2 + ‖yyy‖2 − xxx ′ · yyy

overlap coefficient|A ∩ B|

min(|A|, |B|)

xxx ′ ·yyymin(‖xxx‖2, ‖yyy‖2)

Dice coefficient2 · |A ∩ B|

|A| + |B|

2 · xxx ′ ·yyy‖xxx‖2 + ‖yyy‖2

Table 3.1: Some similarity metrics.

Thus for solving the XOR problem with a linear algorithm, e.g. perceptron, SVM, etc.,one can use the kernel function shown above.

Apart from the Mercer-theorem and the kernel trick there is another important the-orem called the representer theorem which specifies the form of the decision function,and is useful when the problem cannot be easily written as a function of the dot products.

Theorem 3.1 ([Schölkopf and Smola, 2002]). Let H be the feature space associated toa positive semi-definite kernel k : X × X → R. Denote by Ω : [0, ∞) → R a strictlymonotonic increasing function, and by c : (X × R2)` → R ∪ ∞ an arbitrary lossfunction. Then each minimizer of the regularized risk

c((xxx1, y1, f(xxx1)), . . . , (xxx`, y`, f(xxx`))) + Ω(‖f‖H)

admits a representation of the form

f(xxx) =∑

i=1

αik(xxxi, xxx)

Here Ω(‖f‖H) denotes the regularization term added to the loss function.Kernels can also be thought of as similarity measures: examples in the same class

should have high kernel values. For binary classification, i.e. Y = −1, 1, the idealkernel is k(xxxi, xxxj) = yiyj, of course provided the labels of the test/unlabeled points

25 3.1. A SIMPLE CLASSIFICATION ALGORITHM

ccc+

ccc−

www

xxx − ccc

Figure 3.2: A simple classification algorithm

were known [Cristianini et al., 2001]. A few similarity measures are enumerated inTable 3.1.

In this chapter we present methods to construct the kernelized versions of a simplealgorithm, we introduce some general purpose kernels, and we present in detail theprocess of classification using SVMs. The chapter ends with a section about PrincipalComponent Analysis (PCA) and its kernelized version.

3.1 A simple classification algorithm

In this section we present a simple binary classification algorithm and its kernelizedversion [Schölkopf and Smola, 2002]. We split the training data into positive and neg-ative examples: X+ = xxxi|yi = +1 and X− = xxxi|yi = −1, and denote their sizes byN+ = |X+| and N− = |X−|.

Let us calculate the class centers

ccc+ =1

N+

∑

xxxi∈X+

xxxi and ccc− =1

N−

∑

xxxi∈X−

xxxi

We can then classify an unseen example by choosing the class whose center is closer.Let www = ccc+ − ccc− and ccc = (ccc+ + ccc−)/2. The class label can be obtained by calculatingthe angle between the vectors www and xxx − ccc as shown in Figure 3.2. If the angle is in[0, π/2), xxx gets the label +1, otherwise −1. Suppose the vectors are normalized to unitlength – otherwise we normalize them before classification; then the predicted label canbe calculated as follows:

y = sgn〈xxx − ccc,www〉= sgn (〈ccc+, xxx〉− 〈ccc−, xxx + b〉)

26 3.2. SOME GENERAL PURPOSE KERNELS

where b =(‖ccc−‖2 − ‖ccc+‖2

)/2. Substituting the corresponding expressions for ccc+ and

ccc− we obtain

y = sgn

(1

N+

∑

xxxi∈X+

〈xxx,xxxi〉−1

N−

∑

xxxi∈X−

〈xxx,xxxi〉+ b

)(3.2)

where

b =1

2

1

N2−

∑

xxxi,xxxj∈X−

〈xxxi, xxxj〉−1

N2+

∑

xxxi,xxxj∈X+

〈xxxi, xxxj〉 (3.3)

Now in the decision function, namely in equations (3.2) and (3.3), the independentvariables appear only in the form of dot products, hence instead of the dot product 〈·, ·〉one can use an arbitrary kernel function k(·, ·).

3.2 Some general purpose kernels

As we already saw, a kernel function is a continuous symmetric positive semi-definitefunction which returns the dot product of the vector in the feature space, k(xxx,zzz) =

φ(xxx) ′φ(zzz), which is usually a higher dimensional space compared to the input space.There are cases when it is hard to write the kernel function explicitly – for examplewhen the function is replaced by a complex algorithm – but we can simply calculatethe kernel values for a given set of points. In these cases we will use the kernel matrixinstead of the kernel function. The kernel matrix is also called the Gram matrix and isusually denoted by KKK, Kij = k(xxxi, xxxj), i, j = 1, . . . , `. A complex matrix of size `× ` ispositive semi-definite if

ccc∗KKKccc ≥ 0, ∀ccc ∈ C` (3.4)

where ccc∗ is the conjugate transpose of ccc. A real symmetric matrix is positive semi-definite if and only if all its eigenvalues are non-negative. A kernel function is positivesemi-definite if for all xxx1, . . . , xxx` the induced Gram matrix is positive semi-definite.Instead of “kernel function” we will use the term “kernel” that refers to either the kernelfunction or the kernel matrix, depending on the context.

In what follows we present four widely used general purpose kernels, namely thelinear, the polynomial, the Gaussian and the sigmoid kernel.


3.2.1 The linear kernel

The dot or inner product 〈xxx,zzz ′〉 = xxx ′zzz =∑d

i=1 xizi, xxx,zzz ∈ Rd, is also called the linearkernel. The Gram matrix XXXXXX ′, XXX = [xxx1, xxx2, . . . , xxxN]

′, is always positive semi-definite;it has r positive and N − r zero eigenvalues, where r is the rank of the Gram matrix. Itcan also be written as

XXXXXX ′ =d∑

i=1

XXX·iXXX′·i

3.2.2 The polynomial kernel

We can connect the features of the examples by logical “and”; thus higher order featurescan be built. If we apply a learning method using these features instead of the features ofthe original space, we usually improve the separation ability of the method, because it iseasier to find a good separator in a higher dimensional space. In Rn n + 1 points can beseparated by a hyperplane (see [Burges, 1998]), therefore the higher the dimensionality,the more points can be separated. By using kernels we exploit the advantages of highdimensional spaces without actually working in these spaces.

A good example for extending the feature space is the problem of handwritten digitrecognition [Schölkopf and Smola, 2002], where the 2nd degree polynomial kernel –which considers the occurrences of pixel pairs too – led to significant improvement inthe classification problem.

When working with numerical data, multiplication is used for logical “and”; there-fore we refer to these higher order features as product features. Consider for examplethe set of kth degree monomials of a vector xxx ∈ Rd,

x

j11 x

j22 · · · xjd

d

∣∣∣∣d∑

`=1

j` = k

(3.5)

For d = 2 and k = 2 this set becomes x21, x

22, x1x2, x2x1. Hence for this case we

can write the feature transformation as φ(xxx) = [x21 x2

2 x1x2 x2x1]′, which is similar to

(3.1), except that here the ordered monomials are taken. If we increase the dimension ofthe inputs and/or the degree of the monomials, then the dimension of the feature spaceincreases very fast. For example the number of kth degree unordered monomials forgiven (d, k) is (

d + k − 1

k

)=

(d + k − 1)!

k!(d − 1)!


If we consider 16 × 16 pixel images (d = 28), for k = 5 the vectors’ dimension in thefeature space exceeds 233. Therefore we will not compute the vectors mapped to thefeature space, but only their dot product. Since we do not need ordered monomials, weuse the map (3.1), or in general

k(xxx,zzz) = (xxx ′zzz)k

which is called the homogeneous polynomial kernel, since every monomial has the samedegree. The inhomogeneous polynomial kernel is defined as

k(xxx,zzz) = (axxx ′zzz + b)k

where a, b ∈ R+. It is easy to observe, that the homogeneous polynomial kernel ispositive semi-definite, because it is equal to the kth power of the dot product of thevectors. To see that the inhomogeneous polynomial kernel is positive semi-definite,rewrite it in the following way:

(axxx ′zzz + b)k =

k∑

i=0

(k

i

)ak−i · (xxx ′zzz)k−i · bi

It is indeed a kernel, because it is the linear combination of homogeneous polynomialkernels up to degree k.

3.2.3 The RBF kernel

The RBF (Radial Basis Function) or Gaussian kernel is defined as

k(xxx,zzz) = exp(

−‖xxx − zzz‖2

2σ2

)

This is the most widely used kernel due to its favourable properties:

(i) it has only one parameter, the σ, which is called the neighborhood width;

(ii) it is invariant to translation, because it is a function of ‖xxx − zzz‖;

(iii) it generates normalized data in the feature space: k(xxx,xxx) = ‖φ(xxx)‖2 = 1;

(iv) the dimension of the feature space is infinite: for every new data the dimension ofthe space spanned by the data increases by one.

29 3.3. CLASSIFICATION WITH SUPPORT VECTOR MACHINES

Its most important property is the last one, which tells us that using the RBF kernelone can find a solution – which is not necessarily good in general – to any problem.There is a theorem [Schölkopf and Smola, 2002] stating that for any set of samplesfrom the input domain xxx1, . . . , xxx` ⊆ X, xxxi 6= xxxj, ∀i 6= j, the Gram matrix Kij =

exp(−‖xxxi−xxxj‖

2σ2

)has full rank, that is the points φ(xxx1), . . . , φ(xxxm) in the feature space

are linearly independent.

3.2.4 The sigmoid kernel

Another popular kernel is the sigmoid or tanh kernel:

k(xxx,zzz) = tanh(axxx ′zzz + b)

The parameters of the sigmoid kernel have to be set with great care, because for somevalues the kernel is not positive semi-definite. If a > 0 and b ¿ 0 the kernel becomesconditionally positive semi-definite [Schölkopf and Smola, 2002], and for small a issimilar to the RBF kernel. For a detailed analysis of the sigmoid kernel see [tien Linand jen Lin, 2003].

3.3 Classification with Support Vector Machines

Support Vector Machines (SVMs) were proposed for classification already in the 1960sby Vladimir Vapnik, but the kernelized version of the method appeared only in 1992 inthe work of Boser et al. [1992]. SVMs search for a separating hyperplane, separatingthe positive and negative examples with maximum margin. It minimizes the hinge lossdefined as

Lhinge(xxxi, yi) = max 0, 1 − yi(www′xxxi + b)

plus the regularization term ‖www‖2, where www denotes the normal of the separating hy-perplane, yi ∈ −1, +1; the margin of the separating hyperplane equals 2/‖www‖ [Csatóand Bodó, 2008]. SVMs are similar to regularized least squares (RLS) [Bishop, 2006],only the loss function is different; the SVM minimizes the hinge loss, while the RLSminimizes the quadratic loss:

Lquad(xxxi, yi) = (1 − yi(www′xxxi + b))2

We present two versions of the SVMs: hard margin and soft margin SVMs.


3.3.1 Hard margin SVMs

We use hard margin SVMs if the data is separable. We search for a hyperplaneparametrized by www ∈ Rd and b ∈ R that separates positive and negative examples,

www ′xxxi + b ≥ 1, if yi = 1

www ′xxxi + b ≤ −1, if yi = −1

The decision function is simply f(xxx) = sgn(www ′xxx + b), that is for a positive value thepoint obtains a positive label, while for a negative value it is classified as a negativeexample. These two inequalities can be merged into a single inequality,

yi(www′xxxi + b) ≥ 1, i ∈ 1, . . . , `

Because there are many such hyperplanes, we choose the one which minimizes theactual risk [Csató and Bodó, 2008]: the maximum margin hyperplane. The marginalhyperplanes are

H1 : www ′xxx + b = 1

H2 : www ′xxx + b = −1

By the formula d(xxx,H(www,b)) = |www ′xxx+b|‖www‖ [Sloughter, 2001] – that gives the distance of

a point from a hyperplane – we see that the distance between H1 and H, where H nowrepresents the separating hyperplane, is 1/‖www‖, and similarly 1/‖www‖ between H2 andH. Therefore we want to minimize (1/2)‖www‖, or equivalently (1/2)‖www‖2. Those pointswhich lie on the marginal hyperplanes H1 and H2 are called support vectors, becausethose are the only points the separation depends on. The optimization problem is aconstrained convex optimization task:

min F(www,b) =1

2‖www‖2

such that yi(www′xxxi + b) ≥ 1, i = 1, . . . , `

(3.6)

Working with the Lagrangian formulation of the problem the constraints become sim-pler and we will also be able to kernelize the algorithm. Thus we write

LP(www,b,AAA) =1

2‖www‖2 −

∑

i=1

αi (yi(www′xxxi + b) − 1)


where AAA = [α1 . . . α`]′ is the vector of positive Lagrange multipliers. Now the task is

to minimize LP with respect to www and b, and simultaneously maximize with respect tothe Lagrange multipliers in AAA such that αi ≥ 0, i = 1, . . . , N. We calculate the partialderivatives of LP with respect to vector www and scalar b, and set them to zero:

www −∑

i=1

αiyixxxi = 000

∑

i=1

αiyi = 0

(3.7)

Substituting the results back into LP we have:

LD(AAA) =∑

i=1

αi −1

2‖www‖2 =

∑

i=1

αi −1

2

∑

i,j=1

αi yixxx′ixxxjyj αj (3.8)

= AAA ′111 −1

2AAA ′DDDAAA (3.9)

where DDD is a symmetric matrix, Dij = yixxx′ixxxjyj, 111 is the vector of ones of size `,

and AAA = [α1 . . . α`]′ is the vector of Lagrange multipliers. Now LD(AAA) has to be

maximized with respect to the Lagrange multipliers. This form is called the Wolfe-dualof the primal (3.6), with the constrains

AAA ≥ 000; AAA ′YYY = 0

where YYY = [y1 . . . y`]′ is the label vector. Solving the above problem, the normal vector

of the separating hyperplane is calculated by substituting the result into (3.7):

www∗ =∑

i=1

α∗iyixxxi

The optimal offset b∗ is calculated from the equation www ′xxxi + b = yi,

b∗ = yi − (www∗) ′xxxi

for an arbitrary index i, but in practice, for a numerically stable solution we take thealgebraic mean of the b values for all the support vectors. The decision function for anew point xxx becomes

f(xxx) = sgn

(∑

i=1

yiα∗ixxx

′xxxi + b∗)

(3.10)


3.3.2 Soft margin SVMs

For noisy data settings where the data is not separable we introduce the additional slackvariables ξi, which allow some points with incorrect classification. The optimizationproblem changes to

www ′xxxi + b ≥ +1 − ξi, if yi = 1

www ′xxxi + b ≤ −1 + ξi, if yi = −1

where ξi ≥ 0, ∀i. For making an error, i.e. incorrect classification, the ξi must exceed1, which means that

∑i ξi is an upper bound on the training errors. Therefore we in-

troduce the penalty term C(∑

i ξi)k into the minimization problem with regularization

parameter C, which has to be set carefully, since if C is too large overfitting can occur,while small C can lead to underfitting. For k = 1 and k = 2 the optimization problemremains convex, but for k = 1 neither the Lagrange multipliers, nor the ξi slack vari-ables appear in the dual; hence we discuss the case of k = 1. Thus the optimization taskcan be written as

min F(www,b,ξξξ) =1

2‖www‖2 + C

∑

i=1

ξi

such that yi(www′xxxi + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , `

Following the same derivation as in the case of hard margin SVMs, we obtain the fol-lowing:

max LD(AAA) = AAA ′111 −1

2AAA ′DDDAAA (3.11)

such that 0 ≤ αi ≤ C, i = 1, . . . , `

and AAA ′YYY = 0

Similarly to the hard margin variant, we call the points support vectors if αi > 0: ifξi = 0, then the support vector lies on the marginal hyperplane, if 0 < ξi < 1, then notraining occurred, otherwise, if ξi ≥ 1, the data is not separable. The points for whichαi = 0 are not important for the classification.

3.3.3 Kernelization

Suppose that the data is not linearly separable, but it shows a “clear” low density regionwhere the decision boundary should lie. In this case we proceed by taking a map φ :


−10 −5 0 5 10 15 20

0

5

10

−10 −5 0 5 10 15 20

0

5

10

(a) (b)

−10 −5 0 5 10 15 20

0

5

10

−10 −5 0 5 10 15 20

0

5

10

(c) (d)Figure 3.3: Training and testing with SVMs on the “two-moons” data set: (a) linear, (b)inhomogeneous polynomial kernel, a = 0.5, r = 1, k = 3, (c) RBF kernel, σ =

√5,

(d) sigmoid kernel, a = 0.1, b = −8.5

X → H and map the points to the high-dimensional Hilbert space H, which is alsocalled the feature space. We expect to generate features such that the data becomeslinearly separable in that space. One can see that the independent variables appear onlyin dot products: in the matrix DDD – in equations (3.9) and (3.11) – in the decision function(3.10) and clearly when computing the optimal offset b∗ of the hyperplane. So again, aswe saw in Section 3.1, we choose a kernel function k(·, ·) and replace the appearancesof xxx ′ixxxj by k(xxxi, xxxj). Thus in order to handle the non-linear case, we need to perform thefollowing modifications:

Dij = yiyjφ(xxxi)′φ(xxxj) = yiyjk(xxxi, xxxj)

whereas the decision function and the offset become

f(xxx) = sgn

(∑

i=1

yiα∗ik(xxx,xxxi) + b∗

)

b∗ = yi − www∗ ′φ(xxxi)

= yi −∑

j=1

yjα∗j k(xxxj, xxxi)

In Figure 3.3 the training and testing is illustrated on the “two-moons” data set usingdifferent SVMs, i.e. different kernels. The circles denote the examples of the positive


tech-no-lo-gia sze-rin-ti szuk-seg-sze-ru ter-me-sze-te-sen tu-laj-don-s´

Figure 3.4: Ambiguous classification. The new point is put into the class whose bound-ary is the farthest. The corresponding class region is drawn with continuous line.

class, while the points denoted by “x” constitute the negative class; the triangles rep-resent the test points. The points were colored during the testing phase: the red onesdenote the points classified as positives, the blue ones represent the points classified asnegatives. The hyperplanes are represented by the three curves: the one colored withblack is the separating hyperplane, while the gray curves are the marginal hyperplanes.The support vectors are denoted by the points put in the gray squares.

3.3.4 Classification with multiple classes

The SVMs are binary classifiers, but there also exist multi-class and binarization ap-proaches. Under binarization we mean the transformation of the problem to simplerbinary classifications which can be performed using the original formulation.

Let K denote the number of classes. The first approach we present is the binariza-tion technique called one-against-rest, sometimes wrongly called the one-against-allmethod, shortly 1-vs-rest. Using this technique one has to build K binary classifiers,and going through the classes from 1 to K we take the points of the current class as pos-itives and the remaining points form the negative class. At classification the point willbelong to the class which provides the better, most probable or most confident classifi-cation. Using SVMs the value based on which the decision is made will be the distanceof the point from the separating hyperplane, because the farther the point is from thehyperplane, the more probable the classification is correct. This is shown in Figure 3.4.

Another binarization method consists of building K · (K − 1)/2 binary classifiers,that is one classifier for every pair of classes – the order is not important – and we trainthe classifier for every such pair, such that the points in one class form the positive


label codewordpolitics 0 1 1 0 1 1 0 0 0 1

sport 0 0 0 1 1 1 1 1 0 0

business 1 0 1 0 1 0 1 1 0 1

arts 1 0 0 0 0 1 1 0 1 0

Figure 3.5: Codewords assigned to categories. Example taken from [Berger, 1999].

class and the points from the other class make up the negative class; the order is againunimportant. This is called the one-against-one or round robin binarization, or shortly1-vs-1. At classification we apply a voting scheme, which we call MaxWin. We testthe point with each classifier, and the “winner” class will be the one with the maximumnumber of votes/wins. At first glance the complexity of this method seems higher thanthat of 1-vs-rest, but this turns out to be a spurious conclusion. Suppose that the trainingdata is uniformly distributed among the classes, that is `i = `/K, where `i denotesthe number of points in the ith class. Then the complexity of the 1-vs-1 approach isK(K−1)

22`K

= (K−1)`, while the complexity of 1-vs-rest is K`. For a detailed analysis see[Fürnkranz, 2002].

A third technique we present uses error correcting output codes (ECOC) [Morelos-Zaragoza, 2002]. ECOC is used in information theory for error correction, when somekind of error recovery is needed, the data being transmitted through a noisy channel.In machine learning we use ECOC as a robust binarization technique too. The easiestway to illustrate the ECOC-based binarization is through an example. This is shown inFigure 3.5. Suppose we have 4 classes: politics, sport, business and arts. To each classwe assign a unique b bit code, which means that 2b ≥ K must be true, otherwise wecould not be able to use K different codes. After determining the codewords we buildb classifiers for each column of the code matrix, and train them taking the points of theclasses where a 1 appears as positives, and taking the other points as negatives. Whenclassifying a new point each classifier is “asked” in turn. In this case we want to obtainzeros and ones, so the output y of the SVM is transformed by (y+1)/2. Then the wholecodeword is built from the answers, and we assign the class whose codeword standsclosest to this codeword, using the Hamming distance. The ECOC-based binarizationis robust since it has the ability of correcting some errors made during classification. If∆min(CCC) denotes the minimum distance between two codewords, where CCC is the code


1, 3

1 2 3

2, 3

2 3

1, 2

1 2

3 2 1

not 1 not 3

not 2

not 3 not 1

not 2

Figure 3.6: DAG-based evaluation for multi-class settings. A DAG for K = 3 is shown.Near the vertices the lists attached to the nodes are shown.

matrix, then the ECOC-based classifier can recover or tolerate b(∆min(CCC)−1)/2c errors,i.e. misclassification. It is easy to observe that the 1-vs-rest is a special case of ECOC,when CCC is a K × K identity matrix. Berger [1999] proved that if b is large enough, agood solution is to assign the codewords randomly, which results in a well separablecode matrix.

Further matrix generation methods can be found in [Dietterich and Bakiri, 1995].Another solution for handling multi-class settings for SVMs is called the DAGSVM

(Directed Acyclic Graph SVM) and was proposed in [Platt et al., 1999]. Similarly to the1-vs-1 approach, K(K − 1)/2 classifiers are built for every class pair, but the testing –i.e. predicting the class of an unseen example – requires only K− 1 tests. A (K− 1)-arytree is built with K(K − 1)/2 inner nodes and K leaves. At prediction, starting from theroot, a decision is made at every node – we decide to which class the point does notbelong to, thus its label is determinable in K − 1 steps. This can also be interpreted asworking with an ordered list, where every transition removes a class from the list, andthe decision is made between the first and the last class of the list. In Figure 3.6 such anexample is shown for 3 classes.

The multi-class learning using SVMs can be also written such that the training errorand the regularization term is minimized for the whole problem [Weston and Watkins,1999]. This approach resembles 1-vs-rest, since K classifiers are searched for heretoo, which separates the data correctly. Actually this will be a category ranking al-gorithm [Crammer and Singer, 2002]; we simply constrain the output of the correctclass/classifier to be greater than the decision values of the other classes – with a margin

37 3.4. DIMENSIONALITY REDUCTION WITH PCA AND KPCA

of 2. Formally, the optimization task is now the following:

min1

2

K∑

m=1

‖wwwm‖2 + C∑

i=1

∑

m 6=yi

ξmi

such that

www ′yi

xxxi + byi≥www ′

mxxxi + bm + 2 − ξmi

ξmi ≥ 0, i = 1, 2, . . . , K, m ∈ 1, 2, . . . , K \ yi

The decision function now becomes

f(xxx) = argmaxm

(www ′mxxx + bm) , m = 1, 2, . . . , K

For multi-class learning solutions using SVMs see also the works of Schölkopf andSmola [2002] and wei Hsu and jen Lin [2001].

3.4 Dimensionality reduction with principal com-ponents

In this section we shortly present principal component analysis and its kernelized ver-sion used for reducing the dimensionality of the input space. These methods will beused later in the thesis.

For every proposed kernel construction method in this thesis we use SVMs for learn-ing. Since most data sets are for binary classification, i.e. having two classes, most ofthe time the original formulation of soft margin SVM learning is used. However for textcategorization, where the number of classes is usually large, we use soft margin SVMswith one-against-one binarization.

3.4.1 Principal Component Analysis

Principal component analysis (PCA) [Jolliffe, 2002; Bishop, 2006; Shlens, 2005] is adimensionality reduction method filtering out the noise from the measured data. It re-expresses the data in another basis and reveals the hidden structure, assuming that the


points are drawn from a Gaussian distribution. PCA works on the covariance matrix,assuming that the data is centered, that is the mean of each dimension is equal to zero.

If we arrange the samples xxxi in a matrix XXX – now as columns –

XXX =[

xxx1 xxx2 . . . xxxN

]

then the empirical covariance matrix will be 1NXXXXXX ′1. The diagonal elements are the vari-

ances, while the off-diagonal terms are the covariances. The covariance measures theredundancy, while variance measures the variability of the data along the dimensions.The goals of PCA are the following: re-express (transform) the data such that:

(i) we minimize redundancy (covariance),

(ii) we maximize variance.

PCA finds those axes (eigenvectors) along which the variance is maximal. Dimension-ality reduction can then be performed by deleting/neglecting dimensions with smallvariance.

Once again, the aim is to find a transformation matrix PPP, PPPXXX = ZZZ, where ZZZ is the re-expression of X satisfying (i) and (ii). To achieve this the covariance matrix CCCZZZ shouldbe diagonalized, because we want to maximize variance and minimize covariance, thatis PCA finds an orthonormal matrix PPP such that CCCZZZ is diagonalized. The rows of PPP

constitute the principal components of XXX. PPP is chosen as follows. We rewrite CCCZZZ bysubstituting PPPXXX,

CCCZZZ =1

NZZZZZZ ′

=1

NPPPXXXXXX ′PPP

Since XXXXXX ′ is symmetric, it can be rewritten as XXXXXX ′ = UUUΣΣΣUUU ′, where ΣΣΣ is a diagonalmatrix and UUU contains the eigenvectors of XXXXXX ′ arranged in columns. If there are r ≤ d

orthonormal vectors, then we simply select some arbitrary orthonormal vectors to fillthe matrix of the eigenvectors. Now we select PPP to contain the eigenvectors of XXXXXX ′

arranged in rows, that is PPP will be equal to UUU ′. Therefore

CCCZZZ =1

NPPPXXXXXX ′PPP ′ =

1

NPPP(PPP ′ΣΣΣPPP)PPP ′

=1

N(PPPPPP ′)ΣΣΣ(PPPPPP ′) =

1

NΣΣΣ

1We assume that the data is centered; if not we always center the data before applying PCA.


−30 −20 −10 0 10 20 30

−25

−20

−15

−10

−5

0

5

10

15

20

−30 −20 −10 0 10 20 30 400

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

(a) (b)Figure 3.7: PCA in two dimensions. The leftmost picture (a) shows the data with theeigenvectors of the covariance matrix, while the rightmost picture (b) shows the trans-formed data, PPPXXX. The data was generated along the line y = x adding some Gaussiannoise two the y values. The first eigenvector’s direction is denoted by the continuous,while the second one is represented by the dashed line.

Dimension reduction using PCA is then performed by selecting only the first k eigen-vectors (arranged in decreasing order by the eigenvalues), assuming that small variancesrepresent only noise in the data.

ZZZ =

ppp1

ppp2...

pppk

·[

xxx1 xxx2 . . . xxxN

]

Figure 3.7 shows the eigenvectors obtained by performing PCA on a two-dimensional data set. Observe that the direction of the first eigenvector correspondsto the direction with maximum variance.

PCA can be performed using Singular Value Decomposition (SVD). Suppose thatthe matrix XXX can be factorized into UUUSSSVVV ′, from which XXXXXX ′ = UUUSSS2UUU ′. Then we obtainthe low-dimensional representation provided by PCA if we multiply XXX from the left bythe matrix containing the desired number of eigenvectors,

UUU ′kXXX = UUU ′

kUUUSSSVVV ′ = IIIkUUUSSSVVV ′

= SSSkVVV′ = SSSkVVV

′k

where AAAk denotes the matrix formed using only the first k columns of AAA; the rest of thematrix is filled with zeros.


−3 −2 −1 0 1 2 3−1.5

−1

−0.5

0

0.5

1

1.5

−3 −2 −1 0 1 2 3−1.5

−1

−0.5

0

0.5

1

1.5

(a) (b)

−3 −2 −1 0 1 2 3−1.5

−1

−0.5

0

0.5

1

1.5

−3 −2 −1 0 1 2 3−1.5

−1

−0.5

0

0.5

1

1.5

(c) (d)Figure 3.8: Illustration for KPCA. The illustrations show the first two eigenvectorsobtained using KPCA; more precisely the contour lines shown are perpendicular to theeigenvectors in the feature space: (a), (b) linear kernel, which corresponds to PCA, (c),(d) RBF kernel.

3.4.2 Kernel Principal Component Analysis

Kernel principal component analysis (KPCA) [Schölkopf et al., 1996, 1999] is the ker-nelized version of PCA, that is we perform PCA in the feature space. If the simplefeatures – or input space – provide only an inadequate description of the data, we usefeature space transformations through the use of kernels.

In PCA we solve the eigenproblem

λvvv = CCCXXXvvv

where λ > 0, vvv ∈ Rd, and ‖vvv‖ = 1. Using the definition of the covariance matrix wecan write

λvvv =1

N

N∑

i=1

(xxxixxx′i)vvv =

1

N

N∑

i=1

(xxx ′ivvv)xxxi, (3.12)

That is the solution vectors vvv lie in the span of the vectors xxxi, vvv =∑N

i=1 βixxxi. Mappingthe independent variables to the feature space we have the covariance matrix CCCφφφ =1N

∑Ni=1 φ(xxxi)φ(xxxi)

′, where we assumed that the data is centered in the feature space.


Now we have to solve the following eigenproblem:

λvvv = CCCφφφvvv

Using equation (3.12) we arrive at the dual eigenproblem [Schölkopf et al., 1999],

Nλβββ = KKKβββ

To obtain eigenvectors of length 1, βββi has to be multiplied by 1/√

Nλi, provided that‖βββi‖ = 1. To obtain the principal components every point is projected onto the chosenk eigenvectors, that is

vvv ′jφ(xxx) =

N∑

i=1

βjiφ(xxxi)′φ(xxx)

=

N∑

i=1

βjik(xxxi, xxx), j = 1, . . . , k

In order to obtain centered data in the feature space, instead of KKK we use the centralizedkernel matrix (

III −1

N111111 ′

)KKK

(III −

1

N111111 ′

)

To write KPCA transformation in a more compact form we notice that the projectiononto the eigenvectors can be written as VVV ′

kKKK, where VVVk denotes the matrix of eigenvec-tors, keeping only the first k of the them. Thus, since KKK = VVVSSSVVV ′, the new representationof the points becomes

VVV ′kKKK = VVV ′

kVVVSSSVVV ′ = SSSkVVV′k

placed in the columns of the resulting matrix.Figure 3.8 shows KPCA applied to a noisy sinus-like data set. The contour lines are

perpendicular to the eigenvectors.

42

Chapter 4

Data-dependent kernels

In supervised learning one has to find a decision function separating the examplesinto classes. Semi-supervised learners however use also a set of unlabeled exam-ples, and try to benefit simultaneously from the large amount of training data and

the usually smaller label information. Data-dependent kernels are somehow similarto semi-supervised learning machines: the kernel function does not depend anymoresolely on the two points in question, but in some form makes use of the entire data,the information contained in the whole learning set available. That is k(xxx,zzz) with dataset D1 returns a value not necessarily equal to k(xxx,zzz) with data set D2, however thekernel function – or preferably the kernel construction method – is the same. This canbe formalized as


provided that the data are different, i.e. D1 6= D2, where “m” means “not necessarilyequal” and “;” stands for conditioning.

Learning with data-dependent kernels is considered semi-supervised learning. Theybelong to the class of semi-supervised learning with change of representation (see Sec-tion 2.3.4), because here we construct a kernel – or equivalently a distance – thus givingthe points a new representation. Then if a supervised learning algorithm is used, wearrive at the general scheme of SSL methods with change of representation described inSection 2.3.4.

Only a few of the existing data-dependent kernels were selected to be presented inthis chapter, roughly speaking those which are connected to our data-dependent kernelconstruction methods, to some extent. In our experiments we use these semi-supervised

43 4.1. THE ISOMAP KERNEL

kernels and compare it against the kernels proposed by us.

4.1 The ISOMAP kernel

The ISOMAP (ISOmetric feature MAPping) algorithm [Tenenbaum et al., 2000; Hamet al., 2004] was the first method that used graph distances for approximating thetrue distances of the underlying manifold, assuming that such a manifold exists. TheISOMAP kernel is defined as

KKKisomap = −(1/2)JJJGGG(2)JJJ (4.1)

where GGG(2) contains the squared graph distances (shortest paths) and JJJ is the centeringmatrix, JJJ = III− 1

N·111·111 ′, III is the identity matrix and 111 is the N×1 vector of 1’s. The graph

distances are computed by first calculating the Euclidean distances between the points,and then considering an ε-neighborhood or the k-nearest neighbors of the points; next,the shortest path distances are calculated for example by Dijkstra’s algorithm [Cormenet al., 2001]. If the points were centered along each dimension, then the ISOMAPkernel from (4.1) would be positive semi-definite. Unfortunately we are not able tocenter the data. But – analogously to PCA – because only the large eigenvalues and thecorresponding eigenvectors are important we proceed in the following way. The kernelKKKisomap is factorizable into UUUSSSUUU ′, where UUU contains the eigenvectors and the diagonalmatrix SSS holds the eigenvalues of the decomposed matrix [Golub and Van Loan, 1996].Then the ISOMAP kernel we use is KKKisomap = UUUSSSUUU ′, where SSS is the truncated diagonalmatrix of eigenvalues:

Sii =

Sii, if Sii ≥ 0

0, otherwise

Informally, the ISOMAP kernel maps the points to the space where their pointwisedistances are equal to the shortest path distances on the data graph in the input space. Ifthe points are centered along each dimension, then −(1/2)JJJGGG(2)JJJ is equal to the matrixof dot product of the vectors mapped to the above-mentioned space [Borg and Groenen,2005]. That is we want to minimize

‖τ(GGG) − τ(DDDeucl)‖2F

where DDDeucl denotes the matrix of Euclidean distances and the operator τ(·) transformsdistances to dot products, τ(GGG) = −(1/2)JJJGGG(2)JJJ. Denote by BBB the dot product matrix,

44 4.1. THE ISOMAP KERNEL

−20

−10

0

10

20

−20

−10

0

10

200

50

−60 −40 −20 0 20 40 60−30

−20

−10

0

10

20

30

(a) (b)Figure 4.1: The input and the result of the ISOMAP algorithm: (a) the “swiss-roll”(three-dimensional) data set and (b) its two-dimensional embedding (k = 7). The em-phasized points show how distances on the manifold are kept.

BBB = XXXXXX ′, where the points are put in the rows of XXX. Then for the squared distances wecan write

GGG(2) = diag(BBB)111 + 111diag(BBB) ′ − 2BBB

By multiplying both sides of the equation by the centering matrix JJJ and by the scalar−1/2 we obtain

−1

2JJJGGG(2)JJJ = −

1

2JJJdiag(BBB)111JJJ −

1

2JJJ111diag(BBB) ′JJJ + JJJBBBJJJ

= JJJBBBJJJ

because centering the vector 111 results in the zero vector 000. Hence if XXX is column cen-tered, that is the data points are centered along each dimension, then following the aboveprocedure we obtain the desired matrix of dot products.

Figure 4.1 shows an example data set – where the manifold assumption clearly holds– and its two-dimensional embedding resulting from the ISOMAP algorithm1. The“swiss-roll” can be “unraveled” into the two-dimensional representation shown in Fig-ure 4.1(b), and the recorded points show that the distances on the manifold are kept.

Here we presented only the ISOMAP kernel, but there are many dimensionality re-duction techniques which can be used to construct data-dependent kernels. For exampleconsider the spectral methods of PCA and KPCA presented in Sections 3.4.1 and 3.4.2,or LLE from Section 2.3.4.

1The MATLAB implementation of ISOMAP and the above data set can be downloaded fromhttp://isomap.stanford.edu. The illustration in Figure 4.1(b) was created using this program.

45 4.2. THE NEIGHBORHOOD KERNEL

4.2 The neighborhood kernel

The neighborhood kernel introduced in [Weston et al., 2004, 2006] is perhaps the sim-plest cluster kernel, which is based on the semi-supervised cluster assumption. We calla kernel function or matrix a “cluster kernel” if it was constructed using the clusterassumption of semi-supervised learning. Let N(xxx) denote the neighbors of xxx. This,again, could be obtained by considering an ε-neighborhood of the point or the k-nearestneighbors according to some distance measure. The neighborhood kernel representseach point as the average of its neighbors, that is φnbd(xxx) = 1

|N(xxx)|

∑xxx ′∈N(xxx) φb(xxx), and

the kernel is

knbd(xxx,zzz) =

∑xxx ′∈N(xxx),zzz ′∈N(zzz) kb(xxx

′, zzz ′)

|N(xxx)||N(zzz)|

It can be seen that it is a valid kernel, because it is the linear combination of some otherkernel values. Here we followed the notation used in [Weston et al., 2006], where kb(·, ·)means some base kernel used between the neighboring points (for example the linear orthe Gaussian kernel). If the dissimilarity measure used in determining the neighbors is atrue distance measure, then the points will be contracted in each cluster appearing in thedata set. Using matrix notation the neighborhood kernel can be written in the followingform:

KKKnbd = (nnn ·nnn ′)(−1) ¯DDDKKKbDDD′

where nnn is the neighborhood size vector, that is ni is the size of the neighborhood ofpoint xxxi, AAA

(−1) denotes the elementwise inverse of the matrix AAA, DDD is the neighborhoodmatrix defined as

Dij =

1, if j is the neighbor of i

0, otherwise

and KKKb is the base kernel matrix, (KKKb)ij = kb(xxxi, xxxj).

4.3 The bagged cluster kernel

The bagged cluster kernel was proposed in [Weston et al., 2004] and it reweights thebase kernel values by the probability that the points belong to the same cluster. Tocompute this probability the bagged cluster kernel uses k-means clustering, togetherwith its property that the choice of initial cluster centers highly affects the output of the

46 4.4. MULTI-TYPE CLUSTER KERNEL

algorithm. Here we give only a short description of the method, because in Chapter 7we construct similar kernels and also give a more detailed presentation of this method.

The kernel is constructed as follows:

Algorithm 5 Bagged cluster kernel1: Run k-means t times, which results in cj(xxxi) cluster assignments, j = 1, . . . , t,

i = 1, . . . , N, c·(·) ∈ 1, . . . , K.2: Construct the bagged kernel in the following way:

kbag(xxx,zzz) =

∑tj=1[cj(xxx) = cj(zzz)]

t

The final kernel used in [Weston et al., 2004, 2006] is some base kernel weighted bythe bagged kernel,

k(xxx,zzz) = kb(xxx,zzz) · kbag(xxx,zzz) (4.2)

that is the base kernel values are multiplied by the probabilities that the points belong tothe same cluster.

4.4 Multi-type cluster kernel

Chapelle et al. [2002] develop a cluster kernel which combines several techniques suchas spectral clustering, kernel PCA and random walks. The proposed cluster kernel isbuilt following the steps shown in Algorithm 6.

Algorithm 6 Multi-type cluster kernel1: Compute the RBF kernel and store in matrix WWW.2: Symmetrically normalize WWW, that is let LLL = DDD−1/2WWWDDD−1/2, where DDD = diag(WWW ·

111), and compute its eigendecomposition, LLL = UUUΣΣΣUUU ′.3: Determine a transfer function ϕ(·) for transforming the eigenvalues, λi = ϕ(λi),

and construct LLL = UUUΣΣΣUUU ′, where ΣΣΣ contains the transformed eigenvalues on thediagonal.

4: Let DDD be a diagonal matrix with diagonal elements Dii = 1/Lii, and compute

KKK = DDD1/2

LLLDDD1/2

.


The kernel type depends on the chosen transfer function ϕ, modifying the eigenval-ues of the Laplacian. We discuss here four types of transfer functions, as in [Chapelleet al., 2002].

4.4.1 Linear transfer function

The linear transfer function ϕ(λ) = λ yields the RBF kernel, because in this case LLL = LLL

and DDD = DDD, thus KKK = DDD1/2DDD−1/2WWWDDD−1/2DDD1/2 = WWW.

4.4.2 Step transfer function

In the following λi represent the eigenvalues of the matrix LLL defined before.The step function is defined as ϕ(λi) = 1 if λi ≥ λcut and 0 otherwise, where λcut

is a predetermined cut threshold for the eigenvalues. This results in the dot productkernel matrix of the points in the spectral clustering representation. Spectral clusteringhas several, slightly different formulations, but here we discuss the technique proposedin [Ng et al., 2002]; the algorithm can be found also in [von Luxburg, 2006]. Spectralclustering has the following steps:

Algorithm 7 Spectral clustering1: Compute the RBF kernel matrix, Wij = exp(−‖xxxi − xxxj‖2/(2σ2)) and set Wii = 0.2: Compute the normalizing matrix DDD = diag(WWW111) and set LLL = DDD−1/2WWWDDD−1/2.3: Find the first k eigenvectors vvv1, vvv2, . . . , vvvk of LLL corresponding to the largest eigen-

values, and then form the matrix VVV = [vvv1 vvv2 vvvk], VVV ∈ Rd×k.4: Row-normalize VVV , that is let Uij = Vij/(

∑j V

2ij)

1/2.5: Perform a simple clustering, e.g. k-means, on the new data representation, that is

on the rows of UUU.

In the ideal case suppose there are k clusters infinitely far apart from each other.In this case the first k eigenvalues of LLL will be 1, while the remaining eigenvalues arestrictly less than 1. Now if we choose the first k eigenvectors with eigenvalue 1, follow-ing the algorithm above, the new representation that we obtain consists of k orthogonalvectors for each cluster, that is each point from one cluster will correspond to one suchorthogonal vector, thus clustering becomes trivial. For a detailed discussion of spectralclustering see [Ng et al., 2002; von Luxburg, 2006; Bie, 2005].


In order to see that the multi-type cluster kernel with step function is equivalent tothe representation used in spectral clustering, calculate Lii, which will be equal to

Lii = UUUi·ΣΣΣUUU ′i· = UUUi·UUU

′i· = ‖UUUi·‖2

where ΣΣΣ =

[III OOO

OOO OOO

]and where UUU now denotes the matrix in which the first k columns

are equal to the first k eigenvectors, while the remaining columns are filled with zeros.

Thus Dii = 1/‖UUUi·‖2. We know that KKK = DDD1/2

UUUΣΣΣUUU ′DDD1/2

, hence we can write that

Kij =(DDD

1/2UUU

)i·ΣΣΣ

(DDD

1/2UUU

)j·

′

=(D

1/2ii UUUi·

)ΣΣΣ

(D

1/2jj UUUj·

) ′

which is equivalent to the row-normalization of the spectral clustering algorithm.

4.4.3 Linear step transfer function

The linear step transfer function simply cuts off the eigenvalues which are smaller thana predetermined threshold; ϕ(λi) = λi if λi ≥ λcut and 0 otherwise. Without normal-ization, that is with DDD = III and similarly DDD = III the method would be equal to datarepresentation in KPCA space, since in that case we simply cut off the least significantdirections to obtain a more efficient small rank representation of LLL.

4.4.4 Polynomial transfer function

The polynomial transfer function is defined as ϕ(λi) = λti , where t ∈ N or t ∈ R is a

parameter. Thus the final kernel can be written as

KKK = DDD1/2

DDD1/2(DDD−1WWW

)t

DDD−1/2DDD1/2

(4.3)

where DDD−1WWW = PPP is the probability transition matrix, where Pij is the probability ofgoing from point i to point j – used for example in label propagation (see Section 2.3.3).This is called the random walk kernel, since (4.3) can be considered as a symmetrizedversion of the transition probability matrix PPP.

In [Bodó and Minier, 2008] we empirically compared supervised and semi-supervised k-nearest neighbor (kNN) classification methods on two data sets: the USPS

49 4.5. MANIFOLD REGULARIZATION AND DATA-DEPENDENT KERNELS FOR SSL USING POINT CLOUD NORMS

−10 −5 0 5 10 15 20−5

0

5

10

−10 −5 0 5 10 15 20−5

0

5

10

(a) (b)Figure 4.2: Semi-supervised learning using LapSVM: (a) RBF kernel, σ2 = 2.5 (usedalso in the Laplacian), γA = 0.05, γI = 0; (b) the same setting now with γI = 0.5.

data set and a subset of the Reuters-21578 text categorization corpus (for details seeAppendix A). For semi-supervised kNN methods we used the ISOMAP kernel, themulti-type cluster kernel and our hierarchical cluster kernel which will be presentedin Chapter 6. We also tested label propagation on these data sets, since LP can be con-sidered as a semi-supervised kNN algorithm; however here the labels of the unlabeledpoints are influenced not only by the labeled points.

4.5 Manifold regularization and data-dependentkernels for SSL using point cloud norms

Sindhwani et al. [2005] introduced a kernel which modifies the Reproducing KernelHilbert Space (RKHS) by modifying the inner product:

〈f, g〉H = 〈f, g〉H + 〈Sf, Sg〉V

where H and H denote the original and the modified RKHS respectively, V is a linearspace with a positive semi-definite inner product, and S : H → V is a bounded linearoperator. In RN the norm – called the point cloud norm – is defined as

‖Sf‖2V = fff ′MMMfff

where S(f) = [f(xxx1) . . . f(xxxN)] which is denoted by the N-dimensional vector fff, andMMM is a positive semi-definite matrix. The authors derive an exact form for the new

50 4.5. MANIFOLD REGULARIZATION AND DATA-DEPENDENT KERNELS FOR SSL USING POINT CLOUD NORMS

kernel, which isk(xxx,zzz) = k(xxx,zzz) − kkk ′xxx(III + MMMKKK)−1MMMkkkxxx (4.4)

where kkkxxx = [k(xxx,xxx1) . . . k(xxx,xxxN)] ′. The key issue stands in choosing the matrix MMM,which in case of semi-supervised learning should reflect the intrinsic geometry of thedata. Therefore the authors choose the graph Laplacian, LLL = DDD − WWW (or a power LLLt ofthe graph Laplacian), which imposes smoothness conditions on the function,

fff ′LLLfff =

N∑

i,j=1

Wij(f(xxxi) − f(xxxj))2

The above kernel is equivalent to manifold regularization [Belkin et al., 2006; Chapelleet al., 2006]. Consider the following optimization problem:

minimize1

`

∑

i=1

V(xxxi, yi, f) + γA‖f‖2K +

γI

(` + u)2fff ′LLLfff

where V(·, ·, ·) is an appropriate loss function. If we use the quadratic loss, we obtainLaplacian Regularized Least Squares (LapRLS), while using the hinge loss, LaplacianSVMs (LapSVM) are obtained, which are equivalent to their original (non-Laplacian)formulations using the kernel defined in (4.4).

Figure 4.2 shows the “two-moons” data set with only two labeled points and dense-structured unlabeled points revealing the true geometry of the data. The classificationsshown by the contour lines were performed using LapSVM.

In [Bodó and Minier, 2009] we proposed semi-supervised SVMs, especially Lapla-cian SVMs for dimensionality reduction. For the complete method and results obtainedsee the referred article.

51

Chapter 5

Wikipedia-based kernels for textcategorization

Text categorization is the problem of determining the true predefined categoriesof natural language documents based on some training examples. It consti-tutes a subdomain of information retrieval (IR), however many times it is clas-

sified wrongly as being a natural language processing task. The best performing andwidely used vector space model representation for documents creates extremely high-dimensional and sparse vectors; thus in order to work efficiently in this representationalspace dimensionality reduction techniques must be applied. Kernel methods howeveroffer an elegant way to overcome dimensionality: they require the kernel matrix only,containing data similarities. We propose a data-dependent kernel construction methodwhich is expected to yield better results for text categorization tasks.

In this chapter text categorization and related concepts will be presented in detail:document representation, feature selection, applied machine learning techniques, evalu-ation techniques and different string and document kernels. The last part of the chapterthen describes the proposed data-dependent Wikipedia-based kernel for text categoriza-tion.

The techniques presented in this chapter are based on the articles [Minier et al.,2006], [Bodó et al., 2007] and [Minier et al., 2007].

52 5.1. TEXT CATEGORIZATION

5.1 Text categorization

Text categorization (TC) is the task of determining the categories of natural languagedocuments, which in fact means the determination of the main concepts characterizingthe document. The categories are predefined, our task is to determine one or morecategories to which the document belongs. The problem can be formalized as a multi-class, multi-label classification: we search for a function

f : D× C → T, F

which best approximates the true function f observed through training examples; D

represents the documents, C is the set of categories, while T and F means “true” and“false”, respectively. The above function can be also written as f : D → 2C. Oneof the most successful application of text categorization is spam filtering. While spamfiltering is a problem with smaller complexity than the general problem of TC, beinga binary single-label classification problem, because of the special tags or features andother techniques involved it is also much more complex. An intelligent email filteringsystem has to recognize the obfuscated or scrambled words, has to filter out “word sal-ads” that try to mislead statistical filters, not to mention the integrated optical characterrecognition (OCR) system against image-based spam messages. And these are only afew of the required features of such a system.

In this chapter the two terms “class” and “category” and also the three terms “word”,“term” and “feature” will be used interchangeably.

5.1.1 The bag-of-words representation

Although many representations were proposed over the years for TC [Baeza-Yates andRibeiro-Neto, 1999], most of them adopted from IR, the simple but very efficient bag-of-words model became the most successful one. Information retrieval deals with naturallanguage documents and searches a document database against a user-defined querythe objective being to retrieve the most relevant ones. In bag-of-words documents arerepresented as bags of words: a bag or multiset is a generalization of a set, where anelement can appear more than once, the ordering of the elements being irrelevant. Itis actually a set, where each element has its own frequency, stored for example in ahash table. In the bag-of-words model the words are considered independent of each


music

spo

rt

science

Figure 5.1: The bag-of-words (or VSM) representation of some documents in the 3-dimensional space indexed by the terms music, sport and science.

other, which is truly an incorrect assumption in natural languages, because words arecorrelated. Still, this representation provides one of the best results. The independenceassumption is used elsewhere too. An example is the Naive Bayes classifier familythat provides fast and accurate categorization for natural language documents. Theseclassifiers are one of the best performing methods in spam filtering [Zdziarski, 2005].We also call the bag-of-words model the Vector Space Model (VSM); it was proposedin [Salton et al., 1975].

Suppose we parse a document corpus – which is the set of training documents – andput them in a set W. For efficiency reasons we represent documents as vectors; thereforewe define an order between words and store each document as a vector, where eachdimension stores the number of occurrences of the corresponding word in the actualdocument (see Figure 5.1). Then the whole training corpus can be stored as a matrix,which is called the term×document matrix:

DDD =

w11 w12 · · · w1nD

......

......

wnT 1 wnT 2 · · · wnT nD

(5.1)

where wij is the weight or frequency of term i in document j, while nT and nD (nD := `)denotes the number of terms and documents, respectively. We used the term “term” in-stead of word, because sometimes a stemming algorithm is applied to the words in W,resulting in rather some word stubs than roots (etymons). The problem with the aboveis its size: even in a relatively small corpus thousands of words could appear. The goodnews about DDD is that it is very sparse. Using the techniques from subsequent sections wedrastically decrease the size of DDD (along the dimensions) for the following two reasons:


10

1

d1

d2α

0 1 2 3 4−1

−0.5

0

0.5

1

(a) (b)Figure 5.2: Calculating document similarity: (a) dot product = cosine of the angleenclosed by the document vectors; (b) cosine function in [0;π/2]

(i) to speed up computation; (ii) to eliminate words which are irrelevant for categoriza-tion. The first reason is a direct consequence of the smaller matrix. Eliminating certainterms is also important, since their presence usually harm the performance of the cat-egorization; these words are for example the stop-words like: the, and, or, about, etc.and words which can easily mislead the classifier.

For document vectors a better representation than pure word frequencies is to trans-form these frequencies such that every term obtains a different weight, and the weightsalso vary between documents. One of the most popular and efficient representationis the tfidf (term frequency×inverse document frequency) transformation [Baeza-Yatesand Ribeiro-Neto, 1999],

tfidf(wij) = wij · lognD

ni

where wij denotes the frequency of word i in document j as in (5.1), nD is the size ofthe training corpus, while ni is the number of documents containing the ith word. It isefficient because the computation is linear in the number of terms and it was empiricallyproven to be a very effective weighting scheme. The weighting is the inverse documentfrequency term, idf(i) = log nD

ni, that gives a higher weight to infrequent terms in

the corpus [Baeza-Yates and Ribeiro-Neto, 1999]. The motivation for the idf factoris that frequently used terms throughout the corpus have no discriminative power incategorization.

Regarding the lengths of the documents we do not want the representation to beinfluenced by the length of the documents, because we are not interested in the absolutefrequencies, but rather in the distribution of the words. Therefore we normalize wordvectors to unit length, ‖dddi‖ = 1, where dddi denotes the ith column of DDD, that is the ithdocument.


The simple dot product kernel induced by the bag-of-words representation returnthe cosine of the angle enclosed by the document vectors – provided that the vectors arenormalized (see Figure 5.2). This is equivalent to lexical matching counting the termsappearing in both documents. Now consider the following part of dialogue from StanleyKubrick’s famous last movie “Eyes Wide Shut”1:

. . .

– Now. . . get undressed.

– Get undressed?

– Remove your clothes.

. . .

Clearly, the expressions “get undressed” and “remove your clothes” have the samemeaning. This is in spite of the fact that they do not contain common words, there-fore using bag-of-words representation they are orthogonal.

To overcome this problem several methods were proposed over the years. For agood survey of these techiques see [Cristianini et al., 2003]. In Section 5.2 we also givea detailed description of these representations.

5.1.2 Feature selection techniques in text categorization

Most ML problems deal with thousands of features, some of them representing onlynoise, and some of them are strongly correlated. Therefore in order handle this highdimensionality and build efficient ML systems feature selection (dimensionality reduc-tion) must take place. According to Guyon and Elisseeff [2003], dimensionality reduc-tion methods can be classified as follows:

• Variable subset selection – One tries to select the best subset of the features.

– Wrappers – use an arbitrary ML technique together with a heuristic forchoosing the best feature/variable set, e.g. SVMs. These methods are oftencriticized for being “brute force”, however they use efficient search strate-gies. We differentiate these methods by the direction of the selection pro-cess:

1http://www.script-o-rama.com/movie_scripts/e/eyes-wide-shut-script-transcript.html


∗ Forward selection – start from the empty set and incrementally build thefeature set by selecting the best features.

∗ Backward elimination – start with the set containing all the features andin each step eliminate the weakest features.

– Embedded methods – these appear within the learning algorithm. A goodexample is SVMs with zero-norm minimization, turning the SVM into

minwww,b

‖www‖00

s.t. yi(www′xxxi + b) ≥ 1, i = 1, . . . , `

where the zero-norm is defined in the following way:

‖www‖00 = |wi|wi 6= 0|

Thus learning the optimal large-margin separator and finding the best fea-ture subset are performed simultaneously. For a detailed presentation of theproblem and efficient solutions see [Weston et al., 2003].

– Filters – Perhaps the most popular feature selection methods that use someheuristic to select the best feature subset. The methods presented below(DIA association factor, DFT, IG, etc.) fall into this category.

• Feature construction – New features are constructed from the original features:(i) by clustering (k-means, spectral clustering, etc.), (ii) by matrix factorization(PCA/SVD).

In the following we present the most popular filter methods for feature selection inTC. All of these methods calculate the strength of association between a term and acategory. To filter out irrelevant terms we calculate a global value for each term in orderto be able to rank the features.

The simplest and yet very effective method is the document frequency thresholding(DFT). The words that appear too many times in the corpus, and the words that appearonly a few times are irrelevant for classification. Therefore we build an array of thewords (terms) putting them in decreasing/increasing order of their count and we cut offthe two tails of the array; that is we determine two thresholds t1 and t2, and only thewords with occurrence greater than t1 and less than t2 are taken into account.


The second simple feature selection method we describe here was applied in theDarmstadt Indexing Approach (DIA), which is also shown by its name: it is called theDIA association factor [Fuhr et al., 1991]. It is the conditional probability of class cj

given the term ti,

z(ti, cj) = p(cj|ti) =p(cj, ti)

p(ti)

its estimation being

z(ti, cj) =A

A + B

wherecj cj

ti A B

ti C D

(5.2)

The contingency table (5.2) shows the common occurrences of terms and classes, cj

meaning the occurrence of the class cj, and cj denoting all classes except cj, and simi-larly for ti and ti.

The information gain (IG) term selection is actually the mutual information (MI)borrowed from information theory [Cover and Thomas, 2006], and is defined betweentwo random variables X and Y as [Sebastiani, 2002]:

I(X;Y) = H(X) − H(X|Y)

=∑x,y

p(x, y) logp(x, y)

p(x)p(y)

where in our case x ∈ ti, ti, y ∈ cj, cj. It measures the reduction of uncertainty ofa random variable (X) due to another random variable (Y). In TC we use the followingform

I(tj, cj) = −∑

c∈cj,cj

p(c) log p(c) +∑

t∈ti,ti,c∈cj,cj

p(t)p(c|t) log p(c|t)

where the estimates are:

p(cj) =|cj|

nD

p(ti) =A + B

nT

p(cj|ti) =p(cj, ti)

p(ti)=

A

A + B


A simpler method than information gain is the mutual information [Yang and Ped-ersen, 1997; Sebastiani, 2002], called also the pointwise mutual information, whichmeasures the independence of two random variables or densities, and is defined in thefollowing way:

MI(ti, cj) = logp(ti, cj)

p(ti)p(cj)

= logA · nD

(A + C) · (A + B)

where A, B and C are from the contingency table (5.2).The χ2 term selection method [Yang and Pedersen, 1997; Sebastiani, 2002] is one

of the best performing simple feature selection techniques in TC according to Yang andPedersen [1997]. It is a hypothesis test borrowed from statistics with the null hypothesisthat “events” ti and cj are independent. Let Nij denote the common occurrences ofevents ti and cj, nij is the expectation of common occurrences of ti and cj, while Ni·and N·j are the occurrences of ti and the occurrences of cj, respectively. Then the χ2

statistics is defined as

χ2 =∑

i,j

(Nij − nij)2

nij

= nD ·∑

i,j

(Nij − Ni·N·j)2

Ni·N·j(5.3)

The null hypothesis can be formalized as P(Ni·|N·j) = P(Ni·), where the N’s nowdenote events, from which it follows that

nij

N·j=

Ni·nD

from which the expression of (5.3) follows. By using estimations, the χ2-based scoresare calculated as

χ2(ti, cj) =nD(AD − CB)2

(A + C)(B + D)(A + B)(C + D)

The four measures presented above: the DIA association factor, the informationgain, the mutual information and the χ2 statistic, they all calculate an “association”


### ### ### ###

### ### ### ###

…

Documents Segment vectors

Figure 5.3: The scheme of the segmentation based feature selection for TC.

value between a term and a category. If there is a loose connection between a termand the categories, that means that the term is irrelevant and has no predictive power;therefore we choose terms with high values. First of all the “local” scores have to beconverted to “global” scores, that is for all the categories. There are two commonly usedmethods: the average and the maximum scores defined in the following way:

scoreavg(ti) =

K∑

j=1

p(cj) · score(ti, cj)

scoremax(ti) = maxscore(ti, cj) | j = 1, . . . , K

where score ∈ z, I,MI, χ2.

In [Minier et al., 2006] we developed a novel feature selection technique for TC. Itis a filtering technique like χ2, so we are able to select k features from the whole termset of corpus. However k cannot be set manually, it is determined by the system, thusno cross-validation is needed to set this parameter. The method is based on the idea thatdocuments contain semantically coherent parts, but only a few segments are important,i.e. related to the real topic of the category in each document. The scheme of the algo-rithm is shown in Figure 5.3. Therefore we segmented the documents and representedthese segments as mini-documents using the bag-of-words representation. For segmen-tation a sentence semantic similarity measure was used, based on WordNet2 [Milleret al., 1993], introduced in [Corley et al., 2005]. Thus we segmented the documents ofthe training corpus and the mini-document vectors were built. For each category thesevectors were clustered using the k-means algorithm [Bishop, 2006]. We expect that theclustering process should result in a large cluster and many smaller clusters, the larger

2We will shortly present WordNet and related issues later in this chapter.


cluster containing the relevant segments of the documents belonging to the correspond-ing category. Thus for each category we determined the largest cluster, and the termsappearing in these clusters formed our term set. The results obtained show that themethod is comparable with χ2 term selection; moreover the macro-averaged precision–recall breakeven point (see its definition later in this chapter) can be increased using ourmethod.

5.1.3 Machine learning in text categorization

In the eighties the dominant technique used for categorizing documents was the knowl-edge engineering-based logical rule matching. The expert systems contained rules con-structed manually by domain experts. Although these systems give good results on dif-ferent classification tasks, problems arise when re-designing the system. This is calledthe knowledge acquisition bottleneck problem [Sebastiani, 2002]. In these cases the do-main experts have to make modifications in the system’s rule set, and it is not unusualthat the entire system has to be built again.

In the nineties machine learning methods came into view. ML techniques gave betterperformance at a lower price: there was no need for domain experts and the systembecame easily portable. The problem now became the optimal choice of the systemparameters.

Almost every ML technique developed so far was tested on the text categorizationtask, but not all gave good results. It is very important that the method is fast both atlearning and testing. Consider for example the problem of spam filtering when almostreal-time learning and decision making is needed. The greatest problem we deal within TC is dimensionality. It is not unusual that that a document corpus containing thou-sands of documents result in document vectors with number of dimensions of magnitude105. However the document vectors are usually sparse, which can make some methodsparticularly useful for TC.

Sebastiani [2002] gives a good overview on TC and machine learning techniquesused in TC: the referred article presents basics of text categorization, related problems,dimensionality reduction methods, machine learning algorithms and evaluation meth-ods. Here we briefly describe the most popular ML techniques used for categorizingdocuments.

We already gave a short description of the Naive Bayes probabilistic text classifier in


Section 2.3.1. Although it uses the simplifying assumption that the words in a documentare independent of each other, Naive Bayes is a fast method with good performance inTC.

Rocchio’s method is one of the earliest algorithms applied in IR (see [Sebastiani,2002; Jackson and Moulinier, 2002; Tikk, 2007]). It provides a fast linear algorithm byconstructing profiles or prototypes for each class, where an unseen example is classi-fied by comparing the example to the class profiles. Profile-based classification meth-ods build a synthetic representation of each class, which in turn can be compared tothe examples’ vector using some similarity metric. In Rocchio’s method the profileswww1,www2, . . . ,wwwK are built in the following way:

wwwi =β

|ci|

∑

dddj∈ci

dddj −γ

|ci|

∑

dddj∈ci

dddj

where ci denotes the ith class and |ci| its size, and (β, γ) are the parameters of thealgorithm. The crucial task is to determine these parameters. Moschitti [2003] offersa heuristic determination by which Rocchio’s method can be tuned to approach thebest performing (text) classifiers. The algorithm also needs setting some thresholdst1, . . . , tK for determining class labels. The labels are determined as

fi(ddd) =

1, wwwi ·ddd > ti

0, otherwise

where fi is the decision or class assignment function for the ith class. Observe that ifβ = 1 and γ = 0 we obtain the algorithm presented in Section 3.1, that is Rocchio’smethod is a linear classifier that builds the class profiles in a batch setting; therefore it iscalled a batch method. On-line or incremental methods however build a classifier rightafter seeing the first training examples, and as the data is sequentially processed, theclassifier is refined. One of the well-known on-line learning method is the perceptronalgorithm, which performs well on TC tasks, as well as the Widrow–Hoff and Win-now algorithms. A brief overview of the above-mentioned methods can be found in[Sebastiani, 2002; Jackson and Moulinier, 2002].

The k-nearest neighbor (kNN) method [Cover and Hart, 1967; Duda et al., 2001;Tikk, 2007] is an example-based classifier, which does not build an explicit decisionfunction. Instead it assigns to an unlabeled example the class which is in majority


among its k-nearest neighbors, that is

f(xxx) = argmaxc=1,...,K

∑

zzz∈Nk(xxx)

sim(zzz,xxx)δ(c, f(zzz))

where Nk(xxx) denotes the set of k-nearest neighbors of xxx, the function sim(·, ·) returnsthe similarity of two points and δ(·, ·) is the Kronecker delta function. In soft kNN weaverage the labels of the surrounding points, thus the decision function becomes

yyy =1∑

zzz∈Nk(xxx) Wzzzxxx

∑

zzz∈Nk(xxx)

Wzzzxxxf(zzz)

where Wzzzxxx denotes the similarity between zzz and xxx, and yyy contains the class assignmentvalues; the functions f(·) and f(·) return the known and predicted labels of the points.There is no actual training in kNN, the label is determined by running over the entiretraining set whenever a new example has to be labeled.

The Linear Least Squares Fit (LLSF) method [Yang and Chute, 1992, 1994] is apowerful (multi-label) regression-based classification method. The multi-label classi-fication task can be viewed as determining a category×feature association matrix WWW

where Wij represents how feature j contributes in assigning category i to an example.Hence we want to determine WWW such that

WWW

K× nT

AAA

nT × nD

=

BBB

K× nD

where AAA contains the training data, i.e. the document vectors in its columns, and BBB is acategory assignment matrix containing values from 0, 1 (or −1, +1). K denotes thenumber of classes. Since this system does not always have a solution, the problem isrewritten as finding WWW that minimizes:

‖WWWAAA − BBB‖2F =

nD∑

j=1

‖WWWAAA·j − BBB·j‖22

The problem is solved in many ways, but usually one computes the Moore–Penrosepseudo inverse of AAA (using SVD) and expresses WWW as WWW = BBBAAA†. Regularized linearregression can be used as well where we minimize

minWWW‖WWWAAA − BBB‖2

F + λ‖WWW‖2F


where λ is the regularization parameter. In this case the solution can be expressed as

WWW = BBBAAA ′(AAAAAA ′ + λIII)−1

For an unseen example, represented as a vector xxx of feature frequencies, the solution yyy

is obtained byyyy = WWWxxx

where yyy contains the weights of the classes assigned to the unseen xxx.The most successful classification method in TC is learning with SVMs. Because

of the sparseness of the document vectors and due to the fact that only a few among thetraining examples are involved in determining the decision function SVMs represent apowerful technique for categorizing textual data. Joachims [1997, 1998] enumerates thefollowing arguments for using SVMs in text categorization:

• SVMs can handle large feature spaces.

• Although the dimension of document vectors is high, they are sparse; thereforeefficient techniques can be applied which overcome the dimensionality problem,even without the prior use of any feature selection method.

• Most text categorization data are linearly separable; hence linear SVMs provideremarkable results.

Classification with SVMs are presented in detail in Section 3.3.

5.1.4 Evaluation measures

Apart from designing good learning algorithms, one also needs to assess the perfor-mance of the methods, that is evaluation measures have to be defined. The commonmeasures of accuracy and error are usually sufficient to evaluate a ML system. Al-though there are cases when these measures cannot truly measure the performance of amethod because of the particular properties of the data. This is illustrated by the follow-ing example. Suppose that the number of learning examples is ` = 10 000, the numberof classes is K = 100, while the learning points are uniformly distributed among theclasses, that is every class has ≈ 100 points. Consider now the classifier always says


Category i Expert judgementTRUE FALSE

Classifier TRUE TPi FPi

judgement FALSE FNi TNi

Table 5.1: Contingency table. TP = true positives, FP = false positives, FN = falsenegatives, TN = true negatives.

“NO”, that is none of the classes is assigned to any point. Using the notions defined inTable 5.1, we can calculate the accuracy of this classifier for every class i:

Acci =TNi

TNi + FNi

=1

1 + FNi

TNi

≈ 1

1 + 1100

= 0.99

that is the classifier is 99% accurate, however it “rejected” every point for every class.This problem appears when the number of classes is large, while the average labels ofa point is relatively small compared to the number of classes. To avoid these problems,the following measures were introduced for evaluating IR systems.

Precision and recall

Precision and recall are the most common measures for evaluating a TC system, andthey were adopted from IR. Precision is the proportion of returned documents that aretargets, while recall is the proportion of target documents returned:

P =# retrieved and relevant

# retrieved

R =# retrieved and relevant

# relevant

Formally

Pi =TPi

TPi + FPi

Ri =TPi

TPi + FNi

Accuracy and error can also be written using the quantities defined in Table 5.1:

Acci =TPi + TNi

TPi + FPi + FNi + TNi

,


Erri =FPi + FNi

TPi + FPi + FNi + TNi

There are two conventional methods of calculating the performance of a text categoriza-tion system based on precision and recall. The first method is called micro-averaging,while the second method is called macro-averaging. Micro-averaged values are cal-culated by constructing a global contingency table, and then calculating precision andrecall using these sums. In contrast, macro-averaged scores are calculated by first cal-culating precision and recall for each category and then taking the average of these. Thenotable difference between these two calculations is that micro-averaging gives equalweight to every document (called a document-pivoted measure) while macro-averaginggives equal weight to every category (category-pivoted measure):

Pmicro =

∑|C|i=1 TPi∑|C|

i=1 TPi + FPi

; Rmicro =

∑|C|i=1 TPi∑|C|

i=1 TPi + FNi

Pmacro =1

|C|

|C|∑

i=1

TPi

TPi + FPi

; Rmacro =1

|C|

|C|∑

i=1

TPi

TPi + FNi

Break-even point

The properties or the expected behaviours of text categorization/information retrievalsystems can vary. For a particular system it is better to return mostly correct answers,while for another it is better to cover more true positives. There is a tradeoff betweenprecision and recall: if a classifier says “True” to every category for every document,it receives perfect recall, but very low precision. However it can be easily seen that bysaying “False” to almost every category except a few correct “True” answers, then itreceives a maximum precision of 1, but a very low recall. Hence precision and recallare somewhat complementary.

It is usually beneficial to have a single measure assessing the system performance.One such measure is the so called break-even point (BEP), the point where precisionequals recall. This can be achieved by tuning the parameters of the system. Since TP,FP and FN are natural numbers, when there is no such point the average of the nearestprecision and recall is used, and is called the interpolated BEP.

The E and F-measures


A BA ∩ B

Dokumentumhalmazok grafikus reprezentációja. aFigure 5.4: The graphical representation of document sets: A denoting the relevant andB the set of retrieved documents.

Rijsbergen [1979] introduced two measures for evaluating information retrieval sys-tems: the E and the F-measure; these measures became one of the most frequently usedmeasures.

The E-measure measures the non-overlapping grade between the retrieved and rel-evant documents. Let A and B denote the set of relevant and the set of retrieved docu-ments (see Figure 5.4). Let us denote the shaded region by A∆B = A ∪ B − A ∩ B.Then, by definition,

E =|A∆B|

|A| + |B|

where the denominator stands for normalizing the value, because we are interested onlyin the proportion of the relevant and non-relevant documents. Moreover from the con-tingency table Table 5.1 we can write |A| = TP + FN and |B| = TP + FP. Using thesethe E-measure becomes

E =FN + TP

2TP + FP + FN= 1 −

2TP

2TP + FP + FN

= 1 − 2 · 1TP+FP

TP+ TP+FN

TP

= 1 −1

12

1P

+ 12

1R

= 1 −2PR

P + R

the harmonic mean of the precision and recall. Thus the more efficient the system, thelower the corresponding E-value. The measure achieves its minimum at the highestvalues such that P = R, or P ' R if such a point does not exist, which is equal to thebreak-even point.

The F-measure is simply the complement of the E-measure, indicating the extent ofoverlap of the above sets, that is

F =2PR

P + R(5.4)

A more generalized version of these are the Eβ and Fβ-measures, introducing a param-eter β ∈ [0, ∞) as a weighting factor for the importance of the recall (or precision)

67 5.2. STRING AND TEXT KERNELS

[Rijsbergen, 1979; Lewis, 1995]:

Eβ = 1 −(β2 + 1)PR

β2P + R; Fβ =

(β2 + 1)PR

β2P + R

By setting β to 1, we obtain the above E and F measures; that is why the conventionalnotation uses E1 and F1. The most frequently applied setting is using β = 1, whichyields the F-measure defined in (5.4).

5.2 String and text kernels

Kernels can be viewed as similarity measures. Using the kernel trick, one can designa large variety of algorithms by using the same learning method and varying only thesimilarity measure between data points, i.e. the kernel function.

Previously we argued that the simple dot product (linear) kernel cannot capture syn-onymy, hyponymy etc. relations between terms. In this section we present some moresophisticated kernels used in IR and TC: in the first part we describe string kernels usedfor lexical subsequence matching, while in the second part more sophisticated “seman-tic” kernels are introduced.

5.2.1 String kernels

String kernels were first used in computational biology, but they were successfully ap-plied in text categorization too, that is for determining document similarities. Stringkernels measure the overlap between two strings by considering common subsequencesin the strings. We briefly describe here common string kernels starting with a few defi-nitions.

Let Σ denote the string alphabet, |Σ| its cardinality, Σn the set of strings of lengthn, and Σ∗ be the set of all strings on alphabet Σ, Σ∗ = ∪∞

n=0Σn. We say that s[i :

j] := sisi+1 . . . sj is a substring of s ∈ Σ∗ and that v = s[iii] := si1si2 . . . sik is asubsequence of s if iii = i1, i2, . . . , ik, 1 ≤ i1 < i2 < . . . < ik ≤ |s|. The expressions[iii] denotes the complementary subsequence of s[iii], containing the characters frompositions 1, . . . , |s| \ iii. The length of the subsequence v is defined to be l(iii) = ik −

i1 + 1.


The String Subsequence Kernel (SSK) was introduced in [Lodhi et al., 2002] andtransforms a string/text to the feature space R|Σ|n by the following formula:

φu(s) =∑

iii:s[iii]=u

λl(iii)

for each subsequence u ∈ Σn, where λ ∈ (0, 1] is the parameter of the representation.Therefore the SSK is written as

kn(s, t) =∑

u∈Σn

〈φu(s), φu(t)〉 =∑

u∈Σn

∑

iii:s[iii]=u

∑

jjj:s[jjj]=u

λl(iii)+l(jjj)

The authors give an algorithm based on dynamic programming by which a kernel com-putation kn(s, t) can be performed in O(n|s||t|) time. The kernel was evaluated on theReuters-21578 text categorization corpus and compared to the standard word kernel –which means the linear kernel of the document vectors built using the tfidf transfor-mation (see Section 5.1.1) – and the n-grams kernel, which is also the linear kernelbut this time the documents are indexed by n-grams. The SSK always outperformedthe word kernel, but the best results were achieved using the n-grams kernel. One canobserve that the SSK and the n-grams kernel “search” for common subsequences instrings, thus implementing some kind of stemming, but as in the standard linear kernelthe comparisons remain at lexical level. It is interesting to note that SSK provides suchgood results, since lexical matching of subsequences can lead to wrong similarity as-sessments. Consider for example the strings “computer” and “pute”, containing a longcommon subsequence, but having no connection between them.

The (k,m)-mismatch kernel is introduced in [Leslie et al., 2002; Leslie and Kuang,2003]. The (k,m)-mismatch feature mapping counts the number of matchings of thek-grams (k-mers) of a text with all possible k-grams generated by the alphabet withinm mismatches. If α = a1a2 . . . ak is a k-gram, then N(k,m)(α) denotes the (k,m)-neighborhood set of α containing all β ∈ Σk that differ in at most in m characters.The generated feature space – similarly to SSK – has a dimension of |Σ|k. Let ourintermediate mapping be

φβ(α) =

1, if β ∈ N(k,m)(α)

0, otherwise

For an arbitrary text x ∈ Σ∗ we define the mismatch mapping as the sum of the above


Figure 5.5: Computing the mismatch string kernel. Calculating the feature space repre-sentation φ(3,1)(abbaba).

mappings for each k-gram in x,

φ(k,m)(s) =∑

k-grams α in s

(φβ(α))β∈Σk

The (k,m)-mismatch kernel is then given by

k(k,m)(s, t) = 〈φ(k,m)(s), φ(k,m)(t)〉In Figure 5.5 an example is shown for the computation of the feature vector of

“abbaba” with parameters k = 3 and m = 1; the symbol # denotes the end of the string.The corresponding feature vector is [1 2 3 2 2 1 2 3]. For kernel computations one doesnot need to store the whole tree; only a depth first traversal is needed in parallel on thetrees corresponding to the strings. The total complexity of one kernel computation isO(km+1|Σ|m(|s| + |t|)).

Other string kernels include the (g, k)-gappy kernel [Leslie and Kuang, 2003], the(g,m, λ)-wildcard kernel [Leslie and Kuang, 2003] and the string kernel introduced in[Vishwanathan and Smola, 2002].

5.2.2 The VSM kernel

The most common kernel – also called the standard word kernel or bag-of-words kernel– is the dot product, which gives the cosine of the angle enclosed by two normalizeddocuments,

k(dddi,dddj) = ddd ′idddj


The Gram matrix is the document×document matrix DDD ′DDD, with DDD defined in equation(5.1). In the following we will consider linear transformations of the document vector,φ(ddd) = PPPddd, following Cristianini et al. [2002].

If PPP = III, then we obtain the kernel shown above. This is called the VSM kernel. IfPPP = diag(ttt), where ti = log nD

niis the inverse document frequency of the corresponding

term, and ni denotes the number of documents in which the ith term appears, it is calledthe VSM kernel with tfidf weighting.

5.2.3 The GVSM kernel

The Generalized Vector Space Model (GVSM) kernel captures the correlations betweenthe terms in the corpus by applying the transformation φ(ddd) = DDD ′ddd. Thus the kernelbecomes

k(dddi,dddj) = ddd ′iDDDDDD ′dddj,

where the matrix DDDDDD ′ will have a nonzero entry on position (i, j) if there is at least onedocument where the ith and jth terms co-occur. The GVSM was proposed by Wonget al. [1985] (see also [Wong et al., 1987]).

5.2.4 WordNet-based kernels

Another possibility is to use external information for determining the semantic related-ness of two terms, and use this term×term proximity matrix to define the linear trans-formation

k(dddi,dddj) = ddd ′iPPP′PPPddd ′j = ddd ′iPPP

2dddj

where Pij stores the value of relatedness of terms i and j. The issue is to find a goodPPP, acquired from external knowledge. A possibility for external knowledge on wordrelatedness is offered by WordNet3. WordNet is a semantic network initially devel-oped to provide conceptual searching, and it was proved to be a very beneficial re-source in various domains such as natural language processing (NLP) and informationretrieval. WordNet contains only content-words, that is nouns, verbs, adjectives and ad-verbs, stored apart and connected by different types of semantic relations. The conceptgot the name synset because it is represented by a synonym set. Relations include syn-onymy, antonymy, hyponymy/hypernymy (IS-A), meronymy/holonymy etc. The word

3http://wordnet.princeton.edu


semantic relatedness measures developed so far take into account the so engenderedhierarchical structure of the relations and some properties of this network like density,relation type, depth and link strength. The relatedness metrics can be categorized intoseveral classes by the means they calculate similarity: edge-counting, information con-tent, feature-based and hybrid methods. A good enumeration and brief description ofthese measures can be found in [Varelas, 2005].

Most of the measures below operate on hyponymy/hypernymy relations, and thuscan be used only for nouns and verbs. The simplest measure is the simple node countingmetric, which counts the nodes on the path from one concept to another through IS-Arelations. A similarity measure decreases as the concepts become similar and increaseswith dissimilarity. Thus

simsnc(c1, c2) = 2D − dist(c1, c2)

where dist(·) denotes the distance between concepts c1 and c2 by node counting, and D

is the depth of the WordNet taxonomy. Normalized to [0, 1] we obtain simsnc(c1, c2) =

1 − dist(c1, c2)/(2D).The Hirst and St-Onge measure divides relations in three types: extra-strong, strong

and medium-strong relations, which possess different weights in the semantic network.Then the similarity is calculated by node counting. The weighting rules are describedin detail in [Hirst and St-onge, 1997].

The Wu and Palmer measure introduced in [Wu and Palmer, 1994] calculates relat-edness using the following formula:

simWUP(c1, c2) =2 ·N3

N1 + N2 + 2 ·N3

where N1 is the length of the shortest path from c1 to c, N2 from c2 to c, and N3

from c to the root, and c is the most specific common subsumer (ancestor) of c1 and c2.Besides concept distance, this metric takes into account the depth of the lowest commonsubsumer.

The Leacock and Chodorow [Leacock and Chodorow, 1998] measure calculatessimilarity as

simLCH(c1, c2) = − log(

dist(c1, c2)

2 ·D)

The Resnik measure [Resnik, 1995] was proposed to be the first information contentbased metric:

simRES(c1, c2) = IC(c)


where IC(c) = − log P(c) is the information content, an information-theoretic measureintroduced for measuring specificity of concepts in a taxonomy. As one goes deeper insuch a taxonomy like WordNet, the specificity increases, while probability decreases. Ifthe hierarchy has only one root concept, its probability is 1 and its information contentequals 0, i.e. it is too general to be of any use. This metric can return arbitrarily largevalues, but in practice its upper bound is equal to log N, where N is the total number ofconcepts in the database, because if the frequency of a concept is 0, one cannot calculateits information content.

The Lin similarity metric [Lin, 1998] is an information-theoretic measure applicablewithin any probabilistic model:

simLIN =2 · IC(c)

IC(c1) + IC(c2)

The value returned is a number between 0 and 1.The simplified Jiang and Conrath measure [Jiang and Conrath, 1997] has the form

distJCN(c1, c2) = IC(c1) + IC(c2) − 2 · IC(c)

It is actually a distance, so one can take the inverse of it as in [Patwardhan et al., 2003].The WordNet-based Lesk similarity measure was proposed by Banerjee and Peder-

sen [2002]. The Lesk algorithm is originally used for word sense disambiguation bycomparing the different glosses of the word being analyzed to the glosses of the otherwords from the sentence.

Using gloss vectors for measuring word similarity was proposed by Patwardhanand Pedersen, and is actually an adaptation of Schütze’s context-group discriminationmethod [Schütze, 1998]. Word senses are represented by second-order co-occurrencevectors calculated from their glosses.

The relatedness measures presented above calculated similarity between concepts orsynsets, but in order to use these in kernels word similarities are needed. The simplestmethod for obtaining word relatedness is the following:

sim(w1, w2) = maxsim(c1, c2) | c1 ∈ s(w1), c2 ∈ s(w2)

where s(w) denotes the different senses of w as concepts. Although it is an ad-hocmethod, it is proved to be useful and avoids the complex problem of disambiguatingword senses.


5.2.5 Latent Semantic Kernel

Latent Semantic Analysis (LSA, also called Latent Semantic Indexing or LSI) was in-troduced in [Deerwester et al., 1990]. It is an approximation method for representingdocuments and terms. The term×document matrix is decomposed by SVD, DDD = UUUSSSVVV ′,where UUU and VVV are the left and right singular matrices containing the singular vec-tors in their columns, and SSS is diagonal containing singular values in decreasing order.More precisely UUU contains the eigenvectors of DDDDDD ′, and VVV contains the eigenvectorsof DDD ′DDD. If the diagonal of SSS is truncated by keeping only the largest k elements, andsetting the others to zero, one obtains the best approximation of DDD of rank k, writtenas DDD = UUUSSSkVVV

′. As PCA (see Section 3.4.1), SVD transforms the data into a lowerdimensional space by removing noise and keeping only the most relevant information.Comparing two documents and two terms is performed by taking the dot product be-tween the corresponding two rows of VVVSSSk and of UUUSSSk respectively.

The question is how to map new documents to this space. Let ddd denote the documentin the input space, while ddd represents the document in LSA space. Then the followingequation holds:

ddd ′DDD = dddSSSkVVV′k

From DDD ≈ UUUSSSkVVV′ it follows that ddd ′UUUk = ddd

′, that is ddd = UUU ′

kddd. Hence the transformationmatrix will be UUU ′

k = IIIkUUU′, that is we project the document vectors into the first k

eigenvectors of UUU. Thus the Latent Semantic Kernel (LSK) [Cristianini et al., 2002]becomes

k(dddi,dddj) = ddd ′iUUUkUUU′kdddj

Thus one does not have to compute the SVD of DDD; it suffices to perform the eigenvaluedecomposition of DDDDDD ′. However this is often a very large square matrix – the number ofterms without feature selection can be of order 104 or 105, depending on the size of thecorpus – but it is easy to notice that we obtain the same kernel working with the muchsmaller matrix DDD ′DDD – the number of documents is usually smaller than the number ofterms. The VSM kernel has the form

KKK = DDD ′DDD = VVVSSSUUU ′UUUSSSVVV ′ = VVVSSS2VVV ′

Now if we substitute back the LSI transformation we get the k-rank approximation ofthe same kernel,

KKK = (IIIkUUU′DDD) ′IIIkUUU

′DDD = VVVSSSUUU ′UUUIIIkIIIkUUU′UUUSSSVVV ′ = VVVSSS2

kVVV′


This has the beneficial property that the matrix is much smaller; therefore the eigenvaluedecomposition is less expensive.

5.2.6 The von Neumann kernel

The von Neumann kernel [Kandola et al., 2002b; Cristianini et al., 2003] defines thekernel using recursive equations. As we have already seen, KKK = DDD ′DDD defines the linearkernel, where DDD denotes the term×document matrix, while GGG = DDDDDD ′ gives the corre-lations between terms. The document and term similarities are defined in the followingrecursive way:

KKK = λDDD ′GGGDDD + KKK

GGG = λDDDKKKDDD ′ + GGG

where λ ∈ R is a parameter setting the influence of the augmenting term. The similar-ities are thus calculated by adding a term to the base similarity matrix, which connectsentities – documents or terms – using the other similarity matrix defined analogously.It is proved by Kandola et al. [2002b] that if λ < ‖KKK‖−1

F = ‖GGG‖−1F the solutions of the

above equations are

KKK = KKK(III − λKKK)−1

GGG = GGG(III − λGGG)−1

The kernel defined asKKK(λ) = KKK(III − λKKK)−1

is called the von Neumann kernel.The von Neumann kernel – similarly to the WordNet-based kernels presented in

Section 5.2.4 – can be viewed as a kernel based on the semantic proximity matrix PPP =

λGGG + III, since DDDPPP ′DDD = λDDD ′GGGDDD + KKK = KKK(λ). Using the von Neumann kernel theauthors achieved superior results compared to the VSM kernel with tfidf weighting ondocument retrieval (using the Medline1033 corpus4).

4ftp://ftp.cs.cornell.edu/pub/smart/med

75 5.3. WIKIPEDIA-BASED TEXT KERNELS

5.3 Wikipedia-based text kernels

In the previous section we presented a few string and document/text kernels used in IRsystems that try to provide a more precise similarity between documents. Documentvectors are built using the assumption that occurrences of words are independent, andthe results delivered using the linear kernel – or using the VSM kernel – prove that de-spite the falsity of the assumption good classifiers can be built. However using the VSMrepresentation one could use a special kernel providing a better document similarity, andfor which only the word frequencies are known; the additional information of word or-der is neglected. Hence most of the document kernels do the following: since there is noother information than word frequencies, try to relate similar or semantically connectedwords within a document, which would result in a better kernel. If for example thereare two documents containing totally different words, that is no common word exists,however these documents describe the same happening but using different words andexpressions, the linear kernel returns a value of 0, which means that the documents havetotally different meanings/topics. This shows how synonymy can cause serious prob-lems to an IR or TC system using the VSM kernel. However for example by clusteringsemantically related terms this problem can be overcome.

In this section we present our document kernel based on Wikipedia, which trans-forms documents to Wikipedia concept space and uses dimensionality reduction tech-niques to filter out the noise. It is related to the GVSM (Section 5.2.3) and the latentsemantic kernel (Section 5.2.5) presented in the previous section.

5.3.1 Wikipedia

The Wikipedia5 is the largest encyclopedia edited collaboratively on the Internet. It iswritten in a clear and coherent style with many concepts sufficiently explained, makingit an excellent resource for natural language research. The word “wiki” means “fast”in Hawaiian language, and it was first used by Ward Cunningham in 1994, when heinvented the WikiWikiWeb, a simple tool for exchanging ideas between programmersand developers. A wiki is a “collection of hypertext documents that can directly beedited by anyone” [Voss, 2005]. Sometimes the backronym “What I Know Is” is usedfor explaining what wiki means.

5http://wikipedia.org


Wikipedia can be traced back to Nupedia, an online encyclopedia founded by JimmyWales in 1999. Nupedia was not a wiki, but it was freely available to anyone. Wikipediawas started in January 2001 as a side project of Nupedia. Soon after, Nupedia wentbankrupt and finally it was shut down in 2003. However Wikipedia became very pop-ular and soon it was started in other languages – initially it was available only in En-glish. According to Voss [2005], in March 2005 Wikipedia contained 195 languages,among which 21 languages had more than 10 000 articles. Now, in August 2008, thereare Wikipedias in 264 languages – some of them are no longer active, for examplethe Wikipedia written in Kanuri language6 – among which 79 languages have morethan 10 000 articles and 22 with more than 100 000 articles7. Of course the EnglishWikipedia is the largest with 2 493 680 articles, but the Romanian (18th, 113 521 ar-ticles) and Hungarian (22th, 102 217 articles) versions are also among the 22 largestWikipedias.

Wikipedia is freely downloadable in several formats8 (e.g. XML), thus turns into anoutstanding resource for research including natural language processing and informa-tion retrieval applications.

Wikipedia uses the software MediaWiki9 – also called the wiki engine – which wasinitially developed for the Wikipedia, but it is freely downloadable and can be used tocreate easy-to-edit and easy-to-maintain community sites.

According to Alexa10, in August 2008 Wikipedia is the 8th most popular site on theInternet; about 9% of the global Internet users visit the site on an average day.

Wikipedia also receives increasing academic attention from researchers. The articleWikipedia:Wikipedia_in_academic_studies contains a large list of con-ference and journal papers using Wikipedia as a tool for solving different problems,especially in NLP and IR.

6http://kr.wikipedia.org7http://meta.wikimedia.org/wiki/List_of_Wikipedias8http://download.wikimedia.org9http://www.mediawiki.org

10Alexa – http://www.alexa.com – is a popular site providing Internet traffic rankings havingone of the largest Web crawls.


5.3.2 Wikipedia-based document representation

We saw that in the VSM with tfidf weighting documents are represented by term fre-quencies multiplied by the idf factor. Thus we can build the term×document matrix DDD

in (5.1), which we repeat here:

DDD =

w11 w12 · · · w1nD

......

......

wnT 1 wnT 2 · · · wnT nD

repeated (5.1)

In this way we obtain a dual representation: (i) a representation of documents (columnsof DDD), and (ii) a representation for terms (rows of DDD). Each document is represented bythe terms occurring in the document, while each term is represented by the documentsin which the term appears.

We now switch to another representation of documents. First we give a new repre-sentation for the terms, namely we represent each term as the distribution of that termin another document corpus. As for representing a document – that is a set of terms– in this new document space, we simply form the weighted sum of the term vectors,weighted by term weights. Consider the following example: we have three indexingterms and our document looks like [1 0 1] ′. We choose only two other documents fromthe corpus which look like [

2 1 0

1 3 2

]

where we put the documents in the rows of the matrix. Now terms get the followingrepresentations:

[2

1

]· [1] =

[2

1

];

[1

3

]· [0] =

[0

0

];

[0

2

]· [1] =

[0

2

]

from which the document vector will be 1 · [2 1] ′+0 · [0 0] ′+1 · [0 2] ′ = [2 3] ′. One caneasily observe that this is actually the GVSM kernel (Section 5.2.3), that is we transformthe documents by DDD ′ddd, provided that the documents from which the term distribution istaken form the same corpus from which the documents actually come from.

Gabrilovich and Markovitch [2007] modified the GVSM transformation in the fol-lowing way: instead of using the same corpus for detecting term correlations, they used


Wikipedia for extracting term distributions. Wikipedia consists of about 2.5× 107 arti-cles and will therefore give documents a better, richer representation. In the followingwe will use the terms “article” and “concept” interchangeably.

Suppose that Wikipedia contains nC articles, and we are interested in the distributionin these Wikipedia articles of the terms which index the documents. Then the documenttransformation to the Wikipedia concept space becomes:

c11 c12 . . . c1nT

c21 c22 . . . c2nT

...... . . . ...

cnC1 cnC2 . . . cnCnT

︸︷︷︸WWW

·

w1

w2

...wnT

︸︷︷︸ddd

We call matrix WWW the Wikipedia matrix.Gabrilovich and Markovitch [2007] call this method Explicit Semantic Analysis

(ESA), because unlike LSA, they map the terms/documents to explicit – and not latent– concepts. However we can interpret this transformation – and similarly the GVSMtransformation – as transforming the document vector to a vector of similarities betweenthe document and the Wikipedia concepts; thus a comparison in this new representationcompares these similarity vectors, that is it measures how close these similarities are.Hence the new document kernel becomes

k(dddi,dddj) = ddd ′iWWW′WWWdddj

where WWW ′WWW is the Wikipedia term×term co-occurrence matrix.Gabrilovich and Markovitch [2007] used the cosine similarity (Table 3.1) to com-

pare words and texts, and they tested the new representation on the WordSimilarity-35311 collection and on a collection of 50 documents from the Australian BroadcastingCorporation’s news mail service [Lee et al., 2005]. The similarity values resulted usingthe new representation with cosine similarity were compared to human judgement, cal-culating human–computer correlations. They achieved correlation coefficients of 0.75

and 0.72 for words and texts respectively, which is the highest value ever produced bysuch an automated system. For the performance of other algorithms see the referredpaper of Gabrilovich and Markovitch.

11http://www.cs.technion.ac.il/∼gabr/resources/data/wordsim353


w1 w2 … wnT v1 v2 … vnC u1 u2 … unL

word space Wikipedia concept

space

(latent) concept space

LSA

Figure 5.6: Scheme of forming the Wikipedia kernel. First the documents are mappedto the Wikipedia concept space, then to a latent space for reducing dimensionality andfiltering out noise.

Inspired by the performance of the document representation in the Wikipedia con-cept space we decided to use this kernel for text categorization. We also tried to reducedimensionality and filter out noise from this representation.

This method of transforming documents can be considered as a semi-supervisedtechnique, where a large unlabeled corpus of Wikipedia articles are used to give docu-ments a new representation. The schematic representation of our method is illustratedin Figure 5.6. The methods we present here are described in the papers [Bodó et al.,2007] and [Minier et al., 2007].

5.3.3 Dimensionality reduction for the Wikipedia kernel

When building the Wikipedia kernel we did not use all the articles, because many ofthem are too short to be of any use, and of course there are many irrelevant articles toolike collector or category pages (e.g. Category:Machine_learning), internalWikipedia pages (e.g. Wikipedia:Statistics) etc. Therefore we selected a frac-tion of articles of the entire Wikipedia which we considered useful; the methodologywill be described in Section 5.3.6. However it is possible that irrelevant concepts stillmultiply our dimensions.

To filter out irrelevant terms we decided to use LSA, that is approximate theWikipedia matrix by a lower rank matrix,

WWW ≈ WWW = UUUSSSkVVV′

This means that the new Wikipedia kernel becomes

k(dddi,dddj) = ddd ′iWWW′WWWdddj = ddd ′iVVVkSSS

2kVVV

′kdddj

We experimented with the above kernel, but we observed that by using SSSk some dimen-sions received very high influence, which in turn resulted in a decrease of performance.


Thus we replaced SSSk by IIIk and obtained the kernel

k(dddi,dddj) = ddd ′iVVVkVVV′kdddj

Another interpretation of the above transformation would be the following: as the firstcomponent of the SVD decomposition of DDD contains the eigenvectors or principal com-ponents of the feature space in LSA (see Section 5.2.5), in this case VVV will contain it(because DDD is a term×document matrix, while WWW is a concept×term matrix), and weassume that Wikipedia articles yield a better covariance; hence we simply replaced DDDDDD ′

by WWW ′WWW.One problem with the above decomposition is that WWW becomes quite large – because

nC is large – thus the SVD of the matrix is very inefficient, if indeed manageable inacceptable time. However we can observe that by the eigendecomposition of the muchsmaller matrix WWW ′WWW (of size nT × nT ) we can obtain VVV , since

WWW ′WWW = VVVSSS2VVV ′

5.3.4 Link structure of Wikipedia

Wikipedia possesses a link structure too, which can be used efficiently by propagatingfrequencies through these connections. Consider the concept×concept link matrix EEE

which is defined as Eij = 1 if there is a link from the ith to the jth concept, and 0

otherwise. We also want to keep the already assigned weights, therefore we set themain diagonal to be all ones, thus our updated Wikipedia-matrix becomes:

WWW = EEE ′WWW

This means that Wij = (EEE·i) ′ ·WWW·j, that is we add to Wij the sum of occurrences of termj in concepts C = k1, . . . , kn, where the set C contains those concepts – or the indicesof those concepts – from which there was a link to concept i.

5.3.5 Concept weighting

While words/dimensions in the training corpus get varying importance by using the tfidfweighting, the concepts are treated with equal importance in Wikipedia. A possibleweighting scheme could be achieved by ranking Wikipedia articles based on citation


100

105

100

102

1 2 3

x 105

200

400

600

800

1000

1200

1 2 3

x 105

2

4

6

Figure 5.7: Concept weighting with PageRank. The decrease of the PageRank indexin normal (left); and on log-log (center) scale; on the right the decrease with our ranktransformation rrr = log(rrr + 111) is shown.

or reference analysis, so these importances could be extracted from the link structurediscussed in the previous section. The famous PageRank algorithm [Page et al., 1998]of Google does this considering only the link or citation structure of a set of hyperlinkedpages: a page has a high rank if it has many back-links or there are only a few butimportant back-links. This can formulated recursively by

rrri =∑

j∈N−1(i)

rrrj

|N−1(j)|

where rrri denotes the rank of the ith page and N−1(i) is the set of backward neighborsof i, that is neighbors pointing to i. If PPP denotes the adjacency matrix having Pij =

1/|N(i)|, where N(i) represents the forward neighbors of i, neighbors to which i points,then the ranks can be calculated by solving the eigenproblem rrr = PPP ′rrr.

PageRank can be viewed as a random walk on a graph, and the ranks are the proba-bilities that a random surfer is at the respective page at time step k. If k is sufficientlylarge then the probabilities converge to a unique fixed distribution. The only problemwith the above formulation is that the graph may have nodes with no forward neighbors,and groups from which no forward links go out. To solve these problems one can add alittle randomness to the surfing process,

rrr = cPPP ′rrr + (1 − c)111, c ∈ [0, 1]

where denotes the all 1 column vector of size np, having np pages. In our experimentswe used the value c = 0.85 according to Page et al. [1998].

Using the ranks produced by the PageRank algorithm we replace WWW by diag(rrr)WWW,where rrr = log(rrr + 111). Using the log function we tried to smooth the weights to some


Place Article title No. Rank1. united states 231 770 1320.843282. united kingdom 12 118 667.212933. 2006 13 420 666.548674. 2005 13 417 622.370325. france 288 348 571.759426. 2004 13 367 481.869057. world war ii 12 585 480.334968. germany 4415 470.657919. england 3421 467.52380

10. canada 271 492 440.9117111. 2003 13 419 391.8798512. australia 264 104 382.2232913. japan 5829 382.0574614. english language 3393 377.0148015. europe 3390 376.9350816. india 5432 346.3622317. 2002 13 353 338.4906518. london 6705 332.8962919. 2001 13 052 326.3888820. italy 5431 325.82783

Place Article title No. Rank20. wikipedia:people by year/reports/canadians/for ... 139 249 0.1501519. wikipedia:dead external links/301/m 310 077 0.1501418. wikipedia:dead external links/301/c 310 064 0.1501417. wikipedia:dead external links/301/a 310 060 0.1501416. wikipedia:reference desk archive/humanities/ja ... 238 364 0.1501415. wikipedia:reference desk archive/science/janua ... 238 365 0.1501314. wikipedia:reference desk archive/miscellaneous ... 212 383 0.1501313. wikipedia:dead external links/301/s 310 084 0.1501212. wikipedia:reference desk archive/science/octob ... 212 380 0.1501211. list of songs containing overt references to r ... 71 582 0.1501210. list of performers on top of the pops 302 455 0.150109. wikipedia:version 1.0 editorial team/computer a ... 281 658 0.150108. wikipedia:version 1.0 editorial team/physics ar ... 279 053 0.150107. wikipedia:version 0.5/biglist2 322 821 0.150086. wikipedia:version 0.5/full list 322 828 0.150075. wikipedia:version 1.0 editorial team/assessment ... 269 869 0.150074. wikipedia:version 1.0 editorial team/release ve ... 327 212 0.150063. wikipedia:wikiproject automobiles/articles 295 708 0.150052. wikipedia:airline destination lists 247 538 0.150051. wikipedia:bad links/http1 221 406 0.15005

Table 5.2: Highest and lowest ranked articles in the reduced Wikipedia set – 327 653

pages.

extent, since the ratio of the highest and lowest ranked concept was 8.8× 103. We usedthe translation rrr + 111 instead of rrr to avoid negative weights.

Figure 5.7 shows the magnitude of the ranks obtained. The first plot shows the trueranks, on the middle one the ranks are plotted on a log-log scale, while the rightmostplot shows the ranks using our rank transformation. In Table 5.2 some of the highestand lowest ranked Wikipedia article titles are shown.

5.3.6 Experimental methodology and test results

In our experiments we used the English Wikipedia dump of November, 200612 con-taining about 1.6 million articles, which equals about 8 gigabytes of textual data – ex-cluding pictures and other additional files, but including the XML tags. Similarly to[Gabrilovich and Markovitch, 2007], in order to filter out irrelevant articles (categorypages, internal Wikipedia pages, stubs [see Wikipedia:Stub], etc.), we eliminatedthe following articles:

• articles containing less than 500 words

• articles having less than 5 forward links to other Wikipedia articles12http://download.wikimedia.org/enwiki


#eigv mP mR mBEP mF1 MP MR MBEP MF1χ25209 – 88.46 84.59 86.52 86.48 71.61 61.25 66.43 66.02χ24601 – 87.88 83.09 85.49 85.42 64.21 54.62 59.41 59.03

Wikipedia covariance – 48.95 35.42 42.18 41.10 8.79 5.35 7.07 6.65LSA 4500 88.05 83.65 85.85 85.80 62.48 54.38 58.43 58.15

LSA+links 4500 87.89 83.71 85.80 85.75 65.11 56.05 60.58 60.24LSA+PageRank 4000 87.44 83.68 85.56 85.52 59.75 53.86 56.80 56.65

Table 5.3: Performance on the Reuters corpus in percentage. Notation: mP=micro-precision, mR=micro-recall, mBEP=micro-breakeven, mF1=micro-F1, MP=macro-precision, MR=macro-recall, MBEP=macro-breakeven, MF1=macro-F1

After reduction we built an inverted index for the words appearing in Wikipedia, exclud-ing stop words and the 300 most frequent word. We also neglected words appearing inless than 10 articles. Exclusion of such words was done in order to help filtering out ir-relevant indexing terms, since we assumed that a word that is unimportant in Wikipedia,is irrelevant for indexing the text categorization corpus too. In this way we were left with327 653 articles.

For testing the Wikipedia kernel on text categorization we used the corpus Reuters-21578, ModApté split with 90 + 1 categories, where the category “unknown” was notused.13

We also used a filtering term selection method for selecting the most relevant termsfrom the Reuters corpus. To this end we used χ2 term selection (see Section 5.1.2) andselected 5 209 terms (we obtained good results with this number of features in [Minieret al., 2006]). For these words we built word vectors representing the distribution ofthese words in Wikipedia articles. All the words extracted from the Wikipedia articlesand from the Reuters corpus were stemmed using Porter’s algorithm14 [Baeza-Yates andRibeiro-Neto, 1999].

After building the document vectors we used SVMs to learn the training data; forthis we used the LIBSVM implementation [Chang and Lin, 2001]. For details aboutSVMs see Section 3.3.

The results are shown in Table 5.3. To evaluate the system we used precision, re-call, precision–recall breakeven point and the F1 measure. χ25209 and χ24601 showsthe results obtained using χ2 term selection. The rows of Wikipedia covariance, PCA,

13For a detailed description of Reuters-21578 see Appendix A14http://tartarus.org/∼martin/PorterStemmer


PCA+links and PCA+PageRank show the results obtained using the Wikipedia kernel,namely: Wikipedia kernel without reduction, Wikipedia kernel with LSA, Wikipediakernel with LSA and using the link matrix EEE, and finally the Wikipedia kernel with LSAand weighting the concepts with PageRank. The column #eigv shows the number ofeigenvectors chosen.

The best results were achieved by using simple χ2 term selection, and the secondbest by using the Wikipedia kernel with LSA.

5.3.7 Related methods

As we already saw, our Wikipedia kernel is similar to the GVSM kernel [Cristianiniet al., 2003], since the matrix WWW ′WWW gives the term co-occurrences in Wikipedia, whilewhen using LSA to reduce the dimensionality of concepts, a co-occurrence matrix isproduced using the latent concepts.

Reducing concept dimensionality we arrive at a kernel similar to the LSK [Cris-tianini et al., 2002] presented in Section 5.2.5, with the difference that now LSA isperformed on the Wikipedia matrix WWW.

Another method which resembles the construction of the Wikipedia kernel – or in-versely, to which the Wikipedia kernel is similar – is the automatic construction ofdomain kernels [Gliozzo and Strapparava, 2005]. Domain transformations are definedas mapping the document vectors from the space of indexing terms to the space of con-cepts, now called domains: the dimension corresponding to a domain shows the extentto which the document and the domain are related. The term×domain matrix which pro-duces the domain kernel is called the domain matrix. Gliozzo and Strapparava [2005]proposed LSA for automatic construction of the domain matrix.

5.3.8 Discussion

The results obtained are quite surprising, since we expected more impressive results,based on the results published in [Gabrilovich and Markovitch, 2007]. The first line ofTable 5.3 (χ25209) shows the performance results we used as baseline. It is quite sur-prising that the Wikipedia kernel without reduction provides such disappointing results.We also tested our system with χ2 term selection but using only 4601 out of the 5209

features, because the remaining 608 features did not appear in the inverted index of


student professor footballsoccer food fruittiger jaguar territory

surface life deathOPEC oil murder

manslaughter weather forecast. . . . . . . . .

Table 5.4: Words in the WordSimilarity-353 corpus.

abbenhau blowingand chaneriyadezuka garlem harjavaltalechin maringa nonthapanthawat

pequiven qtrly ritterbuschsiedenburg tachon umuarama

wello yearago zincor. . . . . . . . .

Table 5.5: Words selected by χ2 which do not appear in our inverted index of Wikipedia.

Wikipedia, that is we can consider the results in the second line of Table 5.3 as a base-line too. Thus the results obtained by the Wikipedia kernel slightly outperform simpleχ2 term selection using the VSM kernel. Namely, considering the micro-averaged BEPwe obtained better results for almost all methods, except using the Wikipedia covari-ance matrix, and considering the macro-averaged BEP values the baseline results wereslightly outperformed by the Wikipedia kernel with LSA and using the hyperlinks.

In Tables 5.4, 5.5 and 5.6 we listed some randomly selected words/stems fromthe WordSimilarity-353 database, from the terms selected by χ2 and not appearing inWikipedia, and some terms selected by χ2 occurring in Wikipedia. As one can see, thewords in Table 5.4 are quite common, while words appearing in Tables 5.5 and 5.6 aremore sophisticated, more technical and rare words which appear only a few times in theentire Wikipedia. However there were also some common words selected by χ2 whichwere neglected at the preprocessing step, considered to be too general for a good dis-crimination. One argument for the inferior results therefore could be the specificity ofthe words selected by χ2.

Another explanation for the results of the experiments could be that the distributionof words in Reuters is significantly different from the distribution of Wikipedia, so thatno useful principal components can be extracted.

Another linguistic argument could be that the usage of words has somewhat shiftedfrom 1987 – when Reuters was written (see Appendix A) – to this day.


luther spillov ghanacypriot guayana teanecksushi shoichi pretoriaagrum kim ratzeburgrijeka jeddah heavierwim surgic heinz. . . . . . . . .

Table 5.6: Words selected by χ2 which appear in our inverted index of Wikipedia.

Of course there is a large space for improvement, and there are many unessayedpossibilities: choosing other term selection methods, trying to find a better documentrepresentation using the term vectors, etc. We plan to make experiments on other TCcorpora too.

87

Chapter 6

Hierarchical cluster kernels

Hierarchical clustering methods are perhaps the most popular unsupervised learn-ing algorithms in machine learning. Clustering [Berkhin, 2002; Jain et al., 1999]means finding an optimal partition of points, such that similar ones are placed

in the same cluster, while different clusters group somehow different points. Hierarchi-cal clustering is performed as successive steps of merging or dividing an initial parti-tion/clustering until the desired number of clusters is found.

In Chapter 4 we presented several cluster kernels, which made use of the semi-supervised cluster assumption, saying that if two points are in the same cluster, they arelikely to have the same class labels. In this chapter we use hierarchical – more preciselyagglomerative – clustering to build cluster kernels which provide better similarities forsupervised learning algorithms.

The chapter is based on article [Bodó, 2008].

6.1 Motivation for a cluster kernel

Consider the example shown in Figure 6.1(a). We are given 2 labeled points – the bluecircle and the red square – and 1 unlabeled/test point – the green triangle – with coor-dinates (0, 0), (5, 2) and (4.2321, −3.3302), respectively. The third point lies exactlyon the optimal separating hyperplane which separates positive and negative examples,that is on the line perpendicular to the line defined by the labeled examples. How do wedecide about our test point: does it belong to the first or to the second class? Moreoverif our test point did not exactly lie on the separating hyperplane, but very close to it,

88 6.1. MOTIVATION FOR A CLUSTER KERNEL

0 5 10

−6

−4

−2

0

2

4

0 5 10

−6

−4

−2

0

2

4

(a) (b)Figure 6.1: Motivation for a cluster kernel: (a) labeled points (blue circle, red square)and the test point; (b) labeled and test points with additional unlabeled points.

it would be similarly hard to decide its right class. The distance matrix containing thepairwise distance between our points looks like

0 5.3852 5.3852

5.3852 0 5.3852

5.3852 5.3852 0

Now consider Figure 6.1(b), where we added some unlabeled data, namely 100 pointsfrom the first distribution and 100 from the second distribution. The generating distri-

butions are: N(µµµ1,ΣΣΣ) and N(µµµ2,ΣΣΣ), where µµµ1 = [0 0] ′, µµµ2 = [5 0] ′, ΣΣΣ =

[0.5 0

0 2

].

Having this information it can be clearly seen that our test point must have label 2, butwe should give a method by which the classifier can make the correct decision. Wewill use a hierarchical clustering algorithm, namely the single linkage agglomerativeclustering on all the labeled and unlabeled points. The steps are the following:

Algorithm 8 Single linkage agglomerative clustering1: Find the two clusters with minimum distance – initially points are considered as

separate clusters – using the metric

D(C1, C2) = min ‖xxx − zzz‖2 | xxx ∈ C1, zzz ∈ C2

2: Merge these clusters and repeat the process until one big cluster is obtained.

Now we use the distances obtained by the above clustering algorithm: we calculatethe distance between two points as the distance between their clusters when those are

89 6.2. HIERARCHICAL CLUSTERING

−0.5 0 0.5 1 1.5

−0.5

0

0.5

Figure 6.2: The points in the new representational space. Now the test point is clearlyso close to the red square that evidently it belongs to that class.

merged. Thus we obtain the following distance matrix involving the labeled points andthe test point:

0 2.4900 2.4900

2.4900 0 0.5222

2.4900 0.5222 0

We see that now the distance from the test point to the second labeled point is muchsmaller than then the distance to the first labeled point. Thus a simple classificationalgorithm that classifies a new point as belonging to the class whose class center iscloser (see Section 3.1) can now correctly classify our test point. Figure 6.2 shows thenew representation of the points, where the triangle is clearly much closer to the squarethan to the circle.

6.2 Hierarchical clustering

Clustering is a special type of classification (see [Jain and Dubes, 1988] for a hierarchyof classification algorithms); it is often referred to by the name of unsupervised learning.According to Jain and Dubes [1988] supervised learning is called extrinsic, while un-supervised learning is called intrinsic classification. Intrinsic classification classifies –or clusters – objects into separate groups based only on the proximity matrix, while ex-trinsic classifiers are given additional information; that is, besides the proximity matrix,they are given class labels too, based on which they learn how to label unseen examples.

Clustering means partitioning a set of points into “separate” groups or components,called clusters. Although we often imagine the clustering process as grouping the data


into separate, exclusive clusters, the clusters can be overlapping too, that is one point canbelong to more than one cluster. Consider for examples the task of clustering/classifyinganimals in different groups. We know that kangaroos are marsupials but vertebrates too,because marsupials is a subclass of mammals, and mammals is a subclass of vertebrates.However one cannot tell the same about fish; they are vertebrates but not marsupials ormammals. In classification we call the above setting multi-label classification.

A clustering method can be hierarchical or partitional. Hierarchical clustering buildsa tree in successive steps, where the nodes of the tree represent nested partitions of thedata. By partition we mean a decomposition of the data into several groups. Converselypartitional clustering methods produce exactly one partition.

The clustering is called nested if a partition is formed by merging components ofanother clustering; then we say that the latter one is nested into the first one. Hierar-chical clustering consists of a sequence of partitions, where every partition is nestedinto the next partition in the sequence. Agglomerative (bottom-up, clumping) methodsstart with as many clusters as there are points, and at every step the most similar pair ofclusters is merged until one big cluster is obtained. Divisive (top-down, splitting) meth-ods however start with the whole data set as a single cluster, and recursively splits thecluster(s) until one-point clusters are reached. For a detailed introduction to hierarchicalclustering see [Jain and Dubes, 1988; Duda et al., 2001].

The general agglomerative clustering algorithm producing a binary tree is the fol-lowing:

Algorithm 9 Agglomerative clustering1: Define the initial clusters as the points themselves.2: Find the most similar pair of clusters.3: Merge them thus creating a new cluster.4: Repeat from step 2. until a single cluster is obtained.

One can design a large variety of methods by simply choosing a different functionfor determining cluster similarity.

We can represent a hierarchical clustering by a dendrogram, which is actually abinary tree. In Figure 6.3 two such dendrograms are shown. The vertical axis representsthe dissimilarities of the examples. If there are N points to cluster, the resulting tree willhave N leaves and N − 1 internal nodes, including the root. If the desired number ofclusters is known, one does not need the whole tree, but can stop the process when the


2729 5222325 1 7191214 3 818 624 42630282116 213 910112015170

2

4

6

8

10

12

14

16

18

A B C D E

(a) (b)Figure 6.3: Hierarchical clustering represented by dendrograms: (a) dendrogram re-sulting from clustering the “two-moons” data set (33 points) with average linkage; (b)dendrogram of 5 points.

i

j

i j

A

B

C

D

E

1

2

3 4

Figure 6.4: Hierarchical clustering represented by a Venn diagram.

required number of clusters is formed. If the lowest level is denoted as level 0, at thekth level N − k clusters will be formed.

Other representations include the Venn diagram (Figure 6.4) and representation by(LISP-style) nested lists. For example the clustering of the 5 points shown in Figure6.3(b) and Figure 6.4 can be written as (((A B) (C D)) E).

Divisive clustering methods start with the cluster containing all the data points –called the conjoint clustering – and successively splits the cluster into two subclustersuntil the desired number of clusters is reached, or every data point is put in a separatecluster.

In order to produce a dendrogram, equivalently to perform hierarchical clustering,


one can use an arbitrary non-hierarchical method in a divisive manner: split the ini-tial cluster of points into two, and then recursively apply the same method to split theresulting clusters into smaller ones, until one-point clusters are reached.

6.2.1 Linkage distances

In order to merge clusters in agglomerative clustering some distance metric has to beused. Suppose we have k clusters at a given moment. Thus k(k − 1)/2 distances haveto be calculated to be able to decide which of the k(k−1)/2 possible cluster conflationsis the best.

Single linkage – or nearest neighbor – clustering uses the following cluster distancefunction:

D(C1, C2) = min d(xxx,zzz) | xxx ∈ C1, zzz ∈ C2 (6.1)

That is we choose to merge those clusters where the minimal distance is minimal. Onecan observe that single linkage clustering is equivalent to finding the minimal spanningtree of the data graph, i.e. if we consider the initial graph as the complete graph of datapoints, then by choosing the edge with minimal weight – or distance in this case – weobtain the minimal spanning tree of the graph.

The complete linkage – or farthest neighbor – clustering defines the distance be-tween clusters as

D(C1, C2) = max d(xxx,zzz) | xxx ∈ C1, zzz ∈ C2 (6.2)

Complete linkage clustering can be considered as working with the complete graphsof the clusters, that is in each cluster every node is connected to all the other nodes.Let us define the diameter of a cluster as the largest distance (longest edge), while thediameter of a partition is defined as the largest diameter of the clusters in the partition.Then complete linkage clustering chooses to merge those clusters which increase thediameter of the partition as little as possible.

These two methods tend to be sensitive to “outlier” points [Duda et al., 2001]. AsJain and Dubes [1988] explains, single linkage clustering can wrongly chain clusters andform clusters with little homogeneity, complete linkage clustering can result in clusterswhich are not well separated. The following distances represent compromises betweensingle and complete linkage clustering.


In average linkage clustering the average of pointwise distances between all theelements of the two clusters is taken:

D(C1, C2) =1

|C1| · |C2|

|C1|∑

i=1

|C2|∑

j=1

d(xxxi, zzzj) (6.3)

Average linkage clustering is also called UPGMA (Unweighted Pair Group Methodusing Arithmetic mean).

The weighted average linkage clustering (called also Weighted Pair Group Methodusing Arithmetic mean, WPGMA) calculates linkage distances as

D(C1, C2) =1

2

1

|C11| · |C2|

|C11|∑

i=1

|C2|∑

j=1

d(xxxi, zzzj) +1

2

1

|C12| · |C2|

|C12|∑

i=1

|C2|∑

j=1

d(xxxi, zzzj)

where C1 was obtained by merging clusters C11 and C12 in the previous step.Another distance similar to the previous one gives rise to the average group linkage

clustering (Unweighted Pair Group Method using Centroids, UPGMC),

D(C1, C2) = d(mmm1,mmm2)

where mmmi denotes the center of cluster i, that is mmmi = 1|Ci|

∑xxxj∈Ci

xxxj.The weighted average group linkage clustering (Weighted Pair Group Method using

Centroids, WPGMC) calculates linkage distance as

D(C1, C2) = d(www1,www2)

where www1 and www2 are calculated recursively by wwwi = 12(wwwi1 + wwwi2), and where Ci was

obtained by merging clusters Ci1 and Ci2.The last method we mention here is called Ward’s method [Ward, Jr., 1963], which

chooses to merge those clusters resulting in minimum variance clusters,

D(C1, C2) =∑

xxxi∈C12

(xxxi − mmm12)2 =

∑

i

xxx2i −

1

|C12|

(∑

i

xxxi

)2

where C12 denotes the cluster obtained by merging clusters C1 and C2, and similarlymmm12 denotes the cluster center of C12.

We call the function D(·, ·) the linkage distance, while d(·, ·) is called the point-wise distance. The pointwise distance d(xxx,zzz) is usually substituted with the Euclideandistance ‖xxx − zzz‖2.

94 6.2. HIERARCHICAL CLUSTERING→

C1 C2

C3

Figure 6.5: Merging 3 clusters.

In the following we will only consider single, complete and average linkage dis-tances. All of these possess the following property: suppose that in two consecutivesteps we merge three clusters, C1, C2 and C3; first we merge C1 with C2 resulting inC12, and then we merge it with C3 (see Figure 6.5). The property is then the following:if

D(C1, C2) ≤ D(C1, C3)

D(C1, C2) ≤ D(C2, C3)

thenD(C1, C2) ≤ D(C12, C3)

Now we will prove the above statement for certain clusterings, because this propertyis of vital importance for our hierarchical cluster kernel. Again, we deal only withsingle, complete and average linkage distances, that is we prove the above propertyonly for equations (6.1), (6.2) and (6.3) in this order.

(1) We know that D(C1, C2) ≤ D(C1, C3) and D(C1, C2) ≤ D(C2, C3), which isequivalent to saying that D(C1, C2) ≤ mind(xxx,zzz) |xxx ∈ C1, zzz ∈ C3 and D(C1, C2) ≤mind(xxx,zzz) |xxx ∈ C2, zzz ∈ C3. Consequently D(C1, C2) ≤ mind(xxx,zzz) |xxx ∈ C1 ∪C2, zzz ∈ C3 = D(C12, C3).

(2) The proof is similar to the previous proof.(3) This case is a little more complicated than the previous two cases. We proceed in

the following way. We know that D(C1, C2) ≤ D(C1, C3) and D(C1, C2) ≤ D(C2, C3)

and we will prove that D(C1, C2) ≤ D(C12, C3) by considering the following twocases: (i) D(C1, C3) ≤ D(C2, C3) and (ii) D(C2, C3) ≤ D(C1, C3). Then, respec-tively, we prove that the following relations hold: (i) D(C1, C2) ≤ D(C1, C3) ≤D(C12, C3) and (ii) D(C1, C2) ≤ D(C2, C3) ≤ D(C12, C3). In each case, the first

95 6.3. METRIC MULTI-DIMENSIONAL SCALING

of the two relations is true by assumption, so we need to prove only the following rela-tions: (i) D(C1, C3) ≤ D(C12, C3) and (ii) D(C2, C3) ≤ D(C12, C3). For the sake ofsimplicity, the following notations will be used:

N1 = |C1|, N2 = |C2|, N3 = |C3|

C1 = x1, . . . , xN1, C2 = z1, . . . , zN2

, C3 = v1, . . . , vN3

D13 =∑

i

∑

j

d(xi, vj), D23 =∑

i

∑

j

d(zi, vj)

(i) We start from our assumption, that is D(C1, C2) ≤ D(C2, C3), which can be rewrit-ten as

1

N1N3

D13 ≤ 1

N2N3

D23

After multiplying both sides by N1N2N3 and adding N1D13 to both sides we obtain

(N1 + N2)D13 ≤ N1(D13 + D23)

Finally we divide each side by N1N3(N1 + N2), which is allowed, because no clusteris empty, and

N1N3

D13

≤ 1

(N1 + N2)N3

(D13 + D23)

which is equal to D(C1, C3) ≤ D(C12, C3), thus the proof is completed.(ii) The proof is similar to the previous case if one switches the roles of C1 and C2.

Thus our statement is proved.

This property is called the ultrametric property, or ultrametricity. Ultrametric matri-ces and trees are presented in the Section 6.4.

6.3 Metric multi-dimensional scaling

Metric multi-dimensional scaling (metric MDS) can be used to represent points in alow-dimensional space, using the kernel or the distance matrix of the points. We willlater use this method to represent our points in the space induced by hierarchical clusterkernel.

Metric MDS is a method for finding a low-dimensional representation of high-dimensional data that faithfully preserves the inner products [Chapelle et al., 2006].

96 6.4. ULTRAMETRIC MATRICES AND TREES

That is we seek the low-dimensional representation ΨΨΨ of XXX – the data is put in the rowsof the N× d matrix XXX – in such a way that

‖XXXXXX ′ − ΨΨΨΨΨΨ ′‖2F

be minimal, which is found by the low-rank approximation UUUSSSkUUU′ of XXXXXX ′, where UUU is

the eigenvalue matrix and SSSk contains the first (largest) k eigenvalues on the diagonal.Thus the new representation can be found by considering the rows of the matrix UUUSSS1/2.

Metric MDS is used to map data to a lower-dimensional space giving dissimilarities,i.e. distances. In this case first the distances are transformed to dot products by theformula

−1

2JJJDDD(2)JJJ

where DDD(2) contains the squared pairwise distances and JJJ is the centering matrix definedas JJJ = III−(1/N)111111 ′. Then we use the low-rank approximation of the dot product matrixas in the previous case. We have already described this technique when we presentedISOMAP in Section 4.1. MDS is used by the hierarchical cluster kernel constructionmethod: we use the above method of transforming distances to dot products, and if alow-dimensional representation is needed, we can use its eigendecomposition.

It is interesting to note the similarities between PCA (Section 3.4.1) and MDS. Bothmethods need the eigendecomposition of a positive semi-definite matrix, by which thenew representation is obtained. The two matrices (1/N)XXX ′XXX and XXXXXX ′ have the samerank and the same eigenvalues – eigenvalues of the Gram matrix are equal to the eigen-values of the covariance matrix multiplied by N – which follows from the eigende-composition of the matrices. From this it follows that if a relatively large gap is foundbetween the kth and (k + 1)th eigenvalues, then one can obtain a good approximationof the high-dimensional points by mapping the points to the low-dimensional subspaceof dimensionality k.

For more details on multi-dimensional scaling techniques see [Borg and Groenen,2005].

6.4 Ultrametric matrices and trees

In this section we cite some definitions and theorems to introduce ultrametric trees andmatrices. We are going to use ultrametrics to build the hierarchical cluster kernel.

97 6.5. THE HIERARCHICAL CLUSTER KERNEL

B D

3

3

A

2

C

5

E

Figure 6.6: Example of an ultrametric tree.

Definition 6.1 ([Wu and Chao, 2004]). An N×N matrix MMM is said to be ultrametric iff

Mij ≤ maxMik,Mjk, ∀i, j, k ∈ 1, 2, . . . , N

The following theorem establishes the connection between ultrametric matrices andtrees.

Theorem 6.1 ([Wu and Chao, 2004]). A distance matrix is ultrametric iff it can berealized by a unique ultrametric tree.

Definition 6.2 ([Gent et al., 2003]). A tree T is called ultrametric if the nodes are labeledwith some values such that any path from the root to a leaf results in a non-increasingsequence.

Figure 6.6 illustrates an ultrametric tree with 5 leaves and 4 internal nodes. One canobserve that every path – starting from the root and leading to a leaf – produces a de-creasing sequence. The values attached to the internal nodes are considered distances:for every pair of leaves, their distance is equal to the value attached to their lowest com-mon subsumer (ancestor). For example the distance is 3 between A and B, 3 between Band D, and 5 between C and E. Moreover, this construction defines an 5× 5 ultrametricmatrix MMM, with the property Mij ≤ maxMik,Mkj, ∀i, j, k ∈ 1, . . . , 5.

6.5 The hierarchical cluster kernel

Performing hierarchical clustering on a data set results in a dendrogram as shown inSection 6.2. The nodes of the resulting tree can be labeled by the distance of the clusterswhich were merged at the respective node. Thus we can build a distance matrix of


the data points by taking the distance value attached to the lower common ancestor ofthe points in the tree built in the previous step. In order to transform distances to dotproducts we use the method described in Sections 4.1 and 6.3. The only problem withthis transformation is that the resulting Gram matrix is not necessarily positive semi-definite, but we will use a theorem which states that by this transformation ultrametricdistance matrices always result in proper Gram matrices. By the previous section andSection 6.2.1 we can see that the distance matrix resulting from hierarchical clustering– using appropriate distance metrics – is an ultrametric matrix.

For the construction of our cluster kernel we need the following theorems.

Theorem 6.2 ([Fischer et al., 2003]). For every ultrametric MMM,√

MMM is `2 embeddable.

The Euclidean embedding of vectors means that the dissimilarities contained insome matrix can be interpreted as Euclidean distances. If this is possible, then it iseasy to reconstruct the vectors from this distance matrix. For Euclidean embeddingsand multi-dimensional scaling methods see Section 6.3 and the book [Borg and Groe-nen, 2005].

Theorem 6.3 ([Fischer et al., 2003]). For a metric MMM,√

MMM is `2 embeddable iff MMMc =

JJJMMMJJJ is negative semi-definite.

The matrix JJJ used in the above theorem is the centering matrix defined previously.The combination of the previous two theorems results in the following theorem,

which is the core of our method.

Theorem 6.4 ([Fischer et al., 2003]). Given an ultrametric MMM, the matrix −12MMMc =

−12JJJMMMJJJ is a Gram matrix containing the dot products of the vectors zzzi, i = 1, 2, . . . , N,

whose squared Euclidean distances are contained in the matrix MMM.

Thus we use some ultrametric distance matrix and transform these distances to dotproducts as in ISOMAP or MDS. Ultrametricity however assures us that this transfor-mation yields a positive semi-definite matrix.

We construct the cluster kernel using linkage distances from hierarchical clustering.Thus we map the points to a feature space where the pointwise distances are equal tothe cluster distances in the input space. The steps are shown in Algorithm 10.

We call the obtained kernel KKK a data-dependent kernel. Here we use the unlabeleddata in the clustering step to determine “better” pointwise distances, based on which


Algorithm 10 Hierarchical cluster kernel1: Perform an agglomerative clustering on the labeled and unlabeled data – for exam-

ple using one of the linkage functions from equations (6.1), (6.2) and (6.3).2: Define the matrix MMM with entries Mij = linkage distance in the resulting ultrametric

tree at the lowest common subsumer of i and j; Mii = 0, ∀i.3: Define the kernel matrix as KKK = −1

2JJJMMMJJJ.

we construct the kernel. We expect to obtain better similarities than if only the labeledtraining set was used. In order to compute the kernel function values for the test points,we include them in the unlabeled set. For every new test point, unavailable at trainingtime, the whole process of clustering must be performed again, which is rather inappro-priate for certain situations. In Section 6.6 we present a more efficient technique whichcan be used in these cases.

By the resulting kernel matrix we can obtain the new representation of the pointsusing MDS. Since KKK is positive semi-definite, by the spectral theorem it can be decom-posed into UUUSSSUUU ′, where UUU is a unitary matrix, i.e. UUUUUU ′ = III, containing the eigenvec-tors of KKK, and SSS is a diagonal matrix containing the corresponding eigenvalues. ThusZZZ = UUUSSS1/2 gives the low-dimensional representation of the vectors, namely the newvectors will be placed in the rows of ZZZ. We used the above procedure to represent thepoints in Figure 6.2.

6.5.1 Hierarchical cluster kernel with graph distances

When building our hierarchical cluster kernel we used the cluster assumption (Section2.1), that is if two points lie in the same cluster, they are likely to be in the same class. Inthe improved HCK we also make use of the manifold assumption (see Section 2.1). Thatis we approximate manifold distances by using distances – or shortest paths – computedon the kNN or εNN graph, as it is done in ISOMAP (Section 4.1): we simply use thegraph distances for the pointwise distance d(·, ·) (see Section 6.2.1).

The previous algorithm is augmented with three preceding steps as shown in Algo-rithm 11.

We deliberately started the numbering from −2 to highlight that these steps precedethe algorithm from the previous subsection. We also emphasize that these three stepsare optional: should only be used if the manifold assumption holds on the data set.


Algorithm 11 Graph-based hierarchical cluster kernel-2: Determine the k nearest neighbors or an ε-neighborhood of each point and take the

distances to all other points to be equal to zero.-1: Compute shortest paths for every pair of points – using for example Dijkstra’s al-

gorithm.0: Use these distances in clustering for the pointwise distance d(·, ·) in equations (6.1),

(6.2) and (6.3).

We use here the shortest path distances computed on the k nearest neighbor or theε-neighborhood graph of the data, thus – if the data lie on a low-dimensional manifold– approximating pointwise distances on this manifold. For the proof of the previousstatement see [Bernstein et al., 2000]. The very first technique using shortest path graphdistances for dimensionality reduction was ISOMAP presented in Section 4.1. We aregoing to compare our method to the ISOMAP kernel too [Ham et al., 2004].

Connecting the graph

The graph built by taking the k or ε-neighborhood of the points can contain severalcomponents, for example if k or ε is too small. Tenenbaum et al. [2000] argue thatthose small components which are disconnected from the “giant” component usuallycontain outliers, thus they can be neglected from further analysis. However we want tobuild a proper Gram matrix over the entire data set and let the classification algorithmdeal with outliers.

We follow the idea described in [Yong and Jie, 2005] to obtain one connected com-ponent. Let DDD denote the pointwise distance matrix that we obtain after sparsifying theneighborhood matrix by choosing the k nearest neighbors or an ε-neighborhood of eachpoint. Similarly GGG denotes the all-pair shortest path matrix computed using DDD. ThenGij = ∞ for some i and j means that the graph contains more than one connected com-ponent. Then we choose those two unconnected points i and j, which has a minimalEuclidean distance. We connect these points using a relatively large distance,

Lij = gmax +dmin

dmax(6.4)

where gmax denotes the maximal path in the whole graph, while dmin and dmax are theminimal and maximal – either Euclidean or graph – distances between the data points.

101 6.6. NEW TEST POINTS

When connecting these two points one should update only those distances in the distancematrix whose values could not be calculated (∞). For every such pair of points k and `

we calculate the minimal distance as

Gk` = minGik + Lij + Gj`, Gkj + Lij + Gi`

Unfortunately, after the insertion of such a connection, there could still remain uncon-nected components. In such a case the above procedure must be repeated until oneconnected component is obtained. For every such an iteration we increment Lij bydmin/dmax.

6.6 New test points

The cluster kernels we have presented do not allow simple kernel calculations for new,unseen points, which were not available in the training phase together with the labeledand unlabeled data. To extend the kernel to unseen data all the steps must be performedagain – involving all the data – whenever new test points arrives, which could be veryinefficient. Here we outline a solution to this problem following Vishwanathan et al.[2006].

Given a kernel matrix KKK of size N × N, we approximate it using a feature mapϕ : X → Rn, X ⊆ Rd and a positive semi-definite matrix QQQ ∈ Rn×n, with n known.We approximate KKK by with the following functional form k(xxx1, xxx2) = ϕ(xxx1)

′QQQϕ(xxx2)

via the minimization of the following expression:

minQQQ‖KKK − ϕϕϕ ′QQQϕϕϕ‖2 (6.5)

where ϕϕϕ = [ϕ(xxx1) . . . ϕ(xxxN)] and the constraint is KKK − ϕϕϕ ′QQQϕϕϕ º 0. Vishwanathanet al. [2006] proved that (6.5) is minimized by

QQQ = (ϕϕϕ†) ′(KKK − PPP)ϕϕϕ†

wherePPP = KKKVVV2(VVV

′2KKKVVV2)

−1VVV ′2KKK

Here VVV2 is the N × (N − r) matrix resulting from the singular value decomposition(SVD) of ϕϕϕ = UUUSSSVVV ′, VVV = [VVV1 VVV2], where r is the rank of ϕϕϕ. We mention that a

102 6.7. EXPERIMENTAL METHODOLOGY AND TEST RESULTS

possibility to propose a “good” feature mapping ϕ is to use an RBF-type function andoptimize (6.5) w.r.t. the centers of the RBF and the weights, keeping the matrix QQQ fixed;this is left as future work.

6.7 Experimental methodology and test results

In this section we present the results obtained for both versions of our hierarchical clus-ter kernel, and we also compare the kernel to other data-dependent kernels, which werepresented in Chapter 4. For learning we used Support Vector Machines, namely the LIB-SVM version 2.85 [Chang and Lin, 2001] SVM implementation which can be found inmany programming languages, including MATLAB1.

The data set used for evaluating the different kernels were the following:

Data set Classes Dimension Points CommentUSPS 2 241 1500 imbalancedDigit1 2 241 1500 artificialCOIL2 2 241 1500

Text 2 11 960 1500 sparse discrete

The detailed descriptions of these sets can found in Appendix A. Each data set hastwo variations: one with 10 and one with 100 labeled data; furthermore each data setcontains 12 labeled–unlabeled splits of the data. We used only the first split from eachset. The column labels 10 and 100 in the tables showing the obtained results indicatewhich version of the data set was used, i.e. show the number of labeled and unlabeledexamples used.

The evaluation measure used was accuracy, and the results are given in percentage.Table 6.1 shows the baseline results obtained with SVM using linear and Gaussian ker-nels. Principally, we wanted to outperform these results. The hyperparameter for theGaussian kernel was set using two methods for estimating a base σ0 value, and thenwe used 10-fold cross-validation on the larger labeled set containing 100 labeled points.The two methods determining σ0 were the following. In the first method [Chapelle et al.,2006] we set σ0 as the mean of the norm of vectors from the labeled and unlabeled dataset. For the second method we followed Zhu and Ghahramani [2002]. This method was

1The MATLAB packages can be downloaded fromhttp://www.csie.ntu.edu.tw/∼cjlin/libsvm/matlab/


10 100

USPS 72.82 86.43

Digit1 81.07 90.86

COIL2 60.74 80.43

Text 58.26 67.86

10 100

USPS 80.07 89.71

Digit1 56.11 93.86

COIL2 57.38 82.5

Text 59.06 56.43

Linear Gaussian

Table 6.1: Linear and Gaussian kernels with SVMs.

single complete averageUSPS 80.07 82.01 81.48

Digit1 48.86 60.67 71.75

COIL2 67.85 55.64 68.05

Text 66.78 50.27 64.63


Digit1 70.21 89.71 93.79

COIL2 96.00 86.36 91.71

Text 73.14 49.57 50.14

HCK, 10 points HCK, 100 points

Table 6.2: Hierarchical cluster kernels (HCK).

used in label propagation for determining the width of the Gaussian similarity. Herewe built a distance matrix of the entire data set using Euclidean distances. Then usingKruskal’s minimum spanning tree algorithm [Cormen et al., 2001] we took the edgeswith minimum weight to construct the spanning tree, but we stopped when the edge withminimum weight was found, connecting two components having different labeled datain them. In the second case we set σ0 to this distance. Then we performed 10-fold crossvalidation using 100 labeled data – however this can be considered as cheating a littlewith the settings when only 10 labeled examples were available. The cross-validationwas performed for the following parameters:

[σ0/8 σ0/4 σ0/2 σ0 2σ0 4σ0 8σ0]

and the optimal parameters found by the two methods – i.e. by using the two differentσ0’s – were averaged resulting in [4.0039 1.2555 265.3732 0.9041] for the USPS,Digit1, COIL2 and Text data sets, respectively.

In Tables 6.2 and 6.3 the results obtained with our hierarchical (HCK) and graph-based hierarchical cluster kernels (gHCK) are shown. For gHCK we used a kNN graphwith k = 7.

Table 6.4 shows the results obtained using the ISOMAP (Section 4.1) and neighbor-



Digit1 48.86 75.50 94.70

COIL2 60.60 68.52 60.54

Text 66.78 56.17 47.32


Digit1 70.21 93.71 95.21

COIL2 93.86 88.79 90.64

Text 73.14 67.71 66.86

gHCK, k = 7, 10 points gHCK, k = 7, 100 points

Table 6.3: Graph-based hierarchical cluster kernels (gHCK).

10 100

USPS 85.10 86.71

Digit1 94.43 97.43

COIL2 62.62 80.64

Text 59.80 72.43

10 100

USPS 76.31 (k = 3) 94.14 (k = 7)

Digit1 87.11 (k = 6) 94.21 (k = 5)

COIL2 64.43 (k = 7) 84.43 (k = 7)

Text 51.68 (k = 6) 62.79 (k = 2)

ISOMAP, k = 7 Neighborhood kernel

Table 6.4: ISOMAP and neighborhood kernels.

hood kernels (Section 4.2), Table 6.5 shows the results obtained by the bagged clusterkernel (Section 4.3) and Laplacian SVM (Section 4.5), and finally in Table 6.6 the re-sults obtained using the multi-type cluster kernel (Section 4.4) are shown.

For LapSVM the following methodology was adopted from [Chapelle et al., 2006].First for carrying out the computations we used LIBSVM with kernel

k(xxx,zzz) = k(xxx,zzz) − kkk ′x

(N2

2

λA

λI

III + LLLtKKK

)−1

LLLkkkz

using the fixed parameter t = 1. The Laplacian was computed using the kNN datagraph with k = 5 for the USPS, Digit1 and COIL2 data sets, and k = 50 for Text.For the Gaussian kernel we used the parameters found by cross validation, used for thenon-Laplacian Gaussian SVM. The regularization parameters λA and λI were set by a10-fold cross validation process using 100 labeled data. The parameters were chosenfrom the set 10−6, 10−5, . . . , 106 for both λA and λI. The optimal parameters found


10 100

USPS 87.38 (k = 7) 92.79 (k = 6)

Digit1 93.29 (k = 5) 96.93 (k = 7)

COIL2 71.28 (k = 7) 85.57 (k = 5)

Text 63.29 (k = 5) 66.14 (k = 2)

10 100

USPS 81.95 95.93

Digit1 84.50 97.64

COIL2 76.64 97.71

Text 63.42 62.50

Bagged cluster kernel LapSVM

Table 6.5: The bagged cluster kernel and LapSVM with Gaussian kernel.

10 100

USPS 80.07 92.86

Digit1 91.01 91.29

COIL2 55.77 84.86

Text 53.56 74.79

10 100

USPS 80.07 92.86

Digit1 91.01 91.36

COIL2 55.77 84.86

Text 53.02 75.29

10 100

USPS 80.07 80.29

Digit1 48.86 65.07

COIL2 54.23 82.29

Text 50.60 56.71

Step Linear step Polynomial

Table 6.6: Multi-type cluster kernel.

wereλA λI

USPS 10−6 103

Digit1 10−6 10

COIL2 10−6 104

Text 10−6 103

For the ISOMAP kernel we used a kNN graph with k set to 7. For the neighborhoodkernel we tried different parameter settings by changing the number of neighbors from2 to 7; in Table 6.4 the best results are shown, indicating also the best parameters. Forthe bagged cluster kernel we set the parameter t = 20, and similarly to the previouskernel we experimented using different number of clusters; k was chosen from the set2, 3, . . . , 7; in Table 6.5 only the best results are shown. For the multi-type clusterkernel we used the following fixed parameters:

• step transfer function: the largest ` + 10 eigenvalues were used, where ` is thenumber of labeled examples

• linear step transfer function: the largest ` + 10 eigenvalues were used

106 6.8. RELATED WORK

• polynomial transfer function: t = 5

For the neighborhood and bagged cluster kernel we used the linear kernel as base kernel.

6.8 Related work

Data dependent kernels were presented in Chapter 4. Here we shortly describe a kernelconstruction method used to provide better representation for the clustering of points.Our method of constructing hierarchical cluster kernels is based on the work of Fischeret al. [2003]. The authors of the article propose a new 2-step clustering:

Algorithm 12 Clustering with the connectivity kernel1: Give a new representation of the points based on the effective dissimilarity.2: Cluster the points using the new representation.

If uuu ∈ 1, . . . , KN denotes the cluster assignment vector of the points, then theclustering method minimizes the expression

H(uuu;DDD) =

K∑

k=1

1

Nk

∑

i:ui=k

∑

j:uj=k

Dij

called the pairwise clustering cost function, where Nk denotes the size of cluster k,and DDD contains the effective dissimilarities. To compute these values the authors builda graph of the data; they assume that on the path between two points belonging todifferent clusters there will be an edge with large weight, representing the weakest linkon the path. The effective dissimilarity will be represented by this value. If Pij ∈ Pdenotes the set of all possible paths between points i and j, and P is the set of all pathsin the data graph, then the path-specific effective dissimilarity is defined in the followingway:

Dij = minp∈Pij

max

1≤h≤|p|−1D ′

p[h]p[h+1]

They approximate the effective dissimilarities using a Kruskal-style algorithm, which isactually the single linkage agglomerative clustering technique. When two clusters aremerged, the linkage distance Dij approximates the effective dissimilarity, and gives thedistances between objects from clusters i and j. Then the kernel matrix is built using theformula KKK = −(1/2)JJJDDDJJJ, and the eigendecomposition of KKK is performed to obtain the

107 6.9. DISCUSSION

new representation of points. Then a conventional clustering method is used to obtain apartition of the data.

Our method can be viewed as a generalization of the connectivity kernel, since ifthe condition presented in Section 6.2.1 is satisfied, we can use an arbitrary linkagedistance when performing agglomerative clustering. Moreover we applied the kernelconstruction method to semi-supervised learning settings, when only a small portionof the data labels is known, and we also proposed a manifold-based extension of thekernel.

6.9 Discussion

We proposed hierarchical and graph-based hierarchical cluster kernels for supervisedand semi-supervised learning. These kernels calculate dot products in feature space us-ing the ultrametric distance matrix resulting from a “special” agglomerative clustering.HCK uses the cluster assumption of semi-supervised learning, while gHCK uses boththe cluster and manifold assumption.

Looking at Table 6.1 and then considering the results returned by the hierarchicalcluster kernels in Tables 6.2 and 6.3, we can see that baseline results are always out-performed, that is for all data sets. The greatest improvement can be observed for theCOIL2 data set with 100 labeled points; here by single linkage HCK we achieved an im-provement of 13.5%, but the improvements 5.93% and 5.28% are also quite impressivefor the USPS and Text data sets.

The results are comparable to that of other data-dependent kernels. HCK and gHCKalso mostly outperform the ISOMAP kernel (Table 6.4) – except for Digit1 – and theneigborhood and bagged cluster kernels (Tables 6.4 and 6.5) – with the same exceptionof Digit1. LapSVM mostly outperforms our kernels (Table 6.5), but sometimes theresults are close. The multi-type cluster kernel (Table 6.6) yields better results only forthe Text data set, using the step and linear step transfer functions. We may conclude thatHCK and gHCK are useful kernels for almost all types of data sets; the data sets werechosen to cover artificial, real, imbalanced and high-dimensional sets.

The hierarchical cluster kernel is strongly connected to the connectivity kernel pro-posed in [Fischer et al., 2003], although the freedom in choosing the linkage functionmakes it a more flexible technique for constructing kernels for different data settings.

The proposed hierarchical cluster kernel offers a frame for constructing different

108 6.9. DISCUSSION

cluster kernels, by using different linkage functions and clustering techniques. Non-hierarchical methods can be used as well in kernel construction, by applying divi-sive clustering and using an arbitrary non-hierarchical method. The single conditionthat has to be satisfied by the clustering method is the following relation regardingcluster dissimilarities: if D(C1, C2) ≤ D(C1, C3) and D(C1, C2) ≤ D(C2, C3) thenD(C1, C2) ≤ D(C12, C3).

109

Chapter 7

Variations on the bagged clusterkernel

The bagged cluster kernel presented in Section 4.3 combines a “reweighting” kernelwith a base kernel in such a way that the base kernel values are reweighted by theprobabilities that the points lie in the same cluster. Methods for efficiently com-

bining different kernels obtained from different sources have been studied in numerousarticles, the aim being the construction of more powerful kernels providing better sim-ilarities. In this chapter we present two cluster kernels similar to the bagged clusterkernel: the resulting kernel matrix is the Hadamard product of the base and a reweight-ing kernel.

7.1 The bagged cluster kernel

We described the bagged cluster kernel in Section 4.3. We saw that the kernel reweightsthe base kernel values by the probability that the points belong to the same cluster, andthese probabilities are calculated using the k-means algorithm [Bishop, 2006]. K-meansminimizes the following cost/energy function

F(UUU,ccc1, . . . , cccK) =1

2

K∑

k=1

N∑

n=1

Ukn‖xxxn − ccck‖2

where ccck, k = 1, . . . , K, denote the cluster centers and Ukn ∈ 0, 1, n = 1, . . . , N, k =

1, . . . , K, are the cluster indicators; Ukn = 1 if xxxn belongs to the kth cluster and 0

110 7.1. THE BAGGED CLUSTER KERNEL

otherwise. The constraints are conceived as∑K

k=1 Ukn = 1 for all n, which means– in the case of k-means – that each point can belong to only one cluster. Fuzzy c-means is a generalization of simple k-means, where Ukn takes a real value from theinterval [0, 1] and the columns of the matrix UUU form probability distributions over theclusters. The k-means and fuzzy c-means are easily kernelizable [Dhillon et al., 2004;Vishwanathan and Murty, 2002], because the Euclidean distance can be written usingonly dot products. The energy function is usually minimized by the EM-style iterativealgorithm:

Algorithm 13 K-means1: Initialize the cluster centers randomly.2: Compute UUU:

Ukn =

1 if ‖xxxn − ccck‖2 ≤ ‖xxxn − cccm‖2, ∀m 6= k

0 otherwise

3: Compute cost function F.4: If converged, STOP.5: Otherwise compute new centers,

ccck =1∑N

n=1 Ukn

N∑

n=1

Uknxxxn

then go to step 2.

The output of the above algorithm, that is the indicator values Unk, strongly dependon the initial cluster centers; therefore usually the algorithm is run several times toobtain consistent results. The bagged kernel results in a positive semi-definite kernel,because it is the dot product of the vectors φbag(xxx) = ([cj(xxx) = q] : j = 1, . . . , t, q =

1, . . . , K). By the Schur product theorem [Lütkepohl, 1996], the kernel defined in (4.2)by the elementwise, Hadamard or Schur product is also positive semi-definite.

In matrix notation one can write the bagged cluster kernel as follows:

KKKbag =1

t

t∑

i=1

UUU(i) ′ ·UUU(i)

where UUU is the K × N cluster membership matrix and UUU(i) now denotes the ith such

111 7.2. COMPUTING THE REWEIGHTING KERNEL

matrix.

7.2 Computing the reweighting kernel

7.2.1 Combining kernels

Methods for combining kernels to improve classification performance has been stud-ied in many papers. The most popular technique studied is the linear combination ofkernels. Kandola et al. [2002a], for example, linearly combine different kernels andmaximize kernel alignment [Cristianini et al., 2001], while Diosan et al. [2007] performoptimization with genetic algorithms using the classifier’s accuracy for determining thefitness of an individual, to mention just a few approaches.

In Section 3.2 we already defined kernel matrices as positive (semi-)definite matri-ces. In this section we deal only with real Gram matrices. To repeat the definition here,we call a real symmetric matrix AAA of size N×N positive semi-definite if for all xxx ∈ RN

xxx ′AAAxxx ≥ 0

We use the notation AAA º 0 to indicate that AAA is positive semi-definite. To verify thepositive semi-definiteness of a matrix, one can look for example at the eigenvalues ofthe matrix, since a real symmetric matrix is positive semi-definite if an only if its eigen-values are non-negative.

In this section we enumerate some kernel combinations which result in positivesemi-definite kernels; because of the repeated use of the expression “positive semi-definite” we will use the acronym “psd”. The following statements (theorems) are takenfrom [Lütkepohl, 1996; Abadir and Magnus, 2005].

(1) If KKK is positive definite, then KKK−1 is positive definite too.

(2) If KKK1 and KKK2 are psd matrices, then so is KKK1 + KKK2.

(3) If KKK is a psd matrix and a > 0 is a real scalar, then a ·KKK is a psd matrix.

(4) If KKK is a psd matrix of size N×N and ZZZ is an arbitrary N×n matrix, then ZZZ ′KKKZZZ

is psd.

(5) If KKK1 and KKK2 are psd matrices and KKK1KKK2 = KKK2KKK1, then KKK1KKK2 is a psd matrix.


(6) If KKK is a psd matrix and n ∈ N \ 0, then KKKn is psd.

(7) If KKK1 and KKK2 are psd matrices, then so is KKK1¯KKK2, where¯ denotes the Hadamardproduct.

(8) If KKK1 and KKK2 are psd matrices, then so is KKK1⊕KKK2, where⊕ denotes the direct sum.

(9) If KKK1 and KKK2 are psd matrices, then so is KKK1⊗KKK2, where⊗ denotes the Kroneckerproduct.

The Hadamard product (Schur product, elementwise product), the direct sum and theKronecker product of matrices are defined in the following way:

AAA¯BBB =

A11 · B11 · · ·A1n · B1n

... . . . ...Am1 · Bm1 · · ·Amn · Bmn

AAA⊕BBB =

[AAA OOO

OOO BBB

]

AAA⊗BBB =

A11 ·BBB · · ·A1n ·BBB... . . . ...

Am1 ·BBB · · ·Amn ·BBB

We will use the second and the third property from the above list, and the propertyinvolving Hamadard products. For the first two of these the proofs are trivial, becausethey follow directly from the definition of positive semi-definite matrices:

if xxx ′KKK1xxx ≥ 0 and xxx ′KKK2xxx ≥ 0, then xxx ′(KKK1 + KKK2)xxx ≥ 0 for all xxx.

if xxx ′KKKxxx ≥ 0, then xxx ′(aKKK)xxx ≥ 0 for all xxx.

The following proof for the Hadamard product of Gram matrices is based on [Yanget al., 2002]. First we state the theorem.

Theorem 7.1 (Schur product theorem). If AAA and BBB are positive semi-definite matrices,then so is AAA¯BBB, where ¯ denotes the Hadamard product defined above.

In proving the theorem we will use the following two lemmata:


Lemma 7.1. If AAA, BBB and CCC, DDD are real matrices of the same size, then

(AAA¯BBB) · (CCC¯DDD) = (AAACCC)¯ (BBBDDD) (7.1)

Lemma 7.2. If AAA and BBB are real matrices of the same size, then

(AAA¯BBB) ′ = AAA ′ ¯BBB ′ (7.2)

Proof. We denote the spectral decomposition of AAA and BBB by the following matrices:

AAA = UUUSSSUUU ′

BBB = VVVTTTVVV ′

Using equations (7.1) and (7.2), we can write

AAA¯BBB = (UUUSSSUUU ′)¯ (VVVTTTVVV ′)

= (UUU¯VVV)(SSS¯ TTT)(UUU ′ ¯VVV ′)

= (UUU¯VVV)(SSS¯ TTT)(UUU¯VVV) ′

and we also know that

(UUU¯VVV)(UUU¯VVV) ′ = (UUUUUU ′)¯ (VVVVVV ′) = III¯ III = III

because by definition UUU and VVV are orthonormal matrices. From this it directly followsthat (UUU ¯VVV)(SSS ¯ TTT)(UUU ¯VVV) ′ gives the spectral decomposition of AAA ¯BBB, thus for theeigenvalues or spectrum of AAA¯BBB it is true that λ(AAA¯BBB) = λ(AAA) · λ(BBB). Furthermoresince AAA and BBB are psd matrices we know that λ(AAA) ≥ 0 and λ(BBB) ≥ 0, from which itfollows that λ(AAA ¯ BBB) ≥ 0, which is equivalent to the statement that AAA ¯ BBB is a psdmatrix.

7.2.2 Using the Hadamard product for kernel reweighting

Following the idea from the construction of the bagged cluster kernel, we develop sim-ilar techniques that reweight the kernel matrix, using the cluster structure of the labeled– and unlabeled – data. Thus if two points are in the same cluster their similarity ob-tains a high weight, ≈ 1 or > 1, while if they are in different clusters, their similaritywill be multiplied by a lower weight which is < 1 or ¿ 1. The weights of the original


or base kernel matrix will form another kernel matrix, which we call the reweightingkernel in order to guarantee the positive semi-definiteness of the final kernel. Therefore– similarly to the bagged cluster kernel – the new cluster kernel becomes

k(xxx,zzz) = krw(xxx,zzz) · kb(xxx,zzz)

where krw(·, ·) denotes the reweighting, while kb(·, ·) denotes the base kernel. In matrixnotation we can write

KKK = KKKrw ¯KKKb

There are two problems with the construction of the above cluster kernel: (i) thereweighting kernel must be positive semi-definite, (ii) the base kernel matrix has tobe positive. The first condition is obviously important; with this condition we can guar-antee the positive semi-definiteness of the resulting cluster kernel. However the secondcondition is also important, because although the kernel matrix is positive semi-definite,it can contain negative values. For negative values a quite different and more complexreweighting scheme has to be performed. To avoid such situations which encumberbuilding a reweighting kernel, we require the base kernel matrix to be positive, that iskb(xxx,zzz) ≥ 0, for all xxx, zzz. Using inner products this can be accomplished by shiftingeach dimension of the data to the positive half-space.

Here we present two methods for constructing a positive semi-definite reweightingkernel.

Gaussian reweighting kernel

Suppose that we are given the output of an arbitrary clustering algorithm, the clustermembership matrix UUU of size K × N, where K is the number of clusters and N is thenumber of points. We distinguish between two classes of clusterings here: partitionaland fuzzy clustering methods. Partitional methods yield a cluster membership matrix inwhich each column contains exactly one value of 1, and all the other entries are zeros;a fuzzy clustering algorithm produces a matrix UUU, where each column is a probabilitydistribution over the clusters, i.e. Uij contains the probability that the jth point belongsto the ith cluster. In both of these cases we assume that two points belong to the samecluster(s) if their cluster membership vectors are similar or close to each other. Thus wedefine similarity using the Gaussian kernel in the following way:

krw(xxx,zzz) = exp(

−‖UUU·xxx − UUU·zzz‖2

2σ2

)(7.3)


where UUU·xxx denotes the cluster membership vector of the point xxx, i.e. the column of xxx inUUU.

From Section 3.2.3 we know that the resulting matrix is always positive semi-definite. If the vectors are all different – which is not the case here, since that wouldmean every point has a different cluster membership vector – the resulting Gram matrixis positive definite.

Evidently, each value of the kernel matrix will be a real number between 0 and1. In this case the parameter σ defines the amount of separation between similar anddissimilar points: if σ is large, the gap between the values expressing similarity anddissimilarity becomes larger, while for small σ these values will be close.

Dot product-based reweighting kernel

Another possibility is to use the dot product of cluster membership vectors, and definethe following reweighting kernel:

KKKrw = UUU ′UUU + α · 111111 ′ (7.4)

where UUU denotes the cluster membership matrix and α ∈ [0, 1). The first member UUU ′UUU

of the kernel represents similarity between the cluster membership vectors, while thesecond member is used to avoid zeros: if two membership vectors are totally distinct,say [1 0] ′ and [0 1] ′, then the corresponding entry in the dot product matrix will bezero, and by the Hadamard product the corresponding entry of the resulting kernel willalso be zero. If however the resulting clustering is not too “confident”, one should use asmall value close to 0 instead for reweighting the base kernel. If two points do not lie inthe same cluster, this does not necessarily mean that they have to be in different classes,even if we set the number of clusters to the number of classes. This is accomplished byα111111 ′.

Still, we have to prove that this reweighting kernel is positive semi-definite. Firstwe know that UUU ′UUU is positive semi-definite for all matrices UUU. Second we have toprove that α111111 ′ is psd. We know that 111111 ′ has rank one and is a real symmetric matrix.Since it is a rank one matrix it has only one non-zero eigenvalue, and from the equationtr(111111 ′) =

∑Ni=1 λi [Lütkepohl, 1996] we can calculate this eigenvalue, which will be

equal to N. Since N is always positive, 111111 ′ is always positive semi-definite, and fromthe third property listed in Section 7.2.1 it follows that α111111 ′ is psd. Hence, using thesecond property, UUU ′UUU + α111111 ′ is always positive semi-definite.


−4 −2 0 2−3

−2

−1

0

1

2

31

2

34

5

6

7

8

9

10

4 6 8 10

−2

−1

0

1

2

3

1

8

3

95

10

7

46

2

2 4 6 8−2

−1

0

1

2

3

4

7

3

2

46

1

10

8

59

(a) (b) (c)

2 4 6 8 10

−1

0

1

2

3

4

5

7

3

2

46

18

1059

2 4 6 8 10

−1

0

1

2

3

4

5

6

7

2

4

3 18

6

1059

0 2 4 6 8 10

−1

0

1

2

3

4

5

6

7

2

4

13 8

6

1059

(d) (e) (f)Figure 7.1: Kernel reweighting. Representation of points in the space induced by thereweighted kernel: (a) original data, (b) σ = 0.05, (c) σ = 0.5, (d) σ = 1, (e) σ = 2, (f)σ = 20.

Another version of the above kernel would be

KKKrw = β ·UUU ′UUU + 111111 ′ (7.5)

where β ∈ (0, ∞). Here the kernel values for which the dot product matrix of clustermembership vectors corresponds to zero remain the same by 111111 ′; however if the pointslie in the same cluster βUUU ′UUU gives a weight greater than zero, thus this kernel value willbe increased by a factor greater than 1.

For fuzzy clusterings, that is where a column of UUU contains cluster membershipprobabilities, we first normalize the columns of UUU in order to obtain the value 1 in thecase of identical cluster membership vectors.

For computing kernel values for new test points we can consider the technique de-scribed in Section 6.6.

Figure 7.1 illustrates a small data set of 10 points together with its representationin the space induced by the Gaussian reweighting kernel, using the dot products ofthe points as the base kernel. To determine the weights we performed binary spectralclustering, and the illustrations (b)–(f) show the new representation – reduced to 2 di-mensions – by varying the parameter of the Gaussian kernel used for computing the

117 7.3. GETTING THE CLUSTERING

0 5 10

−6

−4

−2

0

2

4

Figure 7.2: Data setting where conventional clustering fails. Clustering algorithms willidentify the two Gaussians as separate clusters, while class labels – indicated by thedifferent colors and shapes – show a different separation.

weights. For constructing the Laplacian we used the Gaussian kernel with parameterσ = 5 and we obtained the following clustering of the points:

1, 3, 8, 2, 4, 5, 6, 7, 9, 10

We can see that for small σ we recover the same representation since every entry of theweight matrix tends towards 1, while for a large σ the points of the clusters become wellseparated.

7.3 Getting the clustering

In order to construct the reweighting kernel the points have to be clustered, i.e. a cluster-ing algorithm must be involved. In the optimal case, if we set the number of clusters tothe number of classes in the classification problem, we should obtain a partition whichcorresponds to the proper class separation of the data. However this is usually not thecase. Consider the situation shown in Figure 7.2: the positive class is denoted by the redcircles, while the blue x’s denote the negative class. Without any additional knowledgeevery clustering algorithm would identify the left and right hand Gaussians as separateclusters, however the classification rule is very simple: if x2 ≥ 0 then it is a point of thepositive class; otherwise it belongs to the negative class. Thus we cannot zero a kernelvalue if the clustering algorithm puts the respective points in different clusters; we canonly say that if two points are in the same cluster they are more likely to be in the sameclass – as the cluster assumption states (see Section 2.1.2). From this point of view


10 100

USPS (a) 86.17 (weighted, k = 6) 95.29 (ward, k = 6)(b) 86.11 (weighted, k = 6) 95.50 (ward, k = 6)(c) 84.97 (weighted, k = 4) 95.29 (ward, k = 6)

Digit1 (a) 89.06 (ward, k = 6) 94.94 (average, k = 6)(b) 89.06 (ward, k = 6) 95.29 (average, k = 7)(c) 89.06 (ward, k = 6) 94.57 (average, k = 6)

COIL2 (a) 62.08 (weighted, k = 6) 85.93 (weighted, k = 6)(b) 62.08 (weighted, k = 6) 85.64 (weighted, k = 6)(c) 62.95 (complete, k = 7) 86.07 (ward, k = 4)

Text (a) 62.35 (ward, k = 7) 68.07 (median, k = 7)(b) 61.28 (ward, k = 7) 71.14 (weighted, k = 3)(c) 59.13 (ward, k = 7) 71.21 (weighted, k = 2)

Table 7.1: Reweighting cluster kernels (RCK) using hierarchical clustering. In bracketswe indicate the settings yielding the best results.

the most accurate technique would be to use the reweighting kernel defined in equation(7.5): for points being in the same cluster the similarity provided by the base kernelis increased, while for points lying in different clusters the base kernel value remainsunchanged.

7.4 Experimental methodology and test results

We performed experiments using the three cluster kernels described in Section 7.2.2.The evaluation sets are the same as in the previous chapter, that is the USPS, Digit1,COIL2 and Text data sets. For learning the decision function we used SVMs; we usedthe LIBSVM implementation as in the previous chapter. The base kernel we experi-mented with was the linear kernel.

Tables 7.1, 7.2 and 7.3 show the results obtained using the different reweightingkernels, i.e. the kernels defined in equations (7.3), (7.4) and (7.5). The best parametersof the respective clustering algorithm are listed In brackets after the accuracy results.The parameters of the reweighting kernels were fixed during the experiments, namelythe following settings were used:


10 100

USPS (a) 84.45 (k = 7) 92.98 (k = 7)(b) 83.86 (k = 7) 92.45 (k = 7)(c) 83.66 (k = 7) 92.59 (k = 7)

Digit1 (a) 83.14 (k = 2) 94.28 (k = 6)(b) 84.58 (k = 2) 94.08 (k = 7)(c) 84.13 (k = 2) 92.96 (k = 7)

COIL2 (a) 58.55 (k = 7) 83.76 (k = 7)(b) 58.95 (k = 7) 83.76 (k = 7)(c) 58.60 (k = 7) 83.28 (k = 6)

Text (a) − −

(b) − −

(c) − −

Table 7.2: Reweighting cluster kernels obtained with k-means. The results are averagedover 10 runs.

• Gaussian reweighting kernel defined in (7.3): σ = 1/√

2

• reweighting kernel defined in (7.4): α = 0.3

• reweighting kernel defined in (7.5): β = 0.5

In the tables the three kernels are denoted by the three letters (a), (b) and (c).Three clustering techniques were applied to build the reweighting kernel. In Table

7.1 the results obtained using hierarchical clustering techniques are shown. Here weused the MATLAB names of the different linkage functions:

• single – single linkage

• complete – complete linkage

• average – average linkage

• weighted – weighted average linkage

• centroid – average group linkage

• median – weighted average group linkage

120 7.5. DISCUSSION

10 100

USPS (a) 81.43 (k = 7) 90.87 (k = 5)(b) 81.63 (k = 7) 91.39 (k = 5)(c) 81.16 (k = 7) 91.56 (k = 7)

Digit1 (a) 88.32 (k = 2) 95.20 (k = 7)(b) 88.32 (k = 2) 94.64 (k = 2)(c) 88.32 (k = 2) 94.73 (k = 7)

COIL2 (a) 58.22 (k = 7) 83.83 (k = 6)(b) 58.03 (k = 6) 83.37 (k = 6)(c) 55.83 (k = 7) 83.20 (k = 6)

Text (a) 63.26 (k = 5) 66.93 (k = 2)(b) 61.50 (k = 5) 70.07 (k = 3)(c) 59.26 (k = 5) 71.00 (k = 2)

Table 7.3: Reweighting cluster kernels with spectral clustering using k-means. Theresults are averaged over 10 runs.

• ward – Ward’s method

Table 7.2 shows the results obtained using k-means clustering. Because the output of k-means strongly depends on the initial cluster centers, we repeated the k-means clustering10 times, and averaged the accuracy results over the 10 runs. For the Text data set k-means turned out to be inadequate because of its high dimensionality (here inadequatemeans it had high running time).

Finally we used spectral clustering to obtain the kernel: we used the normalizedspectral clustering method – the version of Shi and Malik – as described in [vonLuxburg, 2006] with k-means clustering. Hence, similarly to the case of k-means, weaveraged the accuracy results over 10 runs.

7.5 Discussion

In this chapter we proposed three types of reweighting cluster kernels (RCK) for super-vised and semi-supervised kernels. These kernels reweight the similarities provided by abase kernel observing the cluster structure of the data returned by a clustering algorithm.

Comparing the results returned by the linear and Gaussian kernels (Table 6.1) with

121 7.5. DISCUSSION

the results provided by the reweighting kernels using the linear kernel as the base kernel(Tables 7.1, 7.2 and 7.3), the baseline results are clearly outperformed. The three differ-ent approaches however provide very similar results: in most cases the accuracy resultsdiffer only in the decimals. The best results are obtained using hierarchical clusteringmethods, while k-means and normalized spectral clustering produce similar outputs.The results provided by these kernels are comparable to those of other data-dependentkernels too.

Another possibility is to use constrained clustering, in order to use the class infor-mation of the labeled points. Using constrained clustering techniques, it is expected thatby involving the label information the clustering algorithm would produce a groupingequivalent with the true class separation. However, since constrained clustering can beconsidered a semi-supervised learning technique, that would mean the construction ofkernels for semi-supervised learning using semi-supervised learning. This possibilitythus remains to be balanced.

122

Chapter 8

Conclusions

Data-dependent kernels combine kernel methods and semi-supervised learning byconstructing kernels – thus implicitly giving a new representation for the exam-ples – with the use of the labeled and unlabeled data sets. We call the kernels

data-dependent, because the kernel function depends on the entire data set of labeled andunlabeled points available at the training phase. Mathematically speaking, if D1 6= D2,xxx,zzz ∈ D1 ∩D2,


where “m” reads as “not necessarily equal”. In the thesis we proposed data-dependentkernels and kernel construction methods for supervised and semi-supervised learningsettings. Namely we introduced the following data-dependent kernels:

• Wikipedia-based kernels for text categorization (Chapter 5)

• hierarchical cluster kernels (Chapter 6)

• reweighting cluster kernels (Chapter 7)

Text categorization is the task of using some training examples to build a supervisedlearner, which is capable of assigning one or more predefined categories to natural lan-guage documents. The Wikipedia-based kernel is built approximating the Wikipediamatrix by a lower rank matrix to filter out irrelevant terms, and at the same time providea richer representation for documents. Our method can be interpreted as simply replac-ing the covariance matrix of terms, built using the training documents, by the covariance

123

matrix induced by Wikipedia. Since some of the indexing terms selected by the χ2 fea-ture selection method were not found in Wikipedia, we have two baselines here: onewith 5209 and the other with 4601 terms. Our method slightly outperformed only theχ2 feature selection method with 4601 terms. The reasons for the inferior results herecould be the specificity of the base term selection method, the distributional differencesof words in Reuters-21578 and Wikipedia, etc. Further research in this direction in-cludes choosing other term selection methods, finding a better document representationusing the term vectors and making experiments on other TC corpora too.

Hierarchical cluster kernels are constructed using the ultrametric property of certainlinkage distances used in agglomerative clustering methods. In Chapter 6 we proposedtwo such kernels: one which is based on the cluster assumption and another one usingalso the manifold assumption. Of course – similarly to all semi-supervised techniques –these are expected to improve performance if at least one of the assumptions holds. Wealso proposed a general framework for constructing hierarchical cluster kernels. Thekernels were tested on different data sets used for benchmarks in [Chapelle et al., 2006].HCK and gHCK significantly outperformed the linear and Gaussian kernels used asbaseline measures; the greatest improvement measured between the best baseline andHCK/gHCK results was 13.5% for the COIL2 data set. We also compared our kernelsto other data-dependent kernels and found them comparable to those: HCK and gHCKoften yielded better results than the other kernels. Based on the experiments we mayconclude that the application of HCK and gHCK are useful for almost all types of datasets. There is room for improvement here too: test the method proposed for calculatingkernel values for unseen points.

In Chapter 7 we presented three reweighting schemes for constructing reweightingcluster kernels (RCK). The underlying idea is borrowed from the bagged cluster kernel:one builds a positive (semi-)definite matrix (kernel), by which one reweights the valuesof a base kernel, e.g. a linear or Gaussian kernel. Two types of reweighting kernelswere studied: the Gaussian reweighting kernel and reweighting kernels using the dotproducts of the cluster membership vectors. The experiments show similar results forthese kernels, which clearly outperform the baseline results, and show these clusterkernels to be comparable to the existing data-dependent kernels.

The work presented in this thesis shows that by using data-dependent kernels onecan significantly outperform conventional kernel methods. As the experiments show, theproposed kernels can be used on a large variety of data sets. The results of the research

124

prosecuted in the present thesis therefore may serve as useful and efficient tools in manyapplication domains where classification algorithms are needed.

125

Appendix A

Data sets

A.1 Two-moons

The famous “two-moons” data set [Chapelle et al., 2006; Belkin et al., 2006] is a syn-thetic/toy data set used to demonstrate the behaviour of a semi-supervised learning al-gorithm. The data set contains two classes which correspond to the two crescents, andthey are positioned as shown in Figure A.1, such that they can not be separated linearly.Then a small number of points – e.g. 2 points, 1 from each class – is selected andlabeled. In the figure the square and the circle denote the labeled points1.

A.2 Reuters-21578

The Reuters text categorization test collection contains documents which appeared onthe Reuters newswire in 1987. These documents were manually categorized by person-nel from Reuters Ltd. and Carnegie Group Inc. The collection was made available forscientific research in 1990.

There are three commonly used variations of the Reuters corpus:

• the Reuters-21578 ModApté split with the 10 largest categories

• the Reuters-21578 ModApté split with 90 categories

1One version of the data set can be found in the ManifoldLearn MATLAB package downloadable fromhttp://manifold.cs.uchicago.edu/manifold_regularization/software.html

126 A.2. REUTERS-21578

−10 −5 0 5 10 15 20−10

−5

0

5

10

15

Figure A.1: The “two-moons” data set.

Training set Test set Training set Test setearn 2877 1087 trade 369 117

acq 1650 719 interest 347 131

money-fx 538 179 ship 197 89

grain 433 149 wheat 212 71

crude 389 189 corn 181 56

Table A.1: The 10 most frequent categories from the Reuters-21578 corpus, ModAptésplit with 90 categories.

• the Reuters-21578 ModApté split with 115 categories

We used the second one, that is the Reuters-21578 ModApté split corpus with 90 cate-gories, being the most widely used corpus for evaluating TC systems.

Originally the Reuters-21578 corpus contained 21 578 documents, that is where itsname comes from; however some of these documents were removed, namely 8676, andthere are also uncategorized documents left in the corpus2. The Apté split, which is usedfor the evaluation of almost all TC systems, contains 12 902 documents divided in thefollowing way:

• 9603 documents for training

• 3299 documents for testing

2The 90 and 115-categories version of Reuters can be downloaded from the homepage of AlessandroMoschitti, http://dit.unitn.it/∼moschitt/corpora.htm

127 A.3. USPS

The Reuters-21578 corpus is very imbalanced, as is clear from Table A.1, wherewe list the 10 most frequent categories from the Reuters corpus ModApté split with 90

categories. Furthermore there are 7 categories containing only one training example,and these are: castor-oil, cotton-oil, groundnut-oil, lin-oil, nkr, rye, sun-meal. Howeverthere are many other problems with this corpus: it contains many misspellings, docu-ments with empty body containing only the title of the news as usable information –there are 869 documents in the training corpus and 233 in the test corpus containingonly the sentence “Blah blah blah.” besides the news title – etc.

In the Reuters-21578 ModApté split with 90 categories a document is assigned onaverage to 1.2 categories.

For more information about the Reuters-21578 corpus see [Debole and Sebastiani,2004] or see the homepage of David Lewis3.

A.3 USPS

This and the following three evaluation data sets can be downloaded from the addresshttp://www.kyb.tuebingen.mpg.de/ssl-book/benchmarks.html,the website of the book [Chapelle et al., 2006].

The USPS (United States Postal Service) data set is derived from the USPS set ofhandwritten digits4. The data set is imbalanced, since it was created by putting the digits2 and 5 into one class, while the rest in the second class, first selecting 150 imagesfor each of the ten digits. Because the set was used as a benchmark data set for thealgorithms presented in the book [Chapelle et al., 2006], it was obscured using thefollowing algorithm, in order to make it hard or even impossible to cheat, that is torecognize where the data comes from.

Algorithm 14 Data obfuscation algorithm1: Randomly select and permute 241 features (dimensions).2: Add to each feature a random bias from N(0, 1).3: Multiply each column by a uniform random value from [−1, −0.5] ∪ [0.5, 1].4: Add independent noise from N(000, σ2III) to each data point.

3http://www.daviddlewis.com/resources/testcollections/reuters21578/4http://archive.ics.uci.edu/ml/datasets/

128 A.4. DIGIT1

The parameter σ was set to 0.1.This data set contains 1500 examples and 2 × 12 splits of the data, where the first

12 splits contain 10 labeled and 1490 unlabeled examples, while the second 12 splitscontain 100 labeled and 1400 unlabeled examples.

A.4 Digit1

This data set contains artificial writings of the digit 1. Similarly to the previous data set,it contains 1500 examples. The classes were formed according to the tilt angle of thedigit. The data was obscured with the algorithm described above by setting σ = 0.05;

A.5 COIL2

The COIL2 (COIL – Columbia Object Image Library) data set was derived from theCOIL-100 object recognition library5, which contains color pictures of a set of objectstaken from different angles. The pictures were downsampled from 128 × 128 pixels to16 × 16 pixels using averaging. Originally the COIL data set used in [Chapelle et al.,2006] contained 6 classes, but we used the two-class version of the set which can befound on the web page of the book. We found no additional information on this pageregarding the forming of the two classes. The obscuring algorithm was applied again,this time using σ = 2.

A.6 Text

The Text corpus is also a text categorization corpus derived from the 20Newsgroupsdata set6. The original corpus contains approximately 20 000 documents categorizednearly evenly into 20 categories/newsgroups. Thus, unlike Reuters-21578, it is a highlybalanced corpus. The Text data set was derived from 20Newsgroups using only the 5

comp.* categories: the category comp.sys.ibm.pc.hardware constitutes oneclass, while the remaining comp.* categories the other. The documents, having a

5http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php6Downloadable from http://people.csail.mit.edu/jrennie/20Newsgroups/ or see

A. Moschitti’s home page

129 A.6. TEXT

dimension of 11 960, are represented using the tfidf weighting. To form the two classes750 documents were selected randomly from the two classes mentioned previously.

130 BIBLIOGRAPHY

Bibliography

K. M. Abadir and J. R. Magnus. Matrix Algebra. Cambridge University Press, 2005.A. Aizerman, E. M. Braverman, and L. I. Rozonoer. Theoretical foundations of the

potential function method in pattern recognition learning. Automation and RemoteControl, 25:821–837, 1964.

R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley,1999. URL http://sunsite.dcc.uchile.cl/irbook.

S. Banerjee and T. Pedersen. An adapted lesk algorithm for word sense dis-ambiguation using wordnet. In CICLing, pages 136–145, 2002. URLhttp://link.springer.de/link/service/series/0558/bibs/

2276/22760136.htm.S. Basu. Semi-supervised Clustering: Probabilistic Models, Algorithms and Experi-

ments. PhD thesis, The University of Texas at Austin, 2005.M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric frame-

work for learning from labeled and unlabeled examples. Journal of Machine LearningResearch, 7:2399–2434, 2006. URL http://www.jmlr.org/papers/v7/

belkin06a.html.A. Berger. Error-correcting output coding for text classification. In Proceedings of

IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999.P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue

Software, San Jose, CA, 2002.M. Bernstein, V. de Silva, J. C. Langford, and J. B. Tenenbaum. Graph approximations

to geodesics on embedded manifolds, 2000. URL http://isomap.stanford.

edu/BdSLT.pdf.T. D. Bie. Semi-Supervised Learning Based On Kernel Methods And Graph Cut Algo-

rithms. PhD thesis, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001Leuven (Heverlee), 2005.

http://sunsite.dcc.uchile.cl/irbook

http://link.springer.de/link/service/series/0558/bibs/2276/22760136.htm


http://www.jmlr.org/papers/v7/belkin06a.html

http://www.jmlr.org/papers/v7/belkin06a.html

http://isomap.stanford.edu/BdSLT.pdf

http://isomap.stanford.edu/BdSLT.pdf

131 BIBLIOGRAPHY

T. D. Bie, J. A. K. Suykens, and B. D. Moor. Learning from general label constraints.In A. L. N. Fred, T. Caelli, R. P. W. Duin, A. C. Campilho, and D. de Ridder, ed-itors, Structural, Syntactic, and Statistical Pattern Recognition, Joint IAPR Interna-tional Workshops, SSPR 2004 and SPR 2004, Lisbon, Portugal, August 18-20, 2004Proceedings, volume 3138 of Lecture Notes in Computer Science, pages 671–679.Springer, 2004. ISBN 3-540-22570-6.

C. M. Bishop. Pattern Recognition and Machine Learning. Springer Verlag, New York,2006.

A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph min-cuts. In Proc. 18th International Conf. on Machine Learning, pages 19–26. MorganKaufmann, San Francisco, CA, 2001.

Z. Bodó. Hierarchical cluster kernels for supervised and semi-supervised learning. InProceedings of the 4nd International Conference on Intelligent Computer Communi-cation and Processing (ICCP 2008), pages 9–16. IEEE, August 28–30 2008.

Z. Bodó and Z. Minier. On supervised and semi-supervised k-nearest neighbor algo-rithms. In Proceedings of the 7th Joint Conference on Mathematics and ComputerScience, volume LIII, pages 79–92. Studia Universitatis Babes-Bolyai, Series Infor-matica, July 2008.

Z. Bodó and Z. Minier. Semi-supervised feature selections with svms. In Proceedingsof the conference Knowledge Engineering: Principles and Techniques (KEPT 2009),pages 159–162. Presa Universitara Clujeana, July 2–4 2009. Special Issue of StudiaUniversitatis Babes-Bolyai, Series Informatica.

Z. Bodó, Z. Minier, and L. Csató. Text categorization experiments using Wikipedia. InProceedings of the conference Knowledge Engineering: Principles and Techniques(KEPT 2007), pages 66–72. Presa Universitara Clujeana, June 6–7 2007. SpecialIssue of Studia Universitatis Babes-Bolyai, Series Informatica.

I. Borg and P. J. F. Groenen. Modern multidimensional scaling, 2nd edition. Springer-Verlag, New York, 2005.

B. E. Boser, I. Guyon, and V. N. Vapnik. A training algorithm for optimal marginclassifiers. Computational Learning Theory, 5:144–152, 1992.

C. J. C. Burges. A tutorial on support vector machines for pattern recognition. InKnowledge Discovery and Data Mining, volume 2, pages 121–167. 1998.

R. Busa-Fekete and A. Kocsor. Locally linear embedding and its variants for featureextraction. In IEEE International Workshop on Soft Computing Applications, SOFA

132 BIBLIOGRAPHY

2005, 2005.C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.

Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.O. Chapelle, J. Weston, and B. Schölkopf. Cluster kernels for semi-supervised learn-

ing. In S. Becker, S. Thrun, and K. Obermayer, editors, NIPS, pages 585–592. MITPress, 2002. ISBN 0-262-02550-7. URL http://books.nips.cc/papers/

files/nips15/AA13.pdf.O. Chapelle, B. Schölkopf, and A. Zien. Semi-Supervised Learning. MIT Press,

Sept. 2006. URL http://www.kyb.tuebingen.mpg.de/ssl-book/. Webpage: http://www.kyb.tuebingen.mpg.de/ssl-book/.

Chung. Spectral graph theory (reprinted with corrections). In CBMS: Conference Boardof the Mathematical Sciences, Regional Conference Series, 1997.

D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models.Journal of Artificial Intelligence Research, 4:129–145, 1996. ISSN 1076-9757.

C. Corley, A. Csomai, and R. Mihalcea. Text semantic similarity, with applications.In Proceedings of International Conference Recent Advances in Natural LanguageProcessing (RANLP 2005), September 2005.

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms.MIT Press, Cambridge, MA, 2001.

T. Cover and J. Thomas. Elements of Information Theory, Second Edition. Wiley-Interscience, 2006.

T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. IEEE Transactionson Information Theory, IT-13, 1967.

K. Crammer and Y. Singer. A new family of online algorithms for category ranking. InThe 25th Annual International ACM SIGIR Conference. ACM, 2002. URL http:

//doi.acm.org/10.1145/564376.564404.N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. S. Kandola. On kernel-target align-

ment. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in NeuralInformation Processing Systems 14 [Neural Information Processing Systems: Natu-ral and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia,Canada], pages 367–373. MIT Press, 2001. URL http://www-2.cs.cmu.

edu/Groups/NIPS/NIPS2001/papers/psgz/LT17.ps.gz.N. Cristianini, J. Shawe-Taylor, and H. Lodhi. Latent semantic kernels. J. Intell. Inf.

Syst, 18(2-3):127–152, 2002.

http://books.nips.cc/papers/files/nips15/AA13.pdf


http://www.kyb.tuebingen.mpg.de/ssl-book/

http://doi.acm.org/10.1145/564376.564404

http://doi.acm.org/10.1145/564376.564404

http://www-2.cs.cmu.edu/Groups/NIPS/NIPS2001/papers/psgz/LT17.ps.gz

http://www-2.cs.cmu.edu/Groups/NIPS/NIPS2001/papers/psgz/LT17.ps.gz

133 BIBLIOGRAPHY

N. Cristianini, J. Kandola, A. Vinokourov, and J. Shawe-Taylor. Kernel methods for textprocessing. In J. A. K. Suykens, G. Horváth, S. Basu, C. Micchelli, and J. Vandewalle,editors, Advances in Learning Theory: Methods, Models and Applications, pages197–221. IOS Press, 2003.

L. Csató and Z. Bodó. Neurális hálók és a gépi tanulás módszerei (Retele neurale simetode de instruire automata). Presa Universitara Clujeana, Cluj-Napoca, 2008.

L. Csató and Z. Bodó. Decomposition methods for label propagation. In Proceedingsof the conference Knowledge Engineering: Principles and Techniques (KEPT 2009),pages 127–130. Presa Universitara Clujeana, July 2–4 2009. Special Issue of StudiaUniversitatis Babes-Bolyai, Series Informatica.

F. Debole and F. Sebastiani. An analysis of the relative hardness of reuters-21578 sub-sets. Journal of the American Society for Information Science and Technology, 56:971–974, 2004.

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexingby latent semantic analysis. Journal of the American Society for Information Science,41, June 1990.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incompletedata via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38,1977.

I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: spectral clustering and normalizedcuts. In ACM SIGKDD – Knowledge discovery and data mining, pages 551–556,2004. URL http://doi.acm.org/10.1145/1014052.1014118.

T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correctingoutput codes. Journal of Artificial Intelligence Research, 2:263–286, 1995.

L. Diosan, M. Oltean, A. Rogozan, and J.-P. Pécuchet. Improving SVM performance us-ing a linear combination of kernels. In B. Beliczynski, A. Dzielinski, M. Iwanowski,and B. Ribeiro, editors, Adaptive and Natural Computing Algorithms, 8th Interna-tional Conference, ICANNGA 2007, Warsaw, Poland, April 11-14, 2007, Proceed-ings, Part II, volume 4432 of Lecture Notes in Computer Science, pages 218–227.Springer, 2007. ISBN 978-3-540-71590-0.

R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley and Sons, 2001.0-471-05669-3.

W. K. Estes. Classification and Cognition. Oxford University Press, 1994. ISBN9780195073355.

http://doi.acm.org/10.1145/1014052.1014118

134 BIBLIOGRAPHY

B. Fischer, V. Roth, and J. M. Buhmann. Clustering with the connectivity kernel. InS. Thrun, L. K. Saul, and B. Schölkopf, editors, NIPS. MIT Press, 2003. ISBN0-262-20152-6. URL http://books.nips.cc/papers/files/nips16/

NIPS2003_AA12.pdf.N. Fuhr, S. Hartmann, G. Knorz, G. Lustig, M. Schwantner, and K. Tzeras. AIR/X

– a rule-based multistage indexing system for large subject fields. In A. Lich-nerowicz, editor, Proceedings of RIAO-91, 3rd International Conference “Recherched’Information Assistee par Ordinateur”, pages 606–623, Barcelona, ES, 1991. Else-vier Science Publishers, Amsterdam, NL. URL http://www.darmstadt.gmd.

de/~tzeras/FullPapers/gz/Fuhr-etal-91.ps.gz.J. Fürnkranz. Round robin classification. Journal of Machine Learning Research, 2:

721–747, 2002.E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-

based explicit semantic analysis. In Proceedings of The 20th International Joint Con-ference on Artificial Intelligence (IJCAI), January 2007.

I. P. Gent, P. Prosser, B. M. Smith, and W. Wei. Supertree construction with constraintprogramming. In ICCP: International Conference on Constraint Programming (CP),LNCS, 2003.

A. Gliozzo and C. Strapparava. Domain kernels for text categorization. In Proceedingsof the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), pages 56–63, Ann Arbor, Michigan, June 2005. Association for Compu-tational Linguistics. URL http://www.aclweb.org/anthology/W/W05/

W05-0608.G. H. Golub and C. F. Van Loan. Matrix Computations, 3nd Edition. The Johns Hopkins

University Press, Baltimore, MD, 1996.C. M. Grinstead and J. L. Snell. Introduction to Probability. AMS,

2003. URL http://www.dartmouth.edu/~chance/teaching_aids/

books_articles/probability_book/book.html.I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of

Machine Learning Research, 3:1157–1182, 2003.J. Ham, D. D. Lee, S. Mika, and B. Schölkopf. A kernel view of the dimensionality

reduction of manifolds. In C. E. Brodley, editor, Machine Learning, Proceedings ofthe Twenty-first International Conference (ICML 2004), Banff, Alberta, Canada, July4-8, 2004, volume 69 of ACM International Conference Proceeding Series. ACM,

http://books.nips.cc/papers/files/nips16/NIPS2003_AA12.pdf

http://books.nips.cc/papers/files/nips16/NIPS2003_AA12.pdf

http://www.darmstadt.gmd.de/~tzeras/FullPapers/gz/Fuhr-etal-91.ps.gz

http://www.darmstadt.gmd.de/~tzeras/FullPapers/gz/Fuhr-etal-91.ps.gz

http://www.aclweb.org/anthology/W/W05/W05-0608

http://www.aclweb.org/anthology/W/W05/W05-0608

http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/book.html

http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/book.html

135 BIBLIOGRAPHY

2004.G. Hirst and D. St-onge. Lexical chains as representations of context for the detection

and correction of malapropisms, Aug. 31 1997. URL http://citeseer.ist.

psu.edu/109361.html;http://www.cs.utoronto.ca/~pedmonds/

cl-group/pubs/ps-files/Hirst+StOnge-Wordnet-95.ps.gz.P. Jackson and I. Moulinier. Natural Language Processing for Online Applications: Text

Retrieval, Extraction & Categorization. John Benjamins, 2002. ISBN 90-272-4989-X. URL http://members.aol.com/JacksonPE/music1/nlp4olap.

htm.A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, 1988.A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. CSURV: Computing

Surveys, 31, 1999.J. J. Jiang and D. W. Conrath. Semantic similarity based on corpus statistics and lexi-

cal taxonomy. CoRR, cmp-lg/9709008, 1997. URL http://arxiv.org/abs/

cmp-lg/9709008. informal publication.T. Joachims. Text categorization with support vector machines: Learning with many rel-

evant features. Technical Report LS VIII-Report, Universität Dortmund, Dortmund,Germany, 1997.

T. Joachims. Text categorization with suport vector machines: Learning with many rel-evant features. In C. Nedellec and C. Rouveirol, editors, Machine Learning: ECML-98, 10th European Conference on Machine Learning, Chemnitz, Germany, April 21-23, 1998, Proceedings, volume 1398 of Lecture Notes in Computer Science, pages137–142. Springer, 1998. ISBN 3-540-64417-2.

I. T. Jolliffe. Principal Component Analysis. Series in Statistics. Springer Verlag, 2002.J. Kandola, J. Shawe-Taylor, and N. Cristianini. Optimizing kernel alignment over com-

binations of kernels. Technical Report 2002-121, Department of Computer Science,Royal Holloway, University of London, UK, 2002a.

J. S. Kandola, J. Shawe-Taylor, and N. Cristianini. Learning semantic similarity.In S. Becker, S. Thrun, and K. Obermayer, editors, NIPS, pages 657–664. MITPress, 2002b. ISBN 0-262-02550-7. URL http://books.nips.cc/papers/

files/nips15/AA22.pdf.C. Leacock and M. Chodorow. Combining local context and WordNet similarity for

word sense identification. In C. Fellbaum, editor, WordNet: An Electronic LexicalDatabase, pages 265–283. The MIT Press, Cambridge, Massachusetts, 1998.

http://citeseer.ist.psu.edu/109361.html; http://www.cs.utoronto.ca/~pedmonds/cl-group/pubs/ps-files/Hirst+StOnge-Wordnet-95.ps.gz



http://members.aol.com/JacksonPE/music1/nlp4olap.htm

http://members.aol.com/JacksonPE/music1/nlp4olap.htm

http://arxiv.org/abs/cmp-lg/9709008

http://arxiv.org/abs/cmp-lg/9709008



136 BIBLIOGRAPHY

M. D. Lee, B. Pincombe, and M. Welsh. An empirical evaluation ofmodels of text document similarity. In CogSci2005, pages 1254–1259,2005. URL www.rpi.edu/~grayw/courses/cogs6100-CEg/fall05/

downloads/LeePinWel05_CSC.pdf.C. Leslie and R. Kuang. Fast kernels for inexact string matching. In COLT: Proceedings

of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers,2003.

C. S. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mismatch string kernels for SVMprotein classification. In S. Becker, S. Thrun, and K. Obermayer, editors, NIPS,pages 1417–1424. MIT Press, 2002. ISBN 0-262-02550-7. URL http://books.

nips.cc/papers/files/nips15/AP03.pdf.D. D. Lewis. Evaluating and optimizing autonomous text classification systems. In

SIGIR ’95, pages 246–254, New York, NY, USA, 1995. ACM Press.D. Lin. An information-theoretic definition of similarity. In ICML, pages 296–304,

1998.H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classifica-

tion using string kernels. Journal of Machine Learning Research, 2:419–444, 2002.H. Lütkepohl. Handbook of matrices. John Wiley & Sons Ltd., Chichester, 1996. ISBN

0-471-97015-8.G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Five papers on word-

net. Technical Report, Cognitive Science Laboratory, Princeton University, 1993.Z. Minier, Z. Bodó, and L. Csató. Segmentation-based feature selection for text catego-

rization. In Proceedings of the 2nd International Conference on Intelligent ComputerCommunication and Processing (ICCP 2006), pages 53–59. IEEE, September 1–22006.

Z. Minier, Z. Bodó, and L. Csató. Wikipedia-based kernels for text categorization.In Proceedings of the 9th International Symposium on Symbolic and Numeric Algo-rithms for Scientific Computing (SYNASC 2007), pages 157–164. IEEE, September26–29 2007.

T. M. Mitchell. The discipline of machine learning. Technical Report CMU-ML-06-108, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 2006.

R. Morelos-Zaragoza. The Art of Error Correcting Coding. John Wiley and Sons, Inc.,pub-WILEY:adr, 2002. ISBN 0-471-49581-6.

A. Moschitti. A study on optimal parameter tuning for rocchio text classifier. In

www.rpi.edu/~grayw/courses/cogs6100-CEg/fall05/downloads/LeePinWel05_CSC.pdf

www.rpi.edu/~grayw/courses/cogs6100-CEg/fall05/downloads/LeePinWel05_CSC.pdf

http://books.nips.cc/papers/files/nips15/AP03.pdf

http://books.nips.cc/papers/files/nips15/AP03.pdf

137 BIBLIOGRAPHY

F. Sebastiani, editor, Advances in Information Retrieval, 25th European Confer-ence on IR Research, ECIR 2003, Pisa, Italy, April 14-16, 2003, Proceedings, vol-ume 2633 of Lecture Notes in Computer Science, pages 420–435. Springer, 2003.ISBN 3-540-01274-5. URL http://link.springer.de/link/service/

series/0558/bibs/2633/26330420.htm.A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm.

In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural In-formation Processing Systems 14, Cambridge, MA, 2002. MIT Press. URL http:

//www-2.cs.cmu.edu/~nips/2001papers/psgz/AA35.ps.gz.K. Nigam. Using unlabeled data to improve text classification. PhD thesis, Carnegie

Mellon University, 2001.K. Nigam, A. McCallum, S. Thrun, and T. M. Mitchell. Text classification from labeled

and unlabeled documents using EM. Machine Learning, 39(2/3):103–134, 2000.L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing

order to the web. Technical report, Stanford Digital Library Technologies Project,1998.

S. Patwardhan, S. Banerjee, and T. Pedersen. Using measures of semantic relatednessfor word sense disambiguation. In A. F. Gelbukh, editor, Computational Linguis-tics and Intelligent Text Processing, 4th International Conference, CICLing 2003,Mexico City, Mexico, February 16-22, 2003, Proceedings, volume 2588 of LectureNotes in Computer Science, pages 241–257. Springer, 2003. ISBN 3-540-00532-3. URL http://link.springer.de/link/service/series/0558/

bibs/2588/25880241.htm.J. C. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin DAGs for multiclass clas-

sification. In S. A. Solla, T. K. Leen, and K.-R. Müller, editors, NIPS, pages 547–553.The MIT Press, 1999. ISBN 0-262-19450-3. URL http://nips.djvuzone.

org/djvu/nips12/0547.djvu.P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In

IJCAI, pages 448–453, 1995.C. J. V. Rijsbergen. Information Retrieval. Butterworths, 1979.S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear

embedding. Science, 290:2323–2326, 2000. URL http://www.sciencemag.

org/content/vol290/issue5500/.S. Russel and P. Norvig. Artificial Intelligence: a Modern Approach. Prentice-Hall,



http://www-2.cs.cmu.edu/~nips/2001papers/psgz/AA35.ps.gz

http://www-2.cs.cmu.edu/~nips/2001papers/psgz/AA35.ps.gz



http://nips.djvuzone.org/djvu/nips12/0547.djvu

http://nips.djvuzone.org/djvu/nips12/0547.djvu

http://www.sciencemag.org/content/vol290/issue5500/

http://www.sciencemag.org/content/vol290/issue5500/

138 BIBLIOGRAPHY

1995.G. Salton, A. Wong, and A. C. S. Yang. A vector space model for automatic indexing.

Communications of the ACM, 18:229–237, 1975.L. K. Saul and S. T. Roweis. An introduction to locally linear embedding, 2001.B. Schölkopf and A. J. Smola. Learning with Kernels. The MIT Press, Cambridge, MA,

2002.B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a ker-

nel eigenvalue problem. Technical Report 44, Max-Planck Institut für biologischeKybernetik, Arbeitsgruppe Bülthoff, Spemannstrasse 38, 2076 Tobingen, Germany,Dec. 1996.

B. Schölkopf, A. J. Smola, and K.-R. Müller. Kernel principal component analysis.Advances in kernel methods: support vector learning, pages 327–352, 1999. URLciteseer.ist.psu.edu/article/sch99kernel.html.

H. Schütze. Automatic word sense discrimination. Computational Linguistics, 24(1):97–123, 1998.

F. Sebastiani. Machine learning in automated text categorization. ACM ComputingSurveys, 34(1):1–47, 2002.

J. Shlens. A tutorial on principal component analysis, 2005.V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the point cloud: from transduc-

tive to semi-supervised learning. In L. D. Raedt and S. Wrobel, editors, MachineLearning, Proceedings of the Twenty-Second International Conference (ICML 2005),Bonn, Germany, August 7-11, 2005, volume 119 of ACM International Confer-ence Proceeding Series, pages 824–831. ACM, 2005. ISBN 1-59593-180-5. URLhttp://doi.acm.org/10.1145/1102351.1102455.

D. Sloughter. The calculus of functions of several variables, 2001.J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for

nonlinear dimensionality reduction. Science, 290(5500):2319–2323, Dec. 2000. URLhttp://isomap.stanford.edu/.

H. tien Lin and C. jen Lin. A study on sigmoid kernels for SVM and the training ofnon-PSD kernels by SMO-type methods, 2003. URL http://citeseer.ist.

psu.edu/584223.html.D. Tikk. Szövegbányászat. Typotex, Budapest, 2007.V. Vapnik. Statistical Learning Theory. Wiley, 1998.G. Varelas. Semantic similarity methods in wordnet and their application to in-

citeseer.ist.psu.edu/article/sch99kernel.html

http://doi.acm.org/10.1145/1102351.1102455

http://isomap.stanford.edu/

http://citeseer.ist.psu.edu/584223.html


139 BIBLIOGRAPHY

formation retrieval on the web. Technical Report TR-TUC-ISL-01-2005, Techni-cal Univ. of Crete (TUC), Dept. of Electronic and Computer Engineering, Cha-nia, Crete, Greece, 2005. URL http://www.intelligence.tuc.gr/

publications/Varelas.pdf.S. Vishwanathan and N. M. Murty. Kernel enabled K-means algorithm. Technical

report, The Indian Institute of Science, Bangalore, 2002. URL http://eprints.

iisc.ernet.in/archive/00000010.S. V. N. Vishwanathan and A. J. Smola. Fast kernels for string and tree matching.

In S. Becker, S. Thrun, and K. Obermayer, editors, NIPS, pages 569–576. MITPress, 2002. ISBN 0-262-02550-7. URL http://books.nips.cc/papers/

files/nips15/AA11.pdf.S. V. N. Vishwanathan, K. M. Borgwardt, O. Guttman, and A. J. Smola. Kernel extrap-

olation. Neurocomputing, 69(7-9):721–729, 2006.U. von Luxburg. A tutorial on spectral clustering. Technical Report 149, Max Planck

Institute for Biological Cybernetics, August 2006.J. Voss. Measuring wikipedia, 2005. URL http://www.citebase.org/

abstract?id=oai:eprints.rclis.org:3610.J. H. Ward, Jr. Hierarchical grouping to optimize an objective function. Journal of the

American Statistical Association, 58:236–244, 1963.C. wei Hsu and C. jen Lin. A comparison of methods for multi-class support vector

machines, 2001. URL http://citeseer.ist.psu.edu/537288.html.J. Weston and C. Watkins. Support vector machines for multiclass pattern recog-

nition. In Proceedings of the Seventh European Symposium On Artificial Neu-ral Networks, 4 1999. URL http://citeseer.ist.psu.edu/article/

weston99support.html.J. Weston, A. Elisseeff, B. Schölkopf, and M. Tipping. Use of the zero-norm with linear

models and kernel methods. Journal of Machine Learning Research, 3:1439–1461,2003.

J. Weston, C. Leslie, D. Zhou, A. Elisseeff, and W. S. Noble. Semi-supervised protein classification using cluster kernels, 2004. URL http:

//eprints.pascal-network.org/archive/00000479/;http://

eprints.pascal-network.org/archive/00000479/01/CLUS.pdf.J. Weston, C. Leslie, E. Ie, and W. S. Noble. Semi-supervised protein classification using

cluster kernels. In O. Chapelle, B. Schölkopf, and A. Zien, editors, Semi-Supervised

http://www.intelligence.tuc.gr/publications/Varelas.pdf

http://www.intelligence.tuc.gr/publications/Varelas.pdf

http://eprints.iisc.ernet.in/archive/00000010

http://eprints.iisc.ernet.in/archive/00000010



http://www.citebase.org/abstract?id=oai:eprints.rclis.org:3610

http://www.citebase.org/abstract?id=oai:eprints.rclis.org:3610


http://citeseer.ist.psu.edu/article/weston99support.html

http://citeseer.ist.psu.edu/article/weston99support.html

http://eprints.pascal-network.org/archive/00000479/; http://eprints.pascal-network.org/archive/00000479/01/CLUS.pdf



140 BIBLIOGRAPHY

Learning, chapter 19, pages 343–360. MIT Press, 2006.S. K. M. Wong, W. Ziarko, and P. C. N. Wong. Generalized vector spaces model in

information retrieval, proceedings of the 8th annual international ACM SIGIR con-ference on research and development in information retrieval. In J. M. Tague, editor,Proceedings of the 8th annual international ACM SIGIR conference on Research anddevelopment in information retrieval, pages 18–25, Montreal, Quebec, Canada, June1985. ACM Press.

S. K. M. Wong, W. Ziarko, V. V. Raghavan, and P. C. N. Wong. On modeling ofinformation retrieval concepts in vector spaces. ACM Trans. on Database Sys., 12(2):299, June 1987.

B. Y. Wu and K.-M. Chao. Spanning Trees and Optimization Problems. Chapman andHall/CRC, Boca Raton, Florida, 2004.

Z. Wu and M. S. Palmer. Verb semantics and lexical selection. In ACL, pages 133–138,1994.

Y. Yang and C. G. Chute. A linear least squares fit mapping method for informationretrieval from natural language texts. In COLING, pages 447–453, 1992. URLhttp://acl.ldc.upenn.edu/C/C92/C92-2069.pdf.

Y. Yang and C. G. Chute. An example-based mapping method for text categorizationand retrieval. ACM Transactions on Information Systems, 12(3):252–295, July 1994.

Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categoriza-tion. In International Conference on Machine Learning, pages 412–420, 1997. URLciteseer.nj.nec.com/yang97comparative.html.

Z.-P. Yang, X. Zhang, and C.-G. Cao. Inequalities involving khatri-rao products ofhermitian matrices. The Korean Journal of Computational & Applied Mathematics,9(1):125–133, 2002. ISSN 1229-9502.

Q. Yong and Y. Jie. Geodesic distance for support vector machines. Acta Automat-ica Sinica, 31(2):202–208, 2005. URL http://www.aas.net.cn/qikan/

manage/wenzhang/050205.pdf.J. A. Zdziarski. Ending spam: Bayesian content filtering and the art of statistical

language classification. No Starch Press, pub-NO-STARCH:adr, 2005. ISBN1-59327-052-6. URL ftp://uiarchive.cso.uiuc.edu/pub/etext/

gutenberg/;http://www.oreilly.com/catalog/1593270526/

;http://www.loc.gov/catdir/toc/ecip0510/2005008221.html.D. Zhou, B. Schölkopf, and T. Hofmann. Semi-supervised learning on directed graphs.

http://acl.ldc.upenn.edu/C/C92/C92-2069.pdf

citeseer.nj.nec.com/yang97comparative.html

http://www.aas.net.cn/qikan/manage/wenzhang/050205.pdf

http://www.aas.net.cn/qikan/manage/wenzhang/050205.pdf

ftp://uiarchive.cso.uiuc.edu/pub/etext/gutenberg/; http://www.oreilly.com/catalog/1593270526/; http://www.loc.gov/catdir/toc/ecip0510/2005008221.html



141 BIBLIOGRAPHY

In NIPS, 2004. URL http://books.nips.cc/papers/files/nips17/

NIPS2004_0540.pdf.X. Zhu. Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon University,

Pittsburgh, PA, USA, 2005. Chair-John Lafferty and Chair-Ronald Rosenfeld.X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label prop-

agation. Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002.X. Zhu, T. J. Rogers, R. Qian, and C. Kalish. Humans perform semi-supervised classi-

fication too. In AAAI, page 864. AAAI Press, 2007. ISBN 978-1-57735-323-2.

http://books.nips.cc/papers/files/nips17/NIPS2004_0540.pdf

http://books.nips.cc/papers/files/nips17/NIPS2004_0540.pdf