16
The Sum-over-Paths Covariance Kernel: A Novel Covariance Measure between Nodes of a Directed Graph Amin Mantrach, Luh Yen, Jerome Callut, Kevin Francoisse, Masashi Shimbo, and Marco Saerens, Member, IEEE Abstract—This work introduces a link-based covariance measure between the nodes of a weighted directed graph, where a cost is associated with each arc. To this end, a probability distribution on the (usually infinite) countable set of paths through the graph is defined by minimizing the total expected cost between all pairs of nodes while fixing the total relative entropy spread in the graph. This results in a Boltzmann distribution on the set of paths such that long (high-cost) paths occur with a low probability while short (low-cost) paths occur with a high probability. The sum-over-paths (SoP) covariance measure between nodes is then defined according to this probability distribution: two nodes are considered as highly correlated if they often co-occur together on the same—preferably short—paths. The resulting covariance matrix between nodes (say n nodes in total) is a Gram matrix and therefore defines a valid kernel on the graph. It is obtained by inverting an n n matrix depending on the costs assigned to the arcs. In the same spirit, a betweenness score is also defined, measuring the expected number of times a node occurs on a path. The proposed measures could be used for various graph mining tasks such as computing betweenness centrality, semi-supervised classification of nodes, visualization, etc., as shown in Section 7. Index Terms—Graph mining, kernel on a graph, shortest path, correlation measure, betweenness measure, resistance distance, commute time distance, biased random walk, semi-supervised classification. Ç 1 INTRODUCTION 1.1 General Introduction N ETWORK and link analysis are an important growing field that is the subject of much recent work in various areas of science: applied mathematics, computer science, social science, physics, pattern recognition, applied statis- tics, to name a few—see Section 1.2 below. Within this context, one key issue is the proper definition of a similarity measure between the nodes of the graph, capturing their relationship and taking both direct and indirect links into account. Indeed, the input graph is typically very sparse, each node being connected to only a few neighbors. Therefore, the goal is to go from “local” to “global,” from the local neighborhood inputs to the definition of a global similarity between all nodes of the graph, considering indirect links as well. This paper proposes precisely such a similarity measure, taking the form of a covariance matrix, by extending the framework developed in [1] in the context of routing and already exploited in [2] in order to define a dissimilarity measure between nodes generalizing both the shortest path and the commute time—or resistance—distance. This quantity will be called the sum-over-paths (SoPs) covar- iance and has a clear, intuitive, interpretation: When enumerating all of the possible paths through the graph, two nodes will be considered as highly correlated if they often co-occur on the same—preferably short—paths. In order to compute this covariance measure, we first define a probability distribution on the countable set of paths through the graph by adopting a statistical physics frame- work. The definition of this probability measure relies on the work of Akamatsu developed in the field of transportation networks [3]. Consider a graph or network G, where a positive, immediate, cost c kk 0 > 0 is associated with each arc connecting a node k to a node k 0 . Consider further the set of all paths (including cycles) R through the graph, between all pairs of nodes. A cost is associated with each path being the sum of the individual costs of the arcs lying on the path. By defining the probability distribution over R as the one minimizing the expected cost of the paths in R, subject to a constant relative entropy spread in the graph, we obtain a Boltzmann distribution. This distribution has the desired property that long paths occur with a low probability while short paths occur with a high probability. As in statistical physics, it depends on a parameter, ¼ 1=T , controlling the temperature of the system. When T is low, only short paths are taken into account, while, when T is large, the paths IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. X, XXXXXXX 2010 1 . A. Mantrach is with the IRIDIA—CoDE—Universite´Libre de Bruxelles, 50, Av. F. Roosevelt, CP 194/6—B-1050 Brussels, Belgium. E-mail: [email protected]. . L. Yen, J. Callut, K. Francoisse, and M. Saerens are with the ISYS/LSM, Universite´Catholique de Louvain, IAG, ISYS, Place des Doyens, 1, 1348 Louvain-la-Neuve, Belgium. E-mail: {luh.yen, jerome.callut, kevin.francoisse, marco.saerens}@uclouvain.be. . M. Shimbo is with the Graduate School of Information Science—Nara Institute of Science and Technology—8916-5 Takayama, Ikoma, Nara 630- 0192, Japan. E-mail: [email protected]. Manuscript received 12 Sept. 2008; revised 27 Feb. 2009; accepted 18 Mar. 2009; published online 1 Apr. 2009. Recommended for acceptance by J. Kosecka. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-2008-09-0615. Digital Object Identifier no. 10.1109/TPAMI.2009.78. 0162-8828/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

The Sum-over-Paths Covariance Kernel:A Novel Covariance Measure

between Nodes of a Directed GraphAmin Mantrach, Luh Yen, Jerome Callut, Kevin Francoisse,

Masashi Shimbo, and Marco Saerens, Member, IEEE

Abstract—This work introduces a link-based covariance measure between the nodes of a weighted directed graph, where a cost isassociated with each arc. To this end, a probability distribution on the (usually infinite) countable set of paths through the graph is

defined by minimizing the total expected cost between all pairs of nodes while fixing the total relative entropy spread in the graph. Thisresults in a Boltzmann distribution on the set of paths such that long (high-cost) paths occur with a low probability while short (low-cost)

paths occur with a high probability. The sum-over-paths (SoP) covariance measure between nodes is then defined according to thisprobability distribution: two nodes are considered as highly correlated if they often co-occur together on the same—preferably

short—paths. The resulting covariance matrix between nodes (say n nodes in total) is a Gram matrix and therefore defines a validkernel on the graph. It is obtained by inverting an n! n matrix depending on the costs assigned to the arcs. In the same spirit, a

betweenness score is also defined, measuring the expected number of times a node occurs on a path. The proposed measures couldbe used for various graph mining tasks such as computing betweenness centrality, semi-supervised classification of nodes,

visualization, etc., as shown in Section 7.

Index Terms—Graph mining, kernel on a graph, shortest path, correlation measure, betweenness measure, resistance distance,

commute time distance, biased random walk, semi-supervised classification.

Ç

1 INTRODUCTION

1.1 General Introduction

NETWORK and link analysis are an important growingfield that is the subject of much recent work in various

areas of science: applied mathematics, computer science,social science, physics, pattern recognition, applied statis-tics, to name a few—see Section 1.2 below. Within thiscontext, one key issue is the proper definition of a similaritymeasure between the nodes of the graph, capturing theirrelationship and taking both direct and indirect links intoaccount. Indeed, the input graph is typically very sparse,each node being connected to only a few neighbors.Therefore, the goal is to go from “local” to “global,” fromthe local neighborhood inputs to the definition of a globalsimilarity between all nodes of the graph, consideringindirect links as well.

This paper proposes precisely such a similarity measure,taking the form of a covariance matrix, by extending theframework developed in [1] in the context of routing andalready exploited in [2] in order to define a dissimilaritymeasure between nodes generalizing both the shortest pathand the commute time—or resistance—distance. Thisquantity will be called the sum-over-paths (SoPs) covar-iance and has a clear, intuitive, interpretation: Whenenumerating all of the possible paths through the graph,two nodes will be considered as highly correlated if theyoften co-occur on the same—preferably short—paths.

In order to compute this covariance measure, we firstdefine a probability distribution on the countable set of pathsthrough the graph by adopting a statistical physics frame-work. The definition of this probability measure relies on thework of Akamatsu developed in the field of transportationnetworks [3]. Consider a graph or network G, where apositive, immediate, cost ckk0 > 0 is associated with each arcconnecting a node k to a node k0. Consider further the set ofall paths (including cycles) R through the graph, betweenall pairs of nodes. A cost is associated with each path beingthe sum of the individual costs of the arcs lying on the path.By defining the probability distribution over R as the oneminimizing the expected cost of the paths in R, subject to aconstant relative entropy spread in the graph, we obtain aBoltzmann distribution. This distribution has the desiredproperty that long paths occur with a low probability whileshort paths occur with a high probability. As in statisticalphysics, it depends on a parameter, ! ¼ 1=T , controlling thetemperature of the system. When T is low, only short pathsare taken into account, while, when T is large, the paths

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. X, XXXXXXX 2010 1

. A. Mantrach is with the IRIDIA—CoDE—Universite Libre de Bruxelles,50, Av. F. Roosevelt, CP 194/6—B-1050 Brussels, Belgium.E-mail: [email protected].

. L. Yen, J. Callut, K. Francoisse, and M. Saerens are with the ISYS/LSM,Universite Catholique de Louvain, IAG, ISYS, Place des Doyens, 1, 1348Louvain-la-Neuve, Belgium. E-mail: {luh.yen, jerome.callut,kevin.francoisse, marco.saerens}@uclouvain.be.

. M. Shimbo is with the Graduate School of Information Science—NaraInstitute of Science and Technology—8916-5 Takayama, Ikoma, Nara 630-0192, Japan. E-mail: [email protected].

Manuscript received 12 Sept. 2008; revised 27 Feb. 2009; accepted 18 Mar.2009; published online 1 Apr. 2009.Recommended for acceptance by J. Kosecka.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMI-2008-09-0615.Digital Object Identifier no. 10.1109/TPAMI.2009.78.

0162-8828/10/$26.00 ! 2010 IEEE Published by the IEEE Computer Society

Page 2: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

tend to have a total cost proportional to minus the log-probability of the natural random walk through the graph.For intermediate values of T , short (low-cost) paths arefavored over long (high-cost) paths.

In a second step, the SoP covariance measure betweennodes is defined according to this probability distribution.Roughly speaking, it corresponds to the expectation of thenumber of times node i and node j co-occur on the samepath. The proposed measure therefore characterizes therelationship between the nodes and depends on ! ¼ 1=T ,controlling the entropy spread in the graph. Technicallyspeaking, the covariance measure is obtained by taking thesecond-order derivative of the partition function associatedwith the system. The resulting covariance matrix betweennodes (say n nodes in total) is a Gram matrix and thereforedefines a valid kernel matrix on the graph. It is shown thatit can be obtained by inverting an n! n matrix dependingon the immediate costs assigned to the arcs. The SoPcorrelation matrix is easily deduced from the covariancematrix by normalizing its entries in the usual way. Oncesuch a similarity matrix has been defined, it can beexploited in order to compute betweenness, to findcommunities, outliers, etc.

Finally, a betweenness measure associated with eachnode, called the SoP betweenness, is defined in the samemanner: It corresponds to the expectation of the number ofpassages through each node. It is obtained by taking thefirst-order derivative of the partition function.

1.2 Related WorkThis section provides a short survey of the related work,aiming to define meaningful link-based similarities betweenthe nodes of a graph and taking the form of a similaritymeasure or a kernel matrix. Similarity between nodes is alsocalled relatedness in the literature and the most well-knownquantities measuring relatedness are co-citation [4] andbibliographic coupling [5]. On the other hand, the standardcorrelation measure between nodes [6] is simply the innerproduct between the normalized nodes feature vector.Here, each node feature vector simply contains the weightsof the direct links of this node to the other nodes. Thismeasure is therefore closely related to co-citation. Timeautocovariance between nodes, after having defined arandom walk on the graph, is exploited in [7] in order toquantify the quality of a node’s clustering solution. Theproposed measure encompasses several of the nodesclustering criteria proposed in the literature, including themodularity [8] and the Fiedler partition.

More sophisticated measures have been proposed aswell. For instance, Klein and Randic [9] proposed to use theeffective resistance between two nodes as a meaningfuldistance measure. They call this quantity the resistancedistance between nodes. Indeed, it can be shown that theeffective resistance is a euclidean distance [9], [10], [11], [12].The close link between the effective resistance and thecommute time of a random walker on the graph wasestablished in [13], while the links between the Laplacianmatrix and the commute time (as well as the Fiedler vector)were studied in [14]. Therefore, the resistance distance issometimes called the commute time distance.

Chebotarev and Shamis proposed [15], [16] a similaritymeasure between nodes integrating indirect paths based onthe matrix-forest theorem. Interestingly, this quantity,

called the “regularized Laplacian kernel” in [17], defines akernel matrix and is related to the Laplacian matrix of thegraph. Ito et al. [18] further propose the modified regular-ized Laplacian kernel, an extension of the regularizedLaplacian kernel, by introducing a new parameter control-ling importance and relatedness. Moreover, in [18], [19], it isshown that the regularized Laplacian kernel overcomessome limitations of the von Neumann kernel [20] whenranking linked documents. This modified regularizedLaplacian kernel is also closely related to a graph regular-ization framework introduced by Zhou and Scholkopf in[21], extended to directed graphs in [22]. The resultingkernel on a graph is a normalized version of the “regular-ized Laplacian kernel” involving the extension of theLaplacian matrix for directed graphs [23]. In the samespirit, Chen et al. [24] studied graph embedding and semi-supervised classification by using this directed version ofthe “regularized Laplacian kernel.”

The exponential and von Neumann diffusion kernels,based this time on the adjacency matrix, are introduced in[20], [25]. The defined kernel matrixes are computedthrough a power series of the adjacency matrix of the graph;they are closely related to graph regularization models [26].

Moreover, some authors recently considered similaritymeasures based on random walk or electrical concepts (for anice introduction to this topic, see [27]). For instance, Hareland Koren [28] investigated the possibility of clusteringdata according to some random-walk-related quantities,such as the probability of visiting a node before returning tothe starting node. They showed that their algorithm is ableto cluster arbitrary nonconvex shapes. White and Smyth[29] investigated the use of the average first passage time asa similarity measure between nodes. Their purpose was togeneralize the random-walk approach of Brin and Page [30]and Page et al. [31] by capturing a concept of “relativecentrality” of a given node with respect to some other nodeof interest. A recent study comparing several measures foranalyzing the proximity of nodes in a graph in theframework of coauthorship networks is presented in [32].

On the other hand, Kondor and Lafferty [26] as well asSmola and Kondor [33] defined a graph regularizationmodel related to the graph principal components analysisintroduced in [14]. This model results in the definition of afamily of kernels on a graph that provides similaritiesbetween nodes, just as any other graph kernel [25]. Aninteresting attempt to learn the regularization operator inthe context of semi-supervised learning can be found in[34]. The result is a kernel on a graph maximizing kernelalignment to the labeled data, in a semi-supervised setting.Another approach has been investigated by Palmer andFaloutsos [35], who define a similarity function betweencategorical attributes, called “refined escape probability,”based on random walks and electrical networks. They showthat this quantity provides a reasonably good measure forclustering and classifying categorical attributes.

The “commute time” (CT) kernel has been introduced in[14], [17] and was inspired by the already mentioned work ofKlein and Randic [9] and Chandra et al. [13]. It takes its namefrom the average commute time, which is defined as theaverage number of steps a random walker, starting from a

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. X, XXXXXXX 2010

Page 3: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

given node, will take before entering another node for thefirst time, and go back to the starting node. The CT kernel isdefined as the inner product in a euclidean space, where thenodes are exactly separated by the commute time distance.An interesting method allowing to efficiently computetruncated commute time neighbors appears in [36]. Almostat the same period, Qiu and Hancock [37], [38], Ham et al.[39], Yen et al. [40], as well as Brand [41] defined the same CTembedding, preserving the commute time distance, andapplied it to image segmentation and multibody motiontracking [37], [38], to dimensionality reduction of manifolds[39], to clustering [40], as well as to collaborative filtering[17], [41], with interesting results. The commute time kernelis also closely related to the “Fiedler vector” [42], [43], widelyused for graph partitioning [44], [45] or clustering [46], [47],[48], as detailed in [17]. An electrical interpretation of theelements of the CT kernel is provided in [48]. Families ofdissimilarity measures subsuming both the shortest pathdistance and the commute time distance were recentlyproposed by Yen et al. [2] and Chebotarev [49]. Note that thedissimilarity defined by Chebotarev [49] is a distance, that is,it verifies all of the properties of a distance, including thetriangular inequality (which is not the case for [2]).

On the other hand, Zhou [50], [51] uses the average firstpassage time between two nodes as a dissimilarity index inorder to cluster them. He studies various greedy clusteringtechniques based on this dissimilarity index. Anothersimilarity measure related to the average first passage timeappears in [52]. It is defined as the escape probability, that is,the probability that a random walker starting from one nodewill visit the other node before returning to the startingnode. The resulting similarity is directed and closely relatedto the effective conductance between the two nodes. Alsorelated is the measure investigated by Koren et al. [53], [54],where the authors propose to replace the effective con-ductance by a cycle-free effective conductance.

In two recent papers [55], [56], Nadler et al. as well as Ponsand Latapy [57], [58] proposed a well-formulated distancemeasure between nodes of a graph based on a diffusionprocess, called the “diffusion distance.” A valid kernel,called the “Markov diffusion kernel,” has been derived fromthis diffusion distance in [59]. An application of the diffusiondistance to dimensionality reduction and graph visualiza-tion appears in [60]. The natural embedding induced by thediffusion distance is called the “diffusion map” by Nadleret al. [55], [56]. Moreover, in [57], [58], Pons and Latapydefined a hierarchical clustering approach for clustering thenodes according to the squared diffusion distance.

Two recent PageRank-based attempts to define mean-ingful similarities between nodes appear in [61], [62], [63].In these two last works, the authors propose a random walkwith restart procedure while, in the first work, Gori andPucci define a random walk process starting from the nodeof interest, controlled by some precomputed correlationmatrix between nodes. These two algorithms are thusinspired from the well-known PageRank procedure [30],[31], adapted in order to provide relative similaritiesbetween nodes. Yet another PageRank-inspired algorithmdefining similarities between nodes was proposed in [64]. Itprovides a general way of computing similarities between

the nodes of two different graphs. Applying this procedureto the same graph allows us to find self-similarities, that is,similarities between nodes of the same graph. Finally, asimilarity between nodes based on the number of differentpaths connecting two nodes, and therefore, on the max-imum flow/minimum cut, is studied in [65]. This measurehas been tested in two collaborative recommendation tasks[17], but did not perform well in this context. Finally,Tahbaz and Jadbabaie [66] introduce a one-parameterfamily of algorithms that recovers both the Bellman-Fordprocedure for finding the shortest paths as well as theiterative algorithm for computing the average fist -passagetime. It is based on heuristic grounds and not on a well-defined cost function to optimize.

There are also several attempts to generalize graphkernels to directed graphs. For instance, an extension of theLaplacian matrix to directed graphs is proposed in [23]while an extension of the regularized Laplacian kernel todirected graphs appears in [67], [68]. Zhou et al. [22] usedthe regularized normalized Laplacian matrix defined in [23]in the context of semi-supervised classification of labelednodes of a directed graph, while Chen et al. [24] used thesame kernel matrix, but this time unnormalized, fordirected graph embedding. Zhao et al. [69] propose adirected contextual distance and define a directed graphfrom which the Laplacian matrix is computed. It is thenused for ranking and clustering images.

Note that most of the mentioned approaches (exceptthose described in the previous paragraph) are restricted tothe analysis of undirected graphs. For instance, theLaplacian-based techniques assume, most of the time, asymmetric adjacency matrix, that is, an undirected graph.Many approaches further assume a completely connectedgraph having one single connected component. Finally,some approaches consider only aperiodic Markov chains(such as the diffusion map method), do not allow absorbingnodes, or lack a clear, intuitive, interpretation. The SoPapproach developed in this paper does not suffer from anyof these problems while remaining competitive in terms ofperformance, at least for the investigated semi-supervisedclassification tasks (see Section 7). However, the maindrawback of the SoP approach is that it does not scale wellon large graphs when the entire kernel matrix is needed.This issue is discussed in the paper and will be investigatedin future work. Still another weakness of the SoP approachis the fact that it depends on a hyperparameter that has tobe tuned. An interesting contribution to the field would beto make a comprehensive experimental comparison of allthe up-to-now proposed kernels on a graph.

1.3 Contributions and Organization of the Paper

This work has three main contributions as follows:

. It introduces a well-founded covariance as well as abetweenness measure between nodes of a weighteddirected graph. The resulting covariance matrixdefines a valid kernel on a graph.

. It shows how these covariance and betweennessmeasures can be computed efficiently from theimmediate costs associated with each arc by invert-ing an n! n matrix.

MANTRACH ET AL.: THE SUM-OVER-PATHS COVARIANCE KERNEL: A NOVEL COVARIANCE MEASURE BETWEEN NODES OF A DIRECTED... 3

Page 4: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

. Experimental comparisons with several kernels on agraph, computed on eight different databases, showthat the SoP correlation measure obtains competitiveperformances in semi-supervised classification tasks.

Section 2 introduces the necessary background and nota-tion. In Section 3, the probability distribution on the set ofpaths—a Boltzmann distribution—is defined. Section 4introduces our betweenness and covariance measures andshows how these quantities can be derived analyticallyfrom the partition function. Section 5 explains how thepartition function can be computed exactly from theimmediate costs, while Section 6 derives the formulas forcomputing the quantities of interest, betweenness, andcovariance. Section 7 applies the covariance measure tosemi-supervised classification tasks and compares theresults with various other kernels on a graph. Concludingremarks and possible extensions are discussed in Section 8.

2 BACKGROUND AND NOTATION

Consider a weighted directed graph or network, G, notnecessarily strongly connected, with a set of n nodes V (orvertices) and a set of arcs E (or edges). To each arc linkingnode k and k0, we associate a positive number ckk0 > 0representing the immediate cost of following this arc. Thecost matrix C is the matrix containing the immediate costs ckk0as elements. A random walk on this graph is defined in thestandard way. In node k, the random walker chooses the nextarc to follow according to transition probabilities represent-ing the probability of jumping from node k to node k0 2 SðkÞ,the set of successor nodes (successors S). These transitionprobabilities will be denoted by pkk0 ¼ Pðk0jkÞwith k0 2 SðkÞ.Furthermore, P will be the matrix containing the transitionprobabilities pkk0 as elements. If there is no arc between k andk0, we simply consider that ckk0 takes a large value, denoted by1; in this case, the corresponding transition probability willbe set to zero, pkk0 ¼ 0.

The natural random walk on this graph will be definedin the following way. It corresponds to a standard randomwalk through the graph with transition probabilities

prefkk0 ¼

1=ckk0Pk0ð1=ckk0Þ

: ð1Þ

The corresponding transitions probabilities matrix willbe denoted by Pref . In other words, the random walkerchooses to follow an arc with a probability proportional tothe inverse of the immediate cost (apart from the sum-to-one normalization); therefore, locally favoring arcs having alow cost. These transition probabilities will be used asreference probabilities later; hence, the superscript “ref.” If,instead of C, we are given an adjacency matrix withelements akk0 % 0 indicating the affinity between node k andnode k0, the corresponding costs are computed from ckk0 ¼1=akk0 and the transition probabilities associated with eachnode are simply proportional to the affinities (and normal-ized). Note that other relations—other than the inverserelation—between affinity and cost could be considered aswell. The matrix Pref is provided by the user, is assumed tobe stochastic, and contains real, positive, numbers.

The objective of the next sections is to define theprobability distribution on the set of paths as well as the

betweenness and covariance measures. Before diving intothe details, let us briefly describe the main ideas behind themodel. In the first step, the potentially infinite set of pathsin the graph is enumerated and a probability distribution isassigned to each individual path: the longer the path, thesmaller the probability of following it. This probabilitydistribution depends on a parameter, ! ¼ 1=T , controllingthe exploration carried out in the graph: When ! is large, noexploration is performed and only the shortest paths areenumerated, while, when ! is small, a random walk on thegraph is performed according to the natural transitionprobabilities pref

kk0 . In the second step, 1) the expectednumber of visits to each node as well as 2) the expectednumber of times two nodes are visited within the same pathare computed through a sum-over-paths statistical physicsformalism. These two quantities are, respectively, ameasure of betweenness and a measure of covariance onthe set of nodes.

3 A BOLTZMANN DISTRIBUTION ON THE SET OF

PATHS

This section describes how the probability distribution onthe set of paths is assigned. To this end, let us first choosetwo nodes, an initial node i and a destination node j anddefine the set of paths (including cycles) connecting thesetwo nodes asRij ¼ f}rijg. Thus, }rij is path number rij, withrij usually ranging from 1 to 1. Let us denote by Erij thetotal cost associated with path number rij, referred to as theenergy associated with that path. Here, we assume that }rijis a valid path from the initial node to the destination node,that is, every ckt&1kt 6¼1 along that path. In addition, let usdefine the set of all paths through the graph as R ¼

SijRij.

We further assume that the total cost associated with a pathis additive, i.e., Eð}rijÞ ¼

Ptft¼1ckt&1kt , where k0 ¼ i is the

initial node and ktf ¼ j is the destination node, while tf isthe time (number of steps) needed to end the path in node j.Now, a probability distribution on this set R, representingthe probabilities of following the paths } 2 R, is defined asthe probability distribution P minimizing the total expectedcost-to-go, IEfEð}Þg, among all of the distributions having afixed relative entropy J0 with respect to the natural randomwalk on the graph. This choice naturally defines aprobability distribution on the set of paths such that longpaths occur with a low probability while short paths occurwith a high probability. In other words, we are seeking pathprobabilities, Pð}Þ; } 2 R, minimizing the total expectedcost subject to a constant relative entropy constraint:

minimizePð}Þ

X

}2RPð}ÞEð}Þ

subject toX

}2RPð}Þ lnðPð}Þ=Prefð}ÞÞ ¼ J0;

!!!!!!!!ð2Þ

where Prefð}Þ represents the probability of following thepath } when walking according to the natural randomwalk, i.e., using transition probabilities pref

kk0 (see (1)). Here,J0 > 0 is provided a priori by the user, according to thedesired degree of randomness he is willing to concede. Byminimizing the Lagrange function over the set of pathprobabilities Pð}Þ,

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. X, XXXXXXX 2010

Page 5: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

L ¼X

}2RPð}ÞEð}Þ þ "

X

}2RPð}Þ ln Pð}Þ

Prefð}Þ& J0

" #

þ #X

}2RPð}Þ & 1

" #;

ð3Þ

we obtain a Boltzmann probability distribution:

Pð}Þ ¼ Prefð}Þ exp &!Eð}Þ½ )P

}2RPrefð}Þ exp &!Eð}Þ½ )ð4Þ

¼exp &!Eð}Þ þ ln Prefð}Þ" #

P}2R exp &!Eð}Þ þ ln Prefð}Þ

" # ; ð5Þ

where the Lagrange parameter " plays the role of a

temperature ð" ¼ T Þ and ! ¼ 1=" is the inverse temperature.

Thus, as expected, short paths } (having small Eð}Þ) are

favored in that they have a large probability of being

followed. Indeed, from (5), we clearly observe that, when

!! 0, the paths probabilities reduce to the probabilities

generated by the natural random walk on the graph

(characterized by the transition probabilities prefkk0 as defined

in (1). In this case, J0 ! 0 as well. On the other hand, when !

is large, the first term in the exponential dominates the

second one so that the probability distribution defined by (5)

is biased toward short paths (the most likely paths are the

shortest ones). Note that, in the sequel, it will be assumed

that the user provides the value of the parameter ! instead of

J0, with ! > 0. Note that the model could be derived due to a

maximum entropy principle instead [70], [71].For illustration, the graph in Fig. 1 has one shortest path

between nodes i and j, with an unitary cost on all

transitions. Ignoring cycles and assuming uniform reference

probabilities, the probability of the shortest path is

Pð}Þ ¼ 11þ2 expð&!Þ , while the probability of the two non-

shortest paths is Pð}Þ ¼ expð&!Þ1þ2 expð&!Þ . For a large value of

! ¼ 20, the probability of the shortest path is practically 1,

while the other paths have an almost zero probability.

However, for a low value ! ’ 0, the probability of all three

paths going from node i to node j is almost equal to 1=3. For

an intermediary value of ! ¼ 1, the probability of the

shortest path is ’0:57612, which is higher than the

probability of the nonshortest paths, 0.21194.

4 THE SOP BETWEENNESS AND COVARIANCE

MEASURES

By following arguments inspired from [1], it is now shownthat all of the quantities of interest, and in particular, thebetweenness and covariance measures, can be computedfrom a quantity, appearing in the denominator of (5),defined as

Z ¼X

}2Rexp &!Eð}Þ þ ln Prefð}Þ" #

; ð6Þ

and which corresponds to the partition function in statisticalphysics (see [70] or any textbook in statistical physics; forinstance, [72], [73]). In the next section, we will show that thepartition function can easily be computed from the costmatrix and the reference transition probabilities (see [3] andthe next section). Let us further define the free energy F as

F ¼ & 1

!lnðZÞ ¼ &T lnðZÞ; ð7Þ

where T ¼ 1=! is the temperature of the system.Indeed, let us first show how the expected energy (or

expected cost) can be computed from the partition function:

E ¼ @ð& lnðZÞÞ@!

ð8Þ

¼X

}2R

exp &!Eð}Þ þ ln Prefð}Þ" #

Z Eð}Þ ð9Þ

¼X

}2RPð}Þ Eð}Þ: ð10Þ

Moreover, it can easily be shown [70] that themth centeredmoment (with m > 1) of the energy can be found by

IE Eð}Þ & E$ %m& '

¼ ð&1Þm @m

@!mlnðZÞð Þ; ð11Þ

where IE :f g is the expectation operator.The expected number of transitions through the link

k! k0 can also be easily computed:

$ðk; k0Þ ¼ @F

@ckk0¼ 1

!

@ð& lnZÞ@ckk0

ð12Þ

¼X

}2R

exp &!Eð}Þ þ ln Prefð}Þ" #

Z%ð}; k; k0Þ ð13Þ

¼X

}2RPð}Þ %ð}; k; k0; ð14Þ

where %ð}; k; k0Þ indicates the number of times the link k!k0 is present in path }, and thus, the number of times thelink is traversed within this path.

On the other hand, the expected number of passages innode k, which defines the betweenness measure, is

betðkÞ ¼ nk ¼Xn

l¼1

$ðl; kÞ ð15Þ

and corresponds to the sum of incoming transitions in nodek. This is the first quantity of interest.

MANTRACH ET AL.: THE SUM-OVER-PATHS COVARIANCE KERNEL: A NOVEL COVARIANCE MEASURE BETWEEN NODES OF A DIRECTED... 5

Fig. 1. A graph with one shortest path between i and j. The edgeweights are assumed to be 1.

Page 6: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

Furthermore, the expected number of times the link k!k0 and the link l! l0 are traversed together along a path is

$ðk; k0; l; l0Þ ¼ 1

!2

@2ðlnZÞ@cll0@ckk0

ð16Þ

¼X

}2R

exp &!Eð}Þ þ ln Prefð}Þ" #

Z%ð}; k; k0Þ%ð}; l; l0Þ

&X

}2R

exp &!Eð}Þ þ ln Prefð}Þ" #

Z%ð}; k; k0Þ

" #

!X

}2R

exp &!Eð}Þ þ ln Prefð}Þ" #

Z %ð}; l; l0Þ

" #ð17Þ

¼X

}2RPð}Þ %ð}; k; k0Þ%ð}; l; l0Þ

&X

}2RPð}Þ %ð}; k; k0Þ

" #X

}2RPð}Þ %ð}; l; l0Þ

" # ð18Þ

¼X

}2RPð}Þ%ð}; k; k0Þ%ð}; l; l0Þ & $ðk; k0Þ$ðl; l0Þ; ð19Þ

and this quantity is a measure of covariance between linkk! k0 and link l! l0.

Finally, the covariance measure between node k0 andnode l0 is simply defined from (19) as

covðk0; l0Þ ¼Xn

k;l¼1

$ðk; k0; l; l0Þ ð20Þ

¼X

}2RPð}Þ %ð}; k0Þ%ð}; l0Þ & nk0 nl0 ð21Þ

¼X

}2RPð}Þ ð%ð}; k0Þ & nk0Þð%ð}; l0Þ & nl0Þ; ð22Þ

which is the second quantity of interest. Here, %ð}; k0Þ ¼Pk %ð}; k; k0Þ indicates the number of times node k0 is visited

on path }. Obviously, (22) defines a valid kernel on a graphsince covðk0; l0Þ is an inner product in the paths space.

Of course, one single parameter ! could not be adequatefor all regions of a large graph. In a certain sense, the! parameter regulates the subset of paths for which theprobability mass is significant in function of the total cost ofthe path (see Fig. 4 for an illustration of this property inSection 7). For a dense, highly connected, community, largevalues of ! should be sufficient (only short paths areconsidered). On the other hand, for a sparsely connectedregion, lower values of ! might be required to capture thelarge-scale covariance. One could therefore imagine amodel containing different values of !, tuned for eachregion of the graph, in the same spirit as a mixture ofGaussians having each their particular variance. Note that ifthere are different connected components, the covariancebetween the nodes of the first connected component and thesecond one will be negative. This can be seen from (21): Ifthere is no path connecting node k0 from a first connectedcomponent to node l0 from a second connected component,the first term of (21) cancels.

5 COMPUTATION OF THE PARTITION FUNCTION ZBy applying the ideas introduced by Akamatsu [3], let usnow show how the partition function Z (6) can becomputed exactly from the immediate costs. We start fromthe cost matrix C from which we build a new matrix W as

W ¼ Pref * exp &!C½ ) ¼ exp &!Cþ ln Pref" #

; ð23Þ

where Pref is the transition probabilities matrix containingprefkk0 , and the logarithm/exponential functions are taken

elementwise. Moreover, * is the elementwise (Hadamard)matrix product.

Remember that }rij is the path number rij between initial

node i and destination node j. Now, since all of the

quantities in the exponential of (6) are summed along a

path, ln Prefð}rijÞ ¼Ptf

t¼1 ln prefkt&1kt

and Eð}rijÞ ¼Ptf

t¼1ckt&1kt ,

where each link kt&1 ! kt lies on path number rij; we easily

observe that element ði; jÞ of the matrix Wt (W to the

power t) is ½Wt)ij ¼P

}rij2}ijðtÞexp½&!Eð}rijÞ þ ln Prefð}rijÞ),

where }ijðtÞ is the set of paths connecting the initial node i

to the destination node j in exactly t steps. Consequently, the

partition function is

Z ¼X

}2Rexp &!Eð}Þ þ ln Prefð}Þ" #

ð24Þ

¼Xn

i;j¼1

X1

t¼1

X

}rij2}ijðtÞexp &!Eð}rijÞ þ ln Prefð}rijÞ" #

ð25Þ

¼Xn

i;j¼1

X1

t¼1

Wt

" #

ij

¼ eTX1

t¼1

Wt

!

e: ð26Þ

Thus, the sum over the elements of the matrix seriesP1t¼1W

t corresponds to the partition function. Computingthe series of powers of W provides

X1

t¼1

Wt ¼ ðI&WÞ&1 & I; ð27Þ

which converges if the spectral radius of W is less than 1,&ðWÞ < 1. Since the matrix W only contains nonnegativeelements, a sufficient condition for &ðWÞ < 1 is that all of itsrow sums are less than 1 (the matrix is substochastic), whichis always achieved for ! > 0 since ckk0 > 0 (see (23)). Indeed,it is well known that the spectral radius of a real squarematrix is always smaller than or equal to its maximumabsolute row sum norm (see, e.g., [74, p. 111]). Equation (27)is therefore well defined provided ! > 0.

Now, if we pose Z ¼ ðI&WÞ&1, the partition function is

Z ¼ eT ðI&WÞ&1 & I( )

e ð28Þ

¼ eT Z& Ið Þe ð29Þ

¼ eTZe& n ð30Þ

¼ z++ & n; ð31Þ

6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. X, XXXXXXX 2010

Page 7: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

where z++ ¼Pn

k;k0¼1 zkk0 and zkk0 are elements k; k0 of Z. Byanalogy with Markov chains, Z will be called the funda-mental matrix. As suggested by a reviewer, an intuitiveinterpretation of the elements zkk0 of the Z matrix can begiven. Consider a special random walk defined by thetransition-probabilities matrix W. Since W is substochastic,the random walker has a nonzero probability of disappear-ing at each node k and each time step that is equal toð1&

Pk0 wkk0Þ. From (23), it can be observed that the

probability of surviving during a transition k! k0 isproportional to exp½&!ckk0 ). This interpretation makes sense:there is a smaller probability to survive edges with a highcost. In this case, the elements of the Z matrix, zkk0 ¼ ½Z)kk0 ,can be interpreted as the expected number of passagesthrough node k0 (see, for instance, [27], [75]) for an“evaporating” random walker starting in node k.

6 COMPUTATION OF THE BETWEENNESS AND

COVARIANCE MEASURES

Now that we have seen how to compute the partitionfunction Z, we will turn to the computation of thebetweenness and the covariance measures that can bededuced from Z due to (12), (15) , (16), and (20).

However, let us first derive two formulas that will beuseful in the sequel. They involve the ith row (viewed as acolumn vector) and the jth column of Z: ZTei ¼ zri andZej ¼ zcj. In other words, the column vector zri ¼ ðrowiðZÞÞTcontains the elements of the ith row of matrix Z while thecolumn vector zcj ¼ coljðZÞ contains the elements of thejth column of Z.

These two quantities can easily be found by solving thelinear systems of equations

ðI&WÞTzri ¼ ei and ðI&WÞzcj ¼ ej: ð32Þ

Elementwise, these last equations (32) yield

zik0 ¼ %ik0 þX

k2P ðk0Þzik exp &!ckk0 þ ln pref

kk0" #

ð33Þ

and

zkj ¼ %kj þX

k02SðkÞexp &!ckk0 þ ln pref

kk0" #

zk0j; ð34Þ

where P ðk0Þ is the set of predecessors of node k0 and SðkÞ isthe set of successors of node k. When summing these lastequations over i and j, we obtain

z+k0 ¼Xn

i¼1

zik0 ¼ 1þX

k2P ðk0Þz+k exp &!ckk0 þ ln pref

kk0" #

; ð35Þ

zk+ ¼Xn

j¼1

zkj ¼ 1þX

k02SðkÞexp &!ckk0 þ ln pref

kk0" #

zk0+: ð36Þ

These equations will be useful later for deriving theexpression of the covariance measure (see the Appendix).

We thus have to compute the derivatives of Z (28) interms of ckk0 (see (12) and (16)) in order to obtain thedifferent quantities of interest, which is done in theAppendix. For the expected number of passages throughthe link k! k0, we obtain

$ðk; k0Þ ¼ @F

@ckk0ð37Þ

¼Pn

i;j¼1 zikzk0j exp &!ckk0 þ ln prefkk0

" #

Z ð38Þ

¼z+kzk0+ exp &!ckk0 þ ln pref

kk0" #

Z : ð39Þ

Therefore, the expected number of passages throughnode k0 (the betweenness of node k0) is

betðk0Þ ¼ !nk0 ¼Xn

k¼1

$ðk; k0Þ ¼ ðz+k0 & 1Þzk0+Z

; ð40Þ

where we used (35). The column vector containing theelements betðk0Þ will be called b.

Moreover, the covariance between node k and node l is(see the Appendix)

covðk; lÞ ¼ 1

Z

(ðz+k & 1Þzk+%kl þ zk+ðz+l & 1Þðzlk & %lkÞ

þ zl+ðz+k & 1Þðzkl & %klÞ

& zk+zl+ðz+k & 1Þðz+l & 1ÞZ

)

:

ð41Þ

The matrix containing the elements covðk; lÞ will bedenoted by !. On the other hand, the correlation betweennodes k and l is

corðk; lÞ ¼ covðk; lÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffifficovðk; kÞcovðl; lÞ

p : ð42Þ

In Algorithm 1,1 we present the corresponding algorithmfor computing the betweenness as well as the covariancemeasures for a graph G.

Algorithm 1. Computation of the betweenness vector andthe covariance matrix between nodes.Input:

- A graph G containing n nodes.- ! > 0: the parameter controlling the degree ofrandomness.- C: the n! n cost matrix associated to G, containingelements ckk0 > 0.- Pref : the n! n reference transition probabilitiesmatrix.

Output:- The betweenness vector b containing thebetweenness of each node, betðkÞ.- The covariance matrix ! between every pair of nodes,containing the elements, covðk; lÞ.

1. W ¼ Pref * exp½&!C), where * is the elementwiseproduct, and the exponential is taken elementwise

2. Z ¼ ðI&WÞ&1 containing elements zkk03. z+k0 ¼

Pnk¼1 zkk0 , zk+ ¼

Pnk0¼1 zkk0 , z++ ¼

Pnk;k0¼1 zkk0 for all

k; k0

Z ¼ z++ & n

MANTRACH ET AL.: THE SUM-OVER-PATHS COVARIANCE KERNEL: A NOVEL COVARIANCE MEASURE BETWEEN NODES OF A DIRECTED... 7

1. The Matlab/Octave implementation of the algorithm can be down-loaded from http://iridia.ulb.ac.be/~amantrac/pub/SoP_TPAMI.zip.

Page 8: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

4. for k ¼ 1 to n do5. betðkÞ ¼ ðz+k&1Þzk+

Z6. for l ¼ k to n do7. covðk; lÞ ¼ 1

Z fðz+k & 1Þzk+%klþ zk+ðz+l & 1Þðzlk & %lkÞþ zl+ðz+k & 1Þðzkl & %klÞ& zk+zl+ðz+k&1Þðz+l&1Þ

Z g8. covðl; kÞ ¼ covðk; lÞ9. end for

10. end for11. return b ¼ ½betðkÞ), ! ¼ ½covðk; lÞ)

Note that the covariance between one node of interest kand the remaining nodes (one column of the covariancematrix) can be obtained by solving four linear systems ofequations instead: (32) with i ¼ j ¼ k (or, elementwise, (33)and (34)), and equations ðI&WÞzc ¼ e, ðI&WÞTzr ¼ e (or,elementwise, (35) and (36)), where zc and zr contain,respectively, the sum over columns (the row sums) andthe sum over rows (the column sums) of Z. e is a columnvector full of 1s. Thus, the column vectors zc and zr,respectively, contain the elements zk+ and z+k0 . All of theselinear systems of equations can be solved efficiently,especially when the matrix W is sparse [76]. In particular,they could be solved iteratively. For instance, for (32), wecould iterate zri WTzri þ ei and zcj Wzcj þ ej since wehave &ðWÞ < 1 (see, for instance, [77], [78]).

If, instead, all of the elements of the covariance matrix haveto be computed, the inversion of the matrix ðI&WÞ isneeded, which can be an issue when dealing with largegraphs. First, the computation time for inverting the matrix issignificant, and second, even if the original matrix is sparse,the inverse is usually dense and could not fit into mainmemory. One potential solution to this problem is to firstperform an incomplete matrix factorization of the matrix.Indeed, if the graph is undirected, the matrix ðI&WÞ ispositive definite and a low-rank (or incomplete) Choleskyfactorization can be computed efficiently [79] and remainssparse if the original matrix is sparse. Once this factoriza-tion is computed, each column of the matrix inverse can beobtained by simple back-substitution—this technique there-fore exploits the sparseness of the network. The design ofalgorithms able to mine large graphs by, for instance,limiting the length of the walks (such as in [80]) will beinvestigated in further work.

7 EXPERIMENTS

This experimental section has two main objectives. First, theSoP betweenness is compared to two standard betweennessmeasures, Freeman’s betweenness (see [6], computed fromgeodesics paths) and Newman’s betweenness (see [81],computed from random walks on the graph). Second, theperformances of the SoP covariance measure are assessedon a semi-supervised node’s classification task. In thissetting, the SoP correlation kernel is compared to state-of-the-art kernels on a graph on eight different data sets, forseveral different labeling rates.

7.1 Betweenness MeasureIn this first experiment, we report, for different values of !,the linear correlation between the SoP betweenness and

both Freeman’s as well as Newman’s betweenness mea-sures. This linear correlation was computed on varioussocial networks, but only the results obtained on the well-known Zachary’s karate club [82] and the dolphin network[8] are reported—the results being quite similar for theother data sets.

The results for these two data sets are shown in Fig. 2,where the linear correlation between the SoP betweennessand the two other betweenness measures is displayed interms of !. We clearly observe that the SoP betweennessmeasure is highly correlated to Newman’s betweenness,while it is less correlated to Freeman’s betweenness. Thesame behavior has been observed on various other data sets(newsgroups, Florentine families, IMDb). It can also beobserved that the correlation does not change much withthe parameter ! within the same data set. There are,however, larger differences from one data set to another.

7.2 Semi-Supervised ClassificationIn this experiment, we address the task of classification ofunlabeled nodes in partially labeled graphs. Note that thegoal of this experiment is not to design a state-of-the-artsemi-supervised classifier; rather, it is to study theperformances of the proposed SoP correlation measure, incomparison with other kernels on a graph.

Data sets. The different classification models, referred toas classifiers, are compared on eight data sets that havebeen used previously for semi-supervised classification: thefour universities WebKB cocite data sets [22], [83], the twoindustry data sets [83], the IMDb prodco data set [83],and the CoRA cite data set [83].2

8 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. X, XXXXXXX 2010

Fig. 2. Linear correlation between the SoP betweenness and Freeman’sand Newman’s betweenness measures, obtained on Zachary’s karateclub (upper) and the dolphin network (lower) for different values of0 < ! < 10.

2. These data sets (original source) can be downloaded from http://netkit-srl.sourceforge.net/data.html. The preprocessed version in Matlab/Octave format used in our experiments is available from http://iridia.ulb.ac.be/~amantrac/pub/SoP_TPAMI.zip.

Page 9: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

IMDb: The collaborative Internet Movie Database(IMDb) [83] has several applications, such as makingmovie recommendations or movie category classification.The classification problem focuses on the prediction of themovie notoriety (whether the movie is a box office or not).It contains a graph of movies linked together wheneverthey share the same production company. The weight ofan edge in the resulting graph is the number ofproduction companies two movies have in common. TheIMDb-proco graph contains 1,169 movies, which have theclass distribution as shown in Table 1.

Industry: Industry regroups two data sets [83]. Theindustry-pr data set is based on 35,318 Newswire pressrelease. The companies mentioned in each press releasewere extracted and an edge was placed between twocompanies if they appeared together in a press release. Theindustry-yh data set is based on 22,170 business newsstories collected from the Web. An edge between twocompanies is placed if they appeared together in a story.The weight of an edge is the number of such co-occurrencesfound in the complete corpus. To classify a company,Yahoo!’s 12 industry sectors have been used in the twoindustry data sets. The details about the two industry datasets are reported in Table 2.

CoRA: CoRA cite is a graph of 3,583 nodes collectedfrom machine learning research papers labeled into sevendifferent topics [83]. Papers are linked if they share acommon author, or if one cites the other. The compositionof the CoRA cite data set is reported in Table 3.

WebKB: WebKB consists of sets of Webpages gatheredfrom four computer science departments (one for eachuniversity), with each page manually labeled into sixcategories: course, department, faculty, project, staff, andstudent [83]. Two pages are linked by cocitation (if x links toz and y links to z, then x and y are cociting z). Thecomposition of the data set is shown in Table 4.

Classification models. The compared classifiers arebased on the following:

1. the SoP correlation kernel (SoP) introduced in thispaper,

2. the normalized commute time (NCT) kernel [22],3. the randomized shortest path (RSP) similarity

induced by the randomized shortest path dissim-ilarity [2],

4. the diffusion map kernel (DM) [59] computed fromthe diffusion distance [55], [56], [57], [58],

5. the commute time (CT) kernel [14], [17],6. the regularized Laplacian (RL) kernel [15], [16],7. the Netkit (Netkit) framework [83], and8. a simple k-nearest neighbor (KNN). The first six

classifiers are kernel-based while the last two areincluded as baselines.

Note that, for the randomized shortest path dissimilarity,the natural kernel matrix based on the squared dissimila-rities matrix DRSP is derived in the standard way byKRSP ¼ & 1

2 HDRSPH, where H ¼ I& eeT=n is the centeringmatrix, e is a column vector full of ones, and DRSP is amatrix containing the symmetric squared dissimilarities [2].This is the standard way for deriving a similarity from adissimilarity [84] when the dissimilarity matrix containssquared distances.

For the kernel-based methods (SoP, NCT, RSP, DM, CT,and RL), the unlabeled nodes’ classification was performedaccording to a simple sum of the similarities with thelabeled nodes (as, for instance, described in [22]). Moreprecisely, let us define an n-dimensional indicator vector yicontaining as entries a 1 when the corresponding nodebelongs to class i and 0 otherwise (in which case, the node isunlabeled or belongs to another class). For each node, itssimilarity with nodes belonging to class i is contained in thecolumn vector Kyi, where K is the graph kernel matrix.Then, each node is assigned to the class showing the largest

MANTRACH ET AL.: THE SUM-OVER-PATHS COVARIANCE KERNEL: A NOVEL COVARIANCE MEASURE BETWEEN NODES OF A DIRECTED... 9

TABLE 1Class Distribution for the IMDb-proco Data Set

TABLE 2Class Distribution for the

industry-yh and industry-pr Data Sets

TABLE 3Class Distribution for the CoRA cite Data Set

TABLE 4Class Distribution for the WebKB cocite Data Set

Page 10: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

similarity; the predicted class index is thus provided byargmaxiðKyiÞ for all nodes.

Note that, for the NetKit classifier (Netkit), we onlytested the default parameters present in the framework,which generally provide good results [83]. This methodtherefore defines baseline performances on each data set.We also report as baseline the results of a KNN classifier.Our implementation of the KNN consists in taking allneighbors at maximum k steps of the considered node. Anunlabeled node will be labeled with the label that is mostrepresented in the set of nodes located at maximum k steps.Each vote is weighted by the similarity in terms of numberof steps (1=k) with the node of interest.

Note that, while the SoP kernel and the other graphkernels investigated in this experimental section suffer to acertain extent from the diagonal dominance problem [85],[86], no attempt was made to tackle this issue.

Experimental methodology. The classification accuracywill be reported for several labeling rates (20, 35, 50, 65, 80,and 95 percent), i.e., proportions of nodes for which thelabel is known. The labels of remaining nodes are removedand used as test data. For each considered labeling rate,100 random node label deletions (100 runs) were per-formed on which performances are averaged. For eachunlabeled node, the various classifiers predict the mostsuitable category according to the procedures described inthe previous paragraph.

For each run, a fivefold internal cross validation isperformed on the remaining labeled nodes in order to tunethe hyperparameters of each classifier (for instance, theparameter ! for the SoP correlation). Thus, the performanceon each run is assessed on the remaining unlabeled nodeswith the hyperparameter tuned during the cross validation.We report, for each method and each labeling rate, theaverage classification rate obtained on the 100 runs.

Results and discussion. Comparative results for eachmethod are reported on the eight different data sets inFigs. 6a, 6b, 6c, and 6d. Clearly, the SoP, the RSP, and theNCT kernel-based classifiers outperform the other ap-proaches on the majority of data sets. The RSP obtains thebest results on two data sets (WebKB-washington andWebKB-texas), while the NCT kernel achieves the bestperformances on the WebKB-winsonsin data set. TheNetKit package provides poor results except for theindustry data sets, where it achieves the best results.Note that the DM kernel and the RL kernel are competitivemethods but never outperform the three leading methods onthe tested data sets. In the case of an undirected graph, theNCT kernel differs from the RL method only by thenormalization of the Laplacian matrix. This normalizationclearly boosts the performance for the classification task. TheKNN obtains good results on the IMDb data set (Fig. 6d) andis based on the hypothesis that propagating the labels fromnodes to nodes is a good way to label unlabeled nodes.Intuitively, this measure will be ineffective in case of a lowlabeling rate. This can indeed be observed on all data sets,and more strongly, on the IMDb and CoRA data sets (Fig. 6d).

10 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. X, XXXXXXX 2010

Fig. 3. One simple example graph, for illustration.

Fig. 4. SoP correlation matrix computed on the graph of Fig. 3 for,respectively, (a) !! 0, (b) ! ¼ 0:1, and (c) !!1. Dark red colorsindicate high correlations while dark blue colors indicate lowcorrelations.

Fig. 5. Curves of the classification rates obtained for different values ofthe parameter ! (x-axis) and different labeling rates.

Page 11: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

MANTRACH ET AL.: THE SUM-OVER-PATHS COVARIANCE KERNEL: A NOVEL COVARIANCE MEASURE BETWEEN NODES OF A DIRECTED... 11

Fig. 6. Classification rates in percent, averaged on 100 runs, obtained on partially labeled graphs, for an increasing labeling rate of 20, 35, 50, 65, 80,and 95 percent. Results are reported for the SoPs kernel, the NCT kernel, the RSP kernel, the CT kernel, the DM kernel, the RL kernel, the Netkit(NetKit) framework, and the KNN algorithm. The graphs show the results obtained on the washington, wisconsin, texas, and cornellWebKB data sets, on the two industries data sets and on the IMDb and CoRA data sets [83]. Note that the results of the RSP kernel are notreported on the largest data sets (industries and CoRA) because of prohibitive computation time.

Page 12: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

The SoP measure is based on the same diffusinghypothesis. However, in the present case, to assign a labelto a node, we take into account all of the other nodes of thegraph. The way a node is correlated to others is tunedthrough the ! parameter. A high ! value means that weconsider only the shortest paths. Considering only theshortest paths gives small chance to two nodes to appeartogether on the same path, which leads to a constantly lowcorrelation. This can be observed in Fig. 4c, where wedisplay the SoP matrix obtained for the graph of Fig. 3. Thecorrelation between nodes is low and quite the samebetween all pair of nodes so that discriminating betweennodes is difficult. However, a lower ! behaves better fordiscriminating between nodes: In this case, longer paths aretaken into account. This behavior can be observed in Fig. 4a.The left-hand side nodes and the right-hand side nodes arewell clustered. We may still perform better using a ! of, forexample 0.1, which will give more weight to close nodesand improve the diffusion of unlabeled nodes. Fig. 4bshows that, for a ! of 0.1, the left-hand side nodes are lesscorrelated to right-hand side nodes than for a near-zero !value (Fig. 4a) while still clustering nodes. This is confirmedby the experiments shown in Fig. 5, where we present theinfluence of the ! parameter on the classification rate on theWebKB-washington data set for increasing labeling rates.On this data set, the best ! parameter is located around 0.2and 0.8 according to the labeling rate considered.

8 CONCLUSION

This work introduced a new family of covariance measuresbetween nodes of a weighted, directed graph. It depends ona metaparameter ! biasing gradually the simple randomwalk on the graph toward short paths. The main ideabehind the model is that two nodes are correlated if theyoften co-occur on the same—preferably short—trajectory. Asum-over-paths statistical physics framework is used inorder to derive the form of the covariance measure in termsof the immediate costs defined on the arcs.

The resulting covariance matrix between nodes is agraph kernel and is called the SoP kernel. It can becomputed efficiently by inverting an n! n matrix, wheren is the number of nodes. Semi-supervised classificationexperiments show that the SoP kernel performs wellcompared to other competing kernels on a graph. The maindrawback of the SoP method is that it does not scale well onlarge graphs when the entire kernel matrix is needed. Thisissue is currently investigated by the authors, for instance,by limiting the length of the walks.

In future work, comparisons between this covariancemeasure and other popular choices mentioned in Section 1.2will be performed on collaborative recommendation tasks.We will also try to tackle Markov decision processesthrough the sum-over-paths statistical physics framework.The SoP model could also be exploited in order to definehidden Markov models parameter estimation techniquesthat are intermediate between the Viterbi algorithm (short-est path) and the EM (Baum-Welch) algorithm (naturalrandom walk on the lattice). Finally, links between theproposed sum-over-paths framework and the matrix-foresttheorem [15], as well as the generating function approach torandom walks [87], will be investigated. We are alsointerested in the application of the SoP framework to graph

kernels defined on different graphs [88], [89] in order toperform, for instance, graph classification.

On the other hand, we are currently developing modelsdealing with graphs containing features on the nodes, inaddition to labels, as is often the case in data coming fromspatial statistics or econometrics [90], [91]. Comparisons withexisting models from these fields, taking both the featuresand the graph structure into account, will be performed.

APPENDIX

COMPUTATION OF THE PARTIAL DERIVATIVES OF THE

PARTITION FUNCTION ZFirst, let us compute the expected energy E given by (8)

E ¼ @ð& lnZÞ@!

¼ & @!½eTððI&WÞ&1 & IÞe)

Z

¼ & eT@!ðI&WÞ&1e

Z ;

ð43Þ

where the partition function Z is provided by (28). We thushave to compute @!ðI&WÞ&1; by setting Z ¼ ðI&WÞ&1

and denoting elements k; k0 of Z by zkk0 , we obtain

@!ðI&WÞ&1 ¼ &Zð@!ðI&WÞÞZ¼ Zð@!WÞZ¼ ZW0

!Z;

ð44Þ

where the matrix W0! contains the elements ½W0

!)kk0 ¼&ckk0 exp½&!ckk0 þ ln pref

kk0 ). In matrix form, W0! ¼ &C *W,

where * is the elementwise (Hadamard) matrix product[92]. Therefore, the expected energy E is

E ¼ & eTZW0!Ze

Z: ð45Þ

We now turn to the computation of $ðk; k0Þ (12). Recall thatF is the free energy defined in (7):

$ðk; k0Þ ¼ @F

@ckk0ð46Þ

¼ & 1

!

@ckk0"eTððI&WÞ&1 & IÞe

#

Z ð47Þ

¼ & 1

!

eT@ckk0 ðI&WÞ&1e

Zð48Þ

¼ & 1

!

eTð@ckk0ZÞeZ

: ð49Þ

Let us compute @ckk0Z,

@ckk0Z ¼ @ckk0 ðI&WÞ&1 ð50Þ

¼ &Zð@ckk0 ðI&WÞÞZ ð51Þ

¼ Zð@ckk0WÞZ ð52Þ

¼ &! exp &!ckk0 þ ln prefkk0

" #Zeke

Tk0Z: ð53Þ

12 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. X, XXXXXXX 2010

Page 13: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

Thus, by defining z+k ¼Pn

i¼1 zik and zk+ ¼Pn

i¼1 zki, $ðk; k0Þis given by

$ðk; k0Þ ¼ exp"& !ckk0 þ ln pref

kk0# eTZekeT

k0Ze

Z ð54Þ

¼Pn

i;j¼1 zikzk0j exp"& !ckk0 þ ln pref

kk0#

Z ð55Þ

¼z+kzk0+ exp

"& !ckk0 þ ln pref

kk0#

Z : ð56Þ

The expected number of passages through node k0, that is,the betweenness measure (15), is

betðk0Þ ¼Xn

k¼1

$ðk; k0Þ ð57Þ

¼Pn

k¼1 z+kzk0+ exp"& !ckk0 þ ln pref

kk0#

Zð58Þ

¼ ðz+k0 & 1Þzk0+Z

; ð59Þ

where we used (35). The second-order derivative (16) is a

bit tedious to compute; it aims to differentiate $ðk; k0Þprovided by (54):

$ðk; k0; l; l0Þ ¼ 1

!2

@2ðlnZÞ@cll0@ckk0

ð60Þ

¼ & 1

!

@ð$ðk; k0ÞÞ@cll0

ð61Þ

¼ &exp

"& !ckk0 þ ln pref

kk0#

!

(& ! eTZekeT

k0Ze

Z%kl%k0l0

þ eTk0Ze

Zð@cll0 e

TZekÞ þeTZekZ

$@cll0 e

Tk0Ze

%

& eTZekeTk0Ze

Z2 ð@cll0 ZÞ

)ð62Þ

¼ exp"& !ckk0 þ ln pref

kk0#+ eTZekeT

k0Ze

Z%kl%k0l0

þ exp"& !cll0 þ ln pref

ll0#, eT

k0Ze

Z$eTZele

Tl0 Zek

%

þ eTZekZ

$eTk0Zele

Tl0 Ze

%

& eTZekeTk0Ze

Z2 eT$Zele

Tl0 Z%e

-.

ð63Þ

¼ exp"& !ckk0 þ ln pref

kk0#+ z+kzk0+

Z%kl%k0l0

þ exp"& !cll0 þ ln pref

ll0#, zk0+z+lzl0k

Z

þ zl0+z+kzk0lZ & zk

0+z+kzl0+z+lZ2

-.:

ð64Þ

Finally, using (33)-(36), the corresponding covariancesbetween nodes (20) are

covðk0; l0Þ ¼Xn

k;l¼1

$ðk; k0; l; l0Þ ð65Þ

¼Xn

k¼1

exp"& !ckk0 þ ln pref

kk0#

!+z+kzk0+Z %k0l0 þ

zk0+ðz+l0 & 1Þzl0kZ

þ zl0+z+kðzk0l0 & %k0l0Þ

Z & zk0+z+kzl0+ðz+l0 & 1Þ

Z2

.ð66Þ

¼ ðz+k0 & 1Þzk0+Z %k0l0 þ

zk0+ðz+l0 & 1Þðzl0k0 & %l0k0ÞZ

þ zl0+ðz+k0 & 1Þðzk0l0 & %k0l0Þ

Z

& zk0+ðz+k0 & 1Þzl0+ðz+l0 & 1Þ

Z2

ð67Þ

¼ 1

Z

+ðz+k0 & 1Þzk0+%k0l0 þ zk0+ðz+l0 & 1Þðzl0k0 & %l0k0Þ

þ zl0+ðz+k0 & 1Þðzk0l0 & %k0l0Þ

& zk0+ðz+k0 & 1Þzl0+ðz+l0 & 1Þ

Z

.:

ð68Þ

ACKNOWLEDGMENTS

The authors thank the anonymous reviewers for theirinteresting remarks and suggestions that helped to sig-nificantly improve the quality of the paper. Part of thiswork has been funded by projects with the “Regionwallonne” and the Belgian “Politique Scientifique Feder-ale.” The authors thank these institutions for giving themthe opportunity to conduct both fundamental and appliedresearch.

REFERENCES

[1] M. Saerens, Y. Achbany, F. Fouss, and L. Yen, “RandomizedShortest-Path Problems: Two Related Models,” Neural Computa-tion, 2009.

[2] L. Yen, A. Mantrach, M. Shimbo, and M. Saerens, “A Family ofDissimilarity Measures between Nodes Generalizing Both theShortest-Path and the Commute-Time Distances,” Proc. ACMSIGKDD, pp. 785-793, 2008.

[3] T. Akamatsu, “Cyclic Flows, Markov Process and StochasticTraffic Assignment,” Transportation Research B, vol. 30, no. 5,pp. 369-386, 1996.

[4] H. Small, “Co-Citation in the Scientific Literature: A New Measureof the Relationship between Two Documents,” J. Am. Soc.Information Science, vol. 24, no. 4, pp. 265-269, 1973.

[5] M.M. Kessler, “Bibliographic Coupling between ScientificPapers,” Am. Documentation, vol. 14, no. 1, pp. 10-25, 1963.

[6] S. Wasserman and K. Faust, Social Network Analysis: Methods andApplications. Cambridge Univ. Press, 1994.

[7] J.-C. Delvenne, S. Yaliraki, and M. Barahona, “Stability ofGraph Communities across Time Scales,” arXiv:0812.1770,2008.

MANTRACH ET AL.: THE SUM-OVER-PATHS COVARIANCE KERNEL: A NOVEL COVARIANCE MEASURE BETWEEN NODES OF A DIRECTED... 13

Page 14: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

[8] M. Newman and M. Girvan, “Finding and Evaluating CommunityStructure in Networks,” Physical Rev. E, vol. 69, p. 026113, 2004.

[9] D.J. Klein and M. Randic, “Resistance Distance,” J. Math.Chemistry, vol. 12, pp. 81-95, 1993.

[10] R.B. Bapat, “Resistance Distance in Graphs,” The Math. Student,vol. 68, pp. 87-98, 1999.

[11] F. Gobel and A.A. Jagers, “Random Walks on Graphs,” StochasticProcesses and Their Applications, vol. 2, pp. 311-336, 1974.

[12] L. Lovasz, “Random Walks on Graphs: A Survey,” Combinatorics:Paul Erdos Is Eighty, vol. 2, pp. 353-397, 1996.

[13] A.K. Chandra, P. Raghavan, W.L. Ruzzo, R. Smolensky, and P.Tiwari, “The Electrical Resistance of a Graph Captures ItsCommute and Cover Times,” Proc. Ann. ACM Symp. Theory ofComputing, pp. 574-586, 1989.

[14] M. Saerens, F. Fouss, L. Yen, and P. Dupont, “The PrincipalComponents Analysis of a Graph, and Its Relationships to SpectralClustering,” Proc. 15th European Conf. Machine Learning, pp. 371-383, 2004.

[15] P. Chebotarev and E. Shamis, “The Matrix-Forest Theorem andMeasuring Relations in Small Social Groups,” Automation andRemote Control, vol. 58, no. 9, pp. 1505-1514, 1997.

[16] P. Chebotarev and E. Shamis, “On Proximity Measures for GraphVertices,” Automation and Remote Control, vol. 59, no. 10, pp. 1443-1459, 1998.

[17] F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens, “Random-Walk Computation of Similarities between Nodes of a Graph,with Application to Collaborative Recommendation,” IEEETrans. Knowledge and Data Eng., vol. 19, no. 3, pp. 355-369,Mar. 2007.

[18] T. Ito, M. Shimbo, T. Kudo, and Y. Matsumoto, “Applicationof Kernels to Link Analysis,” Proc. ACM SIGKDD, pp. 586-592,2005.

[19] M. Shimbo and T. Ito, “Kernels as Link Analysis Measures,”Mining Graph Data, D. Cook and L. Holder, eds., pp. 283-310, JohnWiley & Sons, 2006.

[20] J. Kandola, N. Cristianini, and J. Shawe-Taylor, “LearningSemantic Similarity,” Proc. Advances in Neural Information Proces-sing Systems 15, pp. 657-664, 2002.

[21] D. Zhou and B. Scholkopf, “Learning from Labeled and UnlabeledData Using Random Walks,” Proc. 26th DAGM Symp.,C.E. Rasmussen, H.H. Bulthoff, M.A. Giese, and B. Scholkopf,eds., pp. 237-244, 2004.

[22] D. Zhou, J. Huang, and B. Scholkopf, “Learning from Labeled andUnlabeled Data on a Directed Graph,” Proc. 22nd Int’l Conf.Machine Learning, pp. 1041-1048, 2005.

[23] F.R. Chung, “Laplacians and the Cheeger Inequality for DirectedGraphs,” Annals of Combinatorics, vol. 9, pp. 1-19, 2005.

[24] T. Chen, Q. Yang, and X. Tang, “Directed Graph Embedding,”Proc. Int’l Joint Conf. Artificial Intelligence, pp. 2707-2712, 2007.

[25] J. Shawe-Taylor and N. Cristianini, Kernel Methods for PatternAnalysis. Cambridge Univ. Press, 2004.

[26] R.I. Kondor and J. Lafferty, “Diffusion Kernels on Graphs andOther Discrete Structures,” Proc. 19th Int’l Conf. Machine Learning,pp. 315-322, 2002.

[27] P.G. Doyle and J.L. Snell, Random Walks and Electric Networks. TheMath. Assoc. of Am., 1984.

[28] D. Harel and Y. Koren, “On Clustering Using Random Walks,”Proc. Conf. Foundations of Software Technology and TheoreticalComputer Science, pp. 18-41, 2001.

[29] S. White and P. Smyth, “Algorithms for Estimating RelativeImportance in Networks,” Proc. ACM SIGKDD, pp. 266-275,2003.

[30] S. Brin and L. Page, “The Anatomy of a Large-Scale HypertextualWeb Search Engine,” Computer Networks and ISDN Systems, vol. 30,nos. 1-7, pp. 107-117, 1998.

[31] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PagerankCitation Ranking: Bringing Order to the Web,” TechnicalReport 1999-0120, Computer Science Dept., Stanford Univ., 1999.

[32] D. Liben-Nowell and J. Kleinberg, “The Link-Prediction Problemfor Social Networks,” J. Am. Soc. Information Science and Technology,vol. 58, no. 7, pp. 1019-1031, 2007.

[33] A.J. Smola and R. Kondor, “Kernels and Regularization onGraphs,” Proc. Conf. Learning Theory, M. Warmuth andB. Scholkopf, eds., pp. 144-158, 2003.

[34] X. Zhu, J. Kandola, J. Lafferty, and Z. Ghahramani, “GraphKernels by Spectral Transforms,” Semi-Supervised Learning,O. Chapelle, B. Scholkopf, and A. Zien, eds., pp. 277-291, MITPress, 2006.

[35] C. Palmer and C. Faloutsos, “Electricity Based ExternalSimilarity of Categorical Attributes,” Proc. Seventh Pacific-AsiaConf. Knowledge Discovery and Data Mining, pp. 486-500, 2003.

[36] P. Sarkar and A. Moore, “A Tractable Approach to FindingClosest Truncated-Commute-Time Neighbors in LargeGraphs,” Proc. 23rd Conf. Uncertainty in Artificial Intelligence,2007.

[37] H. Qiu and E.R. Hancock, “Image Segmentation Using CommuteTimes,” Proc. 16th British Machine Vision Conf., pp. 929-938, 2005.

[38] H. Qiu and E.R. Hancock, “Clustering and Embedding UsingCommute Times,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 29, no. 11, pp. 1873-1890, Nov. 2007.

[39] J. Ham, D. Lee, S. Mika, and B. Scholkopf, “A Kernel View of theDimensionality Reduction of Manifolds,” Proc. 21st Int’l Conf.Machine Learning, pp. 369-376, 2004.

[40] L. Yen, D. Vanvyve, F. Wouters, F. Fouss, M. Verleysen, and M.Saerens, “Clustering Using a Random Walk-Based DistanceMeasure,” Proc. 13th European Symp. Artificial Neural Networks,pp. 317-324, 2005.

[41] M. Brand, “A Random Walks Perspective on MaximizingSatisfaction and Profit,” Proc. 2005 SIAM Int’l Conf. Data Mining,2005.

[42] M. Fiedler, “A Property of Eigenvectors of Nonnegative Sym-metric Matrices and Its Applications to Graph Theory,” Czechoslo-vak Math. J., vol. 25, no. 100, pp. 619-633, 1975.

[43] B. Mohar, “Laplace Eigenvalues of Graphs—A Survey,” DiscreteMath., vol. 109, pp. 171-183, 1992.

[44] T. Chan, P. Ciarlet, and W. Szeto, “On the Optimality of theMedian Cut Spectral Bisection Graph Partitioning Method,” SIAMJ. Scientific Computing, vol. 18, no. 3, pp. 943-948, 1997.

[45] A. Pothen, H.D. Simon, and K.-P. Liou, “Partitioning SparseMatrices with Eigenvectors of Graphs,” SIAM J. Matrix Analysisand Applications, vol. 11, no. 3, pp. 430-452, 1990.

[46] L. Donetti and M. Munoz, “Detecting Network Communities: ANew Systematic and Efficient Algorithm,” J. Statistical Mechanics:Theory and Experiment, vol. P10012, 2004.

[47] L. Yen, F. Fouss, C. Decaestecker, P. Francq, and M. Saerens,“Graph Nodes Clustering Based on the Commute-Time Kernel,”Proc. 11th Pacific-Asia Conf. Knowledge Discovery and Data Mining,pp. 1037-1045, 2007.

[48] L. Yen, F. Fouss, C. Decaestecker, P. Francq, and M. Saerens,“Graph Nodes Clustering with the Sigmoid Commute-TimeKernel: A Comparative Study,” Data & Knowledge Eng., 2009.

[49] P. Chebotarev, “A New Family of Graph Distances,” arxiv.org asmanuscript arXiv:0810.2717v2, 2008.

[50] H. Zhou, “Distance, Dissimilarity Index, and Network Commu-nity Structure,” Physical Rev. E, vol. 67, p. 061901, 2003.

[51] H. Zhou, “Network Landscape from a Brownian ParticlePerspective,” Physical Rev. E, vol. 67, p. 041908, 2003.

[52] H. Tong, Y. Koren, and C. Faloutsos, “Fast Direction-AwareProximity for Graph Mining,” Proc. ACM SIGKDD, pp. 747-756,2007.

[53] Y. Koren, S. North, and C. Volinsky, “Measuring and ExtractingProximity in Networks,” Proc. ACM SIGKDD, pp. 245-255, 2006.

[54] Y. Koren, S. North, and C. Volinsky, “Measuring and ExtractingProximity Graphs in Networks,” ACM Trans. Knowledge Discoveryin Data, vol. 1, no. 3, pp. 12:1-12:30, 2007.

[55] B. Nadler, S. Lafon, R. Coifman, and I. Kevrekidis, “DiffusionMaps, Spectral Clustering and Eigenfunctions of Fokker-PlanckOperators,” Proc. Advances in Neural Information Processing Systems18, pp. 955-962, 2005.

[56] B. Nadler, S. Lafon, R. Coifman, and I. Kevrekidis, “DiffusionMaps, Spectral Clustering and Reaction Coordinate of DynamicalSystems,” Applied and Computational Harmonic Analysis, vol. 21,pp. 113-127, 2006.

[57] P. Pons and M. Latapy, “Computing Communities in LargeNetworks Using Random Walks,” Proc. Int’l Symp. Computer andInformation Sciences, pp. 284-293, 2005.

[58] P. Pons and M. Latapy, “Computing Communities in LargeNetworks Using Random Walks,” J. Graph Algorithms andApplications, vol. 10, no. 2, pp. 191-218, 2006.

14 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. X, XXXXXXX 2010

Page 15: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

[59] F. Fouss, L. Yen, A. Pirotte, and M. Saerens, “An ExperimentalInvestigation of Graph Kernels on a Collaborative Recommenda-tion Task,” Proc. Sixth Int’l Conf. Data Mining, pp. 863-868, 2006.

[60] S. Lafon and A.B. Lee, “Diffusion Maps and Coarse-Graining: AUnified Framework for Dimensionality Reduction, Graph Parti-tioning, and Data Set Parameterization,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 28, no. 9, pp. 1393-1403, Sept.2006.

[61] M. Gori and A. Pucci, “A Random-Walk Based ScoringAlgorithm with Application to Recommender Systems forLarge-Scale E-Commerce,” Proc. ACM SIGKDD, 2006.

[62] J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu, “AutomaticMultimedia Cross-Modal Correlation Discovery,” Proc. ACMSIGKDD, pp. 653-658, 2004.

[63] H. Tong, C. Faloutsos, and J.-Y. Pan, “Random Walk with Restart:Fast Solutions and Applications,” Knowledge and InformationSystems, 2007.

[64] V.D. Blondel and P.V. Dooren, “A Measure of Similarity betweenGraph Vertices, with Application to Synonym Extraction and WebSearching,” SIAM Rev., vol. 46, no. 4, pp. 647-666, 2004.

[65] W. Lu, J. Janssen, E. Milos, N. Japkowicz, and Y. Zhang, “NodeSimilarity in the Citation Graph,” Knowledge and InformationSystems, vol. 11, no. 1, pp. 105-129, 2006.

[66] A. Tahbaz and A. Jadbabaie, “A One-Parameter Family ofDistributed Consensus Algorithms with Boundary: From ShortestPaths to Mean Hitting Times,” Proc. IEEE Conf. Decision andControl, pp. 4664-4669, 2006.

[67] R. Agaev and P. Chebotarev, “The Matrix of Maximum OutForests of a Digraph and Its Applications,” Automation and RemoteControl, vol. 61, no. 9, pp. 1424-1450, 2000.

[68] R. Agaev and P. Chebotarev, “Spanning Forests of a Digraph andTheir Applications,” Automation and Remote Control, vol. 62, no. 3,pp. 443-466, 2001.

[69] D. Zhao, Z. Lin, and X. Tang, “Contextual Distance for DataPerception,” Proc. 11th IEEE Int’l Conf. Computer Vision, pp. 1-8,2007.

[70] E.T. Jaynes, “Information Theory and Statistical Mechanics,”Physical Rev., vol. 106, pp. 620-630, 1957.

[71] J.N. Kapur and H.K. Kesavan, Entropy Optimization Principles withApplications. Academic Press, 1992.

[72] L. Reichl, A Modern Course in Statistical Physics, second ed. Wiley,1998.

[73] E. Schrodinger, Statistical Thermodynamics, second ed. CambridgeUniv. Press, 1952.

[74] R. Bronson, Matrix Operations. McGraw-Hill, 1989.[75] J.G. Kemeny and J.L. Snell, Finite Markov Chains. Springer-Verlag,

1976.[76] T.A. Davis, Direct Methods for Sparse Linear Systems. SIAM, 2006.[77] G.H. Golub and C.F.V. Loan, Matrix Computations, third ed. Johns

Hopkins Univ. Press, 1996.[78] C.D. Meyer, Matrix Analysis and Applied Linear Algebra. SIAM,

2001.[79] S. Fine and K. Scheinberg, “Efficient SVM Training Using Low-

Rank Kernel Representations,” J. Machine Learning Research, vol. 2,pp. 243-264, 2001.

[80] J. Callut, K. Francoisse, M. Saerens, and P. Dupont, “Semi-Supervised Classification from Discriminative Random Walks,”Proc. European Conf. Machine Learning, pp. 162-177, 2008.

[81] M. Newman, “A Measure of Betweenness Centrality Based onRandom Walks,” Social Networks, vol. 27, no. 1, pp. 39-54, 2005.

[82] W.W. Zachary, “An Information Flow Model for Conflict andFission in Small Groups,” J. Anthropological Research, pp. 452-473,1977.

[83] S.A. Macskassy and F. Provost, “Classification in Networked Data:A Toolkit and a Univariate Case Study,” J. Machine LearningResearch, vol. 8, pp. 935-983, 2007.

[84] I. Borg and P. Groenen, Modern Multidimensional Scaling: Theoryand Applications. Springer, 1997.

[85] J. Weston, B. Scholkopf, E. Ekin, C. Leslie, and W. Noble, “Dealingwith Large Diagonals in Kernel Matrices,” Annals Inst. of StatisticalMath., vol. 55, no. 2, pp. 391-408, 2003.

[86] D. Greene and P. Cunningham, “Practical Solutions to theProblem of Diagonal Dominance in Kernel Document Clustering,”Proc. 23rd Int’l Conf. Machine Learning, pp. 377-384, 2006.

[87] J. Rudnick and G. Gaspari, Elements of the Random Walk.Cambridge Univ. Press, 2004.

[88] K. Borgwardt, C.S. Ong, S. Schonauer, S. Vishwanathan, A. Smola,and H.-P. Kriegel, “Protein Function Prediction via GraphKernels,” Bioinformatics, vol. 12, no. 1, pp. 337-357, 2005.

[89] S. Vishwanathan, K.M. Borgwardt, I.R. Kondor, and N.N.Schraudolph, “Graph Kernels,” http://arxiv. org/pdf/0807.0093, 2009.

[90] O. Schabenberger and C.A. Gotway, Statistical Methods for SpatialData Analysis. Chapman & Hall, 2004.

[91] J.L. Sage and R.K. Pace, Introduction to Spatial Econometrics.Chapman & Hall, 2009.

[92] D.A. Harville, Matrix Algebra from a Statistician’s Perspective.Springer-Verlag, 1997.

Amin Mantrach received the BSc degree incomputer science and the MSc degree inbioinformatics in 2004 from the Universite Librede Bruxelles (ULB), Belgium. He is nowcompleting the PhD degree at the IRIDIALaboratory (the Artificial Intelligence Laboratory,ULB, Belgium). His research interests includegraph mining, semi-supervised learning, andtext categorization.

Luh Yen received the MSc degree in electricalengineering from the Universite Catholique deLouvain (UCL), Belgium, in 2002. She is nowcompleting the PhD degree at the ISYS Labora-tory and Machine Learning Group of theUniversite Catholique de Louvain. Her researchinterests include graph mining, clustering, andmultivariate statistical analysis.

Jerome Callut received the MSc degree incomputer science from the Universite Libre deBruxelles (ULB) in 2003 and the PhD degree inapplied sciences from the INGI Department ofthe Universite Catholique de Louvain (UCL) in2007, as a researcher. During this period, hisresearch was mainly concerned with the struc-tural estimation of hidden Markov models(HMMs), and more particularly, with the analysisof first passage times in these models. He then

joined the ISYS Unit from the Louvain School of Management (LSM) asa postdoctoral researcher. In this context, his research is focused on textmining and semi-supervised classification in large-scale graphs. Hisother research interests include kernel methods, sequential processmining, convex optimization, and financial forecasting models.

Kevin Francoisse received the MSc degree incomputer science and statistics from the Uni-versite Catholique de Louvain (UCL), Belgium, in2006. He is now completing the PhD degree intthe Machine Learning Group at the UCL. Hisresearch interests include semi-supervisedlearning, graph mining, and collaborative recom-mendation problems.

MANTRACH ET AL.: THE SUM-OVER-PATHS COVARIANCE KERNEL: A NOVEL COVARIANCE MEASURE BETWEEN NODES OF A DIRECTED... 15

Page 16: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE …homepages.ulb.ac.be/~amantrac/WebPage/papers/TPAMI2010... · 2015. 6. 11. · measure between nodes integrating indirect paths

Masashi Shimbo received the PhD degree inengineering from Kyoto University in 2000. He iscurrently an assistant professor in the GraduateSchool of Information Science, Nara Institute ofScience and Technology. His research interestsinclude graph-theoretic approaches to datamining and application of machine learningtechniques to natural language processing.

Marco Saerens received the BSc degree inphysics engineering and the MSc degree intheoretical physics from the Universite Libre deBruxelles (ULB). After graduation, he joined theIRIDIA Laboratory (the Artificial IntelligenceLaboratory, ULB, Belgium) as a research assis-tant and completed the PhD degree in appliedsciences. While remaining a part-time research-er at IRIDIA, he then departments of variouscompanies, mainly in the fields of speech

recognition, data mining, and artificial intelligence. In 2002, he joinedthe Universite Catholique de Louvain (UCL) as a professor of computersciences. His main research interests include artificial intelligence,machine learning, data mining, pattern recognition, and speech/language processing. He is a member of the IEEE, the IEEE ComputerSociety, and the ACM.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

16 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. X, XXXXXXX 2010