A Link-Analysis Extension of Correspondence

Embed Size (px)

Citation preview

  • 8/6/2019 A Link-Analysis Extension of Correspondence

    1/15

  • 8/6/2019 A Link-Analysis Extension of Correspondence

    2/15

    2

    reducing the Markov chain by stochastic complementationallows to focus the analysis on the elements and relationshipswe are interested in.

    Interestingly enough, when dealing with a bipartite graph(i.e., the database only contains two tables linked by onerelation), stochastic complementation followed by a basicdiffusion map is exactly equivalent to simple correspondence

    analysis. On the other hand, when dealing with a star-schemadatabase (one central table linked to several tables by dif-ferent relations), this two-step procedure reduces to multiplecorrespondence analysis. The proposed methodology thereforeextends correspondence analysis to the analysis of a relationaldatabase.

    In short, this paper has three main contributions: A two-step procedure for analyzing weighted graphs or

    relational databases is proposed. It is shown that the suggested procedure extends corre-

    spondence analysis. A kernel version of the diffusion-map distance, applicable

    to directed graphs, is introduced.The paper is organized as follows. Section II introduces the

    basic diffusion-map distance and its natural kernel on a graph.Section III introduces some basic notions of stochastic comple-mentation of a Markov chain. Section IV p resents the two-stepprocedure for analyzing the relationships between elements of different tables and establishes the equiv alence between theproposed methodology and corresponden ce analysis in somespecial cases. Section V presents some i llustrative examplesinvolving several datasets while Section V I is the conclusion.

    I I . T HE DIFFUSION -MAP DISTANCE A ND ITS NATURALKERNEL MATRIX

    In this section, the basic diffusion-map distance [42], [43],[46], [47] is briey reviewed and som e of its theoretical justications are detailed. Then, a natural kernel matrix is de-rived from the diffusion-map distance, pro viding a meaningfulsimilarity measure between nodes.

    A. Notations and denitions

    Let us consider that we are given a weighted , directed ,graph G possibly dened from a relational database in thefollowing, obvious, way: each element of the database is anode and each relation corresponds to a link (for a detailedprocedure allowing to build a graph from a relational database,see [20]). The associated adjacency matrix A is dened in astandard way as a ij = [A ]ij = wij if node i is connectedto node j and a ij = 0 otherwise (say G has n nodes intotal). The weight wij > 0 of the edge connecting nodei and node j is set to have larger value if the afnitybetween i and j is important. If no information about thestrength of relationship is available, we simply set wij = 1(unweighted graph). We further assume that there are no self-loops ( wii = 0 for i = 1 ,...,n ) and that the graph has asingle connected component; that is, any node can be reachedfrom any other node. If the graph is not connected, thereis no relationship at all between the different components

    and the analysis has to be performed separately on each of

    them. It is therefore to be hoped that the graph modeling therelational database does not contain too many disconnectedcomponents this can be considered as a limitation of ourmethod. Partitioning a graph into connected components fromits adjacency matrix can be done in O(n 2 ) (see for instance[56]). Based on the adjacency matrix, the Laplacian matrix Lof the graph is dened in the usual manner: L = D A ,where D = Diag (a i. ) is the generalized outdegree matrixwith diagonal entries dii = [D ]ii = a i. =

    nj =1 a ij . The

    column vector d = diag (a i. ) is simply the vector containingthe outdegree of each node. Furthermore, the volume of thegraph is dened as vg = vol(G) =

    ni =1 dii =

    ni,j =1 a ij .

    Usually, we are dealing with symmetric adjacency matrices,in which case L is symmetric and positive semidenite (seefor instance [10]).

    From this graph, we dene a natural random walk throughthe graph in the usual way by associating a state to each nodeand assigning a transition probability to each link. Thus, arandom walker can jump from element to element and eachelement therefore represents a state of the Markov chaindescribing the sequence of visited states. A random variables(t) contains the current state of the Markov chain at time stept : if the random walker is in state i at time t , then s(t) = i. Therandom walk is de ned by the following single-step transitionprobabilities of jum ping from any state i = s(t) to an adjacentstate j = s(t + 1) : P(s(t + 1) = j |s(t) = i) = a ij /a i. = pij .The transition prob abilities only depend on the current stateand not on the pas t ones (rst-order Markov chain). Sincethe graph is comp letely connected, the Markov chain isirreducible, that is, every state can be reached from any otherstate. If we denote the probability of being in state i at timet by x i (t) = P(s(t) = i) and we dene P as the transition

    matrix with entries pij , the evolution of the Markov chain ischaracterized by x (t + 1) = P Tx (t), with x (0) = x 0 andT is the matrix tran spose. This provides the state probabilitydistribution x (t) = [ x1 (t), x 2 (t),...,x n (t)]T at time t once theinitial distribution x (0) is known. Moreover, we will denote asx i (t) the column ve ctor containing the probability distributionof nding the random walker in each state at time t whenstarting from state i at time t = 0 . That is, the entries of thevector x i (t) are x ij (t) = P (s(t) = j |s(0) = i), j = 1 , . . . n .

    Since the Markov chain represents a random walk onthe graph G, the transition matrix is simply P = D 1 A .Moreover, if the adjacency matrix A is symmetric, the Markovchain is reversible and the steady-state vector, , is simplyproportional to the degree of each state [48], d (whichhas to be normalized in order to obtain a valid probabilitydistribution). Moreover, this implies that all the eigenvalues(both left and right) of the transition matrix are real.

    B. The diffusion-map distance

    In our two-step procedure, a diffusion-map projection, basedon the so-called diffusion-map distance, will be performedafter stochastic complementation. Now, since the originaldenition of the diffusion-map distance deals only with undi-rected, aperiodic, Markov chains, it will rst be assumed in

    this section that the reduced Markov chain, obtained after

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

    h t t p : /

    / w w w

    . i e e e x p

    l o r e p

    r o j e c

    t s . b l o

    g s p o

    t . c o m

    http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/
  • 8/6/2019 A Link-Analysis Extension of Correspondence

    3/15

    3

    stochastic complementation, is indeed undirected, aperiodicand connected in which case the corresponding random walk denes an irreducible reversible Markov chain. Notice that it isnot required that the original adjacency matrix is irreducibleand reversible; these assumptions are only required for thereduced adjacency matrix obtained after stochastic comple-mentation (see the discussion in Section III-A). Moreover,

    some of these assumptions will be relaxed in Section II-C,when introducing the diffusion-map kernel that is well-denedeven if the graph is directed.

    The original derivation of the diffusion map, introducedindependently by Nadler et al. , and Pons & Latapy [42], [43],[46], [47], is detailed in this section but other interpretationsof this mapping appeared in the literature (see the discussionat the end of this section). Moreover, the basic diffusion mapis closely related to correspondence analysis, as detailed inSection IV. For an application of the basic diffusion map todimensionality reduction, see [35].

    Since P is aperiodic, irreducible and reversible, it is well-known that all the eigenvalues of P are real and the eigen-vectors are also real (see, e.g., [7], p. 202). Moreover, all itseigenvalues [1, +1] , and the eigenvalue 1 has multiplicityone [7]. With these assumptions, Nadler et al. and Pons &Latapy [42], [43], [46], [47] proposed to use as distancebetween states i and j ,

    d2ij (t) =n

    k =1

    (x ik (t) x jk (t))2

    k(1)

    (x i (t) x j (t)) TD 1 (x i (t) x j (t)) (2)

    since, for a simple random walk on a n undirected graph,the entries of the steady-state vector are proportional (the

    sign) to the generalized degree of e ach node (the totalof the elements of the corresponding ro w of the adjacencymatrix [48]). This distance, called the diff usion-map distance,corresponds to the sum of the squared diff erences between theprobability distribution of being in any sta te after t transitionswhen starting (i.e., at time t = 0 ) from two different states,state i and state j . In other words, two nodes are similarwhen they diffuse through the network and thus inuencethe network in a similar way. This is a natural denitionwhich quanties the similarity between two states based onthe evolution of the states probability distribution. Of course,when i = j , dij (t) = 0 .

    Nadler et al. [42], [43] showed that this distance measure

    has a simple expression in terms of the right eigenvectors of P :

    d2ij (t) =n

    k =1 2 tk (uki ukj )

    2 (3)

    where uki = [u k ]i is component i of the kth right eigenvector,u k , of P and k is its corresponding eigenvalue. As usual, thek are ordered by decreasing modulus so that the contributionsto the sum in Equation (3) are decreasing with k. On the otherhand, x i (t) can easily be expressed [42], [43] in the spacespanned by the left eigenvectors of P , the v k ,

    x i (t) = ( P T)t e i =n

    k =1 tk v k u

    Tk e i =

    n

    k =1( tk uki )v k (4)

    where e i is the ith column of I , e i = [01, ..., 0

    i 1, 1

    i, 0

    i +1, ..., 0

    n]T .

    The resulting mapping aims to represent each state i ina n -dimensional Euclidean space with coordinates ( | t2 |u2 i ,| t3 |u3 i , . . . , | tn |uni ), as in Equation (4) (the rst right eigen-vector is trivial and is therefore discarded). Dimensions areordered by decreasing modulus, | tk |. This original mappingintroduced by Nadler and coauthors will be referred to asthe basic diffusion map in this paper, in contrast with thediffusion-map kernel ( K DM ) that will be introduced in the nextsection.

    The weighting factor, D 1 , in Equation (2) is necessary toobtain Equation (3) since the v k are not orthogonal. Instead,it can easily be shown that we have v Ti D

    1 v j = ij , whichaims to redene the inner product as x, y = x TD 1 y , wherethe metric of the space is D 1 [7].

    Notice also that there is a close relationship between spectralclustering (the mapping provided by the normalized Laplacianmatrix; see for instance [15], [45], [65]) and the basic diffusionmap. Indeed, a common embedding of the nodes consists of

    representing each node by the coordinates of the smallest non-trivial eigenvectors (corresponding to the smallest eigenvalues)of the normalized Laplacian matrix, L = D 1 / 2 LD 1 / 2 .More precisely, if u k is the kth largest right eigenvector of the transition matrix P and lk is the kth smallest non-trivialeigenvector of the normalized Laplacian matrix L , we have(see Appendix A fo r the proof and details)

    u k = D 1 / 2 lk (5)

    and lk is associated to eigenvalue (1 k ).A subtle, still im portant, difference between this mapping

    and the one provid ed by the basic diffusion map concerns

    the order in which t he dimensions are sorted . Indeed, for thebasic diffusion map , the eigenvalues of the transition matrixP are ordered by de creasing modulus value. For this spectral-clustering model, t he eigenvalues are sorted by decreasingvalue (and not mo dulus), which can result in a differentrepresentation if P has large negative eigenvalues. This showsthat the mappings provided by spectral clustering and by thebasic diffusion map are closely related.

    Notice that at least three other justications of thiseigenvector-based mapping appeared before in the literature,and are briey reviewed here. ( i) It has been shown that theentries of the subdominant right eigenvector of the transitionmatrix P of an aperiodic, irreducible, reversible, Markov chaincan be interpreted as a relative distance to its stationary-distribution (see [60], Section 1.8.1, or the appendixof [18]). This distance may be regarded as an indicator of the number of iterations required to reach this equilibriumposition, if the system starts in the state from which thedistance is being measured. These quantities are only relative,but they serve as a means of comparison among the states[60]. ( ii) The same embedding can be obtained by minimizingthe criterion ni =1

    nj =1 a ij (zi zj )2 = z TLz [10], [26]

    subject to z TDz = 1 , therefore penalizing the nodes having alarge outdegree [74]. Here, zi is the coordinate of node i onthe axis and the vector z contains the zi . The problem sumsup in nding the smallest non-trivial eigenvector of (I P ),

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

    h t t p : /

    / w w w

    . i e e e x p

    l o r e p

    r o j e c

    t s . b l o

    g s p o

    t . c o m

    http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/
  • 8/6/2019 A Link-Analysis Extension of Correspondence

    4/15

    4

    which is the same as the second largest eigenvector of P , andthis is once more similar to the basic diffusion map. Noticethat this mapping has been rediscovered and reinterpretedby Belkin & Niroyi [2], [3] in the context of nonlineardimensionality reduction. ( iii) The last justication of the basicdiffusion map, introduced in [15], is based on the concept of 2-way partitioning of a graph [58]. Minimizing a normalized

    cut criterion while imposing that the membership vector iscentered with respect to the metric D leads to exactly thesame embedding as in the previous interpretation. Moreover,some authors [72] showed that applying a specic cut criteriato bipartite graphs leads to simple correspondence analysis.Notice that the second justication ( i) leads to exactly thesame mapping as the basic diffusion map while the third andfourth justications, ( ii) and ( iii), lead to the same embeddingspace, but with a possibly different ordering and rescaling of the axis.

    More generally, these mappings are, of course, also relatedto graph embedding and nonlinear dimensionality reduction,which have been highly studied topics in recent years, espe-cially in the manifold learning community (see, i.e., [21], [30],[37], [67], for recent surveys or developments). Experimentalcomparisons with popular nonlinear dimensionality reductiontechniques are presented in the experimen tal section.

    C. A kernel view of the diffusion-map dis tance

    We now introduce 1 a variant of the ba sic diffusion-mapmodel introduced by Nadler et al. and P ons & Latapy [42],[43], [46], [47], which is still well-dene d when the originalgraph is directed. In other words, we do not assume thatthe initial adjacency matrix A is symme tric in this section.This extension presents several advanta ges in comparisonwith the original basic diffusion map: (i ) the kernel versionof the diffusion map is applicable to dir ected graphs whilethe original model is restricted to undire cted graphs, (ii) theextended model induces a valid kernel on a graph, and (iii)the resulting matrix has the nice property of being symmetric positive denite the spectral decomposition can thus becomputed on a symmetric positive denite matrix, and nally(iv) the resulting mapping is displayed in a Euclidean space inwhich the coordinate axis are set in the directions of maximalvariance by using (uncentered if the kernel is not centered)kernel principal-component analysis [54], [57] or multidimen-

    sional scaling [6], [12]. This kernel-based technique will bereferred to as the diffusion-map kernel PCA or the K DMPCA .

    Let us dene W = ( Diag ( )) 1 , where is the stationarydistribution of the nite Markov chain. Remember that if the adjacency matrix is symmetric, the stationary distributionof the natural random walk is proportional to the degree of the nodes, W D 1 [48]. The diffusion-map distance istherefore redened as

    d2ij (t) = ( x i (t) x j (t))T W (x i (t) x j (t)) (6)

    1Part of the material of this section was published in a conference paper[19] presenting a similar idea.

    Since x i (t) = ( P T)t e i , Equation (6) becomes

    d2ij (t) = ( e i e j )T P t W P T

    t(e i e j )

    = ( e i e j )T K DM (e i e j )

    = [K DM ]ii + [K DM ]jj [K DM ]ij [K DM ]ji (7)where we dened

    K DM (t) = P t W P Tt

    , (8)

    referred to as the diffusion-map kernel . Thus, the matrix K DMis the natural kernel (inner-product matrix) associated to thesquared diffusion-map distances [6], [12]. It is clear that thismatrix is symmetric positive semidenite and contains innerproducts in a Euclidean space where the node vectors areexactly separated by dij (t) (the proof is straightforward andcan be found in [17] appendix D where the same reasoningwas applied to the commute-time kernel). It is therefore a validkernel matrix.

    Performing a (uncentered if the kernel is not centered)principal-component analysis (PCA) in this embedding spaceaims to change the coordinate system by putting the newcoordinate axes in the directions of maximal variances. Fromthe theory of classi cal multidimensional scaling [6], [12], itsufces 2 to comput e the m rst eigenvalues/eigenvectors of K DM and to consid er that these eigenvectors multiplied bythe squared root o f the corresponding eigenvalues are thecoordinates of the nodes in the principal-component spacespanned by these e igenvectors (see [6], [12]; for a similarapplication with the commute-time kernel, see [51]; for thegeneral denition o f kernel PCA, see [54], [55]). In otherwords, we compute the m rst eigenvalues/eigenvectors of K DM : K DM w k = k w k , where the w k are orthonormal. Then,we represent each n ode i in a m-dimensional Euclidean spacewith coordinates ( 1 w1 i , 2 w2 i , . . . , m wmi ) wherewki = [w k ]i corres ponds to element i of the eigenvector w kassociated to eigenv alue k ; this is the vector representationof state i in the principal-component space.

    It can easily be shown that when the initial graph isundirected, this PCA based on the kernel matrix K DM issimilar to the diffusion map introduced in the last section, upto an isometry. Indeed, by the classical theory of multidimen-sional scaling [6], [12], the eigenvectors of the kernel matrixK DM multiplied by the squared root of the corresponding

    eigenvalues dene coordinates in a Euclidean space wherethe observations are exactly separated by the distance dij (t).Since this is exactly the property of the basic diffusion map(Equation (3)), both representations are similar up to anisometry.

    Notice that the resulting kernel matrix can easily be centered[40] by HK DM H with H = I (ee T /n ), where e is a columnvector all of whose elements are 1 (i.e., e = [1 , 1, . . . , 1]T).H is called the centering matrix. This aims to place the originof the coordinates of the diffusion map at the center of gravityof the node vectors.

    2The proof must be slightly adapted in order to account for the fact thatthe kernel matrix in not centered, as in [50].

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

    h t t p : /

    / w w w

    . i e e e x p

    l o r e p

    r o j e c

    t s . b l o

    g s p o

    t . c o m

    http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/
  • 8/6/2019 A Link-Analysis Extension of Correspondence

    5/15

    5

    D. Links between the basic diffusion map and the kerneldiffusion map

    While both representing the graph in a Euclidean spacewhere the nodes are exactly separated by the distances denedby Equation (2), and thus providing exactly the same embed-ding, the mappings are, however, different for each method.Indeed, the coordinate system in the embedding space differsfor each method.

    In the case of the basic diffusion map, the eigenvector u krepresents the kth coordinate of the nodes in the embeddingspace. However, in the case of the diffusion-map kernel,since a kernel PCA is performed, the rst coordinate axiscorresponds instead to the direction of maximal variance interms of diffusion-map distance (Equation (2)). Therefore, thecoordinate system used by the diffusion-map kernel is actuallydifferent than the one used by the basic diffusion map.

    Putting the coordinate system in the directions of maximalvariance, and thus computing a kernel PCA, is probably morenatural. We now show that there is a close relationship between

    the two representations. Indeed, from Equation (4), we easilyobserve that the mapping provided by the basic diffusion mapremains the same in function of the parameter t , up to a scalingof each coordinate/dimension (only the sc aling changes). Thisis in fact not the case for the kernel diff usion map. In fact,the mapping provided by the diffusion-m ap kernel tends tobe the same as the one provided by the basic diffusionmap for growing values of t in the cas e of an undirectedgraph. Indeed, it can be shown that the k ernel matrix can berewritten as K DM U 2 t U T where U contains the righteigenvectors of P , u k , as columns. In t his case, when t islarge, every additional dimension has a ver y small contributionin comparison with the previous ones.

    This fact will be illustrated in the exper imental section (i.e.,Section V). In practice, we observed tha t the two mappingsare already almost identical when t is equ al to 5 or 6 (see forinstance Figure 3 in Section V).

    I I I . A NALYZING RELATIONS BY STOCHASTICCOMPLEMENTATION

    In this section, the concept of stochastic complementation[41] is briey reviewed and applied to the analysis of a graphthrough the random-walk-on-a-graph model. From the initialgraph, a reduced graph containing only the nodes of interest,and which is much more easy to analyze, is built.

    A. Computing a reduced Markov chain by stochastic comple-mentation

    Suppose we are interested in analyzing the relationshipbetween two sets of nodes of interest. A reduced Markov chaincan be computed from the original chain, in the followingmanner. First, the set of states is partitioned into two subsets,S 1 corresponding to the nodes of interest to be analyzed and S 2 corresponding to the remaining nodes, to be hidden.We further denote by n 1 and n 2 (with n 1 + n 2 = n ) thenumber of states in S 1 and S 2 , respectively; usually n 2 n 1 .

    Thus, the transition matrix is repartitioned as

    P = S 1S 2

    S 1 S 2P 11 P 12P 21 P 22

    (9)

    The idea is to censor the useless elements by masking themduring the random walk. That is, during any random walk

    on the original chain, only the states belonging to S 1 arerecorded; all the other reached states belonging to subset S 2being censored and therefore not recorded. One can show thatthe resulting reduced Markov chain obtained by censoring thestates S 2 is the stochastic complement of the original chain[41]. Thus, performing a stochastic complementation allowsto focus the analysis on the tables and elements representingthe factors/features of interest. The reduced chain inherits allthe characteristics from the original chain; it simply censorsthe useless states. The stochastic complement P c of the chain,partitioned as in Equation (9), is dened as (see for instance[41])

    P c = P 11 + P 12 (I

    P 22 ) 1 P 21 (10)

    It can be shown that the matrix P c is stochastic, that is, thesum of the elements of each row is equal to 1 [41]; it thereforecorresponds to a v alid transition matrix between states of interest. We will as sume that this resulting stochastic matrixis aperiodic and irr educible, that is, primitive [48]. Indeed,Meyer showed in [ 41] that if the initial chain is irreducibleor aperiodic, so is the reduced chain. Moreover, even if theinitial chain is perio dic, the reduced chain frequently becomesaperiodic by stocha stic complementation [41]. One way toensure the aperiodic ity of the reduced chain is to introduce asmall positive quanti ty on the diagonal of the adjacency matrixA , which does not f undamentally change the model. Then, Phas nonzero diagon al entries and the stochastic complement,P c , is primitive (see [41], Theorem 5.1).

    Let us show that t he reduced chain also represents a randomwalk on a reduced graph Gc containing only the nodes of interest. We therefo re partition the matrices A , D , L , as

    A =A 11 A 12A 21 A 22

    ; D =D 1 OO D 2

    ; L =L 11 L 12L 21 L 22

    from which we easily nd P c = D 11 (A 11 + A 12 (D 2 A 22 ) 1 A 21 ) = D 11 A c , where we dened A c = ( A 11 +

    A 12 (D 2 A 22 ) 1 A 21 ). Notice that if A is symmetric (the

    graph Gc is undirected), A c is symmetric as well. Since P cis stochastic, we deduce that the diagonal matrix D 1 containsthe row sums of A c and that the entries of A c are positive.The reduced chain thus corresponds to a random walk on thegraph Gc whose adjacency matrix is A c .

    Moreover, the corresponding Laplacian matrix of the graphGc can be obtained by

    L c = D 1 A c = ( D 1 A 11 ) A 12 (D 2 A 22 ) 1 A 21

    = L 11 L 12 L 122 L 21 (11)

    since L 12 = A 12 and L 21 = A 21 . If the adjacencymatrix A is symmetric, L 11 (L 22 ) is positive denite sinceit is obtained from the positive semidenite matrix L by

    deleting the rows associated to S 2 (S 1 ) and the corresponding

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

    h t t p : /

    / w w w

    . i e e e x p

    l o r e p

    r o j e c

    t s . b l o

    g s p o

    t . c o m

    http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/
  • 8/6/2019 A Link-Analysis Extension of Correspondence

    6/15

    6

    columns, therefore eliminating the linear relationship. Noticethat L c is simply the Schur complement of L 22 [27]. Thus,for an undirected graph G, instead of directly computing P c ,it is more interesting to compute L c , which is symmetricpositive denite, from which P c can easily be deduced:P c = I D

    11 L c , directly following from L c = D 1 A c;see the next section for a proposition of iterative computation

    of L c .

    B. Iterative computation of L c for large, sparse, graphs

    In order to compute L c from Equation (11), we need toevaluate L 122 L 21 . We now show how this matrix can becomputed iteratively by using, for instance, a simple Jacobi-like or Gauss-Seidel-like algorithm. In order to simplify thedevelopment, let us pick up one column, say column j , of L 21and denote it as b j ; there are n 1 n 2 such columns. Theproblem is thus to compute x j = L

    122 L 21 e j (column j of

    L 122 L 21 ) such that

    L 22 x j = ( D 2

    A 22 )x j = b j (12)

    By transforming this last equation, we obtain x j =D 12 (A 22 x j + b j ) = P 22 x j + D

    12 b j which leads to the

    following iterative updating rule

    x j P 22 x j + D 12 b j (13)

    where x j is an approximation of x j . This procedure convergessince the matrix L 22 is irreducible and has weak diagonaldominance [70]. Initially, one can start, for instance, with

    x j = 0 . The computation of Equation (1 3) fully exploits thesparseness since only the non-zero entrie s of P 22 contributeto the update. Of course, D 12 b j is pre-c alculated.

    Once all thex

    j have been compute d in turn (only n 1in total), the matrix X containing the x j as columns isconstructed; we thus have L 122 L 21 = X . The nal stepconsists of computing L c = L 11 L 12 X .The time complexity of this algorithm is n 1 (the complex-ity of solving one sparse system of n 2 line ar equations). Now,if the graph is undirected, the matrix L 22 is positive denite.Recent numerical analysis techniques have shown that positivedenite sparse linear systems in a n -by-n matrix with m non-zero entries can be solved in time O(mn ), for instance by us-ing conjugate gradient techniques [59], [64]. Thus, apart fromthe matrix multiplication L 12 X , the complexity is O(n 1 n 2 m)where m is the number of non-zero entries of L 22 . In practice,the matrix is usually very sparse and m = n 2 with quitesmall, resulting in O( n 1 n 22 ) with n 1 n 2 . The second step,i.e., the mapping by basic diffusion map or by diffusion-mapkernel PCA, has a negligible complexity since it is performedon a reduced n 1 n 1 matrix. Of course more sophisticatediterative algorithms can be used as well (see for instance [25],[49], [59]).

    IV. A NALYZING THE REDUCED M ARKOV CHAIN WITH THEBASIC DIFFUSION MAP : LINKS WITH CORRESPONDENCE

    ANALYSIS

    Once a reduced Markov chain containing only the nodes

    of interest has been obtained, one may want to visualize the

    graph in a low-dimensional space preserving as accurately aspossible the proximity between the nodes. This is the secondstep of our procedure. For this purpose, we propose to use thediffusion maps introduced in Section II-B and Section II-C.Interesting enough, computing a basic diffusion map on thereduced Markov chain is equivalent to correspondence analysisin two special cases of interest: a bipartite graph and a star-

    schema database. Therefore, the proposed two-step procedurecan be considered as a generalization of correspondenceanalysis.

    Correspondence analysis (see for instance [23], [24], [31],[62]) is a widely used multivariate statistical analysis techniquewhich still is the subject of much research efforts (see forinstance [5], [29]). As stated for instance in [34], simplecorrespondence analysis aims to provide insights into thedependence of two categorical variables. The relationshipsbetween the attributes of the two categorical variables areusually analyzed through a biplot [23] a two-dimensionalrepresentation of the attributes of both variables. The coordi-nates of the attributes on the biplot are obtained by computingthe eigenvectors of a matrix. Many different derivations of simple correspondence analysis have been developed, allowingfor different interpretations of the technique, such as maxi-mizing the correlati on between two discrete variables, recip-rocal averaging, cat egorical discriminant analysis, scaling andquantication of cat egorical variables, performing a principalcomponents analysi s based on the chi-square distance, opti-mal scaling, dual s caling, etc [34]. Multiple correspondenceanalysis is the exten sion of simple correspondence analysis toa larger number of categorical variables.

    A. Simple correspo ndence analysis

    As stated before, simple correspondence analysis (see forinstance [23], [24], [31], [62]) aims to study the relationshipsbetween two random variables x1 and x2 (the features) havingeach mutually excl usive, categorical, outcomes, denoted asattributes. Suppose t he variable x1 has n 1 observed attributesand the variable x2 has n 2 observed attributes, each attributebeing a possible outcome value for the feature. An experi-menter makes a series of measurements of the features x1 ,x2 on a sample of vg individuals and records the outcomesin a frequency (also called contingency) table, f ij , containingthe number of individuals having both attribute x1 = i andattribute x2 = j . In our relational database, this correspondsto two tables, each table corresponding to one variable, andcontaining the set of observed attributes (outcomes) of thevariable. The two tables are linked by a single relation (seeFigure 1 for a simple example).

    This situation can be modeled as a bipartite graph whereeach node corresponds to an attribute and links are onlydened between attributes of x1 and attributes of x2 . Theweight associated to each link is set to wij = f ij , quantifyingthe strength of the relationship between i and j . The associatednn adjacency matrix and the corresponding transition matrixcan be factorized as

    A=

    O A 12A 21 O ,

    P=

    O P 12P 21 O (14)

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

    h t t p : /

    / w w w

    . i e e e x p

    l o r e p

    r o j e c

    t s . b l o

    g s p o

    t . c o m

    http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/
  • 8/6/2019 A Link-Analysis Extension of Correspondence

    7/15

    7

    Document Word

    N N

    Fig. 1. Trivial example of a single relation between two variables, Document and Word . The Document table contains outcomes of documents while theWord table contains outcomes of words.

    where O is a matrix full of zeroes.Suppose we are interested in studying the relationships

    between the attributes of the rst variable x1 which corre-sponds to the n 1 rst elements. By stochastic complementation(see Equation (10)), we easily obtain P c = P 12 P 21 =D 11 A 12 D

    12 A 21 . Computing the diffusion map for t = 1

    aims to extract the subdominant right-hand eigenvectors of P c , which exactly corresponds to correspondence analysis (seefor instance [24], Equation (4.3.5)). Moreover, it can easily beshown that P c has only real non-negative eigenvalues and thusordering the eigenvalues by modulus is equivalent to ordering

    them by value. In correspondence analysis, eigenvalues reectthe relative importance of the dimensions: each eigenvalueis the amount of inertia a given attribute explains in thefrequency table [31]. The basic diffusion map after stochasticcomplementation on this bipartite graph t herefore leads to thesame results as simple correspondence an alysis.

    Relationships between simple correspo ndence analysis andlink-analysis techniques have already been highlighted. For in-stance, Zha et al. [72] showed the equivale nce of a normalizedcut performed on a bipartite graph and sim ple correspondenceanalysis. On the other hand, Saerens et al. investigated therelationships between Kleinbergs HITS algorithm [33], andcorrespondence analysis [18] or principal- component analysis[50].

    B. Multiple correspondence analysis

    Multiple correspondence analysis assign s a numerical scoreto each attribute of a set of p > 2 categorical variables [23],[62]. Suppose the data are available in the form of a starschema: the individuals are contained in a main table and thecategorial features of these individuals, such as education level,gender, etc., are contained in p auxiliary, satellite, tables. Thecorresponding graph is built naturally by dening one nodefor each individual and for each attribute while a link between

    an individual and an attribute is dened when the individualpossesses this attribute. This conguration is known as a star-schema [32] in the data warehouse or relational database elds(see Figure 2 for a trivial example).

    Let us rst renumber the nodes in such a way that theattributes nodes appear rst and the individuals nodes last.Thus, the attributes-to-individuals matrix will be denoted byA 12 ; it contains a 1 on the (i, j ) entry when the individual jhas attribute i, and 0 otherwise. The individuals-to-attributesmatrix, the transpose of the attributes-to-individuals matrix, isA 21 . Thus, the adjacency matrix of the graph is

    A=

    O A 12A 21 O (15)

    Now, the individuals-to-attributes matrix exactly correspondsto the data matrix A 21 = X containing, as rows, the individu-als and, as columns, the attributes. Since the different featuresare coded as indicator (dummy) variables, a row of the Xmatrix contains a 1 if the individual has the correspondingattribute and 0 otherwise. We thus have A 21 = X andA 12 = X T .

    Assuming binary weights, the matrix D 1 contains on itsdiagonal the frequencies of each attribute, that is, the num-ber of individuals having this attribute. On the other hand,D 2 contains p on each element of its diagonal since eachindividual has exactly one attribute for each of the p features(attributes corresponding to a feature are mutually exclusive).Thus, D 2 = p I and P 12 = D

    11 A 12 , P 21 = D

    12 A 21 .

    Suppose we are rst interested in the relationships be-tween attribute nod es, therefore hiding the individuals nodescontained in the m ain table. By stochastic complementation(Equation (10)), the corresponding attribute-attribute transitionmatrix is

    P c = D 11 A 12 D

    12 A 21 =

    1 p

    D 11 A 12 A 21

    =1

    p

    D 11 XTX =

    1

    p

    D 11 F (16)

    where the element f ij of the frequency matrix F = X TX , alsocalled the Burt matr ix, contains the number of co-occurencesof the two attributes i and j , that is, the number of individualshaving both attribut e i and attribute j .

    The largest non-trivial right eigenvector of the matrix P crepresents the scores of the attributes in a multiple correspon-dence analysis. Thus, computing the eigenvalues and eigen-vectors of P c and displaying the nodes with coordinates pro-portional to the eigenvectors, weighted by the correspondingeigenvalue, exactly corresponds to multiple correspondenceanalysis (see [62], Equation (10)). This is precisely what weobtain when computing the basic diffusion map on P c witht = 1 . Indeed, as for simple correspondence analysis, it caneasily be shown that P c has real non-negative eigenvaluesand thus ordering the eigenvalues by modulus is equivalentto ordering by value.

    If we are interested in the relationships between elementsof the main table (the individuals) instead of the attributes, weobtain

    P c =1 p

    A 21 D 11 A 12 =

    1 p

    XD 11 XT (17)

    which, once again, is exactly the result obtained by multiple

    correspondence analysis (see [62], Equation (9)).

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

    h t t p : /

    / w w w

    . i e e e x p

    l o r e p

    r o j e c

    t s . b l o

    g s p o

    t . c o m

    http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/
  • 8/6/2019 A Link-Analysis Extension of Correspondence

    8/15

    8

    Gender Educationlevel

    Individual

    Location Nativelanguage

    Marital status

    N

    1

    N N

    N N

    11

    11

    Fig. 2. Trivial example of a star-schema relation between a main variable, Individual , and auxiliary variables, Gender , Education level , etc. Each tablecontains outcomes of the corresponding random variable.

    V. E XPERIMENTS

    This experimental section aims to answer to four researchquestions. (1) How does the graph mappi ngs provided by thekernel PCA based on the diffusion-map kernel ( K DM PCA)compares with the basic diffusion-map p rojection? (2) Doesthe proposed two-step procedure (stochast ic complementation+ diffusion map) provide realistic subg raph drawings? (3)How does the diffusion-map kernel comb ined with stochasticcomplementation compares to other pop ular dimensionalityreduction techniques? (4) Does stochast ic complementationaccurately preserve the structural informa tion?

    A. Graph embedding

    Two simple graphs are studied in or der to illustrate thevisualization of the graph structure by d iffusion maps alone(without stochastic complementation): the Zachary karate club[71] and the dolphins social network [38].

    Zachary karate club. Zachary has studied the relationsbetween the 34 members of a karate club. A disagreementbetween the club instructor (node 1) and the administrator(node 34) resulted in the split of the club into two parts. Eachmember of the club is represented by a node of the graphand the weight between nodes (i, j ) is set to be the numberof time member i and member j met outside the club. TheUcinet drawing of the network is shown on Figure 3(a). Thebuilt friendship network between members allows to discoverhow the club split, but its mapping also allows to study theproximity between the different members.

    It can be observed from the graph embeddings providedby the basic diffusion map (Figure 3(c)) that the value of parameter t has no inuence on the nodes position, but onlyon the axis scaling (remember Equation (4)). On the contrary,the inuence of the value of parameter t is clearly visible onthe K DM PCA mapping (Figure 3(d)).

    However, the projection of a graph on a two-dimensionalspace usually leads to a loss of information. The information

    preservation ratio can be estimated using ( 1 + 2 )/ i i ; it

    was computed for the 2-D mapping of the network using thebasic diffusion map and the K DM PCA, and is shown in Figure3(b). It can be observed that the ratio increases with t but,overall, the informat ion is better preserved with the K DM PCAthan with the basic d iffusion map. This is due to a better choiceof the projection ax is for the K DM PCA that are oriented inthe directions of ma ximal variance. Since a proximity on themapping can be int erpreted as a strong relationship betweenmembers, the social community structure is clearly observablevisually from Figure 3(c)-(d). For the K DM PCA, the choice of the value for the par ameter t depends on the graphs propertythat the user wishe s to highlight. When t is low ( t = 1 ),nodes with few relev ant connections are considered as outliersand rejected from t he community. On the contrary, when tincreases, the effect of marginal nodes is faded and only theglobal members stru cture, similar to the basic diffusion map,is visible.

    Dolphins social network. This unweighted graph representsthe associations between bottlenose dolphins from the samecommunity, observed over a period of seven years. Only thebasic diffusion map with t = 1 is shown (since, once again,the parameter t has no inuence on the mapping) while theK DM PCA is computed for t = 1 , 3, and 6. It can be observed(see Figure 4) from the K DM PCA that a dolphin (the 61thmember, Zig) is rejected far away from the community dueto his lack of interaction with other dolphins. His only contactis also poorly connected with the remaining community. Asexplained in Section II and conrmed in Figure 4, the basicdiffusion map and the K DM PCA mappings become similarwhen t increases. For instance, when t = 6 or more, the outlieris no more highlighted and only the main structure of thenetwork is visible. Two main subgroups can be identied fromthe mapping (notice that Newman and Girvan [44] actually

    identied 4 sub-communities by clustering).

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

    h t t p : /

    / w w w

    . i e e e x p

    l o r e p

    r o j e c

    t s . b l o

    g s p o

    t . c o m

    http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/
  • 8/6/2019 A Link-Analysis Extension of Correspondence

    9/15

    9

    1 2 3 4 5 6 7 8 9 1 00.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    i n f o r m a

    t i o n p r e s e r v a

    t i o n r a

    t i o

    t

    basic diffusion map

    KDM PCA

    (a ) (b )

    0.2 0.1 0 0.1 0.2 0.30.4

    0.3

    0.2

    0.1

    0

    0.1

    0.2

    1

    2

    3

    4

    5

    67

    8

    9

    10

    11

    12

    13

    14

    17

    18

    20

    22

    29

    31

    32

    34

    Basic diffusion map, t = 1

    24 27,30

    15,16,1921,2328,33

    0.15 0.1 0.05 0 0.05 0.1 0.15 0.2

    0.1

    0.05

    0

    0.05

    0.1

    1

    2

    3

    4

    5

    67

    8

    9

    10

    11

    12

    13

    14

    17

    18

    20

    22

    29

    31

    32

    34

    Basic diffusion map, t = 4

    24 27,30

    15,16,1921,2328,33

    (c)

    1.5 1 0.5 0 0.5 1 1.5 23

    2.5

    2

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

    1

    2

    3 4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    17

    18

    20

    22

    242728

    29

    30

    31

    32

    KDM PCA, t = 1

    15,1619,2123 25,26

    33,34

    1.510.500.5

    0.8

    0.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    17

    18

    20

    22

    28

    29

    31

    32

    33 34

    KDM PCA, t = 4

    24 27,30

    15,16,1921,23

    (d)

    Fig. 3. Zachary karate social network: (a) Ucinet projection of the Zachary karate network, (b) the information preservation ratio of the 2-D projection infunction of the parameter t for the basic diffusion map (diffusion map) and the K DM PCA, (c) the graph mapping obtained by the basic diffusion map ( t = 1and 4 ), and (d) the graph mapping obtained by a K DM PCA ( t = 1 and 4 ).

    B. Analysing the effect of stochastic complementation on areal-world dataset

    This second experiment aims to illustrate the two-stepmapping procedure, i.e., rst applying a stochastic comple-mentation and then computing the K DM PCA, on a real-world dataset the newsgroups data. However, for illustrativepurposes, the procedure is rstly applied on a toy example.

    Toy example. The graph (see Figure 5) is composed of fourobjects ( e1 , e2 , e3 , and e4 ) belonging to two different classes(c1 and c2 ). Each object is also connected to one or manyof the ve attributes ( a1 ,...,a 5 ). The reduced graph mappingobtained by our two-step procedure allows to highlight the

    relations between the attribute values (i.e., the a nodes) andthe classes (i.e., the c nodes). To achieve this goal, the e nodesare eliminated by performing a stochastic complementation:only the a and c nodes are kept. The resulting subgraph isdisplayed on a 2D plane by performing a K DM PCA. Sincethe connectivity between nodes a1 and a2 (a3 , a4 , and a5 ) islarger than with the remaining nodes, these two (three) nodesare close together on the resulting map. Moreover, node c1(c2 ) is highly connected to nodes a1 , a2 (a3 , a4 , a5 ) throughindirect links and is therefore displayed close to these nodes.

    Newsgroups dataset. The real-world dataset studied in

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

    h t t p : /

    / w w w

    . i e e e x p

    l o r e p

    r o j e c

    t s . b l o

    g s p o

    t . c o m

    http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/
  • 8/6/2019 A Link-Analysis Extension of Correspondence

    10/15

    10

    0.1 0.05 0 0.05 0.1 0.15 0.2

    0.2

    0.15

    0.1

    0.05

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    Basic diffusion map, t = 1

    12 10 8 6 4 2 0 25

    4

    3

    2

    1

    0

    1

    2

    3

    4KDM PCA , t = 1

    1 0.5 0 0.5 1 1.5 2 2.52

    1

    0

    1

    2

    3

    4KDM PCA , t = 3

    1.510.500.51

    0.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    KDM PCA , t = 6

    Fig. 4. Dolphins social network: The graph mappi ng obtained by the basic diffusion map ( t =1; upper l eft gure) and using the K DM PCA ( t =1, 3, and 6).

    Fig. 5. Toy example illustrating our two-step proc edure (stochastic complementation followed by K DM PCA).

    this section is the newsgroups dataset 3 . It is composed of about 20, 000 unstructured documents, taken from 20 discus-sion groups (newsgroups) of the Usernet diffusion list. Forthe ease of interpretation, we decided to limit the datasetsize by randomly sampling 150 documents out of threeslightly-correlated topics (sport/baseball, politics/mideast,and space/general; 50 documents from each topic). Those150 documents are preprocessed as described in [68], [69].The resulting graph is composed of 150 document nodes, 564term nodes and 3 topic nodes representing the topics of thedocuments. Each document is connected to its correspondingtopic node with a weight xed to one. The weights betweenthe documents and the terms are set equal to the tf.idf factorand normalized in order to obtain normalized weights between0 and 1 [68], [69].

    Thus, the class (or topics) nodes are connected to documentnodes of the corresponding topics, and each document is alsoconnected to terms contained in the document. Drawing a

    3Available from http://people.csail.mit.edu/jrennie/20Newsgroups/.

    parallel with our illustrative example (see Figure 5), topicsnodes correspond to c-nodes, document nodes to e-nodes andterms to a-nodes. The goal of this experiment is to studythe similarity between the terms and the topics through theirconnections with the document nodes. The reduced Markovchain is computed by setting S 1 to the nodes of the graph cor-responding to the terms and the topics. The remaining nodes(documents) are rejected in the subgroup S 2 . Figure 6 showsthe embedding of the terms used in the 150 documents of thenewsgroups subset, as well as the topics of the documents.The K DM PCA quickly provides the same mapping as thebasic diffusion map when increasing the value of t . However,it can be observed on the embedding with t = 1 that severalnodes are rejected outside the triangle, far away from theother nodes of the graph.

    A new mapping where the terms corresponding to eachnodes are also displayed (for the visualization convenience,only terms cited by 25 documents or more are shown) forK DM PCA with t = 1 is shown in Figure 7. It can be observedthat a group of terms are stuck near each topic nodes, denoting

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

    h t t p : /

    / w w w

    . i e e e x p

    l o r e p

    r o j e c

    t s . b l o

    g s p o

    t . c o m

    http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/
  • 8/6/2019 A Link-Analysis Extension of Correspondence

    11/15

  • 8/6/2019 A Link-Analysis Extension of Correspondence

    12/15

    12

    O( n21 ) where is the number of iterations (these algorithmsare iterative by nature). On the other hand, computing theshortest-path distances matrix takes O(n 21 log(n 1 )) . Thus, eachalgorithm has a time complexity between O(n 21 ) and O(n 31 ).

    In this experiment, we address the task of classicationof unlabeled nodes in partially labeled graphs, that is, semi-supervised classication on a graph [73]. Notice that the goal

    of this experiment is not to design a state-of-the-art semi-supervised classier; rather it is to study the performance of the proposed method, in comparison with other embeddingmethods.

    Three graphs are investigated. The rst graph is constructedfrom the well-known Iris dataset [4]. The weight (afnity)between nodes representing samples is provided by wij =exp[d2ij / 2 ] where dij is the Euclidean distance in thefeature space and 2 is simply the sample variance. The classesare the three iris species. The second graph is extracted fromthe IMDb movie database [39]. 1126 movies are linked toform a connected graph: an edge is added between two moviesif they share common production companies. Each node isclassied to be a high or low-income movie (two classes).The last graph, extracted from the CORA dataset [39], iscomposed of scientic papers from three topics. A citationgraph is built upon the dataset where two papers are linked if the rst paper cites the second one. The t ested graph contains1410 nodes divided into three classes re presenting machine-learning research topics.

    For each of these three graphs, extra nodes are added torepresent the class labels (called the class nodes ). Each classnode is connected to the graph nodes o f the correspondingclass. Moreover, in order to dene cross-va lidation folds, thesegraph nodes are randomly split into traini ng sets and test sets

    (called the training nodes and the test nodes respectively),the edges between the test nodes and th e class nodes beingremoved. The graph is then reduced to the test nodes and to theclass nodes by stochastic complementation (the training nodesare rejected in the S 2 subset and thus cens ored), and projectedinto a two-dimensional space by applying one of the projectionalgorithms described before. If the relationship between thetest nodes and the class nodes is accurately reconstructedin the reduced graph, these nodes from the test set shouldbe projected close to the class node of their correspondingclass. We report the classication accuracy for several labelingrates, i.e. portions of unlabeled nodes which constitute thetest set. The proportion of the test nodes varies between 50%of the graph nodes (2-fold cross validation) to 10% (10-foldcross validation). This means that the proportion of trainingnodes left apart (censored) by stochastic complementationincreases with the number of folds. The whole cross validationprocedure is repeated 10 times (10 runs) and the classicationaccuracy averaged on these 10 runs is reported, as well as the95% condence interval.

    For classication, the assigned label of each test nodeis simply the label provided by the nearest class node, interms of Euclidean distance in the two-dimensional embeddingspace. This will permit to assess if the class information iscorrectly preserved during stochastic complementation and 2D

    dimensionality reduction. The parameter t of the K DM PCA

    is set to 5, in view of our preliminary experiments.The Figures 8(a)(c) show the classication accuracy, as

    well as the 95% condence interval, obtained on the threeinvestigated graphs for different training/test set partitioning(folds). The x-axis represents the number of folds, and thus anincreasing number of nodes left apart (censored) by stochasticcomplementation (from 0%, 50%, . . . , up to 90%). As a

    baseline, the whole original graph (corresponding to 1 singlefold and referred to as 1-fold) is also projected withoutremoving any class link and without performing a stochasticcomplementation; this situation represents the ideal case sinceall the class information is kept. All the methods should obtaina good accuracy score in this setting this is indeed what isobserved.

    First, we observe that, although obtaining very good perfor-mance when projecting the original graph (1-fold), CCA andSM perform poorly when the number of folds, and thus theamount of censored nodes, increases. On the other hand, LEis quite unstable, performing poorly on the CORA dataset.This means that stochastic complementation combined withCCA, SM, or LE does not work properly. On the contrary, theperformance of K DM PCA and MDS remains fairly stable;for instance, the average decrease of performance of K DMPCA is around 10% , in comparison with the mapping of theoriginal graph (fro m 1-fold to 2-fold 50% of the nodesare censored), whic h remains reasonable. MDS offers a goodalternative to K DM PCA, showing competitive performance;however, it involves the computation of the all-pairs shortest-path distance.

    These results are conrmed when displaying the mappings.Figures 8(d)(h) sh ow a mapping example of the test nodesas well as the class nodes (the white markers) of the CORA

    graph, for the 10-f old cross-validation setting. Thus, only10% of the graph nodes are unlabeled and projected afterstochastic complem entation of the 90% remaining nodes. Itcan be observed th at the Laplacian Eigenmap (Figure 8(e))and the MDS (Figu re 8(h)) managed to separate the differentclasses, but mostly in terms of angular similarity. On theK DM PCA mapping Figure (8(d)), the class nodes are well-located, at the center of the set of nodes belonging to the class.On the other hand, the mappings provided by CCA and SMafter stochastic complementation do not accurately preservethe class information.

    D. Discussion of the resultsLet us now come back to our research questions. As a

    rst observation, we can say that the two-step procedure(stochastic complementation followed by a diffusion-map pro- jection) provides an embedding in a low-dimensional subspacefrom which useful information can be extracted. Indeed, theexperiments show that highly related elements are displayedclose together while poorly related elements tend to be drawnfar apart. This is quite similar to correspondence analysis towhich the procedure is closely related. Secondly, it seems thatstochastic complementation reasonably preserves proximityinformation, when combined with a diffusion map ( K DMPCA) or an ISOMAP projection (MDS). For the diffusion

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

    h t t p : /

    / w w w

    . i e e e x p

    l o r e p

    r o j e c

    t s . b l o

    g s p o

    t . c o m

    http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/
  • 8/6/2019 A Link-Analysis Extension of Correspondence

    13/15

    13

    1 2 3 4 5 6 7 8 9 100.7

    0.75

    0.8

    0.85

    0.9

    0.95

    1

    number of folds

    c l a s s

    i f i c a

    t i o n r a

    t eIRIS

    KD MLE

    CCA

    SM

    MDS

    1 2 3 4 5 6 7 8 9 100.5

    0.55

    0.6

    0.65

    0.7

    0.75

    0.8

    0.85

    0.9

    0.95

    1

    number of folds

    c l a s s

    i f i c a

    t i o n r a

    t eIMDB

    KD M

    LE

    CCA

    SM

    MDS

    (a ) (b )

    1 2 3 4 5 6 7 8 9 10

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    number of folds

    c l a s s

    i f i c a

    t i o n r a

    t eCORA

    KD M

    LE

    CCA

    SM

    MDS

    0.03 0.02 0.01 0 0.01 0.02 0.03 0.040.02

    0.015

    0.01

    0.005

    0

    0.005

    0.01

    0.015

    0.02K DM PCA

    (c) (d )

    4 3 2 1 0 1 2 31.5

    1

    0.5

    0

    0.5

    1

    1.5

    2LE

    400 200 0 200 400 600400

    200

    0

    200

    400

    600

    800CCA

    (e ) (f )

    300 200 100 0 100 200 300 400500

    400

    300

    200

    100

    0

    100

    200

    300

    400S M

    200 150 100 50 0 50 100 150200

    150

    100

    50

    0

    50

    100

    150

    200MDS

    (g ) (h )

    Fig. 8. (a)(c) Classication accuracy obtained by the ve compared projection methods for the Iris ((a), 3 classes), IMDb ((b), 2 classes), and Cora ((c), 3classes) datasets respectively, in function of training/test set partitioning (number of folds). (d)(h) The mapping of 10% of the Cora graph (10-folds setting)

    obtained by the ve projection methods. The compared methods are the diffusion-map kernel ((d), K DM PCA, or KDM), the Laplacian Eigenmap ((e), LE),the Curvilinear Component Analysis ((f), CCA), the Sammon nonlinear Mapping ((g), SM), and the Multidimensional Scaling or ISOMAP ((h), MDS). Theclass label nodes are represented by white markers.

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

    h t t p : /

    / w w w

    . i e e e x p

    l o r e p

    r o j e c

    t s . b l o

    g s p o

    t . c o m

    http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/
  • 8/6/2019 A Link-Analysis Extension of Correspondence

    14/15

    14

    map, this is normal since both stochastic complementationand the diffusion-map distance are based on a Markov-chainmodel stochastic complementation is the natural techniqueallowing to censor states of a Markov chain. On the contrary,stochastic complementation should not be combined with aLaplacian eigenmap, a curvilinear component analysis, or aSammon nonlinear mapping the resulting mapping is not

    accurate. Finally, the K DM PCA provides exactly the sameresults as the basic diffusion map when t is large. However,when the parameter t is low, the resulting projection tendsto highlight the outlier nodes and to magnify the relativedifferences between nodes. It is therefore recommended todisplay a whole range of mappings for several different valuesof t .

    VI . C ONCLUSIONS AND FURTHER WORK

    This work introduced a link-analysis based technique al-lowing to analyze relationships existing in relational databases.The database is viewed as a graph where the nodes correspondto the elements contained in the tables and the links correspondto the relations between the tables. A two-step procedure isdened for analyzing the relationships between elements of in-terest contained in a table, or a subset of tables. More precisely,this work (1) proposes to use stochastic c omplementation forextracting a subgraph containing the elem ents of interest fromthe original graph and (2) introduces a ker nel-based extensionof the basic diffusion map for displaying and analyzing thereduced subgraph. It is shown that the resulting method isclosely related to correspondence analysis .

    Several datasets are analyzed by usi ng this procedure,showing that it seems to be well-suited fo r analyzing relation-ships between elements. Indeed, stochast ic complementation

    considerably reduces the original graph and allows to focusthe analysis on the elements of interest, without having todene a state of the Markov chain fo r each element of the relational database. However, one fun damental limitationof this method is that the relational dat abase could containtoo many disconnected components, in w hich case our link-analysis approach is almost useless. Moreover, it is clearlynot always an easy task to extract a graph from a relationaldatabase, especially when the database is huge. These are thetwo main drawbacks of the proposed two-step procedure.

    Further work will be devoted to the application of thismethodology to fuzzy SQL queries or fuzzy informationretrieval. The objective is to retrieve not only the elementsstrictly complying with the constraints of the SQL query,but also the elements that almost comply with these con-straints and are therefore close to the target elements. Wewill also evaluate the proposed methodology on real relationaldatabases.

    ACKNOWLEDGMENTS

    Part of this work has been funded by projects with the R egionWallonne and the Belgian Politique Scientique F ederale. Wethank these institutions for giving us the opportunity to conductboth fundamental and applied research. We also thank ProfessorJef Wijsen, from the Universit e de Mons, Belgium, and Dr JohnLee, from the Universit e catholique de Louvain, Belgium, for theinteresting discussions and suggestions about this work.

    A PPENDIX

    A. Some links between the basic diffusion map and spectralclustering

    Suppose the right eigenvectors and eigenvalues of matrixP are u k and k . Then, the matrix I P has the sameeigenvectors as P and eigenvalues given by (1 k ). There-fore, the largest eigenvectors of P correspond to the smallesteigenvectors of I P .

    Assuming a connected undirected graph, we will nowexpress the eigenvectors of P in terms of the eigenvectorslk of the normalized Laplacian matrix, L = D

    1 / 2 LD 1 / 2 .We thus have

    (I P )u k = (1 k )u k (18)But,

    I P = D 1 (D A )

    = D 1 L

    = D 1 / 2

    D 1 / 2

    LD 1 / 2

    D1 / 2

    = D 1 / 2 LD 1 / 2 (19)

    Inserting this result in Equation (18) provides LD 1 / 2 u k =(1 k )D 1 / 2 u k . Thus, the (unnormalized) eigenvectors of Lare lk = D 1 / 2 u k , and are associated to eigenvalues (1 k ).R EFERENCES

    [1] P. Baldi, P. Frascon i, and P. Smyth. Modeling the Internet and the Web:Probabilistic Metho ds and Algorithms . John Wiley & Sons, 2003.

    [2] M. Belkin and P. N iyogi. Laplacian eigenmaps and spectral techniquesfor embedding and clustering. In Advances in Neural InformationProcessing Systems , volume 14, pages 585591. MIT Press, 2001.

    [3] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionalityreduction and data representation. Neural Computation , 15:13731396,

    2003.[4] C. Blake, E. Keogh , and C. Merz. UCI repository of machine learning

    databases. [http://w ww.ics.uci.edu/mlearn/MLRepository.html]. Irvine,CA: University of California, Department of Information and ComputerScience, 1998.

    [5] J. Blasius, M. Gree nacre, P. Groenen, and M. van de Velden. Special journal issue on co rrespondence analysis and related methods. Compu-tational Statistics and Data Analysis , 53(8):31033106, 2008.

    [6] I. Borg and P. Groenen. Modern multidimensional scaling: Theory and applications . Springer, 1997.

    [7] P. Bremaud. Markov Chains: Gibbs Fields, Monte Carlo Simulation,and Queues . Springer-Verlag, 1999.

    [8] P. Carrington, J. Scott, and S. Wasserman. Models and Methods in Social Network Analysis . Cambridge University Press, 2006.

    [9] S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data . Elsevier Science, 2003.

    [10] F. R. Chung. Spectral graph theory . American Mathematical Society,1997.

    [11] D. J. Cook and L. B. Holder. Mining graph data . Wiley and Sons,2006.

    [12] T. Cox and M. Cox. Multidimensional scaling, 2nd Ed. Chapman andHall, 2001.

    [13] N. Cressie. Statistics for spatial data . Wiley, 1991.[14] P. Demartines and J. Herault. Curvilinear component analysis: A self-

    organizing neural network for nonlinear mapping of data sets. IEEE Transactions on Neural Networks , 8(1):148154, 1997.

    [15] C. Ding. Spectral clustering. Tutorial presented at the 16th EuropeanConference on Machine Learning (ECML 2005) , 2005.

    [16] P. Domingos. Prospects and challenges for multi-relational data mining. ACM SIGKDD Explorations Newsletter , 5(1):8083, 2003.

    [17] F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens. Random-walk computation of similarities between nodes of a graph, with applicationto collaborative recommendation. IEEE Transactions on Knowledge and Data Engineering , 19(3):355369, 2007.

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

    h t t p : /

    / w w w

    . i e e e x p

    l o r e p

    r o j e c

    t s . b l o

    g s p o

    t . c o m

    http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/http://www.ieeexploreprojects.blogspot.com/
  • 8/6/2019 A Link-Analysis Extension of Correspondence

    15/15

    15

    [18] F. Fouss, J.-M. Renders, and M. Saerens. Links between Kleinbergshubs and authorities, correspondence analysis and Markov chains. InProceedings of the 3th IEEE International Conference on Data Mining(ICDM) , pages 521524, 2003.

    [19] F. Fouss, L. Yen, A. Pirotte, and M. Saerens. An experimentalinvestigation of graph kernels on a collaborative recommendation task.Proceedings of the 6th International Conference on Data Mining (ICDM 2006) , pages 863868, 2006.

    [20] F. Geerts, H. Mannila, and E. Terzi. Relational link-based ranking.Proceedings of the 30th Very Large Data Bases Conference (VLDB) ,pages 552563, 2004.

    [21] X. Geng, D.-C. Zhan, and Z.-H. Zhou. Supervised nonlinear dimension-ality reduction for visualization and classication. IEEE Transactions onSystems, Man, and Cybernetics, Part B: Cybernetics , 35(6):10981107,2005.

    [22] L. Getoor, editor. Introduction to statistical relational learning . MITPress, 2007.

    [23] J. Gower and D. Hand. Biplots . Chapman & Hall, 1995.[24] M. J. Greenacre. Theory and Applications of Correspondence Analysis .

    Academic Press, 1984.[25] A. Greenbaum. Iterative Methods for Solving Linear Systems . Society

    for Industrial and Applied Mathematics, 1997.[26] K. M. Hall. An r-dimensional quadratic placement algorithm. Manage-

    ment Science , 17(8):219229, 1970.[27] D. A. Harville. Matrix Algebra from a Statisticians Perspective .

    Springer-Verlag, 1997.

    [28] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data mining, Inference, and Prediction, 2nd ed. Springer-Verlag, 2009.

    [29] H. Hwang, W. Dhillon, and Y. Takane. An extension of multiplecorrespondence analysis for identifying hete rogeneous subgroups of respondents. Psychometrika , 71(1):161171, 2 006.

    [30] A. Izenman. Modern multivariate statistical techniques. Regression,classication, and manifold learning . Springer , 2008.

    [31] R. Johnson and D. Wichern. Applied Multiva riate Statistical Analysis,6th Ed. Prentice Hall, 2007.

    [32] R. Kimball and M. Ross. The data warehou se toolkit: The completeguide to dimensional modeling . John Wiley & Sons, 2002.

    [33] J. M. Kleinberg. Authoritative sources in a h yperlinked environment. Journal of the ACM , 46(5):604632, 1999.

    [34] P. Kroonenberg and M. Greenacre. Corres pondence analysis. In Encyclopedia of Statistical Sciences, 2nd ed . (S. Kotz, founder and editor-in-chief) , pages 13941403, 2006.

    [35] S. Lafon and A. B. Lee. Diffusion maps and c oarse-graining: A uniedframework for dimensionality reduction, graph partitioning, and data setparameterization. IEEE Transactions on Patter n Analysis and Machine Intelligence , 28(9):13931403, 2006.

    [36] A. N. Langville and C. D. Meyer. Googles Pa geRank and Beyond: TheScience of Search Engine Rankings . Princeton University Press, 2006.

    [37] J. Lee and M. Verleysen. Nonlinear dimension ality reduction . Springer,2007.

    [38] D. Lusseau, K. Schneider, O. Boisseau, P. Haase, E. Slooten, andS. Dawson. The bottlenose dolphin community of doubtful soundfeatures a large proportion of long-lasting associations. Can geographicisolation explain this unique trait? Behavioral Ecology and Sociobiology ,54(4):396405, 2003.

    [39] S. A. Macskassy and F. Provost. Classication in networked data:A toolkit and a univariate case study. Journal of Machine Learning Research , 8:935983, 2007.

    [40] K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis .Academic Press, 1979.[41] C. D. Meyer. Stochastic complementation, uncoupling Markov chains,

    and the theory of nearly reducible systems. SIAM Review , 31(2):240272, 1989.

    [42] B. Nadler, S. Lafon, R. Coifman, and I. Kevrekidis. Diffusion maps,spectral clustering and eigenfunctions of Fokker-Planck operators. Ad-vances in Neural Information Processing Systems 18 , pages 955962,2005.

    [43] B. Nadler, S. Lafon, R. Coifman, and I. Kevrekidis. Diffusion maps,spectral clustering and reaction coordinate of dynamical systems. Ap- plied and Computational Harmonic Analysis , 21:113127, 2006.

    [44] M. Newman and M. Girvan. Finding and evaluating community structurein networks. Physical Review E , 69:026113, 2004.

    [45] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysisand an algorithm. In Advances in Neural Information ProcessingSystems , volume 14, pages 849856, Vancouver, Canada, 2001. MITPress.

    [46] P. Pons and M. Latapy. Computing communities in large networksusing random walks. Proceedings of the Inter