A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

Embed Size (px)

Citation preview

  • 8/3/2019 A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

    1/15

    A Link Analysis Extension of CorrespondenceAnalysis for Mining Relational Databases

    Luh Yen, Marco Saerens, Member, IEEE, and Francois Fouss

    AbstractThis work introduces a link analysis procedure for discovering relationships in a relational database or a graph, generalizing

    both simple and multiple correspondence analysis. It is based on a random walk model through the database defining a Markov chain

    having as many states as elements in the database. Suppose we are interested in analyzing the relationships between some elements

    (or records) contained in two different tables of the relational database. To this end, in a first step, a reduced, much smaller, Markov

    chain containing only the elements of interest and preserving the main characteristics of the initial chain, is extracted by stochastic

    complementation[41]. This reduced chain is then analyzed by projecting jointly the elements of interest in the diffusion map subspace

    [42] and visualizing the results. This two-step procedure reduces to simple correspondence analysis when only two tables are defined,

    and to multiple correspondence analysis when the database takes the form of a simple star-schema. On the other hand, a kernel

    version of the diffusion map distance, generalizing the basic diffusion map distance to directed graphs, is also introduced and the links

    with spectral clustering are discussed. Several data sets are analyzed by using the proposed methodology, showing the usefulness of

    the technique for extracting relationships in relational databases or graphs.

    Index TermsGraph mining, link analysis, kernel on a graph, diffusion map, correspondence analysis, dimensionality reduction,statistical relational learning.

    1 INTRODUCTION

    TRADITIONAL statistical, machine learning, pattern recogni-tion, and data mining approaches (see, for example, [28])usually assume a random sample of independent objectsfrom a single relation. Many of these techniques have gonethrough the extraction of knowledge from data (typicallyextracted from relational databases), almost always leading,

    in the end, to the classical double-entry tabular format,containing features for a sample of the population. Thesefeatures are therefore used in order to learn from the sample,provided that it is representative of the population as awhole. However, real-world data coming from many fields(such as World Wide Web, marketing, social networks, orbiology; see [16]) are often multirelational and interrelated.The work recently performed in statistical relational learning[22], aiming at working with such data sets, incorporatesresearch topics, such as link analysis [36], [63], web mining[1], [9], social network analysis [8], [66], or graph mining [11].All these research fields intend to find and exploit linksbetween objects (in addition to featuresas is also the case inthe field of spatial statistics [13], [53]), which could be ofvarious types and involved in different kinds of relation-ships. The focus of the techniques has moved over from the

    analysis of the features describing each instance belonging tothe population of interest (attribute value analysis) to theanalysis of the links existing between these instances(relational analysis), in addition to the features.

    This paper precisely proposes a link-analysis-basedtechnique allowing to discover relationships existing

    between elements of a relational database or, moregenerally, a graph. More specifically, this work is basedon a random walk through the database defining a Markovchain having as many states as elements in the database.Suppose, for instance, we are interested in analyzing therelationships between elements contained in two differenttables of a relational database. To this end, a two-stepprocedure is developed. First, a much smaller, reduced,Markov chain, only containing the elements of interestty-pically the elements contained in the two tablesandpreserving the main characteristics of the initial chain, isextracted by stochastic complementation [41]. An efficientalgorithm for extracting the reduced Markov chain from thelarge, sparse, Markov chain representing the database isproposed. Then, the reduced chain is analyzed by, forinstance, projecting the states in the subspace spanned bythe right eigenvectors of the transition matrix ([42], [43],[46], [47]; called the basic diffusion map in this paper), orby computing a kernel principal component analysis [54],[57] on a diffusion map kernel computed from the reducedgraph and visualizing the results. Indeed, a valid graphkernel based on the diffusion map distance, extending thebasic diffusion map to directed graphs, is introduced.

    The motivations for developing this two-step procedureare twofold. First, the computation would be cumbersome,

    if not impossible, when dealing with the completedatabase. Second, in many situations, the analyst is notinterested in studying all the relationships between allelements of the database, but only a subset of them.

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011 481

    . L. Yen and M. Saerens are with the Information Systems Research Unit(ISYS/LSM) and Machine Learning Group (MLG), Universite Catholiquede Louvain (UCL), 1, place des Doyens, 1348 Louvain-La-Neuve, Belgium.E-mail: {luh.yen, marco.saerens}@ucLouvain.be.

    . F. Fouss is with the Management DepartmentLSM, Facultes Uni-versitaires Catholiques de Mons (FUCaM), 151, Chaussee de Binche,7000 Mons, Belgium. E-mail: [email protected].

    Manuscript received 16 Jan. 2009; revised 14 Sept. 2009; accepted 29 Dec.

    2009; published online 23 Aug. 2010.Recommended for acceptance by Z.-H. Zhou.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2009-01-0023.Digital Object Identifier no. 10.1109/TKDE.2010.142.

    1041-4347/11/$26.00 2011 IEEE Published by the IEEE Computer Society

  • 8/3/2019 A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

    2/15

    Moreover, if the whole set of elements in the database isanalyzed, the resulting mapping would be averaged out bythe numerous relationships and elements we are notinterested infor instance, the principal axis would becompletely different. It would therefore not exclusivelyreflect the relationships between the elements of interest.Therefore, reducing the Markov chain by stochasticcomplementation allows to focus the analysis on the

    elements and relationships we are interested in.Interestingly enough, when dealing with a bipartite graph

    (i.e., the database only contains two tables linked by onerelation), stochastic complementation followed by a basicdiffusion map is exactly equivalent to simple correspon-dence analysis. On the other hand, when dealing with a star-schema database (i.e., one central table linked to severaltables by different relations), this two-step procedurereduces to multiple correspondence analysis. The proposedmethodology therefore extends correspondence analysis tothe analysis of a relational database.

    In short, this paper has three main contributions:

    . A two-step procedure for analyzing weightedgraphs or relational databases is proposed.

    . It is shown that the suggested procedure extendscorrespondence analysis.

    . A kernel version of the diffusion map distance,applicable to directed graphs, is introduced.

    The paper is organized as follows: Section 2 introducesthe basic diffusion map distance and its natural kernel on agraph. Section 3 introduces some basic notions of stochasticcomplementation of a Markov chain. Section 4 presents thetwo-step procedure for analyzing the relationships betweenelements of different tables and establishes the equivalence between the proposed methodology and correspondenceanalysis in some special cases. Section 5 presents someillustrative examples involving several data sets, whileSection 6 gives the conclusion.

    2 THE DIFFUSION MAP DISTANCE AND ITSNATURAL KERNEL MATRIX

    In Section 2, the basic diffusion map distance [42], [43], [46],[47] is briefly reviewed and some of its theoreticaljustifications are detailed. Then, a natural kernel matrix isderived from the diffusion map distance, providing a

    meaningful similarity measure between nodes.

    2.1 Notations and Definitions

    Let us consider that we are given a weighted, directed,graph G possibly defined from a relational database in thefollowing, obvious, way: each element of the database is anode and each relation corresponds to a link (for a detailedprocedure allowing to build a graph from a relationaldatabase, see [20]). The associated adjacency matrix A isdefined in a standard way as aij Aij wij if node i isconnected to node j and aij 0 otherwise (say G has nnodes in total). The weight wij > 0 of the edge connecting

    node i and node j is set to have larger value if the affinitybetween i and j is important. If no information about thestrength of relationship is available, we simply set wij 1(unweighted graph). We further assume that there are no

    self-loops (wii 0 for i 1;:::;n) and that the graph has asingle connected component; that is, any node can bereached from any other node. If the graph is not connected,there is no relationship at all between the differentcomponents and the analysis has to be performed sepa-rately on each of them. It is therefore to be hoped that thegraph modeling the relational database does not contain toomany disconnected componentsthis can be considered asa limitation of our method. Partitioning a graph intoconnected components from its adjacency matrix can bedone in On2 (see, for instance, [56]). Based on theadjacency matrix, the Laplacian matrix L of the graph isdefined in the usual manner: L D A, where D Diagai: is the generalized outdegree matrix with diagonalentries dii Dii ai:

    Pnj1aij. The column vector d

    diagai: is simply the vector containing the outdegree ofeach node. Furthermore, the volume of the graph is definedas vg volG

    Pni1dii

    Pni;j1aij. Usually, we are deal-

    ing with symmetric adjacency matrices, in which case L issymmetric and positive semidefinite (see, for instance, [10]).

    From this graph, we define a natural random walkthrough the graph in the usual way by associating a state toeach node and assigning a transition probability to eachlink. Thus, a random walker can jump from element toelement, and each element therefore, represents a state ofthe Markov chain describing the sequence of visited states.A random variable st contains the current state of theMarkov chain at time step t: if the random walker is in statei at time t, then st i. The random walk is defined by thefollowing single-step transition probabilities of jumpingfrom any state i st to an adjacent state: j st 1:Pst 1 jjst i aij=ai: pij. The transition prob-abilities only depend on the current state and not on thepast ones (first-order Markov chain). Since the graph iscompletely connected, the Markov chain is irreducible, thatis, every state can be reached from any other state. If wedenote the probability of being in state i at time t by xit Pst i and we define P as the transition matrix withentries pij, the evolution of the Markov chain is character-ized by xt 1 PTxt, with x0 x0 and T being thematrix transpose. This provides the state probabilitydistribution xt x1t; x2t; . . . ; xntT at time t oncethe initial distribution x0 is known. Moreover, we willdenote as xit the column vector containing the probabilitydistribution of finding the random walker in each state attime t when starting from state i at time t 0. That is, theentries of the vector xit are xijt Pst jjs0 i; j 1; . . . n.

    Since the Markov chain represents a random walk on thegraph G, the transition matrix is simply P D1A. More-over, if the adjacency matrix A is symmetric, the Markovchain is reversible and the steady-state vector, , is simplyproportional to the degree of each state [48], d (which has to be normalized in order to obtain a valid probabilitydistribution). Moreover, this implies that all the eigenvalues(both left and right) of the transition matrix are real.

    2.2 The Diffusion Map DistanceIn our two-step procedure, a diffusion map projection, based on the so-called diffusion map distance, will beperformed after stochastic complementation. Now, since

    482 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011

  • 8/3/2019 A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

    3/15

    the original definition of the diffusion map distance dealsonly with undirected, aperiodic, Markov chains, it will first be assumed in Section 2 that the reduced Markov chain,obtained after stochastic complementation, is indeed un-directed, aperiodic, and connectedin which case thecorresponding random walk defines an irreducible rever-sible Markov chain. Notice, that it is not required that theoriginal adjacency matrix is irreducible and reversible; theseassumptions are only required for the reduced adjacencymatrix obtained after stochastic complementation (see thediscussion in Section 3.1). Moreover, some of theseassumptions will be relaxed in Section 2.3, when introdu-cing the diffusion map kernel that is well-defined, even ifthe graph is directed.

    The original derivation of the diffusion map, introducedindependently by Nadler et al., and Pons and Latapy [42],[43], [46], [47], is detailed in Section 2, but other interpreta-tions of this mapping appeared in the literature (see thediscussion at the end of Section 2). Moreover, the basicdiffusion map is closely related to correspondence analysis,

    as detailed in Section 4. For an application of the basicdiffusion map to dimensionality reduction, see [35].

    Since P is aperiodic, irreducible, and reversible, it is wellknown that all the eigenvalues of P are real and theeigenvectors are also real (see, e.g., [7], p. 202). Moreover, allits eigenvalues 2 1; 1, and the eigenvalue 1 has multi-plicity one [7]. With these assumptions, Nadler et al. andPons and Latapy [42], [43], [46], [47] proposed to use asdistance between states i and j

    d2ijt Xnk1

    xikt xjkt2k

    1

    / xit xjtTD1xit xjt; 2since, for a simple random walk on an undirected graph,the entries of the steady-state vector are proportional (the/ sign) to the generalized degree of each node (the total ofthe elements of the corresponding row of the adjacencymatrix [48]). This distance, called the diffusion mapdistance, corresponds to the sum of the squared differencesbetween the probability distribution of being in any stateafter t transitions when starting (i.e., at time t 0) from twodifferent states, state i and state j. In other words, twonodes are similar when they diffuse through the network

    and thus influence the networkin a similar way. This is anatural definition which quantifies the similarity betweentwo states based on the evolution of the states probabilitydistribution. Of course, when i j;dijt 0.

    Nadler et al. [42], [43] showed that this distance measurehas a simple expression in terms of the right eigenvectorsof P:

    d2ijt Xnk1

    2tk uki ukj2; 3

    where uki uki is component i of the kth right eigenvec-tor, uk, of P and k is its corresponding eigenvalue. As

    usual, the k are ordered by decreasing modulus, so that thecontributions to the sum in (3) are decreasing with k. Onthe other hand, xit can easily be expressed [42], [43] in thespace spanned by the left eigenvectors of P, the vk,

    xit PTtei Xnk1

    tkvkuTk ei

    Xnk1

    tkuki

    vk; 4

    where ei is the ith column of I,

    ei 0; . . . ; 0; 1; 0; . . . ; 0T; with the single 1 in position i:The resulting mapping aims to represent each state i in an-dimensional euclidean space with coordinatesjt2ju2i; jt3ju3i; . . . ; jtnjuni), as in (4) (the first right eigen-vector is trivial and is therefore discarded). Dimensions areordered by decreasing modulus, jtkj. This original mappingintroduced by Nadler and coauthors will be referred to asthe basic diffusion map in this paper, in contrast with thediffusion map kernel (KDM) that will be introduced inSection 3.

    The weighting factor, D1, in (2) is necessary to obtain(3), since the vk are not orthogonal. Instead, it can easily beshown that we have vTi D

    1vj ij, which aims to redefinethe inner product as x; yh i xTD1y, where the metric ofthe space is D1 [7].

    Notice also that there is a close relationship betweenspectral clustering (the mapping provided by the normal-ized Laplacian matrix; see, for instance, [15], [45], [65]) andthe basic diffusion map. Indeed, a common embedding ofthe nodes consists of representing each node by thecoordinates of the smallest nontrivial eigenvectors (corre-sponding to the smallest eigenvalues) of the normalizedLaplacian matrix, ~L D1=2LD1=2. More precisely, if uk isthe kth largest right eigenvector of the transition matrix Pand ~lk is the kth smallest nontrivial eigenvector of thenormalized Laplacian matrix ~L, we have (see the Appendixfor the proof and details)

    uk D1=2~lk 5and ~lk is associated to eigenvalue 1 k.

    A subtle, still important, difference between this mappingand the one provided by the basic diffusion map concerns theorder in which the dimensions are sorted. Indeed, for the basicdiffusion map, the eigenvalues of the transition matrix P areordered by decreasing modulus value. For this spectralclustering model, the eigenvalues are sorted by decreasingvalue (and not modulus), which can result in a differentrepresentation if P has large negative eigenvalues. Thisshows that themappingsprovided by spectral clustering and

    by the basic diffusion map are closely related.Notice that at least three other justifications of this

    eigenvector-based mapping appeared before in the litera-ture, and are briefly reviewed here. 1) It has been shownthat the entries of the subdominant right eigenvector of thetransition matrix P of an aperiodic, irreducible, reversible,Markov chain can be interpreted as a relative distance to itsstationary distribution (see [60], Section 1.8.1, or [18],Appendix). This distance may be regarded as an indicatorof the number of iterations required to reach this equili-brium position, if the system starts in the state from whichthe distance is being measured. These quantities are onlyrelative, but they serve as a means of comparison among the

    states [60]. 2) The same embedding can be obtained byminimizing the criterion

    Pni1

    Pnj1 aijzi zj2 zTLz

    [10], [26] subject to zTDz 1, thereby penalizing the nodeshaving a large outdegree [74]. Here, zi is the coordinate of

    YEN ET AL.: A LINK ANALYSIS EXTENSION OF CORRESPONDENCE ANALYSIS FOR MINING RELATIONAL DATABASES 483

  • 8/3/2019 A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

    4/15

    node i on the axis and the vector z contains the zi. Theproblem sums up in finding the smallest nontrivialeigenvector of I P, which is the same as the secondlargest eigenvector ofP, and this is once more similar to the basic diffusion map. Notice that this mapping has beenrediscovered and reinterpreted by Belkin and Niroyi [2], [3]in the context of nonlinear dimensionality reduction. 3) Thelast justification of the basic diffusion map, introduced in

    [15], is based on the concept of two-way partitioning of agraph [58]. Minimizing a normalized cut criterion whileimposing that the membership vector is centered withrespect to the metric D leads to exactly the same embeddingas in the previous interpretation. Moreover, some authors[72] showed that applying a specific cut criteria to bipartitegraphs leads to simple correspondence analysis. Noticethat the second justification 1) leads to exactly the samemapping as the basic diffusion map while the third andfourth justifications, 2) and 3) lead to the same embeddingspace, but with a possibly different ordering and rescalingof the axis.

    More generally, these mappings are, of course, alsorelated to graph embedding and nonlinear dimensionalityreduction, which have been highly studied topics in recentyears, especially in the manifold learning community (see,i.e., [21], [30], [37], [67], for recent surveys or developments).Experimental comparisons with popular nonlinear dimen-sionality reduction techniques are presented in Section 5.

    2.3 A Kernel View of the Diffusion Map Distance

    We now introduce1 a variant of the basic diffusion mapmodel introduced by Nadler et al. and Pons and Latapy[42], [43], [46], [47], which is still well-defined when theoriginal graph is directed. In other words, we do not

    assume that the initial adjacency matrix A is symmetric inthis section. This extension presents several advantages incomparison with the original basic diffusion map:

    1. the kernel version of the diffusion map is applicableto directed graphs while the original model isrestricted to undirected graphs,

    2. the extended model induces a valid kernel on agraph,

    3. the resulting matrix has the nice property of beingsymmetric positive definitethe spectral decomposi-tion can thus be computed on a symmetric positivedefinite matrix, and finally

    4. the resulting mapping is displayed in a euclideanspace in which the coordinate axes are set in thedirections ofmaximal variance by using (uncentered ifthe kernel is not centered) kernel principal compo-nent analysis [54], [57] or multidimensional scaling[6], [12].

    This kernel-based technique will be referred to as thediffusion map kernel PCA or the KDM PCA.

    Let us define W Diag1, where is the stationarydistribution of the finite Markov chain. Remember that ifthe adjacency matrix is symmetric, the stationary distribu-tion of the natural random walk is proportional to thedegree of the nodes, W

    /D1 [48]. The diffusion map

    distance is therefore redefined as

    d2ijt xit xjtTWxit xjt: 6Since xit PTtei, (6) becomes

    d2ijt ei ejTPtWPTtei ej ei ejTKDMei ej KDMii KDMjj KDMij KDMji;

    7

    where we defined

    KDMt PtW PT t 8

    referred to as the diffusion map kernel. Thus, the matrixKDM is the natural kernel (inner product matrix) associatedto the squared diffusion map distances [6], [12]. It is clearthat this matrix is symmetric positive semidefinite andcontains inner products in a euclidean space, where thenode vectors are exactly separated by dijt (the proof isstraightforward and can be found in [17]appendix Dwhere the same reasoning was applied to the commute timekernel). It is therefore a valid kernel matrix.

    Performing a (uncentered if the kernel is not centered)principal component analysis (PCA) in this embeddingspace aims to change the coordinate system by putting thenew coordinate axes in the directions of maximal variances.From the theory of classical multidimensional scaling [6],[12], it suffices2 to compute the m first eigenvalues/eigenvectors ofKDM and to consider that these eigenvectorsmultiplied by the squared root of the correspondingeigenvalues are the coordinates of the nodes in the principalcomponent space spanned by these eigenvectors (see [6],[12]; for a similar application with the commute time kernel,see [51]; for the general definition of kernel PCA, see [54],[55]). In other words, we compute the m first eigenvalues/eigenvectors of KDM : KDMwk kwk, where the wk areorthonormal. Then, we represent each node i in a m-dimensional euclidean space with coordinates ffiffiffiffiffi1p w1i,ffiffiffiffiffi

    2p

    w2i; . . . ;ffiffiffiffiffiffiffi

    mp

    wmi, where wki wki corresponds toelement i of the eigenvector wk associated to eigenvalue k;this is the vector representation of state i in the principalcomponent space.

    It can easily be shown that when the initial graph isundirected, this PCA based on the kernel matrix KDM issimilar to the diffusion map introduced in the last section,up to an isometry. Indeed, by the classical theory ofmultidimensional scaling [6], [12], the eigenvectors of thekernel matrix KDM multiplied by the squared root of thecorresponding eigenvalues define coordinates in a eucli-dean space, where the observations are exactly separated bythe distance dijt. Since this is exactly the property of thebasic diffusion map (3), both representations are similar upto an isometry.

    Notice that the resulting kernel matrix can easily becentered [40] by HKDMH with H I eeT=n, where e isa column vector all of whose elements are 1 (i.e.,e 1; 1; . . . ; 1T). H is called the centering matrix. Thisaims to place the origin of the coordinates of the diffusionmap at the center of gravity of the node vectors.

    484 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011

    1. Part of the material of this section was published in a conference paper[19] presenting a similar idea.

    2. The proof must be slightly adapted in order to account for the fact thatthe kernel matrix in not centered, as in [50].

  • 8/3/2019 A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

    5/15

    2.4 Links between the Basic Diffusion Map and theKernel Diffusion Map

    While both representing the graph in a euclidean space,where the nodes are exactly separated by the distancesdefined by (2), and thus providing exactly the sameembedding, the mappings are, however, different for eachmethod. Indeed, the coordinate system in the embeddingspace differs for each method.

    In the case of the basic diffusion map, the eigenvector ukrepresents the kth coordinate of the nodes in the embed-ding space. However, in the case of the diffusion mapkernel, since a kernel PCA is performed, the firstcoordinate axis corresponds instead to the direction ofmaximal variance in terms of diffusion map distance (2).Therefore, the coordinate system used by the diffusionmap kernel is actually different than the one used by the basic diffusion map.

    Putting the coordinate system in the directions ofmaximal variance, and thus computing a kernel PCA, isprobably more natural. We now show that there is a closerelationship between the two representations. Indeed, from

    (4), we easily observe that the mapping provided by thebasic diffusion map remains the same in function of theparameter t, up to a scaling of each coordinate/dimension(only the scaling changes). This is, in fact, not the case forthe kernel diffusion map. In fact, the mapping provided bythe diffusion map kernel tends to be the same as the oneprovided by the basic diffusion map for growing values of tin the case of an undirected graph. Indeed, it can be shownthat the kernel matrix can be rewritten as KDM / U2tUT,where U contains the right eigenvectors of P; uk, ascolumns. In this case, when t is large, every additionaldimension has a very small contribution in comparisonwith the previous ones.

    This fact will be illustrated in the experimental section(i.e., Section 5). In practice, we observed that the twomappings are already almost identical when t is equal to 5or 6 (see, for instance, Fig. 3 in Section 5).

    3 ANALYZING RELATIONS BY STOCHASTICCOMPLEMENTATION

    In Section 3, the concept of stochastic complementation [41]is briefly reviewed and applied to the analysis of a graphthrough the random-walk-on-a-graph model. From theinitial graph, a reduced graph containing only the nodesof interest, and which is much more easy to analyze, is built.

    3.1 Computing a Reduced Markov Chain byStochastic Complementation

    Suppose we are interested in analyzing the relationshipbetween two sets of nodes of interest. A reduced Markovchain can be computed from the original chain, in thefollowing manner: First, the set of states is partitioned intotwo subsets, S1corresponding to the nodes of interest to be analyzedand S2corresponding to the remainingnodes, to be hidden. We further denote by n1 and n2(with n1 n2 n) the number of states in S1 and S2,respectively; usually n2 ) n1. Thus, the transition matrixis repartitioned as

    P S1 S2S1S2

    P11 P12P21 P22

    !:

    9

    The idea is to censor the useless elements by maskingthem during the random walk. That is, during any random

    walk on the original chain, only the states belonging to S1are recorded; all the other reached states belonging to

    subset S2 being censored, and therefore, not recorded. One

    can show that the resulting reduced Markov chain obtained

    by censoring the states S2 is the stochastic complement of

    the original chain [41]. Thus, performing a stochasticcomplementation allows to focus the analysis on the tables

    and elements representing the factors/features of interest.

    The reduced chain inherits all the characteristics from theoriginal chain; it simply censors the useless states. The

    stochastic complement Pc of the chain, partitioned as in (9),

    is defined as (see, for instance, [41])

    Pc P11 P12I P221P21: 10It can be shown that the matrix Pc is stochastic, that is,

    the sum of the elements of each row is equal to 1 [41]; ittherefore corresponds to a valid transition matrix between

    states of interest. We will assume that this resultingstochastic matrix is aperiodic and irreducible, that is,primitive [48]. Indeed, Meyer showed in [41] that if theinitial chain is irreducible or aperiodic, so is the reducedchain. Moreover, even if the initial chain is periodic, thereduced chain frequently becomes aperiodic by stochasticcomplementation [41]. One way to ensure the aperiodicityof the reduced chain is to introduce a small positivequantity on the diagonal of the adjacency matrix A, whichdoes not fundamentally change the model. Then, P hasnonzero diagonal entries and the stochastic complement,Pc, is primitive (see [41], Theorem 5.1).

    Let us show that the reduced chain also represents arandom walk on a reduced graph Gc containing only thenodes of interest. We therefore partition the matricesA; D; L, as

    A A11 A12A21 A22

    !; D D1 O

    O D2

    !; L L11 L12

    L21 L22

    !

    from which we easily find Pc D11 A11 A12D2 A221A21 D11 Ac, whe re we d efined Ac A11 A12D2 A221A21. Notice that if A is symmetric (thegraph Gc is undirected), Ac is symmetric as well. Since Pc isstochastic, we deduce that the diagonal matrix D1 contains

    the row sums of Ac and that the entries of Ac are positive.The reduced chain thus corresponds to a random walk onthe graph Gc whose adjacency matrix is Ac.

    Moreover, the corresponding Laplacian matrix of thegraph Gc can be obtained by

    Lc D1 Ac D1 A11 A12D2 A221A21 L11 L12L122 L21

    11

    since L12 A12 and L21 A21. If the adjacency matrixA is symmetric, L11 (L22) is positive definite, since it isobtained from the positive semidefinite matrix L by

    deleting the rows associated to S2 (S1) and the corre-sponding columns, thereby eliminating the linear relation-ship. Notice that Lc is simply the Schur complement ofL22 [27]. Thus, for an undirected graph G, instead of

    YEN ET AL.: A LINK ANALYSIS EXTENSION OF CORRESPONDENCE ANALYSIS FOR MINING RELATIONAL DATABASES 485

  • 8/3/2019 A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

    6/15

    directly computing Pc, it is more interesting to computeLc, which is symmetric positive definite, from which Pccan easily be deduced: Pc I D11 Lc, directly followingfrom Lc D1 Ac; see Section 3.2 for a proposition ofiterative computation of Lc.

    3.2 Iterative Computation of Lc for Large, Sparse,Graphs

    In order to compute Lc from (11), we need to evaluateL122 L21. We now show how this matrix can be computediteratively by using, for instance, a simple Jacobi-like orGauss-Seidel-like algorithm. In order to simplify thedevelopment, let us pick up one column, say column j, ofL21 and denote it as bj; there are n1 ( n2 such columns. Theproblem is thus to compute xj L122 L21ej (column j ofL122 L21) such that

    L22xj D2 A22xj bj: 12By transforming this last equation, we obtain xj

    D12

    A22xj

    bj

    P22xj

    D12 bj, which leads to the fol-

    lowing iterative updating rule:

    bxj P22bxj D12 bj; 13wherebxj is an approximation of xj. This procedureconverges, since the matrix L22 is irreducible and has weakdiagonal dominance [70]. Initially, one can start, forinstance, withbxj 0. The computation of (13) fully exploitsthe sparseness, since only the nonzero entries of P22contribute to the update. Of course, D12 bj is precalculated.

    Once all the xj have been computed in turn (only n1 intotal), the matrix X containing the xj as columns isconstructed; we thus have L122 L21

    X. The final step

    consists of computing Lc L11 L12X.The time complexity of this algorithm is n1 (the

    complexity of solving one sparse system of n2 linearequations). Now, if the graph is undirected, the matrixL22 is positive definite. Recent numerical analysis techni-ques have shown that positive definite sparse linearsystems in an n-by-n matrix with m nonzero entries canbe solved in time Omn, for instance, by using conjugategradient techniques [59], [64]. Thus, apart from the matrixmultiplication L12X, the complexity is On1n2m, where mis the number of nonzero entries of L22. In practice, thematrix is usually very sparse and m

    n2 with quite

    small, resulting in O n1n22 with n1 ( n2. The second step,i.e., the mapping by basic diffusion map or by diffusionmap kernel PCA, has negligible complexity since it isperformed on a reduced n1 n1 matrix. Of course, moresophisticated iterative algorithms can be used as well (see,for instance, [25], [49], [59]).

    4 ANALYZING THE REDUCED MARKOV CHAIN WITHTHE BASIC DIFFUSION MAP: LINKS WITHCORRESPONDENCE ANALYSIS

    Once a reduced Markov chain containing only the nodes of

    interest has been obtained, one may want to visualize thegraph in a low-dimensional space preserving as accuratelyas possible the proximity between the nodes. This is thesecond step of our procedure. For this purpose, we propose

    to use the diffusion maps introduced in Sections 2.2 and 2.3.Interestingly enough, computing a basic diffusion map onthe reduced Markov chain is equivalent to correspondenceanalysis in two special cases of interest: a bipartite graphand a star-schema database. Therefore, the proposed two-step procedure can be considered as a generalization ofcorrespondence analysis.

    Correspondence analysis (see, for instance, [23], [24],[31], [62]) is a widely used multivariate statistical analysistechnique which still is the subject of much research efforts(see, for instance, [5], [29]). As stated, for instance, in [34],

    simple correspondence analysis aims to provide insights intothe dependence of two categorical variables. The relation-ships between the attributes of the two categoricalvariables are usually analyzed through a biplot [23]a2D representation of the attributes of both variables. Thecoordinates of the attributes on the biplot are obtained bycomputing the eigenvectors of a matrix. Many differentderivations of simple correspondence analysis have beendeveloped, allowing for different interpretations of thetechnique, such as maximizing the correlation between twodiscrete variables, reciprocal averaging, categorical discri-minant analysis, scaling and quantification of categoricalvariables, performing a principal component analysis based on the chi-square distance, optimal scaling, dualscaling, etc. [34]. Multiple correspondence analysis is theextension of simple correspondence analysis to a largernumber of categorical variables.

    4.1 Simple Correspondence Analysis

    As stated before, simple correspondence analysis (see, forinstance, [23], [24], [31], [62]) aims to study the relation-ships between two random variables x1 and x2 (thefeatures) having each mutually exclusive, categorical,outcomes, denoted as attributes. Suppose the variable x1has n1 observed attributes and the variable x2 has n2

    observed attributes, each attribute being a possible out-come value for the feature. An experimenter makes a seriesof measurements of the features x1; x2 on a sample of vgindividuals and records the outcomes in a frequency (alsocalled contingency) table, fij, containing the number ofindividuals having both attribute x1 i and attributex2 j. In our relational database, this corresponds to twotables, each table corresponding to one variable, andcontaining the set of observed attributes (outcomes) ofthe variable. The two tables are linked by a single relation(see Fig. 1 for a simple example).

    This situation can be modeled as a bipartite graph, where

    each node corresponds to an attribute and links are onlydefined between attributes of x1 and attributes of x2. Theweight associated to each link is set to wij fij, quantifyingthe strength of the relationship between i and j. The

    486 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011

    Fig. 1. Trivial example of a single relation between two variables,Document and Word. The Document table contains outcomes ofdocuments while the Word table contains outcomes of words.

  • 8/3/2019 A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

    7/15

    associated n n adjacency matrix and the correspondingtransition matrix can be factorized as

    A O A12A21 O

    !; P O P12

    P21 O

    !; 14

    where O is a matrix full of zeroes.Suppose we are interested in studying the relationships

    between the attributes of the first variable x1, whichcorresponds to the n1 first elements. By stochastic com-plementation (see (10)), we easily obtain Pc P12P21 D11 A12D

    12 A21. Computing the diffusion map for t 1

    aims to extract the subdominant right-hand eigenvectors ofPc, which exactly corresponds to correspondence analysis(see, for instance, [24], (4.3.5)). Moreover, it can easily beshown that Pc has only real nonnegative eigenvalues, andthus, ordering the eigenvalues by modulus is equivalent toordering them by value. In correspondence analysis,eigenvalues reflect the relative importance of the dimen-sions: each eigenvalue is the amount of inertia a givendimension exhibits in the frequency table [31]. The basicdiffusion map after stochastic complementation on thisbipartite graph therefore leads to the same results as simplecorrespondence analysis.

    Relationships between simple correspondence analysisand link analysis techniques have already been highlighted.For instance, Zha et al. [72] showed the equivalence of anormalized cut performed on a bipartite graph and simplecorrespondence analysis. On the other hand, Saerens et al.investigated the relationships between Kleinbergs HITSalgorithm [33], and correspondence analysis [18] or princi-pal component analysis [50].

    4.2 Multiple Correspondence AnalysisMultiple correspondence analysis assigns a numericalscore to each attribute of a set of p > 2 categorical variables[23], [62]. Suppose the data are available in the form of astar-schema: the individuals are contained in a main tableand the categorial features of these individuals, such aseducation level, gender, etc., are contained in p auxiliary,satellite, tables. The corresponding graph is built naturally by defining one node for each individual and for eachattribute while a link between an individual and anattribute is defined when the individual possesses thisattribute. This configuration is known as a star-schema

    [32] in the data warehouse or relational database fields(see Fig. 2 for a trivial example).

    Let us first renumber the nodes in such a way that theattribute nodes appear first and the individuals nodes last.Thus, the attributes-to-individuals matrix will be denoted byA12; it contains a 1 on the i; j entry when the individual jhas attribute i, and 0 otherwise. The individuals-to-attributesmatrix, the transpose of the attributes-to-individuals matrix,is A21. Thus, the adjacency matrix of the graph is

    A O A12A21 O

    !: 15

    Now, the individuals-to-attributes matrix exactly corre-sponds to the data matrix A21 X containing, as rows, theindividuals and, as columns, the attributes. Since thedifferent features are coded as indicator (dummy) variables,

    a row of the X matrix contains a 1 if the individual has the

    corresponding attribute and 0 otherwise. We thus have

    A21 X and A12 XT.Assuming binary weights, the matrix D1 contains on its

    diagonal the frequencies of each attribute, that is, the numberof individuals having this attribute. On the other hand, D2contains p on each element of its diagonal, since each

    individual has exactly one attribute for each of the p features(attributes corresponding to a feature are mutually exclu-

    sive). Thus, D2 p I and P12 D11 A12; P21 D12 A21.Suppose we are first interested in the relationships

    between attribute nodes, thereby hiding the individual

    nodes contained in the main table. By stochastic comple-

    mentation (10), the corresponding attribute-attribute transi-

    tion matrix is

    Pc D11 A12D12 A21 1

    pD11 A12A21

    1p

    D11 XTX 1

    pD11 F;

    16

    where the element fij of the frequency matrixF

    XTX

    ,also called the Burt matrix, contains the number of co-occurences of the two attributes i and j, that is, the numberof individuals having both attribute i and attribute j.

    The largest nontrivial right eigenvector of the matrix Pcrepresents the scores of the attributes in a multiple

    correspondence analysis. Thus, computing the eigenvaluesand eigenvectors of Pc and displaying the nodes withcoordinates proportional to the eigenvectors, weighted bythe corresponding eigenvalue, exactly corresponds tomultiple correspondence analysis (see [62], (10)). This isprecisely what we obtain when computing the basic

    diffusion map on Pc with t 1. Indeed, as for simplecorrespondence analysis, it can easily be shown that Pc hasreal nonnegative eigenvalues, and thus, ordering theeigenvalues by modulus is equivalent to ordering by value.

    YEN ET AL.: A LINK ANALYSIS EXTENSION OF CORRESPONDENCE ANALYSIS FOR MINING RELATIONAL DATABASES 487

    Fig. 2. Trivial example of a star-schema relation between a mainvariable, Individual, and auxiliary variables, Gender, Education level,etc. Each table contains outcomes of the corresponding randomvariable.

  • 8/3/2019 A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

    8/15

    If we are interested in the relationships betweenelements of the main table (the individuals) instead of theattributes, we obtain

    Pc 1p

    A21D11 A12

    1

    pXD11 X

    T; 17

    which, once again, is exactly the result obtained by multiplecorrespondence analysis (see [62], (9)).

    5 EXPERIMENTS

    This experimental section aims to answer four researchquestions.

    1. HowdoesthegraphmappingsprovidedbythekernelPCA based on the diffusion map kernel (KDM PCA)compares with the basic diffusion map projection?

    2. Does the proposed two-step procedure (stochasticcomplementation + diffusion map) provide realisticsubgraph drawings?

    3. How does the diffusion map kernel combined withstochastic complementation compares to other pop-ular dimensionality reduction techniques?

    4. Does stochastic complementation accurately pre-serve the structural information?

    5.1 Graph Embedding

    Two simple graphs are studied in order to illustrate thevisualization of the graph structure by diffusion maps alone(without stochastic complementation): the Zachary karateclub [71] and the dolphins social network [38].

    5.1.1 Zachary Karate Club

    Zachary has studied the relations between the 34 membersof a karate club. A disagreement between the club instructor(node 1) and the administrator (node 34) resulted in thesplit of the club into two parts. Each member of the club isrepresented by a node of the graph and the weight betweennodes i; j is set to be the number of times member i andmember j met outside the club. The Ucinet drawing of thenetwork is shown in Fig. 3a. The built friendship networkbetween members allows to discover how the club split, butits mapping also allows to study the proximity between thedifferent members.

    It can be observed from the graph embeddings provided

    by the basic diffusion map (Fig. 3c) that the value ofparameter t has no influence on the nodes position, butonly on the axis scaling (remember (4)). On the contrary, theinfluence of the value of parameter t is clearly visible on theKDM PCA mapping (Fig. 3d).

    However, the projection of a graph on a 2D space usuallyleads to a loss of information. The information preservationratio can be estimated using 1 2=

    Pi i; it was com-

    puted for the 2D mapping of the network using the basicdiffusion map and the KDM PCA, and is shown in Fig. 3b. Itcanbe observed that the ratio increaseswith tbut, overall, theinformation is better preserved with the KDM PCA than with

    the basic diffusion map. This is due to a better choice ofthe projection axes for the KDM PCA that are oriented in thedirections of maximal variance. Since a proximity on themapping can be interpreted as a strong relationship between

    members, the social community structure is clearly obser-vable visually from Figs. 3c and 3d. For the KDM PCA, thechoice of thevaluefor theparameter t depends on thegraphsproperty that the user wishes to highlight. When t is low(t 1), nodes with few relevant connections are consideredas outliers andrejected from thecommunity. On thecontrary,when t increases, the effect of marginal nodes fades and onlythe global members structure, similar to the basic diffusionmap, is visible.

    5.1.2 Dolphins Social Network

    This unweighted graph represents the associations between bottlenose dolphins from the same community, observedover a period of 7 years. Only the basic diffusion map witht 1 is shown (since, once again, the parameter t has noinfluence on the mapping) while the KDM PCA is computedfor t 1; 3, and 6. It can be observed (see Fig. 4) from KDMPCA that a dolphin (the 61th member, Zig) is rejected faraway from the community due to his lack of interactionwith other dolphins. His only contact is also poorly

    connected with the remaining community. As explainedin Section 2 and confirmed in Fig. 4, the basic diffusion mapand the KDM PCA mappings become similar when tincreases. For instance, when t 6 or more, the outlier isno more highlighted and only the main structure of thenetwork is visible. Two main subgroups can be identifiedfrom the mapping (notice that Newman and Girvan [44]actually identified four subcommunities by clustering).

    5.2 Analyzing the Effect of StochasticComplementation on a Real-World Data Set

    This second experiment aims to illustrate the two-stepmapping procedure, i.e., first applying a stochastic comple-mentationandthencomputingthe KDM PCA, on a real-worlddata setthe newsgroups data. However, for illustrativepurposes, the procedure is first applied on a toy example.

    5.2.1 Toy Example

    The graph (see Fig. 5) is composed of four objects (e1; e2; e3,and e4) belonging to two different classes (c1 and c2). Eachobject is also connected to one or many of the five attributes(a1; . . . ; a5). The reduced graph mapping obtained by ourtwo-step procedure allows to highlight the relations betweenthe attribute values (i.e., the a nodes) and the classes (i.e., thec nodes). To achieve this goal, the e nodes are eliminated by

    performing a stochastic complementation: only the a and cnodes are kept. The resulting subgraph is displayed on a2D plane by performing a KDM PCA. Since the connectivitybetween nodes a1 and a2 (a3; a4, and a5) is larger than withthe remaining nodes, these two (three) nodes are closetogether on the resulting map. Moreover, node c1 (c2) ishighly connected to nodes a1; a2 a3; a4, and a5 throughindirect links and is therefore displayed close to these nodes.

    5.2.2 Newsgroups Data Set

    The real-world data set studied in this section is thenewsgroups data set. It is composed of about 20,000 un-

    structured documents, taken from 20 discussion groups(newsgroups) of the Usernet diffusion list. For the ease ofinterpretation, we decided to limit the data set size byrandomly sampling 150 documents out of three slightly

    488 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011

  • 8/3/2019 A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

    9/15

    correlated topics (sport/baseball, politics/mideast, andspace/general; 50 documents from each topic). Those150 documents are preprocessed as described in [68], [69].The resulting graph is composed of 150 document nodes,564 term nodes, and three topic nodes representing thetopics of the documents. Each document is connected to itscorresponding topic node with a weight fixed to 1. Theweights between the documents and the terms are set equal

    to the tf.idf factor and normalized in order to obtainnormalized weights between 0 and 1 [68], [69].

    Thus, the class (or topics) nodes are connected todocument nodes of the corresponding topics, and each

    document is also connected to terms contained inthe document. Drawing a parallel with our illustrativeexample (see Fig. 5), topic nodes correspond to c-nodes,documentnodes to e-nodes, andterms to a-nodes. Thegoal ofthis experiment is to study the similarity between the termsand the topics through their connections with the documentnodes. The reduced Markov chain is computed by setting S1to the nodes of the graph corresponding to the terms and the

    topics. The remaining nodes (documents) are rejected in thesubgroup S2. Fig. 6 shows the embedding of the terms usedinthe 150 documents of the newsgroups subset, as well as thetopics of the documents. The KDM PCA quickly provides

    YEN ET AL.: A LINK ANALYSIS EXTENSION OF CORRESPONDENCE ANALYSIS FOR MINING RELATIONAL DATABASES 489

    Fig. 3. Zachary karate social network: (a) Ucinet projection of the Zachary karate network, (b) the information preservation ratio of the 2D projectionin function of the parameter t for the basic diffusion map (diffusion map) and the KDM PCA, (c) the graph mapping obtained by the basic diffusionmap (t 1 and 4), and (d) the graph mapping obtained by a KDM PCA (t 1 and 4).

  • 8/3/2019 A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

    10/15

    thesamemappingas thebasic diffusionmap when increasingthevalue oft . However, it can be observed on theembeddingwith t 1 that several nodes are rejected outside thetriangle, far away from the other nodes of the graph.

    A new mapping, where the terms corresponding to eachnode are also displayed (for the visualization convenience,only terms cited by 25 documents or more are shown) forKDM PCA with t 1 is shown in Fig. 7. It can be observedthat a group of terms are stuck near each topic nodes,denoting their relevance to this topic (i.e., player, win,hit, and team for the sport topic). We also observe thatterms lying in-between two topics are also commonly usedby both topics (human, nation, and European seem tobe words used in discussions on politics as well as on space),or centered in the middle of the three topics (common termswithout any specificity, such as work, Usa, or make).

    Globally, the projection provides a good representation ofthe use of the terms in the different discussion groups for both the basic diffusion map and the KDM PCA. Terms

    rejected outside the triangle are often only cited by fewdocuments and seem to have few links with other terms.They are probably out of topic as, for instance, the series ofterms on printer in the outlier group on the left of the sporttopic (printer, Packard, or Hewlett).

    5.3 Graph Reduction Influence and EmbeddingComparison

    The objective of this experiment is twofold. The first aim isto study the influence of stochastic complementation ongraph mapping. The second one is to compare five populardimensionality reduction methods, namely, the diffusionmap kernel PCA (KDM PCA or simply KDM), the LaplacianEigenmap (LE) [3], the Curvilinear Component Analysis(CCA) [14], Sammons nonlinear Mapping (SM) [52], andthe classical Multidimensional Scaling [6], [12], based on

    geodesic distances (MDS). For CCA, SM, and MDS, thedistance matrix is given by the shortest path distancecomputed on the reduced graph whose weights are set tothe inverse of the entries of the adjacency matrix obtained by stochastic complementation. Notice that the MDSmethod computed from the geodesic distance on a graphis also known as the ISOMAP method after [61]. Providedthat the resulting reduced Markov chain is usually dense,the time complexity of each algorithm is as follows: ForKDM PCA, LE, and MDS, the problem is to compute theddominant eigenvectors of a square matrix since the graphis mapped on a d-dimensional space, which is Od n21,where n1 is the number of nodes of interest being displayedand is the number of iterations of the power method. ForSM and CCA, the complexity is about O n21, where isthe number of iterations (these algorithms are iterative by

    490 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011

    Fig. 5. Toy example illustrating our two-step procedure (stochasticcomplementation followed by KDM PCA).

    Fig. 4. Dolphins social network: The graph mapping obtained by the basic diffusion map (t 1; upper left figure) and using the KDM PCA (t 1; 3,and 6).

  • 8/3/2019 A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

    11/15

    nature). On the other hand, computing the shortest pathdistances matrix takes On21 logn1. Thus, each algorithmhas a time complexity between On

    21 and On

    31.

    In this experiment, we address the task of classificationof unlabeled nodes in partially labeled graphs, that is,semisupervised classification on a graph [73]. Notice that

    the goal of this experiment is not to design a state-of-the-art semisupervised classifier; rather it is to study theperformance of the proposed method, in comparison withother embedding methods.

    Three graphs are investigated. The first graph is

    constructed from the well-known Iris data set [4]. The

    weight (affinity) between nodes representing samples isprovided by wij expd2ij=2, where dij is the euclideandistance in the feature space and 2 is simply the sample

    variance. The classes are the three iris species. The second

    graph is extracted from the IMDb movie database [39]. Here,

    1,126 movies are linked to form a connected graph: an edge

    is added between two movies if they share common

    production companies. Each node is classified to be a high-

    or low-income movie (two classes). The last graph,

    extracted from the CORA data set [39], is composed of

    scientific papers from three topics. A citation graph is built

    upon the data set, where two papers are linked if the firstpaper cites the second one. The tested graph contains

    1,410 nodes divided into three classes representing machine

    learning research topics.For each of these three graphs, extra nodes are added to

    represent the class labels (called the class nodes). Each class

    node is connected to the graph nodes of the corresponding

    class. Moreover, in order to define cross-validation folds,

    these graph nodes are randomly split into training sets and

    test sets (called the training nodes and the test nodes,respectively), the edges between the test nodes and the class

    nodes being removed. The graph is then reduced to the test

    nodes and to the class nodes by stochastic complementation(the training nodes are rejected in the S2 subset, and thus,

    censored), and projected into a 2D space by applying one of

    the projection algorithms described before. If the relationship

    YEN ET AL.: A LINK ANALYSIS EXTENSION OF CORRESPONDENCE ANALYSIS FOR MINING RELATIONAL DATABASES 491

    Fig. 7. Newsgroups data set: Example of stochastic complementationfollowed by a KDM PCA (t 1). Only terms cited by at least25 documents and peripheral terms are displayed.

    Fig. 6. Newsgroups data set: Example of stochastic complementation followed by a basic diffusion map (upper left figure) and by a KDM PCA witht 1 (upper right), t 3 (lower left), and t 5 (lower right). Terms and topic nodes are displayed jointly.

  • 8/3/2019 A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

    12/15

    between the test nodes and the class nodes is accuratelyreconstructed in the reduced graph, these nodes from the testset should be projected close to the class node of theircorresponding class. We report the classification accuracy forseveral labeling rates, i.e., portions of unlabeled nodes whichconstitute the test set. The proportion of the test nodes varies between 50 percent of the graph nodes (twofold cross-

    validation) to 10 percent (10-fold cross validation). Thismeans that the proportion of training nodes left apart(censored) by stochastic complementation increases withthe number of folds. The whole cross-validation procedure is

    repeated 10 times (10 runs) and the classification accuracyaveraged on these 10 runs is reported, as well as the95 percent confidence interval.

    For classification, the assigned label of each test node issimply the label provided by the nearest class node, in termsof euclidean distance in the 2D embedding space. This willpermit to assessif the class information is correctly preservedduring stochastic complementation and 2D dimensionality

    reduction. The parameter t of the KDM PCAissetto5,inviewof our preliminary experiments.

    Figs. 8a, 8b, and 8c show the classification accuracy, aswell as the 95 percent confidence interval, obtained on thethree investigated graphs for different training/test setpartitioning (folds). The x-axis represents the number offolds, and thus, an increasing number of nodes left apart(censored) by stochastic complementation (from 0, 50, . . . , upto 90 percent). As a baseline, the whole original graph(corresponding to one single fold and referred to as 1-fold) isalso projected without removing any class link and withoutperforming a stochastic complementation; this situation

    represents the ideal case, since all the class information iskept. All the methods should obtain a good accuracy score inthis settingthis is indeed what is observed.

    First, we observe that, although obtaining very goodperformance when projecting the original graph (1-fold),CCA and SM perform poorly when the number of folds,and thus, the amount of censored nodes, increases. On theother hand, LE is quite unstable, performing poorly on theCORA data set. This means that stochastic complementa-tion combined with CCA, SM, or LE does not workproperly. On the contrary, the performance of KDM PCAand MDS remains fairly stable; for instance, the averagedecrease of performance of KDM PCA is around 10 percent,in comparison with the mapping of the original graph(from 1-fold to 2-fold50 percent of the nodes arecensored), which remains reasonable. MDS offers a goodalternative to KDM PCA, showing competitive performance;however, it involves the computation of the all-pairsshortest path distance.

    These results are confirmed when displaying the map-pings. Figs. 8d, 8e, 8f, 8g, and 8h show a mapping example ofthe test nodes, as well as the class nodes (the white markers)of the CORA graph, for the 10-fold cross-validation setting.

    Thus, only 10 percent of the graph nodes are unlabeled andprojected after stochastic complementation of the 90 percentremaining nodes. It can be observed that the LaplacianEigenmap (Fig. 8e) and the MDS (Fig. 8h) managed to

    separate the different classes, but mostly in terms of angularsimilarity. On the KDM PCA mapping (Fig. 8d), the class

    nodes are well-located, at the center of the set of nodes

    belonging to the class. On the other hand, the mappings

    provided by CCA and SM after stochastic complementation

    do not accurately preserve the class information.

    5.4 Discussion of the Results

    Let us now come back to our research questions. As a firstobservation, we can say that the two-step procedure

    (stochastic complementation followed by a diffusion map

    projection) provides an embedding in a low-dimensional

    subspace from which useful information can be extracted.

    Indeed, the experiments show that highly related elements

    are displayed close together while poorly related elements

    tend to be drawn far apart. This is quite similar to

    correspondence analysis to which the procedure is closely

    related. Second, it seems that stochastic complementationreasonably preserves proximity information, when com-

    bined with a diffusion map (KDM PCA) or an ISOMAPprojection (MDS). For the diffusion map, this is normal,

    since both stochastic complementation and the diffusionmap distance are based on a Markov chain modelsto-

    chastic complementation is the natural technique allowing

    to censor states of a Markov chain. On the contrary,

    stochastic complementation should not be combined with

    a Laplacian Eigenmap, a curvilinear component analysis, or

    a Sammon nonlinear mappingthe resulting mapping is

    not accurate. Finally, the KDM PCA provides exactly the

    same results as the basic diffusion map when t is large.

    However, when the parameter t is low, the resulting

    projection tends to highlight the outlier nodes and tomagnify the relative differences between nodes. It is

    therefore recommended to display a whole range of

    mappings for several different values of t.

    6 CONCLUSIONS AND FURTHER WORK

    This work introduced a link-analysis-based techniqueallowing to analyze relationships existing in relational

    databases. The database is viewed as a graph, where the

    nodes correspond to the elements contained in the tables

    and the links correspond to the relations between the tables.

    A two-step procedure is defined for analyzing the relation-

    ships between elements of interest contained in a table, or asubset of tables. More precisely, this work 1) proposes to

    use stochastic complementation for extracting a subgraph

    containing the elements of interest from the original graph

    and 2) introduces a kernel-based extension of the basic

    diffusion map for displaying and analyzing the reduced

    subgraph. It is shown that the resulting method is closely

    related to correspondence analysis.Several data sets are analyzed by using this procedure,

    showing that it seems to be well-suited for analyzing

    relationships between elements. Indeed, stochastic comple-mentation considerably reduces the original graph and

    allows to focus the analysis on the elements of interest,

    without having to define a state of the Markov chain for

    492 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011

  • 8/3/2019 A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

    13/15

    each element of the relational database. However, one

    fundamental limitation of this method is that the relational

    database could contain too many disconnected components,

    in which case our link analysis approach is almost useless.

    Moreover, it is clearly not always an easy task to extract agraph from a relational database, especially when the

    database is huge. These are the two main drawbacks of

    the proposed two-step procedure.

    Further work will be devoted to the application of this

    methodology to fuzzy SQL queries or fuzzy information

    retrieval. The objective is to retrieve not only the elements

    strictly complying with the constraints of the SQL query,

    but also the elements that almost comply with these

    constraints and are therefore close to the target elements.

    We will also evaluate the proposed methodology on real

    relational databases.

    YEN ET AL.: A LINK ANALYSIS EXTENSION OF CORRESPONDENCE ANALYSIS FOR MINING RELATIONAL DATABASES 493

    Fig. 8. (a)-(c) Classification accuracy obtained by the five compared projection methods for the Iris ((a), three classes), IMDb ((b), two classes), andCora ((c), three classes) data sets, respectively, in function of training/test set partitioning (number of folds). (d)-(h) The mapping of 10 percent of the

    Cora graph (10-folds setting) obtained by the five projection methods. The compared methods are the diffusion map kernel ((d), KDM PCA, or KDM),the Laplacian Eigenmap ((e), LE), the Curvilinear Component Analysis ((f), CCA), the Sammon nonlinear Mapping ((g), SM), and the

    Multidimensional Scaling or ISOMAP ((h), MDS). The class label nodes are represented by white markers.

  • 8/3/2019 A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

    14/15

    APPENDIX

    SOME LINKS BETWEEN THE BASIC DIFFUSION MAPAND SPECTRAL CLUSTERING

    Suppose the right eigenvectors and eigenvalues of matrix P

    are uk and k. Then, the matrix I P has the sameeigenvectors as P and eigenvalues given by 1 k. There-

    fore, the largest eigenvectors ofP

    correspond to the smallesteigenvectors ofI P.Assuming a connected undirected graph, we will now

    express the eigenvectors ofP in terms of the eigenvectors ~lkof the normalized Laplacian matrix, ~L D1=2LD1=2. Wethus have

    I Puk 1 kuk: 18But

    I P D1D A D1L D1=2D1=2LD1=2D1=2 D1=2 ~LD1=2:

    19

    Inserting this result in (18) provides ~LD1=2uk 1 kD1=2uk. Thus, the (unnormalized) eigenvectors of~L are ~lk D1=2uk, and are associated to eigenvalues1 k.

    ACKNOWLEDGMENTS

    Part of this work has been funded by projects with theRegion Wallonne and the Belgian Politique Scientifique

    Federale. We thank these institutions for giving us theopportunity to conduct both fundamental and appliedresearch. We also thank Professor Jef Wijsen, from theUniversite de Mons, Belgium, and Dr John Lee, from theUniversite Catholique de Louvain, Belgium, for the inter-esting discussions and suggestions about this work.

    REFERENCES[1] P. Baldi, P. Frasconi, and P. Smyth, Modeling the Internet and the

    Web: Probabilistic Methods and Algorithms. John Wiley & Sons, 2003.[2] M. Belkin and P. Niyogi, Laplacian Eigenmaps and Spectral

    Techniques for Embedding and Clustering, Advances in Neural

    Information Processing Systems, vol. 14, pp. 585-591, MIT Press, 2001.[3] M. Belkin and P. Niyogi, Laplacian Eigenmaps for Dimension-ality Reduction and Data Representation, Neural Computation,vol. 15, pp. 1373-1396, 2003.

    [4] C. Blake, E. Keogh, and C. Merz, UCI Repository of MachineLearning Databases, Univ. California, Dept. of Information andC omputer Scien ce, h ttp://www.ics.uci .edu/~mlearn/MLRepository.html, 1998.

    [5] J. Blasius, M. Greenacre, P. Groenen, and M. van de Velden,Special Issue on Correspondence Analysis and Related Meth-ods, Computational Statistics and Data Analysis, vol. 53, no. 8,pp. 3103-3106, 2009.

    [6] I. Borg and P. Groenen, Modern Multidimensional Scaling: Theoryand Applications. Springer, 1997.

    [7] P. Bremaud, Markov Chains: Gibbs Fields, Monte Carlo Simulation,and Queues. Springer-Verlag, 1999.

    [8] P. Carrington, J. Scott, and S. Wasserman, Models and Methods inSocial Network Analysis. Cambridge Univ. Press, 2006.[9] S. Chakrabarti, Mining the Web: Discovering Knowledge from

    Hypertext Data. Elsevier Science, 2003.[10] F.R. Chung, Spectral Graph Theory. Am. Math. Soc., 1997.

    [11] D.J. Cook and L.B. Holder, Mining Graph Data. Wiley and Sons,2006.

    [12] T. Cox and M. Cox, Multidimensional Scaling, second ed. Chapmanand Hall, 2001.

    [13] N. Cressie, Statistics for Spatial Data. Wiley, 1991.[14] P. Demartines and J. Herault, Curvilinear Component Analysis:

    A Self-Organizing Neural Network for Nonlinear Mapping ofData Sets, IEEE Trans. Neural Networks, vol. 8, no. 1, pp. 148-154,

    Jan. 1997.[15] C. Ding, Spectral Clustering, Tutorial presented at the 16th

    European Conf. Machine Learning (ECML 05), 2005.[16] P. Domingos, Prospects and Challenges for Multi-Relational Data

    Mining, ACM SIGKDD Explorations Newsletter, vol. 5, no. 1,pp. 80-83, 2003.

    [17] F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens, Random-Walk Computation of Similarities between Nodes of a Graph,with Application to Collaborative Recommendation, IEEE Trans.Knowledge and Data Eng., vol. 19, no. 3, pp. 355-369, Mar. 2007.

    [18] F. Fouss, J.-M. Renders, and M. Saerens, Links betweenKleinbergs Hubs and Authorities, Correspondence Analysis andMarkov Chains, Proc. Third IEEE Intl Conf. Data Mining (ICDM),pp. 521-524, 2003.

    [19] F. Fouss, L. Yen, A. Pirotte, and M. Saerens, An ExperimentalInvestigation of Graph Kernels on a Collaborative Recommenda-tion Task, Proc. Sixth Intl Conf. Data Mining (ICDM 06), pp. 863-

    868, 2006.[20] F. Geerts, H. Mannila, and E. Terzi, Relational Link-BasedRanking, Proc. 30th Very Large Data Bases Conf. (VLDB), pp. 552-563, 2004.

    [21] X. Geng, D.-C. Zhan, and Z.-H. Zhou, Supervised NonlinearDimensionality Reduction for Visualization and Classification,IEEE Trans. Systems, Man, and Cybernetics, Part B: Cybernetics,vol. 35, no. 6, pp. 1098-1107, Dec. 2005.

    [22] Introduction to Statistical Relational Learning, L. Getoor andB. Taskar, eds. MIT Press, 2007.

    [23] J. Gower and D. Hand, Biplots. Chapman & Hall, 1995.[24] M.J. Greenacre, Theory and Applications of Correspondence Analysis.

    Academic Press, 1984.[25] A. Greenbaum, Iterative Methods for Solving Linear Systems. Soc. for

    Industrial and Applied Math., 1997.[26] K.M. Hall, An R-Dimensional Quadratic Placement Algorithm,

    Management Science, vol. 17, no. 8, pp. 219-229, 1970.[27] D.A. Harville, Matrix Algebra from a Statisticians Perspective.

    Springer-Verlag, 1997.[28] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of

    Statistical Learning: Data Mining, Inference, and Prediction,second ed. Springer-Verlag, 2009.

    [29] H. Hwang, W. Dhillon, and Y. Takane, An Extension of MultipleCorrespondence Analysis for Identifying Heterogeneous Sub-groups of Respondents, Psychometrika, vol. 71, no. 1, pp. 161-171,2006.

    [30] A. Izenman, Modern Multivariate Statistical Techniques: Regression,Classification, and Manifold Learning. Springer, 2008.

    [31] R. Johnson and D. Wichern, Applied Multivariate StatisticalAnalysis, sixth ed. Prentice Hall, 2007.

    [32] R. Kimball and M. Ross, The Data Warehouse Toolkit: The CompleteGuide to Dimensional Modeling. John Wiley & Sons, 2002.

    [33] J.M. Kleinberg, Authoritative Sources in a Hyperlinked Environ-ment, J. ACM, vol. 46, no. 5, pp. 604-632, 1999.

    [34] P. Kroonenberg and M. Greenacre, Correspondence Analysis,Encyclopedia of Statistical Sciences, S. Kotz, ed., second ed., pp. 1394-1403, John Wiley & Sons, 2006.

    [35] S. Lafon and A.B. Lee, Diffusion Maps and Coarse-Graining: AUnified Framework for Dimensionality Reduction, Graph Parti-tioning, and Data Set Parameterization, IEEE Trans. Pattern

    Analysis and Machine Intelligence, vol. 28, no. 9, pp. 1393-1403, Sept.2006.

    [36] A.N. Langville and C.D. Meyer, Googles PageRank and Beyond: TheScience of Search Engine Rankings. Princeton Univ. Press, 2006.

    [37] J. Lee and M. Verleysen, Nonlinear Dimensionality Reduction.Springer, 2007.

    [38] D. Lusseau, K. Schneider, O. Boisseau, P. Haase, E. Slooten, and

    S. Dawson, The Bottlenose Dolphin Community of DoubtfulSound Features a Large Proportion of Long-Lasting Associa-tions. Can Geographic Isolation Explain This Unique Trait?Behavioral Ecology and Sociobiology, vol. 54, no. 4, pp. 396-405,2003.

    494 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 4, APRIL 2011

  • 8/3/2019 A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

    15/15

    [39] S.A. Macskassy and F. Provost, Classification in Networked Data:A Toolkit and a Univariate Case Study, J. Machine LearningResearch, vol. 8, pp. 935-983, 2007.

    [40] K.V. Mardia, J.T. Kent, and J.M. Bibby, Multivariate Analysis.Academic Press, 1979.

    [41] C.D. Meyer, Stochastic Complementation, Uncoupling MarkovChains, and the Theory of Nearly Reducible Systems, SIAM Rev.,vol. 31, no. 2, pp. 240-272, 1989.

    [42] B. Nadler, S. Lafon, R. Coifman, and I. Kevrekidis, DiffusionMaps, Spectral Clustering and Eigenfunctions of Fokker-Planck

    Operators, Advances in Neural Information Processing Systems,vol. 18, pp. 955-962, MIT Press, 2005.

    [43] B. Nadler, S. Lafon, R. Coifman, and I. Kevrekidis, DiffusionMaps, Spectral Clustering and Reaction Coordinate of DynamicalSystems, Applied and Computational Harmonic Analysis, vol. 21,pp. 113-127, 2006.

    [44] M. Newman and M. Girvan, Finding and Evaluating CommunityStructure in Networks, Physical Rev. E, 69, p. 026113, 2004.

    [45] A.Y. Ng, M.I. Jordan, and Y. Weiss, On Spectral Clustering:Analysis and an Algorithm, Advances in Neural InformationProcessing Systems, vol. 14, pp. 849-856, MIT Press, 2001.

    [46] P. Pons and M. Latapy, Computing Communities in LargeNetworks Using Random Walks, Proc. Intl Symp. Computer andInformation Sciences (ISCIS 05), pp. 284-293, 2005.

    [47] P. Pons and M. Latapy, Computing Communities in Large

    Networks Using Random Walks, J. Graph Algorithms andApplications, vol. 10, no. 2, pp. 191-218, 2006.[48] S. Ross, Stochastic Processes, second ed. Wiley, 1996.[49] Y. Saad, Iterative Methods for Sparse Linear Systems, second ed. Soc.

    for Industrial and Applied Math., 2003.[50] M. Saerens and F. Fouss, HITS Is Principal Component

    Analysis, Proc. 2005 IEEE/WIC/ACM Intl Joint Conf. WebIntelligence, pp. 782-785, 2005.

    [51] M. Saerens, F. Fouss, L. Yen, and P. Dupont, The PrincipalComponents Analysis of a Graph, and Its Relationships to SpectralClustering, Proc. 15th European Conf. Machine Learning (ECML 04),pp. 371-383, 2004.

    [52] J.W. Sammon, A Nonlinear Mapping for Data StructureAnalysis, IEEE Trans. Computers, vol. C-18, no. 5, pp. 401-409,May 1969.

    [53] O. Schabenberger and C. Gotway, Statistical Methods for Spatial

    Data Analysis. Chapman & Hall, 2005.[54] B. Scholkopf and A. Smola, Learning with Kernels. MIT Press, 2002.[55] B. Scholkopf, A. Smola, and K. Muller, Nonlinear Component

    Analysis as a Kernel Eigenvalue Problem, Neural Computation,vol. 10, no. 5, pp. 1299-1319, 1998.

    [56] R. Sedgewick, Algorithms in C. Addison-Wesley, 1990.[57] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern

    Analysis. Cambridge Univ. Press, 2004.[58] J. Shi and J. Malik, Normalized Cuts and Image Segmentation,

    IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8,pp. 888-905, Aug. 2000.

    [59] D. Spielman and S.-H. Teng, Nearly-Linear Time Algorithms forPreconditioning and Solving Symmetric, Diagonally DominantLinear Systems, arXiv, http://arxiv.org/abs/cs/0607105, 2007.

    [60] W.J. Stewart, Introduction to the Numerical Solution of MarkovChains. Princeton Univ. Press, 1994.

    [61] J.B. Tenenbaum, V. de Silva, and J.C. Langford, A GlobalGeometric Framework for Nonlinear Dimensionality Reduction,Science, vol. 290, pp. 2319-2323, 2000.

    [62] M. Tenenhaus and F. Young, An Analysis and Synthesis ofMultiple Correspondence Analysis, Optimal Scaling, Dual Scal-ing, Homogeneity Analysis and Other Methods for QuantifyingCategorical Multivariate Data, Psychometrika, vol. 50, no. 1,pp. 91-119, 1985.

    [63] M. Thelwall, Link Analysis: An Information Science Approach.Elsevier, 2004.

    [64] L. Trefethen and D. Bau, Numerical Linear Algebra. Soc. forIndustrial and Applied Math., 1997.

    [65] U. von Luxburg, A Tutorial on Spectral Clustering, Statistics andComputing, vol. 17, no. 4, pp. 395-416, 2007.

    [66] S. Wasserman and K. Faust, Social Network Analysis: Methods and

    Applications. Cambridge Univ. Press, 1994.[67] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, GraphEmbedding and Extensions: A General Framework for Dimen-sionality Reduction, IEEE Trans. Pattern Analysis and MachineIntelligence l 29 1 40 51 J 2007

    [68] L. Yen, F. Fouss, C. Decaestecker, P. Francq, and M. Saerens,Graph Nodes Clustering Based on the Commute-Time Kernel,Proc. 11th Pacific-Asia Conf. Knowledge Discovery and Data Mining(PAKDD 07), 2007.

    [69] L. Yen, F. Fouss, C. Decaestecker, P. Francq, and M. Saerens,Graph Nodes Clustering with the Sigmoid Commute-TimeKernel: A Comparative Study, Data and Knowledge Eng., vol. 68,pp. 338-361, 2009.

    [70] D.M. Young and R.T. Gregory, A Survey of Numerical Mathematics.Dover Publications, 1988.

    [71] W.W. Zachary, An Information Flow Model for Conflict andFission in Small Groups, J. Anthropological Research, vol. 33,pp. 452-473, 1977.

    [72] H. Zha, X. He, C.H.Q. Ding, M. Gu, and H.D. Simon, BipartiteGraph Partitioning and Data Clustering, Proc. ACM 10th IntlConf. Information and Knowledge Management (CIKM 01), pp. 25-32,2001.

    [73] X. Zhu and A. Goldberg, Introduction to Semi-Supervised Learning.Morgan & Claypool Publishers, 2009.

    [74] J.Y. Zien, M.D. Schlag, and P.K. Chan, Multilevel SpectralHypergraph Partitioning with Arbitrary Vertex Sizes, IEEETrans. Computer-Aided Design of Integrated Circuits and Systems,vol. 18, no. 9, pp. 1389-1399, 1999.

    Luh Yen received the MSc degree in electricalengineering from the Universite Catholique deLouvain (UCL), Belgium, in 2002. She com-pleted her PhD degree at the ISYS Laboratoryand Machine Learning Group of the UniversiteCatholique de Louvain. Her research interestsinclude graph mining, clustering, and multivari-ate statistical analysis.

    Marco Saerens received the BSc degree inphysics engineering and the MSc degree intheoretical physics from the Universite Librede Bruxelles (ULB). After graduation, he joinedthe IRIDIA Laboratory (the Artificial Intelligence

    Laboratory, the Universite Libre de Bruxelles(ULB), Belgium) as a research assistant andcompleted the PhD degree in applied sciences.While remaining a part-time researcher atIRIDIA, he also worked as a senior researcher

    in the R&D Department of various companies, mainly in the fields ofspeech recognition, data mining, and artificial intelligence. In 2002, he

    joined the Universite Catholique de Louvain (UCL) as a professor incomputer sciences. His main research interests include artificialintelligence, machine learning, pattern recognition, data mining, andspeech/language processing. He is a member of the IEEE and the ACM.

    Francois Fouss received the MS degree ininformation systems and the PhD degree inmanagement science from the Universite Cath-

    olique de Louvain (UCL), Belgium, in 2002 and2007, respectively. In 2007, he joined theFacultes Universitaires Catholiques de Mons(FUCaM) as a professor of computing science.His main research areas include data mining,machine learning, graph mining, classification,and collaborative recommendation.

    . For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    YEN ET AL.: A LINK ANALYSIS EXTENSION OF CORRESPONDENCE ANALYSIS FOR MINING RELATIONAL DATABASES 495