Themes10 Centrality DoubleCol

Embed Size (px)

Citation preview

  • 8/2/2019 Themes10 Centrality DoubleCol

    1/14

    Principal Component Centrality: A KLT-inspiredTransform for Identifying Inuential Neighborhoods

    in Social Network GraphsMuhammad U. Ilyas* and Hayder Radha

    Department of Electrical and Computer EngineeringMichigan State UniversityEast Lansing, MI 48824

    {ilyasmuh, radha }@egr.msu.edu

    Abstract The measurement of node centrality is a keyarea of research in social network analysis. However,centrality is vaguely dened and there are several measuresin existence. The appropriateness of a particular nodecentrality measure for a particular network is judgedby the combination of type of commodity duplicationand type of ow process in the network. In the contextof social networks, Eigenvector centrality (EVC) is anode centrality measure used to rank nodes accordingto the inuence they exert. In this paper, we introducePrincipal Component Centrality (PCC), a measure of node centrality that is inspired by the Karhunen Lo evetransform (KLT)/Principal Component Analysis (PCA).We demonstrate PCCs ability to discover signicantlymore social hubs acting as inuential neighborhoods inmassive social graphs (with millions of links/nodes) than

    EVCs limited focus on one cluster within such graphs. Weshow PCCs performance by processing a friendship graphobtained from Googles Orkut social networking serviceand a gaming graph obtained from users of FacebooksFighters Club application.

    I. INTRODUCTION

    Centrality [3], [4], [6], [14], [26] is a measure to assessthe criticality of a nodes position. Node centrality as ameasure of a nodes importance by virtue of its centrallocation has been in common use by social scientists inthe study of social networks for decades. Over the years

    several different meanings of centrality have emerged.Naturally, the idea of ranking nodes for their ability tospread or detect (positive or negative) inuence is of signicant interest to social network analysis.

    Among many centrality measures, Eigenvalue Cen-trality (EVC) is arguably the most successful tool for

    This work was supported in part by NSF Award CNS-0721550,NSF Award CCF-0728996, NSF Award CCF-0515253, and an unre-stricted gift from Microsoft Research.

    Local Centrality Maximas

    Graph Plane

    Centrality Plane

    Fig. 1. This gure shows a graph on the lower plane, overlayed

    with another plane of the interpolated surface plot of node centralityscores. The centrality planes typically exhibit a number of peaks orlocal maxima.

    detecting the most inuential node(s) within a socialgraph. Thus, EVC has been a highly popular centralitymeasure in the social sciences ( [15], [29], [3], [13], [11],[12], [30], [28], [5], [4]) (it is often referred to simplyas centrality). As we demonstrate later in this paper, onekey shortcoming of EVC is its focus on (virtually) asingle inuential set of nodes that tend to cluster within

    a single neighborhood. In other words, EVC has thetendency of identifying a set of inuential nodes that areall within the same region of a graph. This shortcomingmay not represent a major issue for many social scienceproblems and Internet applications, such as PageRank,where EVC has been used extensively [19]. Meanwhile,when dealing with massive social network graphs, itis hardly the case that there is a single neighborhoodof inuential nodes; rather, there are usually multiple

  • 8/2/2019 Themes10 Centrality DoubleCol

    2/14

    inuential neighborhoods most of which are not detectedor identied by EVC.

    In order to identify inuential neighborhoods, thereis a need to associate such neighborhoods with someform of an objective measure of centrality that can beevaluated and searched for. To that end, one can think of a centrality plane that is overlaid over the underlyingsocial graph under consideration. This centrality planemay contain multiple centrality score maxima, each of which is centered on an inuential neighborhood.

    Nodes that have centrality score higher than othernodes are located under a centrality peak and are morecentral than any of their neighbors. We use the termsocial hubs to refer to nodes forming centrality maxima.Figure 1 illustrates this concept. Thus, these social hubsform the kernel of inuential neighborhoods in socialnetworks. Hence, our focus in this paper is on identifyinginuential neighborhoods rather than inuential nodes.

    We will show that EVC has a tendency to be toonarrowly focused on a dominating neighborhood. To thisend, we introduce a new measure of centrality that wecall Principal Component Centrality (PCC) that gradu-ally widens the focus of EVC in a controlled manner.More importantly, PCC provides a general framework for transforming social graphs into a spectral spaceanalogous to popular signal transforms that operate onrandom signals.

    In this paper, we give a brief review of common cen-trality measures accompanied by a critique with regardto their scope of application to social networks. Wethen present Principal Component Centrality (PCC), anode centrality measure that is inspired by the KarhunenLoeve transform (KLT) and Principal Component Anal-ysis (PCA). In essence, PCC is a general transformof social graphs that can provide vital insight into thecentrality and related characteristics of such graphs.Similar to the KLT of a signal, the proposed PCC of a social graph gives a form of compact representationthat identies inuential nodes and more importantlyinuential neighborhoods . Hence, PCC provides an ele-gant social graph transform framework that outperforms

    EVC as we show in this paper. In particular, early inthis paper, we demonstrate EVCs shortcoming by usingboth EVC and PCC to compute node centralities in anetwork small enough to allow meaningful illustration.This is followed by a thorough description of PCC,and its utility in transforming massive real-world socialnetwork graphs. We also develop the equivalence of an inverse PCC transform that attempts to reconstructa representation of the original social graph from its

    inuential neighborhoods.The rest of this paper is organized as follows. Sec-

    tion II gives a background review of existing centralitymeasures for graphs, highlights problems in EVC andmotivates our development of a new node centrality.Section III introduces PCC as a new measure of central-ity. It describes in detail the advantages, mathematicalinterpretation, visualization and the effect of varyingnumber of features of PCC. Section IV applies PCC totwo real-world social graphs. The rst is an undirected,unweighted friendship graph from Googles Orkut socialnetworking service. Orkut currently has approximately440, 000 active subscribers [24] and has large subscriberbases in Brazil and India. The data set available to us [21]consists of 70, 000 users connected by 2, 971, 776 links.The second is a weighted, undirected gaming graph of matches between users of Facebooks Fighters Clubapplication. This data set was originally collected by

    Nazir in [22]. It consists of 667, 560 recorded matchesbetween 143, 020 users making for a graph with 526, 224edges between users. Section V concludes the paper.

    II . B ACKGROUND

    Let A denote the adjacency matrix of a graph G (V, E )consisting of the set of nodes V = {v1, v2 , v3, . . . , v N }of size N and set of undirected edges E . When a link is present between two nodes vi and v j both A i,j andA j,i are set equal to 1 and set to 0 otherwise. Let (vi )denote the neighborhood of vi , the set of nodes vi isconnected to directly.

    A. Degree Centrality

    The degree centrality of a node in a graph is a measureof the relative importance to the graphs connectivity.The degree centrality of a node is dened as the numberof edges incident on it. Nodes with more incident edgeshave higher degree centrality than nodes with fewerincident edges. If di denotes the degree of node vi thenits degree centrality is computed by:

    CD (vi ) =di

    N 1 (1)

    Degree centrality is a measure of a nodes rate of dissemination (of an infection) in the immediate shortterm. It has the advantage that its computation does notrequire nodes to exchange information. However, it hastwo signicant disadvantages;

    1) Without an exchange of centrality information withother nodes, it is not possible to interpret and

  • 8/2/2019 Themes10 Centrality DoubleCol

    3/14

    evaluate an individual nodes centrality relative tothat of others.

    2) Degree centrality does not take into account thecentrality of its neighbors.

    B. Closeness Centrality

    The closeness centrality of a node is dened asthe mean length of geodesic paths to all other nodes.Intuitively, nodes occupying a more central locationwithin the graph are expected to have shorter paths.Closeness centrality is a measure of the rate at whicha node can spread an infection to all reachable nodes.Closeness is a suitable measure of centrality when theow of commodity in the network follows geodesicpaths. Closeness centrality is a good measure of theaverage detection time in a network with ows of non-replicating commodity following geodesic paths.

    C. Betweenness CentralityThe betweenness centrality of a node is dened as

    the fraction of geodesic paths (shortest paths) out of all geodesic paths between all pairs of nodes passingthrough that node. Thus, nodes located on more geodesicpaths have a higher betweenness centrality than nodeslocated on fewer geodesic paths. Intuitively, since thesubproblem optimality principal holds for the shortestpath problem, a nodes location on a geodesic pathimplies close proximity to all other nodes on that path.A nodes betweenness can be interpreted as a measure

    of disruption caused when the node is removed from thenetwork. Like closeness, betweenness too assumes thatthe ow of commodity is along geodesics. Betweennesscentrality is a good measure of the average probabilityof detection of ows in a network with non-replicatingcommodity following geodesic paths.

    D. Eigenvector Centrality

    Eigenvector centrality (EVC) is a relative score recur-sively dened as a function of the number and strengthof connections to its neighbors and as well as those

    neighbors centralities. Let x (i) be the EVC score of a node vi . Then,

    x (i) =1

    j( v i )

    x ( j )

    =1

    N

    j =1A i,j x ( j )

    (2)

    0 0.5 1 1.5 20

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    x

    y

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    Fig. 2. A spatial graph of 200 nodes. Node colors are indicative of the range in which their EVC falls.

    Here is a constant. Equation 2 can berewritten in vector form equation 3 wherex = {x (1) , x (2) , x (3) , . . . , x (N )} is the vector of EVC scores of all nodes.

    x =1

    Ax

    x = Ax (3)

    This is the well known eigenvector equation wherethis centrality takes its name from. is an eigenvalueand x is the corresponding eigenvector of matrix A.Obviously several eigenvalue/eigenvector pairs exist foran adjacency matrix A. The EVC of nodes are de-ned on the basis of the Perron eigenvalue A (thePerron eigenvalue is the largest of all eigenvalues of A and is also called the principal eigenvalue). If is any other eigenvalue of A then A > | | . Theeigenvector x = {x (1) , x (2) , . . . , x (N )} correspondingto the Perron eigenvalue is the Perron eigenvector orprincipal eigenvector. Thus the EVC of a node vi is thecorresponding element x (i) of the Perron eigenvector x.

    Note that when the adjacency matrix A is symmetricall elements of the principal eigenvector x are positive.As mentioned above, EVC is widely used in the socialsciences ( [20], [30], [3], [14], [12], [13], [31], [29], [5],[4]) and is often referred to simply as centrality.

    EVC does not suffer from the same problems asdegree, closeness and betweenness centralities. In com-puting a nodes EVC it takes into consideration its neigh-borss EVC scores. Because of its recursive denition,

  • 8/2/2019 Themes10 Centrality DoubleCol

    4/14

    EVC is suited to measure nodes power to inuenceother nodes in the network both directly and indirectlythrough its neighbors. Connections to neighbors thatare in turn well connected themselves are rated higherthan connections to neighbors that are weakly connected.Like closeness and betweenness, the EVC of a nodeprovides a network-wide perspective. At the same time itcan take advantage of distributed methods of computingeigenvectors/eigenvalues of a matrix but does not haveto bear the overhead of excess network trafc. Sankar-alingam [27], Kohlsch utter [18] and Canright, Eng-Monsen and Jelasity [7], Bischof [2], Bai [1] and Tisseur[28] proposed some parallel algorithms for computingeigenvectors and eigenvalues of adjacency matrices.

    E. The Need for a New Centrality Measure

    In the preceding sections we highlighted some of the key characteristics of the most common measures

    of centrality. Our discussion left us with only one vi-able measure of centrality that takes into considerationthe centrality scores of a nodes neighbors and whichprovides a network-wide perspective, i.e. eigenvectorcentrality. EVC has been used extensively to great ef-fect in the study and analysis of a wide variety of networks that are shown to exhibit small-world andscale-free properties. In [8] Canright and Eng-Monsencorrelated EVC with the instantaneous rate of spreadof contagion on a Gnutella network peer-to-peer graph,a social network of students in Oslo, a collaborationgraph of researchers at Telenor R&D and a snapshot of a collaboration graph of the Santa Fe Institute. In [23]Newman analyzed the use of EVC in a lexical network of co-occuring words in Reuters newswire stories. In[9] Carreras et al. used EVC to study the spread of epidemics in mobile networks. They used three sets of traces collected by Intel Cambridge, a trace of the publictransportation network from the DieselNet project at theUniversity of Massachusetts at Amherst and mobility andinteraction traces from MITs Reality Mining project.

    Now consider the graph in gure 2. It consists of 200nodes and is typical of the kinds of peer-to-peer networks

    formed by social interactions in networks such as thecellphone based Nokia SensorPlanet project ( [25], [17]).Its nodes are assigned one of six colors from the adjacentcolor palette. Each of the six colors represents one of six bins of a histogram spanning, in uniform step sizes,the range from the smallest to the largest EVCs. As thelegend accompanying gure 2 shows, blue represents thelowest EVCs and red the highest. We make the followingobservations:

    10 5 0 5 10 15 20 25 300

    20

    40

    60

    i

    F r e q u e n c y

    Adjacency MatrixLaplacian MatrixAveraged Adjacency MatrixAveraged Laplacian Matrix

    0 50 100 150 2000

    0.5

    1

    P

    i = 1

    P

    | i |

    /

    i = 1

    N

    | i |

    Adjacency MatrixLaplacian MatrixAveraged Adjacency MatrixAveraged Laplacian Matrix

    Fig. 3. [Top] Histogram of Eigenvalues of adjacency matrix andLaplacian matrix A of network in gure 2; [Bottom] Cumulative sumof the sequence of eigenvalues of adjacency matrix and Laplacianmatrix of network in gure 2 when sorted in descending order of magnitudes. In both gures the lines plotted in red color are averagesof 50 networks generated randomly with the same parameters.

    1) EVCs are tightly clustered around a very smallregion with respect to the total size of the network and drops off sharply as one moves away from thenode of peak EVC.

    2) EVC is unable to provide much centrality informa-tion for the vast majority of nodes in the network.

    3) The position of the peak EVC node appears some-what arbitrary because a visual inspection showsthat almost equally signicant clusters of nodescan be visually spotted in other locations in thegraph. Counter to intuition, the high EVC clusteris connected to the rest of the network by a singlelink.

    III . P RINCIPAL COMPONENT CENTRALITY

    The EVC of a node is recursively dened as a mea-sure of centrality that is proportional to the number of neighbors of a node and their respective EVCs. As wesaw in section II-D, the mathematical expression for

    the vector of node EVCs is equivalent to the principaleigenvector. Our motivation for Principal ComponentCentrality (PCC) as a new measure of node centralitymay be understood by looking at EVC through the lensof the Karhunen Lo` eve Transform (KLT). When theKLT is derived from an N N covariance matrix of N random variables, the principal eigenvector is themost dominant feature vector, i.e. the direction in N -dimensional hyperspace along which the spread of data

  • 8/2/2019 Themes10 Centrality DoubleCol

    5/14

    0 1 20

    1

    2

    x

    y

    ~A1

    0 1 20

    1

    2

    x

    y

    ~A2

    0 1 20

    1

    2

    x

    y

    ~A3

    0 1 20

    1

    2

    x

    y

    ~A5

    0 1 20

    1

    2

    x

    y

    ~A10

    0 1 20

    1

    2

    x

    y

    ~A15

    0 1 20

    1

    2

    x

    y

    ~A50

    0 1 20

    1

    2

    x

    y

    ~A200

    Fig. 4. Reconstructed topologies of the graph from gure 2 using only the rst 1, 2, 3, 5, 10, 15, 50 and all 200 eigenvectors.

    points is maximized. Similarly, the second eigenvec-tor (corresponding to the second largest eigenvalue) isrepresentative of the second most signicant feature of the data set. It may also be thought of as the mostsignicant feature after the data points are collapsedalong the direction of the principal eigenvector. When

    the covariance matrix is computed empirically from a setof data points, the eigendecomposition is the well knownPrincipal Component Analysis (PCA) [11]. Since we areoperating on the adjacency matrix derived from graphdata we call the node centrality proposed in this paperPrincipal Component Centrality (PCC). In a covariancematrix, a non-zero entry with a large magnitude atpositions (i, j ) and ( j,i ) is representative of a strongrelationship between the i-th and j -th random variables.A non-zero entry in the adjacency matrix representinga link from one node to another is, in a broad sense,also an indication of a relationship between the twonodes. Based on this understanding we draw an analogybetween graph adjacency matrix and covariance matrix.

    In the preceding section we described various central-ity measures from literature. Among them, EVC is thenode centrality most often used in the study of socialnetworks and other networks with small-world proper-ties. While EVC assigns centrality to nodes accordingto the strength of the most dominant feature of the data

    set, PCC takes into consideration additional, subsequentfeatures. We dene the PCC of a node in a graph as theEuclidean distance/ 2 norm of a node from the origin inthe P -dimensional eigenspace formed by the P mostsignicant eigenvectors. For a graph consisting of asingle connected component, the N eigenvalues | 1 |

    | 2| . . . | N | = 0 correspond to the normalizedeigenvectors x1 , x2, . . . , xN . The eigenvector/eigenvaluepairs are indexed in order of descending magnitudeof eigenvalues. When P = 1 , PCC equals a scaledversion of EVC. Unlike other measures of centrality, theparameter P in PCC can be used as a tuning parameter toadjust the number of eigenvectors included in the PCC.The question of selection of an appropriate value of P will be addressed in subsequent subsection III-D. Let Xdenote the N N matrix of concatenated eigenvectorsX = [x1x2 . . . xN ] and let = [ 1 2 . . . N ] be thevector of eigenvalues. Furthermore, if P < N and if matrix X has dimensions N N , then XN P will denotethe submatrix of X consisting of the rst N rows andrst P columns. Then PCC can be expressed in matrixform as:

    C P = ((AXN P ) (AX N P )) 1P 1 (4)The operator is the Hadamard (or entrywise

  • 8/2/2019 Themes10 Centrality DoubleCol

    6/14

    product or Schur product) operator. Equation 4 can alsobe written in terms of the eigenvalue and eigenvectormatrices and X, of the adjacency matrix A:

    C P = (XN P XN P ) (P 1 P 1). (5)It is important to note a major difference between a

    traditional signal transform under KLT as comparedwith the proposed PCC graph transform. First, recallthat, under KLT, a transform matrix T is derived froma covariance matrix C ; and then the eigenvector-basedtransform T is applied on any realization of the randomsignal that has covariance C . Meanwhile, under theproposed PCC, the adjacency matrix A plays a dual role:at one hand, it plays the role of the covariance matrixof the KLT; and on the other hand, one can think of Aas being the signal that is represented compactly by

    the PCC vector C P . Effectively, the adjacency matrixA represents the social graph (i.e., signal) that we areinterested in analyzing; and at the same time A is used toderive the eigendecomposition; and hence, we have thedual role for A. Later, we will develop the equivalenceof an inverse PCC, and we will see this dual role of theadjacency matrix A again.

    A. Interpretation of Eigenvalues

    The denition of PCC is based on the graph adjacencymatrix A. For a matrix A of size N N its eigenvectorsxi for 1 i N are interpreted as N -dimensionalfeatures (feature vectors) of the set of N -dimensionaldata points represented by their covariance (adjacency)matrix A. The magnitude of an eigenvalue correspondingto an eigenvector provides a measure of the importanceand prominence of the feature represented by it. Theeigenvalue i is the power of the corresponding featurexi in A.

    An alternative representation of a graphs topologyis the graph Laplacian matrix which is frequently usedin spectral graph theory [10]. The graph Laplacian canbe obtained from the adjacency matrix by setting the

    diagonal entries of the adjacency matrix to A i,i = N j =1; i= j A i,j , i.e. a diagonal entry in a Laplacian ma-trix is the negative of the sum of all off-diagonal entriesin the same row in the adjacency matrix. This denitionapplies equally to weighted and unweighted graphs. Thegraph Laplacian is always positive-semidenite whichmeans all of its eigenvalues are non-negative with atleast one eigenvalue equal to 0. The adjacency matrix,however, does not guarantee positive semideniteness

    and typically has several negative eigenvalues. This is thereason the ordering of features is based on magnitudesof eigenvalues. The bar chart at the top of gure 3plots histograms of eigenvalues for both adjacency andLaplacian matrices of the network in gure 2. But whythen, did we not use the Laplacian matrix in the rstplace? The reason is that the eigendecomposition of the adjacency matrix yields greater energy compactionthan that of the Laplacian. The middle plot in gure3 shows the normalized, cumulative function of thesorted sequence of eigenvalue powers. The line for theeigenvalue derived from the adjacency matrix rises fasterthan that of the Laplacian matrix. The adjacency matrixcurve indicates that 25%, 50% and 75% of total power iscaptured by the rst 15 (7.5%), 44 (22%) and 89 (44.5%)features, respectively. In contrast, the Laplacian matrixeigendecomposition shows that the same power levelsare contained in its rst 26 (13%), 61 (30.5%) and 103

    (51.5%) features, respectively. Thus eigendecompositionof the adjacency matrix of social graphs offers moreenergy compaction, i.e. a set of features of the adjacencymatrix captures more energy than the same number of features of the corresponding Laplacian matrix.

    B. Interpretation of Eigenvectors

    EVC interprets the elements of the Perron-eigenvectorx1 of adjacency matrix A as measures of correspondingnodes centralities in the network topology (see sectionII-D). Research on scale-free network topologies hasdemonstrated EVCs usefulness. However, when appliedto large spatial graphs of uniformly, randomly deployednodes such as the one in gure 2, EVC fails to assignsignicant scores to a large fraction of nodes. For abroader understanding that encompasses all eigenvectorswe revert to the interpretation of eigenvectors as features.One way of understanding PCC is in terms of PCA[11], where PCC takes part of its name from. PCA ndsthe eigenvectors x1, x2, x3, . . . , xN and eigenvalues of G s adjacency matrix A. Every eigenvector representsa feature of the adjacency matrix. To understand howthese feature vectors are to be interpreted in graphical

    terms, refer to equation 6 which uses eigenvectors andeigenvalues to reconstruct an approximation AP of theadjacency matrix A. Reconstruction can be performed tovarying degrees of accuracy depending on P , the numberof features/ eigenvectors used. If we set P = N inequation 6 (all eigenvectors/eigenvalues are used), theadjacency matrix can be reconstructed without losses(see He [15]). Here, denotes the diagonal matrix of eigenvalues sorted in descending order of magnitude

  • 8/2/2019 Themes10 Centrality DoubleCol

    7/14

  • 8/2/2019 Themes10 Centrality DoubleCol

    8/14

    0 1 20

    0.5

    1

    1.5

    2

    x

    C1(i)

    y

    0 2 40

    50

    100

    150

    200

    C1(i)

    F r e q u e n c y

    (a)

    0 1 20

    0.5

    1

    1.5

    2

    x

    C2(i)

    y

    0 2 40

    50

    100

    150

    C2(i)

    F r e q u e n c y

    (b)

    0 1 20

    0.5

    1

    1.5

    2

    x

    C3(i)

    y

    0 2 40

    50

    100

    150

    C3(i)

    F r e q u e n c y

    (c)

    0 1 20

    0.5

    1

    1.5

    2

    x

    C5(i)

    y

    0 2 40

    50

    100

    C5(i)

    F r e q u e n c y

    (d)

    0 1 20

    0.5

    1

    1.5

    2

    x

    C10 (i)

    y

    0 2 40

    10

    20

    30

    40

    C10 (i)

    F r e q u e n c y

    (e)

    0 1 20

    0.5

    1

    1.5

    2

    x

    C15 (i)

    y

    0 2 40

    10

    20

    30

    C15 (i)

    F r e q u e n c y

    (f)

    0 1 20

    0.5

    1

    1.5

    2

    x

    C50 (i)

    y

    0 2 40

    20

    40

    60

    C50 (i)

    F r e q u e n c y

    (g)

    0 1 20

    0.5

    1

    1.5

    2

    x

    C200 (i)

    y

    0 2 40

    20

    40

    60

    C200 (i)

    F r e q u e n c y

    (h)

    Fig. 6. PCC of nodes in network of gure 2 when computed using rst 1, 2, 3, 5, 10, 15, 50 and all 200 eigenvectors. The histogramsaccompanying each graph plot show the distribution of PCC of their nodes. The red lineplot in the histogram represents the average PCChistograms of 50 randomly generated networks with the same parameters as the network in gure 2.

    PCC vectors and EVC to study the effect of adding morefeatures. We compute the phase angle (n ) of a PCCvector using n features with the EVC vector as,

    (P ) = arccosCP

    |CP |

    CE |C E |

    .(7)

    Here, denotes the inner product operator. Therelationship of the phase angle with the number of features used in PCC for the network under considerationis plotted in gure 7. Initially, the function of phase angle rises sharply and then levels off almost completely at22 features. This means that, in this example, the relativePCCs of nodes cease to change with the addition of more

  • 8/2/2019 Themes10 Centrality DoubleCol

    9/14

    0 50 100 150 2000

    0.5

    1

    P # of eigenvectors

    ( r a

    d )

    Fig. 7. Plot of phase angles (in radians) of PCC vectors with theEVC vector for the graph in gures 6.

    features beyond the rst 22 features. The phase angleplot may be used for determining how many featuresare sufcient for the computation of PCC of a network.

    IV. A PPLICATION OF PCC TO SOCIAL N ETWORKS

    In this section we apply PCC to a large scale dataset obtained from the friend network of the Orkut socialnetworking service [16]. We apply PCC to identify socialhubs in the network. We motivate the search for localcentrality maxima by the following application. Supposewe wish to deploy a limited number of monitors ina social graph to spot the emergence and adoption of as many trends as possible. The spread of trends insocial networks is modeled as an inuence process. Thedegree to which a node is inuential in the spread of atrend in the long term is measured by its EVC, whichwas explained in detail in section II-D. Recall from theearlier example network (on which we studied PCC indetail) that EVC and PCC vary gradually along a path.We dene a social hub in a network as a node whosecentrality score forms a local maxima, i.e. its centralityscore is higher than all of its neighbors (see gure 1 foran illustration of social hubs and the corresponding localmaxima). The number of local maxima identied is usedas a performance metric. The results obtained by usingEVC provides the baseline for comparison. On a walk over a graph, the EVCs of nodes change gradually, i.e.a nodes EVC is high in part because its neighbors EVC

    is also high. This means, if we were to pick nodes forplacement of monitors in descending order of EVC, quitea few will end up monitoring the same well connectedcluster of nodes. This will introduce redundancy inthe monitoring at the cost of coverage. With a limitednumber of monitors available it would be more desirousto position them in different vicinities of the graph. If,on the other hand, a node is selected only if its centralitymeasure is signicant and locally maximum we can

    avoid redundancy and in effect, observe a greater numberof nodes with the same number of monitors. This iswhy our focus is on identifying inuential neighborhoodsrather than inuential nodes. To this end we demonstrateapplication of PCC to two social network data sets, onederived from Googles Orkut social network service andthe other from Facebooks Fighters Club application.

    A. Orkut Data Set

    The rst data set is a friendship graph of 70, 000users of Googles Orkut social network service. Thisdata set was originally collected by Mislove [21] andconstitutes an unweighted, undirected graph. The dataset is a social graph obtained from subscriber friendslists of Googles Orkut social network service [21]. Thedata set consists of 70, 000 user nodes with 2, 971, 776undirected links between them and was processed andanalyzed on Matlab 7.4 (R2007a) on a Dell PowerEdge

    server with an Intel QuadCore Xeon 2.13GHz processorand 4GB RAM.Our objective for applying PCC on this data set is

    the discovery of more social hubs than are identied byEVC. Figure 8a plots the EVC scores of all 70, 000 nodesin the data set. It shows that node 692 in the network has the highest EVC, followed by a cluster of nodes withnode IDs centered around 43, 000. Like in the exampleillustrated earlier the remaining majority of nodes isassigned centrality scores close to 0. The histogram of node EVCs in gure 8b conrms this. Earlier, for thesample network in section III we determined the numberof features P to use for the computation of PCC as thenumber of eigenvalues after which the rate of growthof their cumulative sum begins to decline signicantly.An alternative approach which yielded a clearer cutoff point for the selection of the number of features to usein PCC was the plot of the rate of change in the phaseangle of PCC vectors with the EVC vector (see gure7). Figure 8c plots the phase angles of PCC whilevarying the number of features from 1 through 100. Weselect P = 14 as a cut-off point for the computation of PCC (marked in red). Figure 8e plots PCC vector of all

    nodes in the Orkut graph. When we compare it to theplot of the EVC vector what stands out immediately ishow a lot of nodes with near-zero EVCs are assignedhigher and highly varying PCC scores. Figure 8f is thehistogram of PCC scores which, at a standard deviationof 6839.7, is more spread out than that of the EVC witha standard deviation of 4382.7.

    Figure 8d plots the number of local maxima that arefound in the graph for values of P . At rst glance it

  • 8/2/2019 Themes10 Centrality DoubleCol

    10/14

    1 2 3 4 5 6 7x 10

    4

    0

    0.5

    1

    Node ID

    C E

    (a)

    0 0.2 0.4 0.6 0.8 10

    5

    x 104

    CE

    # o

    f n o

    d e s

    (b)

    10 20 30 40 50 60 70 80 90 1000

    0.5

    1

    P # of eigenvectors

    ( r a

    d )

    14

    (c)

    10 20 30 40 50 60 70 80 90 100

    85

    90

    95

    P # of eigenvectors

    # o

    f m a x i m a s

    (d)

    1 2 3 4 5 6 7

    x 104

    0.20.40.60.8

    Node ID

    C 1 4

    (e)

    0 0.2 0.4 0.6 0.8 10

    5

    x 104

    C14

    # o

    f n o

    d e s

    (f)

    0 0.2 0.4 0.6 0.8 10

    10

    20

    CE

    # o

    f n o

    d e s

    (g)

    0 0.2 0.4 0.6 0.8 10

    10

    20

    C14

    # o

    f n o

    d e s

    (h)

    10 20 30 40 50 60 70 80 90 1000

    1

    2

    3

    P # of features

    | S P

    V

    ( C P

    [ | S

    P | ] ) |

    (i)

    0 2040 60

    80 100

    0

    50

    10080

    85

    90

    95

    P1P2

    | S P 1

    S P 2

    |

    (j)

    Fig. 8. Orkut data set: a) EVC scores of nodes, b) A histogram of EVC scores, c) Phase angles of PCC vectors C 1 through C 100 withEVC vector C E , d) Number of local maxima discovered using PCCs of varying number of features, e) PCC scores of nodes using 14features, f) A histogram of PCC scores of nodes using 14 features, g) A histogram of EVCs of the 20 social hubs with the highest EVCs,

    and h) A histogram of PCCs of the 20 social hubs with the highest PCCs based on 14 features, i) The size of the intersection set of S P and V (C P [|S P |]) for 1 P 100, and j) The size of the intersection set of S P 1 and S P 2 when 1 P 1, P 2 100.

    might appear that EVC found 84, while PCC improvesthis number to 91 when using just 14 out of 70, 000possible features. In the plotted range a maximum of 95maxima are identied when 61 features are used. We ex-amine how many of the social hubs identied are in fact

    trivial maxima with low centrality. This is accomplishedby viewing the centrality scores of nodes identiedas social hubs. Let S n denote the set of social hubsidentied by using PCC with n features/eigenvectors.Figure 8g is the histogram of EVC histogram of the

  • 8/2/2019 Themes10 Centrality DoubleCol

    11/14

    top 20 EVC scoring nodes in S 1 (set of social hubsidentied using only EVC/PCC with 1 feature). Onlyone node (node number 692, the node with the highestEVC of all 70, 000 nodes in gure 8b) truly stands outwith a high EVC, whereas the other 19 social hubsEVC scores lie in the lowest bin of the histogram.This behavior is consistent with our observations in theillustrated example in gure 2 where EVC was onlyable to identify a single social hub. Thus, after thedenition of local maxima excludes nodes surroundingthe most central node, EVC fails to identify any otherinuential neighborhood. One might wonder if, amongthe more than 2000 nodes with pronounced EVC scoresin gure 8b there is not a single node besides node692 that might be a local maxima. We veried fromthe data set that node 692 has 2185 neighbors, most of which have node IDs in the range between 42134 and44314 (clearly visible in gure 8b). In contrast, gure

    8h is a histogram of the PCC scores of the 20 nodeswith highest PCC scores in S 14 (the set of social hubsidentied using C14 , PCC with 14 features). Here, atotal of 8 social hubs have non-trivial PCC scores, theremaining 12 social hubs have PCC scores too low tobe considered signicant. The IDs of nodes identied associal hubs of inuential neighborhoods in descendingorder of PCCs are 692, 317, 4749, 487, 39, 14857,35348 and 12219. This is a substantial improvement overthe single neighborhood identied using EVC. Thus,using PCC in conjunction with a node selection criteriaprovided by the denition of local maxima identiesmany more inuential neighborhoods in a social network than is possible by using EVC.

    We also raise the question of how different the setof nodes identied as social hubs is from the nodes wewould have identied as central were we to rely solelyon nodes centrality scores. This raises the question of how different S P , the set of nodes identied as socialhubs based on CP , is from V (C P [|S P |]), the vertex set(returned by the function V () ) of the rst |S P | nodesranked in descending order of CP . Figure 8i plots thesize of the intersection set |S P V (C P [|S P |])| . The

    data point at P = 1 is 1 and is the number of nodescommon in the set of social hubs identied by EVCand those identied by node EVC scores alone. For therange 1 P 100 at most 3 nodes identied as socialhubs are also present in the set of the rst |S P | mostcentral nodes, i.e. placing monitors at nodes based solelyon their centrality scores produces a lot of redundantcoverage.

    We compute the sizes of intersection sets of all pairs

    of S P 1 and S P 2 for 1 P 1, P 2 100. This is plotted ingure 8j. It shows that as the number of features P 1 usedto compute CP 1 is increased from CP 2, the set of socialhubs S P 2 identied by it is (almost) always a supersetof the set S P 1 if P 1 < P 2. Thus the inclusion of morefeature vectors adds members to the set of social hubswithout removing previous ones.

    B. Facebook Fighters Club Data Set

    The second data set is derived from a list of matchesplayed between users of Facebooks Fighters Clubapplication. This data set contains 143, 020 users and667, 560 matches was originally collected by Nazir in[22]. It differs from the Orkut friendship graph in thatit is a weighted, undirected graph. The weights of linksbetween two user nodes represent the number of inter-actions/matches played against each other. Each vertexin the weighted gaming graph represents a user of the

    application with 526, 224 edges between them. Thus,weight of an edge between two users is the number of matches recorded between them in the data set. Edgeweights in this data set range from 1 to 29.

    Figure 9a plots the EVC scores of all 143, 020 nodesin the data set. Unlike in the preceding Orkut data set,there are only very few nodes that are assigned EVCssignicantly greater than 0. The group of nodes in thenode ID space above 130, 000 are the only ones withhigh EVCs, while almost all other nodes have near-zero EVCs. Like in the example illustrated earlier, theremaining majority of nodes is assigned centrality scoresclose to 0. The histogram of node EVCs in gure 9bconrms this. Figure 9c plots the phase angles of the PCC vector while varying the number of featuresfrom 1 through 100. Figure 9d plots the number of localmaximas that are found in the gaming graph for values of 1 P 100. EVC nds approximately 1.23 105 whilePCC increases this number slightly to 1.232 105 whenusing just 20 out of 143, 020 possible features. The phaseangle attains a stable value around P = 10 features(marked by red line) and so we will use P = 10 for thecomputation of PCC, i.e. C10 . Figure 9e plots PCCs of

    all nodes in the graph with 10 features and gure 9f istheir histogram.

    This raises the question of how different S P , the setof nodes classied as social hubs based on CP , is fromV (C P [|S P |]), the vertex set (returned by the functionV () ) of the rst |S P | nodes ranked in descending order of CP . Figure 9i plots the size of the intersection set |S P V (C P [|S P |])| . The data point at P = 1 is approximately1.23 105 and is the number of nodes common in the set

  • 8/2/2019 Themes10 Centrality DoubleCol

    12/14

    2 4 6 8 10 12 14x 10

    4

    0

    0.5

    1

    Node ID

    C E

    (a)

    0 0.2 0.4 0.6 0.8 10

    5

    10

    x 104

    CE

    # o

    f n o

    d e s

    (b)

    0 20 40 60 80 1000

    0.5

    1

    P # of eigenvectors

    ( r a

    d )

    10

    (c)

    10 20 30 40 50 60 70 80 90 1001.228

    1.23

    1.232x 10

    5

    P # of eigenvectors

    # o

    f m a x

    i m a s

    (d)

    2 4 6 8 10 12 14x 10

    4

    0.20.40.60.8

    Node ID

    C 1 0

    (e)

    0 0.2 0.4 0.6 0.8 10

    1

    2 x 105

    C10

    # o

    f n o

    d e s

    (f)

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

    100

    CE

    # o

    f n o

    d e s

    (g)

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

    10

    20

    C10

    # o

    f n o

    d e s

    (h)

    0 10 20 30 40 50 60 70 80 90 1001.19

    1.195

    1.2

    1.205

    1.21

    1.215

    1.22

    1.225

    1.23

    1.235 x 105

    P # of features

    | S P

    V

    ( C P

    [ | S

    P | ] ) |

    (i)

    0 2040 60

    80 100

    0

    50

    1001.15

    1.21.25 x 10

    5

    P1

    P2

    | S P 1

    S

    P 2

    |

    (j)

    Fig. 9. Facebook Fighters Club application data set: a) EVC scores of nodes, b) A histogram of EVC scores, c) Phase angles of PCCvectors C 1 through C 100 with EVC vector C E , d) Number of local maxima discovered using PCCs of varying number of features, e) PCCscores of nodes using 10 features, f) A histogram of PCC scores of nodes using 10 features, g) A histogram of EVCs of the 200 social hubswith the highest EVCs, and h) A histogram of PCCs of the 200 social hubs with the highest PCCs based on 10 features, i) The size of theintersection set of S P and V (C P [|S P | ]) for 1 P 100, and j) The size of the intersection set of S P 1 and S P 2 when 1 P 1, P 2 100.

    of social hubs identied by EVC and those identied bynode EVC scores alone. As we proceed on the horizontalaxis the number of features used to compute PCC isincreased. Each data point is the number of nodes in theintersection of the set of social hubs by P -feature PCC

    (C P ) with the set of most central nodes nodes by PCCCP of equal size. As P increases in the range 1 P 100 the size of the intersection set rapidly drops at rstand then climbs back close to the starting value at 65features.

  • 8/2/2019 Themes10 Centrality DoubleCol

    13/14

    We compute the sizes of intersection sets of all pairsof S P 1 and S P 2 for 1 P 1, P 2 100. This is plottedin gure 9j. It shows that as the number of features P 1used to compute CP 1 is increased from C P 2, the setof social hubs S P 2 identied by it is (almost) always asuperset of the set S P 1 if P 1 < P 2. Thus the inclusionof more feature vectors retains a large fraction of socialhubs identied using fewer features.

    V. C ONCLUSIONS

    We reviewed previously dened measures of centralityand pointed out their shortcomings in general and EVCin particular. We introduced PCC, a new measure of nodecentrality. PCC is based on PCA and the KLT whichtakes the view of treating a graphs adjacency matrix asa covariance matrix. PCC interprets a nodes centralityas its 2 norm from the origin in the eigenspace formedby the P most signicant feature vectors (eigenvectors)

    of the adjacency matrix. Unlike EVC, PCC allows theaddition of more features for the computation of nodecentralities. We explore two criteria for the selection of the number of features to use for PCC; a) The relativecontribution of each features power (eigenvalue) to thetotal power of adjacency matrix and b) Incrementalchanges in the phase angle of the PCC with P featuresand the EVC as P is increased. We also provide a visualinterpretation of signicant eigenvectors of an adjacencymatrix. The use of the adjacency matrix is compared withthat of the Laplacian and it is shown that eigendecompo-sition of the adjacency yields signicantly higher degreeof energy compaction than does the Laplacian at thesame number of features. We also investigated the effectof adding successive eigenvectors and the informationthey contain by looking at reconstructions of the originalgraphs topology using a subset of features.

    We applied PCC analysis to Googles Orkut social net-working service. Our objective was the identication of social hubs in social networks that are left undiscoveredby EVC. In the case of the Orkut graph we saw thatusing 14 most signicant eigenvectors out of a possible70, 000 raises the number of inuential neighborhoods

    discovered from just 1 (that around the most centralnode) to 8 (including the one identied by EVC). Theincrease in the number of social hubs found using PCC iseven greater. The top 200 social hubs found using PCCall have normalized PCC greater than 0.1. The socialhubs found using EVC however contain only 13 socialhubs with normalized EVC greater than 0.1.

    The Orkut friendship graph we used was unweightedand undirected, while the Facebook application graph

    was an undirected and weighted graph. However, inorder to ensure that eigenvalues and eigenvectors remainreal the graphs must be undirected. We compared thesets of nodes identied as social hubs with the set of highest scoring nodes by centrality alone and saw thatmany of the nodes with high centralities belong to thesame neighborhood. Thus, the notion of a local maximaserves its purpose of removing neighbors of highlycentral nodes. A comparison of social hubs discovered byPCC with different numbers of features showed that theaddition of more features in PCC adds new social hubsto the list of identied hubs without replacing previouslyidentied ones.

    In the future we intend to extend the denition of PCC so it can be applied to both directed and undi-rected graphs. Furthermore, we propose to formulate adistributed method for computing PCC along the linesof Canrights method [7] for computing EVC in peer-to-

    peer systems.R EFERENCES

    [1] Z. Bai, J. Demmel, J. Dongarra, A. Petitet, H. Robinson,and K. Stanley, The spectral decomposition of nonsymmetricmatrices on distributed memory parallel computers, SIAM Journal on Scientic Computing , vol. 18, pp. 14461461, 1997.

    [2] C. Bischof and I. for Defense Analyses, Parallel Performanceof a Symmetric Eigensolver Based on the Invariant Subspace Decomposition Approach . Supercomputing Research Center:IDA, 1994.

    [3] P. Bonacich, Power and centrality: A family of measures, American Journal of Sociology , vol. 92, pp. 11701182, 1987.

    [4] P. Bonacich and P. Lloyd, Eigenvector-like measures of cen-

    trality for asymmetric relations, Social Networks , vol. 23, no. 3,pp. 191201, 2001.[5] S. Borgatti, Centrality and network ow, Social Networks ,

    vol. 27, no. 1, pp. 5571, 2005.[6] S. Borgatti and M. Everett, A graph-theoretic perspective on

    centrality, Social Networks , vol. 28, no. 4, pp. 466484, 2006.[7] G. Canright, K. Eng-Monsen, and M. Jelasity, Efcient and

    robust fully distributed power method with an application tolink analysis, Department of Computer Science, University of Bologna, Tech. Rep. UBLCS-2005-17 , pp. 200517, 2005.

    [8] G. Canright and K. Eng-Monsen, Spreading on networks: Atopographic view, ComPlexUs , vol. 3, no. 1-3, pp. 131146,2006.

    [9] I. Carreras, D. Miorandi, G. Canright, and K. Eng-Monsen,Understanding the spread of epidemics in highly partitionedmobile networks, in Proceedings of the 1st international con- ference on Bio inspired models of network, information and computing systems . ACM New York, NY, USA, 2006.

    [10] F. R. K. Chung, Spectral Graph Theory (CBMS Regional Conference Series in Mathematics, No. 92)(Cbms Regional Conference Series in Mathematics) .American Mathematical Society, February 1997. [Online].Available: http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0821803158

    [11] R. Duda, P. Hart, and D. Stork, Pattern Classication , 2nd ed.Wiley New York, 2001.

  • 8/2/2019 Themes10 Centrality DoubleCol

    14/14

    [12] M. Everett and S. Borgatti, Extending centrality , P. J. Carring-ton, J. Scott, and S. Wasserman, Eds. Cambridge UniversityPress, 2005.

    [13] L. Freeman, A set of measures of centrality based on between-ness, Sociometry , vol. 40, no. 1, pp. 3541, 1977.

    [14] N. Friedkin, Theoretical foundations for centrality measures, American Journal of Sociology , vol. 96, no. 6, p. 1478, 1991.

    [15] H. He, Eigenvectors and reconstruction, the electronic journal

    of combinatorics , vol. 14, no. 14, p. 1, 2007.[16] G. Inc., orkut - welcome to the orkut club!http://www.orkut.com/Signup, Tech. Rep., Oct 14, 2009.

    [17] A. Kansal, M. Goraczko, and F. Zhao, Building a sensornetwork of mobile phones, in Proceedings of the 6th interna-tional conference on Information processing in sensor networks .ACM, 2007, p. 548.

    [18] C. Kohlsch utter, P. Chirita, and W. Nejdl, Efcient parallelcomputation of pagerank, Lecture Notes in Computer Science ,vol. 3936, p. 241, 2006.

    [19] A. Langville, C. Meyer, and P. Fern Andez, Googles pager-ank and beyond: the science of search engine rankings, The Mathematical Intelligencer , vol. 30, no. 1, pp. 6869, 2008.

    [20] S. Milgram, The small world problem, Psychology Today ,vol. 2, no. 1, pp. 6067, 1967.

    [21] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, andB. Bhattacharjee, Measurement and analysis of online socialnetworks, in Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC07) , San Diego, CA, October2007.

    [22] A. Nazir, S. Raza, and C. Chuah, Unveiling facebook: a

    measurement study of social network based applications, inProceedings of the 8th ACM SIGCOMM conference on Internet measurement . ACM New York, NY, USA, 2008, pp. 4356.

    [23] M. Newman, Analysis of weighted networks, Physical Review E , vol. 70, no. 5, p. 56131, 2004.

    [24] Orkut, Site prole for orkut.com (rank 4,168) com-plete, http://siteanalytics.compete.com/orkut.com/?metric=uv,Tech. Rep., September, 2009.

    [25] B. P al and P. Scientist, Sensorplanet: An open global researchframework for mobile device centric wireless sensor networks,in MobiSensors: NSF Workshop on Data Management for Mobile Sensor Networks, Pittsburgh , 2007.

    [26] B. Ruhnau, Eigenvector-centrality a node-centrality? Social Networks , vol. 22, no. 4, pp. 357365, 2000.

    [27] K. Sankaralingam, S. Sethumadhavan, and J. Browne, Dis-tributed pagerank for p2p systems, in High Performance Dis-tributed Computing, 2003. Proceedings. 12th IEEE Interna-tional Symposium on , 2003, pp. 5868.

    [28] F. Tisseur and J. Dongarra, A parallel divide and conqueralgorithm for the symmetric eigenvalue problem on distributedmemory architectures, SIAM Journal on Scientic Computing ,vol. 20, p. 2223, 1999.

    [29] S. Wasserman and K. Faust, Social network analysis: Methods

    and applications . Cambridge University Press, 1994.[30] D. Watts, Networks, dynamics, and the small-world phe-

    nomenon, American Journal of Sociology , vol. 105, no. 2, pp.493527, 1999.

    [31] , Small Worlds: The Dynamics of Networks between Order and Randomness . Princeton University Press, 1999.