Network Vector: Distributed Representations of Networks ... · mainly Depth-First Search (DFS) and Breadth-First Search (BFS). The objective is optimized using single layer neural

Network Vector: Distributed Representations of Networks with Global Context

Hao WuUSC ISI

[email protected]

Kristina LermanUSC ISI

[email protected]

AbstractWe propose a neural embedding algorithm calledNetwork Vector, which learns distributed repre-sentations of nodes and the entire networks si-multaneously. By embedding networks in a low-dimensional space, the algorithm allows us to com-pare networks in terms of structural similarity andto solve outstanding predictive problems. Unlikealternative approaches that focus on node level fea-tures, we learn a continuous global vector that cap-tures each node’s global context by maximizingthe predictive likelihood of random walk paths inthe network. Our algorithm is scalable to realworld graphs with many nodes. We evaluate ouralgorithm on datasets from diverse domains, andcompare it with state-of-the-art techniques in nodeclassification, role discovery and concept analogytasks. The empirical results show the effectivenessand the efficiency of our algorithm.

1 IntroductionApplications in network analysis, including network patternrecognition, classification, role discovery, and anomaly de-tection, among others, critically depend on the ability to mea-sure similarity between networks or between individual nodeswithin networks. For example, given an email communica-tions network within an enterprise, one may want to classifyindividuals according to their functional roles, or given oneindividual, find another one playing a similar role.

Computing network similarity requires going beyond com-paring networks at a node level to measuring their struc-tural similarity. To characterize network structure, traditionalapproaches extract features such as node degrees, cluster-ing coefficients, eigenvalues, the lengths of shortest pathsand so on [Berlingerio et al., 2012; Henderson et al., 2012;Gilpin et al., 2013]. However, these hand-crafted featuresare usually heterogeneous and it is often not clear howto integrate them within a learning framework. In addi-tion, some graph features, such as eigenvalues, are compu-tationally expensive and do not scale well in tasks involv-ing large networks. Recent advances in distributed repre-sentation of nodes [Perozzi et al., 2014; Tang et al., 2015;Grover and Leskovec, 2016] in networks created an alternate

framework for unsupervised feature learning of nodes in net-works. These methods are based on the idea of preserving lo-cal neighborhoods of nodes with neural network embeddings.An objective is defined on the proximity between nodes inexploring the network neighborhood with various strategies,mainly Depth-First Search (DFS) and Breadth-First Search(BFS). The objective is optimized using single layer neuralnetwork for efficient training. However, the embeddings usedfor feature representation limit scope to the local context ofnodes, without directly exploiting the global context of thenetwork. To represent the whole network, these approachesrequire us to integrate the representations of all nodes, forexample, by averaging their representations. However, notall nodes contribute equally to the global representation ofthe network, and in order to account for their varying impor-tance, aggregation schemes need to weigh nodes, which addsan extra layer of complexity to the learning task.

To address above-mentioned challenge we describe a neu-ral network algorithm called Network Vector, which learnsdistributed representations of networks that account for theirglobal context. The algorithm is scalable to real world net-works with large numbers of nodes and can be applied togeneric networks such as social networks, knowledge graphs,and citation networks. Networks are compressed into real-valued vectors that preserve the network structure, so that thelearned distributed representations can be used to effectivelymeasure network similarity. Specifically, given two networks,even those with different size and topology, the distance be-tween the learned vector representations can be used to mea-sure their structural similarity. In addition, this approach al-lows us to compare individual nodes by looking at the sim-ilarity of their ego-networks, i.e., networks that contain thefocal node and all their neighbors and connections betweenthem.

Our approach is inspired by Paragraph Vector [Le andMikolov, 2014] that learns distributed representations of textsof variable length such as sentences and documents [Leand Mikolov, 2014]. By exchanging the notions of ordered“word” sequences in sentences and “nodes” in paths alongedges on networks. We learn network representations in asimilar way of learning representations of sentences and doc-uments. Specifically, we sample sequences of nodes from anetwork using random walks, same as in [Perozzi et al., 2014;Grover and Leskovec, 2016]. In contrast to existing ap-

arX

iv:1

709.

0244

8v1

[cs

.SI]

7 S

ep 2

017

proaches, the likelihood of next node in a random walk se-quence predicted by our algorithm depends not only on theprevious nodes, but also on the global context of the net-work. The global context vector representation of the networkis learned to maximize the average predicted likelihood ofnodes in random walk sequences sampled from the network.The learned representations can be used as the signatures ofthe networks for comparison, or as features for classificationand other predictive tasks.

We evaluate the algorithm on several real world datasetsfrom a diversity of domains, including citation network ofknowledge concepts in Wikipedia, email interaction network,legal citations network, social network of bloggers, protein-protein interaction network and language network. We fo-cus on predictive tasks including role discovery in networksthat aims to identify individual nodes serving similar roles,inference of analogous relations between concept pairs inWikipedia and multi-label node classification. We compareNetwork Vector with state-of-the-art feature learning algo-rithm node2vec [Grover and Leskovec, 2016], LINE [Tanget al., 2015], DeepWalk [Perozzi et al., 2014] and feature-based baselines such as node degrees, clustering coefficientsand eigenvalues. Experiments demonstrate the superior per-formance of Network Vector, due to its capacity of learningthe global context of the network.

In summary, our contributions are summarized as follows:

1. We propose Network Vector algorithm, a distributed fea-ture learning algorithm for representing an entire net-work and its nodes simultaneously. We define an ob-jective function that preserves the local neighborhood ofnodes and the global context of the entire network.

2. We evaluate Network Vector on role discovery, conceptanalogy and node multi-label classification tasks. Ex-periments on several benchmark datasets from diversedomains show its effectiveness and efficiency.

2 Related WorkOur algorithm builds its foundation on learning distributedrepresentations of concepts [Hinton, 1986] . Distributed rep-resentations encode structural relationships between conceptsand are typically learned using back-propagation throughneural networks. Recent advances in natural language pro-cessing have successfully adopted distributed representationlearning and introduced a family of neural language mod-els [Bengio et al., 2003; Mnih and Hinton, 2007; Mikolovet al., 2010; Mikolov et al., 2013a; Mikolov et al., 2013b] tomodel word sequences in sentences and documents. Theseapproaches embed words such that words in similar contextstend to have similar representations in latent space.

By exchanging the notions of nodes in a network andwords in a document, recent research [Perozzi et al., 2014;Tang et al., 2015; Cao et al., 2015; Grover and Leskovec,2016] attempt to learn node representations in a network in asimilar way of learning word embeddings in neural languagemodels. Our work follows this line of approaches in whichnodes in a neighborhood will have similar embeddings in vec-tor space. Different node sampling strategies are explored

for characterizing the neighborhood structure. For exam-ple, DeepWalk [Perozzi et al., 2014] samples node sequencesfrom a network using a stream of short first-order randomwalks, and model them just like word sequences in docu-ments using neural embeddings. LINE [Tang et al., 2015]samples nodes in pairwise manner and model the first-orderand second-order proximity between them. GrapRep [Caoet al., 2015] extends LINE to exploit structural informationbeyond second-order proximity. To offer a flexible node sam-pling scheme, node2vec [Grover and Leskovec, 2016] uti-lizes second-order random walks, and combines Depth-FirstSearch (DFS) and Breadth-First Search (BFS) strategies toexplore the local neighborhood structure.

However, existing approaches only consider the local net-work structures (i.e., the neighborhoods of nodes) in learn-ing node embeddings, but exploit little information of theglobal structure of the network. Although recent approachGrapRep [Cao et al., 2015] attempts to capture long distancerelationship between two different nodes, it limits scope to afixed number of hops. More importantly, existing approachesfocus on node representations, and it requires additional ef-fort to compute the representation of the entire network. Thesimple scheme of averaging the representations of all nodesto represent the network is by no means a good choice as itignores the statistics of node frequency and their roles in thenetwork. In contrast, we introduce a notion called the globalnetwork vector, which aims to represent the structural proper-ties of an entire network. The global vector representation ofthe network acts as a memory which is asked to contribute tothe prediction of a node accompanying with the node’s neigh-bors, and updated to maximize the predictive likelihood. As aresult, our algorithm can simultaneously learn the global rep-resentation of a network and the representations of nodes inthe network. This is inspired by Paragraph Vector [Le andMikolov, 2014], which learns a continuous vector to repre-sent a piece of text with variable-length, such as sentences,paragraphs and documents.

3 Network VectorWe consider the problem of embedding nodes of a networkand the entire network into a low-dimensional vector space.Let G = {V,E} denote a graph, and V is the set of ver-tices and E = V × V is the set of edges with weights W .The goal of our approach is to map the entire graph to alow-dimensional vector, represented by vG ∈ Rd, and mapeach node i to a unique vector vi ∈ Rd in the same vectorspace. Although the dimensionality of network representa-tion vG can be different from that of node representations vi

in theory, we adopt the same dimensionality d for the easeof computation in real world applications. Suppose that thereare M graphs given (e.g., ego-networks of M persons of in-terest in a social network) and N distinct nodes in the corpus,then there are (M +N)× d parameters to be learned.

3.1 A Neural ArchitectureOur approach of modeling networks is motivated by learn-ing distributed representations of variable-length texts, e.g.,sentences and documents [Le and Mikolov, 2014]. The con-cept of “words” in a document [Le and Mikolov, 2014] is

Figure 1: A neural network architecture of Network Vector. It learnsthe global network representation as a distributed memory evolvingwith the sliding window of node sequences.

replaced by “nodes” in a network in our modeling. The goalis to predict a node given other nodes in its local context aswell as the global context of the network. Text has a lin-ear property that the local context of a word can be naturallydefined by surrounding words in ordered sequences. How-ever, networks are not linear. In order to characterize the lo-cal context of a node, without loss of generality, we samplenode sequences from the given network with second-orderrandom walks in [Grover and Leskovec, 2016], which offera flexible notion of a node’s local neighborhood by combin-ing Depth-First Search (DFS) and Breadth-First Search (BFS)strategies. Our learning framework can easily adopt higher-order random walks, but with higher computation cost. Eachrandom walk starts from an arbitrary root node and gener-ates an ordered sequence of nodes with second-order Markovchains. Specifically, consider node va that has been visitedin the previous step, and the random walk currently reachesnode vb. Consecutively, the next node vc will be sampled inrandom walks, with probability:

P (vc|va, vb) =1

ZMa

bcWbc (1)

where Mabc is the unweighted transition probability of mov-

ing from node vb to vc given va, Wbc is the weight of edge(vb, vc), and Z is the normalization term. We define Ma

bc as:

Mabc =

1p if d(va, vc) = 0

1 if d(va, vc) = 11q if d(va, vc) = 2

(2)

where d(va, vc) is the shortest path distance between va andvc. The parameters p and q control how the random walkbiases toward visited nodes in previous step and nodes thatare further away. The random walk terminates when l verticesare sampled, and the procedure repeats r times for each rootnode.

Figure 1 illustrates a neural network architecture for learn-ing the global network vector and node vectors simultane-ously. A sliding window (v1, · · · , vn) with fixed-length n

is repeatedly sampled over node sequences. The algorithmpredicts the target node vn given preceding nodes v1:n−1 aslocal context and the entire network G as the global context,with probability P (vn|v1:n−1, G). Formally, the probabilitydistribution of a target node is defined as:

P (vn|v1:n−1;G) =1

Zcexp[−E(G, v1:n−1; vn)] (3)

where Zc =∑

vm∈V exp[−E(G, v1:n−1; vm)] is the normal-ization term. We extend the scalable version of Log-Bilinearmodel [Mnih and Hinton, 2007], called vector Log-Bilinearmodel (vLBL) [Mnih and Kavukcuoglu, 2013]. In our model,the energy function E(G, v1:n−1; vn) is specified as:

E(G, v1:n−1; vn) = −v>vn (4)

where v is the predicted representation of the target node:

v = vG +

(n−1∑i=1

ci � vi

)(5)

Here � denotes the Hadamard (element-wise) product, andci is the weight vector for the context node in position i. ciparameterizes the context nodes at different hops away fromthe target node in random walks. The global network vectorvG is shared across all sliding windows of node sequences.After being trained, the global network vector preserves thestructural information of the network, and can be used as fea-ture input for the network. In our model, in order to imposesymmetry in feature space of nodes, and activate more in-teractions between the feature vector vG and the node vec-tors, we use the same set of feature vectors for both the targetnodes and the context nodes. This is different from [Mnihand Kavukcuoglu, 2013], where two separated sets of repre-sentations are used for the target node and the context nodesrespectively. In practice, we find our approach improves theperformance of Network Vector.

3.2 Learning with Negative SamplingThe global network vector vG, the node vectors vi and theposition-dependent context parameters ci are initialized withrandom values, and optimized by maximizing the objective inEq. (3). Stochastic gradient ascent is performed to update theset of parameters θ = {vG,vi, ci}:

∆θ = ε∇θ logP (vn|v1:n−1, G)

= ε∂

∂θ

[exp(v>vn)∑N

vm=1 exp(v>vm)

](6)

where ε is the learning rate. The computation involves thenormalization term and is proportional to the number of dis-tinct nodes N . The complexity of computation is expensiveand impractical in real applications. In our approach, weadopt negative sampling [Mikolov et al., 2013b] for optimiza-tion. Negative sampling represents a simplified version ofnoise contrastive estimation [Mnih and Teh, 2012], and trainsa logistic regression to distinguish between data samples ofvn from “noise” distribution. Our objective is to maximize

log σ(v>vn) +

k∑m=1

Evm ∼ Pn(v)[log σ(−v>vm)

](7)

where σ(x) = 1/(1 + exp(−x)) is the sigmoid function.Pn(v) is the global unigram distribution of the training dataacting as the noise distribution where we draw k negativesamples of nodes. Negative sampling allows us to train ourmodel efficiently that no longer requires explicitly normal-ized in Eq. (6), and hence are more scalable.

3.3 An Inverse ArchitectureThe architecture in Figure 1 utilizes the linear combinationof the global network vector and the context node vectors topredict the target node in a sliding window. Another way oftraining the global network vector is to model the likelihoodof observing a sampled node vt from the sliding window con-ditioned on the feature vector vG, given by

P (vt|G) =1

ZGexp[−E(G; vt)] (8)

where ZG is the normalization term specific to the featurerepresentation of G. The energy function E(G; vt) is:

E(G; vt) = −v>Gvt (9)

This architecture is a counterpart of the Distributed Bag-of-Words version of Paragraph Vector [Le and Mikolov, 2014].However, this architecture ignores the order of the nodesin the sliding window and perform poorly in practice whenit is used alone. We extend the framework by simulta-neously training network and node vectors using a Skip-gram [Mikolov et al., 2013a; Mikolov et al., 2013b] likemodel. The model additionally maximizes the likelihood ofobserving the local context vt−n:t+n (excluding vt) for thetarget node vt, conditioned on the feature representation ofvt. Unfortunately, modeling the joint distribution of a set ofcontext nodes is not tractable. This problem can be relaxedby assuming the node in different context positions are con-ditionally independent given the target word:

P (vt−n:t+n|vt) =

t+n∏i=t−n

P (vi|vt) (10)

where P (vi|vt) = 1Zt

exp[−E(vt; vi)]. The energy functionis:

E(vt; vi) = v>t (ci � vi) (11)The objective is to maximize the log-likelihood of the productof the probabilities, P (vt|G) and P (vt−n:t+n|vt)

3.4 Complexity AnalysisThe computation of Network Vector consists of two key parts:sampling of node sequences with random walks and opti-mization of vectors. For each node sequence of fixed lengthl, we start from a randomly chosen root node. At each step,the walk visits a new node based on the transition proba-bilities P (vc|va, vb) in Eq. (1). The transition probabilitiesP (vc|va, vb) can be precomputed and stored in memory us-ingO(|E|2/|V |) space. Sampling a new node in the walk canbe efficiently done inO(1) time using alias sampling [Walker,1977]. The overall time complexity is O(r|V |l) for repeatingr times of random walks of fixed length l by taking each nodeas root.

Figure 2: Network Vector is trained on Zachary’s Karate networkto learn two-dimensional embeddings for representing nodes (greencircles) and the entire graph (orange circle). The size of each nodeis proportional to its degree. Note the global representation of thegraph is close to these high-degree nodes.

The time complexity of optimization with negative sam-pling in Eq. (7) is proportional to the dimensionality of vec-tors d, the length of context window n and the number ofnegative samples k. It takes O(dnk) time for nodes withinthe sliding window (v1, · · · , vn). The introduced global vec-tor vG requires O(dk) time to optimize, same as any othernode vectors in the sliding window. Given r random walksof fixed length l starting from every node, the overall timecomplexity is O(dnkr|V |l). To store the node vectors andthe global network vector, it requires O(d|V |+ d) space.

3.5 The Property of Network VectorThe property of the global network vector vG in the archi-tecture (as shown in Figure 1) can be explained by looking atthe objective in Eq. (3). vG is part of the input to the neuralnetwork, and can be viewed as a term that helps to representthe distribution of the target node vn. The relevant part v>Gvn

is related logarithmically to the probability P (vn|v1:n−1;G).Therefore, the more frequently a particular vn is observed inthe data, the larger the value v>Gvn will have, and hence vG

will be closer to vn in vector space. The training objectiveis to maximize the logarithm of the product of all probabili-ties P (vn|v1:n−1;G), and the value is related to v>Gvn, wherevn is the expected vector that can be obtained by averagingall observed vn in the data. It is also true for Eq. (8) in theinverse architecture where the global network vector vG isthe only input to the neural network, in order to predict everynode vt.

Karate NetworkAs an illustrative example, we apply Network Vector to theclassic Karate network [Zachary, 1977]. The nodes in thenetwork represent members in a karate club, and the edges aresocial links between the members outside the club. There are34 nodes and 78 undirected edges in total. We use the inversearchitecture to train the vectors. Figure 2 shows the output of

our method in two dimensional space. We use green circlesto denote nodes and orange circle to denote the entire graph.The size of a node is proportional to its degree in the graph.We can see that the learned global network vector is close tothese high-degree nodes, such as node 1 and 34, which serveas the hubs of two splits of the club. The resulting globalvector mostly represent the backbone nodes (e.g., hubs) inthe network and compensates the lack of global informationin local neighborhoods.

Legal Citation NetworksGiven a citation network of documents, for example, scien-tific papers or legal opinions, we want to identify similar doc-uments. These could be groundbreaking works that serve toopen new fields of discourse in science and law, or founda-tional works that span disciplines but have less of an impacton discourse, such as “methods” papers in science.

For the purpose of case study, we collected a large dig-itized record of federal court opinions from the CourtLis-tener project1. The most cited legal decisions from the UnitedStates Supreme Court are selected and ego-networks of cita-tions are constructed for these legal cases. Two distinct graphpatterns are observed. One is “Citations have a few gianthubs” and “Citations are well connected”. We list a few ex-amples in Table 1, where the titles of the cases with differentcitation patterns are colored as red and blue, respectively. Theego-networks of the first six cases listed in Table 1 have justa few giant hubs which are linked by many other cases. Forexample, the case “Buckley v. Valeo, 424 U.S. 1 (1976)” isa landmark decision in American campaign finance law. Thecase “Coleman v. Miller, 307 U.S. 433 (1939)” is a landmarkdecision centered on the Child Labor Amendment, which wasproposed for ratification by Congress in 1924. These casesare generally centered on a specific topic, and their citationsmay have a narrowed topic. There are only a few hubs citedfrequently by others and the citations generally do not citeeach other. On the other side, the ego-networks of the lastsix cases listed in Table 1 have citations that are well con-nected. For example, “Yick Wo v. Hopkins, 118 U.S. 356(1886) ” was the first case where the United States SupremeCourt ruled that a law that is race-neutral on its face, but isadministered in a prejudicial manner; The case “Stromberg v.California, 283 U.S. 359 (1931)” is a landmark in the historyof First Amendment constitutional law to include a protectionof the substance of the First Amendment. These cases are in-fluential in the history and cited by many diverse subsequentlegal decisions, which usually cite each other.

Our Network Vector algorithm is used to learn two-dimensional embeddings from the ego-networks of the legalcases, and their projections are shown as open dots in theright figure of Table 1. The structures of the ego-networks forfour sampled Supreme Court legal cases (Case IDs: 110578,93272,101957,105547) are also illustrated in the figure. Notethat the ego network includes the case itself (does not showin the figure), and all the cases it cites (smaller circle on theright) as well as all the cases that cite it (larger circle on theleft). Lines represent citations among these cases. The twogroups of ego-networks contrast each other. Compared to the

1https://www.courtlistener.com/

ego-networks in red boxes, which is cited by unrelated legalcases, there is clearly more coherence in discourse related tothe cases in blue boxes, as indicated by citations among otherSupreme Court cases that cite this one. Although the differ-ences between these two ego-networks could be captured in astandard way, by features related to the degree distribution ofthe ego-networks, or their clustering coefficients, the distinc-tions between other ego-networks may be more subtle neces-sitating a new approach for evaluating their similarity. In thisrepresentation, the position of the case in the learned spacecaptures the similarity of the structure of their ego-networks.Cases that are more similar to the ego-networks in red boxesfall in the top half of the 2-D plane (red open dots); whilecases similar to those in blue boxes fall in the bottom half(blue open dots). Thus, distances between the learned repre-sentations of the ego-networks of legal cases can be used toquantitatively capture their similarity.

4 ExperimentsNetwork Vector learns representations of network nodes andthe entire network simultaneously. We evaluate both repre-sentations on predictive tasks. First, we apply Network Vec-tor to a setting where only local information about nodes,such as their immediate neighbors, is available. We learn rep-resentations for ego-networks of a few nodes using NetworkVector and evaluate on role discovery in social networks andconcept analogy in encyclopedia. Second, when the infor-mation of node connectivities in the entire network is avail-able, we may learn node representations using Network Vec-tor, where the additional global vector for the network is usedto help in learning high-quality node representations. The re-sulting node representations are evaluated on multi-label clas-sification.

4.1 Role DiscoveryRoles reflect individuals’ functions within social networks.For example, email communication network within an enter-prise reflects employees’ responsibilities and organizationalhierarchies. An engineer’s interactions with her team are dif-ferent from those of a senior manager’s. In the Wikipedia net-work, each article cites other concepts that explain the mean-ing of the article’s concept. Some concepts may “bridge” thenetwork by connecting different concept categories. For ex-ample, the concept Bat belongs to the category Mammals,however since a bat resembles a bird, it refers to many simi-lar articles about the category Birds.

DatasetsWe use the following datasets in the evaluation:

• Enron Email Network: It contains email interaction datafrom about 150 users, mostly senior management of En-ron. There are about half million emails communicatedby 85,601 distinct email addresses2. We have 362,467links left after removing duplicates and self links. Eachof the email addresses belonging to Enron employeeshas one of 9 different positions: CEO, President, Vice

2http://www.cs.cmu.edu/˜enron/

https://www.courtlistener.com/

http://www.cs.cmu.edu/~enron/

Table 1: A subset of most cited legal decisions with two distinct patterns of ego-networks. 2-D embeddings of the legal decisions are learnedusing Network Vector with their ego-networks as input, and mapped to the right figure. Citations of the top 6 cases in the table have a fewgiant hubs (red dots in the figure), while that of the bottom 6 cases are well connected (blue dots in the figure).

ID Title (with url) Page Year93272 Chicago & Grand Trunk Ry. Co. v. Wellman 143 U.S. 339 189299622 F. S. Royster Guano Co. v. Virginia 253 U.S. 412 1920103222 Coleman v. Miller 307 U.S. 433 1939109380 Buckley v. Valeo 424 U.S. 1 1976118093 Arizonans for Official English v. Arizona 520 U.S. 43 1997110578 Ridgway v. Ridgway 454 U.S. 46 198191704 Yick Wo v. Hopkins 118 U.S. 356 188698094 Weeks v. United States 232 U.S. 383 1914101741 Stromberg v. California 283 U.S. 359 1931101957 Powell v. Alabama 287 U.S. 45 1932103050 Johnson v. Zerbst 304 U.S. 458 1938105547 Roth v. United States 354 U.S. 476 1957

Table 2: Performance of role discovery task.

Wiki - 15 classes Email - 9 classes Email - 3 classesMethod p@1 p@5 p@10 p@1 p@5 p@10 p@1 p@5 p@10Degrees+Clustering+Eigens 0.160 0.149 0.146 0.090 0.102 0.083 0.210 0.200 0.196node2vec 0.231 0.224 0.218 0.290 0.280 0.268 0.500 0.498 0.474Network Vector 0.607 0.560 0.522 0.290 0.298 0.281 0.520 0.498 0.483

President, Director, Managing Director, Manager, Em-ployee, In House Lawyer and Trader. We use the po-sitions as roles. This categorization is fine-grained. Inorder to understand how the feature representations canreflect the properties of different stratum in the corpora-tion, we also use coarse-grained labels Leader (aggre-gates CEO, President, Vice President), Manager (ag-gregates Director, Managing Director, Manager) andEmployee (includes Employee, In House Lawyer andTrader) to divide the users into 3 roles.

• Wikipedia for Schools Network: We use a subset ofarticles available at Wikipedia for Schools3. Thisdatasetcontains 4,604 articles and 119,882 links be-tween them. The articles are categorized by subjects.For example, the article about Cat is categorized assubject.Science.Biology.Mammals. We use one of 15second-level category names (e.g., Science in the caseof Cat) as the role label.

Methods for ComparisonFor real-world networks, such as email, information aboutall connectivities of nodes may not be fully available, e.g.,for privacy reasons. For this reason, we explore predictiontask with local information (i.e., immediate neighbors). Foreach node, we first generate its ego-network, which repre-sents the induced subgraph of its immediate neighbors, andlearn global vector representations for the set of ego-networksthrough Network Vector. We use the architecture as in Eq.(3). In our experiments, we repeat γ = 10 times for rootnode initialization in random walks and the length of eachrandom walks is fixed as l = 80. For comparison, we evalu-

3http://schools-wikipedia.org/

ate the performance of Network Vector against the followingnetwork feature-based algorithms [Berlingerio et al., 2012]:

• Degrees: number of nodes and edges, average node de-gree, maximum “in” and “out” node degrees. The de-gree features are aggregated to form the representationsof the ego-networks.

• Clustering Coefficients: measure the degree to whichnodes tend to cluster. We compute global clustering co-efficient and average clustering coefficient of nodes forrepresenting each ego-network.

• Eigens: For each ego-network, we compute 10 largesteigenvalues of its adjacency matrix.

• node2vec [Grover and Leskovec, 2016]: This approachlearns low-dimensional feature representations of nodesin a network by interpolating between BFS and DFS forsampling node sequences. A parameter p and q is in-troduced to control the likelihood of revisiting a node inwalks, and to dis/encourage outward exploration, result-ing in BFS/DFS like sampling strategy. It’s interestingto note when p = 1 and q = 1, node2vec boils down toDeepWalk [Perozzi et al., 2014], which utilizes uniformrandom walks. We adapt node2vec, and use the mean oflearned node vectors to represent each ego-network.

ResultsGiven a node’s ego-network, we rank other nodes’ ego-networks by their distance to it in vector space of featurerepresentations. Table 2 shows the average precision of re-trieved nodes with the same roles (class labels) at cut-offk = 1, 5, 10. For simplicity, Cosine similarity is used to com-pute the distance between two nodes. From the result, we cansee how the global context allows Network Vector outperform

https://supreme.justia.com/cases/federal/us/143/339/












http://schools-wikipedia.org/

Table 3: Performance of concept analogy task.

Method Hit@1 Hit@5 Hit@10Degrees 0.0147 0.0423 0.0717Clustering Coefficients 0.0006 0.0043 0.0086Eigens 0.0025 0.0074 0.0116Degrees+Clustering+Eigens 0.0153 0.0453 0.0803node2vec 0.2450 0.5098 0.6150Network Vector 0.2849 0.5619 0.6930

node2vec in role discovery. However, the performance gainis dependent on different datasets. We observe Network Vec-tor performs slightly better than node2vec on Enron email in-teraction network, while the improvement of performance isover 150% on Wikipedia network. Compared to the combina-tion of Degrees, Clustering Coefficients and Eigenvalues, theimprovement of the two learning algorithms Network Vectorand node2vec are outstanding, with over 100% performancegain in all cases.

4.2 Concept AnalogyWe also evaluate the feature representations of ego-networkson the analogy task. For Wikipedia network, we follow theword analogy task defined in [Mikolov et al., 2013a]. Givena pair of Wikipedia articles describing two concepts (a, b),and an article describing another concept c. The task aims tofind a concept d such that a is to b as c is to d. For example,Europe is to euro as USA is to dollar. This analogy task canbe solved by finding the concept that is closest to vb−va+vc

in vector space, where the distance is computed using Cosinesimilarity.

There are 1,632 semantic tuples in Wikipedia networkmatched for the semantic pairs in [Mikolov et al., 2013a].We use them as evaluation benchmark. Table 3 shows theaccuracy of hitting the answer d within cut-off k = 1, 5, 10positions in the ranking list. From the results, we can see Net-work Vector performs much better than the baseline, whichuse degree, clustering coefficients and eigenvalues of the ad-jacency matrix. The combination of heterogeneous features(degrees, clustering coefficients and eigenvalues) in differ-ent scale causes the difficulty to utilize an efficient distancemetric. However, Network Vector does not suffer from thisproblem by automating the feature learning using an objectivefunction. In this task, we empirically fix the dimensionalityof vectors as 100 and context window as 10.

4.3 Multi-label ClassificationMulti-label classification is a challenge task, where each nodemay have one or multiple labels. A classifier is trained topredict multiple possible labels for each test node. In ourNetwork Vector algorithm, the global representation of entirenetwork serves as additional context along with local neigh-borhood in learning node representations.

DatasetsTo understand whether the global representation helps learn-ing better node representation, we perform multi-label clas-sification with the same benchmarks and experimental proce-dure as [Grover and Leskovec, 2016] using the same datasets:

• BlogCatalog [Zafarani and Liu, 2009; Tang and Liu,2009]: This is a network of social relationships providedby bloggers on the BlogCatalog website. The labels rep-resent the interests of bloggers on a list of topic cate-gories. There are 10,312 nodes, 333,983 edges in thenetwork and 39 distinct labels for nodes.• Protein-Protein Interactions (PPI) [Breitkreutz et al.,

2008; Grover and Leskovec, 2016]: This is a subgraphof the entire PPI network for Homo Sapiens. The nodelabels are obtained from hallmark gene sets [Liberzonet al., 2011] and represent biological states. There are3,890 nodes, 76,584 edges in the network and 50 dis-tinct labels for nodes.

• Wikipedia Cooccurrences [Mahoney, 2009; Grover andLeskovec, 2016]: This is a network of words appear-ing in the first million bytes of the Wikipedia dump. Theedge weight is defined by the cooccurrence of two wordswithin a 2-length slide window. The Part-of-Speech(POS) tags [Marcus et al., 1993] inferred using the Stan-ford POS-Tagger [Toutanova et al., 2003] are used aslabels. There are 4,777 nodes, 184,812 edges in the net-work and 40 distinct labels for nodes.

Methods for ComparisonWe compare the node representations learned by NetworkVector against the following feature learning methods fornode representations:• Spectral clustering [Tang and Liu, 2011]: This method

learns the d-smallest eigenvectors of the normalizedgraph Laplacian matrix, and utilize them as the d-dimensional feature representations for nodes.

• DeepWalk [Perozzi et al., 2014]: This methodlearns d-dimensional feature representations using Skip-gram [Mikolov et al., 2013a; Mikolov et al., 2013b]from node sequences, that are generated by uniform ran-dom walks from the source nodes on graph.

• LINE [Tang et al., 2015]: This method learns d-dimensional feature representations by sampling nodesat 1-hop and 2-hop distance from the source nodes inBFS-like manner.

• node2vec [Grover and Leskovec, 2016]: We use theoriginal node2vec algorithm with optimal parameter set-tings of (p, q) reported in [Grover and Leskovec, 2016].

Network Vector utilizes only first-order or second-orderproximity between nodes in two-layer neural embeddingframework. The first layer computes the context feature vec-tor, and the second layer computes the probability distribu-tion of target nodes. It is similar to other neural embed-ding based feature learning methods DeepWalk, LINE andnode2vec. For fair comparison, we exclude recent approachesGraRep [Cao et al., 2015], HNE [Chang et al., 2015] andSDNE [Wang et al., 2016]. It is because GraRep utilizes in-formation from network neighborhoods beyond second-orderproximity, and both HNE and SDNE employ deep neural net-works that have multiple layers (more than two). GraRep,HNE and SDNE are less computational efficient and can-not scale well, as compared to DeepWalk, LINE, node2vec

Table 4: Macro-F1 scores for multi-label classification with a bal-anced 50 − 50 split between training and testing data. Results ofSpectral clustering, DeepWalk, LINE and node2vec are reported innode2vec paper.

Algorithm DatasetBlogCatalog PPI Wikipedia

Spectral Clustering 0.0405 0.0681 0.0395LINE 0.0784 0.1447 0.1164DeepWalk 0.2110 0.1768 0.1274node2vec (p*, q*) 0.2581 0.1791 0.1552Network Vector (p=q=1) 0.2473 0.1938 0.1388Network Vector (p*, q*) 0.2607 0.1985 0.1765settings (p*, q*) 0.25, 0.25 4, 1 4, 0.5Gain over DeepWalk 12.4% 9.6% 8.9%Gain over ndoe2vec 1.0% 9.7% 13.7%

and our algorithm Network Vector. For fair comparison,we use the inverse architecture of Network Vector, which isSkip-gram [Mikolov et al., 2013a] like and similar to that ofnode2vec. The parameter settings for Network Vector are infavor of node2vec, and exactly the same as in [Grover andLeskovec, 2016]. Specifically, we set d = 128, r = 10,l = 80, and a context size n = 10, and are aligned withtypical values used for DeepWalk and LINE. A single passof the data (one epoch) is used for optimization. In orderto perform multi-label classification, the learned node repre-sentations from each approach are used as feature input toa one-vs-rest logistic regression with L2 regularization. Ourexperiments are repeated for 10 random equal splits of trainand test data, and average results are reported.

ResultsMacro-F1 scores are used as evaluation metrics, and Table 4shows the results. We run Network Vector with node se-quences generated by biased random walks from node2vec.The default parameter setting (p = 1, q = 1) used in Deep-Walk and the optimal parameter setting of node2vec reportedin [Grover and Leskovec, 2016] are used.

From the results, we can see Network Vector outperformsnode2vec using the same biased random walks, and Deep-Walk using the same uniform random walks. It is evident thatthe global representation of the entire network allows Net-work Vector to exploit the global structure of the networks tolearn better node representations. Network Vector achieves aslight performance gain, 1.0% over node2vec, and a signifi-cant 12.4% gain over DeepWalk on BlogCatalog. As we cansee on PPI, The gain of Network Vector over node2vec andDeepWalk are significant and similar, 9.6% and 9.7% respec-tively. In the case of Wikipedia word cooccurrence network,Network Vector outperforms node2vec with a decent margin,achieving 13.7% performance gain, while with a less gain,8.9% over DeepWalk. Overall, sampling strategies even withoptimal parameter settings (p, q) in node2vec are limited inexploration of local neighborhood of the source nodes, butcannot exploit the global network structure well. NetworkVector overcomes the limitation of locality. By utilizing anadditional global vector to memorize the collective informa-tion from all the local neighborhoods of nodes even within 2-hops, Network Vector learns improved node representations.

Figure 3: Performance of Network Vector and node2vec on varyingthe parameter q when fixing p =∞ to encourage reaching unvisitednodes in random walks. Macro-F1 and Micro-F1 scores for multi-label classification with a balanced 50% train-test split are reported.

Parameter SensitivityIn order to understand how Network Vector improves inlearning node representations with biased random walks infine-grained settings, we evaluate performance while varyingthe parameter settings of (p, q). We fix p = ∞ to discour-age revisiting sampled nodes at the previous step in randomwalks, and varying the value q in the range from 2−4 to 24 toperform DFS-like sampling in various degrees.

Figure 3 shows the comparison results for Network Vec-tor and ndoe2vec in both Macro-F1 and Micro-F1 scores.As we can see, Network Vector consistently outperformsnode2vec in different parameter settings of q in all the threedatasets. However, we observe on BlogCatalog, NetworkVector achieves relatively larger gains over node2vec whenq is large that the random walks is biased towards BFS-likesampling, as compared to that when q is small that the sam-pling is more DFS-like. It is mainly because when the ran-dom walks is biased towards nodes close to the source nodes,the global information of network structure that are exploitedby Network Vector can compensate more for locality infor-mation using BFS-like sampling. However, when q is small,the random walks is biased towards sampling nodes far awayfrom the source nodes, and explore information close to theglobal network structure. Hence, Network Vector is not quite

Figure 4: Performance of Network Vector and node2vec on varyingthe faction of labeled data for training.

helpful in this case. We can see similar patterns of perfor-mance margin between Network Vector and node2vec when qtends to be large in word cooccurrence network of Wikipedia.However, in the case of PPI, the performance gains achievedby Network Vector over node2vec are stable even various val-ues of q are used. The reason is probably because the biolog-ical states of proteins in a protein-protein interaction networkexhibit a high degree of homophily, since proteins in localneighborhood usually organize together to perform similarfunctions. Hence, the global network structure is not quiteinformative to predict the biological states of proteins as weset a large value of q.

Effect of Training DataTo see the effect of training data, we compare performancewhile varying the fraction of labeled data from 10% to 90%.Figure 4 shows the results on PPI. The parameters (p, q) isfixed using optimal values (4, 1). As we can see, when us-ing more labeled data, the performance of node2vec and Net-work Vector generally increases. Network Vector achievesthe largest gain over node2vec of 9.0% in Macro-F1 scoreand 10.3% at 40% labeled data. When only 10% labeled datais used, Network Vector only yields 1.5% gain in Macro-F1score, and 7.1% in Micro-F1 score. We have similar observa-tions on BlogCatalog and Wikipedia datasets, and the resultsare not shown.

5 ConclusionWe have presented Network Vector, an algorithm for learn-ing distributed representations of nodes and networks simul-taneously. By embedding the network in a lower-dimensionalvector space, our algorithm allows for quantitative compari-son of networks. It also allows for the comparison of individ-ual network nodes, since each node can be represented by itsego-network—a network containing the node itself, its net-work neighbors, and all connections between them.

In contrast to existing network embedding methods, whichonly learn representations of component nodes, Network Vec-tor directly learns the representation of an entire network.Learning a representation of a network allows us to evaluatethe similarity between two networks or two individual nodes,which enables us to answer questions that were difficult to ad-dress with existing methods. For instance, given a node in anetwork, for example, a manager within an organization, wecan identify other people serving a similar role within that or-ganization. Also, given a connection, denoting some relation-

ship between two people within a social network, we couldfind another pair in an analogous relationship. Beyond socialnetworks, we can also answer new questions about knowl-edge networks that connect concepts or documents to eachothers, for example, Wikipedia and citations networks. Thiscan be useful especially in cases where the contents of doc-uments is not available for privacy or other reasons, but thenetwork of interactions exists.

For the networks in which content is available for thenodes, the learning method could be extended to account forit. For example, for knowledge networks, the approach couldbe combined with text to learn representations of networksthat will give a more fine-grained view of their similarity. Ad-ditionally, other non-textual attributes could also be includedin the learning algorithm. The flexibility of such learning al-gorithms make them ideal candidates for applications requir-ing similarity comparison of different types of objects.

References[Bengio et al., 2003] Yoshua Bengio, Rejean Ducharme,

Pascal Vincent, and Christian Jauvin. A neural proba-bilistic language model. Journal of Machine Learning Re-search, 3:1137–1155, 2003.

[Berlingerio et al., 2012] Michele Berlingerio, DanaiKoutra, Tina Eliassi-Rad, and Christos Faloutsos. Net-simile: a scalable approach to size-independent networksimilarity. arXiv preprint arXiv:1209.2684, 2012.

[Breitkreutz et al., 2008] Bobby-Joe Breitkreutz, ChrisStark, Teresa Reguly, Lorrie Boucher, Ashton Breitkreutz,Michael Livstone, Rose Oughtred, Daniel H Lackner,Jurg Bahler, Valerie Wood, et al. The biogrid interactiondatabase: 2008 update. Nucleic acids research, 36(suppl1):D637–D640, 2008.

[Cao et al., 2015] Shaosheng Cao, Wei Lu, and QiongkaiXu. Grarep: Learning graph representations with globalstructural information. In Proceedings of the 24th ACMInternational on Conference on Information and Knowl-edge Management, pages 891–900. ACM, 2015.

[Chang et al., 2015] Shiyu Chang, Wei Han, Jiliang Tang,Guo-Jun Qi, Charu C Aggarwal, and Thomas S Huang.Heterogeneous network embedding via deep architectures.In Proceedings of the 21th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining,pages 119–128. ACM, 2015.

[Gilpin et al., 2013] Sean Gilpin, Tina Eliassi-Rad, and IanDavidson. Guided learning for role discovery (glrd):framework, algorithms, and applications. In Proceedingsof the 19th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 113–121.ACM, 2013.

[Grover and Leskovec, 2016] Aditya Grover and JureLeskovec. node2vec: Scalable feature learning fornetworks. In Proceedings of the 22nd ACM SIGKDDInternational Conference on Knowledge Discovery andData Mining, pages 855–864, 2016.

[Henderson et al., 2012] Keith Henderson, Brian Gallagher,Tina Eliassi-Rad, Hanghang Tong, Sugato Basu, LemanAkoglu, Danai Koutra, Christos Faloutsos, and Lei Li.Rolx: structural role extraction & mining in large graphs.In Proceedings of the 18th ACM SIGKDD internationalconference on Knowledge discovery and data mining,pages 1231–1239. ACM, 2012.

[Hinton, 1986] Geoffrey E Hinton. Learning distributed rep-resentations of concepts. In Proceedings of the eighth an-nual conference of the cognitive science society, volume 1,page 12. Amherst, MA, 1986.

[Le and Mikolov, 2014] Quoc Le and Tomas Mikolov. Dis-tributed representations of sentences and documents. InProceedings of the 31st International Conference on Ma-chine Learning (ICML-14), pages 1188–1196, 2014.

[Liberzon et al., 2011] Arthur Liberzon, Aravind Subrama-nian, Reid Pinchback, Helga Thorvaldsdottir, PabloTamayo, and Jill P Mesirov. Molecular signatures database(msigdb) 3.0. Bioinformatics, 27(12):1739–1740, 2011.

[Mahoney, 2009] Matt Mahoney. Large text compressionbenchmark. URL: http://www. mattmahoney. net/text/text.html, 2009.

[Marcus et al., 1993] Mitchell P Marcus, Mary AnnMarcinkiewicz, and Beatrice Santorini. Building alarge annotated corpus of english: The penn treebank.Computational linguistics, 19(2):313–330, 1993.

[Mikolov et al., 2010] Tomas Mikolov, Martin Karafiat,Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. Re-current neural network based language model. In IN-TERSPEECH 2010, 11th Annual Conference of the Inter-national Speech Communication Association, Makuhari,Chiba, Japan, September 26-30, 2010, pages 1045–1048,2010.

[Mikolov et al., 2013a] Tomas Mikolov, Kai Chen, GregCorrado, and Jeffrey Dean. Efficient estimation ofword representations in vector space. arXiv preprintarXiv:1301.3781, 2013.

[Mikolov et al., 2013b] Tomas Mikolov, Ilya Sutskever, KaiChen, Greg S Corrado, and Jeff Dean. Distributed repre-sentations of words and phrases and their compositional-ity. In Advances in Neural Information Processing Sys-tems, pages 3111–3119, 2013.

[Mnih and Hinton, 2007] Andriy Mnih and Geoffrey Hinton.Three new graphical models for statistical language mod-elling. In Proceedings of the 24th international conferenceon Machine learning, pages 641–648. ACM, 2007.

[Mnih and Kavukcuoglu, 2013] Andriy Mnih and KorayKavukcuoglu. Learning word embeddings efficiently with

noise-contrastive estimation. In Advances in Neural Infor-mation Processing Systems, pages 2265–2273, 2013.

[Mnih and Teh, 2012] Andriy Mnih and Yee Whye Teh. Afast and simple algorithm for training neural probabilisticlanguage models. arXiv preprint arXiv:1206.6426, 2012.

[Perozzi et al., 2014] Bryan Perozzi, Rami Al-Rfou, andSteven Skiena. Deepwalk: Online learning of social repre-sentations. In Proceedings of the 20th ACM SIGKDD in-ternational conference on Knowledge discovery and datamining, pages 701–710. ACM, 2014.

[Tang and Liu, 2009] Lei Tang and Huan Liu. Relationallearning via latent social dimensions. In Proceedings of the15th ACM SIGKDD international conference on Knowl-edge discovery and data mining, pages 817–826. ACM,2009.

[Tang and Liu, 2011] Lei Tang and Huan Liu. Leveragingsocial media networks for classification. Data Mining andKnowledge Discovery, 23(3):447–478, 2011.

[Tang et al., 2015] Jian Tang, Meng Qu, Mingzhe Wang,Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In Proceedings ofthe 24th International Conference on World Wide Web,pages 1067–1077. International World Wide Web Confer-ences Steering Committee, 2015.

[Toutanova et al., 2003] Kristina Toutanova, Dan Klein,Christopher D Manning, and Yoram Singer. Feature-rich part-of-speech tagging with a cyclic dependency net-work. In Proceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for ComputationalLinguistics on Human Language Technology-Volume 1,pages 173–180. Association for Computational Linguis-tics, 2003.

[Walker, 1977] Alastair J Walker. An efficient method forgenerating discrete random variables with general distri-butions. ACM Transactions on Mathematical Software(TOMS), 3(3):253–256, 1977.

[Wang et al., 2016] Daixin Wang, Peng Cui, and WenwuZhu. Structural deep network embedding. In Proceed-ings of the 22nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, pages 1225–1234. ACM, 2016.

[Zachary, 1977] Wayne W Zachary. An information flowmodel for conflict and fission in small groups. Journalof anthropological research, 33(4):452–473, 1977.

[Zafarani and Liu, 2009] Reza Zafarani and Huan Liu. So-cial computing data repository at asu, 2009.

Documents

Network Vector: Distributed Representations of Networks ... · mainly Depth-First Search (DFS) and Breadth-First Search (BFS). The objective is optimized using single layer neural