IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …people.seas.harvard.edu/~babis/tkde16.pdf · Theory and Algorithms Chen Chen, Hanghang Tong, B. Aditya Prakash, Charalampos

Node Immunization on Large Graphs:Theory and Algorithms

Chen Chen, Hanghang Tong, B. Aditya Prakash, Charalampos E. Tsourakakis,

Tina Eliassi-Rad, Christos Faloutsos, and Duen Horng Chau

Abstract—Given a large graph, like a computer communication network, which k nodes should we immunize (or monitor, or remove),

to make it as robust as possible against a computer virus attack? This problem, referred to as the node immunization problem, is the

core building block in many high-impact applications, ranging from public health, cybersecurity to viral marketing. A central component

in node immunization is to find the best k bridges of a given graph. In this setting, we typically want to determine the relative importance

of a node (or a set of nodes) within the graph, for example, how valuable (as a bridge) a person or a group of persons is in a social

network. First of all, we propose a novel ‘bridging’ score D�, inspired by immunology, and we show that its results agree with intuition

for several realistic settings. Since the straightforward way to compute D� is computationally intractable, we then focus on the

computational issues and propose a surprisingly efficient way (Oðnk2 þmÞ) to estimate it. Experimental results on real graphs show

that (1) the proposed ‘bridging’ score gives mining results consistent with intuition; and (2) the proposed fast solution is up to seven

orders of magnitude faster than straightforward alternatives.

Index Terms—Immunization, graph mining, scalability

Ç

1 INTRODUCTION

GIVEN a graph, we want to quickly find the k best nodesto immunize (or, equivalently, remove), to make the

remaining nodes to be most robust to the virus attack.This is the core problem for many applications: In a com-puter network intrusion setting, we want the k best nodesto defend (e.g., through expensive and extensive vigi-lance), to minimize the spread of malware. Similarly, in alaw-enforcement setting, given a network of criminals,we want to neutralize those nodes that will maximallyscatter the graph.

There are three main challenges behind this problem.First (C1. Vulnerability measure), we need a ‘Vulnerability’measure of the graph, that is, how likely/easily that a graphwill be infected by a virus. Second (C2. Shield-value), basedon the ‘Vulnerability’ measure of the entire graph, we furtherneed a measure to quantify the ‘Shield-value’ of a given set ofnodes in the graph, i.e., how important are they in terms ofmaintaining the ‘Vulnerability’ of the graph? Alternatively,how much less vulnerable will be the graph to the virus

attack, if those nodes are removed/immunized? Third (C3.Algorithms), based on the ‘Shield-value’ measure of a set ofnodes, we need an effective and scalable algorithm toquickly determine the k nodes that collectively exhibit thehighest ‘Shield-value’ score on large, disk-resident graphs.

In this paper, we aim to address these challenges inmultiple dimensions. Motivated from immunology andgraph loop/path capacity, we adopt the first1 eigenvalue� of the graph as the ‘Vulnerability’ measurement (for C1).Based on that, we propose a novel definition of the‘Shield-value’ score SvðSÞ for a specific set of nodes (forC2). By carefully using the results from the theory ofmatrix perturbation, we show that the proposed ‘Shield-value’ gives a good approximation of the correspondingeigen-drop (i.e., the decrease of the ‘Vulnerability’ mea-surement if we remove/immunize the set of nodes Sfrom the graph). Furthermore, we show that the proposed‘Shield-value’ score is sub-modular, which enables us todevelop a near-optimal and scalable algorithm (NetShield)to find a set of nodes with highest ‘Shield-value’ score (forC3). Finally, we propose a variant (NetShieldþ) to furtherbalance the optimization quality and computational cost.

The main contributions of this paper can be summa-rized as

1. A novel definition of the ‘Shield-value’ score SvðSÞ fora set of nodes, by carefully using the results from thetheory of matrix perturbation.

2. A near-optimal and scalable algorithm (NetShield) andits variant (NetShield+) to find a set of nodes withhighest ‘Shield-value’ score, by carefully using resultsfrom the theory of sub-modularity.

� C. Chen and H. Tong are with Computer Science Department, ArizonaState University, Tempe, AZ. E-mail: {cchen211, hanghang.tong}@asu.edu.

� B. Aditya Prakash is with the Virginia Tech, Blacksburg, VA.E-mail: [email protected].

� C. Tsourakakis is with the School of Engineering and Applied Science,Harvard University, Cambridge, MA. E-mail: [email protected].

� T. Eliassi-Rad is with the Computer Science Department, RutgersUniversity, Piscataway, NJ. E-mail: [email protected].

� C. Faloutsos is with the Computer Science Department, Carnegie MellonUniversity, Pittsburgh, PA. E-mail: [email protected].

� D. H. Chau is with the School of Computational Science & Engineering,Georgia Tech, Atlanta, GA. E-mail: [email protected].

Manuscript received 25 Nov. 2014; accepted 26 July 2015. Date of publication6 Aug. 2015; date of current version 3 Dec. 2015.Recommended for acceptance by R. Jin.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2015.2465378

1. In this paper, the first eigenvalue means the eigenvalue with thelargest module.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 1, JANUARY 2016 113

1041-4347� 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

3. Extensive experiments on several real data sets, illus-trating the effectiveness and efficiency of the pro-posed methods.

The rest of the paper is organized as follows: Section 2gives the problem definitions. We present the ‘Vulnerability’measurement in Section 3. The proposed ‘Shield-value’ scoreis presented in Section 4. We address the computationalissues in Section 5 and evaluate the proposed methods inSection 6. Section 7 gives the related work, and Section 8gives the conclusions.

2 PROBLEM DEFINITIONS

Table 1 lists the main symbols we use throughout the paper.In this paper, we focus on un-directed un-weighted graphs.We represent the graph by its adjacency matrix. Followingstandard notations, we use capital bold letters for matrices(e.g., A), lower-case bold letters for vectors (e.g., a), and cal-ligraphic fonts for sets (e.g., S). We denote the transpose

with a prime (i.e., A0 is the transpose of A), and we useparenthesized superscripts to denote the correspondingvariable after deleting the nodes indexed by the super-

scripts. For example, � is the first eigenvalue of A, then �i isthe first eigenvalue of A after deleting its i(th) row/column.We use ð�i;uiÞ to denote the ith eigen-pair (sorted by themagnitude of the eigenvalue) of A. When the subscript isomitted, we refer to them as the first eigenvalue and eigen-vector respectively (i.e., � , �1 and u , u1).

With the above notations, our problems can be formallydefined as follows:

Problem 1.Measuring ‘Vulnerability’

Given: A large un-directed un-weighted connected graph Gwith adjacency matrix A;

Find: A single number VðGÞ, reflecting the ‘Vulnerability’of the whole graph.

Problem 2.Measuring ‘Shield-value’

Given: A subset S with k nodes in a large un-directed un-weighted connected graph G with adjacency matrix A;

Find: A single number SvðSÞ, reflecting the ‘Shield-value’of these k nodes (that is, the benefit of their removal/immunization to the vulnerability of the graph).

Problem 3. Finding kNodes of Best ‘Shield-value’

Given: A large un-directed un-weighted connected graph Gwith n nodes and an integer k;

Find: A subset S of k nodes with the highest ‘Shield-value’

score among all nk

� �possible subsets.

In the next three sections, we present the correspondingsolutions respectively.

3 BACKGROUND: OUR SOLUTION FOR PROBLEM 1

As mentioned in Section 1, the ultimate goal of node immu-nization problem is to contain epidemic over the network.In an epidemic network, nodes can have different statesdepending on the epidemic model. The model we simulatehere is SIS model [46]. In SIS model, each node would haveone of the following two states: susceptible and infected.Susceptible nodes can be infected by infected nodes withinfection rate b at each time stamp, and each infected nodecan get back to susceptible state with host-recovery rate d.Epidemic threshold is an intrinsic property of a network.When the strength of the virus is greater than the epidemicthreshold, then the epidemic would breakout.

Here, we begin to address Problem 1. According to [46],the epidemic thresholds of arbitrary cascade models onarbitrary networks can be determined by the largest eigen-value of network’s connectivity matrix. The intuition is that,the larger the largest eigenvalue is, the more connected thegraph is, and therefore the more vulnerable the structure isunder epidemic. Thus we suggest using the first eigenvalue� as ‘Vulnerability’ score. We should point out that it is notour main contribution to adopt � as the ‘Vulnerability’ mea-sure of a graph. Nonetheless, it is the base of our proposedsolutions for both Problem 2 and Problem 3.

3.1 ‘Vulnerability’ Score

In Problem 1, the goal is to measure the ‘Vulnerability’ of thewhole graph by a single number. We adopt the first eigen-value of the adjacency matrix A as such a measurement(eq. (1)): the larger � is, the more vulnerable the wholegraph is.

VðGÞ , �: (1)

Fig. 1 presents an example, where we have four graphswith five nodes. Intuitively, the graph becomes more andmore vulnerable from the left to the right. In other words,for a given strength of the virus attack, it is more likely thatan epidemic will break out in the graphs on the right thanthose on the left side. Therefore, the vulnerability of thegraph increases. We can see that the corresponding �increases from left to right as well. Note that ‘Vulnerability’score in this paper is not necessarily comparable between

TABLE 1Symbols

Symbol Definition and Description

A;B; . . . matrices (bold upper case)Aði; jÞ the element at the ith row and jth

column of matrix AAði; :Þ the ith row of matrix AAð:; jÞ the jth column of matrix AA0 transpose of matrix Aa;b; . . . column vectorsS; T ; . . . sets (calligraphic)

n number of nodes in the graphm number of edges in the graph

ð�i;uiÞ the ith eigen-pair of A� first eigenvalue of A (i.e., � , �1)u first eigenvector of A (i.e., u , u1)�ðiÞ, �ðSÞ first eigenvalue of A by deleting

node i (or the set of nodes in S)D�ðiÞ eigen-drop: D�ðiÞ ¼ �� ðiÞ

D�ðSÞ eigen-drop: D�ðSÞ ¼ �� ðSÞ

SvðiÞ ‘Shield-value’ score of node iSvðSÞ ‘Shield-value’ score of nodes in SVðGÞ ‘Vulnerability’ score of the graph

114 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 1, JANUARY 2016

graphs with different number of nodes. That means if wehave two graphs with the same ‘Vulnerability’ score but dif-ferent number of nodes, this does not necessarily meansthat they two have the same ability to contain the epidemic.

Notice that the concept of ‘Vulnerability’ is different fromvertex connectivity of the graph [20]. For ‘Vulnerability’, wewant to quantify how likily/easiy a graph will be infectedby a virus (given the strength of virus attack). Whereas forvertex connectivity, we want to quantify how difficult for agraph to be disconnected. For example, both graph (a) and(b) in Fig. 1 have the same vertex connectivity (both are 1).But graph (b) is more vulnerable to the virus attack. Alsonotice that although ‘Vulnerability’ is related to both graphdensity (i.e., average degree) and diameter, neither of themcan fully describe the ‘Vulnerability’ by itself. For example,in Fig. 1, (a) and (b) share the same density/average degreealthough (b) is more vulnerable than (a); (b) and (c) sharethe same diameter although (c) is more vulnerable than (b).

3.2 Justifications

The first eigenvalue � is a good measurement of the graph‘Vulnerability’, because of recent results on epidemic thresh-olds from immunology [7]: � is closely related to the epi-demic threshold t of a graph under a flu-like susceptible-infective-susceptible (SIS) epidemic model, and specificallyt ¼ 1=�. This means that a virus less infective than t willquickly get extinguished instead of lingering forever. There-fore, given the strength of the virus (that is, the infectionrate and the host-recovery rate), an epidemic is more likelyfor a graph with larger �.

We can also show that the first eigenvalue � is closelyrelated to the so-called loop capacity and the path capacity ofthe graph, that is, the number of loops and paths of length l(l ¼ 2; 3; . . .). If a graph has many such loops and paths, thenit is well connected, and thusmore vulnerable (i.e., it is easierfor a virus to propagate across the graph = the graph is lessrobust to virus attack). Note that although there are manyother measurements that are also related to graph connectiv-ity like second smallest eigenvalue of the LaplacianMatrix ofthe graph, they are not as directly related to epidemic thresh-old as � is, as shown in [46]. Thus, for the epidemic-like influ-ence process, � is more suitable for evaluating vulnerabilityscore than those alternativemeasurements.

4 OUR SOLUTION FOR PROBLEM 2

In this section, we focus on Problem 2. We first present oursolution, and then provide justifications.

4.1 Proposed ‘Shield-Value’ Score

In Problem 2, the goal is to quantify the importance of agiven set of nodes S, and specifically the impact of their

deletion/immunization to the ‘Vulnerability’ of the rest ofthe graph. The obvious choice is the drop in eigenvalue, oreigen-drop D� that their removal will cause to the graph. Wepropose to approximate it, to obtain efficient computations,as we will describe later. Specifically, we propose usingSvðSÞ defined as:

SvðSÞ ¼Xi2S

2�uðiÞ2 �Xi;j2S

Aði; jÞuðiÞuðjÞ: (2)

Intuitively, by eq. (2), a set of nodes S has higher ‘Shield-value’ score if (1) each of them has a high eigen-score (uðiÞ),and (2) they are dissimilar with each other (small or zeroAði; jÞ). Fig. 2 shows an example on measuring the ‘Shield-value’ score of a given set of nodes. The best k nodes found byourNetShield (which will be introduced very soon in the nextsection) are shaded. The result is consistent with intuition. InFig. 2a, it picks node 13 as best k ¼ 1 node (although nodes 1,5 and 9 have the highest degree). In Fig. 2b, deleting theshaded nodes (node 1, 5, 9 and 13) will make the graph leastvulnerable (i.e., the remaining graphs are sets of isolatednodes; and therefore it is most robust to virus attack).

4.2 Justifications

Here, we provide some justifications on the proposed‘Shield-value’ score, which is summarized in Lemma 1. Itsays that our proposed ‘Shield-value’ score SvðSÞ is a goodapproximation for the eigen-drop D�ðSÞ when deleting theset of nodes S from the original graph A.

Lemma 1. Let �ðSÞ be the (exact) first eigen-value of A, where Ais the perturbed version of A by removing all of its rows/col-umns indexed by set S. Let d ¼ �� 2 be the eigen-gap, and dbe the maximum degree of A. If � is the simple first eigen-value

of A, and d � 2ffiffiffiffiffiffiffiffi2kd

p, then

D�ðSÞ ¼ SvðSÞ þOXj2S

kAð:; jÞk2 !

; (3)

where SvðSÞ is computed by eq. (2) and D�ðSÞ ¼ �� ðSÞ.

Proof. First, let us write A as a perturbed version of the orig-inal matrix A:

A ¼ Aþ E; and E ¼ Fþ F0 þG; (4)

Fig. 1. An example of measuring ‘Vulnerability’ of the graph. Moreedges, and carefully placed, make the graph better connected, and thusmore vulnerable. Notice that the chain (a) and the star (b) have thesame number of edges, but our � score correctly considers the star asmore vulnerable.

Fig. 2. An example on measuring the ‘Shield-value’ score of a given setof nodes. The best k nodes found by our NetShield are shaded. In (a),notice that the highest degree nodes (e.g., node 1) is not chosen. In (b),immunizing the shaded nodes makes the remaining graph most robustto the virus attack.

CHEN ET AL.: NODE IMMUNIZATION ON LARGE GRAPHS: THEORY AND ALGORITHMS 115

where Fð:; jÞ ¼ �Að:; jÞ ðj 2 S and Fð:; jÞ ¼ 0 ðj =2 SÞ;Gði; jÞ ¼ Aði; jÞ ði; j 2 SÞ andGði; jÞ ¼ 0ði =2 S; or j =2 SÞ.

Since Au ¼ �u, we have

u0F0u ¼ u0Fu ¼ ðF0uÞ0u ¼ �Xj2S

�uðjÞ2

u0Gu ¼Xi;j2S

Aði; jÞuðiÞuðjÞ:(5)

Let ~� be the corresponding perturbed eigen-value of �,according to the matrix perturbation theory (p.183 [53]),we have

~� ¼ �þ u0EuþOðkEk2F Þ¼ �þ u0Fuþ u0F0uþ u0GuþOðkEk2F Þ

¼ ��Xj2S

2�uðjÞ �Xi;j2S

Aði; jÞuðiÞuðjÞ !

þOXj2S

kAð:; jÞk2 !

¼ �� SvðSÞ þOXj2S

kAð:; jÞk2 !

:

(6)

Let ~�iði ¼ 2; . . . ; nÞ be the corresponding perturbedeigen-value of �iði ¼ 2; . . . ; nÞ. Again, by the matrix per-turbation theory (p.203 [53]), we have

~� � �� kEk2 � �� kEkF � ��ffiffiffiffiffiffiffiffi2kd

p

~�i � �i þ kEk2 � �i þ kEkF � �i þffiffiffiffiffiffiffiffi2kd

p:

(7)

Since d ¼ �� 2 � 2ffiffiffiffiffiffiffiffi2kd

p, we have ~� � ~�iði ¼ 2; . . . ; nÞ.

In other words, we have �ðSÞ ¼ ~�. Therefore,

D�ðSÞ ¼ �� ðSÞ ¼ �� ~�

¼ SvðSÞ þOXj2S

kAð:; jÞk2 !

(8)

which completes the proof. tu

Notice that kEkF and kEk2 refer to the Frobeniousnorm and the l2 norm of E, respectively. The former is

defined as kEkF ¼Pn

i¼1

Pnj¼1 a

2ij, while kEk2 equals to the

largest eigenvalue of E. And the inequality kEkF > kEk2always holds.

5 OUR SOLUTION FOR PROBLEM 3

In this section, we deal with Problem 3. Here, the goal isto find a subset of k nodes with the highest ‘Shield-value’

score (among all nk

� �possible subsets). We start by show-

ing that the two straightforward methods (referred to as‘Com-Eigs’, and ‘Com-Eval’) are computationally intrac-table. Then, we present the proposed NetShield algorithmand analyze its accuracy as well as its computationalcomplexity. Finally to further balance the optimizationquality and computational cost, we propose a variant ofNetShield, NetShield+.

5.1 Challenges

There are two obviously straightforward methods forProblem 3. The first one (referred to as ‘Com-Eigs’2)works as follows: for each possible subset S, we deletethe corresponding rows/columns from the adjacencymatrix A; compute the first eigenvalue of the new per-turbed adjacency matrix; and finally output the subset ofnodes which has the smallest eigenvalue (therefore hasthe largest eigen-drop). Despite the simplicity of thisstrategy, it is computational intractable due to its combi-natorial nature. It is easy to show that the computational

complexity of ‘Com-Eigs’ is Oð nk

� ��mÞ.3 This is computa-

tionally intractable even for small graphs. For example, ina graph with 1 K nodes and 10 K edges, suppose that ittakes about 0.01 second to find its first eigenvalue. Thenwe need about 2,615 years to find the best-5 nodes withthe highest ‘Shield-value’ score!

A more reasonable (in terms of speed) way to find thebest-k nodes is to evaluate SvðSÞ, rather than to compute

the first eigenvalue �ðSÞ, nk

� �times, and pick the subset

with the highest SvðSÞ. We refer to this strategy as‘Com-Eval’. Compared with the straightforward method

(referred to as ‘Com-Eigs’, which is Oð nk

� ��mÞ); ‘Com-

Eval’ is much faster (Oð nk

� �� k2Þ). However, ‘Com-Eval’ is

still not applicable to real applications due to its combi-natorial nature. Again, in a graph with 1 K nodes and10 K edges, suppose that it only takes about 0.00001 sec-ond to evaluate SvðSÞ once. Then we still need about 3months to find the best-5 nodes with the highest ‘Shield-value’ score!

Theorem 1. K-node immunization with � is NP complete.

Proof. See the appendix. tu

5.2 Proposed NetShield Algorithm

The proposed NetShield is given in Alg. 1. In Alg. 1, we com-pute the first eigenvalue � and the corresponding eigenvec-tor u in step 1. In step 4, the n� 1 vector v measures the‘Shield-value’ score of each individual node. Then, in eachiteration of steps 6-17, we greedily select one more nodeand add it into set S according to scoreðjÞ (step 13). Notethat steps 10-12 are to exclude those nodes that are alreadyin the selected set S.

5.3 Analysis of NetShield

Here, we analyze the accuracy and efficiency of the pro-posed NetShield.

First, according to the following theorem, Alg. 1 is near-optimal wrt ‘Com-Eval’. In addition, by Lemma 1, our‘Shield-value’ score (which ‘Com-Eval’ tries to optimize) is agood approximation for the actual eigen-drop D�ðSÞ (which‘Com-Eigs’ tries to optimize). Therefore, we would expectthat Alg. 1 also gives a good approximation wrt ‘Com-Eigs’(See Section 6 for experimental validation).

2. To our best knowledge, this is the best known method to get theoptimal solution of Problem 3.

3. We assume that k is relatively small compared with n andm (e.g.,tens or hundreds). Therefore, after deleting k rows/columns from A,we still have OðmÞ edges.


Algorithm 1. NetShield

Input: the adjacency matrix A and an integer kOutput: a set S with k nodes1: compute the first eigenvalue � of A; let u be the correspond-

ing eigenvector uðjÞðj ¼ 1; . . . ; nÞ;2: initialize S to be empty;3: for j ¼ 1 to n do4: vðjÞ ¼ ð2 � ��Aðj; jÞÞ � uðjÞ2;5: end for6: for iter ¼ 1 to k do7: let B ¼ Að:;SÞ;8: let b ¼ B � uðSÞ;9: for j ¼ 1 to n do10: if j 2 S then11: let scoreðjÞ ¼ �1;12: else13: let scoreðjÞ ¼ vðjÞ � 2 � bðjÞ � uðjÞ;14: end if15: end for16: let i ¼ argmaxjscoreðjÞ, add i to set S;17: end for18: return S.

Theorem 2 Effectiveness of NetShield. Let S and ~S be thesets selected by Alg. 1 and by ‘Com-Eval’, respectively. Let

D�ðSÞ and D�ð~SÞ be the corresponding eigen-drops. Then,

D�ðSÞ � ð1� 1=eÞD�ð~SÞ.

Proof. Let I ;J ;K be three sets and I � J . Define the fol-lowing three sets based on I ;J ;K: S ¼ I [ K; T ¼J [ K; R ¼ J n I .

Substituting eq. (2), we have

SvðSÞ � SvðIÞ ¼Xi2K

2�uðiÞ2 �Xi;j2K

Aði; jÞuðiÞuðjÞ

� 2X

j2I ;i2KAði; jÞuðiÞuðjÞ

SvðT Þ � SvðJ Þ ¼Xi2K



� 2X

j2J ;i2KAði; jÞuðiÞuðjÞ:

(9)

According to Perron-Frobenius theorem, we haveuðiÞ � 0ði ¼ 1; . . . ; nÞ. Therefore,

ðSvðSÞ � SvðIÞÞ � ðSvðT Þ � SvðJ ÞÞ¼ 2

Xi2K;j2R

Aði; jÞuðiÞuðjÞ � 0

) SvðSÞ � SvðIÞ � SvðT Þ � SvðJ Þ:

(10)

Therefore, the function SvðSÞ is sub-modular.Next, we can verify that node i selected in step 16

of Alg. 1 satisfies i ¼ argmaxj =2 SSvðS [ jÞ for a fixed

set S.Next, we prove that SvðSÞ is monotonically non-

decreasing wrt S. According to eq. (9), we have

SvðSÞ � SvðIÞ ¼Xi2K



� 2X

j2I ;i2KAði; jÞuðiÞuðjÞ

�Xi2K

2�uðiÞ2 � 2X

j2S;i2KAði; jÞuðiÞuðjÞ

¼ 2Xi2K

uðiÞð�uðiÞ �Xj2S

Aði; jÞuðjÞÞ

� 2Xi2K

uðiÞð�uðiÞ �Xnj¼1

Aði; jÞuðjÞÞ

¼ 2Xi2K

uðiÞð�uðiÞ � �uðiÞÞ ¼ 0;

(11)

where the last equality is due to the definition ofeigenvalue.

Finally, it is easy to verify that SvðfÞ ¼ 0, where f is anempty set. Using the property of sub-modular func-

tions [30], we have D�ðSÞ � ð1� 1=eÞD�ð~SÞ. tu

According to Lemma 2, the computational complexity ofAlg. 1 is Oðnk2 þmÞ, which is much faster than both ‘Com-

Eigs’ (Oð nk

� ��mÞ) and ‘Com-Eval’ (Oð n

k

� �� k2Þ).

Lemma 2 (Computational Complexity of NetShield). Thecomputational complexity of Alg. 1 is Oðnk2 þmÞ.

Proof. The cost of step 1 is OðmÞ, and the cost of step 2 isconstant. For steps 3-5, its cost is OðnÞ. For each innerloop of steps 6-17, its cost is OðnÞ þOðn � iterÞ. Therefore,we have

costðNetshieldÞ ¼ OðmÞ þOðnÞ þXkiter¼1

ðnþ n � iterÞ

¼ Oðnk2 þmÞ(12)

which completes the proof. tu

Finally, according to Lemma 3, the space cost of Alg. 1 isalso efficient (i.e., linear wrt the size of the graph).

Lemma 3 (Space Cost of NetShield). The space cost of Alg. 1is Oðnþmþ kÞ.

Proof. The space cost of step 1 is Oðnþmþ 1Þ: OðmÞ forstoring the graph, OðnþmÞ for running the eigen-decomposition algorithm, OðnÞ for storing u and Oð1Þ forstoring �. The cost for step 2 is Oð1Þ. For steps 3-5, weneed an additional OðnÞ space. Then, it takes OðnÞ spacefor each inner loop (steps 6-17) and we can re-use thisspace for the next iteration. Finally, we need OðkÞ to storethe selected nodes (step 18).

Putting the above together and ignoring the constantterm, we have that the space cost of Alg. 1 isOðnþmþ kÞ, which completes the proof. tu

5.4 A Variant: NetShield+ Algorithm

Recall in Lemma 1, the eigen-gap d, max degree d and k

should satisfy d � 2ffiffiffiffiffiffiffiffi2kd

p. Given the fact that � � d, we have

d � d. Therefore we get the constraint between max degree


d and k, which can be simplified as k � d=8. The constraintimplies that in order to get a good approximation ofD�ðSÞ with SvðSÞ, the number of nodes we select toimmunize should be less than d=8, which might not holdwhen the max degree of the graph is relatively small. Toaddress this problem and further balance the optimiza-tion quality and the computational cost, we proposeNetShield+ algorithm, which is given in Alg. 2. Instead offinding out all the k nodes to delete in one round as inNetShield (i.e., compute the first eigenvalue and corre-sponding eigenvector only once), NetShield+ tries to findout those k nodes iteratively. By fixing a batch number bas an extra input, NetShield+ would pick out and delete bbest nodes for current graph at each round, and then usethe updated graph for next round of computation. Morediscussion on choosing an appropriate value of b is inSection 6. In Alg. 2, an extra variable b is provided asinput compared to NetShield. It first computes the numberof iterations t in step 1. In each iteration of steps 3-8, wefind b nodes to delete from current graph by NetShieldalgorithm and add them to S. At the end of each iteration,we update matrix A by deleting those selected nodesfrom it. The algorithm will terminate when all the k nodesare collected.

Algorithm 2. NetShield+

Input: the adjacency matrix A, two integers k and bOutput: a set S with k nodes1: compute the number of iterations t ¼ bk=bc;2: initialize S to be empty;3: for j ¼ 1 to t do4: initialize S0 to be empty;5: S0 ¼ NetShieldðA; bÞ;6: S ¼ S

SS0;

7: update A by deleting the nodes in S0;8: end for9: if k > tb then10: S0 ¼ NetShieldðA; k� tbÞ;11: S ¼ S

SS0;

12: end if13: return S.

By a similar procedure for Lemma 2, we can show thatthe time complexity of NetShield+ is Oðmk=bþ nkbÞ; and itsspace cost is the same as that of NetShield. Thus, it is still alinear algorithm wrt the size of the input graph.

6 EXPERIMENTAL EVALUATIONS

We present detailed experimental results in this section. Allthe experiments are designed to answer the followingquestions:

1: (Effectiveness) How effective is the proposed SvðSÞ inreal graphs?

2: (Efficiency) How fast and scalable is the proposedNetShield?

6.1 Data Sets

The real data sets we used are summarized in Table 2. Thefirst data set (Karate) is a unipartite graph, which describesthe friendship among the 34 members of a karate club at a

US university [65]. Each node is a member in the karate cluband the existence of the edge indicates that the two corre-sponding members are friends. Overall, we have n ¼ 34nodes andm ¼ 156 edges.

The second data set (AA) is an author-author networkfrom DBLP.4 AA is a co-authorship network, where eachnode is an author and the existence of an edge indicates theco-authorship between the two corresponding persons.Overall, we have n ¼ 418; 236 nodes and m ¼ 2; 753; 798edges. We also construct much smaller co-authorship net-works, using the authors from only one conference (e.g.,KDD, SIGIR, SIGMOD, etc.). For example, KDD is the co-authorship network for the authors in the ‘KDD’ conference.For these smaller co-authorship networks, they typicallyhave a few thousand nodes and up to a few ten thousandedges. In this graph, the node immunization algorithm canhelp us identify a set of authors who are most important interms of their influence in data mining and informationretrieval area.

The third data set (NetFlix) is from the Netflix prize.5 Thisis also a bipartite graph. We have two types of nodes: userand movie. The existence of an edge indicates that the corre-sponding user has rated the corresponding movie. Overall,we have n ¼ 2; 667; 199 nodes and m ¼ 171; 460; 874 edges.This is a bipartite graph, and we convert it to a unipartite

graph A: A ¼ ð 0 BB0 0

Þ, where 0 is a matrix with all zero

entries and B is the adjacency matrix of the bipartite graph.Like the AA data set, by our node immunization algorithm,it aims to a set of well connected users/movies.

The last is a series of data set (Oregon) from Oregonautonomous system (AS) router graphs, which are AS-level connectivity networks inferred from Oregon route-views [57]. The number of nodes in this set ranges from633 to 13,947, the corresponding edges ranges from 1,086to 30,584. The result returned by node immunizationalgorithm would be a set of most important routers inthe network to immunize when virus begins to spreadaround the Internet.

Repeatability of Experimental Results. The code for the pro-posedNetShield andNetShield+ is available in https://www.dropbox.com/s/aaq5ly4mcxhijmg/Netshieldplus.tar.

TABLE 2Summary of the Data Sets

Name n m

Karate 34 152AA 418,236 2,753,798NetFlix 2,667,199 171,460,874Oregon-A 633 1,086Oregon-B 1,503 2,810Oregon-C 2,504 4,723Oregon-D 2,854 4,932Oregon-E 3,995 7,710Oregon-F 5,296 10,097Oregon-G 7,352 15,665Oregon-H 10,860 23,409Oregon-I 13,947 30,584

4. http://www.informatik.uni-trier.de/ley/db/5. http://www.netflixprize.com/


6.2 Effectiveness

Here, we first test the approximation accuracy of the pro-posed SvðSÞ. Then, we compare different immunizationpolicies, followed by some case studies. Notice that theexperiment results of quality versus speed trade-off for theproposed NetShield, NetShield+, the optimal ‘Com-Eigs’ andthe alternative greedy method are presented in Section 6.3.

6.2.1 Approximation Quality of SvðSÞThe proposed NetShield is based on eq. (2). That is, wewant to approximate the first eigenvalue of the perturbedmatrix by � and u. By Lemma 1, it says that SvðSÞ is agood approximation for the actual eigen-drop D�ðSÞ.Here, let us experimentally evaluate how good thisapproximation is on real graphs. We construct an author-ship network from one of the following conferences:‘KDD’, ‘ICDM’, ‘SDM’, ‘SIGMOD’, ‘VLDB’, ‘NIPS’, ‘UAI’,‘SIGIR’ and ‘WWW’. We then compute the linear correla-tion coefficient between D�ðSÞ and SvðSÞ with several dif-ferent k values (k ¼ 1; 2; 5; 10; 20). The results are shown inTable 3. It can be seen that the approximation is verygood—in all the cases, the linear correlation coefficient isgreater than 0:9. Fig. 3 gives the scatter plot of D�ðSÞ (i.e.,the actual eigen-drop) versus SvðSÞ (i.e., the proposed‘Shield-value’) for k ¼ 5 on ‘ICDM’ data set.

6.2.2 Immunization by NetShield and NetShield+

Recall that the proposed ‘Vulnerability’ score of the graph ismotivated by the epidemic threshold [7]. In this paper, weprimarily use SIS model (like, e.g., the flu) in our experimentfor simplicity. Nonetheless, it has been proved that largesteigenvalue of the connectivity matrix can be used as epi-demic threshold for many other cascade models on arbi-trary networks [46].

We compare NetShield and NetShield+ with the followingalternative choices: (1) picking a random neighbor of a ran-domly chosen node [11] (‘Aquaintance’), (2) picking thenodes with the highest eigen-scores uðiÞði ¼ 1; . . . ; nÞ(‘Eigs’),6 (3) picking the nodes with the highest abnormalityscores [54] (‘abnormality’), (4) picking the nodes with thehighest betweenness centrality scores based on the shortestpath [15] (‘Short’), (5) picking the nodes with the highestbetweenness centrality scores based on random walks [39](‘N.RW’), (6) picking the nodes with the highest degrees(‘Degree’), (7) picking the nodes with the highest PageRankscores [43] (‘PageRank’) and (8) picking the nodes with

highest robustness scores [8] (‘Robust’). For each method,we delete 5 nodes for immunization. Let s ¼ � � b=d bethe normalized virus strength (bigger s means more stron-ger virus), where b and d are the infection rate and host-recovery rate, respectively. The result is presented in Fig. 4,which is averaged over 100 runs. It can be seen that the pro-posed NetShield+ and NetShield are always the best—theircurves are always the lowest which means that we alwayshave the least number of infected nodes in the graph withthis immunization strategy. Notice that the performance of‘Eigs’ is much worse than the proposed NetShield. This indi-cates that by collectively finding a set of nodes with the high-est ‘Shield-value’, we indeed obtain extra performance gain(compared with na€ıvely choosing the top-k nodes whichhave the highest individual ‘Shield-value’ scores). Fig. 5shows the statistical significance of our proposed methodon Oregon-A with the same setting as Fig. 4b. Note that asthe average numbers of infectees and variances of abnor-mality, shortest path betweenness centrality, eigen-scoresand robustness score based methods are relative large andbeyond the scope, we choose to only report the results ofother stable methods. The results on other Oregon graphsare similar to those of Oregon-A.

6.2.3 Case Studies

Next, we will show some case studies to illustrate the effec-tiveness of the proposed SvðSÞ, the ‘Shield-value’ score of asubset of nodes.

We run the proposed NetShield on AA data set and returnthe best k ¼ 200 authors. Some representative authors, toname a few, are ‘Sudhakar M. Reddy’, ‘Wei Wang’, ‘HeinrichNiemann’, ‘Srimat T. Chakradhar’, ‘Philip S. Yu’, ‘Lei Zhang’,

TABLE 3Evaluation on the Approximation Accuracy of SvðSÞ

k ‘KDD’ ‘ICDM’ ‘SDM’ ‘SIGMOD’ ‘VLDB’ ‘NIPS’ ‘UAI’ ‘SIGIR’ ‘WWW’

1 0.9519 0.9908 0.9995 1.0000 0.9548 0.9915 0.9990 0.9882 0.94382 0.9629 0.9910 0.9984 0.9927 0.9541 0.9914 0.9988 0.9673 0.94275 0.9721 0.9888 0.9992 0.9895 0.9671 0.9925 0.9987 0.9423 0.940610 0.9726 0.9863 0.9987 0.9852 0.9382 0.9924 0.9986 0.9327 0.937720 0.9683 0.9798 0.9929 0.9772 0.9298 0.9907 0.9985 0.9354 0.9288

Larger is better.

Fig. 3. Evaluation of the approximation accuracy of SvðSÞ on the ‘ICDM’graph. The proposed ‘Shield-value’ SvðSÞ (y-axis) gives a good approxi-mation for the actual eigen-drop D�ðSÞ (x-axis). Most points are on orclose to the diagonal (ideal).

6. For the un-directed graph which we focus on in this paper, ‘Eigs’is equivalent to ‘HITS’ [29].


‘Wei Li’, ‘Jiawei Han’, ‘Srinivasan Parthasarathy’, ‘SrivathsRavi’, ‘Antonis M. Paschalis’, ‘Mohammed Javeed Zaki’, ‘LeiLi’, ‘Dimitris Gizopoulos’, ‘Alberto L. Sangiovanni-Vincentelli’,‘Narayanan Vijaykrishnan’, ‘Jason Cong’, ‘Thomas S. Huang’,etc. We can make some very interesting observations fromthe result:

1. There are some multi-disciplinary people in theresult. For example, Prof. Alberto L. Sangiovanni-Vincentelli from UC Berkeley is interested in ‘designtechnology’, ‘cad’, ‘embedded systems’, and ‘formalverification’; Prof. Philip S. Yu from UIC is interestedin ‘databases’, ‘performance’, ‘distributed systems’and ‘data mining’.

2. Some people show up because they are famous inone specific area, and occasionally have one/twopapers in a remotely related area (therefore, increas-ing the path capacity between two remote areas). Forexample, Dr. Srimat T. Chakradhar mainly focuseson ‘cad’. But he has co-authored in a ‘NIPS’ paper.Therefore, he creates a critical connection betweenthese two (originally) remote areas: ‘cad’ and‘machine learning’.

3. Some people show up because they have ambiguousnames (e.g., Wei Wang, Lei Li, Lei Zhang, Wei Li,etc.). Take ‘Wei Wang’ as an example; according toDBLP,7 there are 49 different ‘Wei Wang’s. In ourexperiment, we treat all of them as one person. Thatis to say, it is equivalent to putting an artificial ‘WeiWang’ in the graph who brings 49 different ‘WeiWang’s together. These 49 ‘Wei Wang’s are infact spread out in quite different areas. (e.g., WeiWang@UNC is in ‘data mining’ and ‘bio’; WeiWang@NUS is in ‘communication’; Wei Wang@MITis in ‘non-linear systems’.)

6.3 Efficiency

We will study the wall-clock running time of the proposedNetShield and NetShield+ here. Basically, we want to answerthe following three questions:

1. (Speed) What is the speedup of the proposed Net-Shield over the straightforward methods (‘Com-Eigs’and ‘Com-Eval’)?

2. (Scalability) How does NetShield scale with the size ofthe graph (n andm) and k?

3. (Quality/Speed Trade-Off) How does NetShield andNetShield+ balance between the quality and thespeed?

For the results we report in this section, all of the experi-ments are done on the same machine with four 2.4 GHzAMD CPUs and 48 GB memory, running Linux (2.6 kernel).If the program takes more than 1,000,000 seconds, we stoprunning it.

First, we compare NetShield with ‘Com-Eigs’ and ‘Com-Eval’. Fig. 6 shows the comparison on three real data sets.We can make the following conclusions: (1) Straightforwardmethods (‘Com-Eigs’ and ‘Com-Eval’) are computationallyintractable even for a small graph. For example, on theKarate data set with only 34 nodes, it takes more than100,000 and 1,000 seconds to find the best-10 by ‘Com-Eigs’and by ‘Com-Eval’, respectively. (2) The speedup of the pro-posed NetShield over both ‘Com-Eigs’ and ‘Com-Eval’ ishuge—in most cases, we achieve several (up to 7) orders ofmagnitude speedups! (3) The speedup of the proposedNetShield over both ‘Com-Eigs’ and ‘Com-Eval’ quicklyincreases wrt the size of the graph as well as k. (4) For agiven size of the graph (fixed n and m), the wall-clock timeis almost constant—suggesting that NetShield spends mostof its running time in computing � and u.

Next, we evaluate the scalability ofNetShield. From Fig. 7,it can be seen thatNetShield scales linearly wrt both n andm,which means that it is suitable for large graphs.

Then, we evaluate how the proposed NetShield balancesbetween the quality and speed. For the Karate graph, we use

Fig. 4. Evaluation of immunization of NetShield and NetShield+ on the Oregon-A graph. The fraction of infected nodes (in log-scale) versus the timestep. s is normalized virus strength. Lower is better. The proposed NetShield and NetShield+ is always the best, leading to the fastest healing of thegraph. Best viewed in color.

Fig. 5. Average number of infectees at the end of each simulation usingdifferent methods and their corresponding variance.

7. http://www.informatik.uni-trier.de/ey/db/indices/a-tree/w/Wang:Wei.html


the proposed NetShield to find a set of k nodes and check thecorresponding eigen-drop (i.e., the decrease of the firsteigenvalue of the adjacency matrix) as well as the corre-sponding wall-clock time. We compare it with ‘Com-Eigs’,which always gives the optimal solutions (i.e., it returns thesubset that leads to the largest eigen-drop). The results(eigen-drop versus wall-clock time) are plotted in Fig. 8. Itcan been seen that NetShield gains significant of speedupover the ‘Com-Eigs’, at the cost of a small fraction of qualityloss (i.e., the green dash lines are near-flat).

We also compare the proposed NetShield with the follow-ing heuristic (referred to as ‘Greedy’): at each iteration, we re-compute the first eigenvector of the current graph and pick anode with the highest eigen-score uðiÞ; then we delete thisnode from the graph and go to the next iteration. For theNetFlix graph, we find a set of k nodes and check the corre-sponding eigen-drop as well as the corresponding wall-clocktime. The quality/speed trade-off curve is plotted in Fig. 9.From the figure,we canmake two observations: (1) the quality

of the two methods (‘Greedy’ versus the proposed NetShield)are almost the same (note that the green dash curves in theplots are always straight flat); (2) the proposed NetShield isalways faster than ‘Greedy’ (up to 103� speedup).

Finally, we evaluate how NetShield+ further balancesbetween the quality and speed. To try different batch value b,we move the experiment on a larger data set, Oregon-G. InFig. 10a, we set k to different values. For each setting of k, wechange the value of b and report the relationship betweenratio b=k wrt eigen-drop. The three lines all begins withb ¼ 1, that is b=k ¼ 0:02; 0:01; 0:005 for k ¼ 50; 100; 200respectively. Note that when b=k increases to 1, NetShield+ isreduced to NetShield. As we can see, as b increases, eigen-drop decreases, but does not make significant differences ineach setting. Fig. 10b reports the relationship between wallclock time and eigen-drop when setting different b=k ratiosand k values. Setting b ¼ 1 is very time consuming in all threecases. Howeverwhen b is increased to k=10, the time is signif-icantly reducedwhile eigen-drop still keeps relatively high.

7 RELATED WORK

In this section, we review the related work, which can becategorized into five parts: measuring the importance of

Fig. 6. Wall-clock time versus the budget k for different methods. The time is in the logarithmic scale. Our NetShield (red star) is much faster. Lower isbetter.

Fig. 7. Evaluation of the scalability of the proposed NetShield wrt. n(number of nodes) and m (number of edges), respectively. The wall-clock time of our NetShield scales linearly wrt n andm.

Fig. 8. Evaluation of the quality/speed trade off. Eigen-drop versus wall-clock time, with different budget k.The proposed NetShield (red star)achieves a good balance between eigen-drop and speed. Note that thex-axis (wall-clock time) is in logarithmic scale. The number inside theparenthesis above each green dash curve is the ratio of eigen-dropbetween NetShield and ‘Com-Eigs’. NetShield is optimal when this ratiois 1. Best viewed in color.


nodes on graphs, immunization, spectral graph analysis,influence maximization, and general graph mining.

Measuring importance of nodes on graphs. In the literature,there are a lot of node importance measurements, includingbetweenness centrality, both the one based on the shortestpath [15] and the one based on random walks [39],PageRank [43], HITS [29], and coreness score (defined byk-core decomposition) [36]. Other remotely related worksinclude the abnormality score of a given node [54], articula-tion points [20], and k-vertex cut [20]. Our ‘Shield-value’score is fundamentally different from these node importancescores, in the sense that they all aim to measure the impor-tance of an individual node; whereas our ‘Shield-value’ triesto collectively measure the importance of a set of k nodes.Despite the fact that all these existing measures are success-ful for the goal they were originally designed for, they arenot designed for the purpose of immunization. Therefore, itis not surprising that they lead to sub-optimal immuniza-tion results (See Fig. 4). Moreover, several of these impor-tance measurements do not scale up well for large graphs,being cubic or quadratic wrt the number of nodes n, even ifwe use approximations (e.g., [37]). In contrast, the proposed

NetShield is linear wrt the number of edges and the number

of nodes (Oðnk2 þmÞ). Another remotely related work isoutbreak detection [31] in the sense that both works aim toselect a subset of “important” nodes on graphs. However,the motivating applications (e.g., immunization) of thiswork is different from detecting outbreak [31] (e.g., contami-nants in water distribution network). Consequently wesolve a different optimization problem (i.e., maximize the‘Shield-value’ in eq. (2)) in this paper.

Another related topic is information diffusion. Manyworks in this domain are based on finding out the mostinfluential or critical nodes among the network to maxi-mize/minimize the spread of information as shownin [60], [22], [3]. Saito et al. [49] and Yamagishi et al. [64]give the diffusion probability model and opinion forma-tion model respectively based on node attributes. Tuliet al. [58] present an approach for selecting critical nodesfor both simple and complex contagions, with theassumption that a node can contract a contagion frommore than one neighbor. Another interesting work isabout selecting critical nodes from the network withincertain budget as in [42] and [41].

Immunization. There is vast literature on virus propaga-tion and epidemic thresholds: for full cliques (e.g.,Hethcote [24]), for power-law graphs [5], and studies ofheuristics for immunization policies [11]. The only papersthat study arbitrary graphs focus on the epidemic threshold(Wang et al. [61] and its follow-up work [16], [7], [46]). Inshort, none of the above papers solves the problem of opti-mal immunization for an arbitrary, given graph.

Tong et al. in [57] address the problem of optimizing theleading eigenvalue by edge manipulation. Prakash et al. [45]present an immunization approach of online networksbased on self-similar selection, which does not require infor-mation about network morphology at individual node level.The reverse engineering of immunization problems can bedefined as follows: given a snapshot of a graph in which aninfection has been spreading for some time, find out theoriginal seed set where the infection started. Related worksabout this topic are shown in [47] and [48]. Other relatedworks include [44], [46], [35] and [67] which study the the-ory about determining epidemic in the network, algorithmsabout effective immunization, reverse engineering andnode immunization given uncertain data.

Fig. 9. Comparison of NetShield versus ‘Greedy’. The proposed Net-Shield (red star) is better than ‘Greedy’ (i.e., faster, with the same qual-ity). Note that the x-axis (wall-clock time) is in logarithmic scale. Thenumber inside the parenthesis above each green dash curve is thespeedup of the proposed NetShield over ‘Greedy’. Best viewed in color.

Fig. 10. Evaluation of quality/speed trade off of NetShield+. Eigen-drop does not change linearly wrt computation time, it is easy to find compromisepoints where we can get considerable eigen-drop with short computation time.


Spectral graph analysis. Pioneering works in this aspectcan be traced back to Fiedler’s seminal work [14]. Represen-tative follow-up works include [12], [40], [50], [66], etc. Allof these works use the eigenvectors of the graph (or thegraph Laplacian) to find communities in the graph.

Influence maximization. Although node immunizationand influence maximization all aim to find a subset ofnodes to affect the influence spread in the graph, theyare different with each other in the sense that nodeimmunization tries to minimize the influence spread bychanging the graph structure, while influence maximiza-tion aims to choose an optimal subset of seeds to maxi-mize the ‘infected’ population. The pioneering work ininfluence maximization is from Kempe et al. [28]. Toaddress the NP-hardness of the problem, different effi-cient and scalable algorithms were proposed to approxi-mate the optimal solution for different models [9], [10],[18], [19], [52].

General graph mining. In recent years, graph mining is avery hot research topic. Representative works includepattern and law mining [1], [6], frequent substructurediscovery [26], [63], community mining and graph parti-tion [2], [27], proximity [17], [55], [56], bridgeness-baseddetection of fuzzy communities [38], the network valueof a customer [13], the bridge centrality [25], graphblocker [21], the connectivity of the small world [51] andsocial capital [32], etc. Research about sampling in graphshows that the influential individuals in the graph canbe identified by only accessing to a small portion ofnodes in the network. Also, certain sample biases arebeneficial for many applications [34], [33]. A largeamount of work is also done on analyzing the spreadingprocess of competing information, virus and etc. [4], [59],[62]. The algorithm in [23] enables within-networkand across-network classification with regional featuresof the graph.

8 CONCLUSION

We studied the node immunization problem on largereal graphs. Besides the problem definitions, our maincontributions can be summarized as the following threeperspectives. First, we proposed a novel definition of‘Shield-value’ score SvðSÞ for a set of nodes S, by care-fully using the results from the theory of matrix pertur-bation. Second, we proposed a near-optimal and scalablealgorithm (NetShield) to find a set of nodes with thehighest ‘Shield-value’ score. We further proposed its vari-ant (NetShield+) to balance the optimization quality andspeed. Finally, we conducted extensive experiments onseveral real data sets to illustrate both the effectivenessas well as the efficiency of our methods. Specifically, theproposed methods (a) give an effective immunizationstrategy (b) scale linearly with the size of the graph(number of edges) and (c) outperform competitors byseveral orders of magnitude.

Future work includes (1) to parallelize the currentmethod (e.g., using Hadoop8) and (2) to study extensionsfor additional virus propagation models, like SIR [24] etc.

APPENDIX

Proof of NP-Completeness of K-Node Immunization

Proof. We consider the decision version of the K-nodeimmunization problem as follows. tu

Problem 4. K-node immunization (Decision Version) (IMMðG; kÞ)Given: A large un-directed un-weighted connected graph

Gwith n nodes and an integer k;Find: A subset S of k nodes. By deleting S from graph G

(with adjacency matrix A), we get a new graph

GðSÞ(with adjacency matrix A), in which �ðSÞ � t.To make the problem easier, we proof that theproblem is already NP-complete when t ¼ 0.

First, we show that K-node immunization problem is inNP: given subset ðSÞ to be deleted from graph G, we can

check in poly-time if the first eigenvalue of new graph GðSÞ

is less than 0 or not.Second, we prove that K-node immunization problem is

poly-time reducible from a known NP-complete problem,i.e., the Independent Set problem(INDðG; kÞ).

Problem 5. Independent Set problem(INDðG; kÞ) Given a largeun-directed un-weighted connected graph G ¼ ðV;EÞ and anumber k > 0, is there a set of k vertices, no two of which areadjacent?

Assume the size ofG is n. Given an instance of INDðG; kÞ,we create an instance IMMðG;n� kÞ (delete n� k nodes inG such that the the first eigenvalue in new graph is less orequal to 0). We now need to prove two things:

1. If there is a YES answer to INDðG; kÞ, then there is aYES answer to IMMðG; n� kÞ.

The adjacency matrix of G which has YES answer toINDðG; kÞ is

A ¼ Sk�k Xk�ðn�kÞXk�ðn�kÞ Tðn�kÞ�ðn�kÞ

� �;

where Sk�k ¼ 0, because the k nodes in S are independent toeach other. By deleting the rest n� k nodes in T (T ¼ V =S),we have Xk�ðn�kÞ ¼ 0, Tðn�kÞ�ðn�kÞ ¼ 0. Therefore the adja-

cency matrix for new graph GðT Þ has A ¼ 0. Hence �ðT Þ ¼�(0)¼ 0. So there is a YES answer to IMMðG; n� kÞ.

2. If there is a NO answer to INDðG; kÞ, then there is aNO answer to IMMðG;n� kÞ.

Suppose we have a YES answer to IMMðG; n� kÞ. Thenby deleting n� k nodes from graph G (suppose they are in

T ), we will get new graph GðT Þ with �ðT Þ � 0where

A ¼ Sk�k 0k�ðn�kÞ0k�ðn�kÞ 0ðn�kÞ�ðn�kÞ

� �:

Since Sk�k � 0, to satisfy �ðT Þ � 0, we need to have Sk�k ¼0, which implies that all the k nodes in S are independent toeach other. The conclusion is contradict with the assump-tion that there is a NO answer to INDðG; kÞ, thereforeIMMðG;n� kÞ can only have NO answer here.

Hence K-node immunization (Decision Version) is NP-complete.8. http://hadoop.apache.org/


REFERENCES

[1] R. Albert, H. Jeong, and A.-L. Barabasi, “Diameter of the worldwide web,”Nature, vol. 401, pp. 130–131, 1999.

[2] L. Backstrom, D. P. Huttenlocher, J. M. Kleinberg, and X. Lan,“Group formation in large social networks: Membership, growth,and evolution,” in Proc. ACM SIGKDD Int. Conf. Knowl. DiscoveryData Mining, 2006, pp. 44–54.

[3] T. Berger-Wolf and H. Habiba, “Working for influence: Effectof network density and modularity on diffusion in networks,”in Proc. IEEE 11th Int. Conf. Data Mining Workshops, 2011,pp. 933–940.

[4] A. Beutel, B. A. Prakash, R. Rosenfeld, and C. Faloutsos,“Interacting viruses in networks: Can both survive?” in Proc. 18thACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2012,pp. 426–434.

[5] L. Briesemeister, P. Lincoln, and P. Porras, “Epidemic profiles anddefense of scale-free networks,” in Proc. ACMWorkshop Rapid Mal-code, Oct. 27, 2003, pp. 67–75.

[6] A. Broder, R. Kumar, F. Maghoul1, P. Raghavan, S. Rajagopalan,R. Stata, A. Tomkins, and J. Wiener, “Graph structure in the web:Experiments and models,” in Proc. World Wide Web Conf. Comput.Netw., 2000, pp. 309–320.

[7] D. Chakrabarti, Y. Wang, C. Wang, J. Leskovec, and C. Faloutsos,“Epidemic thresholds in real networks,” ACM Trans. Inf. Syst.Security, vol. 10, no. 4, p. 1, 2007.

[8] H. Chan, L. Akoglu, and H. Tong, “Make it or break it: Manipulat-ing robustness in large networks,” in Proc. SIAM Int. Conf. DataMining, 2014, pp. 325–333.

[9] W. Chen, Y. Wang, and S. Yang, “Efficient influence maximizationin social networks,” in Proc. 15th ACM SIGKDD Int. Conf. Knowl.Discovery Data Mining, 2009, pp. 199–208.

[10] W. Chen, Y. Yuan, and L. Zhang, “Scalable influence maximiza-tion in social networks under the linear threshold model,” in Proc.10th Int. Conf. Data Mining, 2010, pp. 88–97.

[11] R. Cohen, S. Havlin, and D. ben Avraham, “Efficient immuniza-tion strategies for computer networks and populations,” Phys.Rev. Lett., vol. 91, no. 24, p. 247901, Dec. 2003.

[12] C. H. Q. Ding, T. Li, and M. I. Jordan, “Nonnegative matrix factor-ization for combinatorial optimization: Spectral clustering, graphmatching, and clique finding,” in Proc. 11th Int. Conf. Data Mining,2008, pp. 183–192.

[13] P. Domingos and M. Richardson, “Mining the network value ofcustomers,” in Proc. ACM SIGKDD Int. Conf. Knowl. DiscoveryData Mining, 2001, pp. 57–66.

[14] M. Fiedler, “Algebraic connectivity of graphs,” Czechoslovak Math.J., vol. 23, no. 2, pp. 298–305, 1973.

[15] L. C. Freeman, “A set of measures of centrality based onbetweenness,” in Proc. Sociometry, 1977, pp. 35–41.

[16] A. Ganesh, E. Massouli, and D. Towsley, “The effect of networktopology on the spread of epidemics,” in Proc. IEEE INFOCOM,2005, pp. 1455–1466.

[17] F. Geerts, H. Mannila, and E. Terzi, “Relational link-basedranking,” in Proc. Int. Conf. Very Large Databases, 2004, pp. 552–563.

[18] A. Gionis, E. Terzi, and P. Tsaparas, “Opinion maximization insocial networks,” in Proc. SIAM Data Mining Conf., 2013,pp. 387–395.

[19] A. Goyal, W. Lu, and L. V. Lakshmanan, “Simpath: An effi-cient algorithm for influence maximization under the linearthreshold model,” in Proc. IEEE 11th Int. Conf. Data Mining,2011, pp. 211–220.

[20] H. Nagamochi and T. Ibaraki, Algorithmic Aspects of Graph Connec-tivity. Cambridge, U.K: Cambridge Univ. Press, 2008.

[21] H. Habiba and T. Y. Berger-Wolf, “Graph theoretic measures foridentifying effective blockers of spreading processes in dynamicnetworks,” in Proc. MLG-ICML Workshop Mach. Learn. Graphs,2008.

[22] H. Habiba, “Critical individuals in dynamic populationnetworks,” PhD dissertation, Northwestern Univ., Evanston, IL,USA, 2013.

[23] K. Henderson, B. Gallagher, L. Li, L. Akoglu, T. Eliassi-Rad, H.Tong, and C. Faloutsos, “It’s who you know: Graph mining usingrecursive structural features,” in Proc. 17th ACM SIGKDD Int.Conf. Knowl. Discovery Data Mining, 2011, pp. 663–671.

[24] H. W. Hethcote, “The mathematics of infectious diseases,” SIAMRev., vol. 42, pp. 599–653, 2000.

[25] W. Hwang, T. Kim, M. Ramanathan, and A. Zhang, “Bridgingcentrality: Graph mining from element level to group level,” inProc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2008,pp. 336–344.

[26] R. Jin, C. Wang, D. Polshakov, S. Parthasarathy, and G. Agrawal,“Discovering frequent topological structures from graph data-sets,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Min-ing, 2005, pp. 606–611.

[27] G. Karypis and V. Kumar, “Multilevel K-way hypergraph parti-tioning,” in Proc. 36th Annu. ACM/IEEE Design Automation Conf.,1999, pp. 343–348.

[28] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the spread ofinfluence through a social network,” in Proc. 9th ACM SIGKDDInt. Conf. Knowl. Discovery Data Mining, 2003, pp. 137–146.

[29] J. M. Kleinberg, “Authoritative sources in a hyperlinked environ-ment,” in Proc. ACM-SIAM Symp. Discrete Algorithms, 1998,pp. 668–677.

[30] A. Krause and C. Guestrin, “Near-optimal observation selectionusing submodular functions,” in Proc. 22nd Nat. Conf. Artif. Intell.,2007, pp. 1650–1654.

[31] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen,and N. S. Glance, “Cost-effective outbreak detection in networks,”in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,2007, pp. 420–429.

[32] L. Licamele and L. Getoor, “Social capital in friendship-eventnetworks,” in Proc. Int. Conf. Data Mining, 2006, pp. 959–964.

[33] A. S. Maiya, “Sampling and inference in complex networks,” PhDdissertation, Stanford Univ., Stanford, CA, USA, 2011.

[34] A. S. Maiya and T. Y. Berger-Wolf, “Benefits of bias: Towardsbetter characterization of network sampling,” in Proc. 17thACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2011,pp. 105–113.

[35] Y. Matsubara, Y. Sakurai, B. A. Prakash, L. Li, and C. Faloutsos,“Rise and fall patterns of information diffusion: Model andimplications,” in Proc. 18th ACM SIGKDD Int. Conf. Knowl. Discov-ery Data Mining, 2012, pp. 6–14.

[36] J. Moody and D. R. White, “Social cohesion and embeddedness: Ahierarchical conception of social groups,” in Proc. Amer. Sociol.Rev., 2003, pp. 1–25.

[37] J. I. Munro and D. Wagner, “Better approximation of betweennesscentrality,” in Proc. 10th Workshop Algorithm Eng. Experiments,2008.

[38] T. Nepusz, A. Petraczi, L. Negyessy, and F. Bazso, “Fuzzy com-munities and the concept of bridgeness in complex networks,” inProc. Phys. Soc., 2007.

[39] M. Newman, “A measure of betweenness centrality based on ran-dom walks,” Soc. Netw., vol. 27, pp. 39–54, 2005.

[40] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering:Analysis and an algorithm,” in Proc. Adv. Neural Inf. Process. Syst.,2001, pp. 849–856.

[41] H. Nguyen, “Interactions on complex networks: Inference algo-rithms and applications,” PhD dissertation, Univ. of Houston,Houston, TX, USA, 2013.

[42] H. Nguyen and R. Zheng, “On budgeted influence maximizationin social networks,” IEEE J. Sel. Areas Commun., vol. 31, no. 6,pp. 1084–1094, Jun. 2013.

[43] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRankcitation ranking: Bringing order to the web,” Stanford Digital Lib.Technol. Project, Stanford, CA, USA, Tech. Rep. SIDL-WP-1999-0120 (version of 11/11/1999), 1998.

[44] B. A. Prakash, “Propagation and immunization in largenetworks,” XRDS: Crossroads, The ACM Mag. Students, vol. 19,no. 1, pp. 56–59, 2012.

[45] B. A. Prakash, L. Adamic, T. Iwashyna, H. Tong, and C. Faloutsos,“Fractional immunization in hospital-transfer graphs,”.

[46] B. A. Prakash, D. Chakrabarti, N. C. Valler, M. Faloutsos, and C.Faloutsos, “Threshold conditions for arbitrary cascade models onarbitrary networks,” Knowl. Inf. Syst., vol. 33, no. 3, pp. 549–575,2012.

[47] B. A. Prakash, J. Vreeken, and C. Faloutsos, “Spotting culprits inepidemics: How many and which ones?” in Proc. 12th IEEE/ACMInt. Conf. Data Mining, 2012, pp. 11–20.

[48] B. A. Prakash, J. Vreeken, and C. Faloutsos, “Efficiently spottingthe starting points of an epidemic in a large graph,” Knowl. Inf.Syst., vol. 38, pp. 35–59, 2014.


[49] K. Saito, K. Ohara, Y. Yamagishi, M. Kimura, and H. Motoda,“Learning diffusion probability based on node attributes in socialnetworks,” in Proc. Found. Intell. Syst., 2011, pp. 153–162.

[50] J. Shi and J. Malik, “Normalized cuts and image segmentation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog., 1997, pp. 731–737.

[51] X. Shi, M. Bonner, L. A. Adamic, and A. C. Gilbert, “The verysmall world of the well-connected,” in Proc. 19th ACM Conf.Hypertext Hypermedia, 2008, pp. 61–70.

[52] Y. Singer, “How to win friends and influence people, truthfully:Influence maximization mechanisms for social networks,” in Proc.5th ACM Int. Conf. Web Search Data Mining, 2012, pp. 733–742.

[53] G. W. Stewart and J.-G. Sun,Matrix Perturbation Theory. New York,NY, USA: Academic, 1990.

[54] J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos, “Neighborhoodformation and anomaly detection in bipartite graphs,” in Proc. Int.Conf. Data Mining, 2005, pp. 418–425.

[55] H. Tong, C. Faloutsos, and J.-Y. Pan, “Fast random walk withrestart and its applications,” in Proc. Int. Conf. Data Mining, 2006,pp. 613–622.

[56] H. Tong, J. He, M. Li, W.-Y. Ma, H.-J. Zhang, and C. Zhang,“Manifold-ranking-based keyword propagation for imageretrieval,” EURASIP J. Appl. Signal Process., vol. 2006, p. 190, 2006.

[57] H. Tong, B. A. Prakash, T. Eliassi-Rad, M. Faloutsos, and C.Faloutsos, “Gelling, and melting, large graphs by edge manipula-tion,” in Proc. 21st ACM Int. Conf. Inf. Knowl. Manage., 2012,pp. 245–254.

[58] G. Tuli, C. J. Kuhlman, M. V. Marathe, S. Ravi, and D. J.Rosenkrantz, “Blocking complex contagions using communitystructure,” in Proc. Workshop Multiagent Interaction Netw., 2012.

[59] N. C. Valler, “Spreading processes on networks theory andapplications,” PhD dissertation, Univ. of California, Oakland, CA,USA, 2012.

[60] D. Wang, Z. Wen, H. Tong, C.-Y. Lin, C. Song, and A.-L. Barab�asi,“Information spreading in context,” in Proc. 20th Int. Conf. WorldWide Web, 2011, pp. 735–744.

[61] Y. Wang, D. Chakrabarti, C. Wang, and C. Faloutsos, “Epidemicspreading in real networks: An eigenvalue viewpoint,” in Proc.22nd Int. Symp. Rel. Distrib. Syst., 2003, pp. 25–34.

[62] X. Wei, N. C. Valler, B. A. Prakash, I. Neamtiu, M. Faloutsos, andC. Faloutsos, “Competing memes propagation on networks: Anetwork science perspective,” IEEE J. Sel. Areas Commun., vol. 31,no. 6, pp. 1049–1060, Jun. 2013.

[63] D. Xin, J. Han, X. Yan, and H. Cheng, “Mining compressedfrequent-pattern sets,” in Proc. Int. Conf. Very Large Databases,2005, pp. 709–720.

[64] Y. Yamagishi, K. Saito, K. Ohara, M. Kimura, and H. Motoda,“Learning attribute-weighted voter model over social networks,”J. Mach. Learn. Res.-Proc. Track, vol. 20, pp. 263–280, 2011.

[65] W. W. Zachary, “An information flow model for conflict and fis-sion in small groups,” J. Anthropol. Res., vol. 33, pp. 452–473, 1977.

[66] H. Zha, X. He, C. H. Q. Ding, M. Gu, and H. D. Simon, “Spectralrelaxation for k-means clustering,” in Proc. Adv. Neural Inf. Process.Syst., 2001, pp. 1057–1064.

[67] Y. Zhang and B. A. Prakash, “Scalable vaccine distribution in largegraphs given uncertain data,” in Proc. 23rd ACM Int. Conf. Inf.Knowl. Manage., 2014, pp. 1719–1728.

Chen Chen received the bachelor’s and master’sdegrees in computer science from Beihang Uni-versity and New York University in 2011 and2013, respectively. She is currently workingtoward the PhD degree in the School of Comput-ing, Informatics, and Decision Systems Engineer-ing, Arizona State University. Her researchinterests include large scale data mining ingraphs and real-world network analysis.

Hanghang Tong received the MSc and PhDdegrees from Carnegie Mellon University in 2008and 2009, both majored in machine learning. Heis currently an assistant professor in the Schoolof Computing, Informatics and Decision SystemsEngineering, Arizona State University. Beforethat, he was an assistant professor in the Com-puter Science Department, City College, City Uni-versity of New York, a research staff member atIBM T.J. Watson Research Center, and a post-doctoral fellow in Carnegie Mellon University. His

research interest includes large scale data mining for graphs and multi-media. He has received several awards, including Best Paper Award inCIKM 2012, SDM 2008, and ICDM 2006. He has published more than80 referred articles and more than 20 patents. He has served as a pro-gram committee member in top data mining, databases, and artificialintelligence venues.

B. Aditya Prakash received the BTech degree inCS from the Indian Institute of Technology (IIT)-Bombay in 2007 and the PhD degree from theComputer Science Department, Carnegie MellonUniversity in 2012. He is an assistant professor inthe Computer Science Department, VirginiaTech. He has published more than 28 refereedpapers in major venues, holds two U.S. patentsand has given two tutorials (VLDB 2012 andECML/PKDD 2012) at leading conferences. Hiswork has also received a Best Paper Award and

two Best-of-Conference Selections (CIKM 2012, ICDM 2012, and ICDM2011) and multiple travel awards. His research interests include datamining, applied machine learning and databases, with emphasis on big-data problems in large real-world networks and time-series. He is also amember of the Discovery Analytics Center at Virginia Tech.

Charalampos E. Tsourakakis received thediploma in electrical and engineering from theNational Technical University of Athens andthe master’s of science degree from the MachineLearning Department, Carnegie Mellon Univer-sity. He received the PhD degree in algorithms,combinatorics, and optimization from CarnegieMellon University. He is a postdoctoral fellow inthe Harvard School of Engineering and AppliedSciences. His research interests include algo-rithm design, random graphs, and data mining.

Tina Eliassi-Rad received the PhD degree incomputer sciences (with a minor in mathematicalstatistics) from the University of Wisconsin-Madi-son. She is an associate professor of computerscience at Rutgers University. Before joining aca-demia, she was a member of Technical Staff andprincipal investigator at Lawrence LivermoreNational Laboratory. Her current research lays atthe intersection of graph mining, network science,and computational social science. Within datamining and machine learning, her research has

been applied to the world-wide web, text corpora, large-scale scientificsimulation data, complex networks, fraud detection, and cyber situationalawareness. She has published more than 60 peer-reviewed papers(including a Best Paper Runner-Up Award at ICDM’09 and a Best Inter-disciplinary Paper Award at CIKM’12); and has given more than100 invited presentations. She is an action editor for the Data Mining andKnowledge Discovery Journal and a member of the editorial board for theSpringer Encyclopedia of Machine Learning and Data Mining. In 2010,she received an Outstanding Mentor Award from the US DOE Office ofScience.


Christos Faloutsos is a professor at CarnegieMellon University. He has received the Presiden-tial Young Investigator Award by the US NationalScience Foundation (1989), the Research Contri-butions Award in ICDM 2006, the SIGKDD Inno-vations Award (2010), 20 “Best Paper” Awards(including two “Test of Time” awards), and fourteaching awards. Five of his advisees haveattracted KDD or SCS dissertation awards. Hehas served as a member of the executive com-mittee of SIGKDD. He has published more than

300 refereed articles, 17 book chapters, and two monographs. He holdseight patents and he has given more than 35 tutorials and 15 invited dis-tinguished lectures. His research interests include data mining forgraphs and streams, fractals, database performance, and indexing formultimedia and bio-informatics data. He is a fellow of the ACM.

Duen Horng (Polo) Chau received the PhDdegree in machine learning and the master’sdegree in human-computer interaction (HCI). Heis an assistant professor in Georgia Tech’sSchool of Computational Science and Engineer-ing, and an associate director of the MS Analyticsprogram. His PhD thesis won Carnegie Mellon’sComputer Science Dissertation Award, Honor-able Mention. His research lab bridges data min-ing and HCI to solves large-scale, real worldproblems. They develop scalable, interactive,

and interpretable tools for big data analytics. Their patented Poloniummalware detection technology protects 120 million people worldwide.Their auction fraud detection research was widely covered by media.Their fake review detection research received the Best Student PaperAward at the 2014 SIAM Data Mining Conference. He received a YahooFaculty Research and Engagement Award, a Raytheon Faculty Fellow-ship, and LexisNexis Dean’s Excellence Award. He is the only two-timeSymantec fellow and an award-winning designer. He designed CarnegieMellon’s ID card.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Documents

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …people.seas.harvard.edu/~babis/tkde16.pdf · Theory and Algorithms Chen Chen, Hanghang Tong, B. Aditya Prakash, Charalampos