View
213
Download
1
Category
Preview:
Citation preview
CPP-SNS: A Solution to Influence Maximization Problem Under Cost Control
Qianyi ZHAN, Hongchao YANG, Chongjun WANG and Junyuan XIE
Department of Computer Science and TechnologyNanjing University, Nanjing, China
Email: zhanqianyi@gmail.com, chjwang, jyxie@nju.edu.cn
Abstract—As more and more people join social network,viral marketing on online social network becomes a new trendof advertising. Motivated by this, plenty of research focuseson how to maximize the information propagation, which iscalled the influence maximization problem. Traditional workhas made significant progress on this topic. However all adcompanies have marketing budget, the research of influencemaximization problem should take account of cost control.
Under the condition of cost control, we model each user’scost of helping spread information as a feature of each nodein the network. Then we modify several most widely studiedalgorithms to suit the new model. In this paper, a new algorithmcalled CPP-SNS is proposed, which selects seeds according tocost performance of nodes. Further improvements, based onstrategy of partial node loading and submodular property ofspread function, make CPP-SNS more effective in practicalscenarios. Extensive experiments show this method has a goodperformance in different social networks. Based on results ofour research, we also provide some advice for the practicalmarketing.
Keywords-social network; viral marketing; influence maxi-mization; cost control;
I. INTRODUCTION
Nowadays Online Social Network (OSN) plays a more
and more significant role as a medium for information
spread. This trend gives birth to viral marketing, which is
for brand or product promotion through creating a buzz
or word of mouth effects. How to develop a successful
viral marketing has attracted attentions of socialogists, psy-
chologists, mathematicians and even epidemiologists. While
computer scientists are trying to use mathematic theories
and computing devices to understand the diffusion process
in social network, much research in this field is related to
influence maximization problem, which is the fundamental
problem of viral marketing.
In the seminal paper [1], Kempe et al. defined influence
maximization as an optimization problem: A social network
is modeled as a directed graph G(V,E), where the nodes
V represent users and weighted edges E reflect influence
between users. The goal is to find a seed set S, including
k nodes, such that with a given propagation model, the
information propagation range of S is the largest. [1] also
proposed two basic stochastic influence propagation models,
the independent cascade (IC) model and linear threshold(LT) model, both extracted from mathematical sociology.
Each node is active or inactive in both models. In the
IC model, an active node spreads influence to its inactive
neighbors independently according to the weight of the
corresponding edge. The IC model stresses the individual
influence among friends. While in the LT model, each node
has a threshold and a node is not activated until the sum of
incoming edge weights from its active neighbors is no less
than its threshold. The LT model emphasizes the threshold
behavior in information spreading.
The influence spread function σ(S) denotes the number
of active nodes after propagation starting from seed set S.
In both IC model and LT model, this function has two nice
properties. The function σ(·) is monotone if σ(A) � σ(B)when A ⊆ B. Moreover it is modular if σ(A∪v)−σ(A) �σ(B ∪ v)− σ(B) for all A ⊆ B and v /∈ A. Based on these
properties, Kempe et al. proved the optimization problems
for both models are NP-hard.
A. Related Work
Most current solutions of influence maximization problem
are greedy algorithms, the simple greedy algorithm in [1]
chooses the node with maximum marginal gain repeatedly.
It has been proved that this intuitive algorithm can achieve an
approximation ratio of (1−1/e). However the simple greedy
algorithm is rather slow and not scalable because they use
Motnte-Carlo (MC) simulations on influence spread estima-
tion. Therefore much work has been devoted to improve the
simple greedy method ([2], [3]).
CELF (Cost-Effective Lazy Forward selection) algorithm
[4], proposed by Leskovec, is one of them. It requires
less running time through reducing the number of spread
estimation. Compared with the simple greedy algorithm,
CELF is found to be 700 times faster, however it is still
not fast and scalable enough in many situations.
Chen et al. [5] presented NewGreedyIC algorithm for the
IC model particularly. The main idea is removing the edges
which are not necessary in propagation at the beginning, then
using simple greedy in residual graph. NewGreedyIC pro-
motes the performance, but if method uses CELF in residual
graph, its effectiveness would be significantly improved and
it would be faster than CELF naturally.
Heuristic strategy is used in DegreeDiscount algorithm
[5], which chooses seeds based on the following two points.
The one is to prefer the nodes with large degree and the
other one is to avoid the nodes which can be activated by
2013 IEEE 25th International Conference on Tools with Artificial Intelligence
1082-3409/13 $31.00 © 2013 IEEE
DOI 10.1109/ICTAI.2013.129
849
the chosen seeds. The latter principle is further developed
as ”looking forward” strategy in [5] .
[6] designed CG (Community-Based Greedy) algorithm
on the base of community detection. The basic procedure
is that divide the social network into several communities,
and seeds are the nodes which can maximize influence in
their own communities. Compared with other methods, CG
combines the two key problems in social network analysis:
information propagation and community detection.
Since Domingos and Richardson ([7], [8]) first researched
the influence maximization, much progress has been made
in this topic. While there is an implicit but unrealistic
assumption in most current work: Each node has no cost
to be the seed. Some work takes cost into consideration but
the value is same and fixed for all nodes. In real situations,
an ad company usually has a budget on viral marketing.
On the other hand, a user spread the information with the
cost of time, network traffic and indiscretion from a celebrity
will damage the public image and lose fans. Therefore some
users are not willing to be seeds even has been selected.
These reasons show the necessity to research the influence
maximization models and algorithms under the condition of
cost control.
Some work has added the element of cost into the re-
search, for example [9] introduced notion of mechanism de-
sign into the influence maximization problem. Considering
each node is self-interested, incentive compatible mechanism
is designed to let each node report its true cost. However [9]
made a clear mistake in the proof of budget control, and the
mistake also affected his further proof.
B. Contributions and Roadmap
Considering lack of research on the cost control in the
influence maximization problem, this paper tries to add the
factor of node’s cost into the model. The classical algorithms
are modified by taking node’s cost into consideration, for
example, node’s cost performance is used to select seeds
greedily. We also propose new methods closely related to the
cost. Since the time complexity of simple greedy algorithm
is too high, The new algorithm, named CPP-SNS, reduces
the running time significantly without narrowing propagation
range. The improvement is based on the submodular prop-
erty of influence function and partial loading techniques.
Through the plenty of experiment, we not only show the
effectiveness of CPP-SNS, but also summarize some advice
for practical viral marketing.
Some classical algorithms are changed to suit the environ-
ment of cost control in Section 2. In Section 3 CPP-SNS al-
gorithm is proposed to improve the cost performance greedy
algorithm. Experiments to demonstrate the performance of
different methods are presented in Section 4, and Section 5
is the conclusion.
II. SIMPLE ALGORITHMS BASED ON COST CONTROL
To solve the cost control problem, we add the node’s
cost into the classical model. As a result, the traditional
algorithms for seed selection have to be altered, including
random selection and greedy selection. The involvement
of cost also gives the birth of the cost based selection
algorithms. In this section we focus on these three kinds
of selection algorithms. Before that, we first give the defini-
tion of influence maximization problem under cost control
condition.
A. Problem Definition
We first define the cost of a node: Each node vi keeps
a value ci to represent its cost of information spread. The
cost ci is generated by the cost function cost(�ai), where �aidescribes the feature vector of node vi.
Then the influence maximization problem under cost
control condition is modeled as following: In the given social
network G(V,E), with the budget b(b > 0) and nodes’
cost function ci = cost(�ai), through the propagation starting
from a set S(S ⊆ V ) of nodes as initial seeds, the number
of activated nodes at last is σ(S). The aim is to find such
seed set S∗ that σ(S∗) = maxS⊆V σ(S) and Σvi∈Sci � b.When cost(�ai) = 1, cost control problem regresses to
traditional influence maximization problem, so traditional
problem is one of special cases of cost control problem.
Since the original problem is NP-hard, it is easy to find:
Theorem 1: The influence problem under cost control
condition is NP-hard.
After the overall description of this problem, we now modify
some classical algorithms to give the solution.
B. Random Selection Algorithm
Though we add cost in the model, random selection
algorithm keeps the same, regardless of the node’s influence
or cost. Here simple random and repeated random are briefly
described.
1) Simple random algorithm: Just like its name, simple
random algorithm chooses nodes from the social network
randomly, and make sure the sum of all nodes’ cost is no
more than the budget. If one of chosen seeds s is using
this method, it is obvious that the probability of s ⊆ S∗
(the optimal seed set) is extremely low especially when
the number of nodes is large. As a result, the outcome of
this easy process is not desirable, even if it requires the
least runtime. Simple random algorithm cannot be used to
solve real problem, however, it provides a baseline model
for comparison, which makes it worth of discussion.
2) Repeated random algorithm: Simple random algo-
rithm suffers from its low probability of selecting effective
seeds. To this disadvantage, if we repeat the process enough
850
times, the probability is increased and the result is much
better. This is the main idea of repeated random algorithm.
The following theorem shows the probability of getting
the optimal solution using this method.
Theorem 2: Repeated random algorithm can achieve the
optimal solution with the probability of 1 − 1/e, if the
repetition times N is large enough.
Due to limitation of space, the detail of proof is not shown in
this paper. It is easy to find the value of N should be 1/pbato reach that probability, where ps(0 < ps � 1) denotes the
probability of obtaining the optimal solution using simple
random algorithm. But the truth is that N is a huge number
since 1/pba is extremely low. Therefore repeated random
algorithm cannot be applied especially in a large network
with large volume of users.
C. Cost Performance Preferred Algorithm
Another classical kind of selection algorithm is greedy
selection. Because of its solid mathematic theory and good
performance in optimization problem, greedy algorithm is
well studied in many related topics. Simple greedy selection
algorithm chooses the node with largest marginal gain,
which can be formulated as following:
v = argmaxvi∈V,cost(�ai)<b∗(σ(Scur ∪ {vi})) (1)
where V is the set of nodes; Scur denotes current influence
result and b∗ means residual budget.
With the addition of the node’s cost, the simple greedy
algorithm is not appropriate any more. Though there is no
hard evidence proving the positive correlation between one
node’s influence and its cost, it is a logical statement that
the node’s cost also reflects its influence and large influence
always implies high cost of spreading. As a result, besides
influence of each node, it is more rational to add cost into
measurement. The combined factor is the ratio of node’s
marginal influence gain and its cost, which is so-called
cost performance. The seed is the node with highest cost
performance and the selection is formulated as following:
v = argmaxvi∈V,cost(�ai)<b∗σ(Scur ∪ {vi})− σ(Scur)
cost(�ai)(2)
where V is the set of nodes; Scur denotes current influ-
ence spread result and b∗ means residual budget.
Algorithm 1 is the pseudo-code of Cost Performance
Preferred (CPP) Algorithm.
Same as simple greedy algorithm, the cost performance
preferred algorithm calls for a large number of propagation
simulations, which cause its poor time efficiency. In addition,
with the same budget, more seeds will be chosen by CPP
than simple greedy algorithm since CPP prefers nodes with
low cost. This will increase the simulation times as well.
Algorithm 1 Greedy Selection Algorithm Based on Cost
Performance
Input: G(V,E), b, ci = cost(�ai), pOutput: S
1: S =empty;
2: budget = b;3: costPer = 0; //record node’s cost performance
4: while budget > 0 do5: for i = 0 to |V | do6: if (inf(S+vi)−inf(S))/vi.cost > costPer then7: costPer = (inf(S + vi)− inf(S))/vi.cost;8: seed = vi;
//choose the node with highest cost performance
9: end if10: end for11: S = S + {seed};12: budget = budget− seed.cost;13: end while
Above all, the CPP algorithm needs more time than the
simple one, which triggers us to improve it in the next
section.
D. Cost Based Selection Algorithm
Cost itself is an index of node’s influence, which can be
the criteria of seed choice. We list two kinds of selection
methods: high cost first algorithm (HCF) and low cost first
(LCF) algorithm.
1) High Cost First Algorithm(HCF): Nodes with large
influence always have high cost to spread information in
viral marketing. The main reason is that a strong attitude
from a famous person may cause uproar in public opinion,
and followers will be also disappointed about celebrities
who send obvious advertisements. Therefore to maintain
the public image, the user with many followers is more
cautious about his or her views and expression. Based on this
observation, we design the high cost first algorithm (HCF),
which gives preference to the nodes with high cost.
In this algorithm, first step is to sort the nodes with
descending order of their cost, the sorting algorithm is
optional. To the node with current highest cost, if it is
allowable to the residual budget, this node will be added
into the seed set. The algorithm runs the latter step as a
loop until the budget is run out.
The complexity of HCF depends on the complexity of
sorting algorithm, which is O(nlogn) at least. Therefore, the
complexity of this algorithm is O(nlogn), where n denotes
the number of nodes in the social network.
2) Low Cost First Algorithm: HCF gives priority to nodes
with high cost, because it is believed the large influence
of this kind of nodes can help spread information most
wildly. While the disadvantage is that the budget will be
used up quickly owing to high cost of seeds and not enough
851
number of seeds is also against the influence maximization.
In contrast to HCF, priority of the low cost first algorithm
(LCF) is nodes with low cost, since it raise the influence by
enlarging the seed set.
The only difference between LCF and HCF is the sorting
order of nodes, instead of descending order, LCF sorts nodes
with ascending order of their cost. Similar with HCF, the
complexity of sorting algorithm decides the complexity of
LCF, which is also O(nlogn), where n denotes the number
of nodes in the social network.
Further discussion about HCF and LCF is given in Section
4.
III. IMPROVEMENT ON GREEDY ALGORITHM
As mentioned above, though classical algorithms are
modified to adapt to the cost control condition, they can-
not be used directly in real applications because of their
high complexity. Among them, the random selection and
cost based selection algorithms are intuitive, we focus on
improving CPP algorithm in this section.
A. Improvement Based on Submodular Property
Before the analysis, some notations are listed as following.
For the algorithm, the seed set is Si after adding the ithseed. For the node v, cv denotes its cost and giv denotes its
marginal gain in the selection process of the ith seed (the
ith round for short).
We give the statement of submodular property again. The
influence function is submodular if σ(A∪v)−σ(A) � σ(B∪v) − σ(B) for all A ⊆ B and v /∈ A. Roughly speaking,
the larger seed set is, the less marginal gain the same node
brings. From this property, we can get the following result
directly. For one node v, giv � gjv if i < j, where i < j is
equivalent to Si ⊆ Sj .
The cost performance of node v in the ith round is defined
as piv = giv/cv . For different nodes v and u, if they are
not included in the seed set before the jth round, as the
procedure of traditional algorithm, propagation has to be
simulated for each node and then calculate and compare
nodes’ cost performance. However if we know piv > piuin the ith (i < j) round and pjv > piu in the jth round,
then there is no need to calculate pju. The deduction is as
following. As known: giu � gju(i < j), this implies
piu =giucu
� gjucu
= pju
So under the condition of pjv>piu, we have pjv>pju.
According to the analysis above, it is not necessary to
do the simulation for each node excluding seeds in every
selection. The detailed procedure is when choose the first
seed, we need simulate the propagation process of each node
in the network and calculate its cost performance p. Then
all nodes are sorted according to p in descending order, and
the first node is chosen as a seed. In the following loop,
for example in the ith round, simulation for only the first
node v in current array is needed. If piv = pi−1v , the node
v is just the ith seed. Otherwise update the node v’s cost
performance with piv , and insert into the array in descending
order. After that, choose the current first node and do the
same loop again.
To choose the first seed, this method will execute |V |times of information propagation simulations. After that, in
the ideal situation, only nodes which will be the seed need
simulations. Therefore the total number of simulations is
|V | + |S| − 1, which is significantly less than |V | × |S| of
the original greedy algorithm.
B. Improvement on Partial Loading Strategy
The above method can help reduce the times of simu-
lations to |V | + |S| − 1 in the best condition. According
to (3), when the |V |/|S| is large enough, the number of
simulations is decreased by a factor of |S|. While this is
not enough because it happens only in the ideal condition.
Further improvement is required.
lim|V |→∞
|S| ∗ |V ||V |+ |S| − 1
= lim|V |→∞
|S|1 + |S|−1
|V |= |S| (3)
It is easy to find the most of simulations contribute to
the selection of first seed, because it is the basic step for
the following choice. Hence we make a comprise between
simulating propagation for all nodes and getting node’s
cost performance. The method is loading partial nodes with
high probability of being seeds at the beginning and using
cost performance to choose seeds. The key problem of this
method is what kind of nodes should be loaded first and
how to add other nodes during the selection.
A new criteria is introduced to decide which nodes are
loaded first. If the node vi’s activation probability is pi, the
expectation number of active nodes at last is N =∑|V |
i=1 pi[5]. Assume only node v has the information at first, which
means the propagation starts with one node. li denotes the
length of shortest path between node vi and node v. pli
denotes the probability of node v activated by the shortest
path, so pj means the probability of one node being activated
by its shortest path which length is j. It is obvious that
pi � pli . Let L = max|V |i=1li and rj means the number of
nodes whose l = j, then we have:
N =
|V |∑
i=1
pi �|V |∑
i=1
pli =L∑
j=1
rjpj (4)
we use∑L
j=1 rjpj to indicate node v’s influence, and this
value can be calculated by scanning the number of node v’s
neighbors, the number of neighbors’ neighbor... Because of
the ”small world effect” in the social network, the length of
shortest path between two nodes will not be too large, so the
calculation can be completed quickly. The approximate value
852
of cost performance is obtained after getting the approximate
value of influence.
After ranking nodes according to the approximate value
of cost performance, top m nodes are first loaded. Here the
value of m is optional. When m = |V |, this method is equal
to Greedy algorithm and when m = 1, the criteria of this
method is the approximate value instead of actual value of
cost performance.
Now the m nodes v1, . . . , vi+m have been loaded, and the
latter problem is how to add other nodes into this array. Our
approach is selecting one seed from the array, meanwhile
loading another node into the array. Which node should be
loaded next? A natural choice is the node ranking m + 1according to load criteria. However since the seed set is
not empty any more, if cost of node vi+m+1 is beyond the
residual budget, vi+m+1 has no chance to be the seed and
there is no need to load it. So a more practical and quicker
way is loading the node ranking highest and with the cost
which is possible to the budget.
C. CPP-SNS Algorithm
By combining the two above improvement given above,
we propose the CPP-SNS Algorithm (Cost Performance
Preferred Algorithm Based on Strategy of Nodes Loading
and Submodular Property). Algorithm 2 is the pseudo-code:
Algorithm 2 CPP-SNS Algorithm
Input: G(V,E), b, ci = cost(�ai), p, mOutput: S
1: S =empty;
2: budget = b;3: for i = 0 to |V | do4: get cost performance by computing
∑Lj=1 rjp
j ;
5: end for6: rank nodes by approximate cost performance→ cnodes;
7: cnodes[0 ∼ m− 1]→ lnodes;
8: for i = 0 to m do9: get lnodes[i].gain/lnodes[i].cost;
10: end for11: rank lnodes by cost performance;
12: pos = m− 1;
13: while budget > 0 do14: if lnodes[0].cost � budget then15: s = s+ lnodes[0];16: budget = budget− lnodes[0].cost;17: end if18: remove lnodes[0] from lnodes;
19: get the position of next node→ pos;
20: get cnodes[pos].gain/cnodes[pos].cost;21: cnodes[pos]→ lnodes;
22: end while
The node’s approximate cost performance is computed as
the 4th line of CPP-SNS algorithm. To calculate the value
of∑L
i=1 ripi for one node, the method just needs count
the number of its neighbor nodes, which is O(|V |) in time
complexity. Thus the time complexity of calculation for all
nodes is O(|V |2). Any sorting algorithm can be used to
rank nodes in the 6th line. Considering the huge number
of users in the social network, we choose quick sort. The
first node is removed in 18th line after scanning, then the
following nodes are moved forward one step. The 19th line
shows how to choose the next node to load according to its
approximate cost performance. The new node is added in
the current loading array after updating its cost performance
in the 20th and 21st line. Here binary search can be used
to find the insertion position, or start the cost performance
comparison between new node and nodes in array from the
ending. This will reduce the times of comparison because
the later the node is added, the smaller its cost performance
should be.
The above is the whole description of the CPP-SNS
algorithm. To prove its effectiveness, thorough experiments
are conducted in the next section.
IV. EXPERIMENT AND ANALYSIS
After the introduction of the new algorithm, we are now
interested in understanding its behavior in practice, and
comparing its performance with other methods mentioned
above. Through the experiments, we found that our al-
gorithm achieves a better performance in both influence
maximization and time complexity.
A. Experiment Setup
Some preparation of experiments are listed as following,
including the algorithms used to compare with, the cost
function and the network data.
1) Comparison Algorithms: Despite the CPP-SNS algo-
rithm, other methods, such as Random Algorithm, Repeat
Random (ReRan for short), Greedy Algorithm, CPP Algo-
rithm, HCF Algorithm and LCF Algorithm, are also tested
for comparison. Another algorithm called SNS (Algorithm
Based on Strategy of Nodes Loading and Submodular
Property) is worth mentioning. It is the combination of
simple greedy algorithm and the improvement method (both
A and B) in the above section. Result comparison between
SNS and CPP-SNS can reveal the necessity of considering
cost performance.
2) The Cost Function: The key factor in budget control
problem is the cost function, which is ci = cost(�ai) where �aiis the feature vector of node vi. It is observed that there is a
positive correlation between the cost of node and its degree,
which also means ci = cost(di) is a increasing function of
di, where di denotes degree of node vi. Therefore we design
three kinds of cost functions to describe different growth of
cost.
• Linear cost function: v.cost = v.degree/15 + 1;
• Exponential cost function: v.cost = 1.015v.degree;
853
Figure 1. Propagation Range Comparison in COND-MAT
• Logarithmic cost function: v.cost = log(v.degree+e).
3) The Network Data: The real datasets used in the
experiment are all from Stanford Large Network Dataset
Collection1 and the details are listed in IV-A3. In these
social networks, the extent of link between nodes is different,
which can show the algorithms’ performance in diverse
networks.
Table IDETAILS OF DATASET
Name Nodes Edges Description
COND-MAT 108300 186936Collaboration network ofArxiv Condensed Matter
cit-HepPh 15233 58891Arxiv High Energy Physics
paper citation network
facebook 4039 168486 Social circles from Facebook
The code is written in Java, and the experiments are run
on Windows XP machine with 2.59 GHz Pentium(R) Dual-
Core E5300 CPU and 2GB memory.
B. Experiment Results
In the experiments, algorithms’ performance lies in the
information propagation range and running time. The result
is the mean value of 10 times’ computation of using one of
algorithms in a specific network.
1) COND-MAT Network: The result of COND-MAT net-
work is shown in Figure 1. From the whole figure, we find
1http://snap.stanford.edu/data
Figure 2. Runtime Comparison in COND-MAT
though there is approximation of nodes’ cost performance in
CPP-SNS, its final propagation range is quite close to CPP.
When using the linear cost function (Figure 1.a), result
reveals other information: The results of HCF, Greedy and
SNS are close, which means nodes with high cost have a
larger influence. That is because that HCF, which chooses
nodes according to cost, shares the similar result with
Greedy only when the cost represents influence well.
Moreover, the results of HCF, Greedy and SNS are
worse than other methods at first, however the algorithm’s
effectiveness rise with the spread probability. The reason is
the number of seeds in this kind of method is smaller than
others’, but all seeds have high degrees. On the contrast, the
rise of CPP and CPP-SNS is mild. While LCF has no clear
change because it neglects node’s influence.
From Figure 1.b, which is the result of exponential
cost function, we can find some differences. Though the
outcomes of HCF, Greedy, SNS are still similar, they do not
have so much advantages as Figure 1.a when propagation
probability increases, and even CPP and CPP-SNS are better
than them at last. That is because with the cost growing fast,
the number of seeds shrinks quickly. CPP-SNS still has a
good performance and is more close to the Greedy than
Figure 1.a.
Figure 1.c shows the outcome of applying logarithmic
cost function. All algorithms have a close result, which
demonstrates when cost increase slowly, considering cost
performance shows no clear strengths over simple methods.
We also record the runtime of these algorithms when the
propagation is 0.01, shown as Figure 2. We can see Greedy,
CPP and ReRan are much slower than other methods. This
can be explained by the huge number of simulations they
used. SNS, based on Greedy, is 1000 to 2000 times faster
than Greedy. And CPP-SNS also reduce the time of CPP 400
to 700 times. The choice of cost function has little effect on
854
Figure 3. Propagation Range Comparison in cit-HepPh
the runtime of all methods.
2) cit-HepPh Network: Figure 3 show the result of three
kinds of cost functions in cit-HepPh Network. In Figure
3.a, CPP and CPP-SNS have better propagation range per-
formance than other ones and CPP-SNS is quite close to
Greedy. As spread probability’s increase, CPP and CPP-
SNS have a larger improvement on propagation range and
outperform Greedy and SNS. This is mainly because CPP
and CPP-SNS take cost performance in account and choose
more seeds.
We can also know the relation of Greedy and SNS from
Figure 3.a. They share the same start and when the probabil-
ity increase, SNS falls behind Greedy. The poor performance
of HCF in Figure 3.a illustrates nodes with high cost maybe
don’t have large influence in cit-HepPh network. That is
why SNS is not as good as Greedy. Moreover, ReRan’s
result increases fast implies the probability of selecting a
node with large influence is high. In other words, in cit-
HepPh network, there is no huge difference between nodes’
influence.
The finding in Figure 3.b is consistent with Figure 3.a,
while Figure 3.c is consistent with Figure 1.c. When cost
increases slowly, it will not become a such important factor
of the node.
Same with COND-MAT network condition, runtime is
recorded as Figure 4. The implicit information is also
identical with that of Figure 4. Here CPP-SNS is 1200 to
2000 times faster than CPP.
Figure 4. Runtime Comparison in cit-HepPh
Figure 5. Propagation Range Comparison in facebook
3) facebook Network: The outcome in facebook network
as Figure 5 shows properties of different algorithms. To
CPP-SNS, the approximation of influence leads to the drop
of its result at last. LCF can have the largest seed set among
all methods, but whatever the cost function is, LCF suffers
poor performance. It tells us in a strong connection network
such as facebook, nodes with small influence are unfit for
the seed job.
In addition, compared with Figure 1 and Figure 3, there is
tiny difference between results of other 7 algorithms, besides
LCF. The reason is nodes in this network link closely and
855
Figure 6. Runtime Comparison in facebook
there are many nodes with large influence. Though seeds
selected by algorithms are different, a good propagation
result can be achieved if some large influence nodes are
chosen. The performance of Random and ReRan also proves
that.
Figure 6 presents the runtime. It spends more time for
Greedy, CPP and ReRan because facebook network is more
complex and nodes all have large degrees. CPP-SNS is 80
to 150 times faster than CPP.
C. Advice on Viral Marketing
Through the analysis of our experiments, we give some
advice about viral marketing. First of all, a strong connected
network is a better choice for ad service.
Once the network is given, an important problem is how
to find the seed users. As mentioned before, price of nodes
with large influence are usually high. In the real life, ad
companies could make a ”degree-cost” growth curve to help
make decision. When the growth increases fast, cost of
nodes should not be neglected. For those users with low
ROI (return on investment), even they have huge influence,
it is better to think twice before selecting them. While in
the situation of mild growth of ”degree-cost”, ad companies
could focus on the large influence nodes. Though they
may not have a desirable ROI, it is an easy and effective
way. When the ”degree-cost” is unknown, taking ROI into
account is still a rational choice because it can avoid the
risk.
V. CONCLUSION
This paper is mainly about the cost control problem in
influence maximization problem. We incorporate the factor
of node’s cost into the traditional algorithms and improve
the cost performance preferred method to make it more
practical. Experiments show the CPP-SNS algorithm has a
good performance in different networks.
Our future research will focus on how to estimate node’s
cost correctly and how different factors effect the cost.
Also the influence maximization problem under cost control
condition needs further research.
ACKNOWLEDGMENT
This research was supported by NSFC (No. 61375069,
61105069) and Technology Foundation of Jiangsu Province
of China (No. BE2012181).
REFERENCES
[1] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing thespread of influ-ence through a social network,” in Proceed-ings of the 9th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, 2003, pp. 137–146.
[2] E. Even-Dar and A. Shapira, “A note on maximizing the spreadof influence in social networks,” in Internet and NetworkEconomics, 2007, pp. 281–286.
[3] C. Budak, D. Agrawal, and A. El Abbadi, “Limiting the spreadof misinformation in social networks,” in Proceedings of the20th international conference on World wide web, 2011, pp.665–674.
[4] J. Leskovec, A. Krause, C. Guestrin, and etc, “Cost-effectiveoutbreak detec-tion in networks,” in Proceedings of the 13thACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, 2007, pp. 420–429.
[5] W. Chen, Y. Wang, and S. Yang, “Efficient influence maxi-mization in social networks,” in Proceedings of the 15th ACMSIGKDD International Conference on Knowledge Discoveryand Data Mining, 2009, pp. 199–208.
[6] Y. Wang, G. Cong, G. Song, and K. Xie, “Community-basedgreedy algorithm for mining top-k influential nodes in mobilesocial networks,” in Proceedings of the 16th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining, 2010, pp. 1039–1048.
[7] P. Domingos and M. Richardson, “Mining the network valueof cus-tomers,” in Proceedings of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining, 2001, pp. 57–66.
[8] M. Richardson and P. Domingos, “Mining knowledge-sharingsites for viral marketing,” in Proceedings of the 8th ACMSIGKDD International Conference on Knowledge Discoveryand Data Mining, 2002, pp. 61–70.
[9] Y. Singer, “How to win friends and influence people, truthfully:In-fluence maximization mechanisms for social networks,” inProceedings of the 5th ACM International Conference on WebSearch and Data Mining, 2012, pp. 733–742.
856
Recommended