Learning Social Infectivity in Sparse Low-rank Networks Using Multi-dimensional Hawkes ... · 2014. 1. 4. · 642 Learning Social Infectivity in Sparse Low-rank Networks Using Multi-dimensional

641

Learning Social Infectivity in Sparse Low-rank Networks UsingMulti-dimensional Hawkes Processes

Ke Zhou Hongyuan Zha Le SongGeorgia Institute of Technology Georgia Institute of Technology Georgia Institute of Technology

Abstract

How will the behaviors of individuals in asocial network be influenced by their neigh-bors, the authorities and the communities ina quantitative way? Such critical and valu-able knowledge is unfortunately not readilyaccessible and we tend to only observe itsmanifestation in the form of recurrent andtime-stamped events occurring at the indi-viduals involved in the social network. It isan important yet challenging problem to in-fer the underlying network of social inferencebased on the temporal patterns of those his-torical events that we can observe.

In this paper, we propose a convex optimiza-tion approach to discover the hidden networkof social influence by modeling the recur-rent events at different individuals as multi-dimensional Hawkes processes, emphasizingthe mutual-excitation nature of the dynam-ics of event occurrence. Furthermore, ourestimation procedure, using nuclear and !1norm regularization simultaneously on theparameters, is able to take into account theprior knowledge of the presence of neighborinteraction, authority influence, and com-munity coordination in the social network.To efficiently solve the resulting optimizationproblem, we also design an algorithm ADM4which combines techniques of alternating di-rection method of multipliers and majoriza-tion minimization. We experimented withboth synthetic and real world data sets, andshowed that the proposed method can dis-cover the hidden network more accuratelyand produce a better predictive model thanseveral baselines.

Appearing in Proceedings of the 16th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2013, Scottsdale, AZ, USA. Volume 31 of JMLR: W&CP31. Copyright 2013 by the authors.

1 INTRODUCTION

In today’s explosively growing social networks such asFacebook, millions of people interact with each otherin a real-time fashion. The decisions made by individ-uals are largely influenced by their neighbors, the au-thorities and various communities. For example, a rec-ommendation from a close friend can be very decisivefor the purchasing of a product. Thus, the problem ofmodeling the influences between people is a vital taskfor studying social networks. Equally important, theissue has also gained much attention in recent yearsdue to its wide-spread applications in e-commerce, on-line advertisements and so on while the availability oflarge scale historical social event datasets further fuelsits rapid development.

Despite its importance, the network of social influenceis usually hidden and not directly measurable. It isdifferent from the “physical” connections in a socialnetwork which do not necessarily indicate direct influ-ence — some “friends” in Facebook (“physical” con-nection) never interact with each other (no influence).However, the network of social influence does manifestitself in the form of various time-stamped and recur-rent events occurring at different individuals that arereadily observable, and the dynamics of these histori-cal events carry much information about how individu-als interact and influence each other. For instance, onemight decide to purchase a product the very next daywhen she saw many people around her bought it todaywhile it may take the same person a prolonged periodto make the decision if none from her community hasadopted it. Consequently, we set the goal of the pa-per to address the issue of inferring the network ofsocial influence from the dynamics of these historicalevents observed on many individuals, and producinga predictive model to answer quantitatively the ques-tion: when will this individual take an action giventhat some other individuals have already done that.

From a modeling perspective, we also want to take intoaccount several key features of social influence. First,many social actions are recurrent in nature. For in-

642

Learning Social Infectivity in Sparse Low-rank Networks Using Multi-dimensional Hawkes Processes

stance, an individual can participate in a discussionforum and post her opinions multiple times. Second,actions between interacting people are often self- andmutually-exciting. The likelihood of an individual’sfuture participation in an event is increased if she hasparticipated in it before and more so if many of herneighbors also have participated in the event. Third,the network of social influence have certain topologi-cal structures. It is usually sparse, i.e. most individu-als only influence a small number of neighbors, whilethere are a small number of hubs with wide spreadinfluence on many others. Moreover, people tend toform communities, with the likelihood of taking an ac-tion increased under the influence of other membersof the same community (assortative, e.g., peer-to-peerrelation) or under the influence of members from an-other specific community (dissortative, e.g., teacher-to-student relation). These topological priors give riseto the structures of the adjacency matrices correspond-ing to the networks: they tend to have a small num-ber of nonzero entries and they also have sophisticatedlow-rank structures.

In this paper, we propose a regularized convex op-timization approach to discovering the hidden net-work of social influence based on a multi-dimensionalHawkes process. The multi-dimensional Hawkes pro-cess captures the mutually-exciting and recurrent na-ture of individual behaviors, while the regularizationusing nuclear norm and !1 norm simultaneously on theinfectivity matrix allows us to impose priors on thenetwork topology (sparsity and low-rank structure).The advantage of our formulation is that the corre-sponding network discovery problem can be solved ef-ficiently by bringing a large collection of tools devel-oped in the optimization communities. In particular,we developed an algorithm, called ADM4, to solve theproblem efficiently by combining the idea of alternat-ing direction method of multipliers [3] and majoriza-tion minimization [8]. In our experiments on both syn-thetic and real world datasets, the proposed methodperforms significantly better than alternatives in termof accurately discovering the hidden network and pre-dicting the response time of an individual.

2 RELATED WORK

We will start by summarizing related work. Estimat-ing the hidden social influence from historical events isattracting increasing attention recently. For instance,[13] proposes a hidden Markov based model to modelthe influence between people which treats time as dis-crete index and hence does not lead to models predic-tive of the response time. The approach in [6] modelsthe probability of a user influenced by its neighborsby sub-modular functions, but it is not easy to in-

corporate recurrent events and topological priors in aprincipled way. In [12, 15], continuous-time models areproposed to recover sparse influence network but themodels can not handle recurrent events and they donot take into account the low-rank network structure.

Self-exciting point processes are frequently used tomodel continuous-time events where the occurrenceof one event increases the possibility of future events.Hawkes process [7], an important type of self-excitingprocess, has been investigated for a wide range of ap-plications such as market modeling [19], earth quakeprediction [11], crime modeling [18]. The maximumlikelihood estimation of one-dimensional Hawkes pro-cess is studied in [10] under the EM framework. Addi-tionally, [16] models cascades of events using markedPoisson processes with marks representing the typesof events while [2] propose a model based on Hawkesprocess that models events between pairs of nodes.The novelty of our paper is the application of multi-dimensional Hawkes process [1, 7] to the problem ofdiscovering hidden influence network, and leverage itsconnections to convex low-rank matrix factorizationtechniques.

Low-rank matrix factorizations are applied to a num-ber of real-world problems including collaborative fil-tering and image processing. Nuclear norm [17] hasbeen shown to be an effective formulation for the esti-mation of low-rank matrices. On the other hand, !-1regularization has also been applied to estimate sparsematrices [9]. Furthermore, [14] shows that the accu-racy of matrix completion can be improved with bothsparsity and low-rank regularizations. One contribu-tion of our paper is to apply these matrix completionresults to social influence estimation problem and wealso develop a new algorithm ADM4 for solving thecorresponding optimization problem efficiently.

3 MULTI-DIMENSIONAL HAWKESPROCESSES WITH LOW-RANKAND SPARSE STRUCTURES

3.1 One-dimensional Hawkes Processes

Before introducing multi-dimensional Hawkes pro-cesses, we first describe one-dimensional Hawkes pro-cess briefly. In its most basic form, a one-dimensionalHawkes process is a point process Nt with its condi-tional intensity expressed as follows [7]

λ(t) = µ+ a

∫ t

−∞g(t− s)dNs = µ+ a

∑

i:ti<t

g(t− ti),

where µ > 0 is the base intensity, ti are the time ofevents in the point process before time t, and g(t) is

643

Ke Zhou, Hongyuan Zha, Le Song

the decay kernel. We focus on the case of exponentialkernel g(t) = we−wt as a concrete examples in thispaper, but the framework discussed in this paper canbe easily adapted to other positive kernels. In theabove conditional intensity function, the sum over iwith ti < t captures the self-exciting nature of thepoint process: the occurrence of events in the past hasa positive contribution of the event intensity in thefuture. Given a sequence of events {ti}ni=1 observedin the time interval [0, T ] that is generated from theabove conditional intensity, the log-likelihood functioncan be expressed as follows

L = log

∏ni=1 λ(ti)

exp(∫ T0 λ(t)dt)

=n∑

i=1

log λ(ti)−

∫ T

0λ(t)dt.

3.2 Multi-dimensional Hawkes Processes

In order to model social influence, one-dimensionalHawkes process discussed above needs to be extendedto the multi-dimensional case. Specifically, we haveU Hawkes processes that are coupled with each other:each of the Hawkes processes corresponds to an in-dividual and the influence between individuals areexplicitly modeled. Formally, the multi-dimensionalHawkes process is defined by a U -dimensional pointprocess Nu

t , u = 1, . . . , U , with the conditional inten-sity for the u-th dimension expressed as follows:

λu(t) = µu +∑

i:ti<t

auuig(t− ti),

where µu ≥ 0 is the base intensity for the u-thHawkes process. The coefficient auu′ ≥ 0 capturesthe mutually-exciting property between the u-th andu′-th dimension. Intuitively, it captures the degree ofinfluence of events occurred in the u′-th dimension tothe u-th dimension. Larger value of auu′ indicates thatevents in u′-th dimension are more likely to trigger aevent in the u-th dimension in the future. We collectthe parameters into matrix-vector forms, µ = (µu) forthe base intensity, and A = (auu′) for the mutually-exciting coefficients, called infectivity matrix. We useA ≥ 0 and µ ≥ 0 to indicate that we require bothmatrices to be entry-wise nonnegative.

Suppose we have m samples, {c1, . . . , cm}, from themulti-dimensional Hawkes process. Each sample c isa sequence of events observed during a time periodof [0, Tc], which is in the form of {(tci , u

ci )}

nc

i=1. Eachpair (tci , u

ci ) represents an event occurring at the uc

i -thdimension at time tci . Thus, the log-likelihood of model

parameters Θ = {A,µ} can be expressed as follows

L(A,µ) =∑

c

(

nc∑

i=1

log λuci(tci )−

U∑

u=1

∫ Tc

0λu(t)dt

)

=∑

c

nc∑

i=1

log(

µuci+∑

tcj<t

auciu

cjg(tci − tcj)

)

−Tc

U∑

u=1

µu −U∑

u=1

nc∑

j=1

auucjG(Tc − tcj)

,

where G(t) =∫ t0 g(s)ds. In general, the parameters

A and µ can be estimated by maximizing the log-likelihood, i.e., minA≥0,µ≥0 −L(A,µ).

3.3 Sparse and Low-Rank Regularization

As we mentioned earlier, we would like to take intoaccount the structure of the social influence in the pro-posed model. We focus on two important properties ofthe social influences: sparsity and low-rank. The spar-sity of social influences implies that most individualsonly influence a small fraction of users in the networkwhile there can be a few hubs with wide-spread in-fluence. This can be reflected in the sparsity patternof A. Furthermore, the communities structure in theinfluence network implies low-rank structures, whichcan also be reflected in matrix A. Thus, we considerincorporating this prior knowledge by imposing bothlow-rank and sparse regularization on A. That is weregularize our maximum likelihood estimator with

minA≥0,µ≥0

−L(A,µ) + λ1‖A‖∗ + λ2‖A‖1, (1)

where ‖A‖∗ is the nuclear norm of matrix A, which is

defined to be the sum of its singular value∑rankA

i=1 σi.The nuclear norm has been used to estimate low-rank matrices effectively [17]. Moreover, ‖A‖1 =∑

u,u′ |auu′ | is the #1 norm of the matrix A, whichis used to enforce the sparsity of the matrix A. Theparameter λ1 and λ2 control the strength of the tworegularization terms.

4 EFFICIENT OPTIMIZATION

It can be observed that the objective function in Equa-tion (1) is non-differentiable and thus difficult to opti-mize in general. We apply the idea of alternating di-rection method of multipliers (ADMM) [5] to convertthe optimization problem to several sub-problems thatare easier to solve. The ADMM has been shown to bea special case of the more general Douglas-Rachfordsplitting method, which has good convergence proper-ties under some fairly mild conditions [4].

644


Specifically, we first rewrite the optimization problemin Equation (1) to an equivalent form by introducingtwo auxiliary variables Z1 and Z1

minA≥0,µ≥0,Z1,Z2

− L(A,µ) + λ1‖Z1‖∗ + λ2‖Z2‖1, (2)

s.t. A = Z1, A = Z2.

In ADMM, we optimize the argumented Lagrangian ofthe above problem that can be expressed as follows:

Lρ =− L(µ,A) + λ1‖Z1‖∗ + λ2‖Z2‖1

+ ρtrace(UT1 (A− Z1)) + ρtrace(UT

2 (A− Z2))

+ρ

2(‖A− Z1‖

2 + ‖A− Z2‖2),

where ρ > 0 is called the penalty parameter and ‖ · ‖denotes the Frobenius norm. The matrices U1 and U2

are the dual variable associated with the constraintsA = Z1 and A = Z2, respectively.

The algorithm for solving the above augmented La-grangian problem involves the following key iterativesteps (also see the details in the Appendix):

Ak+1,µ

k+1 = argminA≥0,µ≥0

Lρ(A,µ,Zk1 ,Z

k2 ,U

k1 ,U

k2), (3)

Zk+11 = argmin

Z1

Lρ(Ak+1

,µk+1

,Z1,Zk2 ,U

k1 ,U

k2), (4)

Zk+12 = argmin

Z2

Lρ(Ak+1

,µk+1

,Zk+11 ,Z2,U

k1 ,U

k2), (5)

Uk+11 = Uk

1 + (Ak+1 − Zk+11 ),

Uk+12 = Uk

2 + (Ak+1 − Zk+12 ).

The advantage of sequential update is that we separatemultiple variables and thus can optimize them one ata time. We first consider the optimization problem forZ1 and Z2 and then describe the algorithm used tooptimize with respect to A and µ.

4.1 Solving for Z1 and Z2.

When solving for Z1 in Equation (4), the relevantterms from Lρ are

λ1‖Z1‖∗+ρtrace((Uk1 )

T (Ak+1−Z1))+ρ

2‖Ak+1−Z1‖

2,

which can be simplified to an equivalent problem,

Zk+11 = argmin

Z1

λ1‖Z1‖∗ +ρ

2‖Ak+1 − Z1 +Uk

1‖2.

The above problem has a closed form solution

Zk+11 = Sλ1/ρ(A

k+1 +Uk1), (6)

where Sα(X) is a soft-thresholding function defined asSα(X) = Udiag((σi − α)+)VT for all matrix X withsingular value decomposition X = Udiag(σi)VT .

Similarly, the optimization for Z2 can be simplifiedinto the following equivalent form

Zk+12 = argmin

Z2

λ2‖Z2‖1 +ρ

2‖Ak+1 − Z2 +Uk

2‖2,

which again has the closed form solution. In this case,depending on the magnitude of the ij-th entry of thematrix (Ak+1 + Uk

2), the corresponding (Zk+12 )ij is

updated as

(Ak+1 +Uk2)ij −

λ2

ρ , (Ak+1 +Uk2)ij ≥

λ2

ρ ,

(Ak+1 +Uk2)ij +

λ2

ρ , (Ak+1 +Uk2)ij ≤ −λ2

ρ ,

0, |(Ak+1 +Uk2)ij | <

λ2

ρ .

(7)

4.2 Solving for A and µ

The optimization problem for A and µ defined inEquation (3) can be equivalently written as

Ak+1,µk+1 = argminA≥0,µ≥0

f(A,µ)

where f(A,µ) = −L(A,µ) + ρ2 (‖A − Zk

1 + Uk1‖

2 +‖A−Zk

2+Uk2‖

2). We propose to solve the above prob-lem by a majorization-minimization algorithm whichis a generalization of the EM algorithm. Since the op-timization is convex, we still obtain global optimumfor this subproblem. Specifically, given any estimationA(m) and µ(m) of A and µ, we minimize a surrogatefunction Q(A,µ;A(m),µ(m)) which is a tight upperbound of f(A,µ). Indeed, Q(A,µ;A(m),µ(m)) canbe defined as follows:

Q(A,µ;A(m),µ(m))

=−∑

c

nc

∑

i=1

pcii logµuc

i

pcii+

i−1∑

j=1

pcij logauc

iucjg(tci − tj)

pcij

−

Tc

∑

u

µu +U∑

u=1

nc

∑

j=1

auucjG(T − tcj)

+ρ

2(‖A− Zk

1 +Uk1‖

2 + ‖A− Zk2 +Uk

2‖2), (8)

where

pcii =µ(m)uci

µ(m)uci

+∑i−1

j=1 a(m)uciu

cjg(tci − tcj)

,

pcij =a(m)uciu

cjg(tci − tcj)

µ(m)uci

+∑i−1

j=1 a(m)uciu

cjg(tci − tcj)

.

Intuitively, pcij can be interpreted as the probabilitythat the i-th event is influenced by a previous event jin the network and pcii is the probability that i-th eventis sampled from the base intensity. Thus, the first

645


Algorithm 1 ADMM-MM (ADM4) for estimating Aand µ

Input: Observed samples {c1, . . . , cm}.Output: A and µ.Initialize µ and A randomly; Set U1 = 0, U2 = 0.while k = 1, 2, . . . , do

Update Ak+1 and µk+1 by optimizing Q definedin (8) as follows:while not converge doUpdate A, µ using (10) and (9) respectively.

end whileUpdate Zk+1

1 using (6); Update Zk+12 using (7).

Update Uk+11 = Uk

1+(Ak+1−Zk+11 ) and Uk+1

2 =Uk

2 + (Ak+1 − Zk+12 ).

end whilereturn A and µ.

two terms of Q(A,µ;A(m),µ(m)) can be viewed as thejoint probability of the unknown infectivity structuresand the observed events.

As is further shown in the Appendix, optimizingQ(A,µ;A(m),µ(m)) ensures that f(A,µ) is decreas-ing monotonically. Moreover, the advantage of opti-mizing Q(A,µ) is that all parameters A and µ canbe solved independently with each other with closedforms solutions, and the nonnegativity constraints areautomatically taken care of. That is

µ(m+1)u =

∑

c

∑

i:i≤nc,uci=u p

cii

∑

c Tc(9)

a(m+1)uu′ =

−B +√

B2 + 8ρC

4ρ, (10)

where

B =∑

c

∑

j:ucj=u′

(G(T − tcj))

+ ρ(−z1,uu′ + u1,uu′ − z2,uu′ + u2,uu′),

C =∑

c

nc

∑

i=1,uci=u

∑

j<i,ucj=u′

pcij .

The overall optimization algorithm is summarized inAlgorithm 1.

5 EXPERIMENTS

In this section, we conducted experiments on both syn-thetic and real-world datasets to evaluate the perfor-mance of the proposed method.

5.1 Synthetic Data

Data Generation. The goal is to show that ourproposed algorithm can reconstruct the underlying

parameters from observed recurrent events. To thisend, we consider a U -dimensional Hawkes process withU = 1000 and generate the true parameters µ from auniform distribution on [0, 0.001]. In particular, theinfectivity matrix A is generated by A = UVT . Weconsider two different types of influences in our exper-iments: assortative mixing and disassortative mixing:

• In the assortative mixing case, U and V are both1000 × 9 matrices with entries in [100(i − 1) +1 : 100(i + 1), i], i = 1, . . . , 9 sampled randomlyfrom [0, 0.1] and all other entries are set to zero.Assortative mixing examples capture the scenariothat influence are mostly coming from membersof the same group.

• In the disassortative mixing case, U is generatedin the same way as the assortative mixing case,while the V has non-zero entries in [100(i−1)+1 :100(i+1), 10−i], i = 1, . . . , 9. Disassortative mix-ing examples capture the scenario that influencecan come outside of the group, possibly from afew influential hubs.

We scale A so that the spectral radius of A is 0.8 toensure the point process is well-defined, i.e., with finiteintensity1. Then, 50000 samples are sampled from themulti-dimensional Hawkes process specified by A andµ. The proposed algorithms are applied to the samplesto obtain estimations A and µ.

Evaluation Metric. We use three evaluation met-rics to measure the performance:

• RelErr is defined as the averaged relative error be-tween the estimated parameters and the true pa-rameters, i.e. |aij−aij |

|aij |for aij #= 0 and |aij − aij |

for aij = 0.• PredLik is defined as the log-likelihood of the esti-

mated model on a separate held-out test set con-taining 50,000 samples.

• RankCorr is defined as the averaged Kendall’s rankcorrelation coefficient between each row of A andA. It measures whether the relative order of theestimated social influences is correctly recovered.

Results. We included 5 methods in the comparisons:

• TimeWindow. This method, we first discretizetime into time windows with equal length andthen represent each node by a vector where eachdimension is to the number of events occurredwithin the corresponding time window at the

1In our case, the point process is well-defined iff thespecitral radius of A is less than 1

646


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10000 15000 20000 25000 30000 35000 40000 45000 50000

RelE

rr

Number of Training Samples

FullLowRank

SparseLowRankSparse

NetRate

(a) RelErr

-2.04e+07

-2.02e+07

-2e+07

-1.98e+07

-1.96e+07

-1.94e+07

-1.92e+07

-1.9e+07

-1.88e+07

10000 15000 20000 25000 30000 35000 40000 45000 50000

Pre

dLik


FullLowRank

SparseLowRankSparse

(b) PredLik

0.1

0.2

0.3

0.4

0.5

0.6

0.7

10000 15000 20000 25000 30000 35000 40000 45000 50000

RankC

orr


FullLowRank

SparseLowRankSparse

NetRateTimeWindow

(c) RankCorr

Figure 1: Assortative mixing networks: performance measured by RelErr, PredLik and RankCorr with respect tothe number of training samples.

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10000 15000 20000 25000 30000 35000 40000 45000 50000

RelE

rr


FullLowRank

SparseLowRankSparse

NetRate

(a) RelErr

-2.01e+07

-2e+07

-1.99e+07

-1.98e+07

-1.97e+07

-1.96e+07

-1.95e+07

-1.94e+07

-1.93e+07

-1.92e+07

-1.91e+07

10000 15000 20000 25000 30000 35000 40000 45000 50000

Pre

dLik


FullLowRank

SparseLowRankSparse

(b) PredLik

0.1

0.2

0.3

0.4

0.5

0.6

0.7

10000 15000 20000 25000 30000 35000 40000 45000 50000

RankC

orr


FullLowRank

SparseLowRankSparse

NetRateTimeWindow

(c) RankCorr

Figure 2: Disassortative mixing networks: Performance measured by RelErr, PredLik and RankCorr with respectto the number of training samples.

node. The cosine similarity are used to estimatethe infectivity matrix.

• NetRate. This method is proposed in [15] for mod-eling information diffusion in networks. It can notmodel the recurrent events, so we only keep thefirst occurrences at each node in the training data.

• Full. The infectivity matrix A is estimated as ageneral U × U matrix without any structure.

• LowRank. Only the nuclear norm is used to obtaina low-rank estimation of A.

• Sparse. Only the !1 norm is used to obtain asparse estimation of A.

• LowRankSparse. This is the proposed method Al-

gorithm 1. Both nuclear norm and !1 norm areused to estimate A.

For each method, the parameters are selected on avalidation set that are disjoint from both training andtest set. We run each experiment for five times withdifferent samples and report the averaged performancemetrics over all the five runs.

Figure 1 plots the results on assortative mixing net-works measured by RelErr, PredLik and RankCorr withrespect to the number of training data. It can be ob-served from Figure 1 that when the number of trainingsamples increase, the RelErr decreases and both Pred-

Lik and RankCorr increase, indicating that all methodscan improve accuracy of estimation with more trainingsamples. Moreover, LowRank and Sparse outperformsFull in all cases. Therefore, we conclude that utilizingthe structure of the matrix can improve the estima-tion very significantly. LowRankSparse outperforms allother baselines since it fully utilizes prior informationabout the infectivity matrix. It is interesting to ob-serve that when the number of training samples aresmall, the improvements of LowRankSparse over otherbaselines are very large. We think this is because whenthe number of training samples are not sufficientlylarge to get a good estimation, the prior knowledgefrom the structure of the infectivity matrix becomesmore important and useful. TimeWindow is not asgood as other methods since it can not capture thetime pattern very accurately. Similarly in Figure 2, theproposed method LowRankSparse outperforms othermethods in disassortative mixing networks. These twosets of experiments indicate that LowRankSparse canhandle well different network topologies.

In Figure 3 and Figure 4, we plot the performance withrespect to the values of the two parameters λ1 and λ2

in LowRankSparse. It can be observed that the per-formance first increases and then decreases when thevalue λ1 grows. Note that when λ1 = 0, we obtain

647


-1.93e+07

-1.925e+07

-1.92e+07

-1.915e+07

-1.91e+07

-1.905e+07

-1.9e+07

-1.895e+07

0 0.2 0.4 0.6 0.8 1

Pre

dLik

λ1

Figure 3: Performance measuredby PredLik with respect to thevalue of λ1.

-1.902e+07

-1.9018e+07

-1.9016e+07

-1.9014e+07

-1.9012e+07

-1.901e+07

-1.9008e+07

-1.9006e+07

-1.9004e+07

-1.9002e+07

-1.9e+07

-1.8998e+07

0 0.001 0.002 0.003 0.004 0.005

Pre

dLik

λ2

Figure 4: Performance measuredby PredLik with respect to thevalue of λ2.

-1.909e+07

-1.908e+07

-1.907e+07

-1.906e+07

-1.905e+07

-1.904e+07

-1.903e+07

-1.902e+07

-1.901e+07

-1.9e+07

-1.899e+07

10 20 30 40 50 60 70 80 90 100

Pre

dLik

Number of outer iterations

Inner=10Inner=20Inner=30Inner=40

Figure 5: Performance measuredby PredLik with respect to thenumber of inner/outer iterations.

a sparse solution that is not low-rank, which is un-derperformed by solutions that are both low-rank andsparse. Similar observations can be made for λ2. Ingeneral, we use λ2 = 0.6,λ1 = 0.02 for those Figure 3and Figure 4, respectively.

In order to investigate the convergence of the proposedalgorithm, in Figure 5, we present the performancemeasured by PredLik with respect to the number ofouter iterations K in Algorithm 1. We can observethat the performance measured by PredLik grows withthe number of outer iterations and converges withinabout 50 iterations. We also illustrate the impact ofthe number of inner iterations in Figure 5. It can beobserved that larger number of inner iterations leadsto better convergence speed and slightly better perfor-mance.

5.2 Real-world Data

We also evaluate the proposed method on a real worlddata set. To this end, we use the MemeTracker dataset2. The data set contains the information flows cap-tured by hyper-links between different sites with times-tamps. In particular, we first extract the top 500 pop-ular sites and the links between them. The events arein the form that a site created a hyper-link to anothersite at a particular time. We use 50% data as trainingdata and 50% as test data.

In Figure 6, we show that negative log-likelihood ofFull, Sparse, LowRank and LowRankSparse on the testset. We can see that LowRankSparse outperforms thebaselines. Therefore, we conclude that LowRankSparsecan better model the influences in social networks.

We also study whether the proposed model can dis-cover the influence network between users from the re-current events. To this end, we present the RankCorrof Full, Sparse, LowRank and LowRankSparse in Fig-ure 7. We also include NetRate as a baseline. It can beobserved that the LowRankSparse obtains better rank

2http://memetracker.org

correlation than other models, which indicates that itcan capture the influence network better than othermodels. In Figure 8, we visualize the influence net-work estimated from the MemeTracker data. We canobserve that there is a quite dense region near thebottom right in the infectivity matrix. This regionrepresents that the corresponding sites are the centricof the infectivity networks. Example sites in this re-gion include news.cnet.com, blogs.zdnet.com andblogs.abcnews.com. The first two are both famousIT news portal and the third one is a blog site thatbelongs to a general news portal. It is clear that allof these sites are popular sites that can quickly detecttrending events and propagate them to a lot of othersites.

6 CONCLUSIONSIn this paper, we propose to infer the network socialinfluence from the observed recurrent events indicatingusers’ activities in the social networks. The proposedmodel utilizes the mutually-exciting multi-dimensionalHawkes model to capture the temporal patterns of userbehaviors. Moreover, we estimate the infectivity ma-trix for the network that is both low-rank and sparseby optimizing nuclear norm and "1 norm simultane-ously. The resulting optimization problem is solvedthrough combining the ideas of alternating directionmethod of multipliers and majorization-minimization.The experimental results on both simulation and real-world datasets suggest that the proposed model canestimate the social influence between users accurately.

There are several interesting directions for future stud-ies: First, we plan to investigate the adaptation ofthe proposed model to the problem of collaborativefiltering. In particular, we plan to model the socialinfluences to capture both short-term and long-termpreferences of users. Moreover, we can also investigatethe problem of estimating the decay kernel togetherwith other parameters in the model.

AcknowledgementsPart of the work is supported by NSF IIS-1116886,NSF IIS-1218749 and a DARPA Xdata grant.

648


4.08436e+08

4.08438e+08

4.0844e+08

4.08442e+08

4.08444e+08

4.08446e+08

4.08448e+08N

egativ

e L

og L

ikelih

ood

FullLowRank

SparseLowRankSparse

Figure 6: Performance measuredby PredLik on MemeTracker dataset.

0.08

0.1

0.12

0.14

0.16

0.18

0.2

RankC

orr

FullLowRank

SparseLowRankSparse

NetRate

Figure 7: Performance measuredby RankCorr on MemeTrackerdata set.

Figure 8: Influence structure es-timated from the MemeTrackerdata set.

Appendix

Derivation of ADMM. In ADMM, we consider theargumented Lagrangian of the above constrained opti-mization problem by writing it as follows:

min− L(µ,A) + λ1‖Z1‖∗ + λ2‖Z2‖1

+ρ

2(‖A− Z1‖

2 + ‖A− Z2‖2) (11)

subject to A = Z1,A = Z2. Clearly, for all ρ theoptimization problem defined above is equivalent toEquation (2) and thus equivalent to the problem de-fined in Equation (1). The argumented Lagrangian ofEquation (2) is the (standard) Lagrangian of Equation(11), which can be expressed as follows:

Lρ =− L(µ,A) + λ1‖Z1‖∗ + λ2‖Z2‖1

+ trace(YT1 (A− Z1)) + trace(YT

2 (A− Z2))

+ρ

2(‖A− Z1‖

2 + ‖A− Z2‖2)

Thus, we can solve the optimization problem definedin Equation (2) applying the gradient ascent algorithmto the dual variables Y1 and Y2. It can be shown thatthe update of Y1 and Y2 has the following form at thek-th iteration:

Yk+11 = Yk

1 + ρ(Ak+1 − Zk+11 )

Yk+12 = Yk

2 + ρ(Ak+1 − Zk+12 ),

where Ak+1, Zk+11 and Zk+1

2 are obtained by optimiz-ing Lρ with Y1 = Yk

1 and Y2 = Yk2 fixed:

argminA,µ,Z1,Z2

Lρ(A,µ,Z1,Z2,Yk1 ,Y

k2 )

In ADMM, the above problem is solved by updatingA, µ, Z1 and Z2 sequentially as follows:

Ak+1,µk+1 = argminA,µ,

Lρ(A,µ,Zk1 ,Z

k2 ,Y

k1 ,Y

k2 ),

Zk+11 = argmin

Z1

Lρ(Ak+1,µk+1,Z1,Z

k2 ,Y

k1 ,Y

k2 ),

Zk+12 = argmin

Z2

Lρ(Ak+1,µk+1,Zk+1

1 ,Z2,Yk1 ,Y

k2 ).

It is usually more convenient to consider the scaledform of ADMM. Let Uk

1 = Yk1/ρ and Uk

2 = Yk2/ρ, we

obtain the algorithm described in Section 4.

Majorization Minmization. First, we claim thatthe following properties hold for Q(A,µ;A(m),µ(m); )defined in Equation (8) :1. For all A, µ,

Q(A,µ;A(m),µ(m)) ≥ f(A,µ)

2.

Q(A(m),µ(m);A(m),µ(m)) = f(A(m),µ(m))

Proof. The first claim can be shown by utilizing theJensen’s inequality: For all c and i, we have

log(µuci+

i−1∑

j=1

auciu

cjg(tci − tcj)) ≥ pcii log

uci

pcii

+i−1∑

j=1

pcijauc

iucjg(tci − tcj))

pcij

Summing up over c and i proves the claim.

The second claim can be checked by setting A = A(m)

and µ = µ(m).

The above two properties imply that if(A(m+1),µ(m+1)) = argminA,µ Q(A,µ;A(m),µ(m)),we have

f(A(m),µ(m)) = Q(A(m),µ(m);A(m),µ(m))

≥ Q(A(m+1),µ(m+1);A(m),µ(m))

≥ f(A(m+1),µ(m+1)).

Thus, optimizing Q with respect to A and µ ensuresthat the value of f(A,µ) decrease monotonically.

References

[1] L. Adamopoulos. Some counting and inter-val properties of the mutually-exciting processes.

649


Journal of Applied Probability, 12(1):78, Mar.1975.

[2] C. Blundell, K. Heller, and J. Beck. Modelling re-ciprocating relationships with Hawkes processes.NIPS, 2012.

[3] S. Boyd. Distributed optimization and statisticallearning via the alternating direction method ofmultipliers. Foundations and Trends in MachineLearning, 3(1):1–122, 2010.

[4] J. Eckstein and D. P. Bertsekas. On the Dou-glas Rachford splitting method and the proxi-mal point algorithm for maximal monotone oper-ators. Mathematical Programming, 55(1-3):293–318, Apr. 1992.

[5] D. Gabay and B. Mercier. A Dual Algorithm forthe Solution of Nonlinear Variational Problemsvia Finite Element Approximation. Computers &Mathematics with Applications, 2(1):17–40, Jan.1976.

[6] A. Goyal, F. Bonchi, and L. V. Lakshmanan.Learning influence probabilities in social net-works. In Proceedings of the third ACM interna-tional conference on Web search and data mining- WSDM ’10, page 241, New York, New York,USA, 2010. ACM Press.

[7] A. G. Hawkes. Spectra of some self-exciting andmutually exciting point processes. Biometrika,58(1):83–90, 1971.

[8] D. R. Hunter and K. Lange. A tutorial on MM al-gorithms. The American Statistician, (58), 2004.

[9] M. Kolar, L. Song, A. Ahmed, and E. Xing. Es-timating time-varying networks. The Annals ofApplied Statistics, 4(1):94–123, 2010.

[10] E. Lewisa and G. Mohlerb. A Nonparametric EMalgorithm for multiscale Hawkes processes. Jour-nal of Nonparametric Statistics, (1), 2011.

[11] D. Marsan and O. Lengline. Extending earth-quakes’ reach through cascading. Science (NewYork, N.Y.), 319(5866):1076–9, Feb. 2008.

[12] S. Myers and J. Leskovec. On the convexity oflatent social network inference. NIPS, 2010.

[13] W. Pan, M. Cebrian, W. Dong, T. Kim, J. Fowler,and A. Pentland. Modeling dynamical influencein human interaction patterns. Arxiv preprintarXiv:1009.0240, page 19, Sept. 2010.

[14] E. Richard, P.-A. Savalle, and N. Vayatis. Es-timation of simultaneously sparse and low rankmatrices. ICML, June 2012.

[15] M. G. Rodriguez, D. Balduzzi, and B. Scholkopf.Uncovering the temporal dynamics of diffusionnetworks. Proceedings of the 28th International

Conference on Machine Learning, pages 561–568,May 2011.

[16] A. Simma and M. Jordan. Modeling events withcascades of Poisson processes. UAI, 2012.

[17] N. Srebro. Learning with matrix factorizations.Thesis, 2004.

[18] A. Stomakhin, M. B. Short, and A. L. Bertozzi.Reconstruction of missing data in social networksbased on temporal patterns of interactions. In-verse Problems, 27(11):115013, Nov. 2011.

[19] I. M. Toke. ”Market making” behaviour in anorder book model and its impact on the bid-askspread. Arxiv, page 17, Mar. 2010.

Documents

Learning Social Infectivity in Sparse Low-rank Networks Using Multi-dimensional Hawkes ... · 2014. 1. 4. · 642 Learning Social Infectivity in Sparse Low-rank Networks Using Multi-dimensional