Pseudo Cold Start Link Prediction with Multiple Sources in Social

Pseudo Cold Start Link Prediction with Multiple Sources in Social Networks

Liang Ge 1, Aidong Zhang 2

1,2Computer Science and Engineering Department, State University of New York at Buffalo

Buffalo, 14260, [email protected], [email protected],

Abstract

Link prediction is an important task in social networks anddata mining for understanding the mechanisms by which thesocial networks form and evolve. In most link predictionresearches, it is assumed either a snapshot of the socialnetwork or a social network with some missing links isavailable. Most existing researches therefore approach thisproblem by exploring the topological structure of the socialnetwork using only one source of information. However, inmany application domains, in addition to the social networkof interest, there are a number of auxiliary informationavailable.

In this work, we introduce the pseudo cold start link pre-diction with multiple sources as the problem of predictingthe structure of a social network when only a small subgraphof the social network is known and multiple heterogeneoussources are available. We propose a two-phase supervisedmethod: the first phase generates an efficient feature selec-tion scheme to find the best feature from multiple sourcesthat is used for predicting the structure in the social network.In the second phase, we propose a regularization method tocontrol the risk of over-fitting induced by the first phase.We assess our method empirically over a large data collec-tion obtained from Youtube. The extensive experimentalevaluations confirm the effectiveness of our approach.

1 Introduction

Link prediction, introduced by Nowell and Kleinberg [1],refers to the problem that given a snapshot of a networkat time t and a future time t’, predict the new links thatare likely to appear in the network within the interval [t,t’]. The link prediction is about understanding to whatextent the mechanisms by which the networks formand evolve can be modeled by features intrinsic to thenetwork itself [1]. Therefore, most existing researchestackle this problem by exploring the topological struc-ture of the social network of interest [1, 9, 12, 26]. Inaddition to the social network, in many application do-mains, we also have exogenous information, most inter-estingly the auxiliary networks between the same groupof people from heterogeneous sources. Take Facebooknetwork as an example, besides the friendship relationsbetween the users, there are other relations based ongroup membership and commenting, etc.In this paper, we tackle the pseudo cold start link

prediction (PCS) with multiple sources problem: we stillaim to predict the links but instead of having a snapshot

of the social network, we only have the knowledge ofa small subgraph. Meanwhile we also obtain multipleauxiliary networks between the same users in the socialnetwork. We are interested in predicting possible linksamong the vertices in the social network by exploitingthe multiple auxiliary networks. Notice that whenthe size of the known subgraph is zero, i.e., we haveno knowledge of the existing link structure and onlyone source of information is available, the problem isreduced to the cold start link prediction introduced byLeroy, et al. [2].Consider, for instance, an online coupon company

G (e.g., Groupon) and a company F (e.g., Facebook)that provides general-purpose social network services.Suppose G and F have made the following agreement:1) F provides a functionality that users can share andreview their favorite online coupons and these sharingand reviews are made available to their contacts, and2) when a user clicks one online coupon, the user isredirected to the corresponding page in G’s website,but 3) F keeps the structure of its social networkas secret (this may be mandatory because of privacyregulations). Thanks to the agreement, users of F mightinfluence each other to purchase online coupons and theeasiest way for them to buy coupons would be via thewebsite of G. In this scenario, G owns information aboutthe purchase history and reviews from its customers.Without violating the privacy regulations of F, G mayhave crawled a subgraph of the social network of F.The question herein is: can G nonetheless infer the

social network, using the purchase and review informa-tion it owns and the small subgraph it crawls? Thiswould be quite helpful for G. It would enable G to adoptviral marketing strategies [3] and if G decides, in the fu-ture, to develop its own social networking service, thiscan be used to recommend possible friends, thus facili-tating the initial growth of this social network.The PCS problem brings distinct challenges in re-

spect to the traditional link prediction. Many existingresearch work in link prediction exploits the topologicalstructure of the social network [1, 9, 12, 26]. However,

518 Copyright © SIAM.Unauthorized reproduction of this article is prohibited.


in the PCS problem, we only know a small subgraph ofthe social network (e.g., the subgraph contains only 5%of the total nodes). Such setting prevents us from ap-plying many existing approaches of the link prediction.Since we have obtained multiple auxiliary networks

from heterogeneous sources, we aim to predict thelinks in social network by exploring these auxiliaryinformation. This brings us another challenge: thecurse of multiple sources. For each auxiliary source,there are a great number of features or predictors thatcan be explored or derived to predict the links inthe social network. Nowell, et al. [1] examined 10topological features for link prediction while Leroy, etal. [2] enumerated and inspected 15 features from thegroup member information alone. Given the multiplesources available, the possible number of features forlink prediction is overwhelming and the fact that thecomputation of each feature in the large scale socialnetwork is not trivial increases our burden. The curseof multiple sources urges us to propose an efficientand effective feature selection scheme to find the bestfeature for link prediction from the multiple sources.However, on the other hand, heterogeneous auxiliarysources grant us broader perspectives about the linkstructure in the social network than any single source.It is thus expected that the more sources we obtain, themore precisely we are able to predict. How to achievethis expectation is yet another challenge in the PCSproblem.In this paper, we propose a two-phase supervised

method for the PCS problem. In the first phase,a feature selection approach is proposed to find thefeature in each source that can best infer the linksin the social network. The proposed approach selectsthe best feature by aligning the feature to the knownsmall subgraph. The efficiency of the feature selectionapproach is guaranteed by the fact that the alignmentis shown to be equivalent to the one dimensional curvefitting. To control the possible risk of over-fitting inthe first phase, we propose a regularization method inthe second phase. The regularization is achieved bymapping the original points in the social network toa Hilbert space and mandating the mapping respectthe known small subgraph and a regularizer. Thepredictions of the links are derived by the inner productof the mappings. It is noted that the optimal mappingis from a Reproducing Kernel Hilbert Space (RKHS)induced by the output of the first phase.We apply our method into a large data collection ob-

tained from Youtube, a popular online video sharingcommunity. We aim to predict the contact links be-tween users and explore four auxiliary networks (seedetails in Section 2). The extensive empirical evalua-

tion results demonstrate the power of our approach.Our contributions can be summarized as follows:

• We introduce pseudo cold start link prediction(PCS) with multiple sources as the problem of pre-dicting the link structure of a social network assum-ing a small subgraph and multiple auxiliary net-works are available.

• We propose a two-phase supervised method as thesolution to the PCS problem which includes anefficient feature selection scheme in the first phaseand a regularization method in the second phase.

• We apply our method to predict the link structureof Youtube social network with four auxiliary net-works and the experimental results show the powerof our approach.

• Our approach answers the challenge of the curse ofmultiple sources and achieves the intuition that themore sources we obtain, the more precisely we areable to predict.

The rest of the paper is organized as follows: in thenext section, we present the formal definition of theproblem and describes the data sets characteristics. InSections 3 and 4 the proposed two-phase method ispresented. Extensive experimental results are shownin Section 5. Related work is discussed in Section 6and finally we discuss future research directions andconclude the paper.

2 Problem and Data sets

2.1 Problem Definition For a set O of n users, weaim to predict the link structure of the social networkAn×n. We are given a subgraph of size l (l << n) andits corresponding links and other auxiliary networks ofthe same n users: Bn×n, Cn×n, · · · . Predicting thelinks in A is to build a function h: O ×O → {0, 1}.

2.2 Data Sets Youtube is a highly popular onlinevideo sharing community. In Youtube, a user can placeanother user as the contact, subscribe to other users’videos and tag the favorite videos. In this work, wefocus the link prediction in the contact network. InYoutube, links are directed. Most of the features we usein the following sections are symmetric, which implieswe predict the same likelihood for the existence of linksin both directions.We obtain the data sets of Youtube from Tang, et al.

[4]. The data set is composed of five networks of whichthe contact network is the one to predict and we treatthe other four networks as auxiliary sources. There are15088 users in the contact network, we remove the users



Figure 1: Degree distribu-tion of Network A

Figure 2: Degree distribu-tion of Network B,C,D,E

with zero contact. After the removal, there are 13723users left. The description of the data sets is as follows:

• Major Network A: One user adds another useras the contact. It is the network of which we willpredict the link structure.

• Co-contact network B: Two users are connectedif they share a contact.

• Co-subscription network C: Two users sharingany subscription are connected.

• Co-subscribed network D: Two users are con-nected if they are subscribed by the same user.

• Favorite network E: Two users are connected ifthey share favorite videos.

The four auxiliary networks show distinct or evencomplementary perspectives of the link structure in A.Roughly speaking, these auxiliary networks may implythe latent reasons that people in Youtube add others ascontacts. For example, people may connect to othersbecause they both know the same people or they sharefavorite videos, etc.

Table 1: Data sets Properties

Network Avg. Max. Total.A 11.2 534 153,350B 1143.2 5637 15,688,290C 771.0 6517 10,581,208D 306.2 5686 4,202,604E 1357.4 7640 18,627,088

The degree properties of various networks are sum-marized in Table 1 where the second and third columnexhibit the average and maximum degree of each net-work. The last column shows the total number of linksin each network. It is easy to see that the major net-work A is rather sparse while the auxiliary networks are

much denser. This simple observation shows the auxil-iary networks are verbose in that a user, for instance,may share favorite videos with many people but only afew of them become his friends (contacts). The directconsequence of this observation is that we can’t infer thelinks of A by directly using links in auxiliary networks.Almost all social networks are found to follow the

scale-free property, i.e., the degree distribution of thesocial network follows the power-law distribution. Sim-ilar skewed distributions of the degree are also observedin networks A, B, C, D, E as shown in Figures 1 and 2.In the PCS problem, we only know a subgraph of

size l. In this paper, we take l to be 5%, 10% and15% of n and randomly sample l number of nodes andtheir links from A as the known subgraph. Notice thatthe subgraph is small in that when l is 10% of n, onaverage the subgraph only contains 0.9% of the totallinks. Due to the limit of the space, in this paper, ifunless otherwise stated, all the empirical results assumel is 10% of n. We also report the results when l is 5%and 15% of n.

2.3 Result Presentation The link prediction prob-lem is intrinsically a very difficult binary predictionproblem due to the sparseness of the social network.Given n nodes, we have a space of n2−n possible links,among them only a very small fraction of links actuallyexists. In our data, the existing links only constitute ap-proximately 0.08% of all possible links. This means ac-curacy is not a very meaningful measure in link predic-tion. In our data sets, by predicting an empty network(the links don’t exist), we can always achieve accuracyof 99.92%. For the same reason, comparing differentfunctions for link prediction by recall is not appropriateeither. In order to compare the performance of differentpredictive functions, we adopt the Receiver OperatingCharacteristics (ROC) curve and Area Under the Curve(AUC) to present our results.Receiver Operating Characteristics (ROC) curve,

with x-axis the false positive rate and y-axis the truepositive rate, is a reliable way to compare binary clas-sifiers. As we move alone the x-axis, the larger percent-age of links are selected (by lowering the link occurrencescore threshold), and the percentage of actually occur-ring links that are in the selected links monotonicallyincreases. When all links are selected the percentage ofthe selected links will be 1. Therefore the ROC curvealways starts at (0, 0) and ends at (1, 1). The ROCcurve associated with the prediction by randomly se-lecting links to include will result in a 45-degree line.The standard measure to assess the quality of the ROCcurve is the Area Under the Curve (AUC) measure.The AUC measure is strictly bounded between 0 and1. The perfect algorithm has an AUC measure of 1 and



the random algorithm has an AUC measure of 0.5. Wenormalize AUC to be between 0 and 100.

3 Feature Selection Phase

In the PCS problem, approaches that employ intrin-sic features to the social network don’t work becausewe only know a small subgraph. Naturally, we haveto explore the auxiliary networks to infer the possiblelinks. Since there are a great number of possible fea-tures that can be derived from auxiliary networks, weencounter the curse of multiple sources. Furthermore,it has been reported [1, 12] that there is no feature thatis consistently superior in all data sets. This observa-tion makes sense because different social networks mayhave distinct mechanisms of formation and evolution.The massive scale of the social network and auxiliarynetworks makes the problem even harder. Therefore,an efficient and effective feature selection strategy hasto be developed to tackle the curse of multiple sources.In this section, we first introduce the pool of candidatefeatures for each source and present the approach tofind the best feature for link prediction in the majornetwork. After that, we show the empirical evidence tosupport our feature selection strategy. At last, we showthat by the proper integration of the best feature in eachsource, we are able to get the result that is better thanthe best feature from the single source.

3.1 The Pool of Candidate Features For a givenauxiliary network, for instance, B, we adopt four map-pings to construct the link prediction features. Theyare 1) B → B, mapping to its self. 2) B → B, mappingto its normalized matrix, where B = W−1/2BW−1/2

and W is the diagonal matrix where Wii =∑

j Bij . 3)B → L, mapping to its Laplacian where L = W − B.4) B → L, mapping to its normalized Laplacian whereL = I − B where I is the identity matrix of the samesize of B .The first set of prediction features we examine is

the exponential [8] and von Neumann [9] graph kernelsdefined as follows

(3.1) fexp(B) = exp(αB),

(3.2) fneu(B) = (I − αB)−1,

where α is a positive parameter. The exponential andvon Neuman kernel are equivalent to the path countingfeatures for the link prediction. Path counting featurescapture the proximity between two nodes in the termsof the number of the length of paths between them.Exploring the fact that, Br, the powers of the network

matrix of an unweighed graph contains the numberof paths of length r connecting all node pairs, it isassumed that nodes connecting by many paths shouldbe considered more close to the pair connected by fewerpaths. We notice the following:

(3.3) exp(αB) =

∞∑

i=0

αi

i!Bi,

(3.4) (I − αB)−1 =

∞∑

i=0

αiBi,

Notice that von Neuman graph kernel is also calledKatz index [7] that enjoys many success in practise.When the network is large, the inversion of I − αB

becomes too expensive, we can stop counting the pathafter length r. Also some experiments show that small rusually yields better results than large r [6]. Notice thatwhen r = 2, it is equivalent to another popular featurethat computes the closeness of the pair by countingthe common neighborhood. We thus define the sumof powers as our another choice of feature for linkprediction:

(3.5) fsop(B) =

r∑

i=1

αiBi,

Notice that if we replace B with B in Eqs. 3.1, 3.2,3.5, we can have another three candidate features forlink prediction.The second set of features for link prediction em-

ploys Laplacian and normalized Laplacian. The Moore-Penrose pseudoinverse of the laplacian [10] is called thecommute time or resistance distance kernel. By the reg-ularization of the commute kernel, we have regularizedLaplacian kernels [11]. These two are defined as

(3.6) fcomm(L) = L+,

(3.7) fcommr(L) = (I + αL)−1,

The ”commute time” kernel takes its name fromthe average commute time, which is defined as theaverage number of steps a random walker, startingfrom node i 6= j, will take before entering a node jfor the first time, and go back to i. The kernel isalso known as resistance distance kernel because of aclose analogy with the effective resistance in electrical



Table 2: Best features for Auxiliary Networks

Auxiliary Network Best Predictor Best Mapping RMSE Parameter AUC-Al×l AUC-A

Sum of Power B 0.08 0.65 87.3 86.1B Heat Kernel L 0.22 0.89 73.5 71.2

Commr kernel L 0.23 0.92 85.8 80.2

Sum of Power C 0.11 0.43 77.7 74.4C Heat Kernel L 0.38 0.93 68.1 68.4

Sum of Power C 0.39 0.01 73.8 71.8

Sum of Power D 0.08 0.33 77.3 80.2D Sum of Power D 0.10 0.01 75.6 77.9

Heat Kernel L 0.39 0.79 71.1 72.8

Sum of Power E 0.09 0.56 71.4 73.8E Sum of Power E 0.26 0.89 59.1 63.5

Heat Kernel L 0.28 0.95 68.9 60.7

networks. Anther candidate feature is the heat diffusionkernel [9] defined as

(3.8) fheat(L) = exp(−αL),

Also, if we replace L with L in Eqs. 3.6, 3.7, 3.8, wehave another three candidate link prediction features.Therefore, for an individual source, we can enumerate12 candidate features to infer the link structure inthe major network. The candidate pool is obviouslynot complete but it is good enough to produce goodperformance and enables the efficient feature selectionstrategy.

3.2 Feature Selection Strategy We notice that foreach feature in the pool, we can represent the feature asa function of X: f(X). X could be B, B, L or L. Sincewe already know a small subgraph of size l, we proposeto select the best feature by minimizing the following:

(3.9) minf

‖ f(Xl×l)−Al×l ‖F ,

where f is from the pool, X is one of the four mappings,Al×l is the given subgraph , Xl×l is the mapping on thesame nodes as in Al×l and ‖ · ‖F denotes the Frobeniusnorm.To solve Eq. 3.9, we adopt the following two obser-

vations. Firstly, For any symmetric matrix, f(X) canbe computed as f(X) = Uf(Λ)UT where U is the or-thogonal matrix whose ith column is the ith eigenvectorsof X and Λ is the diagonal matrix in which Λii is theith eigenvalues of X. Secondly, the Frobenius norm isinvariant under the multiplication of an orthogonal ma-trix. Based on the two observations we have:

min ‖ f(Xl×l) −Al×l ‖F

=‖ Uf(Λ)UT −Al×l ‖F

=‖ f(Λ)− UTAl×lU ‖F

∼∑

i

(f(Λii)− UT.i Al×lU.i)

2

(3.10)

in the last step of Eq. 3.10, notice that f(Λ) is adiagonal matrix, thus we can drop off all the non-diagonal elements in UTAl×lU because it is irrelevantto f . Therefore, the feature selection is equivalent tothe one-dimensional curve fitting, which can be solvedefficiently. Notice that the last step of Eq. 3.10corresponds to the root mean squared error (RMSE).Furthermore, most of the features in the candidate poolis parameter-laden (except commute kernel). With Eq.3.10, we can also tune the parameter by achieving theminimal RMSE.Now we show some empirical evidences to justify the

benefits of the proposed feature selection strategy. Foreach auxiliary network, we only present the results ofthe top-3 features in terms of RMSE and present theAUC and ROC curve of each feature. The detailed re-sults are shown in Table 2 where AUC − Al×l denotesthe performance of the prediction on the known sub-graph Al×l and AUC-A denotes the performance of theprediction on the major network A. The parameter in-dicates the optimal value of α selected by our approach.One thing missing from this table is the parameter r forthe feature Sum of Power, the parameter is selected tobe 3. Figures 3 to 6 show the ROC curve for the top-3features from each auxiliary network.Several observations arise from the empirical evalua-

tion. First, our proposed method selects the best featurefor each source, i.e., the feature with minimal RMSE



Figure 3: ROC of the bestfeatures from Network B

Figure 4: ROC of the bestfeatures from Network C

Figure 5: ROC of the bestfeatures the Network D

Figure 6: ROC of the bestfeatures from Network E

corresponds to the best performance both in the knownsubgraph and the social network. Second, the predic-tion power of each source differs: the network B hasthe greatest prediction power. Since network B is a co-contact network, i.e., users are connected if they share acontact, this observation confirms the popular assump-tion both in our daily life and in social network research:”Friends of friend are friends”.Thirdly, in the related work [6] of the similar topic,

Lu, et al. tackles the problem of multiple sources bymandating the Katz index is the best feature for eachsource ,which is not flexible and does not respect thefact that different sources have varied best feature. Ourwork shows that the best feature in each source inYoutube is the sum of power of normalized network.Notice that when the data set changes, the best fea-ture for each source may also vary, yet our proposedapproach is still able to capture the best feature in eachsource for the link prediction.Finally, we observe that although the minimal RMSE

corresponds to the best performance, when RMSE valueis large, smaller RMSE does not necessarily indicateworse performance. For example, the RMSE for heatkernel and regularized commute kernel of B is 0.22 and0.23, respectively. However, in terms of AUC, regular-ized commute kernel is superior to the heat kernel. Thishappens because large RMSE value means a significantpart of the feature doesn’t fit Al×l. However, RMSEdoes not distinguish between mismatch from the majorpart and from the outliers.Figures 7 and 8 (better seen in color) show the curve

fitting plot for heat kernel and regularized commutekernel, respectively. The x-axis denotes the value off(Λii) and y-axis denotes the value of UT

.i Al×lU.i. As wecan see, the regularized commute kernel fits more closelyto the major part. Its RMSE value is bigger than thatof the heat kernel because the heat kernel probably fitsmore closely to the outliers. This explains the possiblereverse behavior of large RMSE and its corresponding

Figure 7: Heat KernelCurve Fitting

Figure 8: CommR KernelCurve Fitting

performance.

3.3 The Complement of Multiple Sources Fromthe empirical evidence, we notice that network B hasthe greatest prediction power and sum of power ofnormalized B achieves the best performance. Also weobserve that the best features from other three auxiliarynetworks also have good performance. As we perceive,people are sophisticated and people become friends ina variety of reasons. Therefore, although network Bshows the greatest prediction power, which indicatesthe latent reason B represents is probably the majorreason people becomes friends in Youtube, yet we cannever safely rule out other possible reasons that otherthree auxiliary networks embody. We believe the fourauxiliary networks are to some extent complementary inthat the proper integration of those auxiliary networksshould shed more lights on how people are connectedin Youtube. To explore the the complement of multiplesources, we propose two alternatives. One is to selectthe best feature from each source and combine them toperform the prediction. The second is to combine thebest features overall, i.e., select the features that havetop performance. In this case, since network B has thegreatest prediction power, multiple of its features willbe selected.Figures 9 and 10 show the ROC curve of the two



Figure 9: ROC curve ofthe first alternative

Figure 10: ROC curve ofthe second alternative

Figure 11: Ensemble of two features from B

alternatives on the major network A. In Fig. 9, B, C, D,E denotes the sum of power of the normalized matrix B,C, D, E, respectively. B+C+D+E denotes the averageensemble of sum of powers of the normalized matrix.The ensemble result achieves better result than the bestfeature from each source. This supports our assertionthat multiple sources are to some extent complementary.In Fig. 10, on the other hand, B + B + C + E denotesthe average ensemble of two features from B (sum ofpower and regularized commute kernel) and the bestfeatures from C and E. We note that the performanceof such ensemble is inferior to the first alternative butstill better than the best feature of each source.Figure 11 shows the ensemble of the top-2 features in

B, i.e., sum of powers of normalized B and regularizedcommute kernel, comparing with the performance ofindividual features. As we can see, the performanceof the ensemble result, although very close to that ofsum of power of normalized B, is still worse. Thisindicates that the performance increase in Figure 10occurs because of the ensemble of features from differentsources. It also explains the reason that the ensemblein Figure 10 has worse performance than that in Figure

9.

Table 3: Performance of Different Combinations on themajor network A

B C D EB 84.1 87.5 87.1 84.7C - 72.9 84.2 78.0D - - 80.7 82.6E - - - 72.7

Table 3 presents the performance of different combi-nations of the best feature from each source. Any cellin the table denotes the average ensemble of two fea-tures. Since the table is symmetric, we only show itsupper triangle. It is clear that any combination of twofeatures boosts the performance comparing to the fea-ture itself. We also experiment on the combination ofB+C+E, B+D+E and C+D+E. These combinations allboost the performance from the individual source. Theresults are inspiring in that we are able to achieve thedesirable goal that the more sources we obtain, the moreprecisely we can predict.

4 Regularization

The feature selection phase chooses the feature thathas the best performance. The feature is selected byaligning with the known subgraph Al×l. In the PCSproblem, Al×l is usually small, thus the selected featureis prone to over-fitting. It is important to control theover-fitting through a proper regularization.Inspired by the work of Vert, et al [25], we propose

the regularization strategy as follows:

• In the major network A, construct a mapping h

that maps the original nodes of A to a Hilbertspace.

• The mapping h must respect the known subgraphAl×l as much as possible and follow a regulariza-tion.

• The prediction is made by computing the innerproduct between the mappings of two nodes.

It is noted that the optimal mapping h comes fromthe Reproducing Kernel Hilbert Space (RKHS) inducedby the output of the first phase. The RKHS theoryfor real valued functions provides a powerful theoreti-cal framework for regularization. Numerous models in-cluding support vector machines can be derived fromthe application of the representer theorem and differentdata-dependent cost functions.In the following, we briefly review the main ingre-

dients of the RKHS theory for vector-valued functionsthat are needed for our regularization.



4.1 RKHS for Vector-valued Functions For agiven Hilbert space F†, L(F†) denotes the set ofbounded linear operator from F† to itself. Given Z ∈L(F†), Z

∗ denotes the adjoint of Z.Definition 1 [13, 14, 19] Let O be a set and F† a

Hilbert space. Kx : O ×O → L(F†) is a kernel if:

• ∀(o, o′) ∈ O ×O,Kx(o, o′) = Kx(o, o

′)∗

• ∀m ∈ N,∀{(oi, yi)}mi=1 ⊆ O × F†,∑m

i,j=1〈yi,Kx(oi, oj)yj〉F†≥ 0

The kernel defined here is operator-valued kernel withoutput in L(F†) instead of R. The above definition issimilar to the definition of normal kernel (symmetric,semi positive-definite).Theorem 1 [13, 14, 19] Let O be a set and F† a

Hilbert space. If Kx : O × O → L(F†) is an operator-valued kernel, then there exists a unique RKHS H whichadmits Kx as the reproducing kernel, that is

∀o ∈ O, ∀y ∈ F†, 〈h,Kx(o, ·)y〉H = 〈h(o), y〉F†,

Theorem 1 is analogous to Moore-Aronszajn theorem[16] describing the relation between kernel and RKHSof real-valued functions: each kernel defines a uniqueRKHS. With the operator-valued kernel and Theorem1, we have the following representer theorem.Theorem 2 [14, 19] Let O be a set and F† a

Hilbert space. Given a set of labeled sample Sl ={(oi, yi)}

mi=1 ⊆ O × F†, a RKHS H with reproducing

kernel Kx : O × O → L(F†), the minimizer h of thefollowing regularization:

(4.11) argminh∈H

J(h) =

l∑

i=1

‖ h(oi)− yi ‖2F†

+λ ‖ h ‖2H

with λ > 0, admits an expansion:

ˆh(·) =

l∑

i=1

Kx(oj , ·)cj ,

where the vector cj ∈ F†, j = {1, · · · , l} satisfy theequations:

yj =

l∑

i=1

(Kx(oi, oj) + λδij)ci,

where δ is the Kronecker symbol: δii = 1 and ∀j 6= i,δij = 0.

Theorem 2 is devoted to the penalized least square.Eq. 4.11 presents the desirable property for our regu-larization property: respect the known subgraph Al×l as

much as possible and follow a regularization. To benefitfrom RKHS theory, we should define a proper operator-valued kernel based on the features selected in the firstphase and find a representation for the labels of eachnode i.e., {yi}

li=1 in the major network A.

4.2 Regularization First we show how to obtaina operator-valued kernel. If we have a scalar kernel:kx : O × O → R, we can define the operator-valuedkernel Kx : O ×O → L(F†) as follows

Kx : kx(o, o′)× IF†

,

where IF†is the identity matrix of size of dimension

of F†. Clearly, Kx is symmetric. Furthermore, thesemi positive-definite property of kx leads to the semipositive-definite of Kx: ∀m ∈ N,∀{(oi, yi)}

mi=1 ⊆ O ×

F†, we have

m∑

i,j=1

〈yi,Kx(oi, oj)yj〉F†=

m∑

i,j=1

kx(oi, oj)〈yi, yj〉F†≥ 0

The input scalar kernel is chosen to be the bestfeature K in the first phase, which, in this case, is theaverage ensemble of the sum of powers of normalizednetworks. It is noted that this feature may not be asemi positive-definite matrix, thus not a kernel per se.Theoretically, a kernel matrix K must be positive semi-definite. Empirically, for machine learning heuristics,choices of K that do not satisfy Mercer’s conditionmay still perform reasonably if K at least approximatesthe intuitive idea of similarity. The feature for linkprediction captures the idea of proximity between users,thus can be used as a reasonable input to construct theoperator-valued kernel.In this context, the known subgraph Al×l is viewed

as the training set. But different from normal classifi-cation, we now don’t have labels {yi}

li=1 for the nodes

in the training set, rather we have labels for the linksin the training set. To obtain the labels, we construct aGram matrix KYl

from the training set Al×l using anykernel that encodes the proximities of nodes in a givengraph. In this work, we use the heat diffusion kernel:KYl

= exp(−βLYl), where LYl

is the Laplacian of Al×l.We assume that a semi positive-definite kernel ky :

O × O → R underlies this Gram matrix. By Mer-cer theorem [17], there exists a Hilbert space F† and amapping y : O → F† such that ∀(o, o′) ∈ O, ky(o, o

′) =〈y(o), y(o′)〉F†

. In this way, we obtain the label infor-

mation {yi}li=1 about the nodes in the training set. In-

tuitively, yi can be seen as a vector which represents theposition in the Hilbert space F† of the node oi in thetraining set.



When Kx and {yi}li=1 are defined, the solution to Eq.

4.11 reads

C = Yl(KXl+ λIl)

−1

where Yl = (yT1 , · · · , yTl ), KXl

is the Gram matrix ofsize l× l associated with kernel kx and Il is the identitymatrix of size l.

Now we can compute the prediction between nodes oiand oj as follows:

〈h(oi), h(oj)〉F†= 〈Ckx(o1,··· ,l, oi), Ckx(o1,··· ,l, oj)〉F†

= kx(o1,··· ,l, oi)TQTKYl

Qkx(o1,··· ,l, oi)

where Q = (KXl+λIl)

−1, kx(o1,··· ,l, oi) is a l×1 vectorwhose element is kx(ok, oi) and ok is a node in thetraining set. It is noted that the label for nodes {yi}

li=1

is never explicitly computed.Theoretically, any kernel can be acted as the input

scalar kernel for the regularization. However, as ourempirical evaluation will show, the quality of the inputkernel has a huge influence on the performance of theregularization strategy. Simply put, the kernel thatperforms better in the first phase will have betterperformance than the kernel that performs worse in thefirst phase. Therefore, the first phase is justified fromanother perspective as the proper kernel selection forthe regularization.In the following, we show the empirical evidences that

support the claims: 1) Regularization is able to boostthe performance from the first phase. 2) Different inputkernels have great influences on the performance of theregularization.

Figure 12: ROC curve forthe regularization

Figure 13: ROC curve fordifferent input kernels

In Figure 12, the legends B, C, D, E and B+C+D+Estand for the best feature from each source in the firstphase and the average ensemble of the best features.The black curve denotes the performance of regulariza-tion with input kernel of B+C+D+E. As seen from thisfigure, regularization boosts the performance from the

first phase. In Figure 13, the results are about regu-larization with different input kernels. In this plot, thelegends B, C, D, E and B+C+D+E indicate that wechoose sum of power of normalized auxiliary matrix andthe average ensemble as the input kernel to the regular-ization. It is clear in the diagram that different kernelsbear varied performances and the best input kernel fromthe first phase achieves the best performance.

5 Empirical Evaluation

In previous sections, we present empirical evidence tosupport our conclusions. To make the empirical evalu-ation more complete, there are several issues needed tobe discussed.

5.1 Low Rank Approximation In Section 2, weneed to compute the varied features for auxiliary net-work. Those networks are huge and dense, thereforecomputing features out of them is not trivial. Bruteforce is seldom an option, instead, we have to use lowrank approximation to compute the approximation ofthe features. Since all our features in Section 2 can berepresented as f(X) and X is a symmetric matrix, arank-k approximation of X is given by truncation leav-ing only k eigenvalues and eigenvectors of it:

f(k)(X) = U(k)f(Λ(k))UT(k),

when X is an auxiliary network and normalized auxil-iary network, the biggest k eigenvalues are used whilethe smallest eigenvalues are used for the Laplacian andnormalized Laplacian. It is envisioned that the param-eter k will have influence on the performance.

Table 4: Influence of Low Rank Approximation

Feature k=100 k=200 T.(k=100) T.(k=200 )Sopb 84.1 86.0 34s 86sSopc 74.4 72.9 21s 77sSopd 80.2 80.7 19s 36sSope 73.8 72.3 61s 97s

In Table 4, the first two columns show the AUC ofdifferent k and the last two columns show the time tocompute each feature. Sopb denotes the sum of power ofnormalized of B. In Table 4, we see the influence of lowrank approximation on the performance and the runningtime. The experiments are carried on in Intel Duo Core2.7G with 4 GB memory. Clearly, increasing k willincrease running time, but larger k doesn’t necessarilyindicate better performance. In our work, we take k tobe 200.

5.2 The Known Subgraph In our PCS problem,we know a small subgraph. In the empirical evidences



Figure 14: Parameter λ Sensitivity

shown in Section 2 and 3, we take the size of subgraphl to be 10% of n, which is the size of A. To completelyevaluate the influence of l, we take l to be 5%, 10%and 15% of n and randomly sample 10 times l usersfrom the major networks and their corresponding linksas the known subgraph, we summarize the results in thefollowing table.

Table 5: Influence of the Known Subgraph

Method 5% 10% 15%Sopb 85.5± 0.7 86.1± 0.5 86.2± 0.5

Ensemble 88.4± 0.6 89.0± 0.4 89.3± 0.5Regularization 90.2± 0.3 90.8± 0.3 91.2± 0.4

In this table, Sopb denotes the best feature fromB: sum of power of normalized B, Ensemble denotesthe average ensemble of the best feature from eachsource and Regularization denotes the regularization’sperformance. We notice that with the increase of thetraining size, the performance of Sopb and Ensemblehardly increases. This happens because for a specificdata set, the feature from each source that performs bestin the link prediction in the social network is fixed. Thesame feature is always selected despite of varied trainingsize. Regularization does show performance increase interms of training size. This satisfies the basic machinelearning assumption that with bigger training set, wecan do better.

5.3 Parameter Sensitivity The performance ofregularization is sensitive to the parameter λ. We tunethe parameter by achieving the best prediction perfor-mance in the training set. Figure 14 shows the perfor-mance curve over different values of λ on one experi-ment.Another parameter is the β in the heat diffusion

kernel that represents the proximity relations betweennodes. We also tune the parameter by achieving thebest prediction performance in the training set. We

notice that this parameter is remotely relevant to theperformance of the prediction, so we omit the reportshere.

6 Related Work

Liben-Nowell and Kleinberg [1] introduce the link pre-diction problem and show that simple graph-theoreticmeasures, such as the number of common neighbors, areable to detect links that are likely to appear in a socialnetwork. By adopting more elaborated measures thatconsider the ensemble of all paths between two nodes(e.g., Katz index [7]), they further improve the predic-tion quality.Inspired by the seminal work by Liben-Nowell and

Kleinberg, a huge amount of attentions have beenpaid to discover unsupervised measures that can betterpredict the link structure [9, 22, 26]. Those linkprediction methods are ad-hoc, leading to the fact thatthere is no method consistently superior in all data sets.This makes sense in that different networks have distinctmechanisms of formation and evolution as shown in[22]. However, it is noted that Katz index [7] and itsvariant truncated Katz index show their success in manyapplications [1, 6].Link prediction can also be viewed as a binary clas-

sification problem, in which the labels of the links areeither 0 (non-existing) or 1 (existing). Hasan, et al. [24]propose to employ various classification approaches tohandle link prediction directly. Lichtenwalter, et al. [22]notice the extreme imbalanced distribution of class vari-ables of the links and propose over-sampling strategy toovercome imbalance. Vert, et al. [25] propose a novelsupervised model in which points in the network arefirst mapping to the Euclidean space and the predictionis made by computing the distance between the pointsin the Euclidean space. The mapping is mandated toconform to the existing link structure. Inspired by therecent development of the semi-supervised learning inmachine learning community [18], Kashima, et al. [20]propose to apply the label propagation framework [27]to link prediction problem and are able to bring downthe time complexity of their algorithm from O(n4) toO(n3). Brouard, et al. [19] present a new representertheorem for semi-supervised link prediction and showthat semi-supervised is able to boost the performanceof supervised link prediction.Most of the research work on link prediction assume

that we know a general link structure of the network,either the network at time t or the network with somemissing links. In the work [2], Leroy, et al. intro-duce the cold start link prediction problem by assum-ing that there is no known link structure available.Their ambition is to rebuild the network from the aux-



iliary information. However, they manually inspect thepossible features from the single auxiliary informationand find the best feature by comparing with groundtruth. When we have multiple auxiliary sources, theirapproach would lose much efficiency.Several researchers place their focus on the problems

related to multiple sources in social networks. Tang, etal. [4] study the problem of clustering the social networkin the context of multiple sources or multi-dimensionalsocial networks. Du, et al. [5] study the behavior ofthe evolution of large multi-modal social network fromNokia. In the work that is most related to our work,Lu, et al. [6] propose a weight ensemble model tocombine different features from various sources. Also,they proposes two regularization strategies to controlthe complexities of the model. One limitation is thatthey mandate the Katz index to be the best featurefrom each source. Such assumption may be broke inother data sets while we propose the feature selectionstrategy to determine the best feature in each source.

7 Future Work and Conclusion

In this work, we introduced the pseudo cold start linkprediction with multiple sources problem in the con-text that given a small subgraph and multiple auxiliarysources, we aimed to predict the link structure of thesocial network. The problem posed great challenges inthat traditional link prediction methods are unable towork in this scenario. Also multiple auxiliary sourcesbrought the curse: multiple sources implies the explo-sion of possible features for link prediction. Multiplesources also indicate various distinct even complemen-tary perspectives to predict the link structure in thesocial network.To tackle this problem, we proposed a two-phase

methods: we first proposed an efficient and effectivefeature selection strategy to cure the curse of multiplesources. The strategy found the best feature in eachsource by aligning the possible feature to the knownsubgraph. Its efficiency is guaranteed by its equivalentto one dimensional curve fitting. We also discoveredthat by combining the best feature from each source(they may not be the best feature s overall), we achieveda better result than the best individual feature amongall sources. By considering the risk of over-fittingbecause of the small size of the known subgraph, inthe second phase, we proposed a regularization method.The regularization first mapped the points in the socialnetwork to a Hilbert space and mandated the mappingto respect the known subgraph and a regularizer. Weshowed that the proposed regularization increased theperformance given the best feature discovered in thefirst phase. Extensive experimental evaluation verified

our assumptions and conclusions, thus showing thebenefits of our approach.In the future, we have great interests to explore more

features from each source and design an efficient, unifiedframework to find the best feature for link prediction.Also, in this work, we used average ensemble of thebest feature from each source as an input to the secondphase, it is natural to shift into a weighted ensemble ofthe best features, therefore, an efficient weight schemeis our goal of interests in the future.

References

[1] D. Liben-Nowell and J. Kleinberg. The link predictionproblem for social networks. Journal of the AmericanSociety for Information Science and Technology, 2007

[2] V. Leroy, B. Cambazoglu and F. Bonchi. Cold startlink prediction. KDD, 2010

[3] David Kempe, J. Kleinberg, and E. Tardos. Maximiz-ing the spred of influence through a social network.KDD, 2003

[4] Tang, Lei and Liu, Huan and Zhang, Jianping and Naz-eri, Zohreh, Community evolution in dynamic multi-mode networks, KDD, 2008

[5] Nan Du, Nokia Research Center; Hao Wang, NokiaResearch Center; Christos Faloutsos, CMU, xSocial:Analysis of Large Multi-Modal Social Networks Pat-terns and a Generator, ECML/PAKDD, 2010

[6] Z. Lu, B. Savas, W. Tang and I. Dhillon, Supervisedlink prediction using multiple sources ICDM, 2010

[7] Leo Katz. A new status index derived from sociometricanalysis. Psychometrika, 1953

[8] Kandola, J., Shawe-taylor, J., and Cristianini, N.Learning semantics similarity, NIPS, 2002

[9] Ito, T., Shimbo, M., Kudo, T., and Matsumoto, Y.Application of kernels to link analysis, ICDM, 2005

[10] Fouss, F., Pirotte, A., Renders, J.-M., and Saerens,M. Random-walk computation of similarities betweennodes of a graph with application to collaborativerecommendation. TKDE, 2007

[11] Smola, A., and Kondor, R. Kernels and regularizationon graphs. Proc. Conf. on Learning Theory and KernelMachines

[12] J. Kunegis, and A. Lommatzsch, Learning spectralgraph transformation for link prediction, ICML, 2011

[13] Senkene, E. and Temple’man, A. Hilbert spaces ofoperator-valued functions Lithuanisan MathematicalJournal, 1973

[14] Caponnetto, A., Micchelli, C., Pontil, M. and Ying.Y., Universal multitask kernels. Journal of MachineLearning Research, 2008

[15] Micchelli, C. and Pontil, M. On learning vector-valuedfunctions Neural Computation, 2005

[16] Aronszajn, Nachman ”Theory of Reproducing Ker-nels”. Transactions of the American Mathematical So-ciety, 1950



[17] Mercer, J. , ”Functions of positive and negative typeand their connection with the theory of integral equa-tions”, Philosophical Transactions of the Royal Society,1909

[18] Belkin, M., Niyogi, P. and Sindhhwani, V. Manifoldregularization: A geometric framework for learningfrom labeled and unlabeled examples. Journal of Ma-chine Learning Research, 2006

[19] Brouard, C., ALche-Buc, F. and Szafranski, M, Semi-supervised penalized output kernel regression for linkprediction, ICML, 2011

[20] Kashima, H, Kato, T., Yamanishi, Y., Sugiyama,M., and Tsuda, K. Link propagation: a fast semi-supervised learning algorithm for link prediction, SDM,2009

[21] Cortes, C. Mohri, M and Weston, J A general regres-sion technique for learning transductions. ICML, 2005

[22] R. Lichtenwalter, J. Lussier and N. Chawla, NewPerspectives and Methods in Link Prediction, KDD,2010

[23] H, Kautz, B. Selman, and M. Shah Referral web:combining social networks and collaborative filtering,Communications of the ACM, 1997

[24] M. Hasan, V. Chaoji, S. Salem, M. Zaki, Link predic-tion using supervised learning, SDM, 2006

[25] J. Vert and Y. Yamanishi, Supervised Graph Inference,NIPS, 2004

[26] Z. Huang, Link Prediction Based on Graph Topology:The Predictive Value of the Generalized ClusteringCoefficient, KDD, 2006

[27] Zhou, D., Boursquet, O., Lal, T. N, Weston, J. andScholkopf,B Learning with local and global consistency.NIPS, 2004



Documents

Pseudo Cold Start Link Prediction with Multiple Sources in Social