p709

7/30/2019 p709

http://slidepdf.com/reader/full/p709 1/10

Bimodal Invitation-Navigation Fair Bets Modelfor Authority Identification in a Social Network

Suratna BudalakotiThe University of Texas at Austin

[email protected]

Ron BekkermanLinkedIn Corporation, USA

[email protected]

ABSTRACT

We consider the problem of identifying the most respected,authoritative members of a large-scale online social network(OSN) by constructing a global ranked list of its members.The problem is distinct from the problem of identifying in-fluencers: we are interested in identifying members who areinfluential in the real world, even when not necessarily soon the OSN. We focus on two sources for information about

user authority: (a) invitations to connect, which are usuallysent to people whom the inviter respects, and (b) members’browsing behavior, as profiles of more important people areviewed more often than others’. We construct two directedgraphs over the same set of nodes (representing memberprofiles): the invitation graph and the navigation graph re-spectively. We show that the standard PageRank algorithm,a baseline in web page ranking, is not effective in peopleranking, and develop a social capital based model, calledthe fair bets model, as a viable solution. We then proposea novel approach, called bimodal fair bets, for combininginformation from two (or more) endorsement graphs drawnfrom the same OSN, by simultaneously using the authorityscores of nodes in one graph to inform the other, and viceversa, in a mutually reinforcing fashion. We evaluate the

ranking results on the LinkedIn social network using thismodel, where members who have Wikipedia profiles are as-sumed to be authoritative. Experimental results show thatour approach outperforms the baseline approach by a largemargin.

Categories and Subject Descriptors

H.2.8 [Database Applications]: Data Mining

General Terms

Algorithms, Experimentation

Keywords

Social Networks,Authority,PageRank,Reputation,Influence

1. INTRODUCTIONAn online social network (OSN) is an imperfect represen-

tation of real-world social interactions, as only a fraction of people’s real-world activities are reflected online. It can be

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). Distribution of these papers is limited to classroom use,and personal use by others.WWW 2012 , April 16–20, 2012, Lyon, France.ACM 978-1-4503-1229-5/12/04.

compared to a still camera capturing snapshots of modernsociety: it cannot see reality from all possible angles, how-ever it often manages to capture the essence. This raisesa fascinating question: can we use the structure of an on-line social network (OSN) to infer real-world social statushierarchy, in general; and – in particular – can we identifywho the most well respected, prominent members of the so-ciety are? In this paper, we attempt to construct an inferredglobal rank of members of an OSN, according to the level of

recognition they have achieved in the real world. Our modelis applied to the LinkedIn professional network, which islarge enough (over 100M professionals) to yield statisticallymeaningful results.

We distinguish between the tasks of identifying authorities and identifying influencers (as, for example, in [9]). Usu-ally, influence is measured in terms of the degree to whicha user’s behavior affects her peers, as observed on the OSN.In order to influence many people, a member needs to bemore than just active on an OSN. Maintaining an influen-tial online presence can be extremely competitive, and manyrespected people may not be willing to invest such a high de-gree of time and effort. Another potential reason could bethat they represent a demographic group not yet engaged

by the OSN, or the online world in general. Moreover, so-cial media influencers might not be authoritative in the realworld: Khrabov et. al. [13] observed that the most influen-tial Twitter users are relatively, if not completely, unknownoutside their online circles. In contrast, we are interested inauthority identification: inferring influence a user has in thereal world, even if this influence is not highly notable online.

Since the global authority information is not directly ob-tainable, we build our model over two basic actions mostOSN members perform, both of which can be viewed as alocal endorsement from a user A to a user B: (a) user A in-vites user B to connect, and (b) user A views user B’s profile.Our assumption is that users on a social network are morelikely to send invitations to users that they respect, or atleast do not disrespect. Similarly, browsing through another

user’s profile can be seen as a sign of interest. Obviously,the local signal is weak and noisy: there are many reasonswhy a user could send an invitation to another user, or viewtheir profile. However, the signal distills with the volume: if thousands of users wish to connect with a particular person,or her profile is viewed by millions, it is a strong indicationthat the person is an authority outside the OSN. The sig-nal becomes even stronger when aggregated over the entireOSN data. If user A wishes to connect with user B, and userB wishes to connect with user C, there is a certain amount

WWW 2012 – Session: Social Interactions and the Web April 16–20, 2012, Lyon, France

709

7/30/2019 p709


of implicit endorsement that goes from user A to user C.Aggregating billions of invites and profile views, the OSNproduces a stream of endorsement that is directed towardsa few users who may rarely send invites or view profiles, butare getting constantly approached by others who are con-stantly approached by others etc. The real-world influenceof people on the very top of this pyramid is indubitable.

We notice that the text on a user profile is an insufficientand sometimes misleading proxy to the the real-world status

of a particular OSN member. Some member profiles are tooshort and do not contain enough information to perform anaccurate inference. Some are too verbose and may be exag-gerated for marketing or search engine optimization (SEO)purposes. We do not use profile information to fit our model,however, we rely on user profiles in our model’s evaluation.

The goal of this work is to answer the question “Who arethe most respected . . . on LinkedIn?”, rather than “Is userA more respected than user B?”. Despite that we build aglobal ranked list of LinkedIn users, it would be irresponsiblefor us to infer that the user ranked 30M on that list is morerespected than the user ranked 31M, because the signal istoo noisy. However, we’d like to be confident that the top 1%of users in the list are highly respected by many people. For

example, we can infer that T. Boone Pickens

1

is among themost respected financiers on LinkedIn – he ended up beingthe first financier in the constructed ranked list. By creatingone global ranked list that consists of many millions of users,we guarantee that the resulting pool of authoritative users(say, the top 1% of users in the ranked list) is large andrepresentative enough such that it can be further refined bya specific request (“Who are the most respected financiers on LinkedIn?” or “Who are the most respected people fromJapan on LinkedIn?”

Apart from the novelty of the problem being addressed,the paper makes the following technical contributions:

1. We propose a tournament-based model of user interac-tions, where users can be visualized as expending and

accumulating social capital while respectively initiat-ing and accepting links. In other words, user A can bemodeled as making a payment in social capital whenshe sends user B an invitation, or views B’s profile.A’s authority score, in this framework, is the amountshe can afford to pay per interaction initiated by her,based on the capital she has accumulated from others.Mathematically, this is represented using a variant of the fair bets model [5, 20].

2. We present an approach to combining authority-relatedinformation from multiple graphs, where each graph isconstructed over the same set of users, but representsdifferent aspects/modes of their behavior. This modelis equivalent to simultaneously using the fair bets score

vector of one graph for random restarts of walks on theother graph, and vice versa.

The algorithms described here, while evaluated only on theLinkedIn social graph, are easily applicable to other socialnetworks. Having to combine data from multiple graphs is acommon problem in social network analysis. Some examplesof such datasets are: user ‘follow’ networks and ‘retweeting’behavior on microblogging sites such as Twitter2, and user

1www.boonepickens.com2www.twitter.com

social networks and voting/response behavior on Q&A orcontent sharing platforms.

2. BACKGROUNDUser interactions on social networks lend themselves nat-

urally to a graph-based representation. Treating each useras a vertex, we represent the invitation data as a directedgraph, with an edge directed from the inviting user to the

invitee. This graph is referred to as the invitation graph .Similarly, a navigation graph is constructed, with edges di-rected from the person who views a profile, to the personwhose profile was viewed. So, in both cases, an endorse-ment is modeled as ‘flowing’ in the direction of the edge.

Hyperlinked datasets such as the WWW are often repre-sented as directed graphs [3, 14] for the task of identifyinghigh-quality pages, with hyperlinks interpreted as endorse-ment. Similar representations (with invitations, ‘follows’, or‘retweets’ as directed edges) have been used for identifyinginfluencers in social networks [4, 7, 22]. A variety of linkanalysis algorithms [3, 14] exist for identifying importantnodes in such graphs, the most popular one being PageR-ank [2, 3].

The PageRank algorithm employs a recursive definition of

authority: the authority of a vertex is a weighted sum of theauthority of the vertices that point to it (ignoring, for now,the random restart aspect). This can distort user authorityestimates in a number of ways:

1. An authoritative user is more likely to accept connec-tion invitations than to send them out. This is becausemany non-authoritative users find a lot of value in con-necting with authorities, while the opposite is not al-ways true. More generally, link formation in OSNs isfound to be consistent with a status-based model [18],where low status nodes link to those of high status.This observation does not play a part in the PageR-ank model.

2. While most information on the Web is publicly acces-sible, social networks such as LinkedIn have a varietyof privacy settings, and sometimes do not allow usersto access profiles more than a few degrees from theirown. As a result, a user’s network size and opennessplay a major role in the number of invitations / profileviews they receive.

3. Motivated users can take advantage of behavioral norms.For example, the norm of reciprocity, i.e., users feelingobligated to return links with courtesy links, is usedby unscrupulous users to increase their link count, onboth Flickr3 and Twitter [16, 22].

4. Older users can become entrenched over time, and havean indegree disproportionate with their authority level.This can discourage younger users from participating.It may not be an issue on professional networks suchas LinkedIn, but is a serious factor [17] in informationsharing networks such as Twitter and Digg4.

In other words, members’ PageRank scores on an OSNgraph depends on two factors: a) their authority, which de-termines the desirability of a connection with them, and

3www.flickr.com4www.digg.com


710

http://www.boonepickens.com/

http://www.twitter.com/

http://www.flickr.com/

http://www.digg.com/

http://www.digg.com/

http://www.flickr.com/

http://www.twitter.com/

http://www.boonepickens.com/

7/30/2019 p709


Figure 1: Indegree-Total Connections Ratio His-togram: Users with ≤ 10 Connections

b) their visibility on the graph, that is, the likelihood thatthey will be noticed by other users. Non-authoritative mem-bers can improve their PageRank scores by increasing theirvisibility, usually through increased activity (engaging other

members via page views), or increased connectivity (by send-ing more invitations). We use the fair bets model, discussednext, to take into account this observation.

3. THE FAIR BETS MODELThe fair bets model was developed [5, 20] to rank play-

ers in round-robin tournaments. The model is based on theidea that, a player is allowed to bet an amount of money pergame. She forfeits this amount to her opponent if she losesthe game, and if she wins, she is awarded the amount betby her opponent. The score assigned to a player is then theamount she can afford to bet, assuming she has to bet thesame amount against all players. More recently, this modelhas been studied by Slutzki et al. [21].

Mathematically, this model can be represented as follows:we can construct a graph G, with each player i as a vertexvi, and assuming each pair of players played a maximum of one game against another player (this can be generalized),an edge directed from the loser of the game to the winner.This graph can then be represented as a matrix V , wherevij = 1 if there is an edge directed from player i to j in thegraph. Then the fair bets score aj of player j satisfies thefollowing property:

N i=1

vijai =N i=1

vjiaj

That is, the amount of money any player j pays out pergame (aj) is the amount she makes in total, divided by thenumber of games lost. This can can be written in matrixform as:

V ⊤a = Ca (1)

where C is a diagonal matrix such that C ii is equal to thesum of the ith row of V , that is, C ii =

N k=1 vik. A straight-

forward relationship can be established between PageRankand fair bets scores [5]. For a stochastic matrix P , thePageRank vector r corresponding to the matrix is given by

the equation:

P ⊤r = r (2)

Let P = C −1V . That is, P is the stochastic matrix con-structed by normalizing all the rows of some tournamentmatrix V to 1.

Then, equation (2) can be written as:

V ⊤C −1r = r ⇒ C −1V ⊤ C −1r = C −1r

Setting a = C −1r gives:

C −1V ⊤a = a⇒ V ⊤a = Ca

which is the same as equation (1). That is, if the PageRankvector for a stochastic matrix P is given by r, then the fairbets vector of the original graph matrix V = CP is givenby a = C −1r. Thus, mathematically, the fair bets score of a vertex in a graph is equal to its PageRank score, dividedby its outdegree.

3.1 Fair Bets as Social CapitalIn the context of online social networks, fair bets can be

viewed as a model of social capital accumulation and expen-diture. Users can grow their connection graph in two ways:either by sending invitations or accepting them. As send-ing an invitation requires time and effort on a user’s behalf,and a willingness to make the gesture, users are more likelyto make this investment if they believe the new connectioncan help them in achieving social/professional growth. Thisgrowth can take place online: more connections increase thelikelihood that someone will stumble on the person’s profile,thus increasing the likelihood of invitations. Or both theoriginal invitation, and subsequent new connections, couldbe side-effects of real world activity.

Thus, over time, the initial time and social capital spentin inviting connections pays off, as the user accumulates in-vitations in return. In this setup, highly respected usersreceive multiple invitations without making a significant ef-

fort, while the payoff for less authoritative users is lower.The standard fair bets model can then be visualized as fol-lows: assuming users were paying each other to accept invi-tations on an OSN, then the fair bets score of a user is theamount she can afford to pay on average.

4. USER AUTHORITY EVOLUTIONFor an OSN graph, the standard fair bets model discussed

above assumes a linear relationship between a vertex’s au-thority score and its outdegree. The fair bets score ai of avertex vi can be written as:

ai =indegree(vi)

outdegree(vi)· µi

That is, the fair bets authority score of a vertex directlyproportional to a) the mean authority accumulated per in-cident vertex, µi, and, b) the indegree to outdegree ratio(i-o ratio). Both factors depend on the stage of evolution of the vertex. The evolution of user vertices on the invitationgraph can be divided into three stages. The first stage isthat of users with less than 10 connections. A normalizedhistogram of the indegree to number of connections (i-t ra-tio) for this group of users is shown in Figure 1. As can beseen, a majority of these users have a ratio close to 1.Thisis because new users are unlikely to send invitations, due to


711

7/30/2019 p709


Figure 2: Indegree-Total Connections Ratio His-togram: Users with 50 to 1000 Connections

being isolated by the small size of their connection graph.This can give them an artificially high i-o score. To addressthe skewness of the i-o ratio of poorly connected users, we

use a Laplace smoothing of the outdegree value in the fairbets formula, by adding a small constant (equation 3).

The i-t ratio for users with 50−1000 connections is shownin Figure 2. While there’s still a fair number of users withan i-t ratio of more than 0.9, the ratio is relatively normallydistributed, with an overwhelming majority in the 0.2-0.6range.

On the other extreme, for users with more than 3500 con-nections, the graph is biased once again towards much higherratio values, as shown in Figure 3. This is a very small subsetof users, consisting largely of extremely active5 and influen-tial users. PageRank would rank these users near the topof the ranked list. Interestingly, a fair bets-based rankingplaces these users near the bottom of the list (with rare ex-

ceptions), despite their high indegree-outdegree ratio. Thisis because, for users with an extremely large number of in-coming edges, a majority of these incoming edges have lowvalues of authority, due to the way authority scores are usu-ally distributed across the graph (power law). This resultsin a lower mean value.

4.1 Log Fair BetsAs a basic validation, we evaluated the relationship be-

tween the fair bets based rank assigned to a user, and his/herprofessional seniority level. The seniority level data is pro-prietary standardized data derived from LinkedIn profiles,that maps millions of job titles in the LinkedIn dataset toone of ten levels: from intern (0), to founder (9). A rankingby authority is more likely to be reliable if users at higher

ranks, on average, hold titles of higher seniority, comparedto lower ranked users. Figure 4 shows the evolution of se-niority with fair bets ranks. The ranks towards the right arethe highest ranks.

Interestingly, there is a dramatic jump in the seniority of people at the very top of the ranked list. However, after acertain point, users’ ranks seem to bear little relationship

5This seems paradoxical, but a user with 10, 000 connectionsand an i-t ratio of 0.9 has sent out 1000 invitations, a higherlevel of activity than most users.

Figure 3: Indegree-Total Connections Ratio His-togram: Users with More Than 3500 Connections

to seniority levels. The reason is the over-steep normaliza-tion: a user with 100 connections will need to have twice thePageRank score as a user with 50 connections (assuming the

same i-o ratio), to have the same fair bets score. Intuitively,this seems unlikely. PageRank scores are likely to follow apower law distribution, so that a few users would contributemost of a user’s score. Assuming more active users havehigher scores, users are more likely to receive their morevaluable edges sooner rather than later. Also, a user’s con-nection network grows much faster in the initial stages, aseach connection makes them visible to many new users. Atsome point, the law of diminishing returns would set in, asmost connections of a newly added connection are alreadypart of the user’s network, thus unlikely to lead to moreincoming invitations. The same logic extends to page views.

Based on these observations, the normalization we use,which we refer to as log fair bets (LFB), is as follows:

f i = indegree(vi)log (10 + outdegree(vi))

· µi (3)

Log fair bets can be interpreted as assuming that the arrivalpatterns of incoming links follows a power law distributionwith respect to time (measured by outdegree). That is, theexpected authority value of links received once k invites havebeen sent is 1

k. This expected value includes both the prob-

ability of receiving a link, and the authority of the link. Inthis interpretation, the logk can be seen as approximatingthe sum

k

i=11i

. The value of 10 is the Laplace smoothingparameter, fixed based on the analysis in the previous sec-tion. A validation similar to that for fair bets results wasdone for log fair bets by comparing ranking results againststandardized seniority data. Figure 5 shows the resulting

graph. As can be seen, the log fair bets graph is muchsmoother, and the seniority level tracks the ranking muchmore closely.

5. COMBINING INVITATION AND NAVI-

GATION GRAPHSWe construct two separate graphs, the invitation graph

and the navigation graph, to represent invitation data andbrowsing patterns respectively. The assumption behind thisdecision is that the two graphs are complementary: there is


712

7/30/2019 p709


authority-related information in each graph that is missingin the other. This is in contrast to a commonly acceptedassumption about user browsing patterns.

According to the the random surfer model interpretationof the PageRank algorithm for the Web, the PageRank vec-tor corresponds to the fraction of time a user will spend on aweb page, if she were to start at a random page, and at eachtimestep, randomly select an outgoing edge. This positsan extremely close relationship between link structure and

browsing patterns. The assumption is reasonable for webpages, but not for social network graphs, for a number of reasons. For example, while both invitations and browsingbehavior are asymmetric, that is, one user takes the initia-tive, which reveals a greater involvement on their part thanthe other user, the degree of asymmetry is much lesser forinvitation than navigation behavior. For an inviter’s connec-tion request to be successful, the invitee has to accept therequest. On the other hand, navigation requires no activityby the person whose profile is being viewed, and is whollyasymmetric. Thus, invitation requests are more likely to b edirected to people in the inviter’s professional peer group,while navigation data reflects information about who theuser aspires to know. A user may be more likely to con-

nect to her immediate supervisor, but may browse her com-pany CEO’s profile more often. Also, invitation requests areguided by a number of social norms. For example, a usermay feel obligated to send requests to all the people shemeets at her workplace. Such obligations do not exist fornavigation behavior.

Given two separate graphs over which authority ranks canbe calculated, a combined rank can be arrived at in twoways:

5.1 Rank Merging via MetasearchUse a metasearch-based approach to merge the two rank-

ings. Borda voting [1], for example, is a simple but usuallyeffective approach to merging two ranked lists: the rank of a user is essentially the mean of their rank in the two lists.

5.2 Bimodal Authority ModelsIn the random surfer interpretation of the PageRank algo-

rithm, at each timestep, with a certain probability d (usu-ally set to 0.85), the surfer randomly selects an outgoinglink from the current page. With the remaining probability1 − d, the surfer gets bored and jumps to a completely newpage. The probability 1 − d is referred to as the teleporta-tion probability , and the vector the new page is chosen fromis called the teleportation vector . The vector can be uni-form, or biased to reflect some priorly known information.For example, the teleportation vector could be personalized[6] given sufficient information about the surfer, or be bi-ased towards trusted vertices. Its effect is to bias the overall

scores towards the preferences of the vertices with highervalues in the teleportation vector.

A natural way, then, to inform one graph (say, invitation)with information from the other (say, navigation) would beto use the authority vector of one as the teleportation vectorfor the other. Following this, the improved results in thenavigation graph can be reused to improve the results inthe invitation graph, and so on till convergence. We referto this approach as the bimodal authority approach. Theidea behind this approach is mutual positive reinforcement:useful information in one graph can be used to improve the

authority estimates of the other graph, and vice versa. Theteleportation vector could be based on PageRank (bimodal PR), or log fair bets scores (bimodal LFB ).

However, as the next section shows, successive alternateruns of the two algorithms are not necessary. Instead, acomposite graph can be created, by merging the invitationand navigation graphs in a certain way. The invitation andnavigation PageRank vectors that would result from the bi-modal approach, can be obtained from the the PageRank

vector of the composite graph.

5.2.1 Proof Of Equivalence: Bimodal and Compos-ite Graph Models

We are given two graphs, GA = (V A, E A) and GN =(V N , E N ), representing different aspects of user behavior.Both graphs have the same number of vertices, say, k. Foreach vertex v ∈ V A , there is a corresponding twin vertexv′ ∈ V N . In our example, the vertex v for a user representsher invitation behavior, while v′ represents her navigationbehavior. We would like to use the PageRank vector of onegraph as the teleportation vector of the other. That is, theteleportation probability for v ∈ V A should be equal to thePageRank score of its twin vertex v ∈ V N , and vice versa.

To do this efficiently, we prove the following result:Construct a new graph G = (V A ∪ V N , E = E A ∪ E N ∪E AN ), where E AN is a new set of directed edges, betweenall pair of twin vertices, and weighted d. That is, a vertex vin the invitation graph is connected edge to its twin vertexv′ in the navigation graph via a directed edge of weight d.A similar directed edge of weight d connects v′ to v. Thencalculating the PageRank vector for graph G is equivalentto solving the problem described above.

Proof: Let the transition matrix of V A be written as P Aand its (unknown) PageRank vector be rA . Similarly, letthe transition matrix and PageRank vector of V N be P N andrN respectively. Let e be a vector such that ei = 1 for all i.Then, by our recursive definition, PageRank vectors of GA

and GN satisfy the following equations:

(1− d)P A + der⊤N

⊤rA = rA (4)

(1− d)P N + der⊤A

⊤

rN = rN (5)

Expanding (4), we get:

(1− d)P A⊤rA+drN e

⊤rA = rA ⇒ (1− d)P A⊤rA+drN = rA

(6)since rA sums to 1.

Similarly, for (5), we get:

(1 − d)P N ⊤

rN + drA = rN (7)

Let I k be an identity matrix of size k. Then equations (6)and (7) can be written in matrix form as follows:

(1 − d)P A

⊤ dI kdI k (1 − d)P N

⊤

rArN

=

rArN

(8)

Let P be a matrix, such that:

P =

(1− d)P A dI k

dI k (1− d)P N

(9)


713

7/30/2019 p709


Figure 4: User Fair Bets Rank vs Mean Seniority Level (over consecutive groups of 2000 people)

Figure 5: User Log Fair Bets Rank vs Mean Seniority Level (over consecutive groups of 2000 people)


714

7/30/2019 p709


http://www.wikipedia.org/

7/30/2019 p709


vectors of their Bags-of-Words. We estimate the IDF scoresof words over the entire collection of LinkedIn member pro-files.

For every person wi with a Wikipedia profile from thecandidate list, and for every LinkedIn member li with thesame name, we compute the right side of formula (10) anddecide that li = wi if the resulting value is above a presetthreshold. After some hand-tuning, the final system yieldedabout 30K LinkedIn members who have Wikipedia profiles.

We estimate the mapping’s precision as very high – we spotchecked a couple of hundred mappings and did not see asingle instance of a wrong mapping. We cannot estimatethe mapping’s recall though. For our model’s evaluationpurposes however, the mapping’s recall does not matter.

7. EXPERIMENTAL RESULTS

7.1 Evaluation MeasuresWe use two widely used measures, the mean average pre-

cision (MAP) score, and the normalized discounted cumu-lative gain (NDCG) score, to evaluate the quality of ourranked results.

Given a ranked list and a set of relevant documents (or

in our case, users who have Wikipedia profiles), its AveragePrecision (AP) is defined as the mean of the precision scores,calculated at each rank where a relevant match was foundon the list. MAP is the average of AP scores across multiplequeries. Since, in our case, we are essentially evaluating asingle query, the average precision score serves as the MAP.Since we are more interested in the quality of the higherranks of our results, than the entire list, the MAP scoresare given after cutting off the list at three thresholds: after1000 ranks (MAP@100), after a hundred thousand ranks(MAP@100K), and after one million ranks (MAP@1mil).

The MAP measure treats all users on the Wikipedia listas equally relevant. The other measure we use, NDCG, en-ables us to differentiate between users in terms of degreesof relevance. Given a ranked list, the DCG score of the listupto n ranks is given by:

DCG = m1 +ni=2

mi

log2 i(11)

where mi is the estimated relevance of the ith match. TheNDCG score is given by normalizing this value by the ideal DCG (IDCG) value, that is, the maximum score that anyranking can achieve given the relevance scores.

For any user with a Wikipedia profile, we calculate herrelevance score mi, as the log of the mean number of pageviews per day received by her profile, based on two monthsof Wikipedia page view data8 (May and June 2011). Therelevance score for all users receiving less than three pageviews a day is set to 1. This gives us a relevance rangeof approximate 1-15, as highly trafficked profile pages onWikipedia receive around 10, 000 page views a day.

Based on this, the idea DCG score (IDCG) can be calcu-lated as follows: sort the Wikipedia users’ list by descendingorder of page views, and calculate:

IDCG = log2 p1 +ki=2

log2 pilog2 i

(12)

8The data was collected from the websitehttp://stats.grok.se .

where pi is the page views received by the i-th ranked user.The value of k is the cutoff limit. In our case, the maximumis approximately 30, 000, the number of Wikipedia profileswe have mapped to LinkedIn users. To ensure that ranksbeyond the first few hundred impact NDCG results, we di-vide user ranks into buckets of 500. For the first 500 ranks,i = 2, i = 3 for the next 500, and so on, in equations (11)and (12). Thus, a user with a relevance score mi, placedin the first 500, would add mi to the DCG score, while the

same user, placed in the 501-1000 range would add milog2 3 to

DCG.Like MAP, we calculate NDCG after 1000 (NDCG@1000),

100,000 (NDCG@100K), and 1 million (NDCG@1mil). TheIDCG score increases in value from NDCG@1000 to NDCG@100K, but then remains constant till NDCG@1million. Forthis reason, the NDCG score falls from the 1000 to 100,000level, but then increases for the 1 million level.

7.2 Algorithm ComparisonAll algorithms were implemented in a map-reduce frame-

work, and run on a set of 100 Hadoop nodes. The open-source implementation of PageRank in the Pegasus soft-ware toolkit [12] was used as the original code base, and the

code was modified to incorporate bimodal authority models.The results are shown in Table 1. The percentage improve-ments/deterioration, shown in brackets in each case, is basedon treating the invitation graph based PageRank (InvitationGraph-PR) algorithm as the baseline for comparison. Ascan be seen from the table, the log fair bets (Log FB) modelconsistently performs better than the PageRank model forboth the invitation and navigation graphs.

Interestingly, among the hybrid models that combine bothinvitation and navigation data, the best performing ones arethe log fair bets models (Borda LFB and Bimodal LFB).The performance of the PageRank-based hybrid models isaround the same as the single graph-based approaches. Thereason for this is the large impact of user activity levelson the hybrid PageRank models. In the case of bimodalPageRank, the largest mutual reinforcement is for user whoare most active, as they have higher PageRank scores onboth graphs. A similar effect occurs in Borda voting basedPageRank. Since Borda voting is based on mean scores, thehighest ranked users on both graphs are people ranked highlyon both graphs. These are usually highly active users. Incontrast many authoritative users are not very highly rankedin one of the two graphs (for example, many people wouldview the profile of someone famous like Bill Gates, but veryfew would send an invite), and end up being ranked lowon average. As a result, PageRank-based Borda voting isunable to take advantage of the best information in bothgraphs. In contrast, the bimodal log fair bets more (BimodalLFB) is the only one actually able to achieve positive mutual

reinforcement, and outperforms all other algorithms by asignificant margin.

The only exception to this is the NDCG@1000 score, wherethe bimodal LFB comes in second to navigation graph LFB.The reason behind this is that there a small number of veryhigh profile ’celebrity’ users, who garner an extremely largenumber of page views both on Wikipedia and LinkedIn.Their high page views give them large values of mi, whichgives navigation LFB an edge at the 1000 level. This ad-vantage, however, does not carry beyond the first 1000 orso members. Even up to the 1000 level, the actual number


716

http://stats.grok.se/



7/30/2019 p709


7/30/2019 p709


both inside and outside the OSN. A long list of 10 millionpeople or so would be too noisy as its length is comparablewith the size of the OSN. A reasonable cutoff might be ataround a hundred thousand people, based on the guidancegiven in Figures 4 and 5: people’s seniority level drops downdramatically after the first few hundreds of thousands.

Our model is straightforwardly generalizable to any net-work with multiple types of endorsements between nodes.For example, in the social search domain, nodes are Web

pages and edges are hyperlinks between them, while anothertype of edges can be the social network of the pages’ cre-ators. Similarly, for many social networks, users’ invitationand activity (‘likes’,‘retweets’) behavior can be modeled astwo different graphs.

We note that the implicit signals of endorsement, such assending a connection invite, or viewing someone’s profile,are more useful than explicit signals of endorsement, suchas, for example, writing an online recommendation. Thereason for this is two-fold. First, the explicit data is sig-nificantly sparser than the implicit data. Second, there aremultiple incentives for people to endorse someone publicly:for example, this can bring visibility to the endorser. Care-ful analysis of online users’ behavior, coupled with the large

scale and richness of the raw data, is the key to answeringsuch sociological questions, that our predecessors have beentrying to answer for hundreds of years.

10. REFERENCES

[1] J. A. Aslam and M. Montague. Models formetasearch. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval , SIGIR ’01, pages276–284, New York, NY, USA, 2001. ACM.

[2] M. Bianchini, M. Gori, and F. Scarselli. InsidePageRank. ACM Transactions on Internet ,5(1):92–128, 2005.

[3] S. Brin and L. Page. The anatomy of a large-scale

hypertextual Web search engine. Computer Networks and ISDN Systems , 30(1-7):107–117, Apr. 1998.

[4] M. Cha, H. Haddadi, F. Benevenuto, andK. Gummadi. Measuring user influence in twitter:The million follower fallacy. In 4th International AAAI Conference on Weblogs and Social Media (ICWSM), pages 10–17, 2010.

[5] H. Daniels. Round-robin tournament scores.Biometrika , 56(2):295–299, 1969.

[6] D. Fogaras, B. Racz, K. Csalogany, and T. Sarlos.Towards scaling fully personalized pagerank:Algorithms, lower bounds, and experiments. Internet Mathematics , 2(3), 2005.

[7] R. Ghosh and K. Lerman. Predicting influential users

in online social networks. In SNA-KDD: Proceedings of KDD Workshop on Social Network Analysis , 2010.

[8] D. F. Gleich, P. G. Constantine, and A. D. Flaxman.Tracking the Random Surfer : Empirically MeasuredTeleportation Parameters in PageRank. Human Factors , 2010.

[9] M. Gomez Rodriguez, J. Leskovec, and A. Krause.Inferring networks of diffusion and influence. InProceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining ,pages 1019–1028, 2010.

[10] A. A. Goodrum, K. W. McCain, S. Lawrence, andC. L. Giles. Scholarly publishing in the internet age: acitation analysis of computer science literature.INFORMATION PROCESSING and MANAGEMENT , 37(5):661–675, 2001.

[11] A. Goyal, F. Bonchi, and L. Lakshmanan. Learninginfluence probabilities in social networks. InProceedings of the third ACM international conference on Web search and data mining , pages 241–250. ACM,2010.

[12] U. Kang, C. E. Tsourakakis, and C. Faloutsos.Pegasus: mining peta-scale graphs. Knowl. Inf. Syst.,27(2):303–325, 2011.

[13] A. Khrabrov and G. Cybenko. Discovering influence incommunication networks using dynamic graphanalysis. In Proceedings of the 2010 IEEE Second International Conference on Social Computing ,SOCIALCOM ’10, pages 288–294, Washington, DC,USA, 2010. IEEE Computer Society.

[14] J. M. Kleinberg. Authoritative sources in ahyperlinked environment. Journal of the ACM ,46(5):604–632, Sept. 1999.

[15] H. Kwak, C. Lee, H. Park, and S. Moon. What is

Twitter, a social network or a news media? In WWW ’10: Proceedings of the 19th international conference on World wide web, pages 591–600, New York, NY,USA, 2010. ACM.

[16] J. Lee, P. Antoniadis, and K. Salamatian. FavingReciprocity in Content Sharing Communities: AComparative Analysis of Flickr and Twitter. InAdvances in Social Networks Analysis and Mining (ASONAM), 2010 International Conference on , pages136–143. IEEE, 2010.

[17] K. Lerman. Social information processing in socialnews aggregation. IEEE Internet Computing: special issue on Social Search , 2007.

[18] J. Leskovec, D. Huttenlocher, and J. Kleinberg. Signed

networks in social media. In Proceedings of the 28th international conference on Human factors in computing systems , pages 1361–1370. ACM, 2010.

[19] Y. Liu, B. Gao, T. Liu, Y. Zhang, Z. Ma, S. He, andH. Li. BrowseRank: letting web users vote for pageimportance. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval , pages 451–458.ACM, 2008.

[20] J. Moon and N. Pullman. On generalized tournamentmatrices. SIAM Review , 12(3):384–399, 1970.

[21] G. Slutzki and O. Volij. Ranking participants ingeneralized tournaments. International Journal of Game Theory , 33:255–270, 2005.

[22] J. Weng, E. Lim, J. Jiang, and Q. He. Twitterrank:finding topic-sensitive influential twitterers. InProceedings of the third ACM international conference on Web search and data mining , pages 261–270. ACM,2010.

[23] D. Zhou, S. A. Orshanskiy, H. Zha, and C. L. Giles.Co-ranking authors and documents in a heterogeneousnetwork. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining , pages739–744, Washington, DC, USA, 2007. IEEEComputer Society.


718

Documents

p709