69
Esteban Moro (UC3M & IIC) Les Houches ‘14

Temporal dynamics of human behavior in social networks (i)

Embed Size (px)

Citation preview

Page 1: Temporal dynamics of human behavior in social networks (i)

Esteban Moro (UC3M & IIC)

Les Houches ‘14

Page 2: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

First Lecture • Motivation • Social dynamical processes • Ties

• Tie activity • Tie dynamics

• Community dynamics • Network dynamics http://bit.ly/LesHouches

Page 3: Temporal dynamics of human behavior in social networks (i)

@estebanmoro vimeo.com/52390053

Social networks are dynamical by nature

@estebanmoro

Page 4: Temporal dynamics of human behavior in social networks (i)

@estebanmoro vimeo.com/52390053

Social networks are dynamical by nature

@estebanmoro

Page 5: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

Tim

e scale

Page 6: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

appear/disappeart1 t2 t3

Barabasi et al., Physica A (2002), Holme et al. Soc.Net.(2004)

Nod

es

Time scale

Page 7: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

form/decayt1 t2 t3

Tie activity is bursty t

Groups of conversation

t1 t1+dt

Hidalgo et al., Physica A (2008), Burt, Soc.Net.(2000)

Barabasi, Nature (2005)

Kovanen et al. J.Stat.Mech (2011), Zhao et al. NetMob (2011)

Tie

sappear/disappear

t1 t2 t3

Barabasi et al., Physica A (2002), Holme et al. Soc.Net.(2004)

Nod

es

Time scale

Page 8: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

form/decayt1 t2 t3

Tie activity is bursty t

Groups of conversation

t1 t1+dt

Hidalgo et al., Physica A (2008), Burt, Soc.Net.(2000)

Barabasi, Nature (2005)

Kovanen et al. J.Stat.Mech (2011), Zhao et al. NetMob (2011)

Tie

sappear/disappear

t1 t2 t3

Barabasi et al., Physica A (2002), Holme et al. Soc.Net.(2004)

Nod

es

Communities form/change/decay

t1 t2

Palla et al. Proc.of SPIE (2007)Com

mun

ities

Time scale

Page 9: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

form/decayt1 t2 t3

Tie activity is bursty t

Groups of conversation

t1 t1+dt

Hidalgo et al., Physica A (2008), Burt, Soc.Net.(2000)

Barabasi, Nature (2005)

Kovanen et al. J.Stat.Mech (2011), Zhao et al. NetMob (2011)

Tie

sappear/disappear

t1 t2 t3

Barabasi et al., Physica A (2002), Holme et al. Soc.Net.(2004)

Nod

es

Communities form/change/decay

t1 t2

Palla et al. Proc.of SPIE (2007)Com

mun

ities

Networks form/change/decay

t1 t2

Kossinets and Watts, Science (2006)

Net

wor

k

Time scale

Page 10: Temporal dynamics of human behavior in social networks (i)

1 Social dynamical process

3.3 Dynamical communication strategies 59

A

0.0

0.2

0.4

0.6

0.8

10 20 50k

mean

gg1

g2

g3

0.00

0.05

0.10

0.15

0.20

10 20 50k

mean

gg1

g2

g3

ki

pi

ci

52 105 158 211

52 105 158 211

B

C D

log

n↵

,i

1

2

3

4

-1 0 1 2 3 4 53.5e-05

7.3e-05

1.5e-04

3.2e-04

6.6e-04

1.4e-03

2.9e-03

6.0e-03

1.3e-02

2.6e-02

1

2

3

4

-1 0 1 2 3 4 5

0.00003511

0.00007296

0.00015161

0.00031503

0.00065460

0.00136021

0.00282641

0.00587305

0.01220371

0.025358322.5e-2

3.5e-5

2.8e-3

3.1e-4

A

B

log n!,i

log i

Figure 3.11: Heterogeneity of dynamical social strategies: A and B shows different snapshots ofthe neighborhood of two different individuals (in red) at 4 equally spaced times in the observationtime window t = 52, 105, 158, and 211 days. Each black (grey) line corresponds to an open(closed) tie at that particular instant. C Log-density plot of the social activity n

↵,i

as a functionof the social capacity

i

for each individual in our database. Solid line corresponds to theline n

↵,i

= 0.75i

obtained through PCA. Dashed curves are the iso-connectivity lines ki

=

i

+ n↵,i

for ki

= 10, 20, 50. D shows the average value for the persistence pi

and clusteringcoefficient c

i

for three groups of equal connectivity (dashed lines in panel C) but for differentquantiles of �

i

. Specifically, �i

< 0.43 (black), 0.43 < �i

< 0.88 (gray) and �i

> 0.88 (white)

It is very important to note that, despite the strict relation between �i

(or both n↵,i

and

i

) and the social connectivity ki

, individual social strategies of communication cannot be identified only by means of k

i

.This becomes clear by looking at panel C of Fig. 3.11, where we show that users

with exactly the same ki

(dashed curves) can be characterized by very different com-binations of social capacity and social activity. As we will see in the following section,the implications of this result go beyond the characterization of how people allocatetime and resources across their social circle. In fact, the adoption of one strategy or the

Page 11: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Cognitive limits • Dunbar’s number

•There is a cognitive limit to the number of people with whom one can maintain stable social relationships. (Dunbar 1992)

• The magical number Seven Plus Minus Two • The number of objects an average human

can hold in working memory is 7 ± 2 (Miller ’56)

ki

hwij|k

ii

Miritello, G. et al., 2013. Time as a limited resource: Communication strategy in mobile phone networks. Social Networks.

Page 12: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●

● ●

●●

●●

● ●●

● ●

weak tie

structural hole

bridge

strong tie

• Embeddedness / clustering / triadic closure / weak ties

• Embeddedness, clustering: People who spend time with a thirdare likely to encounter each other(triadic closure). Minimizes conflict, maximizes trusts,…

• Bridges, structural holes (Burt): Bridges have structural advantagessince they have access to non-redundant information

• Weak ties (Granovetter): weak ties tend to connect different areas of the network (they are more likely to be sources of novel information)

Rivera, M.T., Soderstrom, S.B. & Uzzi, B., 2010. Dynamics of Dyads in Social Networks: Assortative, Relational, and Proximity Mechanisms. Annual Review of Sociology, 36(1), pp.91–115.

Page 13: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Contagion

• Human behaviors spread on the network • Dynamics too

• Homophily

• The greater the similarity between individuals the more likely they are to establish a connection

measure the relative topological overlap of the neighborhood of two users A and B, representing the proportion of their common friends, as OAB = NAB/((KA-1)+(KB-1)-NAB), where NAB is the number of common neighbors of A and B, and KA (KB) denotes the degree of node A(B).1 Fig. 3(d) demonstrates the effect of removing links in order of strongest (or weakest) overlaps. In both cases, we find that removing ties in rank order of weakest to strongest ties will lead to a sudden disintegration of the network. In contrast, reversing the order shrinks the network without precipitously breaking it apart.

0

0.1

0.2

0.3

0.4

0.5

0.6

0 5 10 15 20 25

Probability

Number of Churner Neighbours

May ChurnersJune ChurnersJuly Churners

(a)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 0.2 0.4 0.6 0.8 1

Probability

Proportion of Pairs Adjacent

3 Churners4 Churners5 Churners6 Churners

(b)

Figure 2. Probability of churning when (a) k friends have already churned (b) adjacent pairs of friends have already churned

This result is broadly consistent with the strength of weak ties hypothesis [5], offering one of its first confirmations in mobile networks. Accordingly, tie strength is driven not only by the individuals involved in the tie, but also by the network structure in the tie’s immediate vicinity. Further, given that the strong ties are predominantly within communities, their removal will only

1 If A and B have no common acquaintances we have OAB =1.

locally disintegrate a community, while the removal of the weak links will delete bridges that connect different communities, leading to a network collapse. Further, we believe that the observed local relationship, between network topology and tie strength affects any global information diffusion process (like churn). In fact, we opine that churn as a behavior can be viewed less as a dyadic phenomenon (affected only by strong churner-churner ties), but more as a diffusion process where both strong and weak ties play a significant role in spreading the influence through the network topology.

4. PREDICTING CHURNERS IN THE CALL GRAPH We next discuss how to exploit social ties to identify potential churners in an operator’s network. Our approach is as follows. We start with a set of churners (e.g. for April) and their social relationships (ties) captured in the call graph (for March). Using the underlying topology of the call graph, we then initiate a diffusion process with the churners as seeds. Effectively, we model a “word-of-mouth” scenario where a churner influences one of his neighbors to churn, from where the influence spreads to some other neighbor, and so on. At the end of the diffusion process, we inspect the amount of influence received by each node. Using a threshold-based technique, a node that is currently not a churner can be declared to be a potential future one, based on the influence that has been accumulated. Finally, we measure the number of correct predictions by tallying with the actual set of churners that were recorded for a subsequent month (e.g. for May). The diffusion model is based on Spreading Activation (SPA) techniques proposed in cognitive psychology and later used for trust metric computations [32]. In essence, SPA is similar to performing a breadth-first search on the call graph GMarch=(V,E). The basic steps are outlined below:-

Node Activation: During each iterative step i, there is a set of active nodes. Let X be an active node which has associated energy E(X,i) at step i. Intuitively, E(X,i) is the amount of (social) influence2 transmitted to the node via one or more of its neighbors. A node with high influence has a greater propensity to churn. Let N(X) be the set of neighbors of X. Active nodes for step i+1 comprises of nodes which are neighbors of currently active members. Further, a currently active node X transfers a fraction of its energy to each neighbor Y (connected by a directed edge <X,Y>), in the process of activating it. The amount of energy that is transferred from X to Y depends on the Spreading Factor d and the Transfer Function F, respectively.

Spreading Factor: SPA starts with a set of active nodes (seed nodes) each having initial energy E(X,0). At each subsequent step i, an active node transfers a portion of its energy d· E(X,i) to its neighbors, while retaining (1 − d) · E(X,i) for itself, where d is the global Spreading Factor. The spreading factor concept is very intuitive and, in fact, very close to real models of energy spreading. Observe that the overall amount of energy in the network does not change over time, i.e. ∑X E(X,i) = ∑X∈V E(X,0) = E0, for each step i. The spreading factor determines the amount of

2 The terms “energy” and “influence” are used interchangeably in

this context.

Worldwide Buzz 27

Attribute Random CommunicateAge -0.0001 0.297Gender 0.0001 -0.032ZIP -0.0003 0.557County 0.0005 0.704Language -0.0001 0.694

Table 5: Correlation coe�cients for random pairs of people and pairs of people who communicate.We compare the degree of homophily of random pairs of users with pairs of users that communicate.

10 20 30 40 50 60 70 80

10

20

30

40

50

60

70

80

10 20 30 40 50 60 70 80

10

20

30

40

50

60

70

80

(a) Random (b) Communicate

Figure 21: Number of pairs of people of di↵erent ages. We plot ages of two people and colorcorresponds to the number of such pairs. (a) Ages of randomly selected pairs of people; we notethere is little correlation. (b) Ages of people who communicate with one another, i.e., ages of peopleat the endpoints of links in the communication network. The high correlation is captured by thediagonal trend.

We contrast this statistic with the correlation coe�cient where we choose users via a processof uniform random sampling across 1.3 billion users.

We also consider two measures of similarity—the correlation coe�cient and the probabil-ity that users have the same attribute value, e.g., that users come from the same countries.

Table 5 compares correlation coe�cient of various user attributes when pairs of usersare chosen uniformly at random with pairs of users that communicate. As attributes arenot correlated for random pairs of people, they are highly correlated for users who com-municate. Also, notice that gender communication is negatively correlated—people tend tocommunicate more with people of a di↵erent gender.

Figure 21 further illustrates the results of table 5. We plot the number of pairs of peopleof particular age. Figure 21(a) shows the distribution over the randomly sampled pairs, i.e.,the Messenger user base created by sampling from 1.3 billion random user pairs, and plotthe distribution over reported ages. As most of the population comes from the age group10-30, the distribution of random pairs of people reaches the mode at those ages but thereis no correlation. Figure 21(b) shows the distribution of ages over the pairs of people that

Worldwide Buzz 27

Attribute Random CommunicateAge -0.0001 0.297Gender 0.0001 -0.032ZIP -0.0003 0.557County 0.0005 0.704Language -0.0001 0.694

Table 5: Correlation coe�cients for random pairs of people and pairs of people who communicate.We compare the degree of homophily of random pairs of users with pairs of users that communicate.

10 20 30 40 50 60 70 80

10

20

30

40

50

60

70

80

10 20 30 40 50 60 70 80

10

20

30

40

50

60

70

80

(a) Random (b) Communicate

Figure 21: Number of pairs of people of di↵erent ages. We plot ages of two people and colorcorresponds to the number of such pairs. (a) Ages of randomly selected pairs of people; we notethere is little correlation. (b) Ages of people who communicate with one another, i.e., ages of peopleat the endpoints of links in the communication network. The high correlation is captured by thediagonal trend.

We contrast this statistic with the correlation coe�cient where we choose users via a processof uniform random sampling across 1.3 billion users.

We also consider two measures of similarity—the correlation coe�cient and the probabil-ity that users have the same attribute value, e.g., that users come from the same countries.

Table 5 compares correlation coe�cient of various user attributes when pairs of usersare chosen uniformly at random with pairs of users that communicate. As attributes arenot correlated for random pairs of people, they are highly correlated for users who com-municate. Also, notice that gender communication is negatively correlated—people tend tocommunicate more with people of a di↵erent gender.

Figure 21 further illustrates the results of table 5. We plot the number of pairs of peopleof particular age. Figure 21(a) shows the distribution over the randomly sampled pairs, i.e.,the Messenger user base created by sampling from 1.3 billion random user pairs, and plotthe distribution over reported ages. As most of the population comes from the age group10-30, the distribution of random pairs of people reaches the mode at those ages but thereis no correlation. Figure 21(b) shows the distribution of ages over the pairs of people that

Correlation coefficient

Number of pairs of people at different ages

Leskovec, J. & Horvitz, E., 2008. Planetary-scale views on a large instant-messaging network. pp.915–924.

Dasgupta, K. et al., 2008. Social ties and their relevance to churn in mobile telecom networks.

Page 14: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Contagion = Homophily?

• Influence and homophily are usually confounded in observational social network studies

launched in July 2007 (Yahoo! Go) (Fig. 2A), and (iii) preciseattribute and dynamic behavioral data on users’ demographics,geographic location, mobile device type and usage, and per-daypage views of different types of content (e.g., sports, weather, news,finance, and photo sharing) from desktop, mobile, and Go plat-forms. Much of these data, such as mobile device usage and pageviews of different types of content, provide fine-grained proxies forindividuals’ tastes and preferences. The complete set of covariatesincludes 40 time-varying and 6 time-invariant individual and net-work characteristics. Taken together, the sampled users of the IM

network registered !14 billion page views and sent 3.9 billionmessages over 89.3 million distinct relationships. For details aboutthe service, the data, and descriptive statistics see the Data sectionof the SI.

Evidence of Assortative Mixing and Temporal ClusteringWe observe strong evidence of both assortative mixing and tem-poral clustering in Go adoption. At the end of the 5-month period,adopters have a 5-fold higher percentage of adopters in their localnetworks (t " stat # 100.12, p $ 0.001; k.s. " stat # 0.06, p $ 0.001)and receive a 5-fold higher percentage of messages from adoptersthan nonadopters (t " stat # 88.30, p $ 0.001; k.s. " stat # 0.17,p $ 0.001). Both the number and percentage of one’s local networkwho have adopted are highly predictive of one’s propensity to adopt(Logistic: !(#) # 0.153, p $ 0.001; !(%) # 1.268, p $ 0.001), and toadopt earlier (Hazard Rate: !(#) # 0.10, p $ 0.001; !(%) # 0.003,p $ 0.001). The likelihood of adoption increases dramatically withthe number of adopter friends (Fig. 2C), and correspondingly,adopters are more likely to have more adopter friends (Fig. 2B),mirroring prior evidence on product adoption in networks (29).

Adoption decisions among friends also cluster in time. Werandomly reassigned all Go adoption times (while maintaining theadoption frequency distribution over time) and compared observeddyadic differences in adoption times among friends to differencesamong friends with randomly reassigned adoption times, a proce-dure known as the ‘‘shuffle test’’ of social influence (25). Comparedwith these randomly reassigned adoption times, friends are between100% and 500% more likely to adopt within 2 days of each other,after which the temporal interdependence of adoption amongfriends disappears (Fig. 1D).

Evidence of assortative mixing and temporal clustering maysuggest peer influence in Go adoption, but is by no means conclu-sive. Demographic, behavioral, and preference similarities couldsimultaneously drive friendship and adoption, creating assortativemixing. Such homophily could also explain the temporal clustering

Fig. 1. Diffusion of Yahoo! Go over time. (A–C and D–F) Two subgraphs of theYahoo! IM network colored by adoption states on July 4 (the Go launch date),August 10, and October 29, 2007. For animations of the diffusion of Yahoo! Goover time see Movies S1 and S2.

Fig. 2. Assortative mixing and temporal clustering. (A) The number of Go adopters per day from July 1 to October 29, 2007. (B) The fraction of adopters andnonadopters with a given number of adopter friends. (C) The ratio of the likelihood of adoption given n adopter friends Pa(n) and the likelihood of adoption given0adopterfriendsPa(0)wherethenumberofadopter friends isassessedatthetimeofadoption. (D) Frequencyofobserveddyadicdifferences inadoptiontimesbetweenfriends compared with differences in adoption times between friends with randomly reassigned adoption times. %t # ti " tj, where ti represents the time of i’s adoption.

Aral et al. PNAS ! December 22, 2009 ! vol. 106 ! no. 51 ! 21545

SOCI

AL

SCIE

NCE

S

early Go adopters. Random matching also implies that the marginalinfluence of an additional adopter friend grows with the number ofadopter friends, whereas propensity score results show linear todiminishing marginal influence effects of additional adopter friends(Fig. 3B Right Inset). This occurs in part because there is exagger-ated homophily among larger clusters of adopter friends (Fig. 3BLeft Inset). The more adopters there are in a group of friends themore likely they are to be more similar to one another. Comparisonsto random therefore incorrectly imply that influence grows super-linearly with the number of adopter friends, whereas there is simplygreater homophily in larger groups of adopters.

Homophily also accounts for temporal clustering. We redefinedtreatment to capture the effect of having a friend who adoptedwithin a certain time period (or recency)(!t " ti

a # tja $ R) and

reevaluated results under random and propensity score matching(Fig. 3D). Random matching overestimates the contribution ofinfluence to the temporal clustering of adoption decisions by%200% for dyads that adopt on the same day (!t $ 0), %100% fordyads that adopt 1 day apart (!t $ 1), and so on. Friends who adoptcontemporaneously are again more similar along observable de-mographic and behavioral dimensions [measured by cos(xit

a, xjta), Fig.

3D Inset], indicating that homophily explains a good deal of variancein the temporal clustering of Go adoption decisions.

Thus, homophily can, to a large extent, explain what seems at firstto be a contagious process driven by peer influence. Over half of thecumulative adoption of treated users (those with at least oneadopter friend) can be attributed to homophily effects (Fig. 4 A andB). The remaining adoption events (49.8%) represent the upperbound of influence effects established by our matched sampleestimates. We also evaluated these influence effects under variousenvironmental conditions (by holding out and varying one charac-teristic (xi) at a time while matching on all other characteristics, SI,Environmental Conditions) and found the upper bounds of influ-

ence vary across different segments of the population. When ego’saverage strength of ties to adopter friends is above the median, thelikelihood of adoption controlling for homophily is on average 2times higher than when below the median (Fig. 4C). Those withcohesive, dense local networks (with more ties among their friends)adopt at a higher rate in the presence of an adopter friendcontrolling for observed homophily (Fig. 4D), reinforcing priorarguments that cohesive networks magnify information exchangeand persuasion via redundancy and trust (39). Finally, greaterconsumption of news content makes ego more susceptible topotential influence. Because Yahoo! Go delivers personalizednews, those with greater interest in such content are more suscep-tible to influence, demonstrating the importance of creating robustmatches based on contextual behavioral variables (Fig. 4E). Theseestimates provide examples of the types of environmental condi-tions that affect the prevalence of influence in networks anddemonstrate how to test them.

DiscussionWe present a generalized statistical framework for distinguishingpeer-to-peer influence from homophily in dynamic networks of anysize. Application of this framework to a network of 27 millionindividuals connected by instant message traffic provides an esti-mate of the degree to which peer influence and homophily affectthe diffusion of a new mobile service application across thisnetwork. Most critically, the results show that previous methodsoverestimate peer influence in this network by 300–700% and thathomophily explains %50% of the perceived behavioral contagion inmobile service adoption. These findings demonstrate that homoph-ily can account for a great deal of what appears at first to be acontagious process.

Overestimates of influence are magnified at early stages of thediffusion process because those who are most susceptible are also

Fig. 3. Distinguishing homophily and influence. (A and B) The fraction of observed treated to untreated adopters (n&/n#) under random (A) and propensity score(B) matching over time. The dotted line shows a ratio of 1, when treatment has no effect. The Right Inset in B graphs the average marginal influence effects of having1, 2, 3, or 4 adopter friends implied by random (open circles) and propensity score (filled circles) matching. The Left Inset graphs the average cosine distance of attributeandbehaviorvectorsofadopters toadopterfriendsasthenumberofadopters inthe localnetwork increases ('i,j

n cos(xia,xj

a)/n). (C)Graphsthecosinedistancesofadoptersto their adopter friends cos(xit

a, xjta), their nonadopter friends cos(xit

a, xjt), and a random alter cos(xita, xrt) over time with trend lines fitted by ordinary least squares. (D)

The fraction of treated and untreated adopters, where treatment is defined as having a friend who adopted within a certain time period (or recency) (!t " tia # tj

a $R), under random matching (open circles) and propensity score matching (filled circles). The Inset graphs the cosine distances of dyads of adopters cos(xit

a, xjta) by the time

interval between their adoption.

Aral et al. PNAS ! December 22, 2009 ! vol. 106 ! no. 51 ! 21547

SOCI

AL

SCIE

NCE

S

Aral, S., et al. 2009. Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proceedings of the National Academy of Sciences, 106(51), p.21544.

Page 15: Temporal dynamics of human behavior in social networks (i)

form/decayt1 t2 t3

Tie activity is bursty t

Groups of conversation

t1 t1+dt

Tie

s

Communities form/change/decay

t1 t2

Com

mun

ities

Networks form/change/decay

t1 t2

Net

wor

k

2 Tie Activity

Page 16: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Bursty human dynamics: inter-event time between activities is heavy-tailed distributed

We obtained the letter correspondence recordsfor 16 writers, performers, politicians, and scien-tists. Each data set consists of a list of letters thatwere sent by each of these individuals, and eachrecord comprises the name of the sender, the nameof the recipient, and the date when the letter waswritten [see supporting online material (SOM) textS1 for details]. The nature of the data raises twoissues to consider during analysis. First, the preciseauthorship date of some letters is unknown, sowe restricted our analysis to only those letters thathave precise authorship dates. Second, it is highlyunlikely that all of the letters written by a partic-ular individual are present in the database.We haveconfirmed that our results are insensitive to sam-pling effects from this method of data collection(SOM text S2).

An important consideration in studying theletter correspondence patterns of these individu-als is that the data cover their entire lifetimes. Asa result, it is conceivable that changing commu-nication needs might affect letter correspondencepatterns. For example, before Einstein becamewidely known, the bulk of his recorded commu-nication was to friends and relatives. After theconfirmation of his theory of relativity in 1919,Einstein’s need to communicate with other indi-viduals substantially increased. By that time, hisstepdaughter Ilse Einstein was helping him withsecretarial tasks, resulting in greatly improved cov-erage of his recorded correspondence (23). Becauseof this secretarial assistance and his increased fame,we expect that the average time between consecu-

tively sent letters, the average interevent time ⟨t⟩,is significantly larger during the beginning ofEinstein’s life than during the latter part of hislife. Our expectations are verified in Fig. 1, A andB, demonstrating that these time series are non-stationary; that is, the heavy-tailed t distributionresults from a mixture of time scales (24).

Because these time series are nonstationary,we partitioned each complete time series intosmaller time segments so that we could approx-imate stationary behavior within each time seg-ment. We accomplished this by splitting the timeseries into segments lasting 364 days (52 weeks),unless fewer than 10 events fell within that timeperiod, in which case consecutive segments weremerged until this criterion was met.

Assuming that the correspondence patternswithin each time segment are stationary, we canthen model the behavior within each time seg-ment with standard techniques. As a first approx-imation, one might naively expect that letters aresent at a constant rate r and that the time at whichevery letter is sent is independent of all others.Such a process is referred to as a homogeneousPoisson process, which gives rise to an exponen-tial t distribution p(t) = re−rt. Whereas the tail ofthe t distribution within these time segments is ap-proximately exponential, the best-estimate predic-tions of a homogeneous Poisson process do notproduce the correct decay rate (Fig. 1C). This sug-gests that only a few changes to the homogeneousPoisson process are needed to reproduce the ob-served t distribution. We hypothesize that, as for

e-mail correspondence, two additional ingredientsmust also be considered for modeling letter cor-respondence (14).

First, circadian and weekly cycles of activitymay influence when individuals communicate.Previously, we accounted for these cycles ofactivity in e-mail communication with a non-homogeneous Poisson process whose rate r(t)changes periodically on daily and weekly timescales. For letter correspondence, however, theresolution of the data does not permit us toidentify activity patterns within a day, and day-to-day changes in activity provide no additionalinsight (SOM text S3).We therefore approximatethe nonhomogeneous Poisson process r(t) by ahomogeneous Poisson process with constant rateri during time segment i; that is, we model therate of activity r(t) throughout each individual'slife by a piecewise constant function of time.

Second, individuals are much more likely tocontinue writing letters once they have written oneletter, in order to use their time more effectively.We account for this behavior by hypothesizingthat, once an individual finishes writing a letter,there is a probability xi that they will write an-other letter. This process repeats itself until thiscascade of additional letters concludes with prob-ability 1 − xi, at which point the individual’sbehavior is again governed by a homogeneousPoisson process with rate ri (25). We refer to theresulting model as a cascading Poisson process.

To compare the predictions of the cascadingPoisson process (26) to the empirical data, we

Fig. 1. Nonstationarityof Albert Einstein’s lettercorrespondence activity.We selected Einstein asan example, but nonsta-tionarities are present forall 16 writers, performers,politicians, and scientistsstudied here. (A) Averageof t over 100 consecutivet’s. During the beginningof Einstein’s life (blue-shaded region), ⟨t⟩100 issignificantly larger thanduring the end of his life(orange-shaded region).

1900 1920 1940 1960Year

10-1

100

101

102

⟨τ⟩

(d)

100101 102

Inter-event time, τ (d)

Inter-event time, τ (d)

10-6

10-5

10-4

10-3

10-2

100

10-1

100

Pro

babi

lity

dens

ity

1896-19251925-19551896-1955

0 2 4 6 8 1010-5

10-4

10-3

10-2

10-1

100

Cum

ulat

ive

dist

ribut

ion

Empirical (1931)Poisson process

A B

C

(B) Logarithmically binned probability density of the nonzero t’s. If we separately consider the tdistribution during each portion of Einstein’s life, it is clear that the complete t distribution(black line) is actually a mixture of behaviors. To emphasize the origins of the heavy-taileddistribution, the probability densities of each portion of Einstein’s life are normalized so thattheir integrals are equal to the fraction of nonzero t’s during that time period. d, days. (C)Comparison of the empirical t distribution during a particular time segment with the simulatedpredictions of the best-estimate homogeneous Poisson process that is interval-censored in thesame manner as the data. It is visually apparent that a homogeneous Poisson process is notconsistent with the empirical data, which is confirmed by Monte Carlo hypothesis testing (SOMtext S3).

www.sciencemag.org SCIENCE VOL 325 25 SEPTEMBER 2009 1697

REPORTS

on

Sept

embe

r 25,

200

9 w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Malmgren, R. et al., 2009. On universality in human correspondence activity. Science, 325(5948), p.1696.

Page 17: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Bursty human dynamics: inter-event time between activities is heavy-tailed distributed

well approximated by Poisson processes1–3. In contrast, thereis increasing evidence that the timing of many humanactivities, ranging from communication to entertainment andwork patterns, follow non-Poisson statistics, characterized bybursts of rapidly occurring events separated by long periods ofinactivity4–8. Here I show that the bursty nature of humanbehaviour is a consequence of a decision-based queuingprocess9,10: when individuals execute tasks based on some per-ceived priority, the timing of the tasks will be heavy tailed, withmost tasks being rapidly executed, whereas a few experience verylong waiting times. In contrast, random or priority blindexecution is well approximated by uniform inter-event statistics.These finding have important implications, ranging fromresource management to service allocation, in both communi-cations and retail.Humans participate on a daily basis in a large number of distinct

activities, ranging from electronic communication (such as sendinge-mails or making telephone calls) to browsing the Internet,initiating financial transactions, or engaging in entertainment andsports. Given the number of factors that determine the timing ofeach action, ranging from work and sleep patterns to resourceavailability, it seems impossible to seek regularities in humandynamics, apart from the obvious daily and seasonal periodicities.Therefore, in contrast with the accurate predictive tools common in

physical sciences, forecasting human and social patterns remains adifficult and often elusive goal.

Current models of human activity are based on Poisson pro-cesses, and assume that in a dt time interval an individual (agent)engages in a specific action with probability qdt, where q is theoverall frequency of themonitored activity. This model predicts thatthe time interval between two consecutive actions by the sameindividual, called the waiting or inter-event time, follows anexponential distribution (Fig. 1a–c)1. Poisson processes are widelyused to quantify the consequences of human actions, such asmodelling traffic flow patterns or accident frequencies1, and arecommercially used in call centre staffing2, inventory control3, or toestimate the number of congestion-caused blocked calls in calls inmobile communication4. Yet, an increasing number of recentmeasurements indicate that the timing of many human actionssystematically deviates from the Poisson prediction, the waiting orinter-event times being better approximated by a heavy tailed orPareto distribution (Fig. 1d–f). The differences between Poissonand heavy-tailed behaviour are striking: a Poisson distributiondecreases exponentially, forcing the consecutive events to followeach other at relatively regular time intervals and forbidding verylong waiting times. In contrast, the slowly decaying, heavy-tailedprocesses allow for very long periods of inactivity that separatebursts of intensive activity (Fig. 1).

Figure 1 The difference between the activity patterns predicted by a Poisson process andthe heavy-tailed distributions observed in human dynamics. a, Succession of eventspredicted by a Poisson process, which assumes that in any moment an event takes place

with probability q. The horizontal axis denotes time, each vertical line corresponding to an

individual event. Note that the inter-event times are comparable to each other, long

delays being virtually absent. b, The absence of long delays is visible on the plot showingthe delay times t for 1,000 consecutive events, the size of each vertical line

corresponding to the gaps seen in a. c, The probability of finding exactly n events within afixed time interval is P(n; q) ¼ e 2qt(qt )n/n!, which predicts that for a Poisson process the

inter-event time distribution follows P(t) ¼ qe 2qt, shown on a log-linear plot in c for the

events displayed in a, b. d, The succession of events for a heavy-tailed distribution.e, The waiting time t of 1,000 consecutive events, where the mean event time waschosen to coincide with the mean event time of the Poisson process shown in a–c. Notethe large spikes in the plot, corresponding to very long delay times. b and e have the samevertical scale, allowing the comparison of the regularity of a Poisson process with the

intermittent nature of the heavy-tailed process. f, Delay time distribution P(t) . t 22 for

the heavy-tailed process shown in d, e, appearing as a straight line with slope22 on a

log–log plot. The signal shown in d–f was generated using g ¼ 1 in the stochastic

priority list model discussed in the Supplementary Information.

letters to nature

NATURE |VOL 435 | 12 MAY 2005 | www.nature.com/nature208© 2005 Nature Publishing Group

Barabasi, A.-L., 2005. The origin of bursts and heavy tails in human dynamics. Nature, 435(7039), pp.207–211.

Page 18: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Why?: managing tasks, queue theory. time in the queue

1 x1 2 x2 3 x3 4 x4 5 x5

… L xL

L+1 xL+1

⇢(x) x

max

45

user’s list

⌧ =Barabasi, A.-L., 2005. The origin of bursts and heavy tails in

human dynamics. Nature, 435(7039), pp.207–211.

selected task, independent of its priority. Thus the p ! 1 limitof the model describes the deterministic protocol (iii), whenalways the highest-priority task is chosen for execution whereasp ! 0 corresponds to the random choice protocol (ii) discussedabove.To establish that this priority list model can account for the

observed fat-tailed inter-event time distribution, we first studied itsdynamics numerically with priorities chosen from a uniformdistribution x i [ [0, 1]. Computer simulations show that in thep ! 1 limit the probability that a task spends t time on the list has apower law tail with exponent a ¼ 1 (Fig. 3a) in agreement with theexponent obtained for e-mail communications (Fig. 2a). In thep ! 0 limit P(t) follows an exponential distribution (Fig. 3b), asexpected for the case (ii). As the typical length of the priority listdiffers from individual to individual, it is particularly important forthe tail of P(t) to be independent of L. Numerical simulationsindicate that this is indeed the case: changes in L do not affect thescaling of P(t). The fact that the scaling holds for L ¼ 2 indicatesthat it is not necessary to have a long priority list: as long asindividuals balance at least two tasks, a bursty, heavy-tailed inter-event dynamics will emerge.To determine the tail of P(t) analytically I consider a stochastic

version of the model in which the probability to choose a task withpriority x for execution in a unit of time is P(x) < xg, where g is aparameter that allows us to interpolate between the random choicelimit (ii, g ¼ 0, p ¼ 0) and the deterministic case, when always thehighest-priority item is chosen for execution (iii, g ¼ 1, p ¼ 1).Note that this parameterization captures the scaling of the modelonly in the p ! 0 and p ! 1 limits, but not for intermediate pvalues, thus it is chosen only for mathematical convenience. Theprobability that a task with priority x is executed at time t isf(x, t) ¼ (1 2 P(x))t21P(x). The average waiting time of a taskwith priority x is obtained by averaging over t weighted with f(x,t)providing

tðxÞ ¼X1

t¼1

tf ðx; tÞ ¼ 1

PðxÞ<1

xgð1Þ

that is, the higher an item’s priority, the shorter the average time itwaits before execution. To calculate P(t) I use the fact that thepriorities are chosen from the r(x) distribution; that is, rðxÞdx¼PðtÞdt; which gives

PðtÞ< rðt21=gÞt1þ1=g

ð2Þ

In the g ! 1 limit, which converges to the strictly priority-baseddeterministic choice (p ¼ 1) in the model, equation (2) predictsP(t) < t21, in agreement with the numerical results (Fig. 3a), aswell as the empirical data on the e-mail inter-arrival times (Fig. 2a).In the g ¼ 0 (p ¼ 0) limit, t(x) is independent of x thus P(t)converges to an exponential distribution, as shown in Fig. 3b (seeSupplementary Information).

The apparent dependence of P(t) on the r(x) distribution fromwhich the agent chooses the priorities may appear to represent apotential problem, as assigning priorities is a subjective process,each individual being characterized by its own r(x) distribution.According to equation (2), however, in the g ! 1 limit, P(t) isindependent of r(x). Indeed, in the deterministic limit the uniformr(x) can be transformed into an arbitrary r 0(x) with a parameterchange, without altering the order in which the tasks are executed11.This insensitivity of the tail to r(x) explains why, despite thediversity of human actions encompassing both professional andpersonal priorities, most decision-driven processes develop a heavytail.

To obtain empirical evidence for the validity of the proposedqueuing mechanism I consider the e-mail activity pattern of anindividual11,12. Once in front of a computer, an individual will replyimmediately to a high-priority message, while placing the lessurgent or the more difficult ones on its priority list to competewith other non-e-mail activities. I propose, therefore, that theobserved inter-event time distribution is in fact rooted in theuneven waiting times experienced by different tasks. To test thishypothesis the waiting time for each task needs to be determineddirectly. In the e-mail data set we have the time, sender and recipientof each e-mail transmitted over several months by each user, thus wecan determine the time it takes for a user to reply to a receivedmessage11. As Fig. 2b shows, the waiting time distribution P(tw) forthe user whose P(t) is shown in Fig. 2a is best approximated byPðtwÞ< t2aw

w with exponent aw ¼ 1, supporting the hypothesisthat the heavy-tailed waiting time distribution drives the observedbursty e-mail activity patterns.

As in the p ! 1 limit of the model the priority list is dominatedby low-priority tasks, new tasks will often be executed immediately.This results in a peak at P(t ¼ 1) (see Supplementary Fig. 3), which,although in some cases may represent amodel artefact, in the e-mailcontext is not unrealistic: most e-mails are either deleted right away(which is one kind of task execution) or immediately replied to.Only the more difficult or time consuming tasks will queue on thepriority list. The e-mail data set does not allow us to resolve thispeak, however, because amessage which the user deletes or replies toright away will appear to have some waiting time, given the delaybetween the arrival of themessage and the time the user has a chanceto check her e-mail.

Although I have illustrated the queuing process for e-mails, ingeneral the model is better suited to capture the competitionbetween different kinds of activities an individual is engaged in;that is, the switching between various work, entertainment andcommunication events. Indeed, most data sets displaying heavy-tailed inter-event times in a specific activity reflect the outcome ofthe competition between tasks of different nature. For example, thestarting of an online gaming session often implies that all higher-priority work-, family-, and entertainment-related activities havebeen already executed.

Detailed models of human activity require us to consider theimpact of a number of additional mechanisms on the queuing

Figure 3 The waiting time distribution predicted by the investigated queuing model. Thepriorities were chosen from a uniform distribution x i [ [0,1], and I monitored a priority

list of length L ¼ 100 over T ¼ 106 time steps. a, Log–log plot of the tail of probabilityP(t) that a task spends t time on the list obtained for p ¼ 0.99999, corresponding to the

deterministic limit of the model. The continuous line of the log–log plot corresponds to the

scaling predicted by equation (2), having slope 21, in agreement with the numerical

results and the analytical predictions. The data were log-binned, to reduce the uneven

statistical fluctuations common in heavy-tailed distributions, a procedure that does not

alter the slope of the tail. For the full curve, including the t ¼ 1 peak, see Supplementary

Fig. 3. b, Linear-log plot of the P(t) distribution for p ¼ 0.00001, corresponding to the

random choice limit of the model. The fact that the curve follows a straight line on a linear-

log plot indicates that P(t) decays exponentially.

letters to nature

NATURE |VOL 435 | 12 MAY 2005 | www.nature.com/nature210© 2005 Nature Publishing Group

Page 19: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Bursty contacts: inter-event times on ties are also heavy-tailed distributed

4

tribution around 20 seconds is found. This peak is dueto event correlations between links. The power law indi-cates the non-Poissonian, bursty character of the events.Both the characteristics vanish for the time-shuffled nullmodel BCW, and the inter-event time is well describedby an exponential function (see inset of Fig. 4), i.e., theprocess is Poissonian.

FIG. 4: (color online) Scaled inter-event time distributions forthe MCN data. Edges were binned (log bins with base 1.3)according to their weights and for every second bin the inter-event time distribution of the events occurring in the corre-sponding edge is shown. Each inter-event time distributionis scaled by the average inter-event time of the correspondingbin τ∗. The inset shows scaled inter-event time distributionsfor the empirical network (red circles) and for the time shufflednetwork (blue squares). An exponential density distributionwith average value of 1 is shown as a light (yellow) line.

The effect of burstiness on the spreading speed can beeasily demonstrated with the following single-link calcu-lation. Let us denote the average time for the infection tospread through a link (the residual waiting time) by ⟨τR⟩,and assume that one of the nodes gets infected at a uni-formly chosen random time. Similarly to Iribarren et al.[12] and Vazquez et al., [11] we calculate ⟨τR⟩ for a giveninter-event time distribution P (τ). For simplicity, weconsider how the burstiness introduced by a continuouspower-law distribution of inter-event times P (τ) ∼ τ−α

affects the average infection times when compared to aPoisson process. If we fix the average inter-event time(and thus the number of events for a long observationperiod), the ratio of average infection times becomes

r = ⟨τR,powerlaw⟩ / ⟨τR,poisson⟩ = (α−2)2

2(α−1)(α−3) for α > 3.

Now r is decreasing with α, r < 1 when α > 2+√2 ≈ 3.4,

and r goes to infinity at α = 3. This indicates that theburstiness characterized by power law distributions withslow decay has a decelerating effect on spreading withrespect to the Poison process with the same mean. How-ever, if the decay is fast enough, i.e., the second momentof the power law distribution is smaller than that of thePoisson distribution, we see acceleration. This mean field

type of reasoning has its limitations, nevertheless, it illus-trates the mechanisms of slowing down because of bursts:the residual waiting time increases as the chance for longwaiting times after getting infected increases with a fattailed waiting time distribution.In conclusion, the spreading phenomena in small-world

communication networks are slow mainly for two reasons.First, the community structure and its correlation withlink weights have already a considerable effect. Second,the inhomogeneous and bursty activity patterns on thelinks result in an additional slowing down. Thus it ismisleading to emphasize only one of these reasons. Butas shown here, by using proper null models the contribu-tions of different factors can be distinguished. Somewhatsurprisingly, the daily pattern and event correlations be-tween links seem to play only a minor role in overallspreading speed.Acknowledgement The project ICTeCollective ac-

knowledges the financial support of the Future andEmerging Technologies (FET) programme within theSeventh Framework Programme for Research of the Eu-ropean Commission, under FET-Open grant number:238597. Partial support by the Academy of Finland, theFinnish Center of Excellence program 2006-2011, projectno. 129670, as well as OTKA K60456 and TEKES arealso acknowledged.

[1] M. Newman, A.-L. Barabasi and D. J. Watts The Struc-ture and Dynamics of Networks (Princeton UP, 2006), M.Newman Networks: An Introduction (Oxford UP, 2010)

[2] A. Barrat, M. Barthalemy and A. Vespignani Dynamicalprocesses on complex networks (Cambridge UP, 2008).

[3] R. Pastor-Satorras and A. Vespignani Evolution andstructure of the Internet (Oxford UP, 2004)

[4] H.W. Hethcote, SIAM Review 42, 599 (2000).[5] R. Lambiotte, J.-C. Delvenne and M. Barahona,

(arxiv.org/abs/0812.1770) (2008).[6] R. Toivonen et al., Phys. Rev. E 79, 016109 (2009).[7] P.J. Mucha et al., Science 328, 876 (2010).[8] M. Granovetter, Am. J. Sociol. 78, 1360 (1973).[9] J.-P. Onnela et al., Proc. Natl. Acad. Sci. (USA) 104,

7332 (2007).[10] A.-L. Barabasi, Bursts: The Hidden Pattern Behind Ev-

erything We Do (Dutton Books, 2010).[11] A. Vazquez et al., Phys.Rev.Lett. 98, 158702 (2007).[12] J.L. Iribarren and E. Moro, Phys.Rev.Lett. 103, 038702

(2009).[13] J. Candia et al., J. Phys. A: Math. Theor. 41, 224015

(2008)[14] J.-P. Onnela et al., New J. Phys. 9, 179 (2007)[15] N. Eagle, A. Pentland, and D. Lazer, Proc. Natl. Acad.

Sci. (USA) 106, 15274 (2009).[16] J. Eckmann, E. Moses, and D. Sergi, Proc. Natl. Acad.

Sci. U.S.A. 101, 14333 (2004)[17] R. D. Malmgren et al., Proc. Natl. Acad. Sci. U.S.A. 105,

18153 (2008), R.D. Malmgren et.al., Science 325 1696(2009)

DayMinutes

Karsai, M. et al., 2011. Small But Slow World: How Network Topology and Burstiness Slow Down Spreading. Physical Review E, 83(2), p.025102.

Miritello, G., Moro, E. & Lara, R., 2011. Dynamical strength of social ties in information spreading. Physical Review E, 83(4), p.045102.

Page 20: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Bursty contacts: impact on the waiting time • When should I wait next call from a friend? • When is the next bus coming?

• Given , calculate

t t+ �t

P (�t) P (⌧)Task/bus/call arrives at random time

Page 21: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Bursty contacts: impact on the waiting time • When should I wait next call from a friend? • When is the next bus coming?

• Given , calculate

t t+ �t

P (�t) P (⌧)

P (⇥) =

Z 1

⌧d�t

�tP (�t)

�t

1

�t

⇤ =�t

2

✓1 +

⇥2�t

�t2

Task/bus/call arrives at random time

Page 22: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Bursty contacts: impact on the waiting time • When should I wait next call from a friend? • When is the next bus coming?

• Given , calculate

t t+ �t

P (�t) P (⌧)

P (⇥) =

Z 1

⌧d�t

�tP (�t)

�t

1

�t

⇤ =�t

2

✓1 +

⇥2�t

�t2

!"#$%&'(&)%*+%,-".%&/0&,/,10*%23%,-&#34%4&/,&-56%&#7&-56%&/0&8"7(&9::;

)%*+%,-".%

<-"*-&!565,.&

)/5,-4

=,-%*6%85"-%&

!565,.&)/5,-4

>-?%*&@34&

<-/A4

B$$&@34&

<-/A4

B$$&@34&<-/A4&

C9::DE

:F::&-/&:GH: !" #$ %# #& #$

:GHI&-/&ID9G !' #! #$ #! #&

IDH:&-/&I;H: !" %( %% #" #)

B$$&-56%4&1&:F::&-/&I;H: !& #" %! #' #* !

"#$!%&'()!*!+,-.+!,-.!/01230&(435!6&74)8!'5!349)!-:!8&5$!;+!94<,3!')!)=/)23)8>!/01230&(435>!/&73420(&7(5!&3!413)79)84&3)!'0+!+3-/+>!.&+!')+3!41!3,)!413)7/)&?!,-07+!.,)1!37&::42!()6)(+!&7)!1-79&((5!(-.)7$!

"@$!A,&73!B!')(-.!+,-.+!,-.!3,)!84+374'034-1!-:!(&3)1)++!&18!,-.!3,4+!6&74)+!'5!35/)!-:!'0+!+3-/$! C1! /&73420(&7>! 43! +,-.+! 3,&3! -6)7! &! D0&73)7! -:! '0+)+! +3&73)8! 3,)47! E-071)5+! )=&23(5! -1!349)>!'03!:-7!1-1F34941<!/-413+!+3-/+>!3,)!/7-/-734-1!.&+!&'-03!"GH$!

!"#$%&'(&)#%*+*,,&-.&/-+01$*23*+%&43,*,&56&768*&-.&43,&9%-8:(&;<<=

:

D

I:

ID

9:

9D

H:

1D 1' 1H 19 1I : I 9 H ' D J ; F G I: II I9 IH I' ID IJ I; IF IG 9:

>?+3%*,

@*$A*+%#B*

<-"*-&!565,.&)/5,-

=,-%*6%85"-%&!565,.&)/5,-

>-?%*&@34&<-/A

I&KL+$38%4&-?%&HM&/0&#34%4&-?"-&N%*%&6/*%&-?",&9:&

!

I7)D0)13!J)7642)+!

"K$!;+! 8)+274')8! 41! /&7&<7&/,! @>! 3,)! /01230&(435! -:! :7)D0)13! +)7642)+! 4+!9)&+07)8! '5! 3,)!L=2)++! M&4341<! %49)! NLM%O! '-71)! '5! /&++)1<)7+$! ;+! :-7! 1-1F:7)D0)13! +)7642)+>! -6)7&((!/01230&(435!.&+!&++)++)8!'5!&++0941<!/7-/-734-1+!-:!BPH>!*PH!&18!BPH!:-7!3,)!3,7))!35/)+!-:!'0+!+3-/$!

"Q$!C3!+,-0(8!')!1-3)8!3,&3!.,4(+3!&1!LM%!-:>!+&5>!G!94103)+!9&5!+))9!&22)/3&'()>!3,4+!4+!41!&88434-1!3-!3,)!&6)7&<)!.&4341<!349)$!I-7!&1!R)6)75!3)1!94103)S!+)7642)!43!7)/7)+)13+!&!*PH!4127)&+)!41!.&4341<!349)!-6)7!3,)!&6)7&<)!+2,)80()8!.&4341<!349)!-:!#!94103)+$!T6)7!,&(:!-:!3,)!/&++)1<)7+!.-0(8!3,)7):-7)!,&6)!3-!)=/)23!3-!.&43!&3!()&+3!K!94103)+!:-7!3,)47!'0+>!+-9)!902,!(-1<)7$!

& '

Bus Punctuality Statistics GB 2007. Dept. of Transport

Task/bus/call arrives at random time

Page 23: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Is that all? Nope: bursts are correlated in time

• To find correlation, detect sequence of events with

• If activity is a renewal process, the probability that we find n of such events in a row is

• P(E) decays exponentially • However, in real data it decays like a

power-law

ResultsCorrelated events. A sequence of discrete temporal events can beinterpreted as a time-dependent point process, X(t), where X(ti) 5 1at each time step ti when an event takes place, otherwise X(ti) 5 0. Todetect bursty clusters in this binary event sequence we have toidentify those events we consider correlated. The smallest temporalscale at which correlations can emerge in the dynamics is betweenconsecutive events. If only X(t) is known, we can assume twoconsecutive actions at ti and ti 1 1 to be related if they follow eachother within a short time interval, ti 1 12ti # Dt30,38. For events withthe duration di this condition is slightly modified: ti 1 12(ti 1 di) #Dt.

This definition allows us to detect bursty periods, defined as asequence of events where each event follows the previous one withina time intervalDt. By counting the number of events, E, that belong tothe same bursty period, we can calculate their distribution P(E) in asignal. For a sequence of independent events, P(E) is uniquely deter-mined by the inter-event time distribution P(tie) as follows:

P E~nð Þ~ðDt

0P tieð Þdtie

" #n{1

1{

ðDt

0P tieð Þdtie

" #ð1Þ

for n . 0. Here the integralÐ Dt

0 P tieð Þdtie defines the probability todraw an inter-event time P(tie) # Dt randomly from an arbitrarydistribution P(tie). The first term of (1) gives the probability that wedo it independently n21 consecutive times, while the second termassigns that the nth drawing gives a P(tie) . Dt therefore the evolvingtrain size becomes exactly E 5 n. If the measured time window isfinite (which is always the case here), the integral

Ð Dt0 P tieð Þdtie~a

where a , 1 and the asymptotic behaviour appears like P(E 5 n) ,a(n21) in a general exponential form (for related numerical results seeSI). Consequently for any finite independent event sequence the P(E)distribution decays exponentially even if the inter-event time distri-bution is fat-tailed. Deviations from this exponential behavior indi-cate correlations in the timing of the consecutive events.

Bursty sequences in human communication. To check the scalingbehavior of P(E) in real systems we focused on outgoing events ofindividuals in three selected datasets: (a) A mobile-call dataset from aEuropean operator; (b) Text message records from the same dataset;(c) Email communication sequences26 (for detailed data descriptionsee Methods). For each of these event sequences the distribution ofinter-event times measured between outgoing events are shown inFig. 2 (left bottom panels) and the estimated power-law exponentvalues are summarized in Table 1. To explore the scaling behavior ofthe autocorrelation function, we took the averages over 1,000

randomly selected users with maximum time lag of t 5 106. InFig. 2.a and b (right bottom panels) for mobile communicationsequences strong temporal correlation can be observed (forexponents see Table 1). The power-law behavior in A(t) appearsafter a short period denoting the reaction time through thecorresponding channel and lasts up to 12 hours, capturing thenatural rhythm of human activities. For emails in Fig. 2.c (rightbottom panels) long term correlation are detected up to 8 hours,which reflects a typical office hour rhythm (note that the datasetincludes internal email communication of a university staff).

The broad shape of P(tie) and A(t) functions confirm that humancommunication dynamics is inhomogeneous and displays non-trivial correlations up to finite time scales. However, after destroyingevent-event correlations by shuffling inter-event times in thesequences (see Methods) the autocorrelation functions still showslow power-law like decay (empty symbols on bottom right panels),indicating spurious unexpected dependencies. This clearly demon-strates the disability of A(t) to characterize correlations for hetero-geneous signals (for further results see SI). However, a more effectivemeasure of such correlations is provided by P(E). Calculating thisdistribution for various Dt windows, we find that the P(E) shows thefollowing scale invariant behavior

P Eð Þ*E{b ð2Þ

for each of the event sequences as depicted in the main panels ofFig. 2. Consequently P(E) captures strong temporal correlations inthe empirical sequences and it is remarkably different from P(E)calculated for independent events, which, as predicted by (1), showexponential decay (empty symbols on the main panels).

Exponential behavior of P(E) was also expected from results pub-lished in the literature assuming human communication behavior tobe uncorrelated29,30,39. However, the observed scaling behavior ofP(E) offers direct evidence of correlations in human dynamics, whichcan be responsible for the heterogeneous temporal behavior. Thesecorrelations induce long bursty trains in the event sequence ratherthan short bursts of independent events.

We have found that the scaling of the P(E) distribution is quiterobust against changes in Dt for an extended regime of time-windowsizes (Fig. 2). In addition, the measurements performed on themobile-call sequences indicate that the P(E) distribution remainsfat-tailed also when it is calculated for users grouped by their activity.Moreover, the observed scaling behavior of the characteristic func-tions remains similar if we remove daily fluctuations (for results seeSI). These analyses together show that the detected correlated beha-vior is not an artifact of the averaging method nor can be attributed tovariations in activity levels or circadian fluctuations.

Figure 1 | Activity of single entities with color-coded inter-event times. (a): Sequence of earthquakes with magnitude larger than two at a single location(South of Chishima Island, 8th–9th October 1994) (b): Firing sequence of a single neuron (from rat’s hippocampal) (c): Outgoing mobile phone callsequence of an individual. Shorter the time between the consecutive events darker the color.

www.nature.com/scientificreports

SCIENTIFIC REPORTS | 2 : 397 | DOI: 10.1038/srep00397 2

P (E = n) =

Z �t

0P (�t)d�t

!n�1 1�

Z �t

0P (�t)d�t

!

�t < �t

Bursty periods in natural phenomena. As discussed above, tem-poral inhomogeneities are present in the dynamics of several naturalphenomena, e.g. in recurrent seismic activities at the same loca-tion19,20,21 (for details see Methods and SI). The broad distributionof inter-earthquake times in Fig. 3.a (right top panel) demonstratesthe temporal inhomogeneities. The characterizing exponent value c5 0.7 is in qualitative agreement with the results in the literature23 asc 5 221/p where p is the Omori decay exponent22,23. At the sametime the long tail of the autocorrelation function (right bottom panel)assigning long-range temporal correlations. Counting the number ofearthquakes belonging to the same bursty period with Dt 52…32 hours window sizes, we obtain a broad P(E) distribution(see Fig. 3.a main panel), as observed earlier in communicationsequences, but with a different exponent value b 5 2.5 (see inTable 1). This exponent value meets with known seismicproperties as it can be derived as b 5 b/a 1 1, where a denotes theproductivity law exponent40, while b is coming from the well knownGutenberg-Richter law41. Note that the presence of long bursty trainsin earthquake sequences were already assigned to long temporalcorrelations by measurements using conditional probabilities42,43.

Another example of naturally occurring bursty behavior is pro-vided by the firing patterns of single neurons (see Methods). The

recorded neural spike sequences display correlated and stronglyinhomogeneous temporal bursty behavior, as shown in Fig. 3.b.The distributions of the length of neural spike trains are found tobe fat-tailed and indicate the presence of correlations between con-secutive bursty spikes of the same neuron.

Memory process. In each studied system (communication ofindividuals, earthquakes at given location, or neurons) quali-tatively similar behaviour was detected as the single entitiesperformed low frequency random events or they passed throughlonger correlated bursty cascades. While these phenomena arevery different in nature, there could be some element of simi-larities in their mechanisms. We think that this common featureis a threshold mechanism.

From this point of view the case of human communication dataseems problematic. In fact generally no accumulation of stress isneeded for an individual to make a phone call. However, accordingto the Decision Field Theory of psychology44, each decision (includ-ing initiation of communication) is a threshold phenomenon, as thestimulus of an action has to reach a given level for to be chosen fromthe enormously large number of possible actions.

As for earthquakes and neuron firings it is well known that theyare threshold phenomena. For earthquakes the bursty periods at agiven location are related to the relaxation of accumulated stress afterreaching a threshold7–9. In case of neurons, the firings take place inbursty spike trains when the neuron receives excitatory input and itsmembrane potential exceeds a given potential threshold45. The spikesfired in a single train are correlated since they are the result of thesame excitation and their firing frequency is coding the amplitude ofthe incoming stimuli46.

The correlations taking place between consecutive bursty eventscan be interpreted as a memory process, allowing us to calculate theprobability that the entity will perform one more event within a Dttime frame after it executed n events previously in the actual cascade.This probability can be written as:

Figure 2 | The characteristic functions of human communication event sequences. The P(E) distributions with various Dt time-window sizes (mainpanels), P(tie) distributions (left bottom panels) and average autocorrelation functions (right bottom panels) calculated for different communicationdatasets. (a) Mobile-call dataset: the scale-invariant behavior was characterized by power-law functions with exponent values a^0:5, b^4:1 and c^0:7(b) Almost the same exponents were estimated for short message sequences taking values a^0:6, b^3:9 and c^0:7. (c) Email event sequence withestimated exponents a^0:75, b^2:5 and c^1:0. A gap in the tail of A(t) on figure (c) appears due to logarithmic binning and slightly negativecorrelation values. Empty symbols assign the corresponding calculation results on independent sequences. Lanes labeled with s, m, h and d are denotingseconds, minutes, hours and days respectively.

Table 1 | Characteristic exponents of the (a) autocorrelation func-tion, (b) bursty number, (c) inter-event time distribution functionsand n memory functions calculated in different datasets (see SI)and for the model study

a b c n

Mobile-call sequence 0.5 4.1 0.7 3.0Short message sequence 0.6 3.9 0.7 2.8Email sequence 0.75 2.5 1.0 1.3Earthquake sequence (Japan) 0.3 2.5 0.7 1.6Neuron firing sequence 1.0 2.3 1.1 1.3Model 0.7 3.0 1.3 2.0

www.nature.com/scientificreports

SCIENTIFIC REPORTS | 2 : 397 | DOI: 10.1038/srep00397 3

Karsai, M. et al., 2012. Universal features of correlated bursty behaviour. Scientific Reports, 2.

Page 24: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Is that all? Nope • Adjacent tie contacts

are correlated in time

*

ij

2

events and in particular, the possible heavy-tail proper-ties of P (�tij) are directly inherited by P (⇤ij). Fig. 2shows our (rescaled) results for P (�tij) and P (⇤ij). Forcomparison, we also show the results obtained when i)the time-stamps of the ⇥ ⇤ i events are randomly se-lected from the complete CDR, thus destroying any possi-ble temporal correlation with i ⇤ j and e�ectively mim-icking Eq. (1) and ii) when the whole CDR time-stampsare shu⌅ed thus destroying both tie temporal patternsand correlation between ties. Both shu⌅ings preserve thetie intensity wij [18], i.e. the number of calls and theirduration and also the circadian rhythms of human com-munication [15]. The result for P (�tij) shows that smalland large inter-event times are more probable for the realseries than for the shu⌅ed ones, where the pdf is almostexponential as in a Poissonian process, apart from a smalldeviation due to the circadian rhythms. This bursty pat-tern of activity has been found in numerous examplesof human behavior [6] and seems to be universal in theway a single individual schedules tasks. Here we see thatit also happens at the level of two individuals interac-tion confirming recent results in mobile [15] and onlinecommunities [7] dynamics. The pdf for ⇤ij is also heavy-tailed but displays a larger number of short ⇤ij comparedto the shu⌅ed one. The abundance of short ⇤ij suggeststhat receiving an information (⇥ ⇤ i) triggers commu-nication with other people (i ⇤ j), a manifestation ofgroup conversations [11–13]. While the fat-tail of P (⇤ij)is accurately described by Eq. (1), i.e. large transmissionintervals ⇤ij are mostly due to large inter-event commu-nication times in the i ⇤ j tie, the behavior of P (⇤ij) isnot only due to the bursty patterns of �tij , but also to thetemporal correlation between the i ⇤ j and the ⇥ ⇤ ievents. In fact, if the correlation between the i ⇤ j andthe ⇥ ⇤ i series is destroyed, the probability of short-time intervals decreases and approaches the Poissoniancase (Fig. 2). In summary, relay times depend on twomain properties of human communication that competeto one another. While the bursty nature of human ac-tivity yields to large transmission times hindering anypossible infection, group conversations translate into anunexpected abundance of short relay times, favoring theprobability of propagation.

To investigate the e�ect of these two conflicting prop-erties of human communication on information spread-ing, we simulate the epidemic Susceptible-Infectious-Recovered (SIR) model in our social network consideringthe real time sequence of communication events [15, 23]and compare them to the shu⌅ed data. We start themodel by infecting a node at a random instant and con-sidering all other nodes as susceptible. In each call aninfected node can infect a susceptible node with prob-ability ⇥. Due to the synchronous nature of the phonecommunication, this happens regardless of who initiatesthe call. However, since the same results are obtainedby considering directionality in the calls, for computa-

i � j

� ⇥ i

t�t

t

�ij �tij

FIG. 1. (color online) Schematic view of communicationsevents around individual i: each horizontal segment indicatesan event between i ! j (top) and ⇤ ! i (bottom). At eacht↵

in the ⇤ ! i time series, ⇥ij

is the time elapsed to the nexti ! j event, which is di�erent from the inter-event time �t

ij

in the i ! j time series. The red shaded area represents therecover time window T

i

after t↵

.

10-6 10-4 10-2 100 10210-4

10-2

100

10-4 10-2 100 102

10-6

10-3

100

103

⇥ij / �tij

P(⇥

ij/�t

ij)

P(�t ij/�t

ij)

FIG. 2. (color online) Distribution of the relay time inter-vals ⇥

ij

(main) and of the inter-event times �tij

(inset) in thei ! j tie rescaled by �t

ij

. The black circles correspond tothe real data, while the red squares is the overall-shu⇥ed re-sult. Blue diamonds correspond to the case in which only the⇤ ! i sequence is randomized. Only ties with w

ij

� 10 areconsidered. In both graphs the dashed line correspond to thee�x function.

tional reasons we consider the latter case. Nodes remaininfected during a time Ti until they decay into the re-covered state. For the sake of simplicity we simulate thesimplest model in which the recovering time Ti is deter-ministic and homogeneous Ti = T and set T = 2 days,although di�erent and/or stochastic Ti can be studiedwithin the same model. The spreading dynamics gener-ates a viral cascade that grows until there are no morenodes in the infected state. We repeat the spreading pro-cess for 3 � 104 randomly chosen seeds. Note that ourmodel includes the SI model simulations in [15] where⇥ = 1 and T = T0, with T0 being the total duration ofthe dataset.By looking at the size of the largest cascade smax (over

all realizations) at each value of ⇥, we first ensure theexistence of a percolation transition [4] (see Fig. 3), con-firmed by a change in the behavior of smax from smallto large cascades at a given value of ⇥ (tipping point).

2

events and in particular, the possible heavy-tail proper-ties of P (�tij) are directly inherited by P (⇤ij). Fig. 2shows our (rescaled) results for P (�tij) and P (⇤ij). Forcomparison, we also show the results obtained when i)the time-stamps of the ⇥ ⇤ i events are randomly se-lected from the complete CDR, thus destroying any possi-ble temporal correlation with i ⇤ j and e�ectively mim-icking Eq. (1) and ii) when the whole CDR time-stampsare shu⌅ed thus destroying both tie temporal patternsand correlation between ties. Both shu⌅ings preserve thetie intensity wij [18], i.e. the number of calls and theirduration and also the circadian rhythms of human com-munication [15]. The result for P (�tij) shows that smalland large inter-event times are more probable for the realseries than for the shu⌅ed ones, where the pdf is almostexponential as in a Poissonian process, apart from a smalldeviation due to the circadian rhythms. This bursty pat-tern of activity has been found in numerous examplesof human behavior [6] and seems to be universal in theway a single individual schedules tasks. Here we see thatit also happens at the level of two individuals interac-tion confirming recent results in mobile [15] and onlinecommunities [7] dynamics. The pdf for ⇤ij is also heavy-tailed but displays a larger number of short ⇤ij comparedto the shu⌅ed one. The abundance of short ⇤ij suggeststhat receiving an information (⇥ ⇤ i) triggers commu-nication with other people (i ⇤ j), a manifestation ofgroup conversations [11–13]. While the fat-tail of P (⇤ij)is accurately described by Eq. (1), i.e. large transmissionintervals ⇤ij are mostly due to large inter-event commu-nication times in the i ⇤ j tie, the behavior of P (⇤ij) isnot only due to the bursty patterns of �tij , but also to thetemporal correlation between the i ⇤ j and the ⇥ ⇤ ievents. In fact, if the correlation between the i ⇤ j andthe ⇥ ⇤ i series is destroyed, the probability of short-time intervals decreases and approaches the Poissoniancase (Fig. 2). In summary, relay times depend on twomain properties of human communication that competeto one another. While the bursty nature of human ac-tivity yields to large transmission times hindering anypossible infection, group conversations translate into anunexpected abundance of short relay times, favoring theprobability of propagation.

To investigate the e�ect of these two conflicting prop-erties of human communication on information spread-ing, we simulate the epidemic Susceptible-Infectious-Recovered (SIR) model in our social network consideringthe real time sequence of communication events [15, 23]and compare them to the shu⌅ed data. We start themodel by infecting a node at a random instant and con-sidering all other nodes as susceptible. In each call aninfected node can infect a susceptible node with prob-ability ⇥. Due to the synchronous nature of the phonecommunication, this happens regardless of who initiatesthe call. However, since the same results are obtainedby considering directionality in the calls, for computa-

i � j

� ⇥ i

t�t

t

�ij �tij

FIG. 1. (color online) Schematic view of communicationsevents around individual i: each horizontal segment indicatesan event between i ! j (top) and ⇤ ! i (bottom). At eacht↵

in the ⇤ ! i time series, ⇥ij

is the time elapsed to the nexti ! j event, which is di�erent from the inter-event time �t

ij

in the i ! j time series. The red shaded area represents therecover time window T

i

after t↵

.

10-6 10-4 10-2 100 10210-4

10-2

100

10-4 10-2 100 102

10-6

10-3

100

103

⇥ij / �tij

P(⇥

ij/�t

ij)

P(�t ij/�t

ij)

FIG. 2. (color online) Distribution of the relay time inter-vals ⇥

ij

(main) and of the inter-event times �tij

(inset) in thei ! j tie rescaled by �t

ij

. The black circles correspond tothe real data, while the red squares is the overall-shu⇥ed re-sult. Blue diamonds correspond to the case in which only the⇤ ! i sequence is randomized. Only ties with w

ij

� 10 areconsidered. In both graphs the dashed line correspond to thee�x function.

tional reasons we consider the latter case. Nodes remaininfected during a time Ti until they decay into the re-covered state. For the sake of simplicity we simulate thesimplest model in which the recovering time Ti is deter-ministic and homogeneous Ti = T and set T = 2 days,although di�erent and/or stochastic Ti can be studiedwithin the same model. The spreading dynamics gener-ates a viral cascade that grows until there are no morenodes in the infected state. We repeat the spreading pro-cess for 3 � 104 randomly chosen seeds. Note that ourmodel includes the SI model simulations in [15] where⇥ = 1 and T = T0, with T0 being the total duration ofthe dataset.By looking at the size of the largest cascade smax (over

all realizations) at each value of ⇥, we first ensure theexistence of a percolation transition [4] (see Fig. 3), con-firmed by a change in the behavior of smax from smallto large cascades at a given value of ⇥ (tipping point).

P (⇥) =

Z 1

⌧d�t

�tP (�t)

�t

1

�t

Miritello, G., Moro, E. & Lara, R., 2011. Dynamical strength of social ties in information spreading. Physical Review E, 83(4), p.045102.

Page 25: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Is that all? Nope • Temporal motifs:

• Two interactions are∆t-conected if they happen with a time difference < ∆t

Temporal motifs in time-dependent networks 5

Figure 1. (a) An example event data set E with six events. Durations have beenomitted for simplicity. With �t = 10 there are two maximal subgraphs, shown in (b)and (c). (d) Valid subgraphs contained in the maximal subgraph in (b). In additionto these the maximal subgraph itself and all unit subgraphs are valid subgraphs. Themaximal subgraph in (c) does not contain other valid subgraphs than the maximal andunit subgraphs. (e) Event sets that are contained in (b) but are not valid subgraphs:the upper one because it is not �t-connected, the lower one because it does not includeall consecutive �t-connected events of node c.

The presented definition for temporal subgraph is meaningful only when each node

is involved in at most one event at a time. When overlapping events are allowed, the

large number of di↵erent situations that can arise in the most general case makes it

more di�cult to define temporal subgraphs in such a way that the results could still be

easily interpreted. Yet the prospects of such a definition are enticing, as it would allow

for the exploration of almost any temporal network data, for example transportation

networks [28] and time-varying brain functional networks [29].

3. Algorithm for identification of temporal motifs

Because maximal subgraphs are temporally separated from all other events by at least

time �t, all subgraphs are fully contained in some maximal subgraph. Based on this

observation the process of identifying all temporal motifs in a given event set E can be

separated to three parts:

(i) Find all maximal connected subgraphs E⇤max

.

Temporal motifs in time-dependent networks 5

Figure 1. (a) An example event data set E with six events. Durations have beenomitted for simplicity. With �t = 10 there are two maximal subgraphs, shown in (b)and (c). (d) Valid subgraphs contained in the maximal subgraph in (b). In additionto these the maximal subgraph itself and all unit subgraphs are valid subgraphs. Themaximal subgraph in (c) does not contain other valid subgraphs than the maximal andunit subgraphs. (e) Event sets that are contained in (b) but are not valid subgraphs:the upper one because it is not �t-connected, the lower one because it does not includeall consecutive �t-connected events of node c.

The presented definition for temporal subgraph is meaningful only when each node

is involved in at most one event at a time. When overlapping events are allowed, the

large number of di↵erent situations that can arise in the most general case makes it

more di�cult to define temporal subgraphs in such a way that the results could still be

easily interpreted. Yet the prospects of such a definition are enticing, as it would allow

for the exploration of almost any temporal network data, for example transportation

networks [28] and time-varying brain functional networks [29].

3. Algorithm for identification of temporal motifs

Because maximal subgraphs are temporally separated from all other events by at least

time �t, all subgraphs are fully contained in some maximal subgraph. Based on this

observation the process of identifying all temporal motifs in a given event set E can be

separated to three parts:

(i) Find all maximal connected subgraphs E⇤max

.

Kovanen, L. et al., 2011. Temporal motifs in time-dependent networks. Journal Of Statistical Mechanics-Theory And Experiment, 2011(11), p.P11005.

Page 26: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Is that all? Nope • Temporal motifs:

• Most of temporal motifs involve two nodes

• Motifs that allow causal interpretations are more common

Zhao, Q. et al., 2010. Communication motifs: a tool to characterize social communications. Proceedings of the 19th ACM international conference on Information and knowledge management, pp.1645–1648.

Page 27: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Is that all? • Temporal motifs

depend on sex

• All-female/All-male cases are overrepresented/underrepresented

3

A B C

1.,2.

1.

2.

2

.1

.

1

.2

.

1

.

2

.

2

.

1

.

repeated contact returned contact

causal chain non-causal chain

out-star in-start=1

t=3

t=4

t=8d

a

b

c

t=1

t=3

t=4

a

b

c

1. 2.3.

t=1,t=2,t=7

ba

t=1,t=2

ba1., 2.

event sequence temporal subgraph(6t=3)

temporal motif

FIG. 1. (A) A schematic presentation of two temporal networks. The top one has clear temporal structure, while the bottom oneis random; however, both give rise to the same aggregate network. (B) Two examples on identifying temporal motifs. Startingfrom a temporal network (left) we first identify temporal subgraphs (middle) and then the temporal motif corresponding toeach subgraph (right). Note that temporal motifs do not contain information about the identities of nodes or the exact times ofevents, but do retain information about node colors and the temporal order of events. (C) The six possible uncolored two-eventtemporal motifs whose colored variants are used in our analysis.

�20 �10 0 10 20

z-score

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

P(z

)

r(m)=1.18r(m)=0.82

1

.

2

.

1

.

2

.

1

.

2

.

1

.

2

.

1

.

2

.

1

.

2

.

1

.

2

.

1

.

2

.

FIG. 2. The distribution of z-scores for all two-event motifsin the synthetic data, averaged over 50 data sets. For mostmotifs z ⇡ 0 because there are no di↵erences between nodecolors. The peak at z ⇡ 17 corresponds to the four causalchains where the first event occurs between nodes of the samecolor. Because expected motif count is defined by the aver-age count of uncolored motifs, the peak at z ⇡ �17 containsthe four remaining causal chains where the first event occursbetween nodes of di↵erent color.

duration of calls it can be represented as a temporal net-work. Node types are formed by combining the gender,age group, and payment type (prepaid or postpaid mobilesubscription plan) of customers. Our analysis focuses ontwo-event motifs for simplicity (Figure 1C). Larger mo-tifs are not only more demanding computationally, butalso more laborious to analyze: there are already 56448di↵erent two-event motifs with the 24 node types created

by combining gender, age group, and payment type.

A. Node types a↵ect motif counts

We first check whether the null hypothesis is true orfalse—whether node types a↵ect motifs counts beyondwhat can be expected based on the aggregate network—by plotting the distribution of z-scores, shown in Figure3A. The distribution does not have zero mean or unitvariance; instead, about 35% of motifs have |z| > 1.96.The null hypothesis is clearly false, and we can concludethat motif counts are not independent of node types.

For comparison, Figure 3B shows the same distributionafter shu✏ing node types, and Figure 3C after shu✏ingevent types (see Appendix B). In both cases the distri-butions suggest that the null hypothesis is true. Indeed,after randomizing node types there can be no di↵erencesbetween them even though the data still contains thesame untyped temporal subgraphs as the original data.The time shu✏ed data, on the other hand, has exactlythe same aggregate network as the original data. How-ever, because event times are now uncorrelated, all di↵er-ences between node types are explained by the structureof the aggregate network.

B. There is temporal homophily

Figure 3A shows that there are di↵erences betweennode types; we will now look at what these di↵erencesare. Homophily refers to the well-documented tendencyof individuals to interact with others similar to them withrespect to various social and demographic factors [31–

Kovanen, L. et al., 2013. Temporal motifs reveal homophily, gender-specific patterns, and group talk in call sequences. PNAS, 110(45), pp.18070–18075.

Page 28: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Wrap-up • Activity within a single tie is bursty

• P(dt) is a heavy tailed • Bursts are correlated

• Activity across adjacent ties is correlated • Two adjacent ties

• Group conversations • Impact on the waiting time (spreading)

• More adjacent ties • Temporal motifs

Page 29: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Wrap-up • Activity within a single tie is bursty

• P(dt) is a heavy tailed • Bursts are correlated

• Activity across adjacent ties is correlated • Two adjacent ties

• Group conversations • Impact on the waiting time (spreading)

• More adjacent ties • Temporal motifs

Page 30: Temporal dynamics of human behavior in social networks (i)

form/decayt1 t2 t3

Tie activity is bursty t

Groups of conversation

t1 t1+dt

Tie

s

Communities form/change/decay

t1 t2

Com

mun

ities

Networks form/change/decay

t1 t2

Net

wor

k

3 Tie dynamics

Page 31: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Ties are formed and decay • Why?

• Creation

• Node creation/decay • Assortative: Homophily/Heterophily • Relational: Reciprocity, Triadic Closure, Degree • Proximity: Proximity and Social Foci

• Decay

• Idem

• How? • Social limitations / strategies

Page 32: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Ties are formed and decay • Why?

• Creation

• Node creation/decay • Assortative: Homophily/Heterophily • Relational: Reciprocity, Triadic Closure, Degree • Proximity: Proximity and Social Foci

• Decay

• Idem

• How? • Social limitations / strategies

Page 33: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Tie formation: Relational predictors

• Triadic closure • Mutual acquaintances / embeddedness

Page 34: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Tie formation: predictors

than individuals who are separated by onlyone intermediary (dij 0 2). Figure 1A (cir-cles), however, demonstrates that when twoindividuals share at least one class, they areon average 3 times more likely to interact ifthey also share an acquaintance (dij 0 2), andabout 140 times more likely if they do not(dij 9 2). In addition, Fig. 1B shows that theempirical probability of tie formation in-creases with the number of mutual acquaint-ances both for pairs with (circles) and without(triangles) shared classes, becoming indepen-dent of shared affiliations for large numbersof mutual acquaintances (six and more). Figure1C displays equivalent information for shared

classes, indicating that while the effect of asingle shared class is roughly interchange-able with a single mutual acquaintance, thepresence of additional acquaintances has agreater effect than additional foci in ourdata set. These findings imply that even aminimally accurate, generative network mod-el would need to account separately for (i)triadic closure, (ii) focal closure, and (iii) thecompounding effect of both biases together.

Our data can also shed light on theoreticalnotions of tie strength (13) and attribute-based homophily (6, 26). We found (Fig. 2)that the likelihood of triadic closure increasesif the average tie strength between two

strangers and their mutual acquaintances ishigh, which supports commonly acceptedtheory (6, 13). By contrast, homophily withrespect to individual attributes appears to playa weaker role than might be expected. Of theattributes we considered in this and othermodels (27)—status (undergraduate, graduatestudent, faculty, or staff), gender, age, andtime in the community—none has a signifi-cant effect on triadic closure. The significantpredictors are tie strength, number of mutualacquaintances, shared classes, the interac-tion of shared classes and acquaintances, andstatus obstruction, which we define as the ef-fect on triadic closure of a mediating indi-vidual who has a different status than eitherof the potential acquaintances. For example,two students connected through a professorare less likely to form a direct tie than twostudents connected through another student,ceteris paribus. We suspect, however, thatstatus obstruction may be an indicator of un-observed focal closure beyond class attend-ance. Thus, although homophily has oftenbeen observed with respect to individualattributes in cross-sectional data (6, 26), theseeffects may be mostly indirect, operatingthrough the structural constraint of shared foci(10), such as selection of courses or extra-curricular activities.

Our results also have implications for theutility of cross-sectional network analysis,which relies on the assumption that thenetwork properties of interest are in equilibri-um (4). Figure 3 shows that different networkmeasures exhibit varying levels of stabilityover time and with respect to the smoothingwindow t. Average vertex degree bkÀ, frac-tional size of the largest component S, andmean shortest path length L all exhibitseasonal changes and produce different mea-surements for different choices of t, where bkÀis especially sensitive to t. The clusteringcoefficient C (28), however, stays virtuallyconstant as bkÀ changes, suggesting, perhapssurprisingly, that averages of local networkproperties are more stable than global proper-ties such as L or S. Nevertheless, these resultssuggest that as long as the smoothing windowt is chosen appropriately and care is taken toavoid collecting data in the vicinity of exog-enous changes (e.g., end of semester), averagenetwork measures remain stable over time andthus can be recovered with reasonable fidelityfrom network snapshots.

The relative stability of average networkproperties, however, does not imply equiva-lent stability of individual network properties,for which the empirical picture is more com-plicated. On the one hand, we find thatdistributions of individual-level propertiesare stable, with the same caveats that applyto averages. For example (Fig. 4, A to C), theshape of the degree distribution p(k) isrelatively constant across the duration of our

Fig. 1. Cyclic and fo-cal closure. (A) Averagedaily empirical proba-bility pnew of a new tiebetween two individ-uals as a function oftheir network distancedij. Circles, pairs thatshare one or more inter-action foci (attend oneor more classes togeth-er); triangles, pairs thatdo not share classes.(B) pnew as a functionof the number of mu-tual acquaintances. Cir-cles, pairs with one ormore shared foci; tri-angles, pairs withoutshared foci. (C) pnew asa function of the number of shared interaction foci. Circles, pairs with one or more mutualacquaintances; triangles, pairs without mutual acquaintances. Lines are shown as a guide for theeye; standard errors are smaller than symbol size.

Fig. 2. Results of mul-tivariate survival analy-sis of triadic closure fora sample of 1190 pairsof graduate and under-graduate students. Shownare the hazard ratiosand 95% confidence in-tervals from Cox re-gression of time to tieformation between twoindividuals since theirtransition to distancedij 0 2. Hazard ratio gmeans that the proba-bility of closure changesby a factor of g with a unit change in the covariate or relative to the reference category. We treat acovariance as significant if the corresponding 95% confidence interval does not contain g 0 1 (noeffect). Predictors, sorted by effect magnitude: strong indirect (1 if indirect connection strength isabove sample median, 0 otherwise), classes (number of shared classes), acquaintances (number ofmutual network neighbors less 1), same age (1 if absolute difference in age is less than 1 year, 0otherwise), same year (1 if absolute difference in number of years at the university is less than 1, 0otherwise), gender [effects of male-male (MM) and female-female (FF) pair, respectively, relative toa female-male (FM) pair], acquaint*classes (interaction effect between acquaintances and classes),and obstruction (1 if no mutual acquaintance has the same status as either member of the pair, 0otherwise) (25).

REPORTS

www.sciencemag.org SCIENCE VOL 311 6 JANUARY 2006 89

Probability of a new tie

than individuals who are separated by onlyone intermediary (dij 0 2). Figure 1A (cir-cles), however, demonstrates that when twoindividuals share at least one class, they areon average 3 times more likely to interact ifthey also share an acquaintance (dij 0 2), andabout 140 times more likely if they do not(dij 9 2). In addition, Fig. 1B shows that theempirical probability of tie formation in-creases with the number of mutual acquaint-ances both for pairs with (circles) and without(triangles) shared classes, becoming indepen-dent of shared affiliations for large numbersof mutual acquaintances (six and more). Figure1C displays equivalent information for shared

classes, indicating that while the effect of asingle shared class is roughly interchange-able with a single mutual acquaintance, thepresence of additional acquaintances has agreater effect than additional foci in ourdata set. These findings imply that even aminimally accurate, generative network mod-el would need to account separately for (i)triadic closure, (ii) focal closure, and (iii) thecompounding effect of both biases together.

Our data can also shed light on theoreticalnotions of tie strength (13) and attribute-based homophily (6, 26). We found (Fig. 2)that the likelihood of triadic closure increasesif the average tie strength between two

strangers and their mutual acquaintances ishigh, which supports commonly acceptedtheory (6, 13). By contrast, homophily withrespect to individual attributes appears to playa weaker role than might be expected. Of theattributes we considered in this and othermodels (27)—status (undergraduate, graduatestudent, faculty, or staff), gender, age, andtime in the community—none has a signifi-cant effect on triadic closure. The significantpredictors are tie strength, number of mutualacquaintances, shared classes, the interac-tion of shared classes and acquaintances, andstatus obstruction, which we define as the ef-fect on triadic closure of a mediating indi-vidual who has a different status than eitherof the potential acquaintances. For example,two students connected through a professorare less likely to form a direct tie than twostudents connected through another student,ceteris paribus. We suspect, however, thatstatus obstruction may be an indicator of un-observed focal closure beyond class attend-ance. Thus, although homophily has oftenbeen observed with respect to individualattributes in cross-sectional data (6, 26), theseeffects may be mostly indirect, operatingthrough the structural constraint of shared foci(10), such as selection of courses or extra-curricular activities.

Our results also have implications for theutility of cross-sectional network analysis,which relies on the assumption that thenetwork properties of interest are in equilibri-um (4). Figure 3 shows that different networkmeasures exhibit varying levels of stabilityover time and with respect to the smoothingwindow t. Average vertex degree bkÀ, frac-tional size of the largest component S, andmean shortest path length L all exhibitseasonal changes and produce different mea-surements for different choices of t, where bkÀis especially sensitive to t. The clusteringcoefficient C (28), however, stays virtuallyconstant as bkÀ changes, suggesting, perhapssurprisingly, that averages of local networkproperties are more stable than global proper-ties such as L or S. Nevertheless, these resultssuggest that as long as the smoothing windowt is chosen appropriately and care is taken toavoid collecting data in the vicinity of exog-enous changes (e.g., end of semester), averagenetwork measures remain stable over time andthus can be recovered with reasonable fidelityfrom network snapshots.

The relative stability of average networkproperties, however, does not imply equiva-lent stability of individual network properties,for which the empirical picture is more com-plicated. On the one hand, we find thatdistributions of individual-level propertiesare stable, with the same caveats that applyto averages. For example (Fig. 4, A to C), theshape of the degree distribution p(k) isrelatively constant across the duration of our

Fig. 1. Cyclic and fo-cal closure. (A) Averagedaily empirical proba-bility pnew of a new tiebetween two individ-uals as a function oftheir network distancedij. Circles, pairs thatshare one or more inter-action foci (attend oneor more classes togeth-er); triangles, pairs thatdo not share classes.(B) pnew as a functionof the number of mu-tual acquaintances. Cir-cles, pairs with one ormore shared foci; tri-angles, pairs withoutshared foci. (C) pnew asa function of the number of shared interaction foci. Circles, pairs with one or more mutualacquaintances; triangles, pairs without mutual acquaintances. Lines are shown as a guide for theeye; standard errors are smaller than symbol size.

Fig. 2. Results of mul-tivariate survival analy-sis of triadic closure fora sample of 1190 pairsof graduate and under-graduate students. Shownare the hazard ratiosand 95% confidence in-tervals from Cox re-gression of time to tieformation between twoindividuals since theirtransition to distancedij 0 2. Hazard ratio gmeans that the proba-bility of closure changesby a factor of g with a unit change in the covariate or relative to the reference category. We treat acovariance as significant if the corresponding 95% confidence interval does not contain g 0 1 (noeffect). Predictors, sorted by effect magnitude: strong indirect (1 if indirect connection strength isabove sample median, 0 otherwise), classes (number of shared classes), acquaintances (number ofmutual network neighbors less 1), same age (1 if absolute difference in age is less than 1 year, 0otherwise), same year (1 if absolute difference in number of years at the university is less than 1, 0otherwise), gender [effects of male-male (MM) and female-female (FF) pair, respectively, relative toa female-male (FM) pair], acquaint*classes (interaction effect between acquaintances and classes),and obstruction (1 if no mutual acquaintance has the same status as either member of the pair, 0otherwise) (25).

REPORTS

www.sciencemag.org SCIENCE VOL 311 6 JANUARY 2006 89

Distance in the network

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2007 1025DOI: 10.1002/asi

FIG. 5. Relative average performance of various predictors versus random predictions. The value shown is the average ratio over the five datasets of the givenpredictor’s performance versus the random predictor’s performance. The error bars indicate the minimum and maximum of this ratio over the five datasets. Theparameters for the starred predictors are as follows: (a) for weighted Katz, ! 0.005; (b) for Katz clustering, 1 ! 0.001, ! 0.15, 2 ! 0.1; (c) for low-rankinner product, rank ! 256; (d) for rooted Pagerank, ! 0.15; (e) for unseen bigrams, unweighted common neighbors with ! 8; and (f) for SimRank, ! 0.8.gda

brbb

0

10

20

30

40

50

Rel

ativ

epe

rfor

man

cera

tiove

rsus

rand

ompr

edic

tions

random predictor

Ada

mic

/Ada

r

wei

ghte

dK

atz*

Kat

zcl

uste

ring*

low

-ran

kin

ner

prod

uct*

com

mon

neig

hbor

s

root

edP

ageR

ank*

Jacc

ard

unse

enbi

gram

s*

Sim

Ran

k*

grap

hdi

stan

ce

hitti

ngtim

e

Similarities Among the Predictors and the Datasets

Not surprisingly, there is significant overlap in the predic-tions made by the various methods. In Figure 8, we show thenumber of common predictions made by 10 of the mostsuccessful measures on the cond-mat graph. We see thatKatz, low-rank inner product, and Adamic/Adar are quitesimilar in their predictions, as are (to a somewhat lesserextent) rooted PageRank, SimRank, and Jaccard. Hittingtime is remarkably unlike any of the other nine in its predic-tions, despite its reasonable performance. The number ofcommon correct predictions shows qualitatively similar be-havior (see Figure 9). It would be interesting to understandthe generality of these overlap phenomena, especially be-cause certain of the large overlaps do not seem to followobviously from the definitions of the measures.

It is harder to quantify the differences among the datasets,but their relationship is a very interesting issue as well. Oneperspective is provided by the methods based on low-rankapproximation: On four of the datasets, their performancetends to be best at an intermediate rank while on gr-qc theyperform best at rank 1 (see Figure 10, e.g., for a plot of thechange in the performance of the low-rank matrix-entry

predictor as the rank of the approximation varies). This factsuggests a sense in which the collaborations in gr-qc havea much “simpler” structure than those in the other four. Onealso observes the apparent importance of node degree inthe hep-ph collaborations: The preferential-attachmentpredictor—which considers only the number (and not theidentity) of a scientist’s coauthors—does uncharacteristicallywell on this dataset, outperforming the basic graph-distancepredictor. Finally, it would be interesting to make precise asense in which astro-ph is a “difficult” dataset, given thelow performance of all methods relative to random andthe fact that none beats simple ranking by common neigh-bors. We will explore this issue further when we considercollaboration data drawn from other fields.

Because almost all of our experiments were carried outon social networks formed via the collaborations of physi-cists, it is difficult to draw broad conclusions about linkprediction in social networks in general. The culture ofphysicists and of physics collaboration (e.g., see the work ofKatz & Martin, 1997) plays a role in the quality of ourresults. The considerations discussed earlier suggest thatthere are some important differences even within physics

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2007 1021DOI: 10.1002/asi

Evaluating a Link Predictor

Each link predictor p that we consider outputs a ranked listLp of pairs in A ! A " Eold; these are predicted new collabora-tions, in decreasing order of confidence. For our evaluation, wefocus on the set Core, so we define (Core !Core) and . Our performance measure for Predictor pis then determined as follows: From the ranked list Lp, we takethe first n pairs that are in Core ! Core, and determine the sizeof the intersection of this set of pairs with the set .

Methods for Link Prediction

In this section, we survey an array of methods for link pre-diction. All the methods assign a connection weight score(x, y)to pairs of nodes x, y , based on the input graph Gcollab, andthen produce a ranked list in decreasing order of score(x, y).Thus, they can be viewed as computing a measure of proxim-ity or “similarity” between nodes x and y, relative to thenetwork topology. In general, the methods are adapted fromtechniques used in graph theory and in social-network analy-sis; in a number of cases, these techniques were not designedto measure node-to-node similarity and hence need to bemodified for this purpose. Figure 2 summarizes most of thesemeasures; we discuss them in more detail later. Note that

98

Enew!

n J |Enew! |

xEnew! J Enew

some of these measures are designed only for connectedgraphs; because each graph Gcollab that we consider has a giantcomponent—a single component containing most of thenodes—it is natural to restrict the predictions for these mea-sures to this component.

Perhaps the most basic approach is to rank pairs x, y bythe length of their shortest path in Gcollab. Such a measurefollows the notion that collaboration networks are “smallworlds,” in which individuals are related through shortchains (Newman, 2001b) (In keeping with the notion thatwe rank pairs in decreasing order of score(x, y), we definescore(x, y) here to be the negative of the shortest pathlength.) Pairs with shortest-path distance equal to oneare joined by an edge in Gcollab, and hence they belong tothe training edge set Eold. For all of our graphs Gcollab, thereare well more than n pairs at shortest-path distance two, soour shortest-path predictor simply selects a random subset ofthese distance-two pairs.

Methods Based on Node Neighborhoods

For a node x, let #(x) denote the set of neighbors of x inGcollab. A number of approaches are based on the idea that twonodes x and y are more likely to form a link in the future iftheir sets of neighbors #(x) and #(y) have large overlap; this

98

FIG. 2. Values for score(x, y) under various predictors; each predicts pairs x, y in descending order ofscore(x, y). The set #(x) consists of the neighbors of the node x in Gcollab.

98

graph distance (negated) length of shortest path between x and y

common neighbors

Jaccard’s coefficient

Adamic/Adar

preferential attachment

Katzb

where paths {paths of length exactly from x to y}

weighted: paths number of collaborations between x, y.

unweighted: paths iff x and y collaborate.

hitting time "Hx,y

stationary-normed "Hx,y y

commute time "(Hx,y $ Hy,x)stationary-normed "Hx,y y $ Hy,x x)

where Hx,y expected time for random walk from x to reach yy stationary-distribution weight of y (proportion of time

the random walk is at node y)

rooted PageRank stationary distribution weight of y under the following random walk: with probability , jump to x. with probability 1 " , go to a random neighbor of current node.

SimRank•1 if x%y

g # ©a!#(x)©b!#(y) score(a, b)

!#(x)! # !#(y)!otherwiseg

a

aa

JpJ

p#p#

p#

"1#x,y J 1

"1#x,y J

Ox,y"O#

J

©&!%1b

/ ' 0paths"O#x,y 0

0 ≠(x) 0 ' 0 ≠ (y) 0©z!≠(x)t≠(y) 1

log 0≠(z) 00 ≠(x) " ≠(y) 00 ≠(x) ´ ≠(y) 00 ≠(x) " ≠ (y) 0

Liben Nowell, D. & Kleinberg, J., 2007. The link‐prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7), pp.1019–1031.

Kossinets, G. & Watts, D.J., 2006. Empirical analysis of an evolving social network. Science, 311(5757), pp.88–90.

Page 35: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

3020 C.A. Hidalgo, C. Rodriguez-Sickert / Physica A 387 (2008) 3017–3024

Fig. 2. Persistence across a cellular phone network (a) Distribution of persistence for all links (b) Fraction of surviving ties as a function of time.The inset shows the same plot in a double logarithmic scale. The continuous line is t�1/4.

Fig. 3. Network structure and the persistence of ties (a) A fragment of the network extracted by considering up to the second neighbour of arandomly chosen node (indicated by a black arrow). (b) Distribution of persistence divided into nine degree categories. (c) Number of persistentlinks defined as those with a persistence of, from top to bottom: 5/10, 6/10, 7/10, 8/10, 9/10 and 10/10. (d) Distribution of persistence dividedinto nine clustering categories. (e) Distribution of persistence divided into five different reciprocity segments.

correlations between persistence, perseverance and the topological attributes of the mobile call network. In particular,we find that these temporal attributes correlate with topological variables such as the number of connection or degreeki , the average reciprocity of a node r (fraction of ties containing both incoming and outgoing calls) and the clusteringcoefficient of a node Ci defined as

Ci = 2�ki (ki � 1)

, (3)

where � is the number of triads in which the node is involved. Fig. 3(b) shows a histogram of persistence split intonine different degree categories revealing that persistent links represent a large fraction of the connections for lowdegree nodes while transient links are more common for large degree nodes. The number of persistent ties however,grows as a function of degree, meaning that although on average the persistence of high degree nodes is lower, inabsolute terms their core is larger.

• Tie decay: predictors

• Links with large embeddedness and reciprocity are more likely to persist

C.A. Hidalgo, C. Rodriguez-Sickert / Physica A 387 (2008) 3017–3024 3021

Table 1Persistence of ties and link attributes

Pearson’s correlation �C �k �r R TO Persistence

�C 1 0.023 0.15 0.11 0.23 0.15�k 1 0.02 �0.13 �0.19 �0.16�r 1 �0.68 �0.073 0.033R 1 0.2964 0.5886TO 1 0.3537

Regression coefficients 0.09 0.002 0.15 0.35 0.56Partial correlations 0.0027 0.0032 0.007 0.26 0.034

Fig. 3(d) shows the distribution of persistence divided by clustering coefficient categories, indicating that highlyclustered nodes tend to have relatively large cores. In the core periphery context, this means that persevering nodesare located in dense parts of the social network (Fig. 3(a) I) while those in sparser parts tend to have nonpersistentties acting as bridges which interruptedly connect different parts of the network (Fig. 3(a) II). Finally, we split thedistribution of persistence by reciprocity (Fig. 3(e)) and observe that nodes with more reciprocated ties tend to bemore persistent.

5.2. Multivariate analysis

In the previous section we presented a bivariate analysis in which we analysed the effect of three single structuralvariables and found that persistence depends monotonically on all of them (degree, clustering coefficient andreciprocity). The observed correlations, however, might well be redundant. To check if this is the case we perform amultivariate analysis to quantify the effect of each of these variables on the persistence of ties. Because of the largenumber of observations considered (⇠2 million nodes, ⇠8 million ties) the confidence intervals of the regressions donot spread far from the predicted values. Hence we concentrate our discussion on the relative magnitude of the effectsrather than on their significance.

On a social network, it is a well-known fact that agents tend to connect to others of similar degree significantlymore than random [21,22]. It is not known, however, whether links connecting same degree agents tend to be morestable than those connecting different degree agents. To study this effect we performed a regression in which we studythe persistence of a link as function of the difference in degree of the nodes that link connects. Furthermore, we alsoinclude in the regression the difference in clustering and average reciprocity of nodes connected by a particular link.In addition to this, we consider two link attributes: the reciprocity of links R, was there ever a panel in which callerand callee reciprocally called each other? and the topological overlap associated with that link which is defined as

T .O.(i, j) =s

O2i j

ki k j, (4)

where Oi j is the number of neighbours that agents i and j have in common and ki and k j are their respectivedegrees.

Together these five variables explain 40% of the variance in persistence (Table 1 R2 = 0.397). The contribution ofeach one of them can be isolated by considering the partial regression coefficients [23], which are a way to quantifyhow much of the variance is explained by each one of the covariates used in a regression. This technique shows thatassortative mixing is not associated with the persistence of ties. Whereas the reciprocity of the links (0 nonreciprocal,1 reciprocal) explains 26% of persistence followed by topological overlap which explains 3.4% of the variance inpersistence.

In the previous section we showed that high degree agents had on average less persistent ties than low degreeagents. We also saw that highly clustered agents tended to have a larger number of persistent connections and thatreciprocal ties tend to be more persistent in average. Again, we explore the redundancy of such statements usinglinear regression and split the contribution to perseverance from each of these variables by calculating their partialcorrelations (Table 2). Together, these variables explain almost 50% of the variance in perseverance (R2 = 0.49).Their contributions however are quite uneven. When we look at the partial correlation coefficients extracted from our

( )R.S. BurtrSocial Networks 22 2000 1–28 11

Fig. 1. Decay functions.

variation. The hazard rate for a relationship is the probability that it will be gone nextyear. Hazard rates for the colleague relations are given in Table 3, predicted by logitequations in Table 4, and graphed in Fig. 1B. Test statistics in Table 4 are adjusted

Ždown for autocorrelation between relations cited by the same respondent e.g., Kish and.Frankel, 1974 .

Decay is high on average. Three in four of the 22,709 colleague relations at risk ofdecay in Table 3 are gone next year.Model I, in the first column of Table 4, shows that tie and node age are both

statistically significant factors in the liability of newness. As expected for tie age, thehazard of decay is lower for older relationships. Hazard rates are lower in Table 3 for

Ž .older relationships e.g., 0.753 vs. 0.529 , there is a statistically significanty6.7 z-scoreŽtest statistic in Model I for tie age T in Table 4 P-0.001, and log T yields no stronger

.effect, y6.3 z-score , and there is a negative slope to the predicted hazard rates in Fig.1B. A colleague relationship that survives for a decade is almost sure to survive into the

Ž .future the predicted hazard rate is near zero .

Burt, R., 2000. Decay functions. Social Networks, 22(1), pp.1–28.

Hidalgo, C. & Rodriguez-Sickert, C., 2008. The dynamics of a mobile phone network. Physica A, 387(12), pp.3017–3024.

Page 36: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Two paradoxes • Ties bridging distant parts of the network (the ones important for information

diffusion, achievement) are not only the least likely to be created, but also the most likely to decay R.S. Burt / Social Networks 24 (2002) 333–363 335

Fig. 1. (A–C) Social capital and bridges across structural holes.

often cited as an early exemplar in this research). For example, the sociogram in Fig. 2shows three groups (A, B and C), and the density table at the bottom of the figure shows thegeneric pattern of in-group relations stronger than relations between groups in that diagonalelements of the table are higher than off-diagonals (each cell of a density table is the averageof relations between individuals in the row and individuals in the column). The result isthat people are not simultaneously aware of opportunities in all groups. Even if informationis of high quality, and eventually reaches everyone, the fact that diffusion occurs over aninterval of time means that individuals informed early or more broadly have an advantage.

2.2. Bridges across structural holes

Participation in, and control of, information diffusion underlies the social capital of struc-tural holes (Burt, 1992, 2000b, 2002). The argument describes social capital as a functionof brokerage opportunities, and draws on network concepts that emerged in sociology dur-ing the 1970s; most notably Granovetter (1973) on the strength of weak ties, Freeman(1977, 1979) on betweenness centrality, Cook and Emerson (1978) on the benefits ofhaving exclusive exchange partners, and Burt (1980) on the autonomy created by complex

Page 37: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Two paradoxes • Tie formation tends to close triangles. But ties embedded in triangles are less

likely to decay. Thus, network should become more clustered

data set except during natural spells of re-duced activity, such as winter break (Fig. 4C).On the other hand, as Fig. 4D illustrates, in-dividual ranks change substantially over theduration of the data set. Analogous results(27) apply to the concept of Bweak ties[ (13):The distribution of tie strength in the net-work is stable over time, and bridges are, onaverage, weaker than embedded ties Econsist-ent with (13)^. However, they do not retaintheir bridging function, or even remain weak,indefinitely.

Our results suggest that conclusionsrelating differences in outcome measuressuch as status or performance to differencesin individual network position (14) should betreated with caution. Bridges, for example,may indeed facilitate diffusion of informa-tion across entire communities (13). How-ever, their unstable nature suggests that theyare not Bowned[ by particular individualsindefinitely; thus, whatever advantages they

confer are also temporary. Furthermore, it isunclear to what extent individuals are ca-pable of strategically manipulating their po-sitions in a large network, even if that istheir intention (14). Rather, it appears thatindividual-level decisions tend to Baverage out,[yielding regularities that are simple functionsof physical and social proximity. Sharing focalactivities (10) and peers (26), for example,greatly increases the likelihood of individualsbecoming connected, especially when theseconditions apply simultaneously.

It may be the case, of course, that the in-dividuals in our population—mostly studentsand faculty—do not strategically manipulatetheir networks because they do not need to, notbecause it is impossible. Thus, our conclusionsregarding the relation between local and globalnetwork dynamics may be specific to theparticular environment that we have studied.Comparative studies of corporate or militarynetworks could help illuminate which features

of network evolution are generic and which arespecific to the cultural, organizational, andinstitutional context in question. We note thatthe methods we introduced here are generic andmay be applied easily to a variety of other set-tings. We conclude by emphasizing that under-standing tie formation and related processes insocial networks requires longitudinal data onboth social interactions and shared affiliations(4, 6, 10). With the appropriate data sets, theo-retical conjectures can be tested directly, andconclusions previously based on cross-sectionaldata can be validated or qualified appropriately.

References and Notes1. P. S. Dodds, D. J. Watts, C. F. Sabel, Proc. Natl. Acad. Sci.

U.S.A. 100, 12516 (2003).2. J. M. Kleinberg, Nature 406, 845 (2000).3. T. W. Valente, Network Models of the Diffusion of

Innovations (Hampton Press, Cresskill, NJ, 1995).4. P. Doreian, F. N. Stokman, Eds., Evolution of Social

Networks (Gordon and Breach, New York, 1997).5. P. Lazarsfeld, R. Merton, in Freedom and Control in

Modern Society, M. Berger, T. Abel, C. Page, Eds. (VanNostrand, New York, 1954), pp. 18–66.

6. M. McPherson, L. Smith-Lovin, J. M. Cook, Annu. Rev.Sociol. 27, 415 (2001).

7. J. A. Davis, Am. J. Sociology 68, 444 (1963).8. T. M. Newcomb, The Acquaintance Process (Holt Rinehart

and Winston, New York, 1961).9. P. M. Blau, J. E. Schwartz, Crosscutting Social Circles

(Academic Press, Orlando, FL, 1984).10. S. L. Feld, Am. J. Sociology 86, 1015 (1981).11. J. S. Coleman, Sociol. Theory 6, 52 (1988).12. A. Rapoport, Bull. Math. Biophys. 15, 523 (1953).13. M. S. Granovetter, Am. J. Sociology 78, 1360 (1973).14. R. S. Burt, Am. J. Sociology 110, 349 (2004).15. M. Hammer, Soc. Networks 2, 165 (1980).16. M. T. Hallinan, E. E. Hutchins, Soc. Forces 59, 225 (1980).17. M. E. J. Newman, Proc. Natl. Acad. Sci. U.S.A. 98, 404

(2001).18. S. Wasserman, K. Faust, Social Network Analysis: Methods

and Applications (Cambridge Univ. Press, Cambridge,1994).

19. M. E. J. Newman, SIAM Review 45, 167 (2003).20. J. P. Eckmann, E. Moses, D. Sergi, Proc. Natl. Acad. Sci.

U.S.A. 101, 14333 (2004).21. C. Cortes, D. Pregibon, C. Volinsky, J. Comp. Graph. Stat.

12, 950 (2003).22. P. Holme, C. R. Edling, F. Liljeros, Soc. Networks 26, 155

(2004).23. B. Wellman, C. Haythornthwaite, Eds., The Internet in

Everyday Life (Blackwell, Oxford, 2003).24. N. K. Baym, Y. B. Zhang, M. Lin, New Media Soc. 6, 299

(2004).25. Materials and methods are available as supporting

material on Science Online.26. H. Louch, Soc. Networks 22, 45 (2000).27. G. Kossinets, D. J. Watts, data not shown.28. M. E. J. Newman, S. H. Strogatz, D. J. Watts, Phys. Rev. E

6402, 026118 (2001).29. We thank P. Dodds and two anonymous reviewers for

helpful comments and B. Beecher and W. Bourne forassistance with data collection and anonymization. Thisresearch was supported by NSF (SES 033902), the James S.McDonnell Foundation, Legg Mason Funds, and theInstitute for Social and Economic Research and Policy atColumbia University.

Supporting Online Materialwww.sciencemag.org/cgi/content/full/311/5757/88/DC1Materials and MethodsReferences

1 July 2005; accepted 29 November 200510.1126/science.1116869

Fig. 3. Network-levelproperties over time, forthree choices of smooth-ing window t 0 30 days(dashes), 60 days (solidlines), and 90 days(dots). (A) Mean vertexdegree bkÀ. (B) Fraction-al size of the largestcomponent S. (C) Meanshortest path length inthe largest componentL. (D) Clustering coeffi-cient C.

Fig. 4. Stability of de-gree distribution and in-dividual degree ranks. (A)Degree distribution inthe instantaneous net-work at day 61, logarith-mically binned. (B) Sameat day 270. (C) TheKolmogorov-Smirnovstatistic D comparing de-gree distribution in theinstantaneous network atday 61 and in subse-quent daily approxima-tions. (D) Dissimilaritycoefficient for degreeranks z 0 1 – rS

2, whererS is the Spearman rankcorrelation between indi-vidual degrees at day 61 and in subsequent approximations. z varies between 0 and 1 andmeasures the proportion of variance in degree ranks that cannot be predicted from the ranks inthe initial network.

REPORTS

6 JANUARY 2006 VOL 311 SCIENCE www.sciencemag.org90

Page 38: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• How are ties formed and destroyed? Is there any strategy?

• Cognitive limitations, time limitations • Dunbar number: there is a limit to

the number of people with whom one can maintain stable social relationships.

• Time/attention is limited: how do we manage relationships if our time is limited?

Page 39: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• How are ties formed and destroyed? • Disentangling tie burstiness and formation/decay

G. Miritello, R. Lara, M. Cebrián & E. Moro 2

G. Miritello, R. Lara, M. Cebrián & E. Moro 3

0T

i $ j

(a)

(c)

(b)

Links tipo (a): 15%Links tipo (b): 19%Links tipo (c): 24%Links tipo (d): 42%

Links tipo (e) : 3.5%

t

(e)

(d)

�tij

Figure S2: (Color online) Schematic view of the different situations of tie formation/decay and theinterplay between the tie communication patterns and tie formation/decay for a given observationtime window of length T (shadowed area). Each lines refers to a different tie while each verticalsegment indicates a communication event between i � j and �tij is the interevent time in the i � jtime series.

for the link. As was shown in [3, ?] the pdf for inter event times depends mostly on the averageinter-event time �tij, i.e. P(�tij) = P(�tij/�tij) where P(x) is a universal function. Thus, we couldrewrite the previous expression as

P(�ij|�tij) =1

�ij

��ij

0

P(�tij/�tij)d�tij (2)

However, links are very heterogeneous in the sense that they have very different �tij. Or equiv-alently, they very different weights wij = T/�tij weight. Suppose that �(�tij) is the distributionof average inter-event times across links and that each node chooses her links activities from thatdistribution of �tij. Then the probability to observe one of her links at time � is given by:

P(�) =

�d�tij�(�tij)P(�|�tij) (3)

Thus, the growing function of the observed connectivity as a function of time is given by the ccf ofP(�).

ki(t) = ki(�)

� t

0

P(�)d� (4)

where ki(�) is the total connectivity of node i. Note that since P(x) and �(�tij) are heavy tailed,then P(�) is heavy tailed too and thus the ki(t) can show an apparent non-trivial time dependenceeven if all links are open during the observation time.

Let’s do an example: assume that the distribution of inter-event times is given by the exponen-tial pdf P(�t|�t) = e��t/�t/�t and also that the pdf for the average inter-event time is an exponen-

Activity localization in online social networks

7 months6 months 6 months

Figure S1: (Color online) Schematic view of the time intervals considered in our database and thedifferent situations of tie formation/decay and the interplay between the tie communication pat-terns and tie formation/decay for a given observation time window ⌦ of length T = 7 months(shadowed area). Each lines refers to a different tie while each vertical segment indicates a com-munication event between i $ j and �tij is the interevent time in the i $ j time series.

after ⌦. This later filter prevents spurious effects in the analysis of tie dynamics just becauseindividuals subscribe/unsubscribe just before/after ⌦; for example, we could have observed anapparent rapid growth of their social network at the beginning of the observation window or afast dissolution at its end [2]. This results in the removal of about the 17% of nodes and the 37% ofreciprocated links within ⌦.

2 Entanglement between bursty activity and tie dynamics

As stressed in the main text, one of the most challenging problems in the study of the dynamics oftie creation and removal is to identify whether a tie is actually a new/old connection. Althoughin most social networks there are specific events for the formation of new “friends” (or followers)or the corresponding unfriending events, due to the cheap cost of maintaing those connectionsmost of those ties are abandoned and thus activity between individuals is the only way to assesthe existence or not of that relationship.

However, human activity is bursty, meaning that there are large periods of inactivity followedby bursts of activity []. This means that within a particular tie i $ j the time between consecutivecommunication events �tij is heavy-tailed distributed. In our database we find that this is indeedthe case and in line with [3, ?] we find that there is a universal law for the distribution of inter-

Activity localization in online social networks

Miritello, G. et al., 2013. Limited communication capacity unveils strategies for human interaction. Scientific Reports, 3.

Page 40: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• How are ties formed and destroyed? • Very heterogeneous tie evolution

• Mean • But for 20% of nodes

hn↵,ii ' hn!,ii ' 8 hkii ' 16n↵,i > 15

G. Miritello, R. Lara, M. Cebrián & E. Moro 3

0 T

i � j

(a)

(c)

(b)

Links tipo (a): 15%Links tipo (b): 19%Links tipo (c): 24%Links tipo (d): 42%

Links tipo (e) : 3.5%

t(e)

(d)

�tij

Figure S2: (Color online) Schematic view of the different situations of tie formation/decay and theinterplay between the tie communication patterns and tie formation/decay for a given observationtime window of length T (shadowed area). Each lines refers to a different tie while each verticalsegment indicates a communication event between i � j and �tij is the interevent time in the i � jtime series.

for the link. As was shown in [3, ?] the pdf for inter event times depends mostly on the averageinter-event time �tij, i.e. P(�tij) = P(�tij/�tij) where P(x) is a universal function. Thus, we couldrewrite the previous expression as

P(�ij|�tij) =1

�ij

��ij

0

P(�tij/�tij)d�tij (2)

However, links are very heterogeneous in the sense that they have very different �tij. Or equiv-alently, they very different weights wij = T/�tij weight. Suppose that �(�tij) is the distributionof average inter-event times across links and that each node chooses her links activities from thatdistribution of �tij. Then the probability to observe one of her links at time � is given by:

P(�) =

�d�tij�(�tij)P(�|�tij) (3)

Thus, the growing function of the observed connectivity as a function of time is given by the ccf ofP(�).

ki(t) = ki(�)

�t

0

P(�)d� (4)

where ki(�) is the total connectivity of node i. Note that since P(x) and �(�tij) are heavy tailed,then P(�) is heavy tailed too and thus the ki(t) can show an apparent non-trivial time dependenceeven if all links are open during the observation time.

Let’s do an example: assume that the distribution of inter-event times is given by the exponen-tial pdf P(�t|�t) = e��t/�t/�t and also that the pdf for the average inter-event time is an exponen-

Activity localization in online social networks

7 months6 months 6 months

Figure 1. (Color online) Schematic view of the time intervals considered in our

database and the di↵erent situations of tie formation/decay and the interplay between

the tie communication patterns and tie formation/decay for a given observation time

window ⌦ of length T = 7 months (shadowed area). Each lines refers to a di↵erent

tie while each vertical segment indicates a communication event between i $ j and

�tij is the interevent time in the i $ j time series.

Empirical analysisTo study the formation and decay of communication ties, westudy the Call Detail Records (CDRs) recorded from a uniquemobile phone operator over a period of 19 months. The dataconsists of the anonymized voice calls of about 20 million userswithin 700 million communication ties. After filtering out allthe incoming or outgoing calls that involve other operators,we only consider users that are active across the whole timeperiod and retain only ties which are reciprocated [14]. Werefer to SI for further details about the processing and thesampling of the data.

Detection of tie creation/removal. In most studies of commu-nication networks a tie is assumed to be present if it showsany kind of activity in the observation window [14]. However,since communication is bursty [17], large inter-event timesbetween interactions are likely and thus they might be unob-served or mistaken as tie decay or formation, specially if theobservation window is short (see Fig.). For example, in ourdatabase we find that the average time between tie communi-cation events is h�tiji = 14 days (with � = 18 days) and thuswe might get spurious e↵ects if the observation window is ofthe order of months, as repeated interactions may fall outsidethe observation window [19].

To overcome this we propose a di↵erent method to asseswhether a tie has been formed/decayed in the observationwindow ⌦. The method is based on the observation of tieactivity in a time window before/after ⌦: if tie activity is ob-served in the 6 months before ⌦ then it is considered an oldtie [cases (a) and (d) in Fig. ]; on the other hand, if activityis observed in the 6 months after ⌦ we will assume that thetie persists [cases (b) and (d) in Fig. ]. In any other case, wewill consider that the tie is formed and/or decay in ⌦ [cases(a), (b) and (c) in Fig. ]. Of course, it is possible that even ifthere is no communication before/after the observation win-dow, the tie is still active after/before our database. Thiswould require that the tie has an inter-event time �tij big-ger than 7 months, i.e. case (e) in Figure . However, in ourdatabase, only 3.5% of the links have such a long inter-eventtime which validates the accuracy of our definition of tie de-cay/formation. See Suppl. Material for detailed informationon our discrimination method.

Dynamical social strategy.The procedure described above al-lows us to determine the tie formation and decay events foreach individual along the observation period of 7 months (seeFig. 2). With those events, we build her instantaneous so-

cial capacity i(t), defined as the number of open ties at anygiven instant t. In principle i(t) is very di↵erent from ki(t),the aggregated number of revealed links up to time t, whichis usually what is taken as a proxy for social connectivity[19]. But if we aggregate the number of added (removed)ties up to time t, denoted by n↵,i(t) [n!,i(t)], we get thatki(T ) = i(0) + n↵,i(T ). Thus ki(T ) is a combination ofthe social capacity and tie formation activity in the observa-tion period. In our database we find a large heterogeneityin n↵,i(T ) and n!,i(T ) [see Fig. 3a]: while on average peo-ple create/destroy about 8 (reciprocated) ties in a period of 7months, 20% of users in our database add/remove more than15 ties in that period. This is a relatively large number for amobile-phone communication, where much more e↵ort is re-quired to establish and maintain a tie if compared to onlinecommunication networks such as Twitter or Facebook, whichare often used to collect as many friends and followers as pos-sible. Note that on average n↵,i(T ) and n!,i(T ) almost equalski(T )/2, (see Fig. 3a) for the observation period, which sug-gests that a large fraction of the revealed aggregated socialconnectivity ki(T ) is given by newly formed or removed con-nections. Thus, ki(T ) usually overestimates the instantaneoushuman social capacity of maintaining social ties.

The imbalance between the number of added/removedties measures how social capacity changes. At the end of

A

B

C

Num

ber o

f tie

sTi

e id

Days

�i(t)ki(t)

0 30 60 90 120 150 180 210

110

2030

110

2030

n�,i(t)

n�,i(t) + �i(0)

Figure 2. From communication activity to tie dynamics:Panel (A) shows the communication events of a given individual in our database with

all her neighbors in the observation window. For each tie id, a vertical line represents

a call with the corresponding neighbor. Grey horizontal rectangles are drawn from

the first to the last observed communication event in each tie, considering also events

before and after the observation window. Panel (B) shows vertical up/down arrows

for each tie formation/decay events detected within the observation window. Using

those events, panel (C) shows the aggregated number of open ties as a function of

time i(0)+n↵,i(t) and the aggregated number of closed ties n!,i. Dashed line

is the apparent growth in the social connectivity ki(t) obtained by the cumulative

number of observed activity in ties up to some time, while red line is the number of

open connections at a given instant i(t) .

2 www.pnas.org/cgi/doi/10.1073/pnas.0709640104 Footline Author

days

Agg. # of ties destroyed

Agg. # of ties formed

#of ties opened

Page 41: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• How are ties formed and destroyed? • Tie formation/decay is bursty

3.3 Dynamical communication strategies 57

10!5 10!3 10!1 101

10!5

10!3

10!1

101 (b)

10!5 10!3 10!1 10110

!510

!310

!110

1(a)

P(x/x

)

x/x

x = �tk�k+1

P(x/x

)

x/x

x = �tk�k�1

Figure 3.10: (Rescaled) Distribution of the time gap between edge creation (a) and edge removal(b). Each curve refers to a group of nodes with a different activity rate ↵

i

, where groups havebeen obtained according to the quartiles of ↵

i

for the whole population. The dashed line in bothpanels correspond to an exponential distribution with the same mean.

We then calculate the aggregated number of events ni,↵

and ni,!

and fit them to linearmodels to obtain the simulated ↵

i

and !i

for those cases in which ni,↵

� 5 and ni,!

�5. As shown in Fig. 3.9 (c) the empirical values of n

i,↵

observed in our data can be wellexplained by the simulations, suggesting that our model works well at the particularscale considered. In addition, we also find a good agreement between the measuredvalues of ↵

i

and !i

and the simulations (Fig. 3.9 (d)), despite of the small amount ofoutliers that cannot be explained by our model.

3.3.2 Dynamics of tie creation/removal

We have seen that the growth of the number of added and removed connections canbe described by a linear process having rates ↵

i

and !i

, respectively. However, aspanel (b) of Fig. 3.8 suggests, it is important to note that the distribution of the timegap between creation and removing of ties is not an homogeneous Poissonian process.To characterize the temporal evolution of ties we analyze the distribution of the inter-event times elapsed between consecutive additions or deletions. We denote these timedifferences respectively �t

k,k+1 and �tk,k�1 since they increase and decrease the social

connectivity k of one unit. Our results in Fig. 3.10 indicate a heterogenous pattern ofactivity and, despite the exponential cut-off, small inter-event times are significantlymore probable than an exponential distribution. This is in line with previous researchthat has shown that the nature of link creation process in online social networks is broadand follows a power-law distribution (Gaito et al. 2012; Kikas et al. 2012; Leskovec,J. and Backstrom, L. et al. 2008). In these studies the bursty behavior of tie creationis usually associated to an acceleration (or exploration) phase in which tie formationrapidly increases. Since the trains of bursts are more likely to appear right after the

2

• For each type of paid service (e.g. PSTN calls),date when the user first and last used this service(whenever applicable).

• Time series indicating the number of days in eachmonth when the user connected to the Skype net-work

• For each type of free service (e.g. Skype-to-Skypeaudio calls, video calls, chat, etc.), time series indi-cating the number of days in each month when theuser used this service.

• Time stamped events of link addition and deletionof each users.

In the Skype network, when a user adds a friend tohis/her contact list, the friend may confirm the contactinvitation or not. Also, at any point in time a user maydelete a “friend” from their contact list.[32] Thus, thenetwork evolves by means of the following events: contactaddition, contact confirmation and contact deletion. Inour study to take into account only trusted social links weretained only confirmed edges, meaning edges where bothparties accepted the connection. Failure to do so wouldlead to mixing undesired with desired connections.

For the present study we employed two subsets of theabove dataset. The first dataset (DS1) includes everyactive user as of the end of 2010, all confirmed edgesbetween these users, and the date of confirmation of eachedge. In this context, we define an active user as one whoconnected to the Skype network at least in two di↵erentmonths during the first year after their registration date.In order to consider users with realistic number of friendswe selected only those with degree between 2 and 1000.Users with more connections are suspected to be botsor are business accounts and their behavior di↵ers fromthe majority who use Skype for personal communication.This filtering led to a set of more than 150 million users.

The second dataset (DS2) includes the set of edge ad-dition events and edge confirmation events for the periodof 2009�2011 as well as edge deletion events recorded fora year-long period in 2010�2011. Each of these events istime-stamped with a date. Only events related to userswith degree between 2 and 1000 were kept. Unlike DS1,non-active users were retained in DS2.

EGOCENTRIC NETWORK EVOLUTION

In this section we look at the evolution of egocentricnetworks to gain deeper understanding about the govern-ing microscopic rules of contact list evolution.

Bursty edge dynamics

To characterize the temporal evolution of contact lists,we first examine the sequences of edge addition and dele-tion of each individual and we calculate the distributionsof inter-event times

⌧a = tai+1 � tai , ⌧d = tdi+1 � tdi , ⌧ad = tde � tae(1)

elapsed between consecutive additions at tai and tai+1 ordeletions at tdi and tdi+1 of the same user or between theaddition and deletion of an edge e. If this distributionfollows a power-law as

P (⌧) ⇠ ⌧�� (2)

it indicates strong temporal heterogeneities and bursti-ness, or otherwise if it decays exponentially it reflectsregular dynamical features. Bursty temporal evolution ofhuman dynamics was confirmed in various systems rang-ing from library loans to human communication [7, 14]or recently for the evolution of social networks [6].

1e-6x

1e-5x

1e-4x

1e-3x

1e-2x

1e-1x

1e-0x

1 10 100

P(τ a

) , P

(τd)

τa , τd (in days)

(a)

P(τa)P(τd)

γ=0.85

1e-6x

1e-5x

1e-4x

1e-3x

1e-2x

1e-1x

1e-0x

1 10 100

P(τ a

d)

τad , (in days)

(b)

P(τad)γ=0.82

FIG. 1: Inter-event time distributions of (a) edge addition(blue squares) and deletion (red circles) events of users and(b) addition and deletion of links (green triangles) in DS2.The straight lines indicate power-law functions with exponent(a) � = 0.85 and (b) � = 0.82. For a formal definition of the⌧a, ⌧d, and ⌧ad see Eq.1.

Calculating P (⌧) for DS2 we observe heterogeneouslydistributed inter-event times in Fig.1.a both in case ofedge additions and edge deletions. The distributions areshowing rather similar scaling with a section fitting ona power-law with exponent � ' 0.85 and an exponentialcuto↵ due to the finite time window. This is an interest-

3.3 Dynamical communication strategies 57

10!5 10!3 10!1 101

10!5

10!3

10!1

101 (b)

10!5 10!3 10!1 101

10!5

10!3

10!1

101(a)

P(x/x

)

x/x

x = �tk�k+1

P(x/x

)

x/x

x = �tk�k�1

Figure 3.10: (Rescaled) Distribution of the time gap between edge creation (a) and edge removal(b). Each curve refers to a group of nodes with a different activity rate ↵

i

, where groups havebeen obtained according to the quartiles of ↵

i

for the whole population. The dashed line in bothpanels correspond to an exponential distribution with the same mean.

We then calculate the aggregated number of events ni,↵

and ni,!

and fit them to linearmodels to obtain the simulated ↵

i

and !i

for those cases in which ni,↵

� 5 and ni,!

�5. As shown in Fig. 3.9 (c) the empirical values of n

i,↵

observed in our data can be wellexplained by the simulations, suggesting that our model works well at the particularscale considered. In addition, we also find a good agreement between the measuredvalues of ↵

i

and !i

and the simulations (Fig. 3.9 (d)), despite of the small amount ofoutliers that cannot be explained by our model.

3.3.2 Dynamics of tie creation/removal

We have seen that the growth of the number of added and removed connections canbe described by a linear process having rates ↵

i

and !i

, respectively. However, aspanel (b) of Fig. 3.8 suggests, it is important to note that the distribution of the timegap between creation and removing of ties is not an homogeneous Poissonian process.To characterize the temporal evolution of ties we analyze the distribution of the inter-event times elapsed between consecutive additions or deletions. We denote these timedifferences respectively �t

k,k+1 and �tk,k�1 since they increase and decrease the social

connectivity k of one unit. Our results in Fig. 3.10 indicate a heterogenous pattern ofactivity and, despite the exponential cut-off, small inter-event times are significantlymore probable than an exponential distribution. This is in line with previous researchthat has shown that the nature of link creation process in online social networks is broadand follows a power-law distribution (Gaito et al. 2012; Kikas et al. 2012; Leskovec,J. and Backstrom, L. et al. 2008). In these studies the bursty behavior of tie creationis usually associated to an acceleration (or exploration) phase in which tie formationrapidly increases. Since the trains of bursts are more likely to appear right after the

Kikas, R., Dumas, M. & Karsai, M., 2012. Bursty egocentric network evolution in Skype. arXiv.org, physics.soc-ph.

Miritello, G., Temporal Patterns of Communication in Social Networks, Springer 2013

Page 42: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• How are ties formed and destroyed? • Linear tie formation/decay and conserved capacity

G. Miritello, R. Lara, M. Cebrián & E. Moro 3

0 T

i � j

(a)

(c)

(b)

Links tipo (a): 15%Links tipo (b): 19%Links tipo (c): 24%Links tipo (d): 42%

Links tipo (e) : 3.5%

t(e)

(d)

�tij

Figure S2: (Color online) Schematic view of the different situations of tie formation/decay and theinterplay between the tie communication patterns and tie formation/decay for a given observationtime window of length T (shadowed area). Each lines refers to a different tie while each verticalsegment indicates a communication event between i � j and �tij is the interevent time in the i � jtime series.

for the link. As was shown in [3, ?] the pdf for inter event times depends mostly on the averageinter-event time �tij, i.e. P(�tij) = P(�tij/�tij) where P(x) is a universal function. Thus, we couldrewrite the previous expression as

P(�ij|�tij) =1

�ij

��ij

0

P(�tij/�tij)d�tij (2)

However, links are very heterogeneous in the sense that they have very different �tij. Or equiv-alently, they very different weights wij = T/�tij weight. Suppose that �(�tij) is the distributionof average inter-event times across links and that each node chooses her links activities from thatdistribution of �tij. Then the probability to observe one of her links at time � is given by:

P(�) =

�d�tij�(�tij)P(�|�tij) (3)

Thus, the growing function of the observed connectivity as a function of time is given by the ccf ofP(�).

ki(t) = ki(�)

�t

0

P(�)d� (4)

where ki(�) is the total connectivity of node i. Note that since P(x) and �(�tij) are heavy tailed,then P(�) is heavy tailed too and thus the ki(t) can show an apparent non-trivial time dependenceeven if all links are open during the observation time.

Let’s do an example: assume that the distribution of inter-event times is given by the exponen-tial pdf P(�t|�t) = e��t/�t/�t and also that the pdf for the average inter-event time is an exponen-

Activity localization in online social networks

7 months6 months 6 months

Figure 1. (Color online) Schematic view of the time intervals considered in our

database and the di↵erent situations of tie formation/decay and the interplay between

the tie communication patterns and tie formation/decay for a given observation time

window ⌦ of length T = 7 months (shadowed area). Each lines refers to a di↵erent

tie while each vertical segment indicates a communication event between i $ j and

�tij is the interevent time in the i $ j time series.

Empirical analysisTo study the formation and decay of communication ties, westudy the Call Detail Records (CDRs) recorded from a uniquemobile phone operator over a period of 19 months. The dataconsists of the anonymized voice calls of about 20 million userswithin 700 million communication ties. After filtering out allthe incoming or outgoing calls that involve other operators,we only consider users that are active across the whole timeperiod and retain only ties which are reciprocated [14]. Werefer to SI for further details about the processing and thesampling of the data.

Detection of tie creation/removal. In most studies of commu-nication networks a tie is assumed to be present if it showsany kind of activity in the observation window [14]. However,since communication is bursty [17], large inter-event timesbetween interactions are likely and thus they might be unob-served or mistaken as tie decay or formation, specially if theobservation window is short (see Fig.). For example, in ourdatabase we find that the average time between tie communi-cation events is h�tiji = 14 days (with � = 18 days) and thuswe might get spurious e↵ects if the observation window is ofthe order of months, as repeated interactions may fall outsidethe observation window [19].

To overcome this we propose a di↵erent method to asseswhether a tie has been formed/decayed in the observationwindow ⌦. The method is based on the observation of tieactivity in a time window before/after ⌦: if tie activity is ob-served in the 6 months before ⌦ then it is considered an oldtie [cases (a) and (d) in Fig. ]; on the other hand, if activityis observed in the 6 months after ⌦ we will assume that thetie persists [cases (b) and (d) in Fig. ]. In any other case, wewill consider that the tie is formed and/or decay in ⌦ [cases(a), (b) and (c) in Fig. ]. Of course, it is possible that even ifthere is no communication before/after the observation win-dow, the tie is still active after/before our database. Thiswould require that the tie has an inter-event time �tij big-ger than 7 months, i.e. case (e) in Figure . However, in ourdatabase, only 3.5% of the links have such a long inter-eventtime which validates the accuracy of our definition of tie de-cay/formation. See Suppl. Material for detailed informationon our discrimination method.

Dynamical social strategy.The procedure described above al-lows us to determine the tie formation and decay events foreach individual along the observation period of 7 months (seeFig. 2). With those events, we build her instantaneous so-

cial capacity i(t), defined as the number of open ties at anygiven instant t. In principle i(t) is very di↵erent from ki(t),the aggregated number of revealed links up to time t, whichis usually what is taken as a proxy for social connectivity[19]. But if we aggregate the number of added (removed)ties up to time t, denoted by n↵,i(t) [n!,i(t)], we get thatki(T ) = i(0) + n↵,i(T ). Thus ki(T ) is a combination ofthe social capacity and tie formation activity in the observa-tion period. In our database we find a large heterogeneityin n↵,i(T ) and n!,i(T ) [see Fig. 3a]: while on average peo-ple create/destroy about 8 (reciprocated) ties in a period of 7months, 20% of users in our database add/remove more than15 ties in that period. This is a relatively large number for amobile-phone communication, where much more e↵ort is re-quired to establish and maintain a tie if compared to onlinecommunication networks such as Twitter or Facebook, whichare often used to collect as many friends and followers as pos-sible. Note that on average n↵,i(T ) and n!,i(T ) almost equalski(T )/2, (see Fig. 3a) for the observation period, which sug-gests that a large fraction of the revealed aggregated socialconnectivity ki(T ) is given by newly formed or removed con-nections. Thus, ki(T ) usually overestimates the instantaneoushuman social capacity of maintaining social ties.

The imbalance between the number of added/removedties measures how social capacity changes. At the end of

A

B

C

Num

ber o

f tie

sTi

e id

Days

�i(t)ki(t)

0 30 60 90 120 150 180 210

110

2030

110

2030

n�,i(t)

n�,i(t) + �i(0)

Figure 2. From communication activity to tie dynamics:Panel (A) shows the communication events of a given individual in our database with

all her neighbors in the observation window. For each tie id, a vertical line represents

a call with the corresponding neighbor. Grey horizontal rectangles are drawn from

the first to the last observed communication event in each tie, considering also events

before and after the observation window. Panel (B) shows vertical up/down arrows

for each tie formation/decay events detected within the observation window. Using

those events, panel (C) shows the aggregated number of open ties as a function of

time i(0)+n↵,i(t) and the aggregated number of closed ties n!,i. Dashed line

is the apparent growth in the social connectivity ki(t) obtained by the cumulative

number of observed activity in ties up to some time, while red line is the number of

open connections at a given instant i(t) .

2 www.pnas.org/cgi/doi/10.1073/pnas.0709640104 Footline Author

days

-4

-3

-2

-1

0

-4 -3 -2 -1 03.5e-05

1.0e-03

2.0e-03

3.1e-03

4.1e-03

5.1e-03

6.1e-03

7.1e-03

8.1e-03

9.1e-03

6e-3

9e-3

3e-3

0

-4

-3

-2

-1

-4 -3 -2 -1

"black"black

3.5e-05

1.0e-03

2.0e-03

3.1e-03

4.1e-03

5.1e-03

6.1e-03

7.1e-03

8.1e-03

9.1e-03

log

�i

log �i5 10 20 50 100 200

1e-07

1e-05

1e-03

1e-01

x

dens

10-7

10-5

10-3

10-1

10 100

P(x

)

x

kin�,i

n�,i

�i

n�,i

n�

,i

0 20 40 60 80 100

020

4060

80100

A B C

Figure 3. Characterization of social dynamical strategies (A) Probability distribution function (pdf) of the aggregated social connectivity ki, number

of created ties n↵,i and number of deleted ties n!,i at t = T , compared with the pdf for the average social capacity i over the observation window. (b) We observe a

positive correlation between the number of created n↵,i and removed n!,i connections with a linear correlation coe�cient of 0.87. The results form the PCA analysis indicate

that the 93% of the variation can be explained by the first component with a standard deviation of 1.81 in the (0.70, 0.71) direction. This result is shown in the box plot,

where the bottom and the top of the boxes correspond to the 25th and the 75th quantiles respectively, while the band near the middle is the 50th percentile (median) of

the distribution. The down and top of the whiskers represent the 5th and 95th percentiles. The line y = x lies between the 9th and the 91st percentiles, thus capturing the

90% of the distribution in correspondence of each box. The blue solid lines refer to the 5th and 95th percentiles of random generated n↵ and n! from a Poisson distribution

with n↵ taken as the expected number of events in a given time interval of length T. (c) Density plot of ↵i as a function of !i. We observe.. (In order for the regression to

be significant, the arrivals and removals time sequences must contain a su�cient number of points, that we set equal to 5. This corresponds to keep all those nodes who form

at least 5 connections and at the same time remove 5 connections between the first and the last day of the observation period T (n↵ � 5 and n! � 5). For these nodes

we find that the linear model apply with a high statistical significance (p-value 0.05).

the observation period the change in the social capacity isi(T ) � i(0) = n↵,i(T ) � ni,!(T ). Interestingly, we findthat for most users in our database we get n↵,i(T ) ' n!,i(T )(Fig.3b). This means that there is some sort of conservationprinciple in social attention, where the number of destroyedties equals the number of formed ties in the observation win-dow such that the total number of open relationships in agiven time window T remains almost constant. This conser-vation of social capacity not only happens at this particulartime scale T but also instantaneously: as seen in Fig.3 we findthat for 80% of the users tie formation/destruction happenslinearly in time so that ni,↵(t) ' ↵it and ni,! ' !it where↵i and !i are the rates of tie formation/decay and ↵i ' !i.These two facts have a remarkable consequence: despite tiesare added/decayed continually, the social capacity for each in-dividual remains almost constant throughout the observationperiod i(t) ' i, signaling that people tend to balance theformation/removal of edges in such a way that the numberof open connections remains stable over time (or that variesslower in time than tie evolution). This finding is the rootof many observations in the literature (see for example [4, ?])that the distribution of connectivity in social networks seemsto be stable in time but the neighbors of a given node changefrom one time window to the other. In fact, we find thatthe average social persistence pi, i.e. the fraction of neigh-bors that remain at the end of the 7 months observation timewindow is around 75%, meaning that users renew their socialcircle slowly, in line with studies in o↵-line social networks.This value is much larger than what is expected in a randommodel where every tie has the same probability to disappear:simulations using the real tie formation/destruction sequenceof events but where ties are randomly chosen yield to pi = 50%[Cebrian: are details available in the Suppl. Matheorial]. Thisresult indicates that the way in which people add and removesocial connections from their social circle is not random; in-stead, some existing ties are more probable to be destroyedthan others.

Thus, each user social dynamics can be characterized interms of his social capacity i and his social activity n↵,i ina time window (or rate ↵i). These two quantities give in-formation about two related although not equivalent featuresof social communication. While the social capacity is a mea-sure of the number of relations that a user manages instan-

taneously in time, the social activity is instead related to thenumber of relations a user establishes and at what rate. How-ever, as shown in Fig. 4, we observe for most of the individ-uals n↵,i ' � i with � = 0.75, meaning that the numberof created connections tends to be proportional to the socialcapacity. This correlation resembles the preferential attach-

ment process by which tie formation is more probable withmore connected individuals. Note however that we find that

A

0.0

0.2

0.4

0.6

0.8

10 20 50k

mean

gg1

g2

g3

0.00

0.05

0.10

0.15

0.20

10 20 50k

mean

gg1

g2

g3

ki

pi

ci

52 105 158 211

52 105 158 211

B

C D

log

n�

,i

1

2

3

4

-1 0 1 2 3 4 53.5e-05

7.3e-05

1.5e-04

3.2e-04

6.6e-04

1.4e-03

2.9e-03

6.0e-03

1.3e-02

2.6e-02

1

2

3

4

-1 0 1 2 3 4 5

0.00003511

0.00007296

0.00015161

0.00031503

0.00065460

0.00136021

0.00282641

0.00587305

0.01220371

0.025358322.5e-2

3.5e-5

2.8e-3

3.1e-4

A

B

log n�,i

Figure 4. Heterogeneity of dynamical social strategies: A

and B shows di↵erent snapshots of the neighborhood of two di↵erent individuals (in

red) at 4 equally spaced times in the observation time window t = 52, 105, 158,and 211 days. Each black (grey) line corresponds to an open (closed) tie at that

particular instant. C Log-density plot of the social activity n↵,i as a function of the

social capacity i for each individual in our database. Solid line corresponds to the

line n↵,i = 0.75i obtained through PCA. Dashed curves are the iso-connectivity

lines ki = i + n↵,i for ki = 10, 20, 50. D shows the average value for the

persistence pi and clustering coe�cient ci for three groups of equal connectivity

(dashed lines in panel C) but for di↵erent quantiles of �i. Specifically, �i < 0.43(black), 0.43 < �i < 0.88 (gray) and �i > 0.88 (white).

Footline Author PNAS Issue Date Volume Issue Number 3

n↵,i(t) ' ↵it n!,i(t) ' !it ↵i ' !i

i(t) ' i(0)

Page 43: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• How are ties formed and destroyed? Linear tie formation/decay

3.5 4.0 4.5 5.0 5.5

0100

200

300

400

500

600

700

Year

ne

igh

bo

rd id

2

7

20

54

148

403

1096

2980

Page 44: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• How are ties formed and destroyed? Linear tie formation/decay

3.5 4.0 4.5 5.0 5.5

0100

200

300

400

500

600

700

Year

ne

igh

bo

rd id

2

7

20

54

148

403

1096

2980

1.1 persons/

0.6 persons/

Page 45: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• How are ties formed and destroyed? • Social capacity and activity are not

independent

• For a given we have

• Social explorers (A)

• Balanced (-)

• Social keepers (B)

1

2

3

4

-1 0 1 2 3 4 53.5e-05

7.3e-05

1.5e-04

3.2e-04

6.6e-04

1.4e-03

2.9e-03

6.0e-03

1.3e-02

2.6e-02

1

2

3

4

-1 0 1 2 3 4 53.5e-05

7.3e-05

1.5e-04

3.2e-04

6.6e-04

1.4e-03

2.9e-03

6.0e-03

1.3e-02

2.6e-022.5e-2

2.9e-3

3.2e-4

3.5e-5

A

B

logn↵,i

log i

0.0

0.2

0.4

0.6

0.8

10 20 50k

mean

gg1

g2

g3

pi

ki

0.00

0.05

0.10

0.15

0.20

10 20 50k

mean

gg1

g2

g3

ci

ki

52 105 158 211

52 105 158 211A

B

C

n↵,i / i

ki

n↵,i ' i

n↵,i � i

n↵,i ⌧ iMiritello, Lara, Cebrián and EMScientific Reports 3, 1950 (2013)

Page 46: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

n↵,i = 23,i = 4Social explorer

n↵,i = 3,i = 24Social keeper

• How are ties formed and destroyed?

Page 47: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

n↵,i = 23,i = 4Social explorer

n↵,i = 3,i = 24Social keeper

• How are ties formed and destroyed?

Page 48: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

n↵,i = 23,i = 4Social explorer

n↵,i = 3,i = 24Social keeper

• How are ties formed and destroyed?

Page 49: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Wrap-up • Triadic closure, reciprocity are predictors for the formation of a link • Embeddedness, reciprocity are predictors for the persistence of a

link • Tie formation/decay is bursty • Tie formation/decay strategy:

• Heterogeneous • Linear in time • Social explorers / social keepers

Page 50: Temporal dynamics of human behavior in social networks (i)

form/decayt1 t2 t3

Tie activity is bursty t

Groups of conversation

t1 t1+dt

Tie

s

Communities form/change/decay

t1 t2

Com

mun

ities

Networks form/change/decay

t1 t2

Net

wor

k

4 Community activity

Page 51: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• What structural properties influence community evolution? • Communities = dense areas of the network • Detection of communities: clique percolation algorithm http://cfinder.org

LETTERS

Quantifying social group evolutionGergely Palla1, Albert-Laszlo Barabasi2 & Tamas Vicsek1,3

The rich set of interactions between individuals in society1–7

results in complex community structure, capturing highly con-nected circles of friends, families or professional cliques in a socialnetwork3,7–10. Thanks to frequent changes in the activity and com-munication patterns of individuals, the associated social and com-munication network is subject to constant evolution7,11–16. Ourknowledge of the mechanisms governing the underlying commun-ity dynamics is limited, but is essential for a deeper understandingof the development and self-optimization of society as a whole17–22.We have developed an algorithm based on clique percolation23,24

that allows us to investigate the time dependence of overlappingcommunities on a large scale, and thus uncover basic relationshipscharacterizing community evolution. Our focus is on networkscapturing the collaboration between scientists and the calls be-tween mobile phone users. We find that large groups persist forlonger if they are capable of dynamically altering their member-ship, suggesting that an ability to change the group compositionresults in better adaptability. The behaviour of small groups dis-plays the opposite tendency—the condition for stability is thattheir composition remains unchanged. We also show that know-ledge of the time commitment of members to a given communitycan be used for estimating the community’s lifetime. These find-ings offer insight into the fundamental differences between thedynamics of small groups and large institutions.

The data sets we consider are (1) the monthly list of articles in theCornell University Library e-print condensed matter (cond-mat)archive spanning 142 months, with over 30,000 authors25, and (2)the record of phone calls between the customers of a mobile phonecompany spanning 52 weeks (accumulated over two-week-long per-iods), and containing the communication patterns of over 4 millionusers. Both types of collaboration events (a new article or a phonecall) document the presence of social interaction between theinvolved individuals (nodes), and can be represented as (time-dependent) links. The extraction of the changing link weights fromthe primary data is described in Supplementary Information. InFig. 1a, b we show the local structure at a given time step in thetwo networks in the vicinity of a randomly chosen individual(marked by a red frame). The communities (social groups repre-sented by more densely interconnected parts within a network ofsocial links) are colour coded, so that black nodes/edges do notbelong to any community, and those that simultaneously belong totwo or more communities are shown in red.

The two networks have rather different local structure: the collab-oration network of scientists emerges as a one-mode projection of thebipartite graph between authors and papers, so it is quite dense andthe overlap between communities is very significant. In contrast, in thephone-call network the communities are less interconnected and areoften separated by one or more inter-community nodes/edges. Indeed,whereas the phone record captures the communication between twopeople, the publication record assigns to all individuals that contributeto a paper a fully connected clique. As a result, the phone data are

dominated by single links, whereas the co-authorship data have manydense, highly connected neighbourhoods. Furthermore, the links inthe phone network correspond to instant communication events, cap-turing a relationship as it happens. In contrast, the co-authorship data

1Statistical and Biological Physics Research Group of the HAS, Pazmany P. stny. 1A, H-1117 Budapest, Hungary. 2Center for Complex Network Research and Departments of Physics andComputer Science, University of Notre Dame, Indiana 46566, USA. 3Department of Biological Physics, Eotvos University, Pazmany P. stny. 1A, H-1117 Budapest, Hungary.

a Co-authorship

c

e f

d

b Phone call

140.6

Zip-codeAge

Zip-codeAge

0.5

0.4

0.3

0.2

0.1

0

⟨nre

al⟩ /

⟨nra

nd⟩

⟨nre

al⟩ /

s10

6

2

12

8

4

00 20

Growth

Merging

Birth

t t + 1

t t + 1

t t + 1

Contraction

Splitting

Death

t t + 1

t

t t ∪ t + 1 t + 1

t + 1

t t + 1

40 60

s80 100 120 0 20 40 60

s80 100 120

Figure 1 | Structure and schematic dynamics of the two networksconsidered. a, The co-authorship network. The figure shows the localcommunity structure at a given time step in the vicinity of a randomly selectednode. b, As a but for the phone-call network. c, The filled black symbolscorrespond to the average size of the largest subset of members with the samezip-code, Ænrealæ, in the phone-call communities divided by the same quantityfound in random sets, Ænrandæ, as a function of the community size, s. Similarly,the open symbols show the average size of the largest subset of communitymembers with an age falling in a three-year time window, divided by the samequantity in random sets. The error bars in both cases correspond to Ænrealæ/(Ænrandæ 1 srand) and Ænrealæ/(Ænrandæ 2 srand), where srand is the standarddeviation in the case of the random sets. d, The Ænrealæ/s as a function of s, forboth the zip-code (filled black symbols) and the age (open symbols). e, Possibleevents in community evolution. f, The identification of evolving communities.The links at t (blue) and the links at t 1 1 (yellow) are merged into a joint graph(green). Any CPM community at t or t 1 1 is part of a CPM community in thejoined graph, so these can be used to match the two sets of communities.

Vol 446 | 5 April 2007 | doi:10.1038/nature05670

664Nature ©2007 Publishing Group

Palla, G., Barabasi, A.-L. & Vicsek, T., 2007. Quantifying social group evolution. Nature, 446(7), pp.664–667.

Page 52: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Dynamics of communities: • For each interval of time calculate the communities • Compare them at different time intervals

LETTERS

Quantifying social group evolutionGergely Palla1, Albert-Laszlo Barabasi2 & Tamas Vicsek1,3

The rich set of interactions between individuals in society1–7

results in complex community structure, capturing highly con-nected circles of friends, families or professional cliques in a socialnetwork3,7–10. Thanks to frequent changes in the activity and com-munication patterns of individuals, the associated social and com-munication network is subject to constant evolution7,11–16. Ourknowledge of the mechanisms governing the underlying commun-ity dynamics is limited, but is essential for a deeper understandingof the development and self-optimization of society as a whole17–22.We have developed an algorithm based on clique percolation23,24

that allows us to investigate the time dependence of overlappingcommunities on a large scale, and thus uncover basic relationshipscharacterizing community evolution. Our focus is on networkscapturing the collaboration between scientists and the calls be-tween mobile phone users. We find that large groups persist forlonger if they are capable of dynamically altering their member-ship, suggesting that an ability to change the group compositionresults in better adaptability. The behaviour of small groups dis-plays the opposite tendency—the condition for stability is thattheir composition remains unchanged. We also show that know-ledge of the time commitment of members to a given communitycan be used for estimating the community’s lifetime. These find-ings offer insight into the fundamental differences between thedynamics of small groups and large institutions.

The data sets we consider are (1) the monthly list of articles in theCornell University Library e-print condensed matter (cond-mat)archive spanning 142 months, with over 30,000 authors25, and (2)the record of phone calls between the customers of a mobile phonecompany spanning 52 weeks (accumulated over two-week-long per-iods), and containing the communication patterns of over 4 millionusers. Both types of collaboration events (a new article or a phonecall) document the presence of social interaction between theinvolved individuals (nodes), and can be represented as (time-dependent) links. The extraction of the changing link weights fromthe primary data is described in Supplementary Information. InFig. 1a, b we show the local structure at a given time step in thetwo networks in the vicinity of a randomly chosen individual(marked by a red frame). The communities (social groups repre-sented by more densely interconnected parts within a network ofsocial links) are colour coded, so that black nodes/edges do notbelong to any community, and those that simultaneously belong totwo or more communities are shown in red.

The two networks have rather different local structure: the collab-oration network of scientists emerges as a one-mode projection of thebipartite graph between authors and papers, so it is quite dense andthe overlap between communities is very significant. In contrast, in thephone-call network the communities are less interconnected and areoften separated by one or more inter-community nodes/edges. Indeed,whereas the phone record captures the communication between twopeople, the publication record assigns to all individuals that contributeto a paper a fully connected clique. As a result, the phone data are

dominated by single links, whereas the co-authorship data have manydense, highly connected neighbourhoods. Furthermore, the links inthe phone network correspond to instant communication events, cap-turing a relationship as it happens. In contrast, the co-authorship data

1Statistical and Biological Physics Research Group of the HAS, Pazmany P. stny. 1A, H-1117 Budapest, Hungary. 2Center for Complex Network Research and Departments of Physics andComputer Science, University of Notre Dame, Indiana 46566, USA. 3Department of Biological Physics, Eotvos University, Pazmany P. stny. 1A, H-1117 Budapest, Hungary.

a Co-authorship

c

e f

d

b Phone call

140.6

Zip-codeAge

Zip-codeAge

0.5

0.4

0.3

0.2

0.1

0

⟨nre

al⟩ /

⟨nra

nd⟩

⟨nre

al⟩ /

s10

6

2

12

8

4

00 20

Growth

Merging

Birth

t t + 1

t t + 1

t t + 1

Contraction

Splitting

Death

t t + 1

t

t t ∪ t + 1 t + 1

t + 1

t t + 1

40 60

s80 100 120 0 20 40 60

s80 100 120

Figure 1 | Structure and schematic dynamics of the two networksconsidered. a, The co-authorship network. The figure shows the localcommunity structure at a given time step in the vicinity of a randomly selectednode. b, As a but for the phone-call network. c, The filled black symbolscorrespond to the average size of the largest subset of members with the samezip-code, Ænrealæ, in the phone-call communities divided by the same quantityfound in random sets, Ænrandæ, as a function of the community size, s. Similarly,the open symbols show the average size of the largest subset of communitymembers with an age falling in a three-year time window, divided by the samequantity in random sets. The error bars in both cases correspond to Ænrealæ/(Ænrandæ 1 srand) and Ænrealæ/(Ænrandæ 2 srand), where srand is the standarddeviation in the case of the random sets. d, The Ænrealæ/s as a function of s, forboth the zip-code (filled black symbols) and the age (open symbols). e, Possibleevents in community evolution. f, The identification of evolving communities.The links at t (blue) and the links at t 1 1 (yellow) are merged into a joint graph(green). Any CPM community at t or t 1 1 is part of a CPM community in thejoined graph, so these can be used to match the two sets of communities.

Vol 446 | 5 April 2007 | doi:10.1038/nature05670

664Nature ©2007 Publishing Group

Page 53: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Community evolution depends on its size: • Larger communities are on average older • Membership of large communities is changing at a higher rate

record the results of a long-term collaboration process. These fun-damental differences suggest that any common features of the com-munity evolution in the two networks represent potentially genericcharacteristics of community formation, rather than being rooted inthe details of the network representation or data collection process.

The communities at each time step were extracted using the cliquepercolation method23,24 (CPM). The key features of the communitiesobtained by the CPM are that their members can be reached throughwell connected subsets of nodes, and that the communities mayoverlap (share nodes with each other). This latter property is essen-tial, as most networks are characterized by overlapping and nestedcommunities6,23. As a first step, it is important to check if the uncov-ered communities correspond to groups of individuals with a sharedcommon activity pattern. For this purpose, we compared the averageweight of the links inside communities, wc, to the average weightof the inter-community links, wic. For the co-authorship networkwc/wic is about 2.9, whereas for the phone-call network the differenceis even more significant, as wc/wic < 5.9, indicating that the intensityof collaboration/communication within a group is significantly higherthan with contacts belonging to a different group26–28. Although forco-authors the quality of the clustering can be directly tested by study-ing their publication records in more detail, in the phone-call networkpersonal information is not available. In this case the zip-code andthe age of the users provide additional information for checking thehomogeneity of the communities. According to Fig. 1c, the Ænrealæ/Ænrandæ ratio is significantly larger than 1 for both the zip-code and theage, indicating that communities have a tendency to contain peoplefrom the same generation and living in the same neighbourhood(Ænrealæ is the size of the largest subset of people having the same zip-code averaged over time steps and the set of available communities,while Ænrandæ represents the same average but with randomly selectedusers). It is of specific interest that Ænrealæ/Ænrandæ for the zip-code has aprominent peak at community size s < 35, suggesting that commu-nities of this size are geographically the most homogeneous ones.However, as Fig. 1d shows, the situation is more complex: on aver-age, the smaller communities are more homogeneous in respect ofboth the zip-code and the age, but there is still a noticeable peak ats < 30–35 for the zip-code. In summary, the phone-call communitiesuncovered by the CPM tend to contain individuals living in the same

neighbourhood, and having a comparable age, a homogeneity thatsupports the validity of the uncovered community structure. Furthersupport is given in Supplementary Information.

The basic events that may occur in the life of a community areshown in Fig. 1e: a community can grow or contract; groups maymerge or split; new communities are born while others may disappear.We have developed a method for the appropriate matching (betweenthe subsequent states of the evolving communities) from the informa-tion available for relatively distant points in time only (see Methods).

After determining the dynamically changing community struc-ture, we first consider two basic quantities characterizing a commun-ity: its size s and its age t, representing the time passed since its birth.The quantities s and t are positively correlated: larger communitiesare on average older (Fig. 2a). Next we used the auto-correlationfunction, C(t), to quantify the relative overlap between two statesof the same community A(t) at t time steps apart:

C(t):A(t0)\A(t0zt)j jA(t0)|A(t0zt)j j

ð1Þ

where A(t0)\A(t0zt)j j is the number of common nodes (members)in A(t0) and A(t0 1 t), and A(t0)|A(t0zt)j j is the number of nodesin the union of A(t0) and A(t0 1 t). Figure 2b shows the average timedependent auto-correlation function for communities born with dif-ferent sizes. The data indicate that the collaboration network is more‘dynamic’ (ÆC(t)æ decays faster). We also find that in both networks,the auto-correlation function decays faster for the larger communit-ies, showing that the membership of the larger communities is chan-ging at a higher rate. In contrast, small communities change at asmaller rate, their composition being more or less static. To quantifythis aspect of community evolution, we define the stationarity f of acommunity as the average correlation between subsequent states:

f:

Ptmax{1

t~t0

C(t ,tz1)

tmax{t0{1ð2Þ

where t0 denotes the birth of the community, and tmax is the last stepbefore the extinction of the community. Thus, 1 2 f represents theaverage ratio of members changed in one step.

a

b

c

d

2.835

18

16

14

12

10

8

6

30

25

20

15

10

5

Co-authorshipPhone call

Phone call, s = 6Phone call, s = 12Phone call, s = 18

Co-authorship, s = 6Co-authorship, s = 12Co-authorship, s = 18

2.4

2.0

1.6

1.2

0.8

1.0

0.8

0.6

0.4

0.2

0 20 40 60 80s

100 120 140

0.85 0.875 0.9 0.925 0.95 0.975 1.0

0.8 0.825 0.85 0.875 0.9 0.925 0.95 0.975 1.0

6050403020100

0 5 1510 20 25t ζ

ζ

30 35 40

⟨τ(s

)⟩ / ⟨

τ⟩

⟨τ*⟩

20

15

10

5

0

⟨τ*⟩

⟨C(t)

ss

Figure 2 | Characteristic features of community evolution. a, The age t ofcommunities with a given size (number of people) s, averaged over the set ofavailable communities and the time steps, divided by the average age of allcommunities, Ætæ, as a function of s. The increasing nature of the plotindicates that larger communities are on average older. b, The auto-correlation function C(t) of communities with different sizes averaged overthe communities and t0.The unit of time, t, is two weeks, thus, for the co-authorship network, where the data samples were taken monthly, the C(t)

values are shown for every other time step. c, The life-span t* averaged overthe communities as a function of the stationarity f and the community size sfor the co-authorship network. (The communities still living at the lastavailable time step in the data set were excluded from this investigation.) Thepeak in Æt*æ is close to f 5 1 for small sizes, whereas it is shifted towards lowerf values for large sizes. d, Similar results found in the phone-call network. Inc and d, the white line corresponds to the optimal stationarity.

NATURE | Vol 446 | 5 April 2007 LETTERS

665Nature ©2007 Publishing Group

record the results of a long-term collaboration process. These fun-damental differences suggest that any common features of the com-munity evolution in the two networks represent potentially genericcharacteristics of community formation, rather than being rooted inthe details of the network representation or data collection process.

The communities at each time step were extracted using the cliquepercolation method23,24 (CPM). The key features of the communitiesobtained by the CPM are that their members can be reached throughwell connected subsets of nodes, and that the communities mayoverlap (share nodes with each other). This latter property is essen-tial, as most networks are characterized by overlapping and nestedcommunities6,23. As a first step, it is important to check if the uncov-ered communities correspond to groups of individuals with a sharedcommon activity pattern. For this purpose, we compared the averageweight of the links inside communities, wc, to the average weightof the inter-community links, wic. For the co-authorship networkwc/wic is about 2.9, whereas for the phone-call network the differenceis even more significant, as wc/wic < 5.9, indicating that the intensityof collaboration/communication within a group is significantly higherthan with contacts belonging to a different group26–28. Although forco-authors the quality of the clustering can be directly tested by study-ing their publication records in more detail, in the phone-call networkpersonal information is not available. In this case the zip-code andthe age of the users provide additional information for checking thehomogeneity of the communities. According to Fig. 1c, the Ænrealæ/Ænrandæ ratio is significantly larger than 1 for both the zip-code and theage, indicating that communities have a tendency to contain peoplefrom the same generation and living in the same neighbourhood(Ænrealæ is the size of the largest subset of people having the same zip-code averaged over time steps and the set of available communities,while Ænrandæ represents the same average but with randomly selectedusers). It is of specific interest that Ænrealæ/Ænrandæ for the zip-code has aprominent peak at community size s < 35, suggesting that commu-nities of this size are geographically the most homogeneous ones.However, as Fig. 1d shows, the situation is more complex: on aver-age, the smaller communities are more homogeneous in respect ofboth the zip-code and the age, but there is still a noticeable peak ats < 30–35 for the zip-code. In summary, the phone-call communitiesuncovered by the CPM tend to contain individuals living in the same

neighbourhood, and having a comparable age, a homogeneity thatsupports the validity of the uncovered community structure. Furthersupport is given in Supplementary Information.

The basic events that may occur in the life of a community areshown in Fig. 1e: a community can grow or contract; groups maymerge or split; new communities are born while others may disappear.We have developed a method for the appropriate matching (betweenthe subsequent states of the evolving communities) from the informa-tion available for relatively distant points in time only (see Methods).

After determining the dynamically changing community struc-ture, we first consider two basic quantities characterizing a commun-ity: its size s and its age t, representing the time passed since its birth.The quantities s and t are positively correlated: larger communitiesare on average older (Fig. 2a). Next we used the auto-correlationfunction, C(t), to quantify the relative overlap between two statesof the same community A(t) at t time steps apart:

C(t):A(t0)\A(t0zt)j jA(t0)|A(t0zt)j j

ð1Þ

where A(t0)\A(t0zt)j j is the number of common nodes (members)in A(t0) and A(t0 1 t), and A(t0)|A(t0zt)j j is the number of nodesin the union of A(t0) and A(t0 1 t). Figure 2b shows the average timedependent auto-correlation function for communities born with dif-ferent sizes. The data indicate that the collaboration network is more‘dynamic’ (ÆC(t)æ decays faster). We also find that in both networks,the auto-correlation function decays faster for the larger communit-ies, showing that the membership of the larger communities is chan-ging at a higher rate. In contrast, small communities change at asmaller rate, their composition being more or less static. To quantifythis aspect of community evolution, we define the stationarity f of acommunity as the average correlation between subsequent states:

f:

Ptmax{1

t~t0

C(t ,tz1)

tmax{t0{1ð2Þ

where t0 denotes the birth of the community, and tmax is the last stepbefore the extinction of the community. Thus, 1 2 f represents theaverage ratio of members changed in one step.

a

b

c

d

2.835

18

16

14

12

10

8

6

30

25

20

15

10

5

Co-authorshipPhone call

Phone call, s = 6Phone call, s = 12Phone call, s = 18

Co-authorship, s = 6Co-authorship, s = 12Co-authorship, s = 18

2.4

2.0

1.6

1.2

0.8

1.0

0.8

0.6

0.4

0.2

0 20 40 60 80s

100 120 140

0.85 0.875 0.9 0.925 0.95 0.975 1.0

0.8 0.825 0.85 0.875 0.9 0.925 0.95 0.975 1.0

6050403020100

0 5 1510 20 25t ζ

ζ

30 35 40

⟨τ(s

)⟩ /

⟨τ⟩

⟨τ*⟩

20

15

10

5

0

⟨τ*⟩

⟨C(t)

ss

Figure 2 | Characteristic features of community evolution. a, The age t ofcommunities with a given size (number of people) s, averaged over the set ofavailable communities and the time steps, divided by the average age of allcommunities, Ætæ, as a function of s. The increasing nature of the plotindicates that larger communities are on average older. b, The auto-correlation function C(t) of communities with different sizes averaged overthe communities and t0.The unit of time, t, is two weeks, thus, for the co-authorship network, where the data samples were taken monthly, the C(t)

values are shown for every other time step. c, The life-span t* averaged overthe communities as a function of the stationarity f and the community size sfor the co-authorship network. (The communities still living at the lastavailable time step in the data set were excluded from this investigation.) Thepeak in Æt*æ is close to f 5 1 for small sizes, whereas it is shifted towards lowerf values for large sizes. d, Similar results found in the phone-call network. Inc and d, the white line corresponds to the optimal stationarity.

NATURE | Vol 446 | 5 April 2007 LETTERS

665Nature ©2007 Publishing Group

record the results of a long-term collaboration process. These fun-damental differences suggest that any common features of the com-munity evolution in the two networks represent potentially genericcharacteristics of community formation, rather than being rooted inthe details of the network representation or data collection process.

The communities at each time step were extracted using the cliquepercolation method23,24 (CPM). The key features of the communitiesobtained by the CPM are that their members can be reached throughwell connected subsets of nodes, and that the communities mayoverlap (share nodes with each other). This latter property is essen-tial, as most networks are characterized by overlapping and nestedcommunities6,23. As a first step, it is important to check if the uncov-ered communities correspond to groups of individuals with a sharedcommon activity pattern. For this purpose, we compared the averageweight of the links inside communities, wc, to the average weightof the inter-community links, wic. For the co-authorship networkwc/wic is about 2.9, whereas for the phone-call network the differenceis even more significant, as wc/wic < 5.9, indicating that the intensityof collaboration/communication within a group is significantly higherthan with contacts belonging to a different group26–28. Although forco-authors the quality of the clustering can be directly tested by study-ing their publication records in more detail, in the phone-call networkpersonal information is not available. In this case the zip-code andthe age of the users provide additional information for checking thehomogeneity of the communities. According to Fig. 1c, the Ænrealæ/Ænrandæ ratio is significantly larger than 1 for both the zip-code and theage, indicating that communities have a tendency to contain peoplefrom the same generation and living in the same neighbourhood(Ænrealæ is the size of the largest subset of people having the same zip-code averaged over time steps and the set of available communities,while Ænrandæ represents the same average but with randomly selectedusers). It is of specific interest that Ænrealæ/Ænrandæ for the zip-code has aprominent peak at community size s < 35, suggesting that commu-nities of this size are geographically the most homogeneous ones.However, as Fig. 1d shows, the situation is more complex: on aver-age, the smaller communities are more homogeneous in respect ofboth the zip-code and the age, but there is still a noticeable peak ats < 30–35 for the zip-code. In summary, the phone-call communitiesuncovered by the CPM tend to contain individuals living in the same

neighbourhood, and having a comparable age, a homogeneity thatsupports the validity of the uncovered community structure. Furthersupport is given in Supplementary Information.

The basic events that may occur in the life of a community areshown in Fig. 1e: a community can grow or contract; groups maymerge or split; new communities are born while others may disappear.We have developed a method for the appropriate matching (betweenthe subsequent states of the evolving communities) from the informa-tion available for relatively distant points in time only (see Methods).

After determining the dynamically changing community struc-ture, we first consider two basic quantities characterizing a commun-ity: its size s and its age t, representing the time passed since its birth.The quantities s and t are positively correlated: larger communitiesare on average older (Fig. 2a). Next we used the auto-correlationfunction, C(t), to quantify the relative overlap between two statesof the same community A(t) at t time steps apart:

C(t):A(t0)\A(t0zt)j jA(t0)|A(t0zt)j j

ð1Þ

where A(t0)\A(t0zt)j j is the number of common nodes (members)in A(t0) and A(t0 1 t), and A(t0)|A(t0zt)j j is the number of nodesin the union of A(t0) and A(t0 1 t). Figure 2b shows the average timedependent auto-correlation function for communities born with dif-ferent sizes. The data indicate that the collaboration network is more‘dynamic’ (ÆC(t)æ decays faster). We also find that in both networks,the auto-correlation function decays faster for the larger communit-ies, showing that the membership of the larger communities is chan-ging at a higher rate. In contrast, small communities change at asmaller rate, their composition being more or less static. To quantifythis aspect of community evolution, we define the stationarity f of acommunity as the average correlation between subsequent states:

f:

Ptmax{1

t~t0

C(t ,tz1)

tmax{t0{1ð2Þ

where t0 denotes the birth of the community, and tmax is the last stepbefore the extinction of the community. Thus, 1 2 f represents theaverage ratio of members changed in one step.

a

b

c

d

2.835

18

16

14

12

10

8

6

30

25

20

15

10

5

Co-authorshipPhone call

Phone call, s = 6Phone call, s = 12Phone call, s = 18

Co-authorship, s = 6Co-authorship, s = 12Co-authorship, s = 18

2.4

2.0

1.6

1.2

0.8

1.0

0.8

0.6

0.4

0.2

0 20 40 60 80s

100 120 140

0.85 0.875 0.9 0.925 0.95 0.975 1.0

0.8 0.825 0.85 0.875 0.9 0.925 0.95 0.975 1.0

6050403020100

0 5 1510 20 25t ζ

ζ

30 35 40

⟨τ(s

)⟩ /

⟨τ⟩

⟨τ*⟩

20

15

10

5

0

⟨τ*⟩

⟨C(t)

ss

Figure 2 | Characteristic features of community evolution. a, The age t ofcommunities with a given size (number of people) s, averaged over the set ofavailable communities and the time steps, divided by the average age of allcommunities, Ætæ, as a function of s. The increasing nature of the plotindicates that larger communities are on average older. b, The auto-correlation function C(t) of communities with different sizes averaged overthe communities and t0.The unit of time, t, is two weeks, thus, for the co-authorship network, where the data samples were taken monthly, the C(t)

values are shown for every other time step. c, The life-span t* averaged overthe communities as a function of the stationarity f and the community size sfor the co-authorship network. (The communities still living at the lastavailable time step in the data set were excluded from this investigation.) Thepeak in Æt*æ is close to f 5 1 for small sizes, whereas it is shifted towards lowerf values for large sizes. d, Similar results found in the phone-call network. Inc and d, the white line corresponds to the optimal stationarity.

NATURE | Vol 446 | 5 April 2007 LETTERS

665Nature ©2007 Publishing Group

Tim

e sin

ce b

irth

of

the

com

mun

ity

Community sizeUnit time = 2 weeks

Page 54: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Community evolution depends on its size: • Large communities are on average older • Membership of large communities is changing at a higher rate • Thus: large communities survive by continually changing membership

We observe an interesting effect when investigating the relation-ship between the lifetime t* (the number of steps between the birthand the disintegration of a community), the stationarity and thecommunity size. The lifetime can be viewed as a simple measure of‘fitness’: communities having higher fitness have an extended life,whereas the ones with small fitness quickly disintegrate. In Fig. 2c,d we show the average life-span Æt*æ (colour coded) as a function ofthe stationarity f and the community size s (both s and f werebinned). In both networks, for small community sizes the highestaverage life-span is at a stationarity value very close to one, indicating

that for small communities it is optimal to have static, time-independent membership. On the other hand, the peak in Æt*æ isshifted towards low f values for large communities, suggesting thatfor these the optimal regime is to be dynamic, that is, to have acontinually changing membership.

To illustrate the difference in the optimal behaviour (a pattern ofmembership dynamics leading to extended lifetime) of small andlarge communities, in Fig. 3 we show the time evolution of fourcommunities from the co-authorship network. As Fig. 3 indicates,a typical small and stationary community undergoes minor changes,but lives for a long time. This is well illustrated by the snapshots ofthe community structure, showing that the community’s stability isconferred by a core of three individuals representing a collaborativegroup spanning over 52 months. In contrast, a small community withhigh turnover of its members has a lifetime of nine time steps only(Fig. 3b). The opposite is seen for large communities: a large sta-tionary community disintegrates after four time steps (Fig. 3c). Incontrast, a large non-stationary community whose members changedynamically, resulting in significant fluctuations in both size andcomposition, has a quite extended lifetime (Fig. 3d).

The different stability rules followed by the small and largecommunities raise an important question: could the inspection of

a

b

c

d

e

50

small, stationary

large, stationary

large, non-stationary

small, non-stationary

0 10 20 30

τ = 0–2

τ = 1

τ = 9 τ = 10

τ = 2 τ = 3 τ = 4 τ = 5 τ = 6 τ = 7 τ = 8

τ = 3 τ = 4–34 τ = 35 τ = 36–52

τ40 50

0 10 20 30τ

40 50

0 10 20 30τ

40 50

0 10 20 30τ

40 50

0

s

50

0

s

50

New Leaving innext step

Old200

150

100

50

0

0

ss

Figure 3 | Evolution of four types of communities in the co-authorshipnetwork. The height of the columns corresponds to the actual communitysize, s, and within one column the yellow colour indicates the number of ‘old’nodes (that have been present in the community at least in the previous timestep as well), while newcomers are shown with green. The membersabandoning the community in the next time step are shown with orange orpurple colour, depending on whether they are old or new. (This latter type ofmember joins the community for only one time step.) We show a small andstationary community (a), a small and non-stationary community (b), alarge and stationary community (c) and, finally, a large and non-stationarycommunity (d). A mainly growing stage (two time steps) in the evolution ofthe last community is detailed in e.

⟨τn⟩

⟨τ*⟩

wout/(win + wout)

wout/(win + wout)

p lp d

a

b

Co-authorshipPhone call

0 0.2 0.4 0.6 0.8 1.0

0 0.2 0.4 0.6 0.8 1.0

8

12

16

4

0

Phone callCo-authorship

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Wout/(Win + Wout)0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0

0.20

0.15

0.05

0

0.10Wout/(Win + Wout)

0

10

20

30

Figure 4 | Effects of links between communities. a, The probability pl of amember abandoning the community in the next step as a function of theratio of the member’s aggregated link weights to other parts of the network(wout) and the member’s total aggregated link weight (win 1 wout). The insetshows the average time spent in the community by the nodes, Ætnæ, as afunction of wout/(win 1 wout). b, The probability pd of a communitydisintegrating in the next step as a function of the ratio of the aggregatedweights of links from the community to other parts of the network (Wout)and the aggregated weights of all links starting from the community(Win 1 Wout). The inset shows the average life-time Æt*æ of communities as afunction of Wout/(Win 1 Wout).

LETTERS NATURE | Vol 446 | 5 April 2007

666Nature ©2007 Publishing Group

We observe an interesting effect when investigating the relation-ship between the lifetime t* (the number of steps between the birthand the disintegration of a community), the stationarity and thecommunity size. The lifetime can be viewed as a simple measure of‘fitness’: communities having higher fitness have an extended life,whereas the ones with small fitness quickly disintegrate. In Fig. 2c,d we show the average life-span Æt*æ (colour coded) as a function ofthe stationarity f and the community size s (both s and f werebinned). In both networks, for small community sizes the highestaverage life-span is at a stationarity value very close to one, indicating

that for small communities it is optimal to have static, time-independent membership. On the other hand, the peak in Æt*æ isshifted towards low f values for large communities, suggesting thatfor these the optimal regime is to be dynamic, that is, to have acontinually changing membership.

To illustrate the difference in the optimal behaviour (a pattern ofmembership dynamics leading to extended lifetime) of small andlarge communities, in Fig. 3 we show the time evolution of fourcommunities from the co-authorship network. As Fig. 3 indicates,a typical small and stationary community undergoes minor changes,but lives for a long time. This is well illustrated by the snapshots ofthe community structure, showing that the community’s stability isconferred by a core of three individuals representing a collaborativegroup spanning over 52 months. In contrast, a small community withhigh turnover of its members has a lifetime of nine time steps only(Fig. 3b). The opposite is seen for large communities: a large sta-tionary community disintegrates after four time steps (Fig. 3c). Incontrast, a large non-stationary community whose members changedynamically, resulting in significant fluctuations in both size andcomposition, has a quite extended lifetime (Fig. 3d).

The different stability rules followed by the small and largecommunities raise an important question: could the inspection of

a

b

c

d

e

50

small, stationary

large, stationary

large, non-stationary

small, non-stationary

0 10 20 30

τ = 0–2

τ = 1

τ = 9 τ = 10

τ = 2 τ = 3 τ = 4 τ = 5 τ = 6 τ = 7 τ = 8

τ = 3 τ = 4–34 τ = 35 τ = 36–52

τ40 50

0 10 20 30τ

40 50

0 10 20 30τ

40 50

0 10 20 30τ

40 50

0

s

50

0

s

50

New Leaving innext step

Old200

150

100

50

0

0

ss

Figure 3 | Evolution of four types of communities in the co-authorshipnetwork. The height of the columns corresponds to the actual communitysize, s, and within one column the yellow colour indicates the number of ‘old’nodes (that have been present in the community at least in the previous timestep as well), while newcomers are shown with green. The membersabandoning the community in the next time step are shown with orange orpurple colour, depending on whether they are old or new. (This latter type ofmember joins the community for only one time step.) We show a small andstationary community (a), a small and non-stationary community (b), alarge and stationary community (c) and, finally, a large and non-stationarycommunity (d). A mainly growing stage (two time steps) in the evolution ofthe last community is detailed in e.

⟨τn⟩

⟨τ*⟩

wout/(win + wout)

wout/(win + wout)

p lp d

a

b

Co-authorshipPhone call

0 0.2 0.4 0.6 0.8 1.0

0 0.2 0.4 0.6 0.8 1.0

8

12

16

4

0

Phone callCo-authorship

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Wout/(Win + Wout)0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0

0.20

0.15

0.05

0

0.10Wout/(Win + Wout)

0

10

20

30

Figure 4 | Effects of links between communities. a, The probability pl of amember abandoning the community in the next step as a function of theratio of the member’s aggregated link weights to other parts of the network(wout) and the member’s total aggregated link weight (win 1 wout). The insetshows the average time spent in the community by the nodes, Ætnæ, as afunction of wout/(win 1 wout). b, The probability pd of a communitydisintegrating in the next step as a function of the ratio of the aggregatedweights of links from the community to other parts of the network (Wout)and the aggregated weights of all links starting from the community(Win 1 Wout). The inset shows the average life-time Æt*æ of communities as afunction of Wout/(Win 1 Wout).

LETTERS NATURE | Vol 446 | 5 April 2007

666Nature ©2007 Publishing Group

Page 55: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Is there any structural property that predicts community evolution? • YES: how members are “committed” with the community • Commitment = F(strong links, embeddedness, etc.) = I/O ratio

We observe an interesting effect when investigating the relation-ship between the lifetime t* (the number of steps between the birthand the disintegration of a community), the stationarity and thecommunity size. The lifetime can be viewed as a simple measure of‘fitness’: communities having higher fitness have an extended life,whereas the ones with small fitness quickly disintegrate. In Fig. 2c,d we show the average life-span Æt*æ (colour coded) as a function ofthe stationarity f and the community size s (both s and f werebinned). In both networks, for small community sizes the highestaverage life-span is at a stationarity value very close to one, indicating

that for small communities it is optimal to have static, time-independent membership. On the other hand, the peak in Æt*æ isshifted towards low f values for large communities, suggesting thatfor these the optimal regime is to be dynamic, that is, to have acontinually changing membership.

To illustrate the difference in the optimal behaviour (a pattern ofmembership dynamics leading to extended lifetime) of small andlarge communities, in Fig. 3 we show the time evolution of fourcommunities from the co-authorship network. As Fig. 3 indicates,a typical small and stationary community undergoes minor changes,but lives for a long time. This is well illustrated by the snapshots ofthe community structure, showing that the community’s stability isconferred by a core of three individuals representing a collaborativegroup spanning over 52 months. In contrast, a small community withhigh turnover of its members has a lifetime of nine time steps only(Fig. 3b). The opposite is seen for large communities: a large sta-tionary community disintegrates after four time steps (Fig. 3c). Incontrast, a large non-stationary community whose members changedynamically, resulting in significant fluctuations in both size andcomposition, has a quite extended lifetime (Fig. 3d).

The different stability rules followed by the small and largecommunities raise an important question: could the inspection of

a

b

c

d

e

50

small, stationary

large, stationary

large, non-stationary

small, non-stationary

0 10 20 30

τ = 0–2

τ = 1

τ = 9 τ = 10

τ = 2 τ = 3 τ = 4 τ = 5 τ = 6 τ = 7 τ = 8

τ = 3 τ = 4–34 τ = 35 τ = 36–52

τ40 50

0 10 20 30τ

40 50

0 10 20 30τ

40 50

0 10 20 30τ

40 50

0

s

50

0

s

50

New Leaving innext step

Old200

150

100

50

0

0

ss

Figure 3 | Evolution of four types of communities in the co-authorshipnetwork. The height of the columns corresponds to the actual communitysize, s, and within one column the yellow colour indicates the number of ‘old’nodes (that have been present in the community at least in the previous timestep as well), while newcomers are shown with green. The membersabandoning the community in the next time step are shown with orange orpurple colour, depending on whether they are old or new. (This latter type ofmember joins the community for only one time step.) We show a small andstationary community (a), a small and non-stationary community (b), alarge and stationary community (c) and, finally, a large and non-stationarycommunity (d). A mainly growing stage (two time steps) in the evolution ofthe last community is detailed in e.

⟨τn⟩

⟨τ*⟩

wout/(win + wout)

wout/(win + wout)

p lp d

a

b

Co-authorshipPhone call

0 0.2 0.4 0.6 0.8 1.0

0 0.2 0.4 0.6 0.8 1.0

8

12

16

4

0

Phone callCo-authorship

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Wout/(Win + Wout)0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0

0.20

0.15

0.05

0

0.10Wout/(Win + Wout)

0

10

20

30

Figure 4 | Effects of links between communities. a, The probability pl of amember abandoning the community in the next step as a function of theratio of the member’s aggregated link weights to other parts of the network(wout) and the member’s total aggregated link weight (win 1 wout). The insetshows the average time spent in the community by the nodes, Ætnæ, as afunction of wout/(win 1 wout). b, The probability pd of a communitydisintegrating in the next step as a function of the ratio of the aggregatedweights of links from the community to other parts of the network (Wout)and the aggregated weights of all links starting from the community(Win 1 Wout). The inset shows the average life-time Æt*æ of communities as afunction of Wout/(Win 1 Wout).

LETTERS NATURE | Vol 446 | 5 April 2007

666Nature ©2007 Publishing Group

Wout

Win

+Wout

Page 56: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• What makes a node to join a community? • Explicit definition of communities

• LiveJournal

• Free on-line community with ~ 10M members • 300,000 update the content in 24-hour period • Maintaining journals, individual and group blogs • Declaring who are their friends and to which communities they belong

• DBLP

• On-line database of computer science publications (about 400,000 papers) • Friendship network – co-authors in the paper • Conference - community

Backstrom, L. et al., 2006. Group formation in large social networks: membership, growth, and evolution.

Page 57: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Is membership a behavior spreading on the network? (contagion/influence) • Yes: probability to join a community depends on the number of friends previously

in the community changes in membership consistently precede or lag changes in in-terest? While such questions are extremely natural at a qualitativelevel, it is highly challenging to turn them into precise quantita-tive ones, even on data as detailed as we have here. We approachthis through a novel methodology based on burst analysis [22]; weidentify bursts both in term usage within a group and in its mem-bership. We find that these are aligned in time to a statisticallysignificant extent; furthermore, for CS conference data in DBLP,we present evidence that topics of interest tend to cross betweenconferences earlier than people do.

Related Work. As discussed above, there is a large body of workon identifying tightly-connected clusters within a given graph (seee.g. [14, 15, 16, 20, 28]). While such clusters are often referred toas “communities”, it is important to note that this is a very differenttype of problem from what we consider here — while this cluster-ing work seeks to infer potential communities in a network basedon density of linkage, we start with a network in which the com-munities of interest have already been explicitly identified and seekto model the mechanisms by which these communities grow andchange. Dill et al. [11] study implicitly-defined “communities” ofa different sort: For a variety of features (e.g. a particular keyword,a name of a locality, or a ZIP code), they consider the subgraphof the Web consisting of all pages containing this feature. Suchcommunities of Web pages are still quite different from explicitly-identified groups where participants deliberately join, as we studyhere; moreover, the questions considered in [11] are quite differentfrom our focus here.

The use of on-line social networking sites for data mining ap-plications has been the subject of a number of recent papers; see[1, 26] for two recent examples. These recent papers have focusedon different questions, and have not directly exploited the structureof the user-defined communities embedded in these systems. Stud-ies of the relationship between different newsgroups on Usenet [4,35] has taken advantage of the self-identified nature of these on-line communities, although again the specific questions are quitedifferent.

As noted earlier, the questions we consider are closely relatedto the diffusion of innovations, a broad area of study in the socialsciences [31, 33, 34]; the particular property that is “diffusing” inour work is membership in a given group. The question of how asocial network evolves as its members’ attributes change has beenthe subject of recent models by Sarkar and Moore [32] and Holmeand Newman [19]; a large-scale empirical analysis of social net-work evolution in a university setting was recently performed byKossinets and Watts [23]; and rich models for the evolution of top-ics over time have recently been proposed by Wang and McCallum[36]. Mathematical models for group evolution and change havebeen proposed in a number of social science contexts; for an ap-proach to this issue in terms of diffusion models, we refer the readerto the book by Boorman and Levitt [3].

2. COMMUNITY MEMBERSHIPBefore turning to our studies of the processes by which individ-

uals join communities in a social network, we provide some detailson the two sources of data, LiveJournal and DBLP. LiveJournal(LJ) is a free on-line community with almost 10 million members;a significant fraction of these members are highly active. (For ex-ample, roughly 300, 000 update their content in any given 24-hourperiod.) LiveJournal allows members to maintain journals, individ-ual and group blogs, and — most importantly for our study here —it allows people to declare which other members are their friendsand to which communities they belong. By joining a community,

0

0.005

0.01

0.015

0.02

0.025

0 5 10 15 20 25 30 35 40 45 50

prob

abili

ty

k

Probability of joining a community when k friends are already members

Figure 1: The probability p of joining a LiveJournal commu-nity as a function of the number of friends k already in thecommunity. Error bars represent two standard errors.

0

0.02

0.04

0.06

0.08

0.1

0 2 4 6 8 10 12 14 16 18

prob

abili

ty

k

Probability of joining a conference when k coauthors are already ’members’ of that conference

Figure 2: The probability p of joining a DBLP community as afunction of the number of friends k already in the community.Error bars represent two standard errors.

one typically gains the right to create new posts in that communityand other people’s posts become more accessible.

DBLP, our second dataset, is an on-line database of computerscience publications, providing the title, author list, and conferenceof publication for over 400,000 papers. A great deal of work hasgone into disambiguation of similar names, so co-authorship rela-tionships are relatively free of name resolution problems. For ourpurposes, we view DBLP as parallel to the friends-and-communitiesstructure of LiveJournal, with a “friendship” network defined bylinking people together who have co-authored a paper, and withconferences serving as communities. We say that a person hasjoined a community (conference) when he or she first publishesa paper there; and, for this section, we consider the person to be-long to the community from this point onward. (See Section 4 foran analysis of changes in community membership that include no-tions of both joining and leaving.) For simplicity of terminology,we refer to two people in either of LJ or DBLP as “friends” whenthey are neighbors in the respective networks.

A fundamental question about the evolution of communities isto to determine who will join in the future. As discussed above, ifwe view membership in a community as a “behavior” that spreadsthrough the network, then we can gain insight into this questionfrom the study of the diffusion of innovation [31, 33, 34].

46

Research Track Paper

Diminishing returns

Page 58: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Is membership a behavior diffusing on the network?

• Beyond the dyadic approximation: • Probability to join depends on the clustering of

friends in the community

t

t+1

Figure 3: The top two levels of decision tree splits for predict-ing single individuals joining communities in LiveJournal. Theoverall rate of joining is 8.48e-4.

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0 0.2 0.4 0.6 0.8 1

Pro

babi

lity

Proportion of Pairs Adjacent

Probability of joining a community versus adjacent pairs of friends in the community

3 friends4 friends5 friends

Figure 4: The probability of joining a LiveJournal communityas a function of the internal connectedness of friends already inthe community. Error bars represent two standard errors.

Internal Connectedness of Friends. The top-level splits in the LJand DBLP decision trees were quite stable over multiple samples;in Figure 3 we show the top two levels of splits in a representativedecision tree for LJ. We now discuss a class of features that provedparticularly informative for the LJ dataset: the internal connected-ness of an individual’s friends.

The general issue underlying this class of feature is the follow-ing: given someone with k friends in a community, are they morelikely to join the community if many of their friends are directlylinked to one another, or if very few of their friends are linked to oneanother? This distinction turns out to result in a significant effect onthe probability of joining. To make this precise, we use the follow-ing notation. For an individual u in the fringe of a community, witha set S of friends in the community, let e(S) denote the number ofedges with both ends in S. (This is the number of pairs in S whoare themselves friends with each other.) Let ϕ(S) = e(S)/

`|S|2

´

denote the fraction of pairs in S connected by an edge.We find that individuals whose friends in a community are linked

to one another — i.e., those for which e(S) and ϕ(S) are larger —are significantly more likely to join the community. In particular,the top-level decision tree split for the LJ dataset is based on ϕ(S),and in the right branch (when ϕ(S) exceeds a lower bound), thenext split is based on e(S). We can see the effect clearly by fixinga number of friends k, and then plotting the joining probability asa function of ϕ(S), over the sub-population of instances where theindividual has k friends in the community. Figure 4 shows thisrelationship for the sub-populations with k = 3, 4, and 5; in each

case, we see that the joining probability increases as the density oflinkage increases among the individual’s friends in the community.

It is interesting to consider such a finding from a theoretical per-spective — why should the fact that your friends in a communityknow each other make you more likely to join? There are socio-logical principles that could potentially support either side of thisdichotomy.4 On the one hand, arguments based on weak ties [17](and see also the notion of structural holes in [6]) support the no-tion that there is an informational advantage to having friends in acommunity who do not know each other — this provides multiple“independent” ways of potentially deciding to join. On the otherhand, arguments based on social capital (e.g. [8, 9]) suggest thatthere is a trust advantage to having friends in a community whoknow each other — this indicates that the individual will be sup-ported by a richer local social structure if he or she joins. Thus, onepossible conclusion from the trends in Figure 4 is that trust advan-tages provide a stronger effect than informational advantages in thecase of LiveJournal community membership.

The fact that edges among one’s friends make community mem-bership more likely is also consistent with observations made inrecent work of Centola, Macy, and Eguiluz [7]. They contend thatinstances of successful social diffusion “typically unfold in highlyclustered networks” [7]. In the case of LJ and DBLP communi-ties, for example, Macy observes that links among one’s friendsmay contribute to a “coordination effect,” in which one receives astronger net endorsement of a community if it is a shared focus ofinterest among a group of interconnected friends [27].

Relation to Mathematical Models of Diffusion. There are a num-ber of theoretical models for the diffusion of a new behavior in asocial network, based on simple mechanisms in which the behav-ior spreads contagiously across edges; see for example [12, 21, 34]for references. Many of these models operate in regimented timesteps: at each step, the nodes that have already adopted the behaviormay have a given probability of “infecting” their neighbors; or eachnode may have a given threshold d, and it will adopt the behavioronce d of its neighbors have adopted it.

Now, it is an interesting question to consider how these modelsare related to the empirical data in Figures 1 and 2. The theoreticalmodels posit very simple dynamics by which influence is transmit-ted: in each time step, each node assesses the states of its neighborsin some fashion, and then takes action based on this information.The spread of a real behavior, of course, is more complicated, andour measurements of LJ and DBLP illustrate this: we observe thebehavior of an individual node u’s friends in one snapshot, andthen u’s own behavior in the next, but we do not know (i) when orwhether u became aware of these friends’ behavior, (ii) how longit took for this awareness to translate into a decision by u to act,and (iii) how long it took u to actually act after making this deci-sion. (Imagine, for example, a scenario in which u decides to join acommunity after seeing two friends join, but by the time u actuallyjoins, three more of her friends have joined as well.) Moreover,for any given individual in the LJ and DBLP data, we do not knowhow far along processes (i), (ii), and (iii) are at the time of the firstsnapshot — that is, we do not know how much of the informationcontained in the first snapshot was already known to the individ-ual, how much they observed in the interval between the first andsecond snapshots, and how much they never observed.

These considerations help clarify what the curves in Figures 1and 2 are telling us. The concrete property they capture is the mea-

4We thank David Strang for helping to bring the arguments on eachside into focus.

49

Research Track Paper

Page 59: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Wrap-up • Communities evolution

• Larger communities last longer • Larger communities change at higher rate • Communities with larger I/O ratio are more stable

• Node dynamics

• Probability to join a community depends on the number of friends previously in the community

• Probability to join depends on the clustering of friends in the community

Page 60: Temporal dynamics of human behavior in social networks (i)

form/decayt1 t2 t3

Tie activity is bursty t

Groups of conversation

t1 t1+dt

Tie

s

Communities form/change/decay

t1 t2

Com

mun

ities

Networks form/change/decay

t1 t2

Net

wor

k

4 Network activity

Page 61: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• How does a network growth? • Nodes arrival is not linear. Neither edges

degree d power power law log stretchedlaw exp. cutoff normal exp.

1 9.84 12.50 11.65 12.102 11.55 13.85 13.02 13.403 10.53 13.00 12.15 12.594 9.82 12.40 11.55 12.055 8.87 11.62 10.77 11.28

avg., d ≤ 20 8.27 11.12 10.23 10.76

Table 3: Edge gap distribution: percent improvement of thelog-likelihood at MLE over the exponential distribution.

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103

Gap

pro

babi

lity,

p(δ

(1))

Gap, δ(1)

10-5

10-4

10-3

10-2

10-1

100 101 102 103

Gap

pro

babi

lity,

p(δ

(1))

Gap, δ(1)

(a) FLICKR (b) DELICIOUS

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103

Gap

pro

babi

lity,

p(δ

(1))

Gap, δ(1)

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103

Gap

pro

babi

lity,

p(δ

(1))

Gap, δ(1)

(c) ANSWERS (d) LINKEDIN

Figure 8: Edge gap distribution for a node to obtain the secondedge, δ(1), and MLE power law with exponential cutoff fits.

6.1.2 Time gap between the edgesNow that we have a model for the lifetime of a node u, we must

model that amount of elapsed time between edge initiations fromu. Let δu(d) = td+1(u)− td(u) be the time it takes for the node uwith current degree d to create its (d+1)-st out-edge; we call δu(d)the edge gap. Again, we examine several candidate distributions tomodel edge gaps. Table 3 shows the percent improvement of thelog-likelihood at the MLE over the exponential distribution. Thebest likelihood is provided by a power law with exponential cutoff:pg(δ(d);α, β) ∝ δ(d)−α exp(−βδ(d)), where d is the current de-gree of the node. (Note that the distribution is neither exponentialnor Poisson, as one might be tempted to assume.) We confirm theseresults in Figure 8, in which we plot the MLE estimates to gap dis-tribution δ(1), i.e., distribution of times that it took a node of degree1 to add the second edge. In fact, we find that all gaps distributionsδ(d) are best modeled by a power law with exponential cut-off (Ta-ble 3 gives improvements in log-likelihoods for d = 1, . . . , 5 andthe average for d = 1, . . . , 20.)For each δ(d) we fit a separate distribution and Figure 9 shows

the evolution of the parameters α and β of the gap distribution, asa function of the degree d of the node. Interestingly, the powerlaw exponent α(d) remains constant as a function of d, at almostthe same value for all four networks. On the other hand, the expo-nential cutoff parameter β(d) increases linearly with d, and variesby an order of magnitude across networks; this variation modelsthe extent to which the “rich get richer” phenomenon manifestsin each network. This means that the slope α of power-law partremains constant, only the exponential cutoff part (parameter β)starts to kick in sooner and sooner. So, nodes add their (d + 1)st

edge faster than their dth edge, i.e., nodes start to create more andmore edges (sleeping times get shorter) as they get older (and havehigher degree). So, based on Figure 9, the overall gap distributioncan be modeled by pg(δ|d; α, β) ∝ δ−α exp(−βdδ).

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100

Gap

par

amet

er α

Current degree, d

0

0.1

0.2

0.3

0.4

0.5

0 20 40 60 80 100

Gap

par

amet

er β

Current degree, d

FlickrLinkedInAnswersDelicious

0

0.1

0.2

0.3

0.4

0.5

0 20 40 60 80 100

Gap

par

amet

er β

Current degree, d

Figure 9: Evolution of the α and β parameters with the currentnode degree d. α remains constant, and β linearly increases.

101

102

103

104

105

106

0 5 10 15 20 25

Nod

es

Time (months)

N(t) ∝ e0.25 t

4.0e46.0e48.0e41.0e51.2e51.4e51.6e51.8e52.0e52.2e5

0 5 10 15 20 25 30 35 40 45

Nod

es

Time (months)

N(t) = 16t2 + 3e3 t + 4e4

(a) FLICKR (b) DELICIOUS

0.0e0

1.0e5

2.0e5

3.0e5

4.0e5

5.0e5

6.0e5

0 2 4 6 8 10 12 14 16 18

Nod

es

Time (weeks)

N(t) = -284 t2 + 4e4 t - 2.5e3

0e0

1e6

2e6

3e6

4e6

5e6

6e6

7e6

0 5 10 15 20 25 30 35 40 45

Nod

es

Time (months)

N(t) = 3900 t2 + 7600 t - 1.3e5

(c) ANSWERS (d) LINKEDIN

Figure 10: Number of nodes over time.

Given the above observation, a natural hypothesis would be thatnodes that will attain high degree in the network are in some waya priori special, i.e., they correspond to “more social” people whowould inherently tend to have shorter gap times and enthusiasti-cally invite friends at a higher rate than others, attaining high de-gree quickly due to their increased activity level. However, thisphenomenon does not occur in any of the networks. We com-puted the correlation coefficient between δ(1) and the final degreed(u). The correlation values are −0.069 for DELICIOUS,−0.043for FLICKR, −0.036 for ANSWERS, and −0.027 for LINKEDIN.Thus, there is almost no correlation, which shows that the gap dis-tribution is independent of a node’s final degree. It only depends onnode lifetime, i.e., high degree nodes are not a priori special, theyjust live longer, and accumulate many edges.

6.2 Node arrivalsFinally, we turn to the question of modeling node arrivals into the

system. Figure 10 shows the number of users in each of our net-works over time, and Table 4 captures the best fits. FLICKR growsexponentially over much of our network, while the growth of othernetworks is much slower. DELICIOUS grows slightly superlinearly,LINKEDIN quadratically, and ANSWERS sublinearly. Given thesewild variations we conclude the node arrival process needs to bespecified in advance as it varies greatly across networks due to ex-ternal factors.

7. A NETWORK EVOLUTIONMODELWe first take stock of what we have measured and observed so

far. In Section 6.2, we analyzed the node arrival rates and showedthat they are network-dependent and can be succinctly representedby a node arrival function N(t) that is either a polynomial or anexponential. In Section 6.1, we analyzed the node lifetimes andshowed they are exponentially distributed with parameter λ. InSection 4.1, we argued that the destination of the first edge of a

degree d power power law log stretchedlaw exp. cutoff normal exp.

1 9.84 12.50 11.65 12.102 11.55 13.85 13.02 13.403 10.53 13.00 12.15 12.594 9.82 12.40 11.55 12.055 8.87 11.62 10.77 11.28

avg., d ≤ 20 8.27 11.12 10.23 10.76

Table 3: Edge gap distribution: percent improvement of thelog-likelihood at MLE over the exponential distribution.

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103

Gap

pro

babi

lity,

p(δ

(1))

Gap, δ(1)

10-5

10-4

10-3

10-2

10-1

100 101 102 103

Gap

pro

babi

lity,

p(δ

(1))

Gap, δ(1)

(a) FLICKR (b) DELICIOUS

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103

Gap

pro

babi

lity,

p(δ

(1))

Gap, δ(1)

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103

Gap

pro

babi

lity,

p(δ

(1))

Gap, δ(1)

(c) ANSWERS (d) LINKEDIN

Figure 8: Edge gap distribution for a node to obtain the secondedge, δ(1), and MLE power law with exponential cutoff fits.

6.1.2 Time gap between the edgesNow that we have a model for the lifetime of a node u, we must

model that amount of elapsed time between edge initiations fromu. Let δu(d) = td+1(u)− td(u) be the time it takes for the node uwith current degree d to create its (d+1)-st out-edge; we call δu(d)the edge gap. Again, we examine several candidate distributions tomodel edge gaps. Table 3 shows the percent improvement of thelog-likelihood at the MLE over the exponential distribution. Thebest likelihood is provided by a power law with exponential cutoff:pg(δ(d);α, β) ∝ δ(d)−α exp(−βδ(d)), where d is the current de-gree of the node. (Note that the distribution is neither exponentialnor Poisson, as one might be tempted to assume.) We confirm theseresults in Figure 8, in which we plot the MLE estimates to gap dis-tribution δ(1), i.e., distribution of times that it took a node of degree1 to add the second edge. In fact, we find that all gaps distributionsδ(d) are best modeled by a power law with exponential cut-off (Ta-ble 3 gives improvements in log-likelihoods for d = 1, . . . , 5 andthe average for d = 1, . . . , 20.)For each δ(d) we fit a separate distribution and Figure 9 shows

the evolution of the parameters α and β of the gap distribution, asa function of the degree d of the node. Interestingly, the powerlaw exponent α(d) remains constant as a function of d, at almostthe same value for all four networks. On the other hand, the expo-nential cutoff parameter β(d) increases linearly with d, and variesby an order of magnitude across networks; this variation modelsthe extent to which the “rich get richer” phenomenon manifestsin each network. This means that the slope α of power-law partremains constant, only the exponential cutoff part (parameter β)starts to kick in sooner and sooner. So, nodes add their (d + 1)st

edge faster than their dth edge, i.e., nodes start to create more andmore edges (sleeping times get shorter) as they get older (and havehigher degree). So, based on Figure 9, the overall gap distributioncan be modeled by pg(δ|d; α, β) ∝ δ−α exp(−βdδ).

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100

Gap

par

amet

er α

Current degree, d

0

0.1

0.2

0.3

0.4

0.5

0 20 40 60 80 100

Gap

par

amet

er β

Current degree, d

FlickrLinkedInAnswersDelicious

0

0.1

0.2

0.3

0.4

0.5

0 20 40 60 80 100

Gap

par

amet

er β

Current degree, d

Figure 9: Evolution of the α and β parameters with the currentnode degree d. α remains constant, and β linearly increases.

101

102

103

104

105

106

0 5 10 15 20 25

Nod

es

Time (months)

N(t) ∝ e0.25 t

4.0e46.0e48.0e41.0e51.2e51.4e51.6e51.8e52.0e52.2e5

0 5 10 15 20 25 30 35 40 45

Nod

es

Time (months)

N(t) = 16t2 + 3e3 t + 4e4

(a) FLICKR (b) DELICIOUS

0.0e0

1.0e5

2.0e5

3.0e5

4.0e5

5.0e5

6.0e5

0 2 4 6 8 10 12 14 16 18

Nod

es

Time (weeks)

N(t) = -284 t2 + 4e4 t - 2.5e3

0e0

1e6

2e6

3e6

4e6

5e6

6e6

7e6

0 5 10 15 20 25 30 35 40 45

Nod

es

Time (months)

N(t) = 3900 t2 + 7600 t - 1.3e5

(c) ANSWERS (d) LINKEDIN

Figure 10: Number of nodes over time.

Given the above observation, a natural hypothesis would be thatnodes that will attain high degree in the network are in some waya priori special, i.e., they correspond to “more social” people whowould inherently tend to have shorter gap times and enthusiasti-cally invite friends at a higher rate than others, attaining high de-gree quickly due to their increased activity level. However, thisphenomenon does not occur in any of the networks. We com-puted the correlation coefficient between δ(1) and the final degreed(u). The correlation values are −0.069 for DELICIOUS,−0.043for FLICKR, −0.036 for ANSWERS, and −0.027 for LINKEDIN.Thus, there is almost no correlation, which shows that the gap dis-tribution is independent of a node’s final degree. It only depends onnode lifetime, i.e., high degree nodes are not a priori special, theyjust live longer, and accumulate many edges.

6.2 Node arrivalsFinally, we turn to the question of modeling node arrivals into the

system. Figure 10 shows the number of users in each of our net-works over time, and Table 4 captures the best fits. FLICKR growsexponentially over much of our network, while the growth of othernetworks is much slower. DELICIOUS grows slightly superlinearly,LINKEDIN quadratically, and ANSWERS sublinearly. Given thesewild variations we conclude the node arrival process needs to bespecified in advance as it varies greatly across networks due to ex-ternal factors.

7. A NETWORK EVOLUTIONMODELWe first take stock of what we have measured and observed so

far. In Section 6.2, we analyzed the node arrival rates and showedthat they are network-dependent and can be succinctly representedby a node arrival function N(t) that is either a polynomial or anexponential. In Section 6.1, we analyzed the node lifetimes andshowed they are exponentially distributed with parameter λ. InSection 4.1, we argued that the destination of the first edge of a

Network T N E Eb Eu E∆ % ρ κFLICKR (03/2003–09/2005) 621 584,207 3,554,130 2,594,078 2,257,211 1,475,345 65.63 1.32 1.44

DELICIOUS (05/2006–02/2007) 292 203,234 430,707 348,437 348,437 96,387 27.66 1.15 0.81ANSWERS (03/2007–06/2007) 121 598,314 1,834,217 1,067,021 1,300,698 303,858 23.36 1.25 0.92LINKEDIN (05/2003–10/2006) 1294 7,550,955 30,682,028 30,682,028 30,682,028 15,201,596 49.55 1.14 1.04

Table 1: Network dataset statistics. Eb is the number of bidirectional edges, Eu is the number of edges in undirected network, E∆ isthe number of edges that close triangles, % is the fraction of triangle-closing edges, ρ is the densification exponent (E(t) ∝ N(t)ρ),and κ is the decay exponent (Eh ∝ exp(−κh)) of the number of edges Eh closing h hop paths (see Section 5 and Figure 4).

10-5

10-4

10-3

100 101 102

Edge

pro

babi

lity,

pe(

d)

Destination node degree, d

pe(d) ∝ d0

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103

Edge

pro

babi

lity,

pe(

d)

Destination node degree, d

pe(d) ∝ d1

(a) Gnp (b) PA

10-7

10-6

10-5

10-4

10-3

100 101 102 103

Edge

pro

babi

lity,

pe(

d)Destination node degree, d

pe(d) ∝ d1

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100 101 102 103

Edge

pro

babi

lity,

pe(

d)

Destination node degree, d

pe(d) ∝ d1

(c) FLICKR (d) DELICIOUS

10-7

10-6

10-5

10-4

10-3

10-2

100 101 102 103

Edge

pro

babi

lity,

pe(

d)

Destination node degree, d

pe(d) ∝ d0.9

10-8

10-7

10-6

10-5

10-4

100 101 102 103

Edge

pro

babi

lity,

pe(

d)

Destination node degree, d

pe(d) ∝ d0.6

(e) ANSWERS (f) LINKEDIN

Figure 1: Probability pe(d) of a new edge e choosing a destina-tion at a node of degree d.

and for every edge that arrives into the network, we measure thelikelihood that the particular edge endpoints would be chosen undersome model. The product of these likelihoods over all edges willgive the likelihood of the model. A higher likelihood means a “bet-ter” model in the sense that it offers a more likely explanation ofthe observed data. For numerical purposes, we use log-likelihoods.

4. PREFERENTIAL ATTACHMENTIn this section we study the bias in selection of an edge’s source

and destination based on the degree and age of the node.

4.1 Edge attachment by degreeThe preferential attachment (PA) model [1] postulates that when

a new node joins the network, it creates a constant number of edges,where the destination node of each edge is chosen proportional tothe destination’s degree. Using our data, we compute the probabil-ity pe(d) that a new edge chooses a destination node of degree d;pe(d) is normalized by the number of nodes of degree d that existjust before this step. We compute:

pe(d) =

!t[et = (u, v) ∧ dt−1(v) = d]!

t |{u : dt−1(u) = d}| .

First, Figure 1(a) shows pe(d) for the Erdos–Rényi [9] randomnetwork, Gnp, with p = 12/n. In Gnp, since the destination nodeis chosen independently of its degree, the line is flat. Similarly,in the PA model, where nodes are chosen proportionally to theirdegree, we get a linear relationship pe(d) ∝ d; see Figure 1(b).

0.01

0.1

1

10

0 20 40 60 80 100 120 140

Avg.

no.

of c

reat

ed e

dges

, e(a

)

Node age (weeks), a

0.01

0.1

1

10

0 5 10 15 20 25 30 35 40

Avg.

no.

of c

reat

ed e

dges

, e(a

)

Node age (weeks), a

(a) FLICKR (b) DELICIOUS

0.1

1

10

0 2 4 6 8 10 12 14 16

Avg.

no.

of c

reat

ed e

dges

, e(a

)

Node age (weeks), a

0.01

0.1

1

0 20 40 60 80 100 120 140 160 180

Avg.

no.

of c

reat

ed e

dges

, e(a

)

Node age (weeks), a

(c) ANSWERS (d) LINKEDIN

Figure 2: Average number of edges created by a node of age a.

Next we turn to our four networks and fit the function pe(d) ∝dτ . In FLICKR, Figure 1(c), degree 1 nodes have lower probabilityof being linked as in the PA model; the rest of the edges could beexplained well by PA. In DELICIOUS, Figure 1(d), the fit nicely fol-lows PA. In ANSWERS, Figure 1(e), the presence of PA is slightlyweaker, with pe(d) ∝ d0.9. LINKEDIN has a very different pattern:edges to the low degree nodes do not attach preferentially (the fit isd0.6), whereas edges to higher degree nodes are more “sticky” (thefit is d1.2). This suggests that high-degree nodes in LINKEDIN getsuper-preferential treatment. To summarize, even though there areminor differences in the exponents τ for each of the four networks,we can treat τ ≈ 1, meaning, the attachment is essentially linear.

4.2 Edges by the age of the nodeNext, we examine the effect of a node’s age on the number of

edges it creates. The hypothesis is that older, more experiencedusers of a social networking website are also more engaged andthus create more edges.Figure 2 plots the fraction of edges initiated by nodes of a certain

age. Then e(a), the average number of edges created by nodes ofage a, is the number of edges created by nodes of age a normalizedby the number of nodes that achieved age a:

e(a) =|{e = (u, v) : t(e) − t(u) = a}|

|{t(u) : tℓ − t(u) ≥ a}| ,

where tℓ is the time when the last node in the network joined.Notice a spike at nodes of age 0. These correspond to the people

who receive an invite to join the network, create a first edge, andthen never come back. For all other ages, the level of activity seemsto be uniform over time, except for LINKEDIN, in which activity ofolder nodes slowly increases over time.

4.3 Bias towards node age and degreeUsing the MLE principle, we study the combined effect of node

age and degree by considering the following four parameterizedmodels for choosing the edge endpoints at time t.

Leskovec, J. et al., 2008. Microscopic evolution of social networks. In KDD '08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining.  ACM

Page 62: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• How does a network growth? • Edge dynamics: Preferential attachment • Triadic closure

Network T N E Eb Eu E∆ % ρ κFLICKR (03/2003–09/2005) 621 584,207 3,554,130 2,594,078 2,257,211 1,475,345 65.63 1.32 1.44

DELICIOUS (05/2006–02/2007) 292 203,234 430,707 348,437 348,437 96,387 27.66 1.15 0.81ANSWERS (03/2007–06/2007) 121 598,314 1,834,217 1,067,021 1,300,698 303,858 23.36 1.25 0.92LINKEDIN (05/2003–10/2006) 1294 7,550,955 30,682,028 30,682,028 30,682,028 15,201,596 49.55 1.14 1.04

Table 1: Network dataset statistics. Eb is the number of bidirectional edges, Eu is the number of edges in undirected network, E∆ isthe number of edges that close triangles, % is the fraction of triangle-closing edges, ρ is the densification exponent (E(t) ∝ N(t)ρ),and κ is the decay exponent (Eh ∝ exp(−κh)) of the number of edges Eh closing h hop paths (see Section 5 and Figure 4).

10-5

10-4

10-3

100 101 102

Edge

pro

babi

lity,

pe(

d)

Destination node degree, d

pe(d) ∝ d0

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103

Edge

pro

babi

lity,

pe(

d)

Destination node degree, d

pe(d) ∝ d1

(a) Gnp (b) PA

10-7

10-6

10-5

10-4

10-3

100 101 102 103

Edge

pro

babi

lity,

pe(

d)

Destination node degree, d

pe(d) ∝ d1

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100 101 102 103

Edge

pro

babi

lity,

pe(

d)

Destination node degree, d

pe(d) ∝ d1

(c) FLICKR (d) DELICIOUS

10-7

10-6

10-5

10-4

10-3

10-2

100 101 102 103

Edge

pro

babi

lity,

pe(

d)

Destination node degree, d

pe(d) ∝ d0.9

10-8

10-7

10-6

10-5

10-4

100 101 102 103

Edge

pro

babi

lity,

pe(

d)

Destination node degree, d

pe(d) ∝ d0.6

(e) ANSWERS (f) LINKEDIN

Figure 1: Probability pe(d) of a new edge e choosing a destina-tion at a node of degree d.

and for every edge that arrives into the network, we measure thelikelihood that the particular edge endpoints would be chosen undersome model. The product of these likelihoods over all edges willgive the likelihood of the model. A higher likelihood means a “bet-ter” model in the sense that it offers a more likely explanation ofthe observed data. For numerical purposes, we use log-likelihoods.

4. PREFERENTIAL ATTACHMENTIn this section we study the bias in selection of an edge’s source

and destination based on the degree and age of the node.

4.1 Edge attachment by degreeThe preferential attachment (PA) model [1] postulates that when

a new node joins the network, it creates a constant number of edges,where the destination node of each edge is chosen proportional tothe destination’s degree. Using our data, we compute the probabil-ity pe(d) that a new edge chooses a destination node of degree d;pe(d) is normalized by the number of nodes of degree d that existjust before this step. We compute:

pe(d) =

!t[et = (u, v) ∧ dt−1(v) = d]!

t |{u : dt−1(u) = d}| .

First, Figure 1(a) shows pe(d) for the Erdos–Rényi [9] randomnetwork, Gnp, with p = 12/n. In Gnp, since the destination nodeis chosen independently of its degree, the line is flat. Similarly,in the PA model, where nodes are chosen proportionally to theirdegree, we get a linear relationship pe(d) ∝ d; see Figure 1(b).

0.01

0.1

1

10

0 20 40 60 80 100 120 140

Avg.

no.

of c

reat

ed e

dges

, e(a

)

Node age (weeks), a

0.01

0.1

1

10

0 5 10 15 20 25 30 35 40

Avg.

no.

of c

reat

ed e

dges

, e(a

)

Node age (weeks), a

(a) FLICKR (b) DELICIOUS

0.1

1

10

0 2 4 6 8 10 12 14 16

Avg.

no.

of c

reat

ed e

dges

, e(a

)

Node age (weeks), a

0.01

0.1

1

0 20 40 60 80 100 120 140 160 180

Avg.

no.

of c

reat

ed e

dges

, e(a

)

Node age (weeks), a

(c) ANSWERS (d) LINKEDIN

Figure 2: Average number of edges created by a node of age a.

Next we turn to our four networks and fit the function pe(d) ∝dτ . In FLICKR, Figure 1(c), degree 1 nodes have lower probabilityof being linked as in the PA model; the rest of the edges could beexplained well by PA. In DELICIOUS, Figure 1(d), the fit nicely fol-lows PA. In ANSWERS, Figure 1(e), the presence of PA is slightlyweaker, with pe(d) ∝ d0.9. LINKEDIN has a very different pattern:edges to the low degree nodes do not attach preferentially (the fit isd0.6), whereas edges to higher degree nodes are more “sticky” (thefit is d1.2). This suggests that high-degree nodes in LINKEDIN getsuper-preferential treatment. To summarize, even though there areminor differences in the exponents τ for each of the four networks,we can treat τ ≈ 1, meaning, the attachment is essentially linear.

4.2 Edges by the age of the nodeNext, we examine the effect of a node’s age on the number of

edges it creates. The hypothesis is that older, more experiencedusers of a social networking website are also more engaged andthus create more edges.Figure 2 plots the fraction of edges initiated by nodes of a certain

age. Then e(a), the average number of edges created by nodes ofage a, is the number of edges created by nodes of age a normalizedby the number of nodes that achieved age a:

e(a) =|{e = (u, v) : t(e) − t(u) = a}|

|{t(u) : tℓ − t(u) ≥ a}| ,

where tℓ is the time when the last node in the network joined.Notice a spike at nodes of age 0. These correspond to the people

who receive an invite to join the network, create a first edge, andthen never come back. For all other ages, the level of activity seemsto be uniform over time, except for LINKEDIN, in which activity ofolder nodes slowly increases over time.

4.3 Bias towards node age and degreeUsing the MLE principle, we study the combined effect of node

age and degree by considering the following four parameterizedmodels for choosing the edge endpoints at time t.

100

101

102

103

104

0 5 10 15 20 25 30

Num

ber o

f edg

es

Hops, h

100

101

102

103

104

0 2 4 6 8 10

Num

ber o

f edg

es

Hops, h

(a) Gnp (b) PA

100

101

102

103

104

105

106

0 2 4 6 8 10 12

Num

ber o

f edg

es

Hops, h

∝ e-1.45 h

100

101

102

103

104

105

106

0 2 4 6 8 10 12 14 16

Num

ber o

f edg

es

Hops, h

∝ e-0.8 h

(c) FLICKR (d) DELICIOUS

100

101

102

103

104

105

0 2 4 6 8 10 12 14 16

Num

ber o

f edg

es

Hops, h

∝ e-0.95 h

100

101

102

103

104

0 2 4 6 8 10 12

Num

ber o

f edg

es

Hops, h

∝ e-1.04 h

(e) ANSWERS (f) LINKEDIN

Figure 4: Number of edges Eh created to nodes h hops away.h = 0 counts the number of edges that connected previouslydisconnected components.

number of edges decays exponentially with the hop distance be-tween the nodes (see Table 1 for fitted decay exponents κ). Thismeans that most edges are created locally between nodes that areclose. The exponential decay suggests that the creation of a largefraction of edges can be attributed to locality in the network struc-ture, namely most of the times people who are close in the network(e.g., have a common friend) become friends themselves.These results involve counting the number of edges that link

nodes certain distance away. In a sense, this overcounts edges(u, w) for which u and w are far away, as there are many moredistant candidates to choose from — it appears that the number oflong-range edges decays exponentially while the number of long-range candidates grows exponentially. To explore this phenomenon,we count the number of hops each new edge spans but then nor-malize the count by the total number of nodes at h hops. Moreprecisely, we compute

pe(h) =

!t[et connects nodes at distance h in Gt−1]!

t(# nodes at distance h from the source node of et).

First, Figures 5(a) and (b) show the results forGnp and PA mod-els. (Again, the isolated dot at h = 0 plots the probability of anew edge connecting disconnected components.) In Gnp, edgesare created uniformly at random, and so the probability of linkingis independent of the number of hops between the nodes. In PA,due to degree correlations short (local) edges prevail. However, anon-trivial amount of probability goes to edges that span more thantwo hops. (Notice the logarithmic y-axis.)Figures 5(c)–(f) show the plots for the four networks. Notice

the probability of linking to a node h hops away decays double-exponentially, i.e., pe(h) ∝ exp(exp(−h)), since the number ofedges at h hops increases exponentially with h. This behavior isdrastically different from both the PA and Gnp models. Also notethat almost all of the probability mass is on edges that close length-two paths. This means that edges are most likely to close triangles,i.e., connect people with common friends.

10-5

10-4

10-3

0 2 4 6 8 10 12

Edge

pro

babi

lity,

pe(

h)

Hops, h

10-5

10-4

10-3

0 2 4 6 8 10

Edge

pro

babi

lity,

pe(

h)

Hops, h

(a) Gnp (b) PA

10-8

10-7

10-6

10-5

10-4

10-3

0 2 4 6 8 10

Edge

pro

babi

lity,

pe(

h)

Hops, h

10-7

10-6

10-5

10-4

10-3

10-2

0 2 4 6 8 10 12 14

Edge

pro

babi

lity,

pe(

h)

Hops, h

(c) FLICKR (d) DELICIOUS

10-7

10-6

10-5

10-4

0 2 4 6 8 10 12Ed

ge p

roba

bilit

y, p

e(h)

Hops, h

10-8

10-7

10-6

10-5

10-4

0 2 4 6 8 10 12

Edge

pro

babi

lity,

pe(

h)

Hops, h

(e) ANSWERS (f) LINKEDIN

Figure 5: Probability of linking to a random node at h hopsfrom source node. Value at h = 0 hops is for edges that connectpreviously disconnected components.

Figure 6: Triangle-closing model: node u creates an edge byselecting intermediate node v, which then selects target node wto which the edge (u, w) is created.

Column E∆ in Table 1 further illustrates this point by presentingthe number of triangle-closing edges. FLICKR and LINKEDIN havethe highest fraction of triangle-closing edges, whereas ANSWERSand DELICIOUS have substantially less such edges. Note that herewe are not measuring the fraction of nodes participating in trian-gles. Rather, we unroll the evolution of the network, and for everynew edge check to see if it closes a new triangle or not.

5.1 Triangle-closing modelsGiven that such a high fraction of edges close triangles, we aim

to model how a length-two path should be selected. We considera scenario in which a source node u has decided to add an edge tosome node w two hops away, and we are faced with various alter-natives for the choice of node w. Figure 6 illustrates the setting.Edges arrive one by one and the simplest model to close a trian-gle (edge (u, w) in the figure) is to have u select a destination wrandomly from all nodes at two hops from u.To improve upon this baseline model we consider various models

of choosing node w. We consider processes in which u first selectsa neighbor v according to some mechanism, and v then selects aneighbor w according to some (possibly different) mechanism. Theedge (u, w) is then created and the triangle (u, v, w) is closed. Theselection of both v andw involves picking a neighbor of a node. Weconsider five different models to pick a neighbor v of u, namely,node v is chosen

Page 63: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• But how do people get to the network) • Adoption Contagion + • Mass media • = Bass Model

• Homophily: geography

the creators in the form of advertising and marketing or doneimplicitly by the users through word-of-mouth. As explainedabove, we use Google news and search volumes as a proxy forthese media influences. Importantly, we note that media coverage(Fig. 1) of Twitter was nearly non-existent during the first twoyears. The company itself did almost no official advertising.During this time, Google search volume was highly correlated withuser growth. After a critical mass of users was reached, mediacoverage began to increase super-linearly. Many of the spikes inadoption rates were the result of celebrity endorsements such asOprah’s decision to officially sign up on-air during her show onApril 17th, 2009 and political events like the Iranian protests inJuly and August 2009 that sparked debate over the use of socialmedia to coordinate revolution.

Qualitatively, we recognize the media as having an enormousrole in driving adoption. We also find that news coverage did notpick up until after the nation had achieved a critical mass of users,suggesting strong endogeneity where media responds to the veryadoption it produces. This is much different than the traditionalmodeling of media [4,5]. We seek to capture these stylized facts byincluding a powerful media agent whose coverage both grows withadoption and produces powerful and random shocks, simulatinghyper-influentials and major media events.

Model IntroductionTo capture both geographic effects as well as media influence,

we introduce our model as follows:

(i) We begin by initializing the agent population and socialnetwork. Contagion spreading is simulated by a mechanismresembling the susceptible - infected (SI) model, which is also aspecial case of the Bass model, widely used in the diffusion ofinnovations literature. We create a population of N agents andplace each agent into one of L cities, creating city level meta-populations. Each agent can be one of two types, early adopter orregular adopter. The geographic placement and and agent types arechosen to reflect empirical distributions of real Twitter users aswell the composition of cities. Thus, if a city was measured to have4% of all US Twitter users, 4% of our agents are placed there.Furthermore, of the agents placed in that city, if the compositionwas measure empirically to be 30% early adopters, 30% of agentswill have an early adopter type, with the remainder marked asregular.

Agents are then connected by links to form a social network.The empirical characteristics of links and distances can be set toreflect those measured in on-line social networks. Liben-Nowellet al. [33] show that pr, the probability of being connected tosomeone located a distance r from your city, follows a truncatedpower-law, pr~r{czn, where c~1:2 and the probability ofconnection becomes roughly constant for distances greater thann~1000 km. We are also able to set the degree distribution anddensity of the social network to reflect different topologies.

(ii) Next, we add dynamics to the simulation. At any giventime, each agent can be in one of two states, susceptible (S) orinfected (I ). Initial adoption is seeded to a small fraction of agents

Figure 1. Plots of weekly national adoption. (a.) The number of new U.S. Twitter users is plotted for each week, normalized by the maximumweekly increase during the entire period of data collection. (b.) The cumulative total number of U.S. Twitter users is plotted for for the same timeperiod. Google search and news volumes are normalized such that the maximum value is 1.doi:10.1371/journal.pone.0029528.g001

Adoption with Geographic and Media Influence

PLoS ONE | www.plosone.org 3 January 2012 | Volume 7 | Issue 1 | e29528

dn

dt= p(t)(1� n) + qn(1� n)

mass media

Contagion

Toole, J.L., Cha, M. & Gonzalez, M.C., 2011. Modeling the adoption of innovations in the presence of geographic and media influences. PLoS ONE, 7(1), pp.e29528–e29528.

Page 64: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Users leaving behavior is also contagious

measure the relative topological overlap of the neighborhood of two users A and B, representing the proportion of their common friends, as OAB = NAB/((KA-1)+(KB-1)-NAB), where NAB is the number of common neighbors of A and B, and KA (KB) denotes the degree of node A(B).1 Fig. 3(d) demonstrates the effect of removing links in order of strongest (or weakest) overlaps. In both cases, we find that removing ties in rank order of weakest to strongest ties will lead to a sudden disintegration of the network. In contrast, reversing the order shrinks the network without precipitously breaking it apart.

0

0.1

0.2

0.3

0.4

0.5

0.6

0 5 10 15 20 25

Probability

Number of Churner Neighbours

May ChurnersJune ChurnersJuly Churners

(a)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 0.2 0.4 0.6 0.8 1

Probability

Proportion of Pairs Adjacent

3 Churners4 Churners5 Churners6 Churners

(b)

Figure 2. Probability of churning when (a) k friends have already churned (b) adjacent pairs of friends have already churned

This result is broadly consistent with the strength of weak ties hypothesis [5], offering one of its first confirmations in mobile networks. Accordingly, tie strength is driven not only by the individuals involved in the tie, but also by the network structure in the tie’s immediate vicinity. Further, given that the strong ties are predominantly within communities, their removal will only

1 If A and B have no common acquaintances we have OAB =1.

locally disintegrate a community, while the removal of the weak links will delete bridges that connect different communities, leading to a network collapse. Further, we believe that the observed local relationship, between network topology and tie strength affects any global information diffusion process (like churn). In fact, we opine that churn as a behavior can be viewed less as a dyadic phenomenon (affected only by strong churner-churner ties), but more as a diffusion process where both strong and weak ties play a significant role in spreading the influence through the network topology.

4. PREDICTING CHURNERS IN THE CALL GRAPH We next discuss how to exploit social ties to identify potential churners in an operator’s network. Our approach is as follows. We start with a set of churners (e.g. for April) and their social relationships (ties) captured in the call graph (for March). Using the underlying topology of the call graph, we then initiate a diffusion process with the churners as seeds. Effectively, we model a “word-of-mouth” scenario where a churner influences one of his neighbors to churn, from where the influence spreads to some other neighbor, and so on. At the end of the diffusion process, we inspect the amount of influence received by each node. Using a threshold-based technique, a node that is currently not a churner can be declared to be a potential future one, based on the influence that has been accumulated. Finally, we measure the number of correct predictions by tallying with the actual set of churners that were recorded for a subsequent month (e.g. for May). The diffusion model is based on Spreading Activation (SPA) techniques proposed in cognitive psychology and later used for trust metric computations [32]. In essence, SPA is similar to performing a breadth-first search on the call graph GMarch=(V,E). The basic steps are outlined below:-

Node Activation: During each iterative step i, there is a set of active nodes. Let X be an active node which has associated energy E(X,i) at step i. Intuitively, E(X,i) is the amount of (social) influence2 transmitted to the node via one or more of its neighbors. A node with high influence has a greater propensity to churn. Let N(X) be the set of neighbors of X. Active nodes for step i+1 comprises of nodes which are neighbors of currently active members. Further, a currently active node X transfers a fraction of its energy to each neighbor Y (connected by a directed edge <X,Y>), in the process of activating it. The amount of energy that is transferred from X to Y depends on the Spreading Factor d and the Transfer Function F, respectively.

Spreading Factor: SPA starts with a set of active nodes (seed nodes) each having initial energy E(X,0). At each subsequent step i, an active node transfers a portion of its energy d· E(X,i) to its neighbors, while retaining (1 − d) · E(X,i) for itself, where d is the global Spreading Factor. The spreading factor concept is very intuitive and, in fact, very close to real models of energy spreading. Observe that the overall amount of energy in the network does not change over time, i.e. ∑X E(X,i) = ∑X∈V E(X,0) = E0, for each step i. The spreading factor determines the amount of

2 The terms “energy” and “influence” are used interchangeably in

this context.

measure the relative topological overlap of the neighborhood of two users A and B, representing the proportion of their common friends, as OAB = NAB/((KA-1)+(KB-1)-NAB), where NAB is the number of common neighbors of A and B, and KA (KB) denotes the degree of node A(B).1 Fig. 3(d) demonstrates the effect of removing links in order of strongest (or weakest) overlaps. In both cases, we find that removing ties in rank order of weakest to strongest ties will lead to a sudden disintegration of the network. In contrast, reversing the order shrinks the network without precipitously breaking it apart.

0

0.1

0.2

0.3

0.4

0.5

0.6

0 5 10 15 20 25

Probability

Number of Churner Neighbours

May ChurnersJune ChurnersJuly Churners

(a)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 0.2 0.4 0.6 0.8 1

Probability

Proportion of Pairs Adjacent

3 Churners4 Churners5 Churners6 Churners

(b)

Figure 2. Probability of churning when (a) k friends have already churned (b) adjacent pairs of friends have already churned

This result is broadly consistent with the strength of weak ties hypothesis [5], offering one of its first confirmations in mobile networks. Accordingly, tie strength is driven not only by the individuals involved in the tie, but also by the network structure in the tie’s immediate vicinity. Further, given that the strong ties are predominantly within communities, their removal will only

1 If A and B have no common acquaintances we have OAB =1.

locally disintegrate a community, while the removal of the weak links will delete bridges that connect different communities, leading to a network collapse. Further, we believe that the observed local relationship, between network topology and tie strength affects any global information diffusion process (like churn). In fact, we opine that churn as a behavior can be viewed less as a dyadic phenomenon (affected only by strong churner-churner ties), but more as a diffusion process where both strong and weak ties play a significant role in spreading the influence through the network topology.

4. PREDICTING CHURNERS IN THE CALL GRAPH We next discuss how to exploit social ties to identify potential churners in an operator’s network. Our approach is as follows. We start with a set of churners (e.g. for April) and their social relationships (ties) captured in the call graph (for March). Using the underlying topology of the call graph, we then initiate a diffusion process with the churners as seeds. Effectively, we model a “word-of-mouth” scenario where a churner influences one of his neighbors to churn, from where the influence spreads to some other neighbor, and so on. At the end of the diffusion process, we inspect the amount of influence received by each node. Using a threshold-based technique, a node that is currently not a churner can be declared to be a potential future one, based on the influence that has been accumulated. Finally, we measure the number of correct predictions by tallying with the actual set of churners that were recorded for a subsequent month (e.g. for May). The diffusion model is based on Spreading Activation (SPA) techniques proposed in cognitive psychology and later used for trust metric computations [32]. In essence, SPA is similar to performing a breadth-first search on the call graph GMarch=(V,E). The basic steps are outlined below:-

Node Activation: During each iterative step i, there is a set of active nodes. Let X be an active node which has associated energy E(X,i) at step i. Intuitively, E(X,i) is the amount of (social) influence2 transmitted to the node via one or more of its neighbors. A node with high influence has a greater propensity to churn. Let N(X) be the set of neighbors of X. Active nodes for step i+1 comprises of nodes which are neighbors of currently active members. Further, a currently active node X transfers a fraction of its energy to each neighbor Y (connected by a directed edge <X,Y>), in the process of activating it. The amount of energy that is transferred from X to Y depends on the Spreading Factor d and the Transfer Function F, respectively.

Spreading Factor: SPA starts with a set of active nodes (seed nodes) each having initial energy E(X,0). At each subsequent step i, an active node transfers a portion of its energy d· E(X,i) to its neighbors, while retaining (1 − d) · E(X,i) for itself, where d is the global Spreading Factor. The spreading factor concept is very intuitive and, in fact, very close to real models of energy spreading. Observe that the overall amount of energy in the network does not change over time, i.e. ∑X E(X,i) = ∑X∈V E(X,0) = E0, for each step i. The spreading factor determines the amount of

2 The terms “energy” and “influence” are used interchangeably in

this context.

Dasgupta, K. et al., 2008. Social ties and their relevance to churn in mobile telecom networks.

Page 65: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Network dies if cascades of users leaving are unlimited in size • Network is resilient if it can resists cascades of limited size. • What properties of the network make it resilient? Degree distribution?

David Garcia, Pavlin Mavrodiev, Frank Schweitzer:Social Resilience in Online Communities: The Autopsy of Friendster

Submitted on February 22, 2013. http://www.sg.ethz.ch/research/publications/

could be sufficient for some practical applications, the empirical test of the power-law hypothesiscan only be tested, and eventually rejected, through the result of a statistical test, assuming areasonable confidence level.

Facebook

100 101 102 103 104

degree

10-610-410-2100

Prob

[deg

ree

= d] Orkut

100 101 102 103 104 105

degree

10-810-610-410-2100

Prob

[deg

ree

= d]

Friendster

100 101 102 103 104

degree

10-810-610-410-2100

Prob

[deg

ree

= d] LiveJournal

100 101 102 103 104

degree

10-810-610-410-2100

Prob

[deg

ree

= d]

MySpace

100 101 102 103 104 105

degree

10-610-410-2100

Prob

[deg

ree

= d]

100 101 102 103 104 105

degree

10-810-610-410-2100

Prob

[deg

ree

> d]

OrkutLiveJournalMySpaceFacebookFriendster

Figure 2: Complementary cumulative density function and probability density functions ofnode degree in the five considered communities. Red lines show the ML power-law fits fromddeg

min

We followed the state-of-the-art methodology to test power laws [7], which roughly involves thefollowing steps. First, we created Maximum Likelihood (ML) estimators ↵ and ddegmin for p(d).Second, we tested the empirical data above ddegmin against the power law hypothesis and werecorded the corresponding KS-statistics (D). Third, we repeated the KS test for 100 syntheticdatasets that follow the fitted power law above ddegmin. The p-value is then the fraction of thesynthetic D values that are larger than the empirical one. Thus, for each degree distribution, wehave the ML estimates ddeg

min

and ↵, which define the best case in terms of the KS test, withan associated D value, and the p-value.

Ultimately, a power law hypothesis cannot be rejected if (i) the p-value of the KS-test is abovea chosen significance level [7], and (ii) there is a sufficiently large amount of datapoints from

10/22

●●

●●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

● ●

● ●

●●

●●

●●●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

● ●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●●

●●

●●

● ●●

● ●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●●

●●

●●

● ●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●

●●●

● ●

●●

● ●

● ●

●●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●● ●●●

●●

●●

●●

● ●

● ●

●●●

●●

●●

●●●● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●●

●●

● ●

● ●

●●●

●●

●●

●●● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●●

● ●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

● ●

● ●

●●●

●●

● ●

●●●

●●

●●

●●

●● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●●

●●

● ●●

●●

●●

●●

●●

q = 1/8 q = 1/4 q = 3/8 q = 1/2

Robustness of the network against node removal attacks Degree distribution in OSN networks

Random

By degree

Fraction of nodes removed

David Garcia, Pavlin Mavrodiev, Frank Schweitzer:Social Resilience in Online Communities: The Autopsy of Friendster

Submitted on February 22, 2013. http://www.sg.ethz.ch/research/publications/

4 Data on Online Social Networks

For our empirical study of social network resilience, we use datasets from five different OSN.The choice of these datasets aims at spanning a variety of success stories across OSN, includingsuccessful and failed communities, as well as communities currently in decline. The size, datagathering methods, and references are summarized in Table 1, and outlined in the following.

Table 1: Outline of OSN and datasetsname date status users links source

Livejournal 1999 successful 5.2M 28M [26]Friendster 2002 failed 117M 2580M Internet ArchiveMyspace 2003 in decline 100K 6.8M [3]Orkut 2004 in decline 3M 223M [26]

Facebook 2004 successful 3M 23M [14]

FriendsterThe most recent dataset we take into account is the one retrieved by the Internet Archive, with thepurpose of preserving Friendster’s information before its discontinuation. This dataset providesa high-quality snapshot of the large amount of user information that was publicly available onthe site, including friend lists and interest-based groups [31]. In this article, we provide the firstanalysis of the social network topology of Friendster as a whole.

Since some user profiles in Friendster were private, this dataset does not include their connec-tions. However, these private users would be listed as contacts in the list of their friends whowere not private. We symmetrized the Friendster dataset by adding these additional links. Dueto the large size of the Friendster dataset, we symmetrized the data by using Hadoop, whichwe distribute under a creative commons license 3.

LivejournalIn Livejournal, users keep personal blogs and define different types of friendship links. Theinformation retrieval method for the creation of this dataset combined user id sampling withneighborhood exploration [26], covering more than 95% of the whole community. We choose thisLivejournal dataset for its overall quality, as it provides a view of practically the whole OSN.

Note that the desing of Livejournal as an OSN deviates from the other four communitiesanalyzed here. First, Livejournal is a blog community, in which the social network functionalityplays a secondary role. Second, Livejournal social links are directed, in the sense that one usercan be friend of another without being friended back. In our analysis, we only take includereciprocal links, referring to previous research on its k-core decomposition [21]. By including

3web.sg.ethz.ch/users/dgarcia/Friendster-sim.tar.bz2

8/22

Garcia, D., Mavrodiev, P. & Schweitzer, F., 2013. Social Resilience in Online Communities: The Autopsy of Friendster. arXiv.org, cs.SI.

Page 66: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• If users leave once they have less than k neighbor. Thus the relevant structure is the k-core of the network: the part of the network than remains once nodes with degree smaller thank k are removed

• There is a critical k-coreness in each network Ks

Page 67: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Thus: are unsuccessful networks those with small k-cores? David Garcia Chair of Systems Design www.sg.ethz.chThe Autopsy of Friendster Measuring Social Resilience

Empirical K-core decompositions

����������

�� ��

��

���

���

��� ��������� �

� ������

Conference in Online Social Networks. Boston, US October 7th, 2013 11 / 19

NO!

k-co

rene

ss

Colored area is proportional to the number of nodes

A large proportion of nodes have a large k-coreness

Page 68: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Assumption: Ks increases with time (linearly)

• Network decay is a much more complex process

David Garcia, Pavlin Mavrodiev, Frank Schweitzer:Social Resilience in Online Communities: The Autopsy of Friendster

Submitted on February 22, 2013. http://www.sg.ethz.ch/research/publications/

and it took a bit of time for the OSN to become more resilient. The second peak is shortly afterhaving 22 million users, which coincides with the decay of popularity of Friendster in the US.Finally, the ratio of users at risk went above 0.5 before the community had 80 million accounts,showing a lack of cohesion as its shutdown approaches, as new users do not manage to connectto the rest.

To conclude our analysis, we explored how the spread of departures captured in the k-coredecomposition (see Section 3.3) can describe the collapse of Friendster as an OSN. As we donot have access to the precise amount of active users of Friendster, we proxy its value throughthe Google search volume of www.friendster.com. The inset of Figure 6 shows the relative weeklysearch volume from 2004, where the increase of popularity of Friendster is evident. At somepoint in 2009, Friendster introduced changes in its user interface, coinciding with some technicalproblems, and the rise of popularity of Facebook 4. This led to the fast decrease of active usersin the community, ending on its discontinuation in 2011.

date

Goo

gle

sear

chvo

lum

e(%

)0

2040

6080

2Jul2009 1Jan2010 2Jul2010 1Jan2011

��������

������

P(k

s<

6)

��

���

���

���

������� ��� ��� ���

Figure 6: Weekly Google search trend volume for Friendster. The red line shows the estima-tion of the remaning users in a process of unraveling. Inset: time series of fraction of nodeswith ks < 6.

We scale the search volumes fixing 100% as the total amount of users with coreness above 0, 68million. At the point when the collapse of Friendster started, the search volume indicates apopularity of 78% of its maximum. We take this point to start the simulation of a user departurecascade, with an initial amount of 58 million active users, i.e. users with coreness above 3. Thesecond reference point we take is June 2010, when Friendster was reported to have 10 million

4www.time.com/time/business/article/0,8599,1707760,00.html

17/22

Model

Page 69: Temporal dynamics of human behavior in social networks (i)

@estebanmoro

• Wrap up • Network growth

• Nodes and edges arrive to the network through a complex process (mass media)

• Edges are created through the preferential attachment and triadic closure process

• Joining the network is also a contagious process • Network death

• Users leaving behavior is also contagious. Depends on the embeddedness • Network resilience has an impact through the k-core of the network