5
Social Network Analysis on the Arxiv HEP-PH and HEP-TH citation graphs – Gamma distribution instead of Power Law observed Wang Fumin [email protected] Datasets: cit-HepPh: The Arxiv HEP-PH (high energy physics phenomenology ) citation graph is from the e-print arXiv and covers all the citations within a dataset of 34,546 papers with 421,578 edges. If a paper i cites paper j, the graph contains a directed edge from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this. cit-HepTh: Arxiv HEP-TH (high energy physics theory) 27,770 papers with 352,807 edges. They can downloaded from http://snap.stanford.edu/data/index.html Experiments: Graph Densification We computed statistics of nodes and edges of the above two graphs for the period January 1993 to April 2003 (124 months). In our computation, we allowed the effects of phantom nodes and edges according to the below rules: Suppose paper i cites paper j, If paper i’s publication date is before Jan 1993, then it is accumulated into Jan 1993’s nodes statistics. Likewise, this link is accumulated into Jan 1993’s edges statistics, too. If paper j’s publication date is before Jan 1993 or after April 2003, then it is discarded in our nodes statistics. The unit of computation is “month.” Shrinking diameter We defined the time evolution of both PH and TH citation graphs as follows: Suppose paper i cites paper j, If paper i’s publication date is before of equals a timestamp m, then paper i and j and this link is considered to belong to the graph at timestamp m. Although our algorithm for computing the effective diameter is inspired by the ANF (C. R. Palmer, 2002), we discarded all aspects of approximation in

Social Network Analysis on the Arxiv HEP-PH and HEP-TH

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Social Network Analysis on the Arxiv HEP-PH and HEP-TH

Social Network Analysis on the Arxiv HEP-PH and

HEP-TH citation graphs – Gamma distribution instead of

Power Law observed

Wang Fumin

[email protected]

Datasets: cit-HepPh: The Arxiv HEP-PH (high energy physics phenomenology ) citation

graph is from the e-print arXiv and covers all the citations within a dataset of

34,546 papers with 421,578 edges. If a paper i cites paper j, the graph contains a

directed edge from i to j. If a paper cites, or is cited by, a paper outside the

dataset, the graph does not contain any information about this.

cit-HepTh: Arxiv HEP-TH (high energy physics theory) 27,770 papers with

352,807 edges.

They can downloaded from http://snap.stanford.edu/data/index.html

Experiments: Graph Densification

We computed statistics of nodes and edges of the above two graphs for the

period January 1993 to April 2003 (124 months). In our computation, we

allowed the effects of phantom nodes and edges according to the below rules:

Suppose paper i cites paper j,

If paper i’s publication date is before Jan 1993, then it is accumulated into

Jan 1993’s nodes statistics. Likewise, this link is accumulated into Jan 1993’s

edges statistics, too.

If paper j’s publication date is before Jan 1993 or after April 2003, then it is

discarded in our nodes statistics.

The unit of computation is “month.”

Shrinking diameter

We defined the time evolution of both PH and TH citation graphs as follows:

Suppose paper i cites paper j,

If paper i’s publication date is before of equals a timestamp m, then paper i

and j and this link is considered to belong to the graph at timestamp m.

Although our algorithm for computing the effective diameter is inspired by

the ANF (C. R. Palmer, 2002), we discarded all aspects of approximation in

Page 2: Social Network Analysis on the Arxiv HEP-PH and HEP-TH

their algorithm since we found that we could fit our entire dataset into

memory (about 100 MB) by using bitmasks. In other words, our

computation for this part is “exact.”

The unit of computation is also “month.”

Probability distribution of Seniority differences

Suppose paper i cites paper j, we define their seniority difference as follows:

We check for paper i and j’s publication date from their meta data, if their

publication date is found, then the date is used. If not, we acquire their

year and month of publication from their paper ID (ex: a paper ID of

9301253 translates into a publication year of 1993 and month January). We

then randomly assign a date (between 1 to 30) for this paper.

The seniority difference is then defined by the difference in paper i and j’s

publication date counted in days.

A 30-day moving average is also computed for ease of analysis.

Rank degree distribution

We followed exactly the same treatment for phantom nodes and edges as

we did for the graph densification part

Following Leskovec (J. Leskovec, March 2007), we performed power law

regression on the first 2500 ranks to compute the power-law exponent.

Discussions: Fig. 1 below is a log-log plot of the number of edges versus the number of nodes. The

wide gap between the regression lines of Ph and Th and the hypothetical linear

growth line clearly shows that these citation graphs are getting denser and denser.

Th: y = 0.0231x1.619 R² = 0.9928

Ph: y = 0.0208x1.6118 R² = 0.9977

100

1000

10000

100000

1000000

100 1000 10000 100000

Nu

mb

er

of

Edge

s

Number of Nodes

Fig. 1 Densification Plot

Th Ph Linear growth Power (Th) Power (Ph)

Page 3: Social Network Analysis on the Arxiv HEP-PH and HEP-TH

Currently there is no widely agreed upon method of dealing with phantom nodes and

edges (i.e. nodes and edges that are formed before our earliest record of a graph). In

this report, we proposed the above described treatment of phantom nodes and

edges which give densification exponents of 1.612 and 1.619 for the Arvix HEP-PH

and HEP-TH citation graphs. From the fact that our proposal detects fairly similar

exponents for the two citation graphs of the same background, it seems that it is

able cope pretty well with the problem of the graphs’ “missing past”.

Next, in order to verify that such densification indeed results in a smaller diameter,

we plotted the time evolution of both citation graphs’ effective diameter below. Our

results confirm that both graphs do get tighter and tighter as time advances.

4

6

8

10

12

14

Jan-92 Jan-94 Jan-96 Jan-98 Jan-00 Jan-02 Jan-04

Effe

ctiv

e D

iam

ete

r

Date

Fig. 2 Shrinking Diameter

Th

Ph

0

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

0.0014

0.0016

0 500 1000 1500 2000 2500 3000

Pro

bab

ility

Seniority Difference (Days)

Fig. 3 Seniority Difference Probability Distribution

Ph Ph(30-day moving average) Th Th(30-day moving average)

27 126

Page 4: Social Network Analysis on the Arxiv HEP-PH and HEP-TH

To see the effects of seniority difference on individual nodes’ willingness to form

edges, in Fig. 3 we plotted the seniority difference probability distribution of both

graphs below. Our results show that nodes in these two graphs generally prefer to

establish links with comparatively fresher nodes (27 and 126 day difference

respectively), but not the freshest ones, a result that is in accordance to Ko’s (Y. K. Ko,

2011) findings.

Last but not least, we checked whether the Rank-degree distribution of both citation

graphs do reveal a power-law relationship as suggested by previous works (J.

Leskovec, March 2007).

From what we see in Fig. 4, the beginning part of the rank-degree log-log plot does

seem to resemble roughly a straight line, but it is clear that such a linear relationship

is not as strong as that purported by Leskovec et al. In fact, when we tried to

compute the densification exponent from the slope of this line according to Theorem

5.2 described in (J. Leskovec, March 2007), we found that their formula gave

insensible results such as 1.373 and 1.350 for the HEP-PH and HEP-TH graphs

respectively. When compared against the true values of 1.612 and 1.619, this

translates into errors of 16.2% and 15.2%.

Theorem 5.2 (J. Leskovec, March 2007):

11

1

slope

slopedreedegaverage

0.000001

0.00001

0.0001

0.001

0.01

1 10 100 1000 10000 100000

Fig. 4 Rank-Degree distribution

Ph Th Gamma(0.8,9947) Gamma(0.7,8090)

Page 5: Social Network Analysis on the Arxiv HEP-PH and HEP-TH

As such, we believe it would be more honest for us to admit that a power law does

not hold for citation graphs, and that accuracy of the calculations on the

densification exponent can be reached only if we take into account of the real

underlying Rank-degree distribution. Looking at Fig. 4 in a more prudent manner, we

suspect that the distribution could be a gamma one. Surprisingly, a simple gamma fit

reveals that a gamma distribution not only coincides with the data visually, but also

delivers fairly accurate predictions of the densification exponent. In fact, our gamma

distribution based formula below returns densification exponents of 1.553 and 1.571

for the PH and TH graphs respectively.

nkngammadreedegaverage ln2ln,,ln

We believe the diminutive errors of these predictions, 3.7% and 3.0% respectively,

are ample evidence of the presence of a gamma distribution in citation graphs.

Derivation of the above equation:

The value of the real curve at n is c*gamma(n, k, theta), where c is the number of

edges (because the integral of c*gamma(rank, k, theta) = the number of edges).

However, we also know that this value is one because one is the smallest degree of a

node. It follows that the number of edges is 1/ gamma(n, k, theta).

References C. R. PalmerB. Gibbons, and C. FaloutsosP. (2002). ANF: A fast and scalable tool for

data mining in massive graphs. SIGKDD, Edmonton, AB, Canada.

J. LeskovecKleinberg, C. FaloutsosJ. (March 2007). Graph Evolution: Densification and

Shrinking Diameters. ACM Transactions on Knowledge Discovery from Data,

Vol 1, No. 1, Article 2.

Y. K. KoK. Lou, C. T. Li, S. D. Lin, S. K. JengJ. (2011). A Social Network Evolution Model

Based on Seniority. Social Networks.