Download ppt - A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles [email protected]

A Comparison of On-line Computer ScienceCitation Databases

Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles

[email protected]://www.cs.ucl.ac.uk/staff/V.Petricek

mailto:[email protected]

2

Motivation

Autonomous databases have advantages compared to manually constructed

- Easier maintenance- Lower cost

Is it really an equivalent solution that is just cheaper?

Does the automated acquisition introduce any bias?

3

Talk Overview

Datasets Acquisition bias and models CS Citation Distribution Conclusions Future Work

4

Datasets - DBLP

DBLP was operated by Micheal Ley since 1994 [8]. It currently contains over 550,000 computer science references from around 368,000 authors.

Each entry is manually inserted by a group of volunteers and occasionally hired students. The entries are obtained from conference proceeding and journals.

5

Datasets - CiteSeer

CiteSeer was created by Steve Lawrence and C. Lee Giles in 1997. It currently contains over 716,797 documents.

In contrast, each entry in CiteSeer is automatically entered from an analysis of documents found on the Web.

6

Datasets – Publication year

CiteSeer DBLP

Declining CiteSeer maintenance

Increased DBLP funding

7

Author bias

CiteSeer papers have higher average number of authors Both databases show growing team sizes

8

Author bias

Crossover for low number of authors

CiteSeer has higher proportion of multiauthor papers than DBLP

(for number of authors <4)

9

Author bias

“Papers with higher number of authors are more likely to be included in CiteSeer”

Hypothesis

Crawler suffers from acquisition bias due to - Submission- Crawling

10

Models - CiteSeer

CiteSeer Submission model

Probability of a document being submitted grows with number of authors

- Publication submitted with probability β- Probabilities independent for coauthors

citeseers(i) = (1-(1- β )i) * all(i)

11

Models - CiteSeer

CiteSeer crawler model- Probability of crawling a document grows with number of its

online copies- Probability of a document being online grows with number

of authors- Probabilities independent between authors- Publication published online with probability δ- Publication found by crawler with probability γ

citeseerc(i) = (1-(1- γδ)i) * all(i)

Both models result in equivalent type of bias

12

Coverage

Can we estimate the coverage of dblp? Can we estimate the coverage of CiteSeer? Can we estimate the coverage of CS

literature?

We need a model of DBLP acquisition method

13

Models - DBLP

DBLP model- Publication included in DBLP with probability α- α is a parameter reflecting DBLP “coverage” of CS

literature

dblp(i) = α * all(i)

14

Coverage

citeseer(i) = (1-(1- β )î) * all(i)

dblp(i) = α * all(i)

r(i) = dblp(i) / citeseer(i)

r(i) = α / (1-(1- β )î)

15

Results

• r(i) = α / (1-(1- β )î)

Alpha ~ 0.3

DBLP covers approx 30%

of CS literature

CiteSeer covers approx 40%

CS literature ~ 2M publications

Citation distribution

17


Studied before Follow a power-law Redner, Laherrere et al, Lehmann and

others Mostly physics community

We use a subset of CiteSeer and DBLP papers that have citation information

18


Power law Sparse data for

high number of citations

19


Exponential binning Data aggregated in

exponentially increasing ‘bins’

Equivalent to constant bins on a logarithmic scale

Easier interpolation

20


Distribution of citations more uneven in CS than in Physics Significant differences between DBLP and CiteSeer

slope

# citations Lehmann DBLP CiteSeer

< 50 -1.29 -1.876 -1.504

> 50 -2.32 -3.509 -3.074

21


CiteSeer contains fewer low cited papers than DBLP

No model yet Lawrence

- “Online or invisible?”

22

Conclusions - authors

CiteSeer and DBLP have very different acquisition methods

Significant bias against papers with low number of authors (less than 4) in CiteSeer.

Single author papers appear to be disadvantaged with regard to the CiteSeer acquisition method.

two probabilistic models for paper acquisition in CiteSeer resulting in the same type of bias

- Crawler model- Submission model

23

Conclusions - coverage

Simple model of DBLP coverage predicts coverage of approx 30% of the entire Computer Science literature.

This gives us CiteSeer coverage of approx 40%

and total number of CS papers around 2M

24

Conclusions - citations

CiteSeer and DBLP citation distributions are different

Both indicate that highly cited papers in Computer Science receive a larger citation share than in Physics.

CiteSeer contains fewer low cited papers

25

Future Work

Repeat experiments on most recent CiteSeer data

Other methods to estimate Computer science literature size and trends

- Overlap of CiteSeer and DBLP

Bias introduced by bibliography parsing Collaborative network analysis Connection to internet surveys?

Thank you