How to build your own citation index

How to Build Your Own Citation Index

First-hand experience with WoS, Scopus, and CSA reference data

Philipp Mayr, Frank Sawitzky, Andreas Strotmann (GESIS – Leibniz Institute for the Social Sciences,

Cologne)

Background I

● Author cocitation and collaboration network mining and visualization– e.g. Bubela, Strotmann, et al (2010) Cell Stem

Cell: Researcher commercialization activity tends to reduce their collaboration breadth

Background II

Citation Index for the Social Sciences ● GESIS' Sowiport portal

– 18 databases, including 6 CSA databases, all social sciences

– CSA comes with cited refs for some docs

– SSOAR – extract refs from OA full text and index in Sowiport

– Crawl Google Scholar for citations to “our” docs

Two Models of Citation GraphsBipartite (Classic IR) Model:Citing and Cited Partitions

• Citing nodes: full bibliographic records

• Cited nodes: „keys“, e.g.– First author name & initials

+ Year of publication+ Journal key, + volume, +number, +page

Uniform Model:Interconnected Documents

• All nodes: bibliographic records– Citing nodes full records– Cited nodes mostly simplified

records– „Matched“ cited nodes have

full records

Citation Matching

• Goal: Citation network–Unique nodes for documents

• Sub tasks:–Match cited references to each other–Match cited references to full records–Match full records across databases

Scopus Citations

• Cited reference info contains–Up to 8 author names (family+inits)

• Including last author• Frequently as cited (not standardized or corrected)

–Publication year, title, journal name/vol./nr./p.• Frequently as cited

–Reasonably well parsable, not normalized

Matching Scopus Citations to Scopus Full Records

External matching: Scopus search engine● „Algorithm“: parse Scopus reference into subfields,

construct complex search queries for Scopus engine, download resulting full records, choose best fit

● High precision searches: complex searches allowed, many searchable fields– Improve recall by successively vaguer queries

● Small number of downloads allowed, so many queries needed to construct sizable citation index

Matching Scopus Citations to PubMed Full Records

CrossDB External Match: Scopus/Medline● „Algorithm“: parse Scopus reference, construct

PubMed batch citation matcher queries, download matched PubMed(!) records– Only for biomedical fields

– Result is a citation network of PubMed records, not Scopus

– Requires matching of Scopus citing records as well● Either direction (Scopus<->PubMed)● Both include PubMed IDs

Matching Web of Science References to WoS Full RecordsWoS cited reference info contains● First author (last name plus initials)● Publication year● Source title code● Vol./num./page● More and more frequently DOI

No title included!

Matching WoS Cited References to WoS Record

External matching via WoS web search● Only small queries supported

– Many downloads necessary

● Crucial search fields not supported (vol., num.)– Therefore highly ambiguous results to be expected

● Requires translation of source title from code to full● Requires algorithmic filtering of correct hit from long

result list

Matching WoS references to WoS

● Internal Matching● Kompetenzzentrum Bibliometrie has full local

copy of WoS data● Experiment: good „match key“ to support

this?– Dinkel (2011), ISSI

– Results in error estimates for references

Building a Citation Index for the Social Sciences: CSA

● Basis: Cambridge Scientific Abstracts (Social Sciences)– To be extended with additional sources of cited refs info

● Nationwide licensing scheme for Germany administered at GESIS

● Six CSA/Proquest databases incorporated into GESIS' „Sowiport“ social sciences portal– Now including ~8.5 mio cited references

● No matchings to full records provided by Proquest● Early experimental results available on portal

– Focus on precision, not recall

CSA References in GESIS' Sowiport Database

● Each full record contains „references“ and „cited-by“ information– Some with actionable links to full records

● Combines WoS/Scopus and Google Scholar approaches to citation index construction

CSA reference information● Fields: citing ID, reference ID, authors, title, year, publisher,

source title/num/vol/p., ISSN– Format changes, though

● Mostly automatically parsed, as fields frequently mis-assigned● Example (book):

<CI>200601317</CI><CA>Voice UK</CA>

<CT>No More Abuse.</CT><CY>2000</CY>

<CZ>Derby: Voice UK</CZ>

Citation Matching in CSA

„Algorithm“:● Internal matching

– However, across multiple CSA databases

● Parse references; construct search queries (Solr) – exact title and year – or fuzzy title and year and ISSN; choose first match

● Favors precision over recall – Fuzzy match only for journal literature, for example

● Research to be continued!

Experiments - Datasets

Caveat● Scopus/PubMed and WoS experiments run on stem cell

research field (biomedical area)– < 100k citing docs, ~1mio references– >95% refs are to journal articles

● CSA experiment run on social sciences databases– ~1mio full records, ~10mio references

● Only recent records contain refs● Many(!!) refs to non-journal articles

Some Rough Numbers● Scopus ↔ PubMed full record matching

– >95% match rate

● Scopus references → Scopus/PubMed full record– ~90% match rate „exact“ + ~5% fuzzy match

– ~1% false positives needed to be filtered out

● WoS references → WoS full record– ~90% match rate– >>50% false positives needed to be filtered out

● CSA references → CSA full record

– ~30% match rate– ~1% false positives

DiscussionCSA matching is much(!) harder● Social science publication culture

– Books & chapters, and articles● Published in roughly equal numbers, books cited most

– Multilingual publishing● English is not the only language● Docs may be cited in translation

– Broad referencing behaviour● Large proportion of references to non-source items

● Biomedical publication culture– >>90% references to international journal articles

– Near-complete coverage in WoS/Scopus/PubMed databases

Discussion

=> A first-try high-precision match rate of ~30% is an excellent result● Close to expected rate of references to journal articles● Plenty of research opportunities to improve matching of

non-journal literature references to source records– e.g. to GESIS' own SOLIS / SOFIS / SSOAR databases

– e.g. by crawling Google Scholar for reference links

– You are invited to try your hands at this, too!● See: GESIS Application Laboratory

Outlook – What we should be doing

Towards a distributed semantic citation index ● Based on digital full-text collections (cooperate with publishers)● Reference extraction (with contexts)

– Enables sentiment analysis (important in social sciences)

● Reference matching– Enables referential semantics

● Open reference semantics information exchange– „<this> paper indexed in our collection cites <that> paper indexed in yours“

● Semi-automatic / Computer-aided– Algorithms + professional indexers (authority files) + crowd sourcing

Data & Analytics

How to build your own citation index