Upload
philipp-mayr
View
125
Download
5
Tags:
Embed Size (px)
Citation preview
How to Build Your Own Citation Index
First-hand experience with WoS, Scopus, and CSA reference data
Philipp Mayr, Frank Sawitzky, Andreas Strotmann (GESIS – Leibniz Institute for the Social Sciences,
Cologne)
Background I
● Author cocitation and collaboration network mining and visualization– e.g. Bubela, Strotmann, et al (2010) Cell Stem
Cell: Researcher commercialization activity tends to reduce their collaboration breadth
Background II
Citation Index for the Social Sciences ● GESIS' Sowiport portal
– 18 databases, including 6 CSA databases, all social sciences
– CSA comes with cited refs for some docs
– SSOAR – extract refs from OA full text and index in Sowiport
– Crawl Google Scholar for citations to “our” docs
Two Models of Citation GraphsBipartite (Classic IR) Model:Citing and Cited Partitions
• Citing nodes: full bibliographic records
• Cited nodes: „keys“, e.g.– First author name & initials
+ Year of publication+ Journal key, + volume, +number, +page
Uniform Model:Interconnected Documents
• All nodes: bibliographic records– Citing nodes full records– Cited nodes mostly simplified
records– „Matched“ cited nodes have
full records
Citation Matching
• Goal: Citation network–Unique nodes for documents
• Sub tasks:–Match cited references to each other–Match cited references to full records–Match full records across databases
Scopus Citations
• Cited reference info contains–Up to 8 author names (family+inits)
• Including last author• Frequently as cited (not standardized or corrected)
–Publication year, title, journal name/vol./nr./p.• Frequently as cited
–Reasonably well parsable, not normalized
Matching Scopus Citations to Scopus Full Records
External matching: Scopus search engine● „Algorithm“: parse Scopus reference into subfields,
construct complex search queries for Scopus engine, download resulting full records, choose best fit
● High precision searches: complex searches allowed, many searchable fields– Improve recall by successively vaguer queries
● Small number of downloads allowed, so many queries needed to construct sizable citation index
Matching Scopus Citations to PubMed Full Records
CrossDB External Match: Scopus/Medline● „Algorithm“: parse Scopus reference, construct
PubMed batch citation matcher queries, download matched PubMed(!) records– Only for biomedical fields
– Result is a citation network of PubMed records, not Scopus
– Requires matching of Scopus citing records as well● Either direction (Scopus<->PubMed)● Both include PubMed IDs
Matching Web of Science References to WoS Full RecordsWoS cited reference info contains● First author (last name plus initials)● Publication year● Source title code● Vol./num./page● More and more frequently DOI
No title included!
Matching WoS Cited References to WoS Record
External matching via WoS web search● Only small queries supported
– Many downloads necessary
● Crucial search fields not supported (vol., num.)– Therefore highly ambiguous results to be expected
● Requires translation of source title from code to full● Requires algorithmic filtering of correct hit from long
result list
Matching WoS references to WoS
● Internal Matching● Kompetenzzentrum Bibliometrie has full local
copy of WoS data● Experiment: good „match key“ to support
this?– Dinkel (2011), ISSI
– Results in error estimates for references
Building a Citation Index for the Social Sciences: CSA
● Basis: Cambridge Scientific Abstracts (Social Sciences)– To be extended with additional sources of cited refs info
● Nationwide licensing scheme for Germany administered at GESIS
● Six CSA/Proquest databases incorporated into GESIS' „Sowiport“ social sciences portal– Now including ~8.5 mio cited references
● No matchings to full records provided by Proquest● Early experimental results available on portal
– Focus on precision, not recall
CSA References in GESIS' Sowiport Database
● Each full record contains „references“ and „cited-by“ information– Some with actionable links to full records
● Combines WoS/Scopus and Google Scholar approaches to citation index construction
CSA reference information● Fields: citing ID, reference ID, authors, title, year, publisher,
source title/num/vol/p., ISSN– Format changes, though
● Mostly automatically parsed, as fields frequently mis-assigned● Example (book):
<CI>200601317</CI><CA>Voice UK</CA>
<CT>No More Abuse.</CT><CY>2000</CY>
<CZ>Derby: Voice UK</CZ>
Citation Matching in CSA
„Algorithm“:● Internal matching
– However, across multiple CSA databases
● Parse references; construct search queries (Solr) – exact title and year – or fuzzy title and year and ISSN; choose first match
● Favors precision over recall – Fuzzy match only for journal literature, for example
● Research to be continued!
Experiments - Datasets
Caveat● Scopus/PubMed and WoS experiments run on stem cell
research field (biomedical area)– < 100k citing docs, ~1mio references– >95% refs are to journal articles
● CSA experiment run on social sciences databases– ~1mio full records, ~10mio references
● Only recent records contain refs● Many(!!) refs to non-journal articles
Some Rough Numbers● Scopus ↔ PubMed full record matching
– >95% match rate
● Scopus references → Scopus/PubMed full record– ~90% match rate „exact“ + ~5% fuzzy match
– ~1% false positives needed to be filtered out
● WoS references → WoS full record– ~90% match rate– >>50% false positives needed to be filtered out
● CSA references → CSA full record
– ~30% match rate– ~1% false positives
DiscussionCSA matching is much(!) harder● Social science publication culture
– Books & chapters, and articles● Published in roughly equal numbers, books cited most
– Multilingual publishing● English is not the only language● Docs may be cited in translation
– Broad referencing behaviour● Large proportion of references to non-source items
● Biomedical publication culture– >>90% references to international journal articles
– Near-complete coverage in WoS/Scopus/PubMed databases
Discussion
=> A first-try high-precision match rate of ~30% is an excellent result● Close to expected rate of references to journal articles● Plenty of research opportunities to improve matching of
non-journal literature references to source records– e.g. to GESIS' own SOLIS / SOFIS / SSOAR databases
– e.g. by crawling Google Scholar for reference links
– You are invited to try your hands at this, too!● See: GESIS Application Laboratory
Outlook – What we should be doing
Towards a distributed semantic citation index ● Based on digital full-text collections (cooperate with publishers)● Reference extraction (with contexts)
– Enables sentiment analysis (important in social sciences)
● Reference matching– Enables referential semantics
● Open reference semantics information exchange– „<this> paper indexed in our collection cites <that> paper indexed in yours“
● Semi-automatic / Computer-aided– Algorithms + professional indexers (authority files) + crowd sourcing