32
1 Publications, Identity, and Disambiguation NIH Workshop on Identifiers and Disambiguation in Scholarly Work Denise Beaubien Bennett Gainesville, FL March 18, 2010 2 "Until George W. Bush became President, the first President Bush never used his middle initials," George H.W. Bush's chief of staff, Jean Becker, says . "But once his son became President, the elder Bush began to realize that it was necessary, to help identify which President Bush was being referred to.” How confident are we that all mentions of plain “George Bush” refer to Senior? Remember that George H.W. Bush had several roles: CIA Director, Ambassador to China, Vice President

Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

1

Publications, Identity, and Disambiguation

NIH Workshop on Identifiers and Disambiguation in Scholarly Work

Denise Beaubien Bennett Gainesville, FL March 18, 2010

2

"Until George W. Bush became President, the first President Bush

never used his middle initials," George H.W. Bush's chief of staff, Jean Becker, says. "But once his son became President, the elder Bush began to realize that it was necessary, to

help identify which President Bush was being referred to.”

•  How confident are we that all mentions of plain “George Bush” refer to Senior?

•  Remember that George H.W. Bush had several roles: CIA Director, Ambassador to China, Vice President

Page 2: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

2

3

Automated disambiguation

•  Scopus •  Web of Science •  CiteSeer •  DBLP author search engine – query interpreted as set

of prefixes (implicit truncation) of name parts

•  Author-ity

•  improving recall and precision over time!

4

Scopus – snapshot from 2007 2007 – one solid cluster, 6 ambiguous outliers

Page 3: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

3

Scopus in 2010: improving 2010 - one solid cluster, 3 ambiguous outlier names

6

Web of Science

•  Their example shows incompleteness of disambiguation; continue using all variations

with and without apostrophe

Page 4: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

4

WoS Distinct Author Sets – clustering is improving

DIY disambiguation

Web of Science

Page 5: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

5

CiteSeer – disambiguated (but not perfect)

unclustered items are mostly typos

alternate name resolves to preferred name

Page 6: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

6

Page 7: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

7

Author-ity clusters

Page 8: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

8

Author-ity pairwise ranking

Author-ity ranking results

Page 9: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

9

Author-ity ranking – the bottom super-high probability through 130.

less than 50% with title far off topic

Voluntary Profiles Author (or proxy) created and maintained •  Compliance challenges with ingestion and

updating •  Usually include numbers

•  COS Expertise - 480,000 profiles •  ResearcherID (to be used by ORCID)

•  RePEc Author Service in IDEAS

Page 10: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

10

19

COS Community of Science

useful tools

18 months ago

20

ResearcherID

author-controlled profile

Page 11: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

11

21

ResearcherID - features

value added from WoS – only works on cites in WoS

ResearcherID dups

keywords helpful when present

Page 12: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

12

RePEc Author Service •  Relies on authors to maintain their profiles and

identify articles as written by them •  23,000+ registered authors and 7000+ registered

non-authors

from 2007: dups & funnies

disambiguated index is much cleaner in 2010

they track lost and deceased authors

Page 13: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

13

25

In development

•  Cooperative Identities Hub

•  ISNI

•  ORCID

26

Manual checking

•  no guarantee of perfection •  scalability

•  MathSciNet •  Mathematics Genealogy Project •  ACM

Page 14: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

14

MathSciNet clusters all papers but preserves name on piece

28

However…

•  Even the small, discipline-specific database of MathSciNet cannot corral all the duplicate names. – only half of the entries disambiguated for:

•  Zhang, Lei •  Zhang, Li

•  Red herring: how many people only author one paper in their career???

–  about 46% in Medline (sec. 3.5)

Page 15: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

15

Many people, same name

30

MGP -

Page 16: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

16

ACM – discloses the weighting

ACM Digital Library – not quite yet

Page 17: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

17

33

After we disambiguate, we can:

•  Link / cluster records within the silo –  highlighting the preferred version

•  Link headings (or records) across silos

•  Analyze / repackage / mashup the data

34

Linking within a silo

•  more examples -- inspiration from outside the university/research world

Page 18: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

18

Linking in Community-maintained IMDB

others born the same day or year or place

links to people, films, etc. credit!

Community-maintained - MusicBrainz

members & years

Page 19: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

19

Community-maintained - MusicBrainz

please – no “eyes” no “pears” no hyphen

38

Linking across silos •  VIAF – Virtual International Authority File

•  Getty ULAN – Union List of Artist Names

•  Names Project - UK individuals and institutions – for benefit of institutional and subject repositories

•  BKN People – using Bibliographic Ontology (BIBO) to aggregate author silos

•  rely on local silos for maintenance

Page 20: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

20

VIAF – linking across files

authority record in BNF (France) matches these other files

Getty Union List of Artist Names

•  ULAN •  Used mostly by museums •  Merges multiple authority files •  Displays all options and sources •  Guides to preferred name

Page 21: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

21

name variations

preferred among options

Page 22: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

22

relationships

sources

Names project (UK)

Page 23: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

23

45

Names Project (UK)

46

BKN People: uses BIBO

Page 24: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

24

47

BKN People: uses BIBO

48

Analyzing / repackaging the data

– discover outliers through analysis •  what’s wrong with this picture?

–  run the outliers by human checkers

– use the analyzed results to refine the disambiguation

Page 25: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

25

WorldCat Identities

more than birth/death dates

the fun stuff

Anne O’Tate (Author-ity) analyze by address

note the fractions of addresses

Page 26: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

26

Anne O’Tate (Author-ity) analyze by topic

neat clustering, compared to “Topics” with 324 results

analyze – author’s impact within silo

IDEAS / RePEc

Page 27: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

27

MathSciNet collaboration distance

the Kevin Bacon of Math

How close are these authors?

Page 28: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

28

DBLP Vis – coauthor intensity

see # papers with coauthor when mouse-over a year

DBLP Vis – coauthor timecolor

see fatter boxes on graph when mouse-over a year

Page 29: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

29

57

Features to help disambiguate

•  affiliation (how many addresses/year?) •  email address •  coauthors •  keywords from source or all metadata •  dates - degree years, expected range •  web page – URL and other data

•  caution - what fuzziness/distance is acceptable? differences by disciplines?

Use with care: one author, many interests

Page 30: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

30

59

For contemplation and discussion

60

Assigning numbers

•  Centralized numbering system – governance issues, unpalatable to some

•  Individual small silo numbering – can be highly accurate

•  Record linking across files – easily accomplished

•  Getting started -- authors could include number(s) with all contact info

Page 31: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

31

61

Trustworthiness

•  Am I in control of all of my publications? •  If I’m logged in (to ResearcherID, via my

university account, etc.) and I indicate “these items are mine,” should you trust my accuracy?

•  Have I captured all of my items? – variants on my name –  items I forgot –  items credited without my awareness

61

62

Issues to explore

•  Ingestion vs. maintenance –  very different problems

–  author compliance needed?

•  De-duplication (within and across silos)

•  Management and cooperation for updating •  Scalability •  Automated vs. manual techniques •  Optimizing computational performance •  Long tail of one-hit authors (how much attention?)

Page 32: Publications, Identity, and Disambiguationscimaps.org/exhibit/docs/031810-meeting/slides/bennett_presentatio… · • Remember that George H.W. Bush had several roles: CIA Director,

32

63

Researchers, projects, products, models

•  Great review (by the Author-ity folks) Smalheiser NR, Torvik VI. (2009) Author name disambiguation.

Databases and those who created or tinkered with them

•  MathSciNet •  ULAN •  DBLP - Han •  CiteSeer – Giles, Han •  IMDB – Malin •  ANAC – Levy sheet music •  Medline – Torvik and Smalheiser •  D-Dupe - Getoor •  rexa.info – McCallum •  VIAF - Hickey