View
645
Download
0
Category
Tags:
Preview:
DESCRIPTION
Presented at the 10th annual Data Harmony Users Group meeting on Wednesday, February 12, 2014 by Bob Kasenchak of Access Innovations, Inc. With the rise of ORCID and other universal databases of researchers and institutions, it is increasingly crucial for publishers to sort out their own data containing named entities. This talk details Access Innovations' approach to author disambiguation, which includes a taxonomy-based solution in addition to algorithmic processes. The presentation includes a case study.
Citation preview
LEVER
AGING S
EMANTI
C
FINGER
PRINTS
FOR
BUILDIN
G AUTH
OR
NETW
ORKS
Bob KasenchakProduction Coordinator
DHUG 2014
NAMED ENTITY DISAMBIGUATIONMost publishers (and many other organizations) have need of disambiguating lists of:Persons
AuthorsEditorsMembersEmployees
InstitutionsColleges, Companies, Laboratories, Organizations
Copyright 2014 Access Innovations, Inc.
BUT WHY DISAMBIGUATE?
Facilitate content discoveryBrowse by Author or Institution name
Resolve member, author, marketing listsLink out to other organizations (e.g., ORCID)Demonstrate value to stakeholders
e.g., College libraries less apt to cancel subscriptions if they are shown how many of their professors are published in your content
Market research and analysis
Copyright 2014 Access Innovations, Inc.
TWO DISAMBIGUATION PROCESSESMatching algorithms
String matchingFuzzy matching
Leveraging other data associated with each entity to increase matching probability and reduce false matches, such as:CountryDateCo-authors
Copyright 2014 Access Innovations, Inc.
TWO-PHASE WORKFLOW Initial set of raw data is used to create an authority fileQuestionable names are subject to human review
Authority file is subject to constant review and cleanup
Entities are extracted from new content and compared to the authority fileAnomalies are reviewed and matched to existing records or added as new entities
Copyright 2014 Access Innovations, Inc.
INSTITUTION DISAMBIGUATIONHaving a clean Institution authority file allows for better processing of persons
The work is easier and more clear-cutDevelop standards and practices, but be prepared to change or add to them as new data comes to lightForcing data into a bad paradigm isn’t helpfulThe data should inform your standards and practices
Copyright 2014 Access Innovations, Inc.
INSTITUTION DISAMBIGUATION FLOW
Copyright 2014 Access Innovations, Inc.
QUALITY OF RAW DATA MATTERS
Well-formed source data?Structured or unstructured?Legacy content?
Often not as well structuredOr auto-tagged, so can be unreliable
Parsed using punctuation etc. as delimitersCommon abbreviations and stopwordsAlso, leverage country information if available
Copyright 2014 Access Innovations, Inc.
INSTITUTIONS: RAW DATAOhio Aerosp. Inst., Cleveland, OH 44142
Ohio Aerospace Institute (OAI)
Ohio Dominican University
Ohio Institute of Technology
Ohio Northern University
Ohio State
Ohio State Univ., Columbus, OH
Ohio State Univ., Columbus, OH 43210
Ohio State Univ., Columbus, OH 43210‐1298
Ohio State Univ., Dept. of Linguist.
Ohio State Univ., Dept. of Mech. Eng., Columbus, OH 43210, mechprof@osu.edu
Ohio University
Copyright 2014 Access Innovations, Inc.
INSTITUTION DISAMBIGUATION FLOW
Copyright 2014 Access Innovations, Inc.
HUMAN EDITORIAL REVIEW
Two kinds of human intervention are used:QC of automated matches for accuracy
Culls out errorsGather data to iteratively adjust matching algorithms
Reviewing non-matched entitiesMatch by hand to existing authority fileCreate new listings for new entities
Copyright 2014 Access Innovations, Inc.
EDITORIAL REVIEW INTERFACE
Institutions to be reviewed
AuthorityFile lookup
Search results
Copyright 2014 Access Innovations, Inc.
AUTHORS (AND OTHER PERSONS)
Persons are trickier than institutions!VariantsNicknamesMiddle name, initial, or nothing
InitialsSuffixes and PrefixesSimilar last namesName changesTransliterations
Copyright 2014 Access Innovations, Inc.
NAMES: RAW DATACarlson, N.
Carlson, Neil N.
Carlson, P.
Carlson, R. L.
Carlson, R. M. K.
Carlson, R. W.
Carlson, Roy
Carlson, Roy F.
Carlson, T. A.
Carlson, Thomas
Carlson, Thomas A.
Carlson, Thomas J.
Carlson, W. G.
Carlson, William
Carlson, William V.
Which, if any, arethe same person?
Copyright 2014 Access Innovations, Inc.
PERSON NAME DISAMBIGUATION FLOW
Copyright 2014 Access Innovations, Inc.
RESOLVER; SEMANTIC FINGERPRINTS
Copyright 2014 Access Innovations, Inc.
RESOLVER; SEMANTIC FINGERPRINTS
Copyright 2014 Access Innovations, Inc.
AUTHOR NAME AUTHORITY FILE
Each author record is linked to other associated data:Every DOI (or other document #)Every co-authorEvery institutionDates of publicationSubject terms from thesaurus used to index content associated with each personEach of these is used in the disambiguation algorithm to weight the potential matches of similar names
Copyright 2014 Access Innovations, Inc.
LEVERAGING THESAURUS TERMS
The indexing from every paper by each known author comprises a weighted subject “fingerprint”
Potential matching names from incoming content are associated with the indexing from each paper
Subject terms are compared to potential matches to increase certainty weighting
Copyright 2014 Access Innovations, Inc.
PERSON NAME DISAMBIGUATION FLOW
Copyright 2014 Access Innovations, Inc.
EDITORIAL REVIEW INTERFACE
Authors to be reviewed AuthorityFile lookup
Search results
Copyright 2014 Access Innovations, Inc.
ITERATIVE PROCESSES
Every batch of new content adds more data for the matching algorithms to use
The authority files should be reviewed by editors for QC to keep the files clean
Editors can suggest tweaks to the algorithm based on the results that are being sent to them for review and QC of the authority files Too many obvious matches being kicked out; or Bad automatic matches being added to authority files
Copyright 2014 Access Innovations, Inc.
CONTENT-AWARE PROCESSES
Every dataset is different, so the named entity disambiguation processes and algorithms should be modified to suit
More “adjustable” than “one-size-fits-all”Basic processes can be customized to suit different datasets and client needs
Leveraging thesaurus/subject terms from indexing is a huge addition to the disambiguation algorithms
Copyright 2014 Access Innovations, Inc.
NAMED ENTI
TY
DISAMBIG
UATIO
N
PROCESSES A
ND
PROCEDURES
Bob KasenchakProject CoordinatorNovember 20, 2013
Thank You – Any Questions?
Recommended