"Unique," "Descriptive," and Other Damned Lies: The Challenges of Identifying...

Preview:

Citation preview

HATHITRUST A Shared Digital Repository

“Unique,” “Descriptive,” and Other Damned Lies: The Challenges of

Identifying Related Records

Valerie Glenn and Bill DueberLITA Forum

November 14, 2015

Overview

• Introduction/Background• What we’re trying to do & why• What is a Federal Government Document?• What’s been done• Next steps

Background

• 2011 Constitutional Convention – Ballot Initiative #4

• Resolved: “that HathiTrust facilitate collective action to create a comprehensive digital corpus of U.S. federal publications including those issued by GPO and other federal agencies”

• Resolved: “that HathiTrust develop a process of catalog record review to ensure accurate and full display of U.S. federal publications including those issued by GPO and other federal agencies”

What are we trying to do?

•Define the corpus of US federal documents•Identify documents that aren’t in the HathiTrust Digital Library

•Find documents and digitize them

What’s Been Done

• Matching on Identifiers• OCLC #• LCCN• ISSN• SuDoc Call number

• “Duplicates”• Related (parts of the same series, etc.)

Enumeration and Chronology

Image found at http://goo.gl/qkrd0Q

Quick Record-matching Quiz #1

•Mathematical preparation for general physics with calculus / by Davidson, Ronald C. Published: 1973

•A textbook of oral pathology by Shafer, William G. Published: 1974

Quick Record-matching Quiz #2

•Mathematical preparation for general physics with calculus / by Davidson, Ronald C. Published: 1973

•Mathematical preparation for general physics with calculus / by Davidson, Ronald Published: 1973

Quick Record-matching Quiz #3

What is the most reliable unique identifier in all of Libraryland?

Quick Record-matching Quiz #3

What is the most reliable unique identifier in all of Libraryland?

OCLC Number

FEEL BAD!!!!!

Enum/Chron

FEEL BAD!!!!!

Examples

1985

v. 3

NO. 1-12 1963-64

This stuff we can parse with a few dozen lines of ruby, or even regex.

Examples

V. 138 NO. 125-127 PT. 2 SEP 15-17 1992

NO. 3-4, 8, 13, 15-20, 22-23, 25, 27-28, 30-31, 33, 39-41, 43-44:V. 1, 45-46, 48-58, 63-66, 68-81, 83-91, 95, 99-113, 115-128, 130-133, 135-136, 144-145, 147-148, 151, 155, 157, 159, 162-164, 173-174, 178, 180, 182, 185, 190, 195, 198-199, 201-202, 205, 207-208

Examples

V. 138 NO. 125-127 PT. 2 SEP 15-17 1992

NO. 3-4, 8, 13, 15-20, 22-23, 25, 27-28, 30-31, 33, 39-41, 43-44:V. 1, 45-46, 48-58, 63-66, 68-81, 83-91, 95, 99-113, 115-128, 130-133, 135-136, 144-145, 147-148, 151, 155, 157, 159, 162-164, 173-174, 178, 180, 182, 185, 190, 195, 198-199, 201-202, 205, 207-208

Examples

V. 33:NO. 36-54+SS1-4;SUP. ;ANNUAL SUMM. 1984

Examples

31-40D

V. 45:NO. 7-9V. 45:NO. 7-92008

2011:pt.1 (1.501-1.640) = P.1 (1.501-1.640)/2011

V 11-13,14b/d no 11ab - 14 Jul 93 + abs 1992/93 c-f not e index

Examples

982

NOS. 9-1461 WITH MANY EXCEPTIONS

So...where are we?

• Parser up over 1000 lines with a long way to go

• “parse” about 65% of enumchron (3.5M)

• Not at all sure they’re all right

• ...or how to compare them

• ...or how to do gap detection

• ...or what to do with the other 35%

FEEL BAD!!!!!

Next steps

• Refine enum/chron parsing• String matching• Automated gap detection

How to find out more

• HathiTrust Registry of US Federal Government Documents: http://www.hathitrust.org/usdocs_registry

• Contact Bill: dueberb@umich.edu@billdueber

• Contact Valerie: valglenn@umich.edu@vdglenn

Thank you! Questions?

Recommended