34
HATHITRUST: SHARING THE CARE AND FEEDING OF THE ELEPHANT John Weise and Chris Powell and Kat Hagedorn University of Michigan Libraries

HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

  • Upload
    khage1

  • View
    651

  • Download
    0

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

HATHITRUST: SHARING THE CARE AND

FEEDING OF THE ELEPHANT

John Weise and Chris Powell and Kat Hagedorn University of Michigan Libraries

Page 2: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Introduction

HathiTrust ingests and integrates digital content produced by a variety of systems, processes, practices, and workflows at partner institutions. •  Google •  Internet Archive •  Locally scanned

e.g., Yale, Michigan, and several others.

Page 3: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Some of Michigan's Hats

•  Google partner •  HathiTrust administrator

o  Specifications and guidelines o  Ingest manager/gatekeeper

•  HathiTrust partner •  Michigan as Michigan

o  MDP scans to HT (i.e., Google scans) o  Local scans to HT o  Legacy migration to HT o  Investigate and fix problems

Page 4: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Making Decisions

Try as we might, to do what is right, there may be more than one right answer.

Page 5: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

The aggregation of content in HathiTrust has revealed outcroppings in the data landscape that were not as apparent when segregated.

Page 6: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

We won't talk about...

•  HathiTrust governance, the many benefits of partnership, or the lawsuit.

•  Users, data mining, or preservation per se, but they are inherent throughout.

•  Google's scanning processes except to illustrate a point.

Page 7: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

In a nutshell

We're contemplating the impact of independent decisions made in the past on preservation and access today.

Page 8: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

To do this, we'll talk about...

•  Michigan's digital library heritage. •  The impact of local decisions on global

preservation and access. •  Meaningful vs. meaningless variations in

practice. •  Variations in quality. •  The benefits of aggregation for preservation. •  Where we can go from here.

Page 9: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Our mass digitization heritage

Page 10: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Large scale, but sharp focus

•  Collaborative, but separate •  Curated

o  Condition o  Completeness o  Metadata availability o  Restricted scope o  Meaningfulness within the context of the collection

•  Separate systems obscured variation in application of agreed-upon standards

Page 11: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Now these texts are moving into an environment where the sharp focus that defined their previous online existence is less meaningful, and some shortcomings are now exposed.

Page 12: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Michigan's Local Legacy

•  5K-10K volumes/year back to the 1990's •  24K volumes migrated to HathiTrust. •  Relatively painstaking process.

o  Why?

Page 13: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Reasons volumes that don't make the automated move

•  A record for the item cannot be located in the catalog

•  Non-standard naming conventions •  Skips in file sequence •  Bitonal TIFF images aren't 600 dpi •  Various TIFF header anomalies •  JPEG2000 images that don't contain

resolution information

Page 14: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Successful volumes sharing the larger repository aren't all the same

•  Different libraries (even within the same institution)

•  Different materials (books, journals, photos) •  Different physical formats •  Different languages and scripts •  Different application of standards (including

MARC) •  Different decisions made along the way

Page 15: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Meaningful vs. meaningless variation

•  Variation you want to maintain vs. variation you want to obscure

•  Need for consensus •  Need for certainty that solutions are truly

global •  Why is this variation occurring? •  How can you spot variation in such a large

pool? •  How are truly meaningful variants identified

and preserved?

Page 16: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Digitization Decisions: Page Features/Book Structures

Page 17: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Digitization Decisions: Omissions

•  It's impossible to illustrate what you have omitted

•  It's also impossible to find where omissions occurred

Page 18: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Digitization Decisions: Inserts

Page 19: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Cataloging differences

Page 20: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Even among brief descriptions

Page 21: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

And among expanded descriptions

Page 22: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

The combined repository gives you a fresh and broader look at your collections and your practices.

Page 23: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Content quality problems

•  Issues we see with quality can be found in any collection

•  Some are unavoidable or were based on a particular decision due to resource issues

•  Some can be given special treatment if they occur frequently or are anticipated

•  There's a trade-off, naturally o  decision between a pristine corpus and a massively

useful corpus

Page 24: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Focus on potential physical volume errors NOT volume scan errors

These are volume scan errors...

Warp

Skew

Page 25: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

RTL and upside-down (e.g., Japanese)

Page 26: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Unfolded foldouts

Page 27: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Pagetagging gone awry

Page 28: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Faint text

Page 29: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Pages misnumbered and duplicated in physical volume

page 135 page

139, which should be page 136

Page 30: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Pages missing in the physical volumes

page 96 page

99

pages 97 and 98 are not in volume

Page 31: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Benefits of corpus

•  Preservation •  Noting provenance and process of creating

these digitized volumes •  Aggregation •  Ability to compare volumes •  Reveal potential solutions to problems •  Certification of particular volumes

Page 32: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

More hands make lighter work

•  Working with institutions on a collective level as opposed to singularly

•  Working together to find common models and workflows

•  Share experience and develop policies to mitigate newly discovered issues and maintain the corpus

Page 33: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Lessons we're learning as we go

•  You do NOT have to solve everything at once •  Don't let potential problems prevent you from

moving forward •  Decide what is the most important, and where

you use your resources, and do it at the beginning of your project, if at all possible

Page 34: HathiTrust: Sharing the Care and Feeding of the Elephant: Digital Library Federation Forum 2012

Contact info

•  www.hathitrust.org •  John: [email protected] •  Chris: [email protected] •  Kat: [email protected]