HATHI TRUST A Shared Digital Repository HathiTrust Digital Library Cooperation for Preservation

Preview:

Citation preview

HATHI TRUST A Shared Digital Repository

HathiTrust Digital Library

Cooperation for Preservation

Outline

• About HathiTrust– Mission & Goals

• Background• What we do– Services

• How we do it– Governance– Partnership & Resources– Technology

• Future Directions

What is HathiTrust• Shared Digital Repository– Launched 2008 by 25 institutions (now 26)– Initial focus on digitized book and journal content– Expanding to non-book/non-journal, born digital – “Light” archive

• Collaboration – Preservation and access– Print collections– Local services– Public Good

Background

History

• Michigan Digitization Project 2004• “…U of M shall have the right to use the U of

M Digital Copy, in whole or in part at U of M's sole discretion, as part of services offered in cooperation with partner research libraries such as the institutions in the Digital Library Federation…”

History

• Collective Agreement with CIC Announced in June 2007

• CIC agreed to establish a shared digital repository

History

The Partners

• When announced in October 2008, partners included:– University of California system– CIC (Committee on Institutional Cooperation)

– University of Virginia

University of ChicagoUniversity of IllinoisIndiana UniversityUniversity of IowaUniversity of Michigan Michigan State University

University of MinnesotaNorthwestern University Ohio State University Pennsylvania State University Purdue University University of Wisconsin-Madison

Columbia University

The Name

• The meaning behind the name– Hathi (hah-tee)--Hindi for elephant– Big, strong– Never forgets, wise– Secure– Trustworthy

Content Distribution

As of February 1:5,323,716 - Total 764,481 - Public Domain

Content Growth

What we do

Services

How we do it

Governance

HathiTrustHathiTrust

Executive Committee

Strategic Advisory

Board

Strategic Advisory

Board

Budget/FinancesDecision-making

PolicyPlanning

Executive Committee

• Paul Courant, University Librarian and Dean of Libraries, UM• Laine Farley, Executive Director, CDL• John King, Vice Provost for Academic Information, UM• Paula Kaufman, University Librarian and Dean of Libraries, UI• Brian Schottlaender, University Librarian, UCSD• Ed Van Gemert, Director of Libraries, UW - Madison• Brenda Johnson, Dean of Libraries, IU• Brad Wheeler, Chief Information Officer, IU• John Wilkin, Executive Director of HathiTrust and

Associate University Library, LIT, UM

Strategic Advisory Board

• Ed Van Gemert (Chair), Director of Libraries, UW - Madison• John Butler, Associate University Librarian for Information

Technology, U Minn• Patricia Cruse, Director, Preservation, CDL• Bernie Hurley, Director, Library Technologies, UC Berkeley• R. Bruce Miller, University Librarian, UC - Merced• Sarah Pritchard, University Librarian, Northwestern• Paul Soderdahl, Director, LIT, U Iowa• John Wilkin, Executive Director, HathiTrust (ex officio)

Partnership & Resources (1)

• Funded for a initial 5 years with base-funding from partners

• Budget – separately held within UMich budget system, managed by the Executive Committee

• Cost Model – Per GB cost of storage per year with a one-time fee on new content to build a capital fund

• Review in 3rd yr of each 5 yr period

Partnership & Resources (2)

• Staff/Expertise – highly integrated– Project managers, IT and communications

staff, copyright experts, administrators (UM,

Indiana and UC taking the lead)• Working groups• UM recently hired a Digital Preservation Librarian• Shared development space

Financial contributions of partners

HathiTrust Functional Framework

Partnership & Resources (3)

• Toward a Cloud Library– CLIR, Mellon Foundation– OCLC Research, NYU, HathiTrust, Recap Libraries

• Objective: Characterize the near-term opportunity for externalizing management of academic research collections leveraging capacity of large-scale shared print and digital repositories*

• Outcomes: opportunity and risk assessment based on aggregate collection analysis; draft service agreement enabling generic consumer library to selectively outsource preservation and access of low-use research collections to large-scale print and digital repositories

*From the RLG Partner Update January 7, 2010

Partnership & Resources (4)

• CRL TRAC Audit– Portico and HathiTrust assessments timely– “Certification will augment CRL’s strategic archiving of

print, and support a responsible transition to electronic-only formats where appropriate.”

– Work with UC to design shared print journal archiving effort

– “With this hybrid strategy CRL hopes to enable its community to accelerate the shift to electronic-only resources in a careful and responsible manner.”

* http://www.crl.edu/archiving-preservation/digital-archives/certification-and-assessment-digital-repositories

Partnership & Resources (5)

• New cost model• Based on benefits to institutions– Public Domain– In-copyright• Volumes “held”

Partnership & Resources (6)

• Timeline:– Implement in 2013– Accept new partners now with costs based on

overlap calculations

• Requirements:– Print holdings database– Update mechanisms– Manual remediation

Technology - OAIS

GRINInternal Data Loading

GRINInternal Data Loading

Google[OCA]

In-house Conversion

Google[OCA]

In-house Conversion

MARC record extensions (Aleph)

Rights DB

MARC record extensions (Aleph)

Rights DB

Page TurnerHathiTrust API

OAIGeoIP DB

CNRI Handles[Solr]

Page TurnerHathiTrust API

OAIGeoIP DB

CNRI Handles[Solr]

METS/PREMIS objectTIFF G4/JPEG2000

OCRMD5 checksums

METS/PREMIS objectTIFF G4/JPEG2000

OCRMD5 checksums

METS objectPNGOCRPDF

METS objectPNGOCRPDFIsilon

Site ReplicationTSM

MD5 checksum validation

IsilonSite Replication

TSMMD5 checksum validation

GROOVE(JHOVE)GROOVE(JHOVE)

;

Technology – Architecture

• Inbound validation, standards-based object storage and related metadata

• Storage in Ann Arbor and Indianapolis• Encrypted backup to 3rd location• Rights database for rights metadata• Online catalog as source and storage for descriptive

metadata

Technology - Ingest

• Automatic validation in GROOVE– Check barcode check digit using Luhn algorithm– Fixity check on JPG2000, TIFF, UTF8 using MD5– Well-formedness and embedded metadata check

on JPG2000, TIFF, UTF8 using JHOVE• Creation of METS and PREMIS

• Isilon storage• Simple filesystem layout– One directory per volume, zip file and METS file– Use of a namespace allows for conflicting

identifiers– Namespaces for institutions and, if needed, types

of identifiers within the institution

Technology - Repository

• Why METS?– Can serve as Archival Information

Package and a Dissemination Information Package

– Designed to record the relationship between pieces of complex digital objects

– Can be created automatically as texts are loaded or reloaded

– Preservation actions (PREMIS)

Technology – METS Object

• What’s there?

–metsHdr with an ID and CREATEDATE

– 2 dmdSecs: Marcxml and mdRef

– amdSec containing one techMD with PREMIS metadata

– fileSec with 4 fileGrps (zip, images, OCR, hOCR)

– Physical structMap tying together files with metadata (pg. numbers and features)

Technology – METS Object

Future Directions

Future Directions (1)

Future Directions (2)

Links• Catalog, Full-text search, and Collection Builder

– http://catalog.hathitrust.org• METS and PREMIS implementation

– http://www.hathitrust.org/preservation• Technical profile:

– http://www.hathitrust.org/technology• Technical flow diagram

– http://www.hathitrust.org/documents/HathiTrust-PASIG-200910.pdf– http://www.hathitrust.org/documents/HathiTrust-PASIG-notes-200910.pdf

• Rights management– http://www.hathitrust.org/rights_management

• TRAC– http://www.hathitrust.org/accountability

Thank You!hathitrust-info@umich.edu

jjyork@umich.edu

http://www.hathitrust.org

Recommended