29
Where Preservation Meets Mass Digitization John A. Kunze California Digital Library LAUC Fall Assembly, UC Merced, 16 November 2007

Where Preservation Meets Mass Digitization John A. Kunze California Digital Library LAUC Fall Assembly, UC Merced, 16 November 2007

Embed Size (px)

Citation preview

Where Preservation Meets Mass Digitization

John A. Kunze

California Digital Library

LAUC Fall Assembly, UC Merced, 16 November 2007

2

The UC Libraries’ Digital Preservation Program

UC-wide program: serves all 10 UC campuses– 208,000 students– 121,000 faculty and staff – 10+ libraries– Museums

Located at the CDL

4

Preservation challenges: case studies

With benefit of hindsight, what’s hard?

• Policy

• Making files small

• Fast data transfer

• Cheap, reliable storage

• Lots of annoying files

• Preserving the revenue stream

5

What’s digital preservation?

Storing digital objects while retaining a balance of usability and faithfulness (truthiness) to their creators’ original intentions

6

Policy Challenges

• How faithful

• How long

• How many replicas

• How much manipulation

• Right(s)mare

7

Fast data transfer challenges

Lots of files, lots of data• Could take months to move and replicate

Explore data transfer / replication options• Test with CDL and New York University

Survey tool performance and usability

Continuing conversations with the San Diego Supercomputer Center and the Library of Congress with goal of creating guidelines

8

Transfer tools testedUbiquitous, usual suspects: RSYNC, SCP, SFTP, FTP• MogileFS (simple distributed filesystem, Perl scripts) http://www.danga.com/mogilefs/

• High Performance SSH (no system gaming) http://www.psc.edu/networking/projects/hpn-ssh/

But parallelism really works:• GridFTP (high security, from Grid community) http://www.globus.

org/grid_software/data/gridftp.php

• SRB (bundled Sget/Sput tools) http://www.sdsc.edu/srb/index.php/Main_Page

• BBFTP (easy installation and use) http://doc.in2p3.fr/bbftp/

• BBCP (easy installation and use) http://www.slac.stanford.edu/~abh/bbcp/

Practically, combine parallelism with common tools: 20 x SCP!

9

Upload Comparison

10

Making many files small

Now we know how to move millions of files

How to make them smaller?

11

What is mass digitization?

Large-scale scanning of newspapers, books, videos, etc. from the world’s major libraries– Millions of items/hours to digitize, e.g.,

12

Why mass digitization?

For better access and search– Page images remotely accessible– OCR (Optical Character Recognition) makes

text visible to search engines

Mass digitization is, for us, not intended to

replace the physical item

13

“Page Image Compression for Mass Digitization”

A study of page image tradeoffs with:• National Library of France (BnF)• Harvard University Libraries (HUL)

– With Google Book Search: G9 Libraries – Harvard, Michigan, Stanford, NYPL, Oxford, University of California, etc.

• University of California Berkeley (UCB) and the California Digital Library (CDL)

– With Open Content Alliance: Internet Archive, Microsoft, University of Toronto, etc.

Presented at IS&T Archiving 2007, Arlington, May 2007

14

Mass book digitization tradeoffs

For our millions of volumes• Need to strike balance between size of the files and

quality of the reading experience• Images need to work with OCR• Possibility of re-printing books (print on demand), but this

was not investigated formallyRecommendations common to all 3 groups:• JPEG 2000 JP2 (ISO/IEC 15444-1) file format• An all color, all lossy solution is feasible

15

Text pages: point size mixes, foxing, handwriting

16

Text page : fonts, paper color, bleed-through

17

Text page : wordy, tight 2-cols, uneven ink

(details)

18

Color page : high information density

(detail)

19

Color page : over-exposed, fine lines

20

Grayscale : coarse half-tones

(detail)

21

Don’t forget audio/video

Case: Swedish National Archive of Sound and Moving Images is digitizing 6 million hours of material– 50 different recording formats and

catalogs, growing 10% per annum– Eg, 500,000 hours of open-reel 4 track

using 16 simultaneous players, 8 players per operator

– Eg, 220,000 hours VHS using 12 simultaneous players

Digitizing and ingesting 42 TB/month

22

Cheap, reliable storage

OK, we can make files smaller and we can move lots of them quickly, but can we make disk cheaper and still reliable?

• RAID (Redundant Arrays of Inexpensive Disk) 1980s

• JOBD (Just a Bunch of Disks) 1990s• MAID (Massive Arrays of Idle Disks) 2000s

23

Lots of annoying files, or “making files fewer”

Origin: web archiving

Solution: aggregate W/ARC file format– Many “files” in one file for speed and ease– Records are sort of peers of files

Generalization to mass digitization and other processing products

W/ARC File Anatomy

WARC = Web ARChive file format

.

.

.

Text header

Content block

W/ARC File

W/ARC Record

Length, source URI, date, type, …

E.g., HTTP responseheaders and length bytes of HTML, GIF, PDF, …

Append at will WARC is fast track ISO work item

25

Digitizing the Digital

Origin: preservation of revenue streamCase of Data Desiccation, creating no-frills, sometimes

feature-poor derivatives that retain most of the original scholarly value but are likely to be less perishable than original format (similar to “digital microfilm”)

Save desiccated derivatives along with original, just in case no one ever again

• Has the funds to touch files• Has the expertise to convert them properly

26

Example Photo of Mission San Luis de Tolosa [2]About the City [3]Visiting SLO [4]What’s New [5]City Government [6]Employment Opportunities [7]Bids & Proposals [8]Economic Development [9]FAQs [10]How are we doing? City of San Luis Obispo About the City

[Choose a Destination....] [11]Search [12]Contact Us [13]City Home A Brief History

Who we are and how we got started. The City of San Luis Obispo serves as the commercial, governmental and cultural hub of California’s Central Coast. One of California’s oldest communities, it began with the founding of Mission San Luis Obispo de Tolosa in 1772 by Father Junípero Serra as the fifth mission in the California chain of 21 missions. The mission was named after Saint Louis, a 13th Century Bishop of Toulouse, France. (San Luis Obispo is Spanish for "St. Louis, the Bishop".) It was first incorporated in 1856 as a General Law City, and became a Charter City in 1876.

Where we’re located. With a population of 44,000, the City is located eight miles from the Pacific Ocean and is midway between San Francisco and Los Angeles at the junction of Highway 101 and scenic Highway 1. San Luis Obispo is the County Seat, and a number of federal and state regional offices and facilities are located here, including Cal Poly State University, Cuesta Community College, Regional Water Quality Board and Caltrans District offices. The City’s ideal weather and natural beauty provide numerous opportunities for outdoor recreation at nearby City and State parks, lakes, beaches and wilderness areas.

Great place to live and visit. While San Luis Obispo grew relatively…

27

Example continued: endnotes…[18]About the City | [19]Visiting SLO | [20]What’s New | [21]City Government | [22]Employment [23]Bids & Proposals | [24]Economic Development | [25]FAQs | [26]How are we doing? [27]©2006, City of San Luis Obispo

References

1. http://www.ci.san-luis-obispo.ca.us/briefhistory.asp#content 2. http://www.ci.san-luis-obispo.ca.us/about.asp 3. http://www.ci.san-luis-obispo.ca.us/visit.asp 4. http://www.ci.san-luis-obispo.ca.us/whatsnew.asp 5. http://www.ci.san-luis-obispo.ca.us/government.asp 6. http://www.ci.san-luis-obispo.ca.us/humanresources/index.asp 7. http://www.ci.san-luis-obispo.ca.us/finance/bids.asp 8. http://www.ci.san-luis-obispo.ca.us/economicdevelopment/index.asp 9. http://www.ci.san-luis-obispo.ca.us/faq.asp 10. http://www.ci.san-luis-obispo.ca.us/how.asp 11. http://www.ci.san-luis-obispo.ca.us/search2.asp 12. http://www.ci.san-luis-obispo.ca.us/contact.asp 13. http://www.ci.san-luis-obispo.ca.us/index.asp 14. http://www.ci.san-luis-obispo.ca.us/visit.asp…

28

Desiccation and Mass Digitization?

How to make the OCR’d plain text version of a book as acceptable as possible?

Very difficult problem: cf. work of Project Gutenberg and Digital Proofreaders– Born-digital plain text prettier than OCR– Page numbers, footnotes, sidebars– Multiple columns and reading order

At the same time, page/section/chapter structural layout is a mass digitization feature frontier

29

Questions?

[email protected]