Ebooks: digitizing our print collections Sian Meikle University of Toronto Libraries

Preview:

Citation preview

Ebooks: digitizing our print collections

Sian MeikleUniversity of Toronto Libraries

Digitizing our print collections About mass digitization

Who is digitizing? What is getting digitized? What is the output? Digitization issues

Integration of print-to-digital content Choices for discovery and access Issues for delivery

Partners: 11 libraries; several publishers University of California National Library of Catalonia University Complutense of Madrid Harvard University University of Michigan New York Public Library Oxford University Stanford University University of Texas at Austin University of Virginia University of Wisconsin at Madison

Scanning: in-copyright & out-of-copyright In-copyright:

Searching : all content fully indexed Display: snippets can be viewed, full text can

be bought or located in library Out-of-copyright:

PDFs can be downloaded for personal use, bought, or located in library

Google output Metadata Scanned images: TIFFs Access derivatives:

JPEGs Image-based PDFs (one per page or one per

book) Uncorrected OCR

Partners: Internet Archive 29 libraries Publishers: O’Reilly Media Industrial partners:

MSN HP Labs Adobe Xerox

Open Content Alliance library partners Boston Public Library Boston Library Consortium Columbia University Emory University European Archive Indiana University Johns Hopkins University

Libraries McMaster University Memorial University of

Newfoundland Missouri Botanical Garden MSN National Archives (United

Kingdom) National Library of Australia Rice University Tufts University

San Francisco Public Library Simon Fraser University Smithsonian Libraries University of Alberta University of British Columbia University of California University of Chicago Library University of Georgia University of Illinois Urbana-

Champaign University of North Carolina University of Ottawa University of Pittsburgh University of Texas University of Toronto University of Virginia Washington University York University

Scanning: Out of copyright, copyright-cleared material

Searching: Search full content via MSN Live Book Search Metadata via Internet Archive

Use: All scans & derivatives can be downloaded

Open Content Alliance output Metadata Scanned images: TIFFs/J2Ks/CR2s Access derivatives:

JPEGs DJVU (viewer, requires plugin) Flip book (viewer, does not require plugin) Image-based PDFs (one per book) Uncorrected OCR integrated into the PDFs

Mass digitization content 900,000 pre-1923 titles 60% are unique 40% have more than one manifestation

Published page count

0

2

4

6

8

10

12

pre 1923 1923-1963 post 1963

Bil

lio

ns

of

pag

es

Data courtesy Microsoft, 2007

Comparison: scanned & born-digitalScanned from print Born-digital

Page images E-text

Search uncorrected OCR Search text

TOC, title page, index are marked Can be highly segmented, linked

Literature, history, … STM, social sciences, reference, …

Mass digitization: some Q&ADuplication

Q: How do we guard against duplication?A: It might be cheaper just to scan duplicates.

OmissionsQ: What about fold outs, uncut pages, tightly

bound books, print running into margins…A: Mass digitization works because it is efficient.

A parallel process should handle exception cases.

Mass digitization at U of Toronto

Not scanned:2,400 (8%)

Scanned:32,000 (92%)

Method 1: Union digital repositoryInternet Archive (OCA) E-books integrated with non-book content User contributions (content, reviews) Other sites can point to this content

Method 2: Full text search repository MSN Live Books and Google Books Both cross-book & intra-book searching Google’s goal is to index MSN is developing a reading environment

Google

Google

Google

MSN Live Book Search

MSN Live Book Search

MSN Live Book Search

Why load it locally? Safekeeping

Lots of copies keep stuff safe! Discovery

Integration with licensed books Integration with non-book content Local subject specialization

Method 3: Local loadUniversity of Michigan E-books linked from OPAC Rights system decides who can view:

Nobody University of Michigan United States World

In-book searching: OCR, one-at-a-time

MBooks at University of Michigan Download and validation:

local data mover GROOVE (perl & mysql) Data integrity:

MDS fixity checks on jpegs, tiffs, utf-8 Quality assurance:

GROOVE samples 20 p. chunks for students to check with ACDSee

Problems referred to Google for later correction

University of Michigan: OPAC link

University of Michigan: OPAC link

University of Michigan: e-book display

University of Michigan: search in e-book

How do people read?

Intentional reading Attentive, sustained,

linear reading of text Heavily influenced by

printed-book culture Dominant in classical

and scholarly literature

Functional reading Manipulating different

content types Web browsing, text

database searching Most screen reading is

functional

Intentional Functional

Hillesund, T., & Noring, J. E. (2006)

How do people know what they’ve read?[A] strong relationship…exists between the

sensory motor representation of the user and his/her treatment of the information content of the paper book or e-book…

Because an electronic book is functionally closer to a computer than a traditional book […] it does not provide the external indicators to memory that the classical book does…

Morineau et al, 2005

Delivering the book to the user

Printed books

Make use copy

Make discovery surrogate

Search surrogates,choose candidates

Examine candidates

Browse more candidates

Choose material

Online books

Make discovery surrogate

Search surrogates, choose candidates

Examine candidates

???

Choose material

Make use copy

Use

r ta

sks

Implications for mass digitization

Support production of good print copies for use

Target TOC and index for indexing & correction

Provide granular linking Provide browse functions

References Blanche, C., Gueguen, N., Morineau, T., & Tobin, L. (2005). The

emergence of the contextual role of the e-book in cognitive processes through an ecological and functional analysis. International Journal of Human-Computer Studies, 62(3), 329-348.

Christianson, M., & Aucoin, M. (2005). Electronic or print books: Which are used? Library Collections, Acquisitions, and Technical Services, 29(1), 71-81.

Hillesund, T., & Noring, J. E. (2006). Digital libraries and the need for a universal digital publication format. JEP: the Journal of Electronic Publishing, vol.9, no.2,

cLevine-Clark, M. (2006). Electronic book usage: A survey at the University of Denver. portal: Libraries and the Academy, 6(3), 285-299.

Su, S. (2005). Desirable search features of web-based scholarly e-book systems. Electronic Library, 23(1), 64-71.

Mass digitization archives Google Books:

http://books.google.com/ Internet Archive:

http://archive.org/ MSN Live Book Search:

http://books.live.com/ University of Michigan:

http://mirlyn.lib.umich.edu/