View
214
Download
1
Category
Preview:
Citation preview
Ebooks: digitizing our print collections
Sian MeikleUniversity of Toronto Libraries
Digitizing our print collections About mass digitization
Who is digitizing? What is getting digitized? What is the output? Digitization issues
Integration of print-to-digital content Choices for discovery and access Issues for delivery
Partners: 11 libraries; several publishers University of California National Library of Catalonia University Complutense of Madrid Harvard University University of Michigan New York Public Library Oxford University Stanford University University of Texas at Austin University of Virginia University of Wisconsin at Madison
Scanning: in-copyright & out-of-copyright In-copyright:
Searching : all content fully indexed Display: snippets can be viewed, full text can
be bought or located in library Out-of-copyright:
PDFs can be downloaded for personal use, bought, or located in library
Google output Metadata Scanned images: TIFFs Access derivatives:
JPEGs Image-based PDFs (one per page or one per
book) Uncorrected OCR
Partners: Internet Archive 29 libraries Publishers: O’Reilly Media Industrial partners:
MSN HP Labs Adobe Xerox
Open Content Alliance library partners Boston Public Library Boston Library Consortium Columbia University Emory University European Archive Indiana University Johns Hopkins University
Libraries McMaster University Memorial University of
Newfoundland Missouri Botanical Garden MSN National Archives (United
Kingdom) National Library of Australia Rice University Tufts University
San Francisco Public Library Simon Fraser University Smithsonian Libraries University of Alberta University of British Columbia University of California University of Chicago Library University of Georgia University of Illinois Urbana-
Champaign University of North Carolina University of Ottawa University of Pittsburgh University of Texas University of Toronto University of Virginia Washington University York University
Scanning: Out of copyright, copyright-cleared material
Searching: Search full content via MSN Live Book Search Metadata via Internet Archive
Use: All scans & derivatives can be downloaded
Open Content Alliance output Metadata Scanned images: TIFFs/J2Ks/CR2s Access derivatives:
JPEGs DJVU (viewer, requires plugin) Flip book (viewer, does not require plugin) Image-based PDFs (one per book) Uncorrected OCR integrated into the PDFs
Mass digitization content 900,000 pre-1923 titles 60% are unique 40% have more than one manifestation
Published page count
0
2
4
6
8
10
12
pre 1923 1923-1963 post 1963
Bil
lio
ns
of
pag
es
Data courtesy Microsoft, 2007
Comparison: scanned & born-digitalScanned from print Born-digital
Page images E-text
Search uncorrected OCR Search text
TOC, title page, index are marked Can be highly segmented, linked
Literature, history, … STM, social sciences, reference, …
Mass digitization: some Q&ADuplication
Q: How do we guard against duplication?A: It might be cheaper just to scan duplicates.
OmissionsQ: What about fold outs, uncut pages, tightly
bound books, print running into margins…A: Mass digitization works because it is efficient.
A parallel process should handle exception cases.
Mass digitization at U of Toronto
Not scanned:2,400 (8%)
Scanned:32,000 (92%)
Method 1: Union digital repositoryInternet Archive (OCA) E-books integrated with non-book content User contributions (content, reviews) Other sites can point to this content
Method 2: Full text search repository MSN Live Books and Google Books Both cross-book & intra-book searching Google’s goal is to index MSN is developing a reading environment
MSN Live Book Search
MSN Live Book Search
MSN Live Book Search
Why load it locally? Safekeeping
Lots of copies keep stuff safe! Discovery
Integration with licensed books Integration with non-book content Local subject specialization
Method 3: Local loadUniversity of Michigan E-books linked from OPAC Rights system decides who can view:
Nobody University of Michigan United States World
In-book searching: OCR, one-at-a-time
MBooks at University of Michigan Download and validation:
local data mover GROOVE (perl & mysql) Data integrity:
MDS fixity checks on jpegs, tiffs, utf-8 Quality assurance:
GROOVE samples 20 p. chunks for students to check with ACDSee
Problems referred to Google for later correction
University of Michigan: OPAC link
University of Michigan: OPAC link
University of Michigan: e-book display
University of Michigan: search in e-book
How do people read?
Intentional reading Attentive, sustained,
linear reading of text Heavily influenced by
printed-book culture Dominant in classical
and scholarly literature
Functional reading Manipulating different
content types Web browsing, text
database searching Most screen reading is
functional
Intentional Functional
Hillesund, T., & Noring, J. E. (2006)
How do people know what they’ve read?[A] strong relationship…exists between the
sensory motor representation of the user and his/her treatment of the information content of the paper book or e-book…
Because an electronic book is functionally closer to a computer than a traditional book […] it does not provide the external indicators to memory that the classical book does…
Morineau et al, 2005
Delivering the book to the user
Printed books
Make use copy
Make discovery surrogate
Search surrogates,choose candidates
Examine candidates
Browse more candidates
Choose material
Online books
Make discovery surrogate
Search surrogates, choose candidates
Examine candidates
???
Choose material
Make use copy
Use
r ta
sks
Implications for mass digitization
Support production of good print copies for use
Target TOC and index for indexing & correction
Provide granular linking Provide browse functions
References Blanche, C., Gueguen, N., Morineau, T., & Tobin, L. (2005). The
emergence of the contextual role of the e-book in cognitive processes through an ecological and functional analysis. International Journal of Human-Computer Studies, 62(3), 329-348.
Christianson, M., & Aucoin, M. (2005). Electronic or print books: Which are used? Library Collections, Acquisitions, and Technical Services, 29(1), 71-81.
Hillesund, T., & Noring, J. E. (2006). Digital libraries and the need for a universal digital publication format. JEP: the Journal of Electronic Publishing, vol.9, no.2,
cLevine-Clark, M. (2006). Electronic book usage: A survey at the University of Denver. portal: Libraries and the Academy, 6(3), 285-299.
Su, S. (2005). Desirable search features of web-based scholarly e-book systems. Electronic Library, 23(1), 64-71.
Mass digitization archives Google Books:
http://books.google.com/ Internet Archive:
http://archive.org/ MSN Live Book Search:
http://books.live.com/ University of Michigan:
http://mirlyn.lib.umich.edu/
Recommended