Upload
lizadaly
View
9.365
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Webinar given on November 12, 2008 as part of an O'Reilly Tools of Change series on publishing and technology. More information on Liza Daly and threepress can be found at http://www.threepress.org/
Citation preview
What publishers need to know about digitizationLiza Daly
Consultant, Threepress Consulting Inc.
http://threepress.org/
Thursday, November 13, 2008
Software engineer and consultant specializing in web-based publishing applications
Digitization projects for Ford Foundation, Arnold Arboretum, Rosen Publishing and SAGE Publications
Online reference products for Oxford University Press and Columbia University Press
Current: ebook applications and consulting
IntroductionLiza Daly [email protected]
Thursday, November 13, 2008
1. Digitization 101: from scanning to OCR to XML
2. Smart vendor selection
3. A gentle introduction to XML
4. I’ve got digital content: now what?
IntroductionWhat I’ll cover
?Thursday, November 13, 2008
What we talk about when we talk about digitization
Turning printed content...
...or microfilm archives
...or documents in legacy systems
...into modern digital forms.
(sometimes starting from print is easier)
text
<text>
Thursday, November 13, 2008
Assume that we’re starting from a print archive.
(If you’re starting from a digital file, congratulations, your costs just went down -- but not to zero!)
Digitization 101
Thursday, November 13, 2008
Scan
From paper to digital images...
Thursday, November 13, 2008
OCR
...to digital text...
Thursday, November 13, 2008
XML
...to reusable markup.
Thursday, November 13, 2008
Digitization 101Scanning
http://www.flickr.com/photos/heather-dietz/448629362/
Thursday, November 13, 2008
Digitization 101Scanning
Scan
http://www.flickr.com/photos/heather-dietz/448629362/
Thursday, November 13, 2008
Digitization 101Scanning methods
Destructive scanningPages are cut out of the binding and
machine-fed into the scanner in batch.
(Imagine a huge office copier.)
Scanned copies are normally destroyed.
Thursday, November 13, 2008
Non-destructive scanning
Pages kept in their original binding
Manual page-turning
Originals are returned to the source
Primarily for rare or historical works
Digitization 101Scanning methods
Thursday, November 13, 2008
High-volume, non-destructive automated scanning also exists.
Digitization 101Scanning methods
Thursday, November 13, 2008
Optical Character Recognition
OCR software “guesses” the letters that appear in an image. A dictionary is used to help correct errors.
Common errors include wordsruntogether or speling mistakes.
Digitization 101OCR
Thursday, November 13, 2008
OCR quality is sensitive to a number of factors.
Is the document in good condition with clear type?
Is the layout simple or complex?
Is a custom dictionary required for proper names or obscure terms?
Digitization 101OCR
Thursday, November 13, 2008
This is easy.
Thursday, November 13, 2008
This is hard.
Thursday, November 13, 2008
http://timesmachine.nytimes.com/
Thursday, November 13, 2008
Better OCR Worse OCR
Layout Simple textMulticolumn,
sidebars
Vocabulary Common Specialized
Source quality Clean and legibleDamaged, dirty or
partial
Digitization 101OCR
Thursday, November 13, 2008
Limitations and cautions:
Documents with specialized jargon, such as medical journals or archaic texts, will require custom dictionaries.
Tables and equations aren’t suitable for OCR.
A human check is always advisable.
Digitization 101OCR
Thursday, November 13, 2008
If the goal of digitization is to make content findable on the web, the text needs to be correct.
Thursday, November 13, 2008
X
SCAN the documents to convert to digital files
Apply OCR to the scans to get computer-ready text
Convert the text into XML
Thursday, November 13, 2008
Digitization 101XML
Not all digitization projects end with XML.
Why?
Thursday, November 13, 2008
1,000 1,500 2,000 3,000+
Characters-per-page versus digitization cost/time
Machine OCRHuman-checked OCRXML
Thursday, November 13, 2008
Vendor selection and costs
Thursday, November 13, 2008
But also:
Project management
Shipping
Heterogeneous content
Front/back matter & indexes
Consider:
Quantity of material
Quality of the originals
Layout complexity
Vocabulary
Thursday, November 13, 2008
But also:
Project management
Shipping
Heterogeneous content
Front/back matter & indexes
Consider:
Quantity of material
Quality of the originals
Layout complexity
Vocabulary
Thursday, November 13, 2008
Vendor tips
Send samples before considering any estimate
...and have the output evaluated.
Compare not just cost-per-page but estimated time.
Feel comfortable with their project management.
Check references!
Thursday, November 13, 2008
Should you partner?
Thursday, November 13, 2008
?Thursday, November 13, 2008
??
Thursday, November 13, 2008
It’s too early to say whether Google Books is right for all publishers.
But you’re certainly giving up:
1. Control
2. Revenue share
3. Ownership
Thursday, November 13, 2008
Creative partnerships Consider whether some of your backlist is public domain or can be released under a Creative Commons license.
Thursday, November 13, 2008
XML 101
Thursday, November 13, 2008
XML 101What’s XML?
XML is just plain text, with markers to tell a computer what the text means and how it should be laid out.
Thursday, November 13, 2008
XML 101What’s XML?
Text with “markup” is an old idea.
This is a paragraph.¶This is another paragraph.
Thursday, November 13, 2008
XML 101What’s XML?
XML just changes the symbols around.
<p>This is a paragraph.</p><p>This is another paragraph.</p>
Thursday, November 13, 2008
XML 101What’s XML good for?
1. Everybody speaks it.
2. Once you have one kind of XML, it’s easy to turn it into another kind.
Thursday, November 13, 2008
When you decide to digitize to XML, you’ll need to pick what kind of XML you want.
Thursday, November 13, 2008
Kinds of XML
Thursday, November 13, 2008
Kinds of XML
DTD
Thursday, November 13, 2008
Kinds of XML
DTD Language
Thursday, November 13, 2008
Kinds of XML
DTD
Format
Language
Thursday, November 13, 2008
Kinds of XML
DTD
Format
Language
Schema
Thursday, November 13, 2008
Kinds of XML
DTD
Format
Language
XSD
Schema
Thursday, November 13, 2008
Kinds of XML
DTD
Format
Language
XSD
Schema
Thursday, November 13, 2008
The schema defines the list of <tags> that appear in a document, and what they mean.
A paragraph ¶ in one schema might be <p>, but in another it might be <para>.
XML 101Schema vocabulary
Thursday, November 13, 2008
TEI
DocBookMETS/ALTO
PRISMePub
DAISY
Thursday, November 13, 2008
TEI
DocBookMETS/ALTO
PRISMePub
DAISY
XML
Thursday, November 13, 2008
XML 101Choosing a schema
Books DocBook, DAISY, ePub, TEI
Magazines/Newspapers METS/ALTO, PRISM
Scholarly TEI, MathML
Thursday, November 13, 2008
XML 101DIY schemas
Creating your own schema should be a last resort.
Expensive to build and maintain.
High training and hiring costs.
Reduced opportunities for interoperability.
Regulatory compliance.
Thursday, November 13, 2008
XML 101DIY schemas
Creating your own schema should be a last resort.
Expensive to build and maintain.
High training and hiring costs.
Reduced opportunities for interoperability.
Regulatory compliance.
Thursday, November 13, 2008
$
$$$
Low High
Complex schemas cost more...
...but also provide more opportunity for product development.
Thursday, November 13, 2008
Now what?
Thursday, November 13, 2008
MonetizingXML conversion
XML
Thursday, November 13, 2008
MonetizingXML conversion
XML web
Thursday, November 13, 2008
XML web
Thursday, November 13, 2008
webXML
Thursday, November 13, 2008
webUGC
Thursday, November 13, 2008
Remixing content
XML allows content to be distributed, altered,
and recontextualized in unexpected ways.
http://flickr.com/photos/thomashawk/2492298772/Thursday, November 13, 2008
Small Beer Press
Thursday, November 13, 2008
Questions?
Liza DalyThreepress Consulting Inc.+01 617 301 [email protected]
Thursday, November 13, 2008