View
224
Download
0
Category
Preview:
Citation preview
7/29/2019 Developing Digital Libraries
1/37
Developing Digital Libraries:
General Strategies
Sukhdev SinghBased on From Paper to Collection by Dr.
Michel Loots, Dan Camarzan and Ian H
Witten. Human Info NGO, Belgium; Simple
Words, Romania; University of waikato, New
Zealand.
7/29/2019 Developing Digital Libraries
2/37
INTRODUCTION
Typical steps that have to be implemented are:i. Selecting the documents to be includedii. Securing copyrights permissions to use
these documents in thedigital libraryiii. Scanning and OCR of the hard-copydocuments which are notavailable in to digital form to have a perfect
digital formativ. Converting all documents to a format(integrating text andimages).
7/29/2019 Developing Digital Libraries
3/37
v. Tagging the chapters, paragraphs andimages of the digitaldocuments
vi. Organising the collection into a optimallystructured digitallibraryvii. Building the digital library using the
Greenstone softwareviii. Printing and distributing the collection onCD-ROM and/ordistributing it over the Internet
7/29/2019 Developing Digital Libraries
4/37
Answers the following before
jumping in: What is the goal of your collection?
What is your target group?
How big is itlocal, regional, or
globaly?
How many documents are you making
available?
How many pages?
How much graphics content?
7/29/2019 Developing Digital Libraries
5/37
Does the material split into parts that will beconsulted by a limitedaudience and parts that need to bedisseminated widely?
Are the documents already availableelectronically?
If so, in which formats? (Note incidentally thatPDF files are notautomatically equivalent to digital full-textform, as they oftencontain only page images.)
What is the co ri ht status of the
7/29/2019 Developing Digital Libraries
6/37
Who owns the copyright?
Are there other organizations with the same
target audience? Are you willing to collaborate with other
groups?
What budget is available for the whole
project? What human resources are available (in
person-months) for co-ordination,editing, scanning and programming?
How many computers are available for this
7/29/2019 Developing Digital Libraries
7/37
How many CD-ROMs do you want to
distribute?
Will they be free, or for sale?
7/29/2019 Developing Digital Libraries
8/37
SCANNERS AND
SCANNING The first step in converting paper
documents into a digital library
collection is to obtain images of allpages of all publications in digitalformat.
7/29/2019 Developing Digital Libraries
9/37
Scanners
Scanners are available in all priceranges, and all shapes and sizes. They
range from Rs. 5000 for flat-bedscanners to upwards of Rs. 25,00,000for large industrial scanners frommanufacturers such as Bell & Howell.
7/29/2019 Developing Digital Libraries
10/37
The output format of a scanned page is a
computer file that is usually stored in TIFF
or Bitmap format. Compressed TIFF IV isthe best format to use. An average page
scanned and converted to this format
occupies only 50 Kb, compared to perhaps
2 Mb for the equivalent page inuncompressed Bitmap form.
7/29/2019 Developing Digital Libraries
11/37
Low-cost flat-bed scanner
Low-cost flat-bed units are the cheapestand most widely available type of
scanner. There are many brands: HP,Agfa, Acer, etc. Prices range from 5,000to 15,000. Both black-and-white andcolor images can be scanned.
The low price allows each computer tohave its own scanner.
7/29/2019 Developing Digital Libraries
12/37
Fact is that rates exceeding twelve pages
per hour are rarely
achieved.
7/29/2019 Developing Digital Libraries
13/37
these scanners are useful only for small
jobs with limited numbers of pagesno
more than 200 to 400 pages a month on aregular
basis, or one-time jobs of up to 1000 or
2000 pages.
7/29/2019 Developing Digital Libraries
14/37
Low-end scanner with sheet
feeder Low-end scanners with sheet feeders
typically cost between Rs. 25,000 andRs. 60,000. Ten to fifty pages can beinserted, scanned and processed atonce: thus the operator does not haveto attend constantly to the machine.This increases capacity up to 150 to 200
pages per day. These scanners aremore robust, and have a larger lifespanbefore repairusually in therange 30,000 to 50,000 pages.
7/29/2019 Developing Digital Libraries
15/37
A disadvantage is that only one side of
the page is scanned at a timethestack of pages must be reversed andrescanned in order to obtain an imageof both sides. This often creates
problems because sheet feeders arenever without problems and sometimespages get blocked. These scanners areuseful for up to 1500 to 3000 pages a
month.
7/29/2019 Developing Digital Libraries
16/37
Color scanners
Any scanning operation invariablyinvolves some color images, so a color
scanner will always be required.Generally speaking, less than 5% of anypublication contains color images, plusthe cover. Thus a low cost flat-bed
scanner as described above suffices. Itis advisable to select one capable ofscanning up to 600 dpi resolution.
7/29/2019 Developing Digital Libraries
17/37
Professional duplex
scanners Professional scanners are reliable,
heavy-duty machines capable of
processing a large volume of pagestypically from 2000 pages to 10,000pages per day. They have an automaticsheet-feeder tray system that processes
batches of about 50 to 200 pages. Thebest and fastest are duplex machinesthat scan both sides of the page atonce.
7/29/2019 Developing Digital Libraries
18/37
Professional duplex scanners require apowerful computer with a hard disk of at least10 to 20 Gb. Prices range from Rs. 2,50,000 to
Rs. 25,00,000. For example, the Canon DR-6020duplex scanner costs Rs.2,50,000 and workswith double-sided documents. It has a capacityof about 2000 pages per day and a lifespan of
600,000 to 800,000 pages. Bell & Howell andFujitsu scanners range from Rs. 5,00,000 toRs. 25,00,000 and have a lifespan of manymillions of pages.
7/29/2019 Developing Digital Libraries
19/37
Scanning programs
Every scanner comes with its ownsoftware, which means that the program
must be installed on the computer thatmanages the scanner. Some have acomputer card that needs to beinstalled in your computer to speed up
the scanning operation.
7/29/2019 Developing Digital Libraries
20/37
Preparing the documents
Before being scanned, documents mustbe properly prepared. Dusty documentsmust be cleaned, humid documentsdried, clips removed, pages unfolded.The spine of each book should beremoved by cutting it off, straight andprecisely. Books provided by libraries
must often be rebound, and if so youshould be particularly careful whenremoving spines in order tofacilitate smooth rebinding.
7/29/2019 Developing Digital Libraries
21/37
The scanning process
Using software provided with thescanner, a digital image of each paper
page is scanned and transformed into aBitmap or TIFF image.
7/29/2019 Developing Digital Libraries
22/37
Quality control
First divide the material into batcheswith similar paper and print qualities.
Perform OCR tests on a sample fromthe first batch to determine the optimalsettings. Then scan all material in thisbatch before
proceeding to the next one.
7/29/2019 Developing Digital Libraries
23/37
Filename conventions
Follow a suitable file namingconvention.
7/29/2019 Developing Digital Libraries
24/37
Productivity and resources
You should not underestimate themagnitude of the scanning operation
and particularly the OCR process thatfollows. It is best to consider scanningand OCR as completely separateactivities. The optimal choice
from an economic and practical point ofview should be made individually foreach one.
7/29/2019 Developing Digital Libraries
25/37
Scanning costs
An important decision is whether toinvest in scanning equipment and
perform all scanning oneself, oroutsource it to a scanning company.The main considerations are:
7/29/2019 Developing Digital Libraries
26/37
pressure of time for the scanning job;total number of pages;salary costs of those who perform the
scanning
7/29/2019 Developing Digital Libraries
27/37
OCR: OPTICALCHARACTER
RECOGNITION The following steps are involved in
converting paper documents to
computer form:1. scanning;
2. page layout analysis;
3. recognition;
4. scanning images and tables.
7/29/2019 Developing Digital Libraries
28/37
On the market are many good OCR
programs, with prices ranging from Rs.
5000 to Rs.20,000. For example, among many others are:
Read-Iris(http://www.readiris.com/)
Omnipage(http://www.omnipage.com/)
Fine-Reader(http://www.finereader.com/)
http://www.finereader.com/http://www.finereader.com/7/29/2019 Developing Digital Libraries
29/37
A choice must be made between
undertaking the scanning and OCR in-
house or outsourcing it to a commercialorganization. To do it in-house
requires a scanner, OCR software
program, OCR skill development, and a
quality-conscious, highly motivatedworkforce.
7/29/2019 Developing Digital Libraries
30/37
The OCR process
The OCR process differs from one OCRprogram to another, and each one
requires a considerable amount oflearning. The programs manual willexplain this process in detail. Fourpoints deserve particular attention:
quality control, tables, images, andspecialized material such as formulas,foreign characters etc.
7/29/2019 Developing Digital Libraries
31/37
Quality control
The first is performed at the same timeas OCR. Every OCR program has
a built-in spell-checker that highlightsevery suspect letter.
The second is a general check of thetext once the OCR process is
finished.
7/29/2019 Developing Digital Libraries
32/37
The third is a spelling check usingMicrosoft Word.
Finally, the completed document shouldbe checked by an independent personwho samples the complete book andchecks for errors, problems with tables
and images, tagging, and the generallook of the resulting text.
7/29/2019 Developing Digital Libraries
33/37
Tables
Images
Specialized material
7/29/2019 Developing Digital Libraries
34/37
Productivity and resources
Young people between 18 and 25 are bestsuited for this job.Because the first hours are always themost productive, the work should either beorganized on a part-time basis or only themost motivated and concentrated peopleshould be selected for full-time work.
7/29/2019 Developing Digital Libraries
35/37
Two-thirds of people tend to quit or getbored after about three to five weeks. Thistranslates into poorer quality and low
productivity in the last weeks.A regular supply of work is needed tojustify the necessary training, to maintainconcentration, and to keep spirits high.
28 Pages / Day
7/29/2019 Developing Digital Libraries
36/37
Alternatives to OCR
Manual retyping
Image files
Combining scanning and OCR
7/29/2019 Developing Digital Libraries
37/37
CREATING AN
ELECTRONIC COLLECTION Meta Data
Tagging document files
Recommended