56
Chapter 4
Technical Aspects of Digitization
4.1 Digitization
The process of converting printed resource into digital resource is digitization. What are the
advantages of digitization that have encouraged the digitization of vast amount of analogue
resources? “The aim of expressing an object in numbers is that it can be stored and
manipulated by computers. Computers are number crunchers, performing millions of
calculations per second. By digitizing an original and placing a digital copy of it on a
computer, the file can be manipulated, transferred, and stored with ease” (Wentzel, Larry,
2006, p. 11).
4.2 Process of Digitization
Digitization is the process of converting something into the numerical form that can be
processed by computers. The main objective behind digitization is the storage and
manipulation and access to the resources. The requirement of less storage space and the
access of the resources without wear and tear by a large number of users have encouraged
the digitization of the printed resources.
“The process of digitization involves two major sets of activities: (i) The process of digital
conversion whereby source materials are converted into digital form, and (ii) The
processing of the digitized information, which involves several activities related to the
storage, organization, processing and retrieval of digitized information” (Choudhury &
Choudhury, 2007, p.104). The stages that are involved in the digitization process are
scanning, indexing, storing, and retrieval of which detailed discussion is made below.
57
Scanning
Using an electronic image scanner or a digital camera, the source document which is in
printed form is converted to an electronic image. In this process, the source document is
scanned at a predefined resolution and bit depth. The images are stored in files where for
each pixel the binary digits (bits) are stored and it is called “bit-map page image”. The
software used for scanning are used for formatting, tagging, storage and retrieval of the
scanned image.
Indexing
In this step, the scanned image files are indexed by linking the database of the scanned
image to a text database. The text database links the set of images according to keyword
and location of the image in the image database. Some scanning software does manual
keying in of the indexing term to the image files, while some others facilitates selection of
indexing term from the image files.
Storing
The file of the scanned image is saved or stored for further processing. The file size of these
files depends upon many factors like resolution used in scanning, the scan area,
compression technique used, and file format used for the scanned image. The scanned
image is stored in offline storage media like CD-ROM or DVD-ROM, external hard disc,
snap servers, etc.
Retrieval
Retrieval is a part of scanning failing which scanning will be of no use. While scanning a
document it is stored in the machine resulting two files. The first file will hold the image
along with a key to the second file where the location of the document is stored. In
retrieving the document already scanned the second file of which the key is linked with the
first file retrieves the document from the system (Arora, 2001).
58
4.3 Technological Background of Digitization
Digital images are represented by set of pixels or bits. These bit-mapped images cannot be
searched like an ASCII file. But by applying Optical Character Recognition (OCR)
technology, a bit-mapped image file can be converted to an ASCII file. Some technological
specifications responsible to control the quality of the scanned image are discussed below.
Bit Depth
Bit is the abbreviated form of Binary Digit. 0 and 1 are the two values of bit. Bits are used
to describe the range of shades between pure black and pure white. Black and white files
are called 1-bit, as there are two shades, black and white. The bit depth or colour depth of a
scanner is an indication of the range of colours that can be captured by the scanner. It does
not define the limits of the colour range that is readable by the device but simply specifies
the number of separate distinct colours. A higher figure will equate to a more accurate
description of the colours available to the scanner but does not necessarily mean that they
are available to the user at the end of the process.
Scanners will often capture at a larger ‘bit depth’ of 36-42 bit and then save or export from
the scanner in standard 24 bit RGB (Red Green Blue) colour. This extended colour depth is
used internally by the scanner to produce the best possible quality original image data but is
not normally available to the user. Although recently, there has been a move towards some
scanners allowing the full size ‘hi-bit’ version of the file to be saved and edited as a 48 bit
TIFF (Taged Image File Format) or PNG (Portable Network Graphic). The colour depth, in
itself, does not provide much evidence of the quality of the scanner, however it does give
some guidance to how capable the scanner might be if it can use all the colour data it
produces.
Resolution
Before going for scanning a document the resolution is to be decided in the form of dpi
(dots per inch) or ppi (pixels per inch) which indicated the quality of the document scanned.
It is to be noted that the higher we accept the resolution the more the dpi/ppi will be
59
(Wentzel, 2006). This can be decided at a certain level which depends on document to
document. “The higher the resolution, the finer the grid used to segment the image” But the
higher the resolution used, the file size will be more, i.e. resolution and the file size are
related proportionally.
Optical and interpolated resolutions are the two different resolution types based on how
they are generated. Optical resolution is the maximum number of resolution a scanner is
capable of capturing. Interpolated resolution is artificially generated where the software
gets pixels captured by the scanner, expands the grid pattern, and estimates the pixels that
were captured by the scanner.
Many recommendations are put forwarded for selection of proper resolution to achieve
good quality scanning for different types of documents. Wentzel (2006) in “Scanning for
Digitization Projects” has put forwarded the following recommendations.
• Normal web image- 72 dpi GIF/JPEG
• Minimum gray/color print setting-150 dpi JPEG
• Optimal color print setting- 300 dpi TIFF
• Optimal setting for running pages of text through OCR- 300 dpi TIFF
• Best black and white print setting- 600 dpi TIFF
• Archival setting (all colors) - 600 dpi TIFF
Again, the Digital Library Federation (DLF) has also recommended to use 300 dpi 24-bit
color TIFF for images and 600 dpi 1-bit bitonal TIFF for pages of text
<www.diglib.org/standards/bmarkfin.htm#benchmark>. Based on the facilities available
and the type of documents to be scanned, the resolution may be adjusted.
Threshold
To scan the pages where text or drawings are there, bitonal scanning is used. It is also
known as binary or black and white scanning where one pixel is represented by one bit. In
black and white photograph where intermediate or continuous tones are there, gray scale
60
canning is used. For the scanning of colour photographs, colour scanning is used. Bitonal
scanning has the fastest processing. On the other hand, grayscale will provide more
accurate results, especially on degraded or shaded background documents. Colour scanning
helps to retain colour information and/or colour graphics in the source document. “The
threshold setting in bitonal scanning defines the point on a scale, usually ranging from 0 to
255, at which gray values will be interpreted as black or white pixels” (Arora, 2001, p. 17).
The threshold setting determines the image quality in bitonal scanning.
Compression
The size of the scanned image is very big if the source document is scanned with high
resolution. Therefore, to make the files manageable by the computer system and by the
user, it is necessary to reduce or compress the file size.
“Compression is the process of reducing the size of a data file or an image by abbreviating
the repetitive information such as one or more rows of white bits to a single code” (Arora,
2001, p. 17). It helps in economic storage, processing and transmission over a network.
Data compression algorithms are of two types- lossless and lossy.
a) Lossless compression
It uses algorithms which encode repeating elements or patterns within an image. If in an
image same colours are present in more than one adjacent pixels then two bytes are used for
storing the information. The first byte is used for the colour and the second for the number
of adjacent pixels. When the file is decompressed, the original image is restored.
b) Lossy compression
In this type, the compression ratio is much higher than lossless ratio. But the quality of the
image degrades in lossy compression. Some of the commonly used compression protocols
are –
i) ITU-G4: Developed by International Telecommunication Union (ITU), is a popular
standard protocol for black and white images.
61
ii) JPEG: Joint Photographic Expert Group (JPEG) is an ISO-10918-I compression
protocol. It represents an area that has the same tone, shade, colour, or other characteristics
by a code.
iii) LZW: Lenpel-Ziv-Welch (LZW) uses a table-based lookup algorithim invented by
Abraham Lempel, Jacob Ziv, and Terry Welch are two commonly used file formats in
which LZW compression is used are the Graphic Interchange Format(GIF) and the Tagged
Image File Format(TIFF).(Arora, 2001, p. 19)
iv) Fractal and wavelet compression: These lossy compression formats offer advantages
for providing access to digital images of oversized materials on the web. It converts the
image into mathematical models instead of an array of pixels and thus save storage space.
Image Enhancement
The image enhancement process can improve the quality of the image that is captured by
using the scanning device. Image editor software helps in this process. “For archiving and
online publishing of images image editor is a must. We can resize images, crop, create
image for website, save in multiple formats” (Deka, 2008, p. 171). According to Arora
(2001) we can decompose the scan area into small areas and can be treated for further
improvement of the image quality. There is lots of image editing software which can be
used for image enhancement like Adobe photoshop, PaintShop Pro, etc.
4.4 File Formats
File format for storage, dissemination and preservation of digital resources is one of the
most important technical issues to be taken into consideration. “ One of the key
components in ensuring resource longevity is the choice of file and media formats used to
create, store, and deliver digital content, and the strategies that are employed to manage
these in the long term” (Williamson, 2005, p. 508). File Formats stores different
information like size, resolution, compression protocols, etc. The scanned image can be
stored in different types of file formats for easy storage and retrieval. PDF, SGML, TIFF,
62
MPEG, WAVE are some popular file formats used for storing scanned images. We have
mainly two types of file formats which are as follows.
Open File Format
Open file format which is freely available for use is free from patent or license issue and
can be used by anyone in any proprietary or free or open source software.
An open standard approach brings a wide range of benefits (Williamson, 2006). These are –
• Resources are freed from dependencies on a single application or particular
hardware platforms;
• Resources can be preserved and accessed over the long term.
Open Document, Office Open XML, PNG, JPEG 2000, ZIP are some of the examples of
open file formats.
Proprietary File Format
Proprietary file format is owned either by an individual or an organization and they protect
it from unauthorized use by using the patent or license. “ These formats are owned by an
organization or group (e.g. Microsoft), may sometimes be accepted as de facto standards
through sheer ubiquity, and might even be referred to as standards, but cannot be regarded
as open since the owner could theoretically choose to change the format or conditions of
usage at any time” (Williamson, 2005, p. 509).
A list of file formats for different media types along with the creator, date of creation,
media types and formats is given in the next page.
63
Table 4.1 List of File Formats
Sl.
No
File Name File
Extension
Creator Creation
Date
Media
Type
Format
1 Advanced
Audio
Coding
.aac Collaboration between
corporations approved
by MPEG
1997 Sound Lossy
Compression
2 Advanced
Authoring
Format
.aaf Advanced Media
Workflow Association
2000 Moving
Image
Uncompressed
3 Apple
QuickTime
.mov Apple Computer, Inc. 1991 Moving
Image
Container
4 Audio
Interchange
File Format
.aiff Electronic Arts
Interchange and Apple
Computer, Inc.
1988 Sound Uncompressed
5 Audio Video
Interleave
.avi Microsoft 1992 Moving
Image
Container
6 Bitmap .bmp IBM and Microsoft 1988 Still
Image
Compressed or
Uncompressed
7 Broadcast
Wave File
.bwav IBM and Microsoft 1997 Sound Uncompressed
8 Digital Video
File
.dv or .dif Sony 1994 Video Uncompressed
9 Extensible
Music Format
.xmf The MIDI
Manufacturers
Association, XMF
Working Group
2001 Moving
Image
Container
10 Final Cut Pro .fcp Final Cut Pro/Apple
Computer, Inc.
1999 Moving
Image
Uncompressed
11 Flash Video .swf (or
.flv)
Adobe/Macromedia 1997 Moving
Image
Moving
Image/Dynamic
12 Graphics .gif CompuServe 1987 Still Lossless
64
Interchange
Format
Image Compression
13 JPEG .jpg Joint Photographic
Experts Group
1990 Still
Image
Lossy
Compression
14 Keynote .key Apple Computer, Inc. 2003 Presenta
tion
Container
15 Material
Exchange
Format
.mxf Pro-MPEG Forum 2004 Moving
Image
Container
16 MPEG-1 or
MPEG-2
.mpg Motion Picture Experts
Group
1988 Moving
Image
Container
17 MPEG-1/2
Audio Layer
3
.mp3 Motion Picture Experts
Group
1991 Sound Lossy
Compression
18 MPEG-4 .mp4 Motion Picture Experts
Group
1998 Moving
Image
Container
19 Ogg Vorbis
Compressed
Video
.ogm Ogg Vorbis 2003 Moving
Image
Container
20 Open Office
Impress
.odp Sun Microsystems 2000 Presenta
tion
Container
21 Photoshop
Document
.psd Adobe 1990 Still
Image
Uncompressed
22 Portable
Network
Graphics
.png The Portable Networks
Graphics Development
Group of the World
Wide Web Consortium
1996 Still
Image
Lossless
Compression
23 Power Point
Document
.ppt Microsoft 2003 Presenta
tion
Container
24 Raw Image
File
.dng, .cr2,
.nef, .arw,
and .srf
Depends on equipment
manufacturer
2000 Still
Image
Uncompressed
65
25 RealAudio
File Format
.ra RealMedia 1995 Sound Compressed
26 Scalable
Vector
Graphics
.svg The World Wide Web
Consortium
1999 Still or
Moving
Image
Uncompressed
27 Tagged
Image File
Format
.tiff Aldus 1985 Still
Image
Container or
Uncompressed
28 WAVE Form
Audio Format
.wav IBM and Microsoft 1992 Sound Uncompressed
(Source: www.nyu.edu/tisch/preservation/...2/07f_1807_nmartin_a2.doc)
4.5 Hardware Used for Digitization
For capturing the image of the source document we need some devices. Scanner is
generally used for image capture from textual document, image or from other sources. A
discussion regarding the hardware used in the process of digitization is given below.
4.5.1 Scanner
Scanners can be called as a photocopier. In case of a flatbed scanner, a moving lamp throws
light onto the object to be digitized and the reflected light is focused through a series of
mirrors and lenses onto the recording medium. In case of a flatbed scanner, the recording
medium is a compact light sensor, either a CCD (Charged Coupling Device) or CIS
(Contact Image Sensor), each of which is composed of hundreds or thousands of elements.
When light strikes each element the intensity of the light is assigned a number. The numeric
reading of light intensity and the element position are recorded in sequence into a file
which forms the digital version of the original. Following features should be analysed first
in a scanner selection process.
66
a) Driver of scanner
Driver is a software that operates the scanner and transfer the digitized file to the hard drive
or software. The scan driver may be a standalone or a plug-in, a specialized version of the
driver that is accessible through Photoshop, word or other programme. The standalone
driver runs the scanner without involving other software and saves the file to the hard drive.
Plug-ins are opened within Photoshop or word and after scanning and the files can be used
immediately in Photoshop or Word.
Scan driver falls into two groups: native and third party. Flatbed scanner manufacturers
provide their own native driver for their scanner and provide updates for the drivers through
the website. In case of specialized scanners, such as overhead book scanners or the digital
cameras, the native driver is the only driver available.
Third party scan drivers offer better control over the scanner and scanned image than the
native drivers. These drivers are to be procured unless they are supplied with the scanner as
an incentive. Windows Image Acquisition (WIA) is a third party scan driver provided by
the Microsoft Windows XP. It has offered the most commonly available features used by
all flatbed scanners. However, the specialized scanners cannot be operated with the WIA.
b) Scanning speed
Scanning times varies depending upon the type of scanner used. Within a busy workflow,
scanning speed often can be a deciding factor in scanner choice and should always be
researched and considered before a choice is made. Many scanners offer a choice of
differing qualities of scan which is dependent upon the number of passes and/or speed of
the CCD: the more passes the CCD makes, the higher the quality and the slower the
scanning speed. Some early scanners were unable to scan Red, Green and Blue data in one
go (one-pass) and had to make three separate scans (three-pass). This does not normally
affect the quality but was very slow. Some scanners offer functions such as dust and noise
reduction, however, this also slows down the process significantly.
67
c) Scan area
The dimension or the area the scanner is capable of scanning is the scan area. The scan
areas are determined by inches and/or media sizes such as
8 ½ X 11 inch (standard letter)
8 ½ X 14 inch (legal)
11 X 17 inch (ledger)
Most flatbed scanners have a nominal size of A4 but can scan an area of about 8.5" by 12-
14". A3 sized scanners are available but they can take up a considerable amount of space.
They are, of course, essential if it becomes necessary to capture works (overA4) although if
the objects are very large or difficult to handle a digital camera might well offer a more
pragmatic alternative. Hi-end A3 flatbed scanners are very popular with commercial
digitization as they can be set up to scan a number of images at one go. This offers greatly
increased efficiency and increased throughput. But these machines are very costly.
Some flatbed scanners offer the addition of dual optics where the optional system can be
switched to scan a ‘sweet-zone’ which offers a smaller scan area with a greatly increased
resolution. This is normally of use when scanning small to medium sized transparencies
within the full size of the scanner bed.
There are range of optional add-on parts that can provide additional functionality and
productivity for many mid-range to high-end scanners. Two of the most common options
for flatbed scanners are the automatic sheet/transparency feeder (ASF/ATF) and the
transparency media adapter (TMA). An ASF or ATF is used to batch scan quantities of
single sheets or transparencies. Normally ASF/ATF is best for creating small and low
quality scans, either 1-bit black and white images from text for later optical character
recognition (OCR) or small scans for thumbnail creation. TMA provides an alternative light
source within the scanner which enables transparent artworks such as photo-slides and
larger colour transparencies to be scanned.
68
d) Scanner types
The selection of the right scanner is a more difficult job than selecting the right computer.
Scanners are used to capture the image of the resources in printed form or from the
microfilm. There are two types of image scanner based on interpretation of the image;
vector scanner and raster scanner. The vector image interprets the image as a set of x, y
coordinates. In case of raster scanner images are captured by passing light down the page
and digitally encoding it row by row.
i) Drum scanner: Drum scanners use photo-multiplier tubes (PMT) to produce very high
quality results. They typically have a density range of 3.4-4.0 with a ‘dMax’ at the top of
that range. They can offer an optical resolution of up to 8000 samples per inch (spi). Drum
scanners are the tool of choice of the print industry and normally used by professional
digitization bureaux. This is due to their expense and their complexity requiring skilful
operation to get the best from them. Only flexible original artwork can be scanned in a
drum scanner as it has to be mounted on a transparent acrylic cylinder (drum) and then spun
at high speed around the photo-multipliers within the cylinder. Mounting transparencies on
the drum is a slow and skilled operation and it is normal to have at least two drums in use
so that one can be mounted whilst the other is being scanned.
Fig. 4.1: A Drum Scanner
Although the quality from these scanners is exemplary, they tend to be slow and cannot
normally provide the level of productivity required from most digitization projects. There
69
are also some preservation issues with the standard use of a mounting oil to avoid Newton’s
rings between the transparency and the drum. If mounting oil is used then the
transparencies must be scrupulously cleaned after scanning.
ii) Flatbed scanner: It is like a photocopier where a lamp moves slowly across the face of
the original and the reflected light is focused through a series of mirrors and lens onto the
recording medium. Here, the recording medium is compact light sensor, either a Charged
Coupling Device (CCD) or Contact Image Sensor (CIS), each of which is composed of
hundreds or thousands of elements. When light strikes each element the intensity of the
light is assigned a number. The numeric reading of light intensity and element position are
recorded in sequence into a file which forms the digital version of the original. To enable
the scanner to capture colour, they must either make three passes with a Red, Green or Blue
filter in front of the CCD or have 3 lines of CCD each with either a Red, Green or Blue
filter on top.
Fig. 4.2: Flatbed Scanner (HP Scanjet G2410)
70
Flatbed scanners are much cheaper than drum scanners and also much easier to operate.
The technology and the quality of CCD have improved a lot and still cheaper than drum
scanners. Another advantage of it is that it can be operated by unskilled operators as its
functions are simple. The document to be scanned does not need to be bent around a drum.
Flatbed scanners also offer more scanning speed than drum scanners. Lots of flatbed
scanners are available in the market. The major printer production companies have their
low cost flatbed scanners which can be used for scanning photographs and loose sheet
pages.
iii) Overhead scanner: This type of scanner is quite expensive as compared to flatbed
scanner, but when we need to capture the image of extremely fragile materials it can be
helpful. We should avoid the overhead scanner that scans only in black and white. A
photograph of Zeutschel overhead scanner is a popular scanner used by LICs and resource
centres for digitization is given below.
Fig. 4.3: An Overhead Scanner (Zeutschel os 5000)
71
Zeutschel Scanners can be used to digitise books, magazines and other large documents.
Special and careful procedures and functions for books are used during scanning. This
includes book cradles, radiographic tables, innovative light systems and the creation of
documents with the text facing upwards. Depending on customer needs, Zeutschel offers
different models for colour, greyscale and black/white.
iv) Sheet-fed scanner: In this type of scanner, we have to slide sheets of paper through the
scanner. It is not good for capturing images of loose manuscripts, photographs, fragile
materials, etc.
v) Microfilm scanner: It is a good choice for microfilm, photographs, slides and negatives.
But it has the limitation of size of the scanning. The microfilm produced from the original
documents can be preserved in ideal condition for a very long time.
Fig. 4.4: Microfilm scanner (B-M-I EYECOM MIC5M)
The steady growth of digital imaging technology over the last five years has led to a vast
range of professional and consumer scanners in the market. Quality and speed are steadily
rising and the cost is slowly falling down. However, it remains true that although it is
72
possible to buy fast low-quality scanners or slow high-quality scanners at a cheaper price,
productive and high-quality scanners tend to still be very expensive.
4.5.2 Digital Camera
Digital camera is a good choice for digitization of not only the valuable documents of an
organization but we can use it for different other purposes like taking the photographs of
the organization and its different sections, the staff etc. and can upload these on the website
of the organization. When we have to digitize the damaged materials which cannot be
moved and captured the image without disturbing their position, investing in a digital
camera is a better choice. Any modern DSLR (Digital Single Lens Reflex Camera) or
point-and-shoot digital camera can be used as a document scanner. We can use a DSLR
with a dedicated flash and a lens with some measure of zoom (18-55mm or 18-200mm). In
order to do this properly, the light in the room where scanning is done should be good
enough.
Fig. 4.5: Digital Camera Used as Document Scanner
73
It is to be properly aligned with the document; otherwise we will get slightly skewed shots
which could be a problem. We can use holding arms in order to fasten the camera in place
while taking the photographs. Most tripods will not angle down enough for this to work but
if we place the document on an easel, it would be feasible to find the right angle for
alignment. The researcher has seen using digital camera of Sony to capture image of rare
documents while visiting the University Library of Osmania University.
4.6 Software Used for Digitization
The scanner can only capture the image of the source document which has to be processed
further for enhancing the image quality, image clarity, or make it searchable and accessible
by the user in future. For these purposes, we need software like scanning software and
Optical Character Recognition (OCR) software.
4.6.1 Scanning Software
For the proper operation of the scanner, we have to install the driver and the scanning
software for a particular scanner. In this regard, we have to install the driver and the
scanning software for a particular scanner. Scanner software controls the scanning process
as well as driving the hardware that captures the image data and passes it on to the next
stage of the image workflow. This software usually offers a range of image processing
features. Software can either be a device-specific program designed to work with one
scanner or a plug-in based on a driver interface such as TWAIN or ISIS which can be
accessed from within a host program.
Software can play an important role within a workflow in terms of productivity and quality
of the scan, so it is important to consider how best to combine the work undertaken by
scanning software with that done by image processing software. In addition to setting
resolution, scan area and colour greyscale, reflective/transmissive quality, the scanner
software can also be used to control colour optimization, colour transmission, sharpening,
74
tonal optimization, automated dust/scratch removal, negative to positive image selection,
scan quality control, image rotation, batch scanning, etc.
Using any of these facilities at the time of acquiring the image can save a lot of time in
corrective manipulation later on in the workflow, but it is worth comparing the performance
of these functions between the scanner software and the image processing software when
deciding which is going to be more effective. Some of scanning software FreeKapture,
VueScan etc.
FreeKapture 2.0: It is a free Twain image capture application from TSoft that works on
any Windows (98 and on) Twain compliant system. TWAIN is, allegedly, an acronym for
Technology Without An Interesting Name and is software (a driver) supplied by the
manufacturer of TWAIN complaint devices. Using this driver, FreeKapture is able to scan,
save and print images (photographs etc.). Images are saved in JPG or BMP formats.
VueScan: It is an easy-to-use replacement for the software that comes with scanner and
supports most flatbed scanners, printer/scanners and film scanners. Over 10 million people
have downloaded VueScan since it was first released in 1998. VueScan is a powerful
scanning tool. It is packed with loads of useful and powerful features and currently supports
over 1200 scanners and 321 digital camera RAW files.
Scanitto Pro: Scanitto Pro provides one-click scanning and copying utilizing TWAIN
drivers which provide exceptional scan and copy quality. In addition, Scanitto Pro
integrates with all major operating systems to provide a seamless document management
environment which is intuitive and very simple to use. Scanitto Pro is extremely stable and
has passed all the major security and operational tests. It supports multiple file formats like
PDF, BMP, JPG, TIFF, JP2 and PNG files. Scanitto Pro supports all major European
languages supported including English, French, German, Italian, Spanish & Russian.
75
4.6.2 OCR Software
A scanned document is nothing but a picture of a printed page. It cannot be edited or
manipulated or managed or searched based on the content. In other words, scanned
documents have to be referred to by their labels rather than characters in the documents.
OCR (Optical Character Recognition) software is used to transform scanned textual page
image into word processing file. The function of OCR software is to convert the captured
image or set of images and generate a file containing that text in ASCII code or in a
specified word processing format leaving the image intact in the process.
OCR does not actually convert an image into text but rather creates a separate file
containing the text. There are four types of OCR technology namely matrix matching,
feature extraction, structural analysis and neural networks. In matrix matching, each
character is compared with a template of the same character. In feature extraction
technology, a character is recognized from its structure and shape based on a set of rules. In
structural analysis, the characters are determined on the basis of density gradations or
character darkness. A form of artificial intelligence is used in neural networking technology
which attempts to minimize the human effort by using fuzzy logic technology and it is also
known as ICR (Intelligent Character Recognition). There are lots of OCR software
available in the market now-a –days. ABBYY FineReader 11 and OmniPage Pro are two of
the widely used OCR software.
ABBYY FineReader 11: With new support for Arabic (Modern Standard), Vietnamese
and Turkmen (Latin), ABBYY FineReader 11 detects any combination of 189 languages.
FineReader 11 supports a wide range of output formats. The OCR results can also be sent
directly to applications such as Microsoft Word, Excel and PowerPoint, Adobe Acrobat,
Corel, WordPerfect and OpenOffice.org TM Writer. It has cutting-edge image correction
tools which adjust motion blur, ISO noise, 3D image distortion, brightness, contrast, color
levels and curved text for the best possible results.
76
OmniPage: The newest version of OmniPage utilizes the latest OCR software technology
with greatly increased accuracy and innovative cloud service capabilities and recognition of
123 languages. OCR loses its convenience if the software is too difficult or confusing to
use. Such is the risk with any multi-featured software. OmniPage easily navigates around
this risk with its intuitive design and logical layout. Even an OCR rookie could navigate
through the many features of this software.
4.7 Storage Space of Scanned Image: An Experimental Study
Two files – one textual of size 19.2 kb in docx file format and the other image file of size
577 kb in docx file format were created and print outs were taken. Both the pages were
scanned using two different types of flatbed scanner. One scanner is Avision FB6280E is
an A3 Bookedge scanner and the other one is Canon image Class D 520. The textual
document was scanned in different resolutions using black and white option and the file is
saved in different file format in both the scanner. In the following table, the different file
size of the images saved in different file formats is given.
Table 4.2 File Size of B/W Scanned Image
Sl No. File format File size in FB6280 File size in Canon image Class D
520
200 dpi 300 dpi 600 dpi 200 dpi 300 dpi 600 dpi
1 pdf 152 kb 315 kb 1.04 mb 34.1 kb 95 kb 67 kb
2 bmp 10.6 mb 23.1 mb 95.8 mb 464 kb 1 mb 4.02 mb
3 tiff 6.08 mb 15.5 mb 59.1mb 457 kb 1 mb 1.78 mb
4 jpg 169 kb 350 kb 1.22 mb ------- ------- -------
5 gif 430 kb 1.29 mb 4.12 mb -------- ------- --------
Similar process was applied for the image printout and was scanned using colour option.
The respective file size of the two different types of scanned document saved in different
file formats are presented in the following table.
77
Table 4.3 File Size of Colour Scanned Image
Sl No. File format File size in FB6280 File size in Canon image Class D
520
200 dpi 300 dpi 600 dpi 200 dpi 300 dpi 600 dpi
1 pdf 127 kb 283 kb 1.09 mb 57.3 kb 101 kb 1 mb
2 bmp 6.32 mb 14.2 mb 56.9 mb 10.7 mb 24 mb 96.2 mb
3 tiff 5.28 mb 12.4 mb 48.6 mb 10.7 mb 24 mb 96.2 mb
4 jpg 459 kb 336 kb 1.28 mb 277 kb 629 kb 2.70 mb
5 gif 507 kb 1.20 mb 4.75 mb -------- -------- --------
From the table 4.2 and 4.3, it is found that the file size of the same document scanned in
two different scanners saved in different file formats in same resolution is different. The
file sizes of the scanned image increase when the documents are scanned using different
resolution. Higher the resolution used in scanned, greater is the file size. The qualities of
the scanned images are found to be good in higher resolution.
4.8 Summing Up
Digitization has many sides to be dealt with from scan area, resolution to file formats of
storing. Selection of hardware and software is also a factor of successful digitization
project. The university libraries can opt for either in-house or outsourcing process to
digitize their rich collection. The university libraries can approach institutes like CDAC-
Noida, CDAC-Pune, IIIT Allahabad, Indira Gandhi National Centre for the Arts, New
Delhi to provide necessary infrastructure and manpower for digitization of their valuable
and rare documents; provided the conditions laid down by the respective bodies are
acceptable by the university libraries.