Upload
blaise-hunt
View
226
Download
4
Tags:
Embed Size (px)
Citation preview
Million Book Project @ Bibliotheca Alexandrina
Noha Adly20 November 2006
Bibliotheca Alexandrina 2
Bibliotheca Alexandrina 3
Bibliotheca Alexandrina 4
BA Digitization Workflow
Bibliotheca Alexandrina 5
Statistics - November 2006
Arabic Latin Total
Scanned
Books 22,023 4,646 26,669
Pages 7,003,185 1,350,688 8,353,873
ProcessedBooks 21,947 4,642 26,589
Pages 6,987,392 1,348,900 8,336,292
OCRedBooks 16,652 4,600 21,252
Pages 5,248,337 1,327,385 6,575,722
Total Archived Data 1,500 GB
Bibliotheca Alexandrina 6
Statistics (Contd) Daily Rates
– Scan: ≈ 1800 pages/person
– Process: ≈ 1800 pages/person
– Latin OCR: ≈ 4000 pages/person
– Arabic OCR: ≈ 1500 pages/person
Five Minolta scanners 2 shifts – 7 days a week
OCR
Image to Text
Bibliotheca Alexandrina 8
OCR - Arabic Poses unique challenges
– Written cursively, with blocks of connected characters
– a ‘block of characters’ can have more than one base line.
– Uses external objects such as dots, 'Hamza' and 'Madda'.
– Diacritization
– Characters can have more than one shape according to their position
– Overlapping makes it difficult to determine the spacing
Sakhr Automatic reader is used Tricky with old books Requires learning
Bibliotheca Alexandrina 9
Arabic Script Is Cursive
Bibliotheca Alexandrina 10
Old, Smudgy, and Sticked Together
Bibliotheca Alexandrina 11
Use of Diacritics
Bibliotheca Alexandrina 12
Font Low Bound High Point % BooksAR-H1 97.70% 99.50% 0.24%AR-H2 97.60% 99.50% 2.66%AR-H3 97.04% 99.10% 8.01%AR-H4AR-L4 92.70% 96.70% 6.62%DT-M1DT-L2 88.40% 96.80% 7.24%TA-H1 97.30% 99.10% 1.26%TA-H2 97.60% 99.20% 11.89%TA-H3TA-H4 96.50% 97.74% 2.99%TA-L1 94.00% 97.70% 1.65%TA-L4 94.00% 97.90% 8.68%TA-M2 95.80% 98.80% 23.47%TA-M4 94.50% 97.50% 15.57%X 9.72%
Under construction
Under construction
Under construction
16 Font Groups
Bibliotheca Alexandrina 13
Evaluation of VERUS and AR
Challenge Set
95%
5%
VERUS AR
Normal Set
38.67%
61.33%
VERUS AR
Research agreement with NovoDynamics Preliminary evaluation on two data sets is promising
– Challenge: difficult to OCR, degraded images
– Normal: known to return acceptable accuracy
Encoding
Image on Text
Bibliotheca Alexandrina 15
Image-on-Text
Multilayered:– Visible page image– Hidden OCR text
View exact original layout while searching and highlighting
Supported with some OCR suites only
Supported format: DJVU and PDF
Bibliotheca Alexandrina 16
Quality Assurance
No missing cover or pages
All pages are in order
Text quality
Images quality
PDF quality
DAR
Digital Assets Repository
Bibliotheca Alexandrina 18
System Architecture
DAF/DAK APIs
Digital Assets Keeper(DAK)
RepositoryDatabase
Authentication and Authorization Subsystem
Users/groups/permissionsDatabase
Storage Subsystem
OfflineStorage
OnlineStorage
Integrated LibrarySystem
CatalogDatabase
User Interface
AdministrationTool
DigitizationClient
ArchivingTool
CatalogingTool
PublishingInterface
OAIGateway
Digital Assets Factory(DAF)
DigitizationDatabase
EncodingTool
Bibliotheca Alexandrina 19
DAK Publishing Module
Bibliotheca Alexandrina 20
DAK Publishing Module
Bibliotheca Alexandrina 21
DAK Publishing Module
Bibliotheca Alexandrina 22
DAK Publishing Module
Bibliotheca Alexandrina 23
Bibliotheca Alexandrina 24
Show notes
Bibliotheca Alexandrina 25
Bibliotheca Alexandrina 26
Transfer of Digitized Books
Challenges
– Storage: CD vs Online
– Bandwidth: 10 Mbps vs 155 Mbps
– Copyright: not published Actions:
– Transferred 8,500+ books to the Internet Archive
– Process is still going on
Books From India
Towards better collaboration
Bibliotheca Alexandrina 28
Books From India
Language Number Books
Arabic 832
Arabic + French 3
Arabic + German 1
Persian 101
French 2
English 1
Spanish 1
German 1
Total 942
Bibliotheca Alexandrina 29
ProgressPhase Name Done as of
November 1, 2006
Expected to finished by
Comments
Cataloging 801 - 35 have metadata problems
Processing 742 November 20, 2006
OCRing 200 March 1, 2007
Encoding 171 - -
Publishing 171 - -
Bibliotheca Alexandrina 30
Metadata Problems
Bibliotheca Alexandrina 31
Processing
Bibliotheca Alexandrina 32
OCR Using VERUS or AR?
Calculated accuracy for a small sample
– Images processed once with darkening effect and once without
– VERUS likes darkening, AR does not
– Overall, AR won 70% of cases
30%
70%
VERUS AR
Bibliotheca Alexandrina 33
Bibliotheca Alexandrina 34
Bibliotheca Alexandrina 35
Bibliotheca Alexandrina 36
Bibliotheca Alexandrina 37
Bibliotheca Alexandrina 38
Bibliotheca Alexandrina 39
Bibliotheca Alexandrina 40
Bibliotheca Alexandrina 41
Bibliotheca Alexandrina 42
Thank You