International Atomic Energy Agency
Digital Preservation SessionDigital Preservation SessionTue, 4 Nov 2008Tue, 4 Nov 2008
3434thth INIS Liaison Officers’ Meeting INIS Liaison Officers’ Meeting3-5 Nov 2008, Vienna, Austria3-5 Nov 2008, Vienna, Austria
S. Rieder, G. St-Pierre, Y. Reynaud-Pulido, T. Kalapurackal
Database Production and Imaging Group, INIS UnitINIS & NKM Section
International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 2
INIS INIS Mission:Mission:
• preservation of nuclear knowledge• serving as a reservoir of nuclear
information• provision of quality information services• promotion of a culture of
“information and knowledge sharing“
Digital Preservation at INISDigital Preservation at INIS
International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 3
INIS Non-Conventional Literature (NCL)
Production of the INIS electronic Full Text Database
Digital Preservation Activities Digitization projects
at IAEA at Member States
Digital Preservation Digital Preservation at INIS
International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 4
Objectives:
Consistent, high-level of image quality Interoperability and accessibility of digitized
resources Long-term preservation of digital resources for future
generations Member States IAEA
Develop good practices for digital preservation
‘Overview of INIS Digital Preservation Practices’: INIS Information Letter No. 253 & Attachment (2008-10-03)
http://www.iaea.org/inisnkm/marea/restricted/restrictedpdf/2008/infoletter253_attachment.pdf
Digital Preservation Digital Preservation at INIS
International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 5
INIS principles and workflow base on Cornell University’s digital imaging tutorial:
http://www.library.cornell.edu/preservation/tutorial/index.html
available in English, French, Spanish
INIS Digital Preservation Principles Digital Preservation Principles
International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 6
INIS WorkflowINIS Workflow
Document Benchmarking Document Preparation Scanning Quality Control Image Enhancement Metadata Creation/Validation Export including Compression Completeness Check Back-up Post-processing Storage and dissemination
International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 7
Benchmarking & Document Preparation
Benchmarking: Adequately capture the ‘original’ content in digital form? Physical format & condition meets digitizing requirements? What is the type of material to be digitized? Which resolution? At which bit-depth? Which compression parameters? Estimated accuracy level for OCR? Other considerations?
Preparation : Physically (unbind, remove staples/clips, etc.)
Structurally (add/remove barcodes, separate chapters, parts, etc.)
Characteristics of paper (eg. size, thick, glossy/mat, condition)
International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 8
Scanning – Scanning – Capture Modes & Optical Capture Modes & Optical ResolutionResolution
Capture modes: depends on the physical form of original
Bitonal: 1 bit/pixel – black & white (printed text) Greyscale: 8 bits/pixel – 256 grey shades (black & white
photographs) Colour: 24 bits/pixel – 16 million colours & grey
shades (continuous tone & colour)
Optical Resolution: “dots per inch” (DPI) or “pixels per inch” (PPI)
High resolution fine detail large file size
Bit depth: amount of information captured Greater bit depths more accurate representation
International Atomic Energy Agency9
Scanning at INISScanning at INIS – – Capture & Optical Capture & Optical ResolutionResolution
INIS practice:
Standard Scanner Settings (for Plain b/w text):bitonal (black & white)300 dpi
Special Cases (colour, pictures):greyscale and colour
200 – 300 dpi with 8 bit depth (256 colours/tones)IMPORTANT: post-processing image compression needed
to reduce file size
NEVER use colour settings to scan B/W documents
34th ILO Meeting, 3-5 Nov 2008, Vienna, AT
International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 10
Quality Control - QCQuality Control - QC
Retain: value utility integrity of resources
Verify: quality accuracy consistency
INIS verifies: accuracy & completeness (eg. same number of pages?) data integrity correctness of metadata form and validity correct matching of metadata and image files ‘checksum’ algorithm (authenticity & integrity of digitized
files) number & order of bytes (eg. after move, copy, transfer, burn)
visual inspection: resolution, colour, tone, appearance
attn: changeable light & monitors
International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 11
Image EnhancementImage Enhancement
Definition: Any process applied to the raw scan to improve quality or legibility of the resource
Image EnhancementImage Enhancement at INIS: despeckling deskewing noise reduction black border removal colour and tone adjustment, etc.
International Atomic Energy Agency
Quality Control and Image Enhancement Quality Control and Image Enhancement (1)(1)
Skewed?
12
34th ILO Meeting, 3-5 Nov 2008, Vienna, AT
International Atomic Energy Agency
Noisy (e.g. unnecessary dots)?
Quality Control and Image Enhancement Quality Control and Image Enhancement (2)(2)
13
34th ILO Meeting, 3-5 Nov 2008, Vienna, AT
International Atomic Energy Agency
Black border ?
Quality Control and Image Enhancement Quality Control and Image Enhancement (3)(3)
14
34th ILO Meeting, 3-5 Nov 2008, Vienna, AT
International Atomic Energy Agency
Quality Control and Image Enhancement Quality Control and Image Enhancement (4)(4)
IMPORTANT• Paper Size must match
document hard copy
•A4 ≠ Letter Size
• Text cut = RESCAN
• If noticed during QC of incoming PDF, INIS will request the Input Centre to resend the page
15
34th ILO Meeting, 3-5 Nov 2008, Vienna, AT
International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 16
File FormatsFile Formats
Very important: Prefer ‘non-proprietary’ formats
Several standard file formats exist different resolution, bit-depth, colour capabilities,
etc.
INIS Digital Collection: 1. From ‘Paper’ or ‘Microfiche’ to ‘Digital’:
Master images in TIFF Group IV (b/w), in JPEG (colour) Majority Full-Text searchable PDF
2. Digital files received from INIS National Centres PDF
Compression: JBIG2 (b/w) , JPEG (colour)
International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 17
Preservation FormatsPreservation Formats
PDF: open standard – official ISO 32000-1:2008
PDF/A: Long-term archiving of electronic documents Creation of PDF documents whose visual appearance will
remain the same over the course of time Official ISO standard: ISO 19005-1:2005 Further development ongoing http://www.pdfa.org
INIS: considers adopting PDF/A
for efficient preservation long-term archival of the Agency’s and Member
States’ nuclear information resources
pilot project in 2009
International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 18
OCR – Optical Character RecognitionOCR – Optical Character Recognition
Printed text searchable as electronic text Primary objective for INIS digitization projects:
creation of ‘searchable full text’
INIS: major tool for mass production: ABBYY FineReader 8 ~ 98% accuracy: printed text in Latin & Cyrillic
characters
Satisfactory testing with Script and Arabic Characters: Adobe Acrobat Professional 8.0:
Chinese (Simplified), Japanese, Korean ABBYY FineReader Pro9: Hebrew, Thai
VERUS™ Professional: Arabic
International Atomic Energy Agency
Various OCR typesVarious OCR types
Typewritten
Hand print and cursive
Fraktur
Music scores
MICR (Magnetic Ink Character Recognition)
19
34th ILO Meeting, 3-5 Nov 2008, Vienna, AT
International Atomic Energy Agency
OCROCR process 1 process 1 (no or wrong (no or wrong dictionary)dictionary)
International Atomic Energy Agency
OCROCR process 2 process 2 (proper (proper dictionary)dictionary)
International Atomic Energy Agency
• Scanned (raster) Image
• Visual representation of the original document
Image Layer Hidden Text
Enables full-text search
Extra information for search engines
OCROCR - value added value added
22
34th ILO Meeting, 3-5 Nov 2008, Vienna, AT
errors
International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 23
Storage of Digital FilesStorage of Digital Files
Mandatory: reliable & controlled environment
Storage of master files: high quality, industry standard devices, eg. CD-R, DVD, or other contemporary reliable media
Backup of master files: regularly, off-site, secure location
RAID: Redundant Array of Independent Disks several drives act collectively as a single storage system consider RAID for large production environment
INIS: THECUS N5200B PRO, 5x3,5" SATA Raid 5 disks 1 TB each configured as local network data storage
International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 24
Back-up and Off-Site StorageBack-up and Off-Site Storage
Create: regular back-ups of master files
Store: remote from the original source in a secure location
INIS: 1970 to 1997: ‘microfiche’
NCL full text: paper microfiche safe, long-term storage INIS National Centres full set of NCL microfiche Austrian Central Lib. of Physics
From 1997: ‘digital’ NCL on CD: INIS Document Delivery Centres (National
Centres) Secure “off site” & back-up: Austrian Central Lib. of Physics 2008: microfiche to PDF Austrian Central Lib. of Physics
INIS National Centres INIS Online Database
International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 25
Preservation PlanningPreservation Planning
Contents of digital files must remain ‘meaningful’
Different processes: 1. Refreshing: copy files from one storage medium to another
verify authenticity & integrity of the files (e.g. checksum)
2. Migration: transfer files from one HW & SW to another or from one computer generation to next generations format-based: move files from ‘obsolete’ format to ‘new’ format
3. Emulation: re-create technical environment maintain information about HW & SW = system reengineered
INIS: Refreshing CD to DVD (until 2007) from 2008: copy to Thecus storage device When PDF/A implemented: ‘migration’
International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 26
MetadataMetadata
Key role for digital resources: Key role for digital resources: Describe, process, manage, track, access, preserveDescribe, process, manage, track, access, preserve
INIS:INIS: comprehensive ‘bibliographic’ metadatacomprehensive ‘bibliographic’ metadata describe the intellectual content of full textdescribe the intellectual content of full text bibliographic elements to identify & retrieve resourcesbibliographic elements to identify & retrieve resources
INIS DatabaseINIS Database: digital resources with bibliographic : digital resources with bibliographic metadatametadata
Technical metadataTechnical metadata for digital resources: for digital resources: automatic creation with PDF filesautomatic creation with PDF files
Future: Future: more sophisticated approach with more sophisticated approach with implementation implementation of PDF/A of PDF/A
International Atomic Energy Agency
Thank you for your attention!Thank you for your attention!
Your INIS Digital Preservation Team
34th ILO Meeting, 3-5 Nov 2008, Vienna, AT