27
International Atomic Energy Agency Digital Preservation Session Digital Preservation Session Tue, 4 Nov 2008 Tue, 4 Nov 2008 34 34 th th INIS Liaison Officers’ Meeting INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria 3-5 Nov 2008, Vienna, Austria S. Rieder, G. St-Pierre, Y. Reynaud-Pulido, T. Kalapurackal Database Production and Imaging Group, INIS Unit INIS & NKM Section

International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

Embed Size (px)

Citation preview

Page 1: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency

Digital Preservation SessionDigital Preservation SessionTue, 4 Nov 2008Tue, 4 Nov 2008

3434thth INIS Liaison Officers’ Meeting INIS Liaison Officers’ Meeting3-5 Nov 2008, Vienna, Austria3-5 Nov 2008, Vienna, Austria

S. Rieder, G. St-Pierre, Y. Reynaud-Pulido, T. Kalapurackal

Database Production and Imaging Group, INIS UnitINIS & NKM Section

Page 2: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 2

INIS INIS Mission:Mission:

• preservation of nuclear knowledge• serving as a reservoir of nuclear

information• provision of quality information services• promotion of a culture of

“information and knowledge sharing“

Digital Preservation at INISDigital Preservation at INIS

Page 3: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 3

INIS Non-Conventional Literature (NCL)

Production of the INIS electronic Full Text Database

Digital Preservation Activities Digitization projects

at IAEA at Member States

Digital Preservation Digital Preservation at INIS

Page 4: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 4

Objectives:

Consistent, high-level of image quality Interoperability and accessibility of digitized

resources Long-term preservation of digital resources for future

generations Member States IAEA

Develop good practices for digital preservation

‘Overview of INIS Digital Preservation Practices’: INIS Information Letter No. 253 & Attachment (2008-10-03)

http://www.iaea.org/inisnkm/marea/restricted/restrictedpdf/2008/infoletter253_attachment.pdf

Digital Preservation Digital Preservation at INIS

Page 5: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 5

INIS principles and workflow base on Cornell University’s digital imaging tutorial:

http://www.library.cornell.edu/preservation/tutorial/index.html

available in English, French, Spanish

INIS Digital Preservation Principles Digital Preservation Principles

Page 6: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 6

INIS WorkflowINIS Workflow

Document Benchmarking Document Preparation Scanning Quality Control Image Enhancement Metadata Creation/Validation Export including Compression Completeness Check Back-up Post-processing Storage and dissemination

Page 7: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 7

Benchmarking & Document Preparation

Benchmarking: Adequately capture the ‘original’ content in digital form? Physical format & condition meets digitizing requirements? What is the type of material to be digitized? Which resolution? At which bit-depth? Which compression parameters? Estimated accuracy level for OCR? Other considerations?

Preparation : Physically (unbind, remove staples/clips, etc.)

Structurally (add/remove barcodes, separate chapters, parts, etc.)

Characteristics of paper (eg. size, thick, glossy/mat, condition)

Page 8: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 8

Scanning – Scanning – Capture Modes & Optical Capture Modes & Optical ResolutionResolution

Capture modes: depends on the physical form of original

Bitonal: 1 bit/pixel – black & white (printed text) Greyscale: 8 bits/pixel – 256 grey shades (black & white

photographs) Colour: 24 bits/pixel – 16 million colours & grey

shades (continuous tone & colour)

Optical Resolution: “dots per inch” (DPI) or “pixels per inch” (PPI)

High resolution fine detail large file size

Bit depth: amount of information captured Greater bit depths more accurate representation

Page 9: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency9

Scanning at INISScanning at INIS – – Capture & Optical Capture & Optical ResolutionResolution

INIS practice:

Standard Scanner Settings (for Plain b/w text):bitonal (black & white)300 dpi

Special Cases (colour, pictures):greyscale and colour

200 – 300 dpi with 8 bit depth (256 colours/tones)IMPORTANT: post-processing image compression needed

to reduce file size

NEVER use colour settings to scan B/W documents

34th ILO Meeting, 3-5 Nov 2008, Vienna, AT

Page 10: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 10

Quality Control - QCQuality Control - QC

Retain: value utility integrity of resources

Verify: quality accuracy consistency

INIS verifies: accuracy & completeness (eg. same number of pages?) data integrity correctness of metadata form and validity correct matching of metadata and image files ‘checksum’ algorithm (authenticity & integrity of digitized

files) number & order of bytes (eg. after move, copy, transfer, burn)

visual inspection: resolution, colour, tone, appearance

attn: changeable light & monitors

Page 11: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 11

Image EnhancementImage Enhancement

Definition: Any process applied to the raw scan to improve quality or legibility of the resource

Image EnhancementImage Enhancement at INIS: despeckling deskewing noise reduction black border removal colour and tone adjustment, etc.

Page 12: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency

Quality Control and Image Enhancement Quality Control and Image Enhancement (1)(1)

Skewed?

12

34th ILO Meeting, 3-5 Nov 2008, Vienna, AT

Page 13: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency

Noisy (e.g. unnecessary dots)?

Quality Control and Image Enhancement Quality Control and Image Enhancement (2)(2)

13

34th ILO Meeting, 3-5 Nov 2008, Vienna, AT

Page 14: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency

Black border ?

Quality Control and Image Enhancement Quality Control and Image Enhancement (3)(3)

14

34th ILO Meeting, 3-5 Nov 2008, Vienna, AT

Page 15: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency

Quality Control and Image Enhancement Quality Control and Image Enhancement (4)(4)

IMPORTANT• Paper Size must match

document hard copy

•A4 ≠ Letter Size

• Text cut = RESCAN

• If noticed during QC of incoming PDF, INIS will request the Input Centre to resend the page

15

34th ILO Meeting, 3-5 Nov 2008, Vienna, AT

Page 16: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 16

File FormatsFile Formats

Very important: Prefer ‘non-proprietary’ formats

Several standard file formats exist different resolution, bit-depth, colour capabilities,

etc.

INIS Digital Collection: 1. From ‘Paper’ or ‘Microfiche’ to ‘Digital’:

Master images in TIFF Group IV (b/w), in JPEG (colour) Majority Full-Text searchable PDF

2. Digital files received from INIS National Centres PDF

Compression: JBIG2 (b/w) , JPEG (colour)

Page 17: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 17

Preservation FormatsPreservation Formats

PDF: open standard – official ISO 32000-1:2008

PDF/A: Long-term archiving of electronic documents Creation of PDF documents whose visual appearance will

remain the same over the course of time Official ISO standard: ISO 19005-1:2005 Further development ongoing http://www.pdfa.org

INIS: considers adopting PDF/A

for efficient preservation long-term archival of the Agency’s and Member

States’ nuclear information resources

pilot project in 2009

Page 18: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 18

OCR – Optical Character RecognitionOCR – Optical Character Recognition

Printed text searchable as electronic text Primary objective for INIS digitization projects:

creation of ‘searchable full text’

INIS: major tool for mass production: ABBYY FineReader 8 ~ 98% accuracy: printed text in Latin & Cyrillic

characters

Satisfactory testing with Script and Arabic Characters: Adobe Acrobat Professional 8.0:

Chinese (Simplified), Japanese, Korean ABBYY FineReader Pro9: Hebrew, Thai

VERUS™ Professional: Arabic

Page 19: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency

Various OCR typesVarious OCR types

Typewritten

Hand print and cursive

Fraktur

Music scores

MICR (Magnetic Ink Character Recognition)

19

34th ILO Meeting, 3-5 Nov 2008, Vienna, AT

Page 20: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency

OCROCR process 1 process 1 (no or wrong (no or wrong dictionary)dictionary)

Page 21: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency

OCROCR process 2 process 2 (proper (proper dictionary)dictionary)

Page 22: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency

• Scanned (raster) Image

• Visual representation of the original document

Image Layer Hidden Text

Enables full-text search

Extra information for search engines

OCROCR - value added value added

22

34th ILO Meeting, 3-5 Nov 2008, Vienna, AT

errors

Page 23: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 23

Storage of Digital FilesStorage of Digital Files

Mandatory: reliable & controlled environment

Storage of master files: high quality, industry standard devices, eg. CD-R, DVD, or other contemporary reliable media

Backup of master files: regularly, off-site, secure location

RAID: Redundant Array of Independent Disks several drives act collectively as a single storage system consider RAID for large production environment

INIS: THECUS N5200B PRO, 5x3,5" SATA Raid 5 disks 1 TB each configured as local network data storage

Page 24: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 24

Back-up and Off-Site StorageBack-up and Off-Site Storage

Create: regular back-ups of master files

Store: remote from the original source in a secure location

INIS: 1970 to 1997: ‘microfiche’

NCL full text: paper microfiche safe, long-term storage INIS National Centres full set of NCL microfiche Austrian Central Lib. of Physics

From 1997: ‘digital’ NCL on CD: INIS Document Delivery Centres (National

Centres) Secure “off site” & back-up: Austrian Central Lib. of Physics 2008: microfiche to PDF Austrian Central Lib. of Physics

INIS National Centres INIS Online Database

Page 25: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 25

Preservation PlanningPreservation Planning

Contents of digital files must remain ‘meaningful’

Different processes: 1. Refreshing: copy files from one storage medium to another

verify authenticity & integrity of the files (e.g. checksum)

2. Migration: transfer files from one HW & SW to another or from one computer generation to next generations format-based: move files from ‘obsolete’ format to ‘new’ format

3. Emulation: re-create technical environment maintain information about HW & SW = system reengineered

INIS: Refreshing CD to DVD (until 2007) from 2008: copy to Thecus storage device When PDF/A implemented: ‘migration’

Page 26: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency34th ILO Meeting, 3-5 Nov 2008, Vienna, AT 26

MetadataMetadata

Key role for digital resources: Key role for digital resources: Describe, process, manage, track, access, preserveDescribe, process, manage, track, access, preserve

INIS:INIS: comprehensive ‘bibliographic’ metadatacomprehensive ‘bibliographic’ metadata describe the intellectual content of full textdescribe the intellectual content of full text bibliographic elements to identify & retrieve resourcesbibliographic elements to identify & retrieve resources

INIS DatabaseINIS Database: digital resources with bibliographic : digital resources with bibliographic metadatametadata

Technical metadataTechnical metadata for digital resources: for digital resources: automatic creation with PDF filesautomatic creation with PDF files

Future: Future: more sophisticated approach with more sophisticated approach with implementation implementation of PDF/A of PDF/A

Page 27: International Atomic Energy Agency Digital Preservation Session Tue, 4 Nov 2008 34 th INIS Liaison Officers’ Meeting 3-5 Nov 2008, Vienna, Austria S. Rieder,

International Atomic Energy Agency

Thank you for your attention!Thank you for your attention!

Your INIS Digital Preservation Team

34th ILO Meeting, 3-5 Nov 2008, Vienna, AT