175
Newspaper digitization Frederick Zarndt IFLA Newspapers Section [email protected] @cowboyMontana hashtag #IFLAnewspaper

20140410 ifla digitization workshop [idlc kuala lumpur]

Embed Size (px)

DESCRIPTION

Newspapers digitization workshop at the 2014 International Digital Libraries Conference in Kuala Lumpur April 10, 2014.

Citation preview

Page 1: 20140410 ifla digitization workshop [idlc kuala lumpur]

Newspaper digitization

Frederick Zarndt IFLA Newspapers Section

[email protected] @cowboyMontana

hashtag #IFLAnewspaper

Page 2: 20140410 ifla digitization workshop [idlc kuala lumpur]

1. Introductions 2.Review of the OAIS

reference model 3.Newspaper digitization

programs 4. Selection of materials 5. Importance of standards 6.Project management 7. Digitization workflow

7.1. Images 7.2. Metadata 7.3. File formats

8.Digitization workflow demonstration with docWorks

9. Quality assurance and acceptance criteria

10. Tools for digitization, workflow, digital preservation, and project management

11. Digital preservation considerations

12.Wrap-up

the agenda10.30 Morning tea break 13.00 Lunch 15.30 Afternoon tea break

Page 3: 20140410 ifla digitization workshop [idlc kuala lumpur]

An Open Archival Information System (or OAIS) is an archive, consisting of an

organization of people and systems, that has accepted the responsibility to preserve information and make it available for a

Designated Community.

Wikipedia contributors, “Open Archival Information System," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Open_Archival_Information_System (accessed March 2014).

Page 4: 20140410 ifla digitization workshop [idlc kuala lumpur]

• Negotiate for and accept appropriate information from information Producers. • Obtain sufficient control of the information provided to the level needed to ensure

Long-Term Preservation. • Determine, either by itself or in conjunction with other parties, which communities

should become the Designated Community and, therefore, should be able to understand the information provided.

• Ensure that the information to be preserved is Independently Understandable to the Designated Community. In other words, the community should be able to understand the information without needing the assistance of the experts who produced the information.

• Follow documented policies and procedures which ensure that the information is preserved against all reasonable contingencies, and which enable the information to be disseminated as authenticated copies of the original, or as traceable to the original.

• Make the preserved information available to the Designated Community.

Open Archival Information System (OAIS) reference model

Wikipedia contributors, “Open Archival Information System," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Open_Archival_Information_System (accessed March 2014).

Page 5: 20140410 ifla digitization workshop [idlc kuala lumpur]

Open Archival Information System (OAIS) reference model

Page 6: 20140410 ifla digitization workshop [idlc kuala lumpur]

programs

Page 7: 20140410 ifla digitization workshop [idlc kuala lumpur]

National

Collaborative

Indi

vidu

al

prog

ram

s

Page 8: 20140410 ifla digitization workshop [idlc kuala lumpur]

national: centrally funded and managed programs with several participants. strict standards.

• National Digital Newspaper Program (Library of Congress)

• Australian Newspaper Digitisation Program

national programspr

ogra

ms

Page 9: 20140410 ifla digitization workshop [idlc kuala lumpur]

cooperative: organizations collaborate to achieve a common goal but digitization programs are managed separately. flexible standards.

• Europeana newspapers • Digital Public Library of America

cooperative programspr

ogra

ms

Page 10: 20140410 ifla digitization workshop [idlc kuala lumpur]

individual: organization digitizes on its own. may or, more usually, does not follow open standards. all commercial organizations.

• ProQuest Historical Newspapers • Newspapers.com • Newsbank • many others…

individual programspr

ogra

ms

Page 11: 20140410 ifla digitization workshop [idlc kuala lumpur]

• digitization program requires careful thought

• must be adapted to local circumstances

• ask those who have gone before

• join the IFLA Newspapers Section! (ask me how)

programs

Image courtesy of Donald Zolan.

Page 12: 20140410 ifla digitization workshop [idlc kuala lumpur]

Discussion questions

1. Has your organization already begun to digitize newspapers? How is the digitization program organized and funded?

2. If your organization hasn’t yet begun to digitize newspapers, what type of digitization program would best suits your organization / state / country? Why?

programs? ?

Page 13: 20140410 ifla digitization workshop [idlc kuala lumpur]

Experience is that marvelous thing that enables you to recognize a

mistake when you make it again. !

F. P. Jones

Page 14: 20140410 ifla digitization workshop [idlc kuala lumpur]

selection

Page 15: 20140410 ifla digitization workshop [idlc kuala lumpur]

reasons for digitization

newspapers are deteriorating

microfilm is dissolving

no storage space

sele

ctio

n

Page 16: 20140410 ifla digitization workshop [idlc kuala lumpur]

access

• Who are your users? Do you know? • Can you ask them what they expect

from a digital newspaper collection? Can you trust their answers?

• Trove, Papers Past, Cambridge Public Library, CDNC: These digital newspaper collections are used mostly by people 50+ years old and with an interest in family history.

?sele

ctio

n

Page 17: 20140410 ifla digitization workshop [idlc kuala lumpur]

Library of Congress selection criteria for the National Digital Newspaper Program (NDNP)

!

• Image quality • Intellectual content • Refinements

http://www.loc.gov/ndnp/guidelines/selection.html

sele

ctio

n

Page 18: 20140410 ifla digitization workshop [idlc kuala lumpur]

Image quality !All NDNP newspaper images are scanned from microfilm. 1. Microfilm should be produced from properly prepared

unbound originals. 2. Microfilm reduction ratio should be less the 20x. This allows

400dpi images to be scanned from the film. 3. Variations in microfilm density within and between images

should be more than 0.2. 4. Negative microfilm duplicated for scanning should have

resolution test patterns readable at 5.0 or higher. For camera master microfilm without resolution test charts, resolution can be estimated by comparison to film with resolution test charts and original material.

selection for NDNPse

lect

ion

Page 19: 20140410 ifla digitization workshop [idlc kuala lumpur]

Intellectual content !1. Newspaper title reflects the political, economic and cultural

history of the State. 2. Selected newspaper titles should ensure broad geographical

coverage. 3. Newspaper titles that provide coverage of a geographic area or a

group over long time periods are preferred over short lived titles or titles with significant gaps.

selection for NDNPse

lect

ion

Page 20: 20140410 ifla digitization workshop [idlc kuala lumpur]

Selection criteria refinements !1. Orphan titles: Special consideration should be given to high

research value titles that have ceased publication and lack active ownership.

2. Newspaper titles that document a significant (minority) community at the state or regional level may be given special consideration.

3. Newspaper which have already been digitized by other organizations (for example, ProQuest) should not be digitized again.

selection for NDNPse

lect

ion

Page 21: 20140410 ifla digitization workshop [idlc kuala lumpur]

National Library of Australia collection managers in consultation with staff from Preservation Services nominate materials for digitization. The Library works closely with state and territory libraries to systematically digitise newspapers held in these libraries. Selected newspapers include this with !

• Cultural and/or historical significance • Uniqueness and/or rarity of the material • Copyright status or permission to digitise obtained • Material in high demand • Material at risk because of its physical condition

https://www.nla.gov.au/policy-and-planning/collection-digitisation-policy

selection for ANDPse

lect

ion

Page 22: 20140410 ifla digitization workshop [idlc kuala lumpur]

Most newspapers titles selected for digitization are out of

copyright and in the public domain. Negotiating use rights is quite simply too much trouble and

fraught with legal pitfalls.

Copyright laws and policies vary considerably between countries.

copyrightse

lect

ion

Page 23: 20140410 ifla digitization workshop [idlc kuala lumpur]

23

…however…

Digitization and public access to in-copyright newspapers is not

impossible.sele

ctio

n

Page 24: 20140410 ifla digitization workshop [idlc kuala lumpur]

24

Page 25: 20140410 ifla digitization workshop [idlc kuala lumpur]

25

Page 26: 20140410 ifla digitization workshop [idlc kuala lumpur]

26

Page 27: 20140410 ifla digitization workshop [idlc kuala lumpur]

27

Page 28: 20140410 ifla digitization workshop [idlc kuala lumpur]

28

Page 29: 20140410 ifla digitization workshop [idlc kuala lumpur]

Discussion questions

1. Has your organization already selected newspapers to digitize? Why did it choose the titles that were selected? Please answer (hypothetically) if your organization hasn’t begun a newspapers digitization program.

2. Why would or why wouldn’t your organization select in-copyright newspapers to digitize?

selection? ?

Page 30: 20140410 ifla digitization workshop [idlc kuala lumpur]

30

importance of standards

Page 31: 20140410 ifla digitization workshop [idlc kuala lumpur]

• Availability : Open standards are available for all to read and implement. • Maximize end-user choice : Open standards create a fair, competitive market

for implementation of the standards. They do not lock the customer into a particular vendor or group.

• No royalty : Open standards are free for all to implement, with no royalty or fee.

• No discrimination : Open standards and the organizations that administer them do not favor one implementor over another for any reason other than the technical standards compliance of a vendor's implementation.

• Extension or subset : Implementations of open standards may be extended, or offered in subset form. However, certification organizations may decline to certify subset implementations, and may place requirements upon extensions.

• Predatory practices : Open standards may employ license terms that protect against subversion of the standard by embrace-and-extend tactics. The licenses attached to the standard may require the publication of reference information for extensions, and a license for all others to create, distribute and sell software that is compatible with the extensions. An open standard may not otherwise prohibit extensions.

Adapted from FOSS Open Standards. http://en.wikibooks.org/wiki/FOSS_Open_Standards

open standardsim

port

ance

of s

tand

ards

Page 32: 20140410 ifla digitization workshop [idlc kuala lumpur]

• Not restrictive : Less chance of being locked in by a specific technology and/or vendor.

• Interoperable : Easier for systems from different parties or using different technologies to interoperate and communicate with one another.

• Protection against obsolescence : Better protection of the data files created by an application against obsolescence.

• Portable : Applications / data are easier to port from one platform to another since they follows known guidelines and rules, and the interfaces.

32

impo

rtan

ce o

f sta

ndar

ds

Adapted from FOSS Open Standards. http://en.wikibooks.org/wiki/FOSS_Open_Standards

open standards

Page 33: 20140410 ifla digitization workshop [idlc kuala lumpur]

What standards are important for newspaper digitization? !• METS XML is an open standard administered by the METS editorial

board. See http://www.loc.gov/standards/mets/. • ALTO XML is an open standard administered by the ALTO editorial

board. See http://www.loc.gov/standards/alto/. • Various image file formats including TIFF, JPEG, JPEG2000. • PDF/A is a portable document format developed by Adobe. It is a

subset of the complete PDF specification and has been adopted by ISO as a standard. See http://www.pdfa.org/.

• Various library metadata standards including, but not limited to • MODS XML http://www.loc.gov/standards/mods/ • Dublin Core http://dublincore.org/ • PREMIS http://www.loc.gov/standards/premis/

newspapers and standardsim

port

ance

of s

tand

ards

Page 34: 20140410 ifla digitization workshop [idlc kuala lumpur]

importance of standards

with few exceptions libraries use METS XML +

ALTO XML + image files (TIFF, JPEG2000) for newspaper

digitization programs

impo

rtan

ce o

f sta

ndar

ds

Page 35: 20140410 ifla digitization workshop [idlc kuala lumpur]

proprietary standardsOlive ActivePaper Archive stores historical newspaper data in an XML format that is as capable as METS/ALTO XML but is not an open standard.

Early versions of WordPerfect (MS Word too) stored data in a proprietary format, not in an open standard like Open Document Format (ODF). WordPerfect or special software is needed to view the files.

Adobe’s Flash is a de facto but not an open standard. Flash now appears to be on a path to obsolescence, destined to be replaced by HTML5.

impo

rtan

ce o

f sta

ndar

ds

Page 36: 20140410 ifla digitization workshop [idlc kuala lumpur]

Discussion questions

1. Name a few standards that you use every time you connect to the Internet.

2. What library standards does your organization currently use? What other, non-library standards, if any, does your organization use?

? ?importance of standards

Page 37: 20140410 ifla digitization workshop [idlc kuala lumpur]

In theory, there's no difference between theory and practice, but in

practice, there is. !

Anonymous

Page 38: 20140410 ifla digitization workshop [idlc kuala lumpur]

project management

Page 39: 20140410 ifla digitization workshop [idlc kuala lumpur]

From the Standish Group’s 2012 Chaos Report on IT Project Failure.

proj

ect m

anag

emen

t

Page 40: 20140410 ifla digitization workshop [idlc kuala lumpur]

Roger Sessions estimates that the worldwide cost of IT failure is USD $500 billion per month

Roger Sessions: CTO of ObjectWatch. He has written seven books including Simple Architectures for Complex Enterprises and many articles. He is a founding member of the Board of Directors of the International Association of Software Architects. 40

high cost of IT failurepr

ojec

t man

agem

ent

Page 41: 20140410 ifla digitization workshop [idlc kuala lumpur]

in a recent survey of 1230 IT professionals conducted by Embarcadero Technologies, 2 of the

3 biggest project challenges cited by the IT pros are “poor planning” and “poor or no requirements”

41

plan!pr

ojec

t man

agem

ent

Page 42: 20140410 ifla digitization workshop [idlc kuala lumpur]

in a March 2007 web poll conducted by the Computing Technology Industry Association "nearly

28 percent of the more than 1,000 respondents singled out poor communications as the number one

cause of project failure"

42

communicate!pr

ojec

t man

agem

ent

Page 43: 20140410 ifla digitization workshop [idlc kuala lumpur]

A recent survey of 752 IEEE members conducted by IEEE Spectrum and The New York Times discovered that "just 9 percent of 133 respondents whose organizations currently

offshore R&D reported 'No problem'. The biggest headache was 'Language, communication, or culture' barriers, as reported by 54.1 percent of respondents."  (http://www.spectrum.ieee.org/feb07/4881

43

communicate!pr

ojec

t man

agem

ent

Page 44: 20140410 ifla digitization workshop [idlc kuala lumpur]

In their 2009 book Cultural Intelligence: Living and Working Globally, Thomas and Inkson say “Although we increasingly cross boundaries and surmount barriers to trade, migration, travel, and the exchange of information, cultural boundaries are not so easily bridged. Unlike legal, political, or economic

aspects of the global environment, which are observable, culture is largely invisible. Therefore, culture is the aspect of

the global context that is most often overlooked.”

44

communicate!pr

ojec

t man

agem

ent

Page 45: 20140410 ifla digitization workshop [idlc kuala lumpur]

in a white paper written for Project Perfect by Taimour al Neimat, he lists • poor planning • unclear goals and objectives • objectives changing during the project • unrealistic time or resource estimates • lack of executive support and user involvement • failure to communicate and act as a team • inappropriate skillsas primary causes for the failure of complex IT projects

Taimour al Neimat. Why IT project fail. The PROJECT PERFECT White Paper Collection. Oct 2005. http://www.projectperfect.com.au/downloads/Info/info_it_projects_fail.pdf accessed Mar 2014.

proj

ect m

anag

emen

tplan!

Page 46: 20140410 ifla digitization workshop [idlc kuala lumpur]

typical tender evaluation criteria in priority order !

1. understanding of requirements 2. reputation of service bureau 3. price

46

requirements?pr

ojec

t man

agem

ent

Page 47: 20140410 ifla digitization workshop [idlc kuala lumpur]

incomplete requirementsrequirements in recent tender from an (anonymous) government agency somewhere in the world !

• project to convert ~ 170,000 text images to xml • value of project ~ USD $180,000 • 19 pages of definitions, governing law, proposal

evaluation criteria, contractual conditions, instructions about tender response format, etc

• technical requirements description? < 1 page • data acceptance criteria? “a high level of

accuracy”47

proj

ect m

anag

emen

t

Page 48: 20140410 ifla digitization workshop [idlc kuala lumpur]

complete requirements Library of Congress JPEG2000 profile

48

proj

ect m

anag

emen

t

Page 49: 20140410 ifla digitization workshop [idlc kuala lumpur]

a recent newspapers digitization program established by a prominent national library !• digitize more than 20 million text pages • high level image and xml requirements • value of work awarded? > USD $5,000,000 • after award of work, technical requirements expand to 43+ pages from ~3 pages • acceptance criteria? added as an afterthought and not well defined pr

ojec

t man

agem

ent

poor planing

Page 50: 20140410 ifla digitization workshop [idlc kuala lumpur]

the value of simplicity“There are two ways of constructing a software

design: one way is to make it so simple that there are obviously no deficiencies and the other way is

to make it so complicated that there are no obvious deficiencies.”

!C.A.R. Hoare

Professor Sir Charles Anthony Richard Hoare Emeritus Professor at Oxford University, Senior Researcher at Microsoft Research, recipient of the ACM Turing Award, author of many books on computers and software.

proj

ect m

anag

emen

t

Page 51: 20140410 ifla digitization workshop [idlc kuala lumpur]

• unitary: the requirement addresses one and only one thing

• complete: the requirement is fully stated in one place with no missing information

• consistent: the requirement does not contradict any other requirement and is fully consistent with all authoritative external documentation

• atomic: it does not contain conjunctions, for example, "the code field must validate American and Canadian postal codes" should be written as two separate requirements

proj

ect m

anag

emen

t

good requirements

Page 52: 20140410 ifla digitization workshop [idlc kuala lumpur]

!• traceable: the requirement meets all or part of a

business need as stated by stakeholders and authoritatively documented

• current: the requirement has not been made obsolete by the passage of time

• feasible: the requirement can be implemented within the constraints of the project

• unambiguous: the requirement is concisely stated without recourse to technical jargon, acronyms

• verifiable: the implementation of the requirement can be determined through one of four possible methods: inspection, demonstration, test, or analysis

proj

ect m

anag

emen

tgood requirements

Page 53: 20140410 ifla digitization workshop [idlc kuala lumpur]

53

proj

ect m

anag

emen

t

Page 54: 20140410 ifla digitization workshop [idlc kuala lumpur]

• be impeccable with your word • don’t take anything personally • don’t make assumptions • always do your best • be mindful

simple principles for (good) communication

Page 55: 20140410 ifla digitization workshop [idlc kuala lumpur]

no communication ... little communication ... poor communication ... reduced communication ...

... all result in more assumptions about intent!

why (better) communication is necessary

Page 56: 20140410 ifla digitization workshop [idlc kuala lumpur]

The single biggest problem with communication is the

illusion that it has taken place.

George Bernard Shaw, 1925 Nobel Peace Prize for Literature.

Page 57: 20140410 ifla digitization workshop [idlc kuala lumpur]

proj

ect m

anag

emen

t

“projects are about communication, communication, and communication”

Elenbass, B. Staging a Project: Are You Setting Your Project Up for Success? Proceedings of the Project Management Institute Annual Seminars & Symposiums. 2000.

Page 58: 20140410 ifla digitization workshop [idlc kuala lumpur]

“Plan to throw one away; you will anyhow. If there is anything new about the function of a system, the first

implementation will have to be redone completely to achieve a satisfactory (i.e., acceptably small, fast, and maintainable)

result. It costs a lot less if you plan to have a prototype.” !

Butler Lampson

Butler Lampson was a founding member of Xerox PARC, worked for DEC, and now works at Microsoft Research. He is an adjunct professor at MIT and an ACM Fellow.

the value of prototypes / pilots

proj

ect m

anag

emen

t

Page 59: 20140410 ifla digitization workshop [idlc kuala lumpur]

create requirements and acceptance criteria repeat {

digitize (small) pilot batch test data against acceptance criteria adjust requirements and acceptance criteria

} until (no more adjustments are necessary) digitize more data

implement: pilot

pilot batches are VERY VERY important!!59

proj

ect m

anag

emen

t

Page 60: 20140410 ifla digitization workshop [idlc kuala lumpur]

reasons for in-house production !• collection cannot be moved • collection is badly organized • digitization must be done slowly over a long

period • digitization is very simple

60

proj

ect m

anag

emen

t

implement: in-house

Page 61: 20140410 ifla digitization workshop [idlc kuala lumpur]

reasons for outsourced production !• originals can’t be scanned in-house because… • equipment is too expensive • output data is beyond staff experience • labor is too expensive

• large volume of work in a short time • insufficient space, infrastructure, or staff

61

proj

ect m

anag

emen

t

implement: outsource

Page 62: 20140410 ifla digitization workshop [idlc kuala lumpur]

The project management tool one chooses should be intuitive, easy to use, and accessible to all. If it isn’t, many will avoid / refuse / dislike / resent using it. !• Discussion of project management tools at http://

en.wikipedia.org/wiki/Comparison_of_project-management_software

• List of project management tools at http://en.wikipedia.org/wiki/Comparison_of_project-management_software

project management toolspr

ojec

t man

agem

ent

Page 63: 20140410 ifla digitization workshop [idlc kuala lumpur]

Discussion questions

1. What project management practices does your organization follow? Why?

2. What library standards does your organization currently use? What other, non-library standards, if any, does your organization use?

3. What reasons, in addition to those already cited, would your organization have to digitize newspapers in-house or to outsource digitization?

? ?project management

Page 64: 20140410 ifla digitization workshop [idlc kuala lumpur]

“Perfection is attained, not when there is nothing left to add, but when there

is nothing left to take away.” !

Antoine de St. Exupery

Page 65: 20140410 ifla digitization workshop [idlc kuala lumpur]

digitization workflow

Page 66: 20140410 ifla digitization workshop [idlc kuala lumpur]

!

• digital library: one or more digital collections

digitization workflow

Page 67: 20140410 ifla digitization workshop [idlc kuala lumpur]

67

digital librarydi

gitiz

atio

n w

orkf

low

Page 68: 20140410 ifla digitization workshop [idlc kuala lumpur]

!

• digital library: one or more digital collections • digital collection: organized group(s) of digital

objects

digitization workflow

Page 69: 20140410 ifla digitization workshop [idlc kuala lumpur]

69

digital collection

Page 70: 20140410 ifla digitization workshop [idlc kuala lumpur]

!

• digital library: one or more digital collections • digital collection: organized group(s) of digital

objects • digital object: a surrogate or digital copy of

the original source document, for example, a newspaper issue

digitization workflow

Page 71: 20140410 ifla digitization workshop [idlc kuala lumpur]

digi

tal o

bjec

t

Page 72: 20140410 ifla digitization workshop [idlc kuala lumpur]

An example of w

hat ALTO

makes possible

The Day book. (Chicago, Ill.), 29 Feb. 1912. Chronicling America: Historic American Newspapers. Lib. of Congress. <http://chroniclingamerica.loc.gov/lccn/sn83045487/1912-02-29/ed-1/seq-26/>

Page 73: 20140410 ifla digitization workshop [idlc kuala lumpur]

!

• digital library: one or more digital collections • digital collection: organized group(s) of digital

objects • digital object: a surrogate or digital copy of

the original source document, for example, a newspaper issue

• metadata: data about data. information about a digital object(s) or a digital collection(s) or the original source document(s)

digitization workflow

Page 74: 20140410 ifla digitization workshop [idlc kuala lumpur]

74

metadatadi

gitiz

atio

n w

orkf

low

Page 75: 20140410 ifla digitization workshop [idlc kuala lumpur]

• to enhance accessibility • to increase collaboration and cooperation

between libraries and archives around the world

• to promote research • to provide opportunities for entrepreneurs • other reasons?

75

why digitize newspapers?di

gitiz

atio

n w

orkf

low

Page 76: 20140410 ifla digitization workshop [idlc kuala lumpur]

Open Archival Information System (OAIS) reference model

Page 77: 20140410 ifla digitization workshop [idlc kuala lumpur]

digi

tizat

ion

wor

kflo

w

Page 78: 20140410 ifla digitization workshop [idlc kuala lumpur]

accessimagesproduce imagessource objects

producedigital objects

ingest preserve

access

the digitization process

Page 79: 20140410 ifla digitization workshop [idlc kuala lumpur]

imagesproduce imagessource

the digitization process

Page 80: 20140410 ifla digitization workshop [idlc kuala lumpur]

• image file formats • TIFF • JPEG2000 • JPEG • GIF

• text file formats • PDF, PDF/A, PDF/A-1b, PDF/A-1a • TEI XML • HTML • plain text • NITF / NewsML

• metadata • METS • MODS / PREMIS / ALTO / MIX ...

standard file formatsdi

gitiz

atio

n w

orkf

low

Page 81: 20140410 ifla digitization workshop [idlc kuala lumpur]

• image production source materials • original documents: better quality, more

expensive • microfiche: poorer quality, less

expensive, microfiche quality varies • bit depth

• black-and-white (bitonal) • greyscale • color

• resolution • compression

• no compression • lossless (reversible) • lossy (irreversible)

• image metadata

image decisions? ¿di

gitiz

atio

n w

orkf

low

Page 82: 20140410 ifla digitization workshop [idlc kuala lumpur]

image format comparison

Wikipedia contributors, "Comparison of Graphics File Formats," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Comparison_of_graphics_file_formats (accessed August 1, 2012)

compression bit depth metadata color management

mime type patent 1st public

release

JBIG (.jbig, .jbg) lossless 1-bit no no 2000?

JPEG (.jpg, .jpeg)

lossy, DCT, RLE, Huffman

8-bit 12-bit 24-bit

yes yes image/jpeg public.jpeg no 1992

JPEG2000 (.jp2)

many lossless and lossy compression

algorithms

8-bit 16-bit

color to 48 bitsyes yes image/jp2

public.jpeg200yes but part 1 is

patent free2000

TIFF (.tiff, .tif)

none LZW RLE ZIP

Other

1, 2, 4, 8, 16, 24, 32 bits

yes yes image/tiff public.tiff no 1986

Page 83: 20140410 ifla digitization workshop [idlc kuala lumpur]

The Sacred Heart Review 300dpi

Los Angeles Star 300dpi

Die Susquehanna Zeitung 600dpi

TIFF (uncompressed) 17.2 MB 87 MB 415.5 MB

TIFF (lossless LZW compression) 10.2 MB 75.8 MB 232.9 MB

JPEG (maximum quality [lossless]) 7.0 MB 37.2MB 101.1 MB

JPEG (medium quality) 1.5 MB 4.6 MB 10.2MB

JPEG2000 (lossless compression) 7.1 MB 52.7 MB 166.2 MB

JPEG2000 (lossy [70] compression) 5.1 MB 37.1 MB 116.7 MB

JPEG2000 (lossy [30] compression) 2.2 MB 16.1 MB 50.3 MB

image compression comparison

Page 84: 20140410 ifla digitization workshop [idlc kuala lumpur]

USA case law image 1300dpi

USA case law image 2300dpi

TIFF 1-bit CCITT G4 compression 40 KB 87 KB

JPEG2000 W5x3 reversible compression 2.6 MB 3.6 MB

JPEG2000 W9x7 irreversible compression 647 KB 1 MB

image bit depth comparison

Page 85: 20140410 ifla digitization workshop [idlc kuala lumpur]

Image courtesy of http://epsos.de (accessed at http://commons.wikimedia.org March 2014).

GARBAGE IN, GARBAGE OUT

GIGO

Page 86: 20140410 ifla digitization workshop [idlc kuala lumpur]

Deaths. lln»rieff, Esq. of <c .. Qn. Sunday, the till. greatly Drandrellt, of Orms4\irJi.- ~ ; ;✓ ' • * On ijfr r inn l j j j i l F i i j ' 1 1 f H a v o d i v y d , Carnarvonshire, S ; **" *- ' « ' March Oxford, F. Tfovmeud, Uerald. » • V . • O n T n c s d a v l a s t , M r . C har l es . IWilinson, this 8 ; had vf thesis#,, a week ago, which tcrminate<i'iu his death. . / ' ■ O'i Sunday, dJst nit. at. AsbtCnvHall, mar Lancaster, Mr.,Geo. Worn ick, many years house'steward hit late Once The Hamilton and Brandon. He locked himself h»oWn'r«wte<: soon. twelve o'clock" that dny, and fii»-d a loaded pistol " t h r o u g h I n s b e a d , 1 w h i c h instantaneously killed him. Coronet's Verdict, shot himself in a temporary fit of Friday week,

raw OCR text

Excerpt from The British Newspaper Archive, Chester Courant, Tuesday 6-Apr-1819, page 3.

newspaper image

Page 87: 20140410 ifla digitization workshop [idlc kuala lumpur]

Discussion topics

1. Assume your organization decides to digitize 1000 newspaper issues averaging 12 pages per issue. The images are scanned 2-up and average 80MB each. How much disk storage is needed for the images?

2. Now assume instead that your organization uses TIFF images with LZW (lossless) compression, which saves on average 40%. How much disk storage is needed for the images?

? ?digitization workflow

Page 88: 20140410 ifla digitization workshop [idlc kuala lumpur]

why (better) communication is necessary

Page 89: 20140410 ifla digitization workshop [idlc kuala lumpur]

images objectsproducedigital objects

the digitization process

Page 90: 20140410 ifla digitization workshop [idlc kuala lumpur]

objectsimages image processing

layout analysis OCR metadata

build digital objects

the digitization process

Page 91: 20140410 ifla digitization workshop [idlc kuala lumpur]

objectsimages image processing

layout analysis OCR metadata

build digital objects

the digitization process

• crop, de-skew, split images • apply image improvement algorithms as

needed • sharpening filters • local adaptive thresholding • remove text bleed-thru • etc

• create master images • create working images

Page 92: 20140410 ifla digitization workshop [idlc kuala lumpur]

92

Page 93: 20140410 ifla digitization workshop [idlc kuala lumpur]

93

Page 94: 20140410 ifla digitization workshop [idlc kuala lumpur]

94

Page 95: 20140410 ifla digitization workshop [idlc kuala lumpur]

what’s wrong with this image?

Page 96: 20140410 ifla digitization workshop [idlc kuala lumpur]

text is skewed about 1° from

vertical

Page 97: 20140410 ifla digitization workshop [idlc kuala lumpur]

text is de-skewed

text is skewed

Page 98: 20140410 ifla digitization workshop [idlc kuala lumpur]

objectsimages image processing

layout analysis OCR metadata

build digital objects

the digitization process

• analyze layout of text image • estimate font types and sizes • calculate coordinates of text blocks • determine layout object types (text,

illustration, headline, etc)

Page 99: 20140410 ifla digitization workshop [idlc kuala lumpur]

newspaper text layout analysis

Page 100: 20140410 ifla digitization workshop [idlc kuala lumpur]

objectsimages image processing

layout analysis OCR metadata

build digital objects

the digitization process

• perform optical character recognition (OCR) • calculate word and character coordinates • calculate word and character confidences • apply language dictionaries • correct OCR text (optional)

Page 101: 20140410 ifla digitization workshop [idlc kuala lumpur]

objectsimages image processing

layout analysis OCR metadata

build digital objects

the digitization process

• populate metadata fields • verify / correct page numbers • verify / correct document structure

Page 102: 20140410 ifla digitization workshop [idlc kuala lumpur]

objectsimages image processing

layout analysis OCR metadata

build digital objects

the digitization process

• create METS / ALTO XML files • create image files and image metadata • create PDF files (if required) • verify digital object • calculate file fixity checks (checksums) • perform file validation and verification • perform quality assurance

Page 103: 20140410 ifla digitization workshop [idlc kuala lumpur]

• automatic production steps performed by software !

• manual production steps performed by operators

real world digitization production workflow

Page 104: 20140410 ifla digitization workshop [idlc kuala lumpur]

• METS XML for descriptive, structural, technical, and administrative metadata !

• descriptive metadata • Metadata Object Description Standard (MODS)

selected metadata from MARC • Dublin Core fundamental group of text elements for

describing and cataloging !

• technical metadata • ALTO for OCR text • PREMIS for digital preservation • MIX and ANSI/NISO Z39.87 for images

digital library standards

Page 105: 20140410 ifla digitization workshop [idlc kuala lumpur]

Metadata Encoding and Transmission Standard

!• METS is a XML standard for encoding descriptive, administrative,

and structural metadata about objects within a digital library • METS files consist of 7 (optional) sections: header, descriptive,

administrative, file map, structural map, structural link, and behavior

• METS profiles describe a class of METS documents in sufficient detail to provide both document authors and programmers the guidance to create and process METS documents conforming with a particular profile

• current version 1.9.1 • administered by METS editorial board (international group of

volunteers) • standards hosted by Library of Congress at http://www.loc.gov/

standards/mets/

Page 106: 20140410 ifla digitization workshop [idlc kuala lumpur]

Graphic from Karin Bredenberg, Communicating Archival Metadata conference and workshops. Riksarkivet, 2011.

METS file structure

Page 107: 20140410 ifla digitization workshop [idlc kuala lumpur]

Metadata Object Description Schema• MODS is an XML schema for a bibliographic element set that may

be used for library applications. Derivative of MARC 21 bibliographic format. Includes a subset of MARC fields, using language-based tags rather than numeric ones

• Subset of MARC 21 • Mappings exist between MODS and MARC, Dublin Core, and RDA

(conversion tools exist) • May be used in conjunction with METS XML • current version 3.4 • administered by Library of Congress Network Development and

MARC Standards Office with help from interested users • standards hosted by Library of Congress at http://www.loc.gov/

standards/mods/

Page 108: 20140410 ifla digitization workshop [idlc kuala lumpur]

MODS metadata in METS XML<mets:dmdSec ID="issue-nla.news-issn18368190_18740425">! <mets:mdWrap MDTYPE="MODS">! ! <mets:xmlData>! ! ! <mods:mods xmlns="http://www.loc.gov/mods/v3">! ! ! ! <mods:language>! ! ! ! ! <mods:languageTerm type="code" authority="rfc3066">en</mods:languageTerm>! ! ! ! </mods:language>! ! ! ! <mods:genre>newspaper issue</mods:genre>! ! ! ! <mods:originInfo>! ! ! ! ! <mods:dateIssued>18740425</mods:dateIssued>! ! ! ! </mods:originInfo>! ! ! ! <mods:relatedItem type="host">! ! ! ! ! <mods:titleInfo>! ! ! ! ! ! <mods:title>The Queenslander (Brisbane, Qld. : 1866-1939)</mods:title>! ! ! ! ! </mods:titleInfo>! ! ! ! ! <mods:genre>newspaper</mods:genre>! ! ! ! ! <mods:identifier>ISSN18368190</mods:identifier>! ! ! ! ! <mods:part>! ! ! ! ! ! <mods:detail type="volume">! ! ! ! ! ! ! <mods:number>IX</mods:number>! ! ! ! ! ! </mods:detail>! ! ! ! ! </mods:part>! ! ! ! ! <mods:part>! ! ! ! ! ! <mods:detail type="issue">! ! ! ! ! ! ! <mods:number>12</mods:number>! ! ! ! ! ! </mods:detail>! ! ! ! ! </mods:part>! ! ! ! </mods:relatedItem>! ! ! </mods:mods>! ! </mets:xmlData>! </mets:mdWrap></mets:dmdSec>

Page 109: 20140410 ifla digitization workshop [idlc kuala lumpur]

Dublin Core metadata

• Dublin Core is a set of vocabulary terms used to describe resources for the purposes of discovery.

• Dublin Core metadata element set is endorsed in IETF RFC 5013, ISO 15836-2009, and NISO Z39.85

• Metadata terms last updated 14-Jun-2012 • May be used in conjunction with METS XML • Dublin Core Metadata Initiative (DCMI) is an open

organization, incorporated as a public, not-for-profit company in Singapore

• Dublin Core Metadata Initiative is hosted at http://dublincore.org/

Page 110: 20140410 ifla digitization workshop [idlc kuala lumpur]

Analyzed Layout and Text Object

!• ALTO XML provides technical metadata for describing the layout

and content of physical text resources, such as pages of a book or a newspaper

• commonly used in conjunction with METS XML but may be used standalone

• current version 2.1 • administered by ALTO editorial board (international group of

volunteers) • standards hosted by Library of Congress at http://www.loc.gov/

standards/alto/

Page 111: 20140410 ifla digitization workshop [idlc kuala lumpur]

<?xml version="1.0" encoding="UTF-8"?><alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://schema.ccs-gmbh.com/metae/alto-1-4.xsd" xmlns:xlink="http://www.w3.org/1999/xlink"><Description>! <MeasurementUnit>pixel</MeasurementUnit>! <sourceImageInformation>! ! <fileName>//docstorage/impdata_2$/IN/NLA/db0046/batch-1109/nlaImageSeq-2349218-b.tif</fileName>! </sourceImageInformation></Description><Styles>! <TextStyle ID="TXT_0" FONTSIZE="7" FONTFAMILY="Times New Roman" FONTSTYLE="bold"/>! <TextStyle ID="TXT_1" FONTSIZE="9" FONTFAMILY="Times New Roman" FONTSTYLE="bold"/> </Styles><Layout>! <Page ID="P1" PHYSICAL_IMG_NR="1" HEIGHT="9224" WIDTH="7136" PC="0.967">! ! <TopMargin ID="P1_TM00001" HPOS="0" VPOS="0" WIDTH="7135" HEIGHT="814"/>! ! <LeftMargin ID="P1_LM00001" HPOS="0" VPOS="814" WIDTH="151" HEIGHT="8194"/>! ! <RightMargin ID="P1_RM00001" HPOS="6959" VPOS="814" WIDTH="176" HEIGHT="8194"/>! ! <BottomMargin ID="P1_BM00001" HPOS="0" VPOS="9008" WIDTH="7135" HEIGHT="216"/>! ! <PrintSpace ID="P1_PS00001" HPOS="151" VPOS="814" WIDTH="6808" HEIGHT="8194">! ! ! <ComposedBlock ID="ART1" HEIGHT="2366" WIDTH="929" HPOS="209" VPOS="831">! ! ! ! <ComposedBlock ID="ZONE1-1" HEIGHT="88" WIDTH="641" HPOS="357" VPOS="831">! ! ! ! ! <TextBlock ID="P1_TB00004" HPOS="357" VPOS="831" WIDTH="641" HEIGHT="88" STYLEREFS="TXT_4 PAR_LEFT">! ! ! ! ! ! <TextLine ID="P1_TL00065" HPOS="357" VPOS="831" WIDTH="641" HEIGHT="75">! ! ! ! ! ! ! <String ID="P1_ST00404" HPOS="357" VPOS="831" WIDTH="65" HEIGHT="74" CONTENT="The" WC="0.98" CC="000"/>! ! ! ! ! ! !<SP ID="P1_SP00340" HPOS="422" VPOS="906" WIDTH="0"/>! ! ! ! ! ! ! <String ID="P1_ST00405" HPOS="422" VPOS="831" WIDTH="576" HEIGHT="74" CONTENT="Queenslander." WC="0.96" CC="0000000000000"/>! ! ! ! ! ! </TextLine>! ! ! ! ! </TextBlock>! ! ! ! </ComposedBlock>! ! ! ! <ComposedBlock ID="ZONE1-2" HEIGHT="83" WIDTH="894" HPOS="228" VPOS="964"/>! ! ! ! <ComposedBlock ID="ZONE1-3" HEIGHT="46" WIDTH="702" HPOS="331" VPOS="1087"/>! ! ! ! ! ! <TextLine ID="P1_TL01143" HPOS="5946" VPOS="8957" WIDTH="881" HEIGHT="46">! ! ! ! ! ! ! <String ID="P1_ST06356" HPOS="5946" VPOS="8965" WIDTH="3" HEIGHT="27" CONTENT="I" WC="1.00" CC="0"/>! ! ! ! ! ! !<SP ID="P1_SP05236" HPOS="5950" VPOS="8992" WIDTH="658"/>! ! ! ! ! ! ! <String ID="P1_ST06357" HPOS="6608" VPOS="8957" WIDTH="219" HEIGHT="46" CONTENT="Proprietors." WC="1.00" CC="101401212010"/>! ! ! ! ! ! </TextLine>! ! ! ! ! </TextBlock>! ! ! ! </ComposedBlock>! ! ! </ComposedBlock> ! </PrintSpace> </Page></Layout></alto>

Analyzed Layout and Text Object

Page 112: 20140410 ifla digitization workshop [idlc kuala lumpur]

Analyzed Layout and Text Object book

Page 113: 20140410 ifla digitization workshop [idlc kuala lumpur]

Analyzed Layout and Text Object newspaper

Page 114: 20140410 ifla digitization workshop [idlc kuala lumpur]

Preservation Metadata Implementation Strategies

• PREMIS is a core set of implementable preservation metadata, broadly applicable across a wide range of digital preservation contexts and supported by guidelines and recommendations for creation, management, and use

• In 2003 OCLC and RLG jointly sponsored the formation of the PREMIS working group comprised of international experts in the use of metadata to support digital preservation activities

• PREMIS data dictionary current version 2.2 • May be used in conjunction with METS XML • PREMIS tools are freely available • PREMIS Maintenance Activity and Editorial Committee has

international members from libraries and industry • PREMIS data dictionary is hosted at http://www.loc.gov/

standards/premis/

Page 115: 20140410 ifla digitization workshop [idlc kuala lumpur]

PREMIS data in METS file

<mets:amdSec> <mets:techMD ID="PREMISOBJECT1"> <mets:mdWrap MDTYPE="PREMIS"> <mets:xmlData> <premis:object xmlns:premis="http://www.loc.gov/standards/premis/v1"> <premis:objectIdentifier> <premis:objectIdentifierType>National Library of Australia</premis:objectIdentifierType> <premis:objectIdentifierValue>nlaImageSeq-218-b.tif</premis:objectIdentifierValue> </premis:objectIdentifier> <premis:objectCategory>file</premis:objectCategory> <premis:objectCharacteristics> <premis:format> <premis:formatDesignation> <premis:formatName>TIFF</premis:formatName> <premis:formatVersion>TIFF 6.0</premis:formatVersion> </premis:formatDesignation> </premis:format> </premis:objectCharacteristics> <premis:relationship> <premis:relationshipType>derivation</premis:relationshipType> <premis:relationshipSubType>is derivative of</premis:relationshipSubType> <premis:relatedObjectIdentification> <premis:relatedObjectIdentifierType>National Library of Australia</premis:relatedObjectIdentifierType> <premis:relatedObjectIdentifierValue>nlaImageSeq-218-b.tif</premis:relatedObjectIdentifierValue> <premis:relatedObjectSequence>0</premis:relatedObjectSequence> </premis:relatedObjectIdentification> <premis:relatedEventIdentification> <premis:relatedEventIdentifierType>National Library of Australia</premis:relatedEventIdentifierType> <premis:relatedEventIdentifierValue>deskew-nlaImageSeq-218-b.tif</premis:relatedEventIdentifierValue> <premis:relatedEventSequence>0</premis:relatedEventSequence> </premis:relatedEventIdentification> </premis:relationship> </premis:object> </mets:xmlData> </mets:mdWrap> </mets:techMD> </mets:amdSec>

Page 116: 20140410 ifla digitization workshop [idlc kuala lumpur]

digi

tizat

ion

wor

kflo

w

Page 117: 20140410 ifla digitization workshop [idlc kuala lumpur]

implement: software

• commercial off-the-shelf (COTS) • open source • customized COTS • customized open source • custom in-house

117

Page 118: 20140410 ifla digitization workshop [idlc kuala lumpur]

Discussion topics

1. Assuming your organization will digitize historic newspapers, will it digitize the newspapers in-house or out-source digitization? Why? (If you don’t know, guesses and speculations are fine.)

2. Describe your organizations current digitization workflow.

? ?digitization workflow

Page 119: 20140410 ifla digitization workshop [idlc kuala lumpur]

quality assurance and acceptance criteria

Page 120: 20140410 ifla digitization workshop [idlc kuala lumpur]

quality assurance and acceptance criteria

Wikipedia on data quality: !The processes and technologies involved in ensuring the conformance of data values to requirements and acceptance criteria

qual

ity a

ssur

ance

Page 121: 20140410 ifla digitization workshop [idlc kuala lumpur]

• is the digital object complete? are all its components present? • is the digital object verifiable? • is the digital object uncorrupted? • do the components of the digital object

conform to standards? • do the file names conform to project

requirements? • does the directory structure conform to

project requirements? • does the digital object metadata conform to

project specifications?

qual

ity a

ssur

ance

automatic quality checks

Page 122: 20140410 ifla digitization workshop [idlc kuala lumpur]

• does the digital object metadata meet accuracy specifications?

• does the text meet accuracy specifications?

• is the image quality satisfactory? • are article continuations correct? • is the text in reading order?qu

ality

ass

uran

ce

manual quality checks

Page 123: 20140410 ifla digitization workshop [idlc kuala lumpur]

acceptance criteria for an English language digitization project at a large, well-known, and internationally recognized national library !

character accuracy > 80% word accuracy > 75% significant word accuracy > 65%

what’s wrong with this?qu

ality

ass

uran

ce

Page 124: 20140410 ifla digitization workshop [idlc kuala lumpur]

project quality requirement: !

“a high level of accuracy”

what’s wrong with this?

Page 125: 20140410 ifla digitization workshop [idlc kuala lumpur]

project quality requirement: !

“article titles must be 99.5% accurate”

what’s wrong with this?

Page 126: 20140410 ifla digitization workshop [idlc kuala lumpur]

project quality requirement: !

“article title characters in each issue must be 99.5% accurate, that is, each issue may have no more than 5 errors in 1000 article title characters”

what’s wrong with this?

Page 127: 20140410 ifla digitization workshop [idlc kuala lumpur]

image quality!

•sharpness: the amount of detail an image can convey

•noise: random variation of image density •dynamic range •contrast (gamma): the slope of the tone

reproduction curve in a log-log space. high contrast usually involves loss of dynamic range — loss of detail, or clipping, in highlights or shadows.

•vignetting: darkens images near the corners •artifacts: “leftovers” from sharpening or

compression

Wikipedia contributors, “Image quality," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Image_quality (accessed March 2014).

qual

ity a

ssur

ance

Page 128: 20140410 ifla digitization workshop [idlc kuala lumpur]

Zhou Wang and Hamid R. Sheikh. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing. April 2004.

image quality!“…images which are ultimately to be viewed by human beings, the only “correct” method of quantifying visual image quality is through subjective evaluation. in practice, however, subjective evaluation is usually too inconvenient, time-consuming and expensive…” !“…best way to assess the quality of an image is to look at it because human eyes are the ultimate viewers of most images…”

Zhou Wang, Alan Bovick, and Ligang Lu. Why is image quality assessment so difficult? IEEE Transactions on Image Processing. April 2004.

qual

ity a

ssur

ance

Page 129: 20140410 ifla digitization workshop [idlc kuala lumpur]

acceptance criteria for the National Library of Australia NDP

129

Page 130: 20140410 ifla digitization workshop [idlc kuala lumpur]

Discussion topics

1. How does your organization currently do quality assurance for digital data?

2. How much time / effort is given to writing quality assurance procedures and acceptance criteria for digitized data?

? ?quality assurance

Page 131: 20140410 ifla digitization workshop [idlc kuala lumpur]

digitization tools

Page 132: 20140410 ifla digitization workshop [idlc kuala lumpur]

open source vs. commercial software: pros

• acquisition : cost, development and implementation contract costs are likely to be lower than for proprietary software. less likely that there will be contractually-bound upgrade costs. total cost of ownership over the lifetime of usage must be taken into account

• data transferability : with open source code and open data formats, there are greater opportunities to share data across interoperable platforms

• re-use : open source is free from per user or per instance costs and there is a guaranteed freedom to use it in any way. re-use is enabled.

digi

tizat

ion

tool

s

Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of-open-source-solutions/

Page 133: 20140410 ifla digitization workshop [idlc kuala lumpur]

• cost effective : pay once or not at all for development (if at all) and reuse where appropriate.

• non-restrictive : open source licenses do not limit or restrict who can use the software, the type of user, or the areas of business in which the software can be used. provides a licensing model that enables rapid provisioning of both known and unanticipated users and in new use cases.

• scalable : open source solutions are scalable upwards and downwards with a reduction in the risk of longer term financial implications. no license fees on a “per user” or “per box” basis. no redundant licenses

digi

tizat

ion

tool

s

Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of-open-source-solutions/

open source vs. commercial software: pros

Page 134: 20140410 ifla digitization workshop [idlc kuala lumpur]

• easy to prototype and adapt : open source software is particularly suitable for rapid prototyping and experimentation, where the ability to “test drive” the software with minimal costs and administrative delays can be important. (proprietary software suppliers may also provide the same through a ‘proof of concept’ phase at minimal or no cost.)

digi

tizat

ion

tool

s

Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of-open-source-solutions/

open source vs. commercial software: pros

Page 135: 20140410 ifla digitization workshop [idlc kuala lumpur]

• support and maintenance costs : may outweigh those of the proprietary package and include ‘hidden’ commitments.

• intellectual property rights : as code is modified and adapted, there may be legal risks the code’s open source status and who owns the intellectual property rights of the modified code.

• expertise : requires software installation and maintenance expertise. modification of open source code requires software development expertise.must ensure that they have the right level of expertise to manage it effectively.

digi

tizat

ion

tool

sopen source vs. commercial software:

cons

Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of-open-source-solutions/

Page 136: 20140410 ifla digitization workshop [idlc kuala lumpur]

digitization toolsa variety of open source and commercial off-the-shelf (COTS) software is available for digitization projects • easier for systems from different parties or using different

technologies to interoperate and communicate with one another • better protection of the data files created by an application

against obsolescence of the application • applications / data are easier to port from one platform to

another since they follows known guidelines and rules, and the interfaces

Page 137: 20140410 ifla digitization workshop [idlc kuala lumpur]

digi

tizat

ion

tool

s

ocr software• ABBYY FineReader (http://www.abbyy.com)

• Tesseract (https://code.google.com/p/tesseract-ocr)

• Nuance OmniPage (http://www.nuance.com)

• IRIS Readiris (http://www.irislink.com)

• LEADTOOLS OCR (http://www.leadtools.com)

• OCRopus (https://code.google.com/p/ocropus)

Wikipedia contributors, “Comparison of optical character recognition software," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software (accessed March 2014).

Wikipedia contributors, “Optical optical character" Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Optical_character_recognition (accessed March 2014).

open source

Page 138: 20140410 ifla digitization workshop [idlc kuala lumpur]

digi

tizat

ion

tool

s

imaging software• LEADTOOLS image SDK (http://www.leadtools.com)

• ImageGear image SDK (http://www.accusoft.com)

• FreeImage image SDK (http://freeimage.sourceforge.net)

• BlackIce image toolkits (http://www.blackice.com)

• Adobe Photoshop (http://www.adobe.com/Photoshop)

• GIMP (http://www.gimp.org)

• GraphicsMagick (http://www.graphicsmagick.org)

• ImageMagick (http://www.imagemagick.org)

open source

Page 139: 20140410 ifla digitization workshop [idlc kuala lumpur]

digi

tizat

ion

tool

s

digital workflow software

• Content Conversion Specialists docWorks (http://content-conversion.com)

• ScanFlow (http://www.treventus.com)

• Goobi (http://www.goobi.org)

• Zissor (http://zissor.com)

open source

Page 140: 20140410 ifla digitization workshop [idlc kuala lumpur]

digi

tizat

ion

tool

s

other software

• BagIt : hierarchical file packaging format for the exchange of digital content. A "bag" has just enough structure to safely enclose descriptive "tags" and a "payload" but does not require any knowledge of the payload's internal semantics. See http://sourceforge.net/projects/loc-xferutils and http://tools.ietf.org/html/draft-kunze-bagit-06.

open source

Page 141: 20140410 ifla digitization workshop [idlc kuala lumpur]

Discussion questions

1. What software tools does your organization use for digital projects or digital libraries?

2. Does your organization host a digital library? If so, does it use Google Analytics or a similar tool? Why or why not?

3. What software tools does your organization use for project management? Are the tools web-based?

? ?digitization tools

Page 142: 20140410 ifla digitization workshop [idlc kuala lumpur]

Preservation of software and preservation of data are two sides of the same coin. From February 2011 Workshop for Digital Curators.

digital preservation

Page 143: 20140410 ifla digitization workshop [idlc kuala lumpur]

preservationOpen Archival Information System (OAIS)

reference model

Page 144: 20140410 ifla digitization workshop [idlc kuala lumpur]

digitization digital preservation≠ !

Page 145: 20140410 ifla digitization workshop [idlc kuala lumpur]

Vint Cerf on “bit rot”

Page 146: 20140410 ifla digitization workshop [idlc kuala lumpur]

digital preservation

long-term, error-free storage of digital information, with means for retrieval and interpretation, for the entire time

span the information is required

Page 147: 20140410 ifla digitization workshop [idlc kuala lumpur]

digital data risks

• standards / format obsolescence • migration to new format, media,

or hardware • media obsolescence / decay • bit rot

Page 148: 20140410 ifla digitization workshop [idlc kuala lumpur]

format obsolescence

remember … WordPerfect ?

MARC records ? Adobe Flash ?

Page 149: 20140410 ifla digitization workshop [idlc kuala lumpur]

strategies for format obsolescence

•migrate data to new formats • create a computer software museum

with virtual machines • format registries • format validators • don’t worry about it!

Page 150: 20140410 ifla digitization workshop [idlc kuala lumpur]

Jeff Rothenberg on format obsolescence

“... digital documents are evolving so rapidly that shifts in the forms of documents

must inevitably arise. New forms do not necessarily subsume their predecessors or

provide compatibility with previous formats.”

Jeff Rothenberg. Ensuring the Longevity of Digital Documents. Originally published in Scientific American. January 1995. Expanded version published February, 1999. (accessed 1 August 2012 at http://www.clir.org/pubs/archives/ensuring.pdf)

Page 151: 20140410 ifla digitization workshop [idlc kuala lumpur]

standard model for format obsolescence

• digital format registry collects information about target format • this information is used to build format identification and

verification tools • holders of content use these tools to extract metadata from

content in target format; metadata is stored with the content • format registry scans computing environment to determine

which formats are obsolescent; notifications sent for obsolete formats

• on receiving such a notification, someone builds a tool to convert obsolete format to non-obsolete format using the format specification in the registry

• on receiving such a notification, holder of content in obsolete format uses conversion tool and content metadata to convert the file in an obsolete format to a file in a non-obsolete format

Page 152: 20140410 ifla digitization workshop [idlc kuala lumpur]

David Rosenthal on format obsolescence

“... format obsolescence is a rare problem that happens infrequently to a minority of

unpopular formats ...”

David Rosenthal. Format obsolescence: Assessing the threat and the defenses. (accessed 1 August 2012 at http://lockss.org/locksswiki/files/LibraryHighTech2010.pdf

Page 153: 20140410 ifla digitization workshop [idlc kuala lumpur]

alternate model for format obsolescence

• store only essential data • perform only essential tasks • delay performing tasks as long as possible

David Rosenthal. Format obsolescence: Assessing the threat and the defenses. Library High Tech, Special Issue, vol. 28, no. 2, 2010, pp.195-210. doi:10.1108/07378831011047613 (accessed 1 August 2012 at http://lockss.org/locksswiki/files/LibraryHighTech2010.pdf).

Page 154: 20140410 ifla digitization workshop [idlc kuala lumpur]

importance of standards vis-a-vis format obsolescence

well-defined standards … !

• guide developers in creation of tools • facilitates development of a broad range of

tools for any format • allow developers to maintain existing tools

Page 155: 20140410 ifla digitization workshop [idlc kuala lumpur]

data migration risks

• file format changes, for example, PDF 1.4 to PDF 1.8 • file name differences, for example, case

sensitive /insensitive names, new operating system • extended file attributes • file permissions, for example, BSD Unix

drwxr-xr-x@ to Windows file permissions • soft links / hard links

Page 156: 20140410 ifla digitization workshop [idlc kuala lumpur]

media obsolescence

• 5 ¼” floppy disks • 8 track tapes • 3 ½” floppy disks • ZIP drives • CD-R, CD-RW, Blu-Ray • DAT tapes • microfilm • etc

Page 157: 20140410 ifla digitization workshop [idlc kuala lumpur]

strategies for media obsolescence

• migrate data to new media, for example, floppy disks to DVD • create and maintain a computer hardware

museum

Page 158: 20140410 ifla digitization workshop [idlc kuala lumpur]

media decay

a report by NIST and the Library of Congress says ... • virtually all CD-Rs tested indicated an estimated life

expectancy beyond 15 years • only 47 percent of recordable DVDs indicated an

estimated life expectancy beyond 15 years, some had a life expectancy as short as 1.9 years • in practice actual lifetimes may be considerably

shorter

Page 159: 20140410 ifla digitization workshop [idlc kuala lumpur]

• proper storage • data file checksums (MD5, SHA-1, ...) • monitor media integrity • migrate data from old media to new media

prevention / detection of media decay

Page 160: 20140410 ifla digitization workshop [idlc kuala lumpur]

bit rot

gradual decay of data due to …

• storage media failure because of media quality • storage media failure because of improper storage • random events (bit-flip, environmental influences) • software / hardware errors

Page 161: 20140410 ifla digitization workshop [idlc kuala lumpur]

prevention / detection of bit rot

• data file fixity check (checksums) such as MD5, SHA-1, ... • monitor file integrity with frequent, corrective

audits • duplicate copies, geographically distributed

Page 162: 20140410 ifla digitization workshop [idlc kuala lumpur]

distributed decentralized digital preservation

• the more copies, the safer the data • the more independent copies, the safer the

data • the more frequently copies are audited, the

safer the data

Paraphrased David Rosenthal. Keeping bits safe: How hard can it be?

Page 163: 20140410 ifla digitization workshop [idlc kuala lumpur]

distributed decentralized digital preservation

• n+1 copies are safer than n copies • n independent copies on different storage

devices / media are safer than n copies on similar or identical storage devices / media

• data audited every week is safer than data audited every month

Page 164: 20140410 ifla digitization workshop [idlc kuala lumpur]

LOCKSS Lots Of Copies Keep Stuff Safe

• It ingests content from target websites using a web crawler similar to those used by search engines.

• It preserves content by continually comparing the content it has collected with the same content collected by other LOCKSS Boxes, and repairing any differences.

• It delivers authoritative content to readers by acting as a web proxy, cache or via Metadata resolvers when the publisher’s website is not available.

• It provides management through a web interface that allows librarians to select new content for preservation, monitor the content being preserved and control access to the preserved content.

• It dynamically migrates content to new formats as needed for display.

From LOCKSS webpages http://www.lockss.org.

LOCKSS box: Open source LOCKSS software installed on a dedicated computer or virtual machine.

Page 165: 20140410 ifla digitization workshop [idlc kuala lumpur]

how LOCKSS works data copied to another LOCKSS box

library X LOCKSS box

library Y LOCKSS box

my library LOCKSS box

data

Page 166: 20140410 ifla digitization workshop [idlc kuala lumpur]

how LOCKSS works data audited

library X LOCKSS box

library Y LOCKSS box

my library LOCKSS box

data

audit

Page 167: 20140410 ifla digitization workshop [idlc kuala lumpur]

how LOCKSS works data audited

library X LOCKSS box

library Y LOCKSS box

my library LOCKSS box

data

audit

audit fails

audit  ok

Page 168: 20140410 ifla digitization workshop [idlc kuala lumpur]

how LOCKSS works data copied to another LOCKSS box

library X LOCKSS box

library Y LOCKSS box

my library LOCKSS box

data

Page 169: 20140410 ifla digitization workshop [idlc kuala lumpur]

private LOCKSS networks

Alabama Digital Preservation Network (http://www.adpn.org/).

CLOCKSS (Controlled LOCKSS), a non-profit collaboration of North American, European, and Asian cultural heritage institutions whose purpose is to preserve digital content with LOCKSS (http://www.clockss.org).

MetaArchive Cooperative is a digital preservation cooperative created by cultural heritage institutions (http://www.metaarchive.org).

Page 170: 20140410 ifla digitization workshop [idlc kuala lumpur]

digital preservation references• Nancy McGovern and Katherine Skinner editors. Aligning National Approaches to

Digital Preservation. Educopia Institute Publications. Atlanta Georgia. 2012. Proceedings of a conference on digital preservation held at the National Library of Estonia in May 2011. (accessed 15 August 2012 at http://www.educopia.org/sites/default/files/ANADP_Educopia_2012.pdf).

• David Rosenthal. Format obsolescence: Assessing the threat and the defenses. Library High Tech, Special Issue, v. 28, n. 2, 2010, pp.195-210. doi:10.1108/07378831011047613 (accessed 1 August 2012 at http://lockss.org/locksswiki/files/LibraryHighTech2010.pdf).

• David Rosenthal. Keeping bits safe: How hard can it be? Communications of the ACM v. 53, n. 11, 2010, pp. 47-55. doi:10.1145/1839676.1839692 (accessed 1 August 2012 at http://lockss.org/locksswiki/files/ACM2010.pdf).

• Jeff Rothenberg. Ensuring the Longevity of Digital Documents. Originally published in Scientific American January 1995. Expanded version published February 1999. (accessed 1 August 2012 at http://www.clir.org/pubs/archives/ensuring.pdf)

• Joint Information Systems Committee (JISC) Programme on Digital Preservation at http://www.jisc.ac.uk/preservation.

• Library of Congress on Digital Preservation at http://www.digitalpreservation.gov. • Stanford University’s website for LOCKSS at http://www.lockss.org.

Page 171: 20140410 ifla digitization workshop [idlc kuala lumpur]

newspaper digitization programs around the world

Europeana Newspapers Project, a collaboration of 17 organizations (http://www.europeana-newspapers.eu/)

Bibliotheque nationale de France (http://gallica.bnf.fr/)

National Library of Australia, Australian Digital Newspapers Program (http://trove.nla.gov.au/newspaper)

Singapore National Library Board (http://newspapers.nl.sg/)

National Library of New Zealand (http://paperspast.natlib.govt.nz/)

National Digital Newspaper Program, Library of Congress (http://chroniclingamerica.loc.gov/)

British Newspaper Archives, British Library (http://www.bl.uk/welcome/newspapers)

Koninklijke Bibliotheek, the Netherlands (http://kranten.kb.nl/)

National Library of Finland (http://digi.kansalliskirjasto.fi/)

National Library of Latvia (https://periodika.lndb.lv/)

Page 172: 20140410 ifla digitization workshop [idlc kuala lumpur]

• Library of Congress National Digital Newspaper Program http://www.loc.gov/ndnp/

• Australian Newspaper Digitisation Program http://www.nla.gov.au/content/newspaper-digitisation-program

• IFLA Newspapers Section Digitisation projects and best practices http://www.ifla.org/node/6777

• ICON: International Coalition on Newspapers http://icon.crl.edu/digitization.htm

Page 173: 20140410 ifla digitization workshop [idlc kuala lumpur]

• METS, MODS, ALTO, PRISM, and other library standards : http://www.loc.gov/standards

• OAIS : http://public.ccsds.org/publications/RefModel.aspx • NISO standards and guidelines : http://www.niso.org/

publications/rp • Good practice guides : http://www.ukoln.ac.uk • And many, many more

Page 174: 20140410 ifla digitization workshop [idlc kuala lumpur]

Wikipedia contributors, "List of online newspaper archives," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Wikipedia:List_of_online_newspaper_archives (accessed March 17, 2013).

Page 175: 20140410 ifla digitization workshop [idlc kuala lumpur]

?!

Frederick Zarndt Secretary, IFLA Newspapers Section

[email protected]

Photo held by John Oxley Library, State Library of Queensland. Original from

Courier-mail, Brisbane, Queensland, Australia.