20140410 ifla digitization workshop [idlc kuala lumpur]

Preview:

DESCRIPTION

Newspapers digitization workshop at the 2014 International Digital Libraries Conference in Kuala Lumpur April 10, 2014.

Citation preview

Newspaper digitization

Frederick Zarndt IFLA Newspapers Section

frederick@frederickzarndt.com @cowboyMontana

hashtag #IFLAnewspaper

1. Introductions 2.Review of the OAIS

reference model 3.Newspaper digitization

programs 4. Selection of materials 5. Importance of standards 6.Project management 7. Digitization workflow

7.1. Images 7.2. Metadata 7.3. File formats

8.Digitization workflow demonstration with docWorks

9. Quality assurance and acceptance criteria

10. Tools for digitization, workflow, digital preservation, and project management

11. Digital preservation considerations

12.Wrap-up

the agenda10.30 Morning tea break 13.00 Lunch 15.30 Afternoon tea break

An Open Archival Information System (or OAIS) is an archive, consisting of an

organization of people and systems, that has accepted the responsibility to preserve information and make it available for a

Designated Community.

Wikipedia contributors, “Open Archival Information System," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Open_Archival_Information_System (accessed March 2014).

• Negotiate for and accept appropriate information from information Producers. • Obtain sufficient control of the information provided to the level needed to ensure

Long-Term Preservation. • Determine, either by itself or in conjunction with other parties, which communities

should become the Designated Community and, therefore, should be able to understand the information provided.

• Ensure that the information to be preserved is Independently Understandable to the Designated Community. In other words, the community should be able to understand the information without needing the assistance of the experts who produced the information.

• Follow documented policies and procedures which ensure that the information is preserved against all reasonable contingencies, and which enable the information to be disseminated as authenticated copies of the original, or as traceable to the original.

• Make the preserved information available to the Designated Community.

Open Archival Information System (OAIS) reference model

Wikipedia contributors, “Open Archival Information System," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Open_Archival_Information_System (accessed March 2014).

Open Archival Information System (OAIS) reference model

programs

National

Collaborative

Indi

vidu

al

prog

ram

s

national: centrally funded and managed programs with several participants. strict standards.

• National Digital Newspaper Program (Library of Congress)

• Australian Newspaper Digitisation Program

national programspr

ogra

ms

cooperative: organizations collaborate to achieve a common goal but digitization programs are managed separately. flexible standards.

• Europeana newspapers • Digital Public Library of America

cooperative programspr

ogra

ms

individual: organization digitizes on its own. may or, more usually, does not follow open standards. all commercial organizations.

• ProQuest Historical Newspapers • Newspapers.com • Newsbank • many others…

individual programspr

ogra

ms

• digitization program requires careful thought

• must be adapted to local circumstances

• ask those who have gone before

• join the IFLA Newspapers Section! (ask me how)

programs

Image courtesy of Donald Zolan.

Discussion questions

1. Has your organization already begun to digitize newspapers? How is the digitization program organized and funded?

2. If your organization hasn’t yet begun to digitize newspapers, what type of digitization program would best suits your organization / state / country? Why?

programs? ?

Experience is that marvelous thing that enables you to recognize a

mistake when you make it again. !

F. P. Jones

selection

reasons for digitization

newspapers are deteriorating

microfilm is dissolving

no storage space

sele

ctio

n

access

• Who are your users? Do you know? • Can you ask them what they expect

from a digital newspaper collection? Can you trust their answers?

• Trove, Papers Past, Cambridge Public Library, CDNC: These digital newspaper collections are used mostly by people 50+ years old and with an interest in family history.

?sele

ctio

n

Library of Congress selection criteria for the National Digital Newspaper Program (NDNP)

!

• Image quality • Intellectual content • Refinements

http://www.loc.gov/ndnp/guidelines/selection.html

sele

ctio

n

Image quality !All NDNP newspaper images are scanned from microfilm. 1. Microfilm should be produced from properly prepared

unbound originals. 2. Microfilm reduction ratio should be less the 20x. This allows

400dpi images to be scanned from the film. 3. Variations in microfilm density within and between images

should be more than 0.2. 4. Negative microfilm duplicated for scanning should have

resolution test patterns readable at 5.0 or higher. For camera master microfilm without resolution test charts, resolution can be estimated by comparison to film with resolution test charts and original material.

selection for NDNPse

lect

ion

Intellectual content !1. Newspaper title reflects the political, economic and cultural

history of the State. 2. Selected newspaper titles should ensure broad geographical

coverage. 3. Newspaper titles that provide coverage of a geographic area or a

group over long time periods are preferred over short lived titles or titles with significant gaps.

selection for NDNPse

lect

ion

Selection criteria refinements !1. Orphan titles: Special consideration should be given to high

research value titles that have ceased publication and lack active ownership.

2. Newspaper titles that document a significant (minority) community at the state or regional level may be given special consideration.

3. Newspaper which have already been digitized by other organizations (for example, ProQuest) should not be digitized again.

selection for NDNPse

lect

ion

National Library of Australia collection managers in consultation with staff from Preservation Services nominate materials for digitization. The Library works closely with state and territory libraries to systematically digitise newspapers held in these libraries. Selected newspapers include this with !

• Cultural and/or historical significance • Uniqueness and/or rarity of the material • Copyright status or permission to digitise obtained • Material in high demand • Material at risk because of its physical condition

https://www.nla.gov.au/policy-and-planning/collection-digitisation-policy

selection for ANDPse

lect

ion

Most newspapers titles selected for digitization are out of

copyright and in the public domain. Negotiating use rights is quite simply too much trouble and

fraught with legal pitfalls.

Copyright laws and policies vary considerably between countries.

copyrightse

lect

ion

23

…however…

Digitization and public access to in-copyright newspapers is not

impossible.sele

ctio

n

24

25

26

27

28

Discussion questions

1. Has your organization already selected newspapers to digitize? Why did it choose the titles that were selected? Please answer (hypothetically) if your organization hasn’t begun a newspapers digitization program.

2. Why would or why wouldn’t your organization select in-copyright newspapers to digitize?

selection? ?

30

importance of standards

• Availability : Open standards are available for all to read and implement. • Maximize end-user choice : Open standards create a fair, competitive market

for implementation of the standards. They do not lock the customer into a particular vendor or group.

• No royalty : Open standards are free for all to implement, with no royalty or fee.

• No discrimination : Open standards and the organizations that administer them do not favor one implementor over another for any reason other than the technical standards compliance of a vendor's implementation.

• Extension or subset : Implementations of open standards may be extended, or offered in subset form. However, certification organizations may decline to certify subset implementations, and may place requirements upon extensions.

• Predatory practices : Open standards may employ license terms that protect against subversion of the standard by embrace-and-extend tactics. The licenses attached to the standard may require the publication of reference information for extensions, and a license for all others to create, distribute and sell software that is compatible with the extensions. An open standard may not otherwise prohibit extensions.

Adapted from FOSS Open Standards. http://en.wikibooks.org/wiki/FOSS_Open_Standards

open standardsim

port

ance

of s

tand

ards

• Not restrictive : Less chance of being locked in by a specific technology and/or vendor.

• Interoperable : Easier for systems from different parties or using different technologies to interoperate and communicate with one another.

• Protection against obsolescence : Better protection of the data files created by an application against obsolescence.

• Portable : Applications / data are easier to port from one platform to another since they follows known guidelines and rules, and the interfaces.

32

impo

rtan

ce o

f sta

ndar

ds

Adapted from FOSS Open Standards. http://en.wikibooks.org/wiki/FOSS_Open_Standards

open standards

What standards are important for newspaper digitization? !• METS XML is an open standard administered by the METS editorial

board. See http://www.loc.gov/standards/mets/. • ALTO XML is an open standard administered by the ALTO editorial

board. See http://www.loc.gov/standards/alto/. • Various image file formats including TIFF, JPEG, JPEG2000. • PDF/A is a portable document format developed by Adobe. It is a

subset of the complete PDF specification and has been adopted by ISO as a standard. See http://www.pdfa.org/.

• Various library metadata standards including, but not limited to • MODS XML http://www.loc.gov/standards/mods/ • Dublin Core http://dublincore.org/ • PREMIS http://www.loc.gov/standards/premis/

newspapers and standardsim

port

ance

of s

tand

ards

importance of standards

with few exceptions libraries use METS XML +

ALTO XML + image files (TIFF, JPEG2000) for newspaper

digitization programs

impo

rtan

ce o

f sta

ndar

ds

proprietary standardsOlive ActivePaper Archive stores historical newspaper data in an XML format that is as capable as METS/ALTO XML but is not an open standard.

Early versions of WordPerfect (MS Word too) stored data in a proprietary format, not in an open standard like Open Document Format (ODF). WordPerfect or special software is needed to view the files.

Adobe’s Flash is a de facto but not an open standard. Flash now appears to be on a path to obsolescence, destined to be replaced by HTML5.

impo

rtan

ce o

f sta

ndar

ds

Discussion questions

1. Name a few standards that you use every time you connect to the Internet.

2. What library standards does your organization currently use? What other, non-library standards, if any, does your organization use?

? ?importance of standards

In theory, there's no difference between theory and practice, but in

practice, there is. !

Anonymous

project management

From the Standish Group’s 2012 Chaos Report on IT Project Failure.

proj

ect m

anag

emen

t

Roger Sessions estimates that the worldwide cost of IT failure is USD $500 billion per month

Roger Sessions: CTO of ObjectWatch. He has written seven books including Simple Architectures for Complex Enterprises and many articles. He is a founding member of the Board of Directors of the International Association of Software Architects. 40

high cost of IT failurepr

ojec

t man

agem

ent

in a recent survey of 1230 IT professionals conducted by Embarcadero Technologies, 2 of the

3 biggest project challenges cited by the IT pros are “poor planning” and “poor or no requirements”

41

plan!pr

ojec

t man

agem

ent

in a March 2007 web poll conducted by the Computing Technology Industry Association "nearly

28 percent of the more than 1,000 respondents singled out poor communications as the number one

cause of project failure"

42

communicate!pr

ojec

t man

agem

ent

A recent survey of 752 IEEE members conducted by IEEE Spectrum and The New York Times discovered that "just 9 percent of 133 respondents whose organizations currently

offshore R&D reported 'No problem'. The biggest headache was 'Language, communication, or culture' barriers, as reported by 54.1 percent of respondents."  (http://www.spectrum.ieee.org/feb07/4881

43

communicate!pr

ojec

t man

agem

ent

In their 2009 book Cultural Intelligence: Living and Working Globally, Thomas and Inkson say “Although we increasingly cross boundaries and surmount barriers to trade, migration, travel, and the exchange of information, cultural boundaries are not so easily bridged. Unlike legal, political, or economic

aspects of the global environment, which are observable, culture is largely invisible. Therefore, culture is the aspect of

the global context that is most often overlooked.”

44

communicate!pr

ojec

t man

agem

ent

in a white paper written for Project Perfect by Taimour al Neimat, he lists • poor planning • unclear goals and objectives • objectives changing during the project • unrealistic time or resource estimates • lack of executive support and user involvement • failure to communicate and act as a team • inappropriate skillsas primary causes for the failure of complex IT projects

Taimour al Neimat. Why IT project fail. The PROJECT PERFECT White Paper Collection. Oct 2005. http://www.projectperfect.com.au/downloads/Info/info_it_projects_fail.pdf accessed Mar 2014.

proj

ect m

anag

emen

tplan!

typical tender evaluation criteria in priority order !

1. understanding of requirements 2. reputation of service bureau 3. price

46

requirements?pr

ojec

t man

agem

ent

incomplete requirementsrequirements in recent tender from an (anonymous) government agency somewhere in the world !

• project to convert ~ 170,000 text images to xml • value of project ~ USD $180,000 • 19 pages of definitions, governing law, proposal

evaluation criteria, contractual conditions, instructions about tender response format, etc

• technical requirements description? < 1 page • data acceptance criteria? “a high level of

accuracy”47

proj

ect m

anag

emen

t

complete requirements Library of Congress JPEG2000 profile

48

proj

ect m

anag

emen

t

a recent newspapers digitization program established by a prominent national library !• digitize more than 20 million text pages • high level image and xml requirements • value of work awarded? > USD $5,000,000 • after award of work, technical requirements expand to 43+ pages from ~3 pages • acceptance criteria? added as an afterthought and not well defined pr

ojec

t man

agem

ent

poor planing

the value of simplicity“There are two ways of constructing a software

design: one way is to make it so simple that there are obviously no deficiencies and the other way is

to make it so complicated that there are no obvious deficiencies.”

!C.A.R. Hoare

Professor Sir Charles Anthony Richard Hoare Emeritus Professor at Oxford University, Senior Researcher at Microsoft Research, recipient of the ACM Turing Award, author of many books on computers and software.

proj

ect m

anag

emen

t

• unitary: the requirement addresses one and only one thing

• complete: the requirement is fully stated in one place with no missing information

• consistent: the requirement does not contradict any other requirement and is fully consistent with all authoritative external documentation

• atomic: it does not contain conjunctions, for example, "the code field must validate American and Canadian postal codes" should be written as two separate requirements

proj

ect m

anag

emen

t

good requirements

!• traceable: the requirement meets all or part of a

business need as stated by stakeholders and authoritatively documented

• current: the requirement has not been made obsolete by the passage of time

• feasible: the requirement can be implemented within the constraints of the project

• unambiguous: the requirement is concisely stated without recourse to technical jargon, acronyms

• verifiable: the implementation of the requirement can be determined through one of four possible methods: inspection, demonstration, test, or analysis

proj

ect m

anag

emen

tgood requirements

53

proj

ect m

anag

emen

t

• be impeccable with your word • don’t take anything personally • don’t make assumptions • always do your best • be mindful

simple principles for (good) communication

no communication ... little communication ... poor communication ... reduced communication ...

... all result in more assumptions about intent!

why (better) communication is necessary

The single biggest problem with communication is the

illusion that it has taken place.

George Bernard Shaw, 1925 Nobel Peace Prize for Literature.

proj

ect m

anag

emen

t

“projects are about communication, communication, and communication”

Elenbass, B. Staging a Project: Are You Setting Your Project Up for Success? Proceedings of the Project Management Institute Annual Seminars & Symposiums. 2000.

“Plan to throw one away; you will anyhow. If there is anything new about the function of a system, the first

implementation will have to be redone completely to achieve a satisfactory (i.e., acceptably small, fast, and maintainable)

result. It costs a lot less if you plan to have a prototype.” !

Butler Lampson

Butler Lampson was a founding member of Xerox PARC, worked for DEC, and now works at Microsoft Research. He is an adjunct professor at MIT and an ACM Fellow.

the value of prototypes / pilots

proj

ect m

anag

emen

t

create requirements and acceptance criteria repeat {

digitize (small) pilot batch test data against acceptance criteria adjust requirements and acceptance criteria

} until (no more adjustments are necessary) digitize more data

implement: pilot

pilot batches are VERY VERY important!!59

proj

ect m

anag

emen

t

reasons for in-house production !• collection cannot be moved • collection is badly organized • digitization must be done slowly over a long

period • digitization is very simple

60

proj

ect m

anag

emen

t

implement: in-house

reasons for outsourced production !• originals can’t be scanned in-house because… • equipment is too expensive • output data is beyond staff experience • labor is too expensive

• large volume of work in a short time • insufficient space, infrastructure, or staff

61

proj

ect m

anag

emen

t

implement: outsource

The project management tool one chooses should be intuitive, easy to use, and accessible to all. If it isn’t, many will avoid / refuse / dislike / resent using it. !• Discussion of project management tools at http://

en.wikipedia.org/wiki/Comparison_of_project-management_software

• List of project management tools at http://en.wikipedia.org/wiki/Comparison_of_project-management_software

project management toolspr

ojec

t man

agem

ent

Discussion questions

1. What project management practices does your organization follow? Why?

2. What library standards does your organization currently use? What other, non-library standards, if any, does your organization use?

3. What reasons, in addition to those already cited, would your organization have to digitize newspapers in-house or to outsource digitization?

? ?project management

“Perfection is attained, not when there is nothing left to add, but when there

is nothing left to take away.” !

Antoine de St. Exupery

digitization workflow

!

• digital library: one or more digital collections

digitization workflow

67

digital librarydi

gitiz

atio

n w

orkf

low

!

• digital library: one or more digital collections • digital collection: organized group(s) of digital

objects

digitization workflow

69

digital collection

!

• digital library: one or more digital collections • digital collection: organized group(s) of digital

objects • digital object: a surrogate or digital copy of

the original source document, for example, a newspaper issue

digitization workflow

digi

tal o

bjec

t

An example of w

hat ALTO

makes possible

The Day book. (Chicago, Ill.), 29 Feb. 1912. Chronicling America: Historic American Newspapers. Lib. of Congress. <http://chroniclingamerica.loc.gov/lccn/sn83045487/1912-02-29/ed-1/seq-26/>

!

• digital library: one or more digital collections • digital collection: organized group(s) of digital

objects • digital object: a surrogate or digital copy of

the original source document, for example, a newspaper issue

• metadata: data about data. information about a digital object(s) or a digital collection(s) or the original source document(s)

digitization workflow

74

metadatadi

gitiz

atio

n w

orkf

low

• to enhance accessibility • to increase collaboration and cooperation

between libraries and archives around the world

• to promote research • to provide opportunities for entrepreneurs • other reasons?

75

why digitize newspapers?di

gitiz

atio

n w

orkf

low

Open Archival Information System (OAIS) reference model

digi

tizat

ion

wor

kflo

w

accessimagesproduce imagessource objects

producedigital objects

ingest preserve

access

the digitization process

imagesproduce imagessource

the digitization process

• image file formats • TIFF • JPEG2000 • JPEG • GIF

• text file formats • PDF, PDF/A, PDF/A-1b, PDF/A-1a • TEI XML • HTML • plain text • NITF / NewsML

• metadata • METS • MODS / PREMIS / ALTO / MIX ...

standard file formatsdi

gitiz

atio

n w

orkf

low

• image production source materials • original documents: better quality, more

expensive • microfiche: poorer quality, less

expensive, microfiche quality varies • bit depth

• black-and-white (bitonal) • greyscale • color

• resolution • compression

• no compression • lossless (reversible) • lossy (irreversible)

• image metadata

image decisions? ¿di

gitiz

atio

n w

orkf

low

image format comparison

Wikipedia contributors, "Comparison of Graphics File Formats," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Comparison_of_graphics_file_formats (accessed August 1, 2012)

compression bit depth metadata color management

mime type patent 1st public

release

JBIG (.jbig, .jbg) lossless 1-bit no no 2000?

JPEG (.jpg, .jpeg)

lossy, DCT, RLE, Huffman

8-bit 12-bit 24-bit

yes yes image/jpeg public.jpeg no 1992

JPEG2000 (.jp2)

many lossless and lossy compression

algorithms

8-bit 16-bit

color to 48 bitsyes yes image/jp2

public.jpeg200yes but part 1 is

patent free2000

TIFF (.tiff, .tif)

none LZW RLE ZIP

Other

1, 2, 4, 8, 16, 24, 32 bits

yes yes image/tiff public.tiff no 1986

The Sacred Heart Review 300dpi

Los Angeles Star 300dpi

Die Susquehanna Zeitung 600dpi

TIFF (uncompressed) 17.2 MB 87 MB 415.5 MB

TIFF (lossless LZW compression) 10.2 MB 75.8 MB 232.9 MB

JPEG (maximum quality [lossless]) 7.0 MB 37.2MB 101.1 MB

JPEG (medium quality) 1.5 MB 4.6 MB 10.2MB

JPEG2000 (lossless compression) 7.1 MB 52.7 MB 166.2 MB

JPEG2000 (lossy [70] compression) 5.1 MB 37.1 MB 116.7 MB

JPEG2000 (lossy [30] compression) 2.2 MB 16.1 MB 50.3 MB

image compression comparison

USA case law image 1300dpi

USA case law image 2300dpi

TIFF 1-bit CCITT G4 compression 40 KB 87 KB

JPEG2000 W5x3 reversible compression 2.6 MB 3.6 MB

JPEG2000 W9x7 irreversible compression 647 KB 1 MB

image bit depth comparison

Image courtesy of http://epsos.de (accessed at http://commons.wikimedia.org March 2014).

GARBAGE IN, GARBAGE OUT

GIGO

Deaths. lln»rieff, Esq. of <c .. Qn. Sunday, the till. greatly Drandrellt, of Orms4\irJi.- ~ ; ;✓ ' • * On ijfr r inn l j j j i l F i i j ' 1 1 f H a v o d i v y d , Carnarvonshire, S ; **" *- ' « ' March Oxford, F. Tfovmeud, Uerald. » • V . • O n T n c s d a v l a s t , M r . C har l es . IWilinson, this 8 ; had vf thesis#,, a week ago, which tcrminate<i'iu his death. . / ' ■ O'i Sunday, dJst nit. at. AsbtCnvHall, mar Lancaster, Mr.,Geo. Worn ick, many years house'steward hit late Once The Hamilton and Brandon. He locked himself h»oWn'r«wte<: soon. twelve o'clock" that dny, and fii»-d a loaded pistol " t h r o u g h I n s b e a d , 1 w h i c h instantaneously killed him. Coronet's Verdict, shot himself in a temporary fit of Friday week,

raw OCR text

Excerpt from The British Newspaper Archive, Chester Courant, Tuesday 6-Apr-1819, page 3.

newspaper image

Discussion topics

1. Assume your organization decides to digitize 1000 newspaper issues averaging 12 pages per issue. The images are scanned 2-up and average 80MB each. How much disk storage is needed for the images?

2. Now assume instead that your organization uses TIFF images with LZW (lossless) compression, which saves on average 40%. How much disk storage is needed for the images?

? ?digitization workflow

why (better) communication is necessary

images objectsproducedigital objects

the digitization process

objectsimages image processing

layout analysis OCR metadata

build digital objects

the digitization process

objectsimages image processing

layout analysis OCR metadata

build digital objects

the digitization process

• crop, de-skew, split images • apply image improvement algorithms as

needed • sharpening filters • local adaptive thresholding • remove text bleed-thru • etc

• create master images • create working images

92

93

94

what’s wrong with this image?

text is skewed about 1° from

vertical

text is de-skewed

text is skewed

objectsimages image processing

layout analysis OCR metadata

build digital objects

the digitization process

• analyze layout of text image • estimate font types and sizes • calculate coordinates of text blocks • determine layout object types (text,

illustration, headline, etc)

newspaper text layout analysis

objectsimages image processing

layout analysis OCR metadata

build digital objects

the digitization process

• perform optical character recognition (OCR) • calculate word and character coordinates • calculate word and character confidences • apply language dictionaries • correct OCR text (optional)

objectsimages image processing

layout analysis OCR metadata

build digital objects

the digitization process

• populate metadata fields • verify / correct page numbers • verify / correct document structure

objectsimages image processing

layout analysis OCR metadata

build digital objects

the digitization process

• create METS / ALTO XML files • create image files and image metadata • create PDF files (if required) • verify digital object • calculate file fixity checks (checksums) • perform file validation and verification • perform quality assurance

• automatic production steps performed by software !

• manual production steps performed by operators

real world digitization production workflow

• METS XML for descriptive, structural, technical, and administrative metadata !

• descriptive metadata • Metadata Object Description Standard (MODS)

selected metadata from MARC • Dublin Core fundamental group of text elements for

describing and cataloging !

• technical metadata • ALTO for OCR text • PREMIS for digital preservation • MIX and ANSI/NISO Z39.87 for images

digital library standards

Metadata Encoding and Transmission Standard

!• METS is a XML standard for encoding descriptive, administrative,

and structural metadata about objects within a digital library • METS files consist of 7 (optional) sections: header, descriptive,

administrative, file map, structural map, structural link, and behavior

• METS profiles describe a class of METS documents in sufficient detail to provide both document authors and programmers the guidance to create and process METS documents conforming with a particular profile

• current version 1.9.1 • administered by METS editorial board (international group of

volunteers) • standards hosted by Library of Congress at http://www.loc.gov/

standards/mets/

Graphic from Karin Bredenberg, Communicating Archival Metadata conference and workshops. Riksarkivet, 2011.

METS file structure

Metadata Object Description Schema• MODS is an XML schema for a bibliographic element set that may

be used for library applications. Derivative of MARC 21 bibliographic format. Includes a subset of MARC fields, using language-based tags rather than numeric ones

• Subset of MARC 21 • Mappings exist between MODS and MARC, Dublin Core, and RDA

(conversion tools exist) • May be used in conjunction with METS XML • current version 3.4 • administered by Library of Congress Network Development and

MARC Standards Office with help from interested users • standards hosted by Library of Congress at http://www.loc.gov/

standards/mods/

MODS metadata in METS XML<mets:dmdSec ID="issue-nla.news-issn18368190_18740425">! <mets:mdWrap MDTYPE="MODS">! ! <mets:xmlData>! ! ! <mods:mods xmlns="http://www.loc.gov/mods/v3">! ! ! ! <mods:language>! ! ! ! ! <mods:languageTerm type="code" authority="rfc3066">en</mods:languageTerm>! ! ! ! </mods:language>! ! ! ! <mods:genre>newspaper issue</mods:genre>! ! ! ! <mods:originInfo>! ! ! ! ! <mods:dateIssued>18740425</mods:dateIssued>! ! ! ! </mods:originInfo>! ! ! ! <mods:relatedItem type="host">! ! ! ! ! <mods:titleInfo>! ! ! ! ! ! <mods:title>The Queenslander (Brisbane, Qld. : 1866-1939)</mods:title>! ! ! ! ! </mods:titleInfo>! ! ! ! ! <mods:genre>newspaper</mods:genre>! ! ! ! ! <mods:identifier>ISSN18368190</mods:identifier>! ! ! ! ! <mods:part>! ! ! ! ! ! <mods:detail type="volume">! ! ! ! ! ! ! <mods:number>IX</mods:number>! ! ! ! ! ! </mods:detail>! ! ! ! ! </mods:part>! ! ! ! ! <mods:part>! ! ! ! ! ! <mods:detail type="issue">! ! ! ! ! ! ! <mods:number>12</mods:number>! ! ! ! ! ! </mods:detail>! ! ! ! ! </mods:part>! ! ! ! </mods:relatedItem>! ! ! </mods:mods>! ! </mets:xmlData>! </mets:mdWrap></mets:dmdSec>

Dublin Core metadata

• Dublin Core is a set of vocabulary terms used to describe resources for the purposes of discovery.

• Dublin Core metadata element set is endorsed in IETF RFC 5013, ISO 15836-2009, and NISO Z39.85

• Metadata terms last updated 14-Jun-2012 • May be used in conjunction with METS XML • Dublin Core Metadata Initiative (DCMI) is an open

organization, incorporated as a public, not-for-profit company in Singapore

• Dublin Core Metadata Initiative is hosted at http://dublincore.org/

Analyzed Layout and Text Object

!• ALTO XML provides technical metadata for describing the layout

and content of physical text resources, such as pages of a book or a newspaper

• commonly used in conjunction with METS XML but may be used standalone

• current version 2.1 • administered by ALTO editorial board (international group of

volunteers) • standards hosted by Library of Congress at http://www.loc.gov/

standards/alto/

<?xml version="1.0" encoding="UTF-8"?><alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://schema.ccs-gmbh.com/metae/alto-1-4.xsd" xmlns:xlink="http://www.w3.org/1999/xlink"><Description>! <MeasurementUnit>pixel</MeasurementUnit>! <sourceImageInformation>! ! <fileName>//docstorage/impdata_2$/IN/NLA/db0046/batch-1109/nlaImageSeq-2349218-b.tif</fileName>! </sourceImageInformation></Description><Styles>! <TextStyle ID="TXT_0" FONTSIZE="7" FONTFAMILY="Times New Roman" FONTSTYLE="bold"/>! <TextStyle ID="TXT_1" FONTSIZE="9" FONTFAMILY="Times New Roman" FONTSTYLE="bold"/> </Styles><Layout>! <Page ID="P1" PHYSICAL_IMG_NR="1" HEIGHT="9224" WIDTH="7136" PC="0.967">! ! <TopMargin ID="P1_TM00001" HPOS="0" VPOS="0" WIDTH="7135" HEIGHT="814"/>! ! <LeftMargin ID="P1_LM00001" HPOS="0" VPOS="814" WIDTH="151" HEIGHT="8194"/>! ! <RightMargin ID="P1_RM00001" HPOS="6959" VPOS="814" WIDTH="176" HEIGHT="8194"/>! ! <BottomMargin ID="P1_BM00001" HPOS="0" VPOS="9008" WIDTH="7135" HEIGHT="216"/>! ! <PrintSpace ID="P1_PS00001" HPOS="151" VPOS="814" WIDTH="6808" HEIGHT="8194">! ! ! <ComposedBlock ID="ART1" HEIGHT="2366" WIDTH="929" HPOS="209" VPOS="831">! ! ! ! <ComposedBlock ID="ZONE1-1" HEIGHT="88" WIDTH="641" HPOS="357" VPOS="831">! ! ! ! ! <TextBlock ID="P1_TB00004" HPOS="357" VPOS="831" WIDTH="641" HEIGHT="88" STYLEREFS="TXT_4 PAR_LEFT">! ! ! ! ! ! <TextLine ID="P1_TL00065" HPOS="357" VPOS="831" WIDTH="641" HEIGHT="75">! ! ! ! ! ! ! <String ID="P1_ST00404" HPOS="357" VPOS="831" WIDTH="65" HEIGHT="74" CONTENT="The" WC="0.98" CC="000"/>! ! ! ! ! ! !<SP ID="P1_SP00340" HPOS="422" VPOS="906" WIDTH="0"/>! ! ! ! ! ! ! <String ID="P1_ST00405" HPOS="422" VPOS="831" WIDTH="576" HEIGHT="74" CONTENT="Queenslander." WC="0.96" CC="0000000000000"/>! ! ! ! ! ! </TextLine>! ! ! ! ! </TextBlock>! ! ! ! </ComposedBlock>! ! ! ! <ComposedBlock ID="ZONE1-2" HEIGHT="83" WIDTH="894" HPOS="228" VPOS="964"/>! ! ! ! <ComposedBlock ID="ZONE1-3" HEIGHT="46" WIDTH="702" HPOS="331" VPOS="1087"/>! ! ! ! ! ! <TextLine ID="P1_TL01143" HPOS="5946" VPOS="8957" WIDTH="881" HEIGHT="46">! ! ! ! ! ! ! <String ID="P1_ST06356" HPOS="5946" VPOS="8965" WIDTH="3" HEIGHT="27" CONTENT="I" WC="1.00" CC="0"/>! ! ! ! ! ! !<SP ID="P1_SP05236" HPOS="5950" VPOS="8992" WIDTH="658"/>! ! ! ! ! ! ! <String ID="P1_ST06357" HPOS="6608" VPOS="8957" WIDTH="219" HEIGHT="46" CONTENT="Proprietors." WC="1.00" CC="101401212010"/>! ! ! ! ! ! </TextLine>! ! ! ! ! </TextBlock>! ! ! ! </ComposedBlock>! ! ! </ComposedBlock> ! </PrintSpace> </Page></Layout></alto>

Analyzed Layout and Text Object

Analyzed Layout and Text Object book

Analyzed Layout and Text Object newspaper

Preservation Metadata Implementation Strategies

• PREMIS is a core set of implementable preservation metadata, broadly applicable across a wide range of digital preservation contexts and supported by guidelines and recommendations for creation, management, and use

• In 2003 OCLC and RLG jointly sponsored the formation of the PREMIS working group comprised of international experts in the use of metadata to support digital preservation activities

• PREMIS data dictionary current version 2.2 • May be used in conjunction with METS XML • PREMIS tools are freely available • PREMIS Maintenance Activity and Editorial Committee has

international members from libraries and industry • PREMIS data dictionary is hosted at http://www.loc.gov/

standards/premis/

PREMIS data in METS file

<mets:amdSec> <mets:techMD ID="PREMISOBJECT1"> <mets:mdWrap MDTYPE="PREMIS"> <mets:xmlData> <premis:object xmlns:premis="http://www.loc.gov/standards/premis/v1"> <premis:objectIdentifier> <premis:objectIdentifierType>National Library of Australia</premis:objectIdentifierType> <premis:objectIdentifierValue>nlaImageSeq-218-b.tif</premis:objectIdentifierValue> </premis:objectIdentifier> <premis:objectCategory>file</premis:objectCategory> <premis:objectCharacteristics> <premis:format> <premis:formatDesignation> <premis:formatName>TIFF</premis:formatName> <premis:formatVersion>TIFF 6.0</premis:formatVersion> </premis:formatDesignation> </premis:format> </premis:objectCharacteristics> <premis:relationship> <premis:relationshipType>derivation</premis:relationshipType> <premis:relationshipSubType>is derivative of</premis:relationshipSubType> <premis:relatedObjectIdentification> <premis:relatedObjectIdentifierType>National Library of Australia</premis:relatedObjectIdentifierType> <premis:relatedObjectIdentifierValue>nlaImageSeq-218-b.tif</premis:relatedObjectIdentifierValue> <premis:relatedObjectSequence>0</premis:relatedObjectSequence> </premis:relatedObjectIdentification> <premis:relatedEventIdentification> <premis:relatedEventIdentifierType>National Library of Australia</premis:relatedEventIdentifierType> <premis:relatedEventIdentifierValue>deskew-nlaImageSeq-218-b.tif</premis:relatedEventIdentifierValue> <premis:relatedEventSequence>0</premis:relatedEventSequence> </premis:relatedEventIdentification> </premis:relationship> </premis:object> </mets:xmlData> </mets:mdWrap> </mets:techMD> </mets:amdSec>

digi

tizat

ion

wor

kflo

w

implement: software

• commercial off-the-shelf (COTS) • open source • customized COTS • customized open source • custom in-house

117

Discussion topics

1. Assuming your organization will digitize historic newspapers, will it digitize the newspapers in-house or out-source digitization? Why? (If you don’t know, guesses and speculations are fine.)

2. Describe your organizations current digitization workflow.

? ?digitization workflow

quality assurance and acceptance criteria

quality assurance and acceptance criteria

Wikipedia on data quality: !The processes and technologies involved in ensuring the conformance of data values to requirements and acceptance criteria

qual

ity a

ssur

ance

• is the digital object complete? are all its components present? • is the digital object verifiable? • is the digital object uncorrupted? • do the components of the digital object

conform to standards? • do the file names conform to project

requirements? • does the directory structure conform to

project requirements? • does the digital object metadata conform to

project specifications?

qual

ity a

ssur

ance

automatic quality checks

• does the digital object metadata meet accuracy specifications?

• does the text meet accuracy specifications?

• is the image quality satisfactory? • are article continuations correct? • is the text in reading order?qu

ality

ass

uran

ce

manual quality checks

acceptance criteria for an English language digitization project at a large, well-known, and internationally recognized national library !

character accuracy > 80% word accuracy > 75% significant word accuracy > 65%

what’s wrong with this?qu

ality

ass

uran

ce

project quality requirement: !

“a high level of accuracy”

what’s wrong with this?

project quality requirement: !

“article titles must be 99.5% accurate”

what’s wrong with this?

project quality requirement: !

“article title characters in each issue must be 99.5% accurate, that is, each issue may have no more than 5 errors in 1000 article title characters”

what’s wrong with this?

image quality!

•sharpness: the amount of detail an image can convey

•noise: random variation of image density •dynamic range •contrast (gamma): the slope of the tone

reproduction curve in a log-log space. high contrast usually involves loss of dynamic range — loss of detail, or clipping, in highlights or shadows.

•vignetting: darkens images near the corners •artifacts: “leftovers” from sharpening or

compression

Wikipedia contributors, “Image quality," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Image_quality (accessed March 2014).

qual

ity a

ssur

ance

Zhou Wang and Hamid R. Sheikh. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing. April 2004.

image quality!“…images which are ultimately to be viewed by human beings, the only “correct” method of quantifying visual image quality is through subjective evaluation. in practice, however, subjective evaluation is usually too inconvenient, time-consuming and expensive…” !“…best way to assess the quality of an image is to look at it because human eyes are the ultimate viewers of most images…”

Zhou Wang, Alan Bovick, and Ligang Lu. Why is image quality assessment so difficult? IEEE Transactions on Image Processing. April 2004.

qual

ity a

ssur

ance

acceptance criteria for the National Library of Australia NDP

129

Discussion topics

1. How does your organization currently do quality assurance for digital data?

2. How much time / effort is given to writing quality assurance procedures and acceptance criteria for digitized data?

? ?quality assurance

digitization tools

open source vs. commercial software: pros

• acquisition : cost, development and implementation contract costs are likely to be lower than for proprietary software. less likely that there will be contractually-bound upgrade costs. total cost of ownership over the lifetime of usage must be taken into account

• data transferability : with open source code and open data formats, there are greater opportunities to share data across interoperable platforms

• re-use : open source is free from per user or per instance costs and there is a guaranteed freedom to use it in any way. re-use is enabled.

digi

tizat

ion

tool

s

Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of-open-source-solutions/

• cost effective : pay once or not at all for development (if at all) and reuse where appropriate.

• non-restrictive : open source licenses do not limit or restrict who can use the software, the type of user, or the areas of business in which the software can be used. provides a licensing model that enables rapid provisioning of both known and unanticipated users and in new use cases.

• scalable : open source solutions are scalable upwards and downwards with a reduction in the risk of longer term financial implications. no license fees on a “per user” or “per box” basis. no redundant licenses

digi

tizat

ion

tool

s

Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of-open-source-solutions/

open source vs. commercial software: pros

• easy to prototype and adapt : open source software is particularly suitable for rapid prototyping and experimentation, where the ability to “test drive” the software with minimal costs and administrative delays can be important. (proprietary software suppliers may also provide the same through a ‘proof of concept’ phase at minimal or no cost.)

digi

tizat

ion

tool

s

Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of-open-source-solutions/

open source vs. commercial software: pros

• support and maintenance costs : may outweigh those of the proprietary package and include ‘hidden’ commitments.

• intellectual property rights : as code is modified and adapted, there may be legal risks the code’s open source status and who owns the intellectual property rights of the modified code.

• expertise : requires software installation and maintenance expertise. modification of open source code requires software development expertise.must ensure that they have the right level of expertise to manage it effectively.

digi

tizat

ion

tool

sopen source vs. commercial software:

cons

Adapted from Open Gov Summit 2013. http://opengov2013.zaizi.com/pros-and-cons-of-open-source-solutions/

digitization toolsa variety of open source and commercial off-the-shelf (COTS) software is available for digitization projects • easier for systems from different parties or using different

technologies to interoperate and communicate with one another • better protection of the data files created by an application

against obsolescence of the application • applications / data are easier to port from one platform to

another since they follows known guidelines and rules, and the interfaces

digi

tizat

ion

tool

s

ocr software• ABBYY FineReader (http://www.abbyy.com)

• Tesseract (https://code.google.com/p/tesseract-ocr)

• Nuance OmniPage (http://www.nuance.com)

• IRIS Readiris (http://www.irislink.com)

• LEADTOOLS OCR (http://www.leadtools.com)

• OCRopus (https://code.google.com/p/ocropus)

Wikipedia contributors, “Comparison of optical character recognition software," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software (accessed March 2014).

Wikipedia contributors, “Optical optical character" Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Optical_character_recognition (accessed March 2014).

open source

digi

tizat

ion

tool

s

imaging software• LEADTOOLS image SDK (http://www.leadtools.com)

• ImageGear image SDK (http://www.accusoft.com)

• FreeImage image SDK (http://freeimage.sourceforge.net)

• BlackIce image toolkits (http://www.blackice.com)

• Adobe Photoshop (http://www.adobe.com/Photoshop)

• GIMP (http://www.gimp.org)

• GraphicsMagick (http://www.graphicsmagick.org)

• ImageMagick (http://www.imagemagick.org)

open source

digi

tizat

ion

tool

s

digital workflow software

• Content Conversion Specialists docWorks (http://content-conversion.com)

• ScanFlow (http://www.treventus.com)

• Goobi (http://www.goobi.org)

• Zissor (http://zissor.com)

open source

digi

tizat

ion

tool

s

other software

• BagIt : hierarchical file packaging format for the exchange of digital content. A "bag" has just enough structure to safely enclose descriptive "tags" and a "payload" but does not require any knowledge of the payload's internal semantics. See http://sourceforge.net/projects/loc-xferutils and http://tools.ietf.org/html/draft-kunze-bagit-06.

open source

Discussion questions

1. What software tools does your organization use for digital projects or digital libraries?

2. Does your organization host a digital library? If so, does it use Google Analytics or a similar tool? Why or why not?

3. What software tools does your organization use for project management? Are the tools web-based?

? ?digitization tools

Preservation of software and preservation of data are two sides of the same coin. From February 2011 Workshop for Digital Curators.

digital preservation

preservationOpen Archival Information System (OAIS)

reference model

digitization digital preservation≠ !

Vint Cerf on “bit rot”

digital preservation

long-term, error-free storage of digital information, with means for retrieval and interpretation, for the entire time

span the information is required

digital data risks

• standards / format obsolescence • migration to new format, media,

or hardware • media obsolescence / decay • bit rot

format obsolescence

remember … WordPerfect ?

MARC records ? Adobe Flash ?

strategies for format obsolescence

•migrate data to new formats • create a computer software museum

with virtual machines • format registries • format validators • don’t worry about it!

Jeff Rothenberg on format obsolescence

“... digital documents are evolving so rapidly that shifts in the forms of documents

must inevitably arise. New forms do not necessarily subsume their predecessors or

provide compatibility with previous formats.”

Jeff Rothenberg. Ensuring the Longevity of Digital Documents. Originally published in Scientific American. January 1995. Expanded version published February, 1999. (accessed 1 August 2012 at http://www.clir.org/pubs/archives/ensuring.pdf)

standard model for format obsolescence

• digital format registry collects information about target format • this information is used to build format identification and

verification tools • holders of content use these tools to extract metadata from

content in target format; metadata is stored with the content • format registry scans computing environment to determine

which formats are obsolescent; notifications sent for obsolete formats

• on receiving such a notification, someone builds a tool to convert obsolete format to non-obsolete format using the format specification in the registry

• on receiving such a notification, holder of content in obsolete format uses conversion tool and content metadata to convert the file in an obsolete format to a file in a non-obsolete format

David Rosenthal on format obsolescence

“... format obsolescence is a rare problem that happens infrequently to a minority of

unpopular formats ...”

David Rosenthal. Format obsolescence: Assessing the threat and the defenses. (accessed 1 August 2012 at http://lockss.org/locksswiki/files/LibraryHighTech2010.pdf

alternate model for format obsolescence

• store only essential data • perform only essential tasks • delay performing tasks as long as possible

David Rosenthal. Format obsolescence: Assessing the threat and the defenses. Library High Tech, Special Issue, vol. 28, no. 2, 2010, pp.195-210. doi:10.1108/07378831011047613 (accessed 1 August 2012 at http://lockss.org/locksswiki/files/LibraryHighTech2010.pdf).

importance of standards vis-a-vis format obsolescence

well-defined standards … !

• guide developers in creation of tools • facilitates development of a broad range of

tools for any format • allow developers to maintain existing tools

data migration risks

• file format changes, for example, PDF 1.4 to PDF 1.8 • file name differences, for example, case

sensitive /insensitive names, new operating system • extended file attributes • file permissions, for example, BSD Unix

drwxr-xr-x@ to Windows file permissions • soft links / hard links

media obsolescence

• 5 ¼” floppy disks • 8 track tapes • 3 ½” floppy disks • ZIP drives • CD-R, CD-RW, Blu-Ray • DAT tapes • microfilm • etc

strategies for media obsolescence

• migrate data to new media, for example, floppy disks to DVD • create and maintain a computer hardware

museum

media decay

a report by NIST and the Library of Congress says ... • virtually all CD-Rs tested indicated an estimated life

expectancy beyond 15 years • only 47 percent of recordable DVDs indicated an

estimated life expectancy beyond 15 years, some had a life expectancy as short as 1.9 years • in practice actual lifetimes may be considerably

shorter

• proper storage • data file checksums (MD5, SHA-1, ...) • monitor media integrity • migrate data from old media to new media

prevention / detection of media decay

bit rot

gradual decay of data due to …

• storage media failure because of media quality • storage media failure because of improper storage • random events (bit-flip, environmental influences) • software / hardware errors

prevention / detection of bit rot

• data file fixity check (checksums) such as MD5, SHA-1, ... • monitor file integrity with frequent, corrective

audits • duplicate copies, geographically distributed

distributed decentralized digital preservation

• the more copies, the safer the data • the more independent copies, the safer the

data • the more frequently copies are audited, the

safer the data

Paraphrased David Rosenthal. Keeping bits safe: How hard can it be?

distributed decentralized digital preservation

• n+1 copies are safer than n copies • n independent copies on different storage

devices / media are safer than n copies on similar or identical storage devices / media

• data audited every week is safer than data audited every month

LOCKSS Lots Of Copies Keep Stuff Safe

• It ingests content from target websites using a web crawler similar to those used by search engines.

• It preserves content by continually comparing the content it has collected with the same content collected by other LOCKSS Boxes, and repairing any differences.

• It delivers authoritative content to readers by acting as a web proxy, cache or via Metadata resolvers when the publisher’s website is not available.

• It provides management through a web interface that allows librarians to select new content for preservation, monitor the content being preserved and control access to the preserved content.

• It dynamically migrates content to new formats as needed for display.

From LOCKSS webpages http://www.lockss.org.

LOCKSS box: Open source LOCKSS software installed on a dedicated computer or virtual machine.

how LOCKSS works data copied to another LOCKSS box

library X LOCKSS box

library Y LOCKSS box

my library LOCKSS box

data

how LOCKSS works data audited

library X LOCKSS box

library Y LOCKSS box

my library LOCKSS box

data

audit

how LOCKSS works data audited

library X LOCKSS box

library Y LOCKSS box

my library LOCKSS box

data

audit

audit fails

audit  ok

how LOCKSS works data copied to another LOCKSS box

library X LOCKSS box

library Y LOCKSS box

my library LOCKSS box

data

private LOCKSS networks

Alabama Digital Preservation Network (http://www.adpn.org/).

CLOCKSS (Controlled LOCKSS), a non-profit collaboration of North American, European, and Asian cultural heritage institutions whose purpose is to preserve digital content with LOCKSS (http://www.clockss.org).

MetaArchive Cooperative is a digital preservation cooperative created by cultural heritage institutions (http://www.metaarchive.org).

digital preservation references• Nancy McGovern and Katherine Skinner editors. Aligning National Approaches to

Digital Preservation. Educopia Institute Publications. Atlanta Georgia. 2012. Proceedings of a conference on digital preservation held at the National Library of Estonia in May 2011. (accessed 15 August 2012 at http://www.educopia.org/sites/default/files/ANADP_Educopia_2012.pdf).

• David Rosenthal. Format obsolescence: Assessing the threat and the defenses. Library High Tech, Special Issue, v. 28, n. 2, 2010, pp.195-210. doi:10.1108/07378831011047613 (accessed 1 August 2012 at http://lockss.org/locksswiki/files/LibraryHighTech2010.pdf).

• David Rosenthal. Keeping bits safe: How hard can it be? Communications of the ACM v. 53, n. 11, 2010, pp. 47-55. doi:10.1145/1839676.1839692 (accessed 1 August 2012 at http://lockss.org/locksswiki/files/ACM2010.pdf).

• Jeff Rothenberg. Ensuring the Longevity of Digital Documents. Originally published in Scientific American January 1995. Expanded version published February 1999. (accessed 1 August 2012 at http://www.clir.org/pubs/archives/ensuring.pdf)

• Joint Information Systems Committee (JISC) Programme on Digital Preservation at http://www.jisc.ac.uk/preservation.

• Library of Congress on Digital Preservation at http://www.digitalpreservation.gov. • Stanford University’s website for LOCKSS at http://www.lockss.org.

newspaper digitization programs around the world

Europeana Newspapers Project, a collaboration of 17 organizations (http://www.europeana-newspapers.eu/)

Bibliotheque nationale de France (http://gallica.bnf.fr/)

National Library of Australia, Australian Digital Newspapers Program (http://trove.nla.gov.au/newspaper)

Singapore National Library Board (http://newspapers.nl.sg/)

National Library of New Zealand (http://paperspast.natlib.govt.nz/)

National Digital Newspaper Program, Library of Congress (http://chroniclingamerica.loc.gov/)

British Newspaper Archives, British Library (http://www.bl.uk/welcome/newspapers)

Koninklijke Bibliotheek, the Netherlands (http://kranten.kb.nl/)

National Library of Finland (http://digi.kansalliskirjasto.fi/)

National Library of Latvia (https://periodika.lndb.lv/)

• Library of Congress National Digital Newspaper Program http://www.loc.gov/ndnp/

• Australian Newspaper Digitisation Program http://www.nla.gov.au/content/newspaper-digitisation-program

• IFLA Newspapers Section Digitisation projects and best practices http://www.ifla.org/node/6777

• ICON: International Coalition on Newspapers http://icon.crl.edu/digitization.htm

• METS, MODS, ALTO, PRISM, and other library standards : http://www.loc.gov/standards

• OAIS : http://public.ccsds.org/publications/RefModel.aspx • NISO standards and guidelines : http://www.niso.org/

publications/rp • Good practice guides : http://www.ukoln.ac.uk • And many, many more

Wikipedia contributors, "List of online newspaper archives," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Wikipedia:List_of_online_newspaper_archives (accessed March 17, 2013).

?!

Frederick Zarndt Secretary, IFLA Newspapers Section

frederick@frederickzarndt.com

Photo held by John Oxley Library, State Library of Queensland. Original from

Courier-mail, Brisbane, Queensland, Australia.

Recommended