1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation

Preview:

Citation preview

1

CS 502: Computing Methods for Digital Libraries

Lecture 28

Current work in preservation

2

Administration

Review class

• Tuesday, 12:20. Room to be announced on web site "Notices".

• Format, questions (by you) and answers (by me).

Laptops

• Return before examination. Bring receipt to examination.

Examination

• Part 1: 5 questions, 1.5 hour time limit

• Part 2: nomad experiment questionnaire, no time limit

3

Education and research

Digital libraries in a state of flux:

• Much of this class has described material that is still experimental

• Cornell people and our colleagues are actively involved in many aspects

This class:

• Recent activities in preservation of materials on the web

• Some of my recent work

4

Some light reading

William Y. Arms, "Preservation of scientific serials: three current examples." Journal of Electronic Publishing, 5(2), December 1999. http://www.press.umich.edu/jep/05-02/arms.html

William Y. Arms, "Economic models for open-access publishing." iMP, March 2000. http://www.cisp.org/imp/march_2000/03_00arms.htm

5

Preservation of serials

September 1999 -- Workshop chaired by Deanna Marcum, Don Waters, Cliff Lynch

Issues in preserving online journals for 100 years

Invited paper by William Arms

"Preservation of Scientific Serials: Three Current Examples"

• ACM Digital Library• Internet RFC Series• D-Lib Magazine

Motivated by realization that early preservation work may be tackling the wrong problem

6

Publisher's role in preservation

Life cycle of electronic publication

1. Active management by publisher

2. Long-term preservation by another organization

Overall observation

• The length of #1 may be very short or hundreds of years

• The most vulnerable time is the transition between #1 and #2

Preservation discussions have emphasized #2 (e.g., 5 level model)

7

ACM Digital Library

Organizational

• ACM is a stable organization that considers the Digital Library one of its principal assets

Rights

• ACM either owns copyright or has full preservation rights

Technical

• Complex: relational database (schema), SGML (DTD), rendering software, private metadata system

• Strong computing department

Replication

• No independent mirrors

8

Internet RFC Series

Organizational

• Complex relationship between Internet Society (ISCO), Internet Engineering Task Force (IETF) and RFC editor. Currently actively managed, but no long-term commitment

• Secretariat & RFC editor -- income from meetings & grants

Rights

• ISOC and IETF have very broad rights

Technical

• Simple: text only (a few PostScript)

Replication

• Several independent mirrors

9

D-Lib Magazine

Organizational

• Published by CNRI, reliant on grants.

Rights

• Authors own rights in articles. CNRI owns rights in other materials.

Technical

• Simple: uses basic web technology.

• Used for experiments in DOIs, XML metadata, etc.

Replication

• Several independent mirrors

10

Approaches to preservation of the web

Partnership with publishers

Publishers and libraries as partners

Selective collection of open access web

Librarianship in a new domain

Bulk collection of open access web

Automatic librarianship

11

Partnerships with publishers

Library of Congress and UMI

• US theses and dissertations

American Physical Society and Cornell University

• Journals in physics

Elsevier Science

• Policy statement on archiving

12

Partnership with publishers

Publishers and libraries as partners

Selective collection of open access web

Librarianship in a new domain

Bulk collection of open access web

Automatic librarianship

Approaches to preservation of the web

Cornell and Library of Congress

13

Selective preservation

Selection of web sites

Example: National Library of Australia

• national importance

• multiple versions (print and online)

• authority and research value

14

Selection of web sites

Pragmatic considerations

• technical complexity

-- not all standards are good

• frequency of making copies

• COST

Librarianship in a new domain

15

Catalogs and indexes

Example: CORC

• simple standard using Dublin Core

• tools for creating records

• COST

Librarianship in a new domain

16

Bulk collection: automatic librarianship

Volumes of information are too great for human selection, indexing and management

Examples:

• Kulturarw3 -- National Library of Sweden

• Internet Archive -- Brewster Kahle

Automatic methods are used to collect, organize and provide access

17

Automatic librarianship

Collection

Example: Internet Archive

• Collecting open access web since 1996

• Complete sweep of web approximately once a month

• HTML pages only

• 14 terabytes of data (soon all online)

• access for researchers using Unix tools

• 7 people

18

Automatic librarianship

Indexing

Examples:

• ResearchIndex

• Google

19

Legal issues

Legal position of archives that download open access materials is unclear

• Preservation is in the national interest

• See the discussion in The Digital Dilemma (National Academy of Sciences, 1999)

• Crucial factor is economic impact on copyright owners

• Library of Congress has no special position except via copyright deposit

• U.S. Copyright Office offer to help clarification

20

Current activities

Selection: guidelines and prototypes

• Library of Congress working group

• Political web sites

Tools

• Web site mirroring

• Web site profiler (M.Eng. project)

Copyright

• Ad hoc working group (Deanna Marcum, Bill Arms)

21

CS 502Computing Methods for Digital

Libraries

THE END

Recommended