29
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel [email protected] cture 5 A research perspective on Digital Libraries

Lecture 5 A research perspective on Digital Libraries

  • Upload
    alka

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel [email protected]. Lecture 5 A research perspective on Digital Libraries. DL Ancestry. URLs to some of these DLs. ADS: http://adswww.harvard.edu/ - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture  5  A research perspective on Digital Libraries

1 herbert van de sompel

CS 502 Computing Methods for Digital

Libraries

Cornell University – Computer ScienceHerbert Van de [email protected]

Lecture 5 A research perspective on Digital Libraries

Page 2: Lecture  5  A research perspective on Digital Libraries

2 herbert van de sompel

DL Ancestry

1992 1993 1994

UCSTRI

CS-TR NCSTRLWATERS

LTRS(TRSkit)

NTRS

Still operational, but no longerbeing developed.

Has also branched into manysub-fields of Physics, as well asMathematics and Chemistry.

STELAR ADSOther databases spun off (Physics /Geophysics, Space Instrumentation)

Current Status

Still In Use

Still In Use

Still In Use

Still In Use

Still In Use

Still In Use

Still In Use

Figure 3: Digital Library System Ancestry

1991

CORE

DLI

Physics e-PrintServer

Page 3: Lecture  5  A research perspective on Digital Libraries

3 herbert van de sompel

URLs to some of these DLs

ADS: http://adswww.harvard.edu/NCSTRL: http://www.ncstrl.orgUCSTRI: http://www.cs.indiana.edu:800/cstr/cover.htmlarXiv: http://arXiv.orgLTRS: http://techreports.larc.nasa.gov/ltrs/NTRS: http://techreports.larc.nasa.gov/cgi-bin/NTRS

Page 4: Lecture  5  A research perspective on Digital Libraries

4 herbert van de sompel

DL Architectural Review

Assumptions made in this perspective– things start with TCP/IP connectivity– distribute full content (reports, software, etc.)

• not only metadata

Page 5: Lecture  5  A research perspective on Digital Libraries

5 herbert van de sompel

DL Architecture History approach 1

1. Build special client and server (generally using Motif/X11, Tcl/Tk, etc.), and use TCP/IP as the transport protocol only• pros: rich functionality• cons: high development cost, client distribution

problem• observation: many of these projects spent more

time building the interfaces, protocols, searching, etc. than populating their DL!

Page 6: Lecture  5  A research perspective on Digital Libraries

6 herbert van de sompel

DL Architecture History approach 2

2. use standard protocols built upon TCP/IP: SMTP, FTP, Gopher, WAIS, HTTP, etc.• con: less functionality (restricted by protocol)• pros: less development cost, uses commonly

available clients• observation: this approach is now the most

common• The ones listed on slide 2 fit into this category

Page 7: Lecture  5  A research perspective on Digital Libraries

7 herbert van de sompel

Early TCP/IP DLs

a very old one: IETF:http://www.ietf.org/

• Internet RFC’s

• Very first TCP/IP DL?

Page 8: Lecture  5  A research perspective on Digital Libraries

8 herbert van de sompel

Early TCP/IP DLs

• Netlib– http://www.netlib.org/– begun in 1985, distributing mathematical

software via e-mail (SMTP)– other access methods and protocols added (ftp,

X11 client, http)

Page 9: Lecture  5  A research perspective on Digital Libraries

9 herbert van de sompel

Netlib 1995

Page 10: Lecture  5  A research perspective on Digital Libraries

10 herbert van de sompel

Netlib 2001

Page 11: Lecture  5  A research perspective on Digital Libraries

11 herbert van de sompel

Los Alamos arXiv

• Physics pre-print server– http://xxx.lanl.gov/ == http://arXiv.org– begun in 1991 as an e-mail service to exchange

TeX source of pre-prints in high energy physics– ftp, http access added shortly– Now THE communication channel in Physics– Paul Ginsparg

Page 12: Lecture  5  A research perspective on Digital Libraries

12 herbert van de sompel

Characteristics of early TCP/IP, non-HTTP DLs

• Useful – could get the “thing” that you were looking for

• Constrained by transport protocol– SMTP, FTP, etc. interface inherently “clunky”– Higher level services such as searching,

sophisticated browsing, etc. difficult to implement• Small scale

– would the same systems work well if the holdings went from 100’s or 1000’s to millions?

Page 13: Lecture  5  A research perspective on Digital Libraries

13 herbert van de sompel

Characteristics of early TCP/IP, HTTP DLs

• Initial HTTP implementations / conversions pretty much provided incremental steps in DL improvement– a “nice” ftp interface, maybe with better

searching and browsing – but the nature of the DLs changed little

• LTRS is an example of a http DL that is really: FTP+Searching(WAIS)+Browsing

• http://techreports.larc.nasa.gov/ltrs/• Also check out user interface of http://arXiv.org

Page 14: Lecture  5  A research perspective on Digital Libraries

14 herbert van de sompel

Early TCP/IP, HTTP DLs

• But http is a very general transport protocol, and it is possible to build even higher level protocols on top of it

• Combine this with the expressive HTTP client (web browser), and there is a lot of potential

• Dienst– (http://www.ncstrl.org/Dienst/htdocs/Info/

protocol4.html)– builds an actual DL protocol on top of HTTP

• 1994 -- the first to do so?• Open Archives Initiative: metadata harvesting

protocol on top of HTTP

Page 15: Lecture  5  A research perspective on Digital Libraries

15 herbert van de sompel

Sophistication increases, tracks meet

e-mail

ftp / gopher

httpLTRS, e-print, Netlib, etc.

httpDienst

sophistication

time

research track

library automation track

Page 16: Lecture  5  A research perspective on Digital Libraries

16 herbert van de sompel

A Framework for Distributed Digital Object Services

Kahn/Wilensky Framework [Kahn 1995]• 1995• A high level document• Almost a definition of key concepts, terminologies, …

for next generation DLs• Foundation for a research discipline?• Not detailed enough to be a real architecture. • Architecture is independent of the type of data

stored in the DL

Page 17: Lecture  5  A research perspective on Digital Libraries

17 herbert van de sompel

KWF: key terms

• digital object (do)– A do is a data structure that contains

• Digital data; data is typed (cf MIME)• Persistent Key Metadata; especially handle• Other metadata (for instance Terms and

Conditions)• handle

– a handle is a unique, persistent name for a do• repository

– The place where do’s live– Has unique global name

• Repository Access Protocol (RAP)– To deposit/access do’s in repositories

Page 18: Lecture  5  A research perspective on Digital Libraries

18 herbert van de sompel

KWF: flow

Originator

digital object

makes a Data

which consists of

Key-Metadata• handle

handle comesfrom a handlegenerator

Handle Server

which registers the do’s handle with a handle server

at which point the do becomesa registered do

Accesses/Deposits the do in repositories by means of the Repository Access Protocol

What the client receives as a result of an access to a do is a dissemination.

client

Properties record per do

• Key metadata: handle• Other metadata:

• Terms and conditions

Transaction record per do

Repository

which can go in a repository

at which point the do becomes a stored do

Page 19: Lecture  5  A research perspective on Digital Libraries

19 herbert van de sompel

Digital objects

• do = data + key-metadata– data is typed; core types include:

• bit-sequence / set-of-bit-sequences• digital-object / set-of-digital-objects• handle / set-of-handles

– other types can be defined, and registered with a global type registry• definition and registration left undefined• ~ similar to MIME

– key-metadata includes handle– possibly other metadata (left undefined in KWF)

Page 20: Lecture  5  A research perspective on Digital Libraries

20 herbert van de sompel

Digital objects

• Composite do’s:– a do with data of type digital-object– non-composite do’s are elemental do’s– composite do’s can – for instance -- be used to

collect similar works together• composite do than contains a do for each work

of Shakespeare...

Page 21: Lecture  5  A research perspective on Digital Libraries

21 herbert van de sompel

Changing digital objects

• Mutable do’s can be changed once placed in a repository– key-metadata cannot be changed – the do’s handle does never change!

• Immutable do’s cannot be changed once placed in a repository– however, they can be deleted

Page 22: Lecture  5  A research perspective on Digital Libraries

22 herbert van de sompel

Handles

• Guest lecture by Professor Arms 02/19

Page 23: Lecture  5  A research perspective on Digital Libraries

23 herbert van de sompel

Repositories

• A network accessible storage system in which digital objects may be stored for possible subsequent access or retrieval

• A stored do is a do that resides in a repository• A registered do is a do that the repository has

registered with a handle server– storing and registering can be the same or

different processes

Page 24: Lecture  5  A research perspective on Digital Libraries

24 herbert van de sompel

Repositories

• A repository keeps a properties record for each do– contains key-metadata and any other metadata

the repository chooses to keep• A do may have a transaction record associated with

it in a repository

Page 25: Lecture  5  A research perspective on Digital Libraries

25 herbert van de sompel

Repository Access Protocol

• “Protocol” may be misleading, its really just the concept for a protocol

• RAP is designed to be simple; higher level services should come from other protocols

• KWF defines 3 basic operation classes:– ACCESS_DO [metadata; key-metadata, digital object]

• A dissemination of a do is the result of a request to access a do

– DEPOSIT_DO [metadata; key-metadata, digital object]– ACCESS_REF

• this is a means to tell the world about other ways (protocols) to access do’s in the repository.

Page 26: Lecture  5  A research perspective on Digital Libraries

26 herbert van de sompel

Terms and Conditions

• TC are attached to:– each do– each dissemination– each repository

• TC are a precondition for any operation on the above• Repositories responsible for enforcing TC

Page 27: Lecture  5  A research perspective on Digital Libraries

27 herbert van de sompel

Terms and Conditions

repositoryterms and conditions

terms and conditions

terms and conditions

digital object

dissemination

data data

1 1

1

11

1

1

1

1

1

1

1

1

N

Figure 1 from 95 TR-1593

Page 28: Lecture  5  A research perspective on Digital Libraries

28 herbert van de sompel

Digital Objects: Terms and Conditions

• Set by originator and/or repository

• Can be arbitrarily complex, but generally consist of:

– permissions: read, write, etc.

– authentication - person, group, etc.

– payment

– 3rd party intervention (possibly in support of the above)

Page 29: Lecture  5  A research perspective on Digital Libraries

29 herbert van de sompel

Readings

• Kahn, R. & Wilensky, R. 1995. A Framework for Distributed Digital Object Services

http://WWW.CNRI.Reston.VA.US/home/cstr/arch/k-w.html

• Arms, W.Y. 1995. Key Concepts in the Architecture of the Digital Library. In: D-Lib Magazine. http://www.dlib.org/dlib/July95/07arms.html

• Marc VanHeyningen. 1994. The Unified Computer Science Technical Report Index: Lessons in indexing diverse resources. http://www.cs.indiana.edu/ucstri/paper/paper.html