20
J a n u a r y 2 0 0 6 Andy Powell, Eduserv Foundation [email protected] www.eduserv.org.uk/foundation Persistently identifying Web site content Future-proofing Institutional Web sites DCC and Wellcome Library workshop

Persistently identifying website content

Embed Size (px)

DESCRIPTION

A presentation given at the Digital Curation Centre Joint Workshop on Future-Proofing Institutional Websites, held in London in January 2006. See http://www.dcc.ac.uk/events/fpw-2006/

Citation preview

Page 1: Persistently identifying website content

Jan

uary

20

06

Andy Powell, Eduserv [email protected]

www.eduserv.org.uk/foundation

Persistently identifying Web site content

Future-proofing Institutional Web sitesDCC and Wellcome Library workshop

Page 2: Persistently identifying website content

January 2006Future-proofing institutional Web sites 2

Contents

• context

• functional requirements

• issues raised

• practical suggestions

• note: not going to look at any particular solutions in any detail – PURLs, DOIs, Handles, ARKs, …

Page 3: Persistently identifying website content

January 2006Future-proofing institutional Web sites 3

Context – institutional Web sites

• institutional Web sites are:– heterogeneous – i.e. wide variety of content,

managed/unmanaged, formal/informal

– primarily accessed via mainstream Web browsers – but that may change over time

– dynamic – i.e. content is regularly added (and changed and removed!)

– closely tied to the institution – and institutions are liable to change!

Page 4: Persistently identifying website content

January 2006Future-proofing institutional Web sites 4

Context – man vs. machine

• identifiers serve a human andmachine/software purpose

– person: “here’s one I foundearlier” – e.g. using del.icio.usor connotea

– machine: “is this the same asthat?”

• worth remembering that machines tend to be fairly stupid…– e.g. if some people use the PURL and some use the corresponding URL,

then del.icio.us won’t spot that their entries are about the same thing

• in most cases, being able to resolve the identifier is helpful to both people and machines

• in most cases, the longer an identifier lasts, the better – even after the resolution service breaks!

Page 5: Persistently identifying website content

January 2006Future-proofing institutional Web sites 5

Context – what is being identified

• the most important question in any discussion about identifiers is “what is being identified?”

• in the case of institutional Web sites…– the site

– significant parts of the site

– static documents, individual images, etc.

– dynamic services

– …

• some possibility for confusion here– e.g. what does http://www.bris.ac.uk/ identify?

• but in the case of institutional Web sites, people usually do the ‘right thing’ and what is being identified is obvious from the context…

Page 6: Persistently identifying website content

January 2006Future-proofing institutional Web sites 6

Context - works vs. manifestations

• one key aspect is whether the identifier is for an abstract ‘work’ or a particular ‘’manifestation’ of that work

• there are some scenarios in which it is necessary to identify the ‘work’…

• in other cases, it is necessary to identify a particular ‘manifestation’ of the work

• beginning to see this problem in the development of eprint archives and institutional repositories

“Crystal Studio is a recommended

resource for the teaching of

crystallography at undergraduate

level.“

"To perform this exercise you will need a copy of Crystal Studio version 5.0

(versions 4.0 Lite and 4.0 Professional do not support the required options)."

Page 7: Persistently identifying website content

January 2006Future-proofing institutional Web sites 7

Every significant item that is made available through a JISC IE network service should be assigned a URI that is reasonably persistent. This means that item URIs should not be expected to break for a period of 10-15 years after they have first been used. For this reason, JISC IE service components should not hardcode file format, server technology, service organisational structure or other information that is likely to change over a 10-15 year period into item URIs. If items become unavailable during that period, then the URI should resolve to a Web page that explains why the item is no longer available and what actions the end-user can take to obtain a copy of the item or similar resources. Furthermore, item URIs should not contain end-user-specific information, i.e. all item URIs should work for all end-users (albeit allowing for appropriate authentication challenges to be inserted into the process by which the URI is resolved).

Functional requirements…

• the JISC IE technical standards document says…

http://www.ukoln.ac.uk/distributed-systems/jisc-ie/arch/standards/

Every significant item that is made available through a JISC IE network service should be assigned a URI that is reasonably persistent. This means that item URIs should not be expected to break for a period of 10-15 years after they have first been used. For this reason, JISC IE service components should not hardcode file format, server technology, service organisational structure or other information that is likely to change over a 10-15 year period into item URIs. If items become unavailable during that period, then the URI should resolve to a Web page that explains why the item is no longer available and what actions the end-user can take to obtain a copy of the item or similar resources. Furthermore, item URIs should not contain end-user-specific information, i.e. all item URIs should work for all end-users (albeit allowing for appropriate authentication challenges to be inserted into the process by which the URI is resolved).

Page 8: Persistently identifying website content

January 2006Future-proofing institutional Web sites 8

What should be identified?

• “every significant item”

• what does that mean?

• every resource that people are likely to want to cite persistently?

• there might be stuff on institutional Web sites that we don’t need to cite persistently

– but often difficult to pre-judge what is significant and what isn’t

– and judgements about significance and required level of persistence may come from outside the institution

Page 9: Persistently identifying website content

January 2006Future-proofing institutional Web sites 9

What does ‘reasonably persistent’ mean?

• notion of ‘persistence’ is application dependent

• perhaps helpful to think about 15 – 20 year timeframe?– longer than the Web has been around to date

– solutions for 20 year period may well last longer

– ‘forever’ is too long

• what will have changed in 20 years time?– technology - HTML replaced? HTTP replaced? DNS

replaced? URI system replaced?

– organisations – mergers, closures, new institutions, new government departments, etc.

– people – deaths, retirements, etc.

– countries!

Page 10: Persistently identifying website content

January 2006Future-proofing institutional Web sites 10

What does ‘break’ mean?

• what does it mean for an identifier to break?

• need to differentiate between the breakage of services on the identifier and breakage of the identifier itself

• most obvious services on identifiers are ‘resolution services’– “give me a representation of the identified thing”

– known as ‘dereferencing’ in W3C documentation

• resolution services can break (by design or by accident) but the identifier may live on and remain useful

• the identifier itself only breaks when all parties (including software systems) have forgotten what it identified, or when parties no longer agree about what it identifies (e.g. if it gets re-assigned)

Page 11: Persistently identifying website content

January 2006Future-proofing institutional Web sites 11

Usability issues

• “the only good long-term identifier is a good short-term identifier”

• unless identifiers work well now, then they won’t turn into persistent identifiers because they won’t be used at all

• what does “work well” mean (particularly in the context of institutional Web sites)?

– conformant with current Internet standards

– usable in Web browsers (without additional plug-ins - i.e. usable by everyone)

– meaningful to people

– resolvable

– simple to assign and maintain

– low cost (in terms of money and time)

Page 12: Persistently identifying website content

January 2006Future-proofing institutional Web sites 12

Interim conclusions…

• identifiers for content on institutional Web sites should be URIs

– why? because the URI is the global and unambiguous standard for identifiers on the Internet

• ‘http’ URIs are better than any other form of URI– why? because they work in current Internet

tools, particularly Web browsers

– built-in resolution mechanism

– easy to assign and low-cost (typically!)

Page 13: Persistently identifying website content

January 2006Future-proofing institutional Web sites 13

‘http’ URI problems?

• but ‘http’ URIs tend to break don’t they?– note: usually it is the resolution service that breaks (i.e. they

stop working as locators) - this doesn’t necessarily imply that they stop functioning as identifiers though the two may be closely related

• reasons for fragility of ‘http’ URI resolution examined later

• but ‘poor design’ and lack of commitment often to blame

• not necessarily the case that one can apply generic Internet-wide findings about ‘http’ URI breakage to ‘institutional’ Web sites

• attempts at more persistent forms of identifier often based on moving away from direct ties to HTTP and/or introducing a level of indirection

Page 14: Persistently identifying website content

January 2006Future-proofing institutional Web sites 14

How indirection works (or not?)

• populate resolution service tables with identifier -> locator mappings (and possibly other metadata)

– DOI: 10.1000/182 -> http://www.doi.org/hb.html

– Handle: 4263537/4002 -> http://www.handle.net/documentation.html

– ARK: http://ark.nlm.nih.gov/ark:/12025/pm10611131 -> http://brain.oxfordjournals.org/cgi/content/full/123/1/171

– PURL: http://purl.org/net/ukoln -> http://www.ukoln.ac.uk/

• typically used as the basis for HTTP redirects, e.g.– http://dx.doi.org/10.1000/182 -> http://www.doi.org/hb.html

– http://hdl.handle.net/4263537/4002 -> http://www.handle.net/documentation.html

– etc.

• helps to ensure persistence… but– HTTP redirects not handled very well by browsers - end-user is

typically left using the non-persistent URI – need commitment to maintain resolver services and tables

– introduces a second (at least) identifier for each resource

Page 15: Persistently identifying website content

January 2006Future-proofing institutional Web sites 15

What about uniqueness?

• the same identifier should not be assigned to more than one resource

• a resource may have more than one identifier assigned to it… but this should be avoided as far as possible

– e.g. the DOI “10.1000/182” can be encoded as a URI in several ways:

– http://dx.doi.org/10.1000/182, doi:10.1000/182, urn:doi:10.1000/182 and info:doi/10.1000/182

– therefore, DOI-aware applications need to have knowledge of these encodings hard-coded into them (partly because the DOI itself is just a string, but also because nothing in the URI specification indicates that the URI encodings are equivalent)

– though within a domain this may become the norm (e.g. Google Scholar, Crossref, Connotea, etc.)

Page 16: Persistently identifying website content

January 2006Future-proofing institutional Web sites 16

ARK system

• ARKs are worthy of note since they are ‘http’ URIs– and therefore meet many of the usability

requirements outlined earlier

• ARKs clearly flag an institutional commitment to persistence

– the identifier owner (often the resource owner) commits to maintaining ARK services and associated metadata

– no reliance on third-party resolver

• but they suffer from the HTTP redirect problem

• and ultimately may lead to multiple URIs being assigned to a single resource

Page 17: Persistently identifying website content

January 2006Future-proofing institutional Web sites 17

Anatomy of ‘http’ URIs

http://www.somewhere.ac.uk/physics/index.cfm?name=about

http://www.somewhere.ac.uk/chemistry/report.rtf

‘http’ URI scheme – URI persistence not reliant on HTTP protocol, but is reliant on continued registration and management of the scheme (and of the URI spec. itself!)

DNS domain name – persistence reliant on continued ownership and management of the DNS domain name (and the DNS!)

Component hierarchy, often organisationally based – persistence reliant on continued management of component structure, i.e. not re-using old components

Server technology – change of technology may enforce change of URI, leading to multiple URIs for same resource (with no simple mechanism for determining equivalence)

File format – inappropriate if identifier is for the ‘work’ rather than the ‘manifestation’ - because changing the format will result in a new URI

Page 18: Persistently identifying website content

January 2006Future-proofing institutional Web sites 18

Improving persistence of ‘http’ URIs

• choose long-lived DNS domain names – e.g. try to avoid details of internal organisational structure

• partition URI components by ‘function’ rather than by organisational structure - because structure is likely to change

• avoid exposing Web server technology in URIs (Cold Fusion, PHP, etc.) - to allow changes to technology without URI proliferation and resolver breakage

• avoid embedding details of document format into URIs, unless particular manifestation is being identified

• avoid embedding end-user or session information into URIs – so that they can be shared between people

Page 19: Persistently identifying website content

January 2006Future-proofing institutional Web sites 19

Conclusions and recommendations

• persistent identifiers require persistent commitment from the institution (and third-parties)

• need to determine what ‘persistent’ means in practice (on the basis that ‘forever’ is unrealistic)

• ‘http’ URIs can be made more persistent if they are constructed and managed sensibly

• use of DOIs/Handles/ARKs/PURLs may be appropriate (particularly where domain practice is clear)

– but need to be clear about cost/benefits and institutional and third-party commitment to maintaining resolver tables and associated services

– where these are used, always and only use the ‘http’ form of URI (e.g. http://dx.doi.org/10.1000/182)

Page 20: Persistently identifying website content

January 2006Future-proofing institutional Web sites 20

Questions…