HIBERLINK: Reference Rot and Linked Data: Threat and Remedy

Reference Rot and Linked Data: Threat and Remedy

PRELIDA

18/19th October 2014

Funded by the Andrew W. Mellon Foundation

Peter BurnhillEDINA, University of Edinburgh

for the Hiberlink Team at University of Edinburgh & LANL Research Library

The Project Team 2013 – 2015, funded by the

Andrew W. Mellon Foundation

• Los Alamos National Laboratory:

Research Library: Martin Klein, [Rob Sanderson], Harihar Shankar, Herbert Van de Sompel

• University of Edinburgh:

Language Technology Group: Beatrice Alex, Claire Grover,

Colin Matheson, Richard Tobin, [Ke “Adam” Zhou]

EDINA * : Neil Mayo, Muriel Mewissen (Project Manager),

Tim Stickland, Richard Wincewicz, Peter Burnhill

Centre for Service Delivery & Digital Expertise


PRELIDA

18/19th October 2014

http://www.jisc.ac.uk/


3

1. Social Science Research Council [now ESRC, UK]– ‘Scientific Officer’

2. Scottish Education Data Archive, 1979 – 1984/1987– Survey statistician: school leavers, YTS, 16-19 cohort surveys; demand for HE

3. Edinburgh University Data Library, 1984 to present– President of IASSIST, 1997 – 2001: social science data professionals

4. ESRC Regional Research Laboratory for Scotland, 1986 -1990– Co-director, early days of Geographical Information Systems (GIS)

– member of Data Task Force, UK Inter-Agency Global Env. Change

5. Graduate School, Faculty of Social Science, UofEd 1987 – 1997– Senior Lecturer (p/t), teaching quantitative/survey methods

– Director of RAPID: ESRC Research Activity & Publications Information Database

6. EDINA national data centre, 1995/6 to present– Director: set-up and continuous development; Jisc-funded UK national services

7. UK Digital Curation Centre (DCC), 2003/04 - 2004/05

– Director for set-up & definition of ‘data curation + digital preservation’

8. CLOCKSS Founder & Board Member / LOCKSS deployment

Data Manufacturing

Data Brokering

funding Data & use of Data

Spatial Data & MetaData

licence to use

Ensuring researchers, students and their teachers have

ease and continuing accessto online resources used for scholarship

P.Burnhill, Edinburgh 2009

accessto content & services

Buckland: thinking about Digital Libraries

mix of the document tradition (signifying objects & their use)

& the computation tradition (applying algorithmic, logical,

mathematical, and mechanical techniques to information management)

“Both traditions are needed. Information Science is rooted in part in humanities and qualitative social sciences. The landscape of Information Science is complex. An ecumenical view is needed.”

– M.Buckland, Journal of American Society for Information Science, 50, 1999

2 (non-convergent) mentalities,Document-ness & Computation

+ a third dimension, the domain of application:

• Academic discipline – if we do this for ourselves

• Business area – if we do this for use beyond …

Related Activity

by Partners

• Los Alamos National Laboratory Research Library:

• Memento

• ResourceSync• http://www.niso.org/workrooms/resourcesync/

• University of Edinburgh / Informatics / Language Technology Group:

• Text mining / Edinburgh Parser

• University of Edinburgh/ Jisc / EDINA :

• CLOCKSS / LOCKSS

• Keepers Registry• https://www.era.lib.ed.ac.uk/handle/1842/6682

Picture credit: http://somanybooksblog.com/2009/03/27/library-tour/

But online articles in the Scholarly Record are not in

the custody of Libraries, nor on their digital shelves.

Top level Problem: We would like to assume that our libraries are ensuring that online e-journal content is being kept safe

Evidence from <thekeepers.org> is worrying!

The Keepers Registry aggregates what is being kept by the (10) leading

archiving agencies (CLOCKSS, Portico, national libraries etc) with all

issued with ISSN

① ‘Ingest Ratio’ = titles being ingested by one or more Keeper

/ ‘online serials’ in ISSN Register

= 23,268 / 136,965 [in March 2014] => 17%

* We do not know about 83% of e-serials having ISSN *

‘KeepSafe Ratio’ = ingest by 3+ Keepers = 9,652 / 136,965 => 7%

② Title Lists of 3 US research libraries (Columbia, Cornell & Duke),

checked i2011/12 ‘Ingest Ratio’ = 22% to 28%; c.75% unknown fate

③ User-centric Evidence, UK usage in 2012, UK OpenURL Router logs

=> over two thirds 68% (36,326 titles) held by none!



Memento

The Memento "Time Travel for the Web" protocol http://mementoweb.org/

• an interoperable approach to access web archives (IETF RFC 7089)

• adopted by all major public archives worldwide, including the Internet Archive.

• Memento for Chrome http://bit.ly/memento-for-chrome

• This protocol underpins the work being done in Hiberlink

http://bit.ly/memento-for-chrome

Now, about Reference Rot & Linked Data …

1. Some definitions

• What is Reference Rot?

• What may be special about Linked Data?

2. Evoking metaphor

• The moment / snapshot / memento

• Flash-freezing to avoid or to stop the rot (of fruit on vine)

3. Evidence of Threat of Reference

4. Devising Remedy for Reference Rot

• Proposals for intervention: plug-ins & infrastructural solutions

5. Next Steps: how to take this work forward?

Reference Rot = Link Rot + Content Drift

“when links to web resources

no longer point to what they once did”

Investigating Reference Rot in Web-Based Scholarly Communication

Link Rot

‘Link Rot’

+ Content Drift: What is at end of URI has changed, or gone!

http://dl00.org

2000

http://dl00.org

2004

http://dl00.org

2005

http://dl00.org

2008

(a) Dynamic contentas values on webpage changes over time

(b) Static contentbut very different (often

unrelated) web pages

What of Linked Data?

One or more sets of 3 linked URIs: conversation or statements for the long term? As time passes, so the content at the end of each of those URIs will suffer:

Reference Rot = Link Rot + Content Drift

“when links to web resources

no longer point to what they once did”

“Adding eScience Assets to the Data Web”, Herbert Van de Sompel, Carl Lagoze, Michael L. Nelson, Simeon Warner, Robert Sanderson, Pete Johnston. Proceedings of Linked Data on the Web (LDOW2009) Workshop, [v1] Thu, 11 Jun 2009 15:33:37 GMT http://arxiv.org/abs/0906.2135v1

Example: ‘mark up’ archaeological site record (metadata)

RDF graph: Article & Supplementary Data http://www.emeraldinsight.com/fig/0350570303002.png

1. Build and publish as metadata in XML format to be found on the web

2. Publishing text and data/multimedia content in XML will delight researchers

• Researchers want to access ‘article as data’, via computational algorithm

What we are doing in Hiberlink

1. Creating evidence on extent of ‘Reference Rot’

– Main focus has been on references (& URIs) made in Journal Articles

• Inc. reference rot in Supreme Court judgments with Harvard Law Library & permaCC

– ETD2014 was opportunity to look at Reference Rot & the e-Thesis

– PRELIDA is opportunity to look at impact on Linked Data

2. Understanding the preparation/publication/ingest workflow(s)

– Identifying opportunity for productive intervention

1. Prototypes for pro-active archiving to enable remedy

– Embedding such ‘solutions’ in existing tools & infrastructure

2. Raising awareness & seeking collaborative actions

…. through events like this

Empirical evidence on the Threat of

Reference Rot

Large-scale analyses: Journal Articles & E-Theses

Methodology: to discover answer to 2 questions

i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?

• Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and regarding an 2XX status code as ‘live’

Methodology: to discover answer to 2 questions

i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?

• Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and regarding an 2XX status code as ‘live’

ii. Is there a ‘Memento’ of that reference in the ‘Archived Web’?

Memento: a prior version, what the Original Resource was like at some time in the past.

A Measure of Reference Rot: Are those references available? [in 6,400 e-Theses defended in 2003-2010 at 5 US universities]

Less than two-thirds

of those links lead

to live content

Live on Web Not Found on ‘Live Web’ All

Count 29,122 16,860 45,982

% 63.3 36.7 100%

1st Order Indicator of

‘Reference Rot’ more than one

third of references

to the Web subject to ‘rot’

After up to 50 redirects

References in Citations Rot over Time:URIs cease to exist on the live Web

[excluding 0s&1s: a few theses are unaffected; a few are ruined]

We can’t stop that process of rot: Web content changes over time,

Reference Rot is inevitable function of time

Number of months elapsed from Date Thesis Defended until date archives checked (June 2014)

Searching for ‘Datetime’ Mementos of content in ‘Archived Web’ [in 6,400 e-Theses defended in 2003-2010 at 5 US universities]

% Live on Web Not found on ‘Live Web’ All

Found to beArchived

47.6

Not Found 52.4

All 100%

There seems a 50:50 chance that referenced content is in the ‘Archived Web’.

Some content is being ‘co-incidentally harvested’ by routine web archiving.

=> half of those references are at ‘risk of loss’

‘Incidental Archiving’ is constant over time (This is an ‘upper bound estimate’, independent of age of e-thesis)

We can improve upon this ‘50:50 chance’

by pro-actively archiving what we cite

We already have ‘Lost Content’ for References to Web[in 6,400 e-Theses defended in 2003-2010 at 5 US universities]

% Live on Web Not found on ‘Live Web’ All

Found to beArchived

29.3 18.3 47.6

Not Found 34.0 18.4 52.4

All 63.3 36.7 100%

18.4%

‘not live & not found in archive’

judged to be lost forever

34%‘live’ & ‘not in archive’

at is risk of loss

NB: The 34% ‘at risk’ could be saved by pro-active archiving

Hiberlink Next Phase: in-depth study of Content Drift

But demonstrated that problem exists & is severe

• The Web changes over time: significant reference rot occurs

• Routine Web Archiving delivers no better than 50:50 chance of success of having co-Incidentally archived what you referenced

- and probably much less chance when we check extent of content drift

- Not (yet) studied impact on Linked Data but expect similar

“Researchers need to know when information on a viewed page has changed.

“Authors of long-shelf-life material want to be sure that their links will still work far into the future.

Jonathan Zittrain, Larry Lessig and Kendra Albert report that

• Harvard Law Review

75% of links are dead

• top 1% Impact Factor Journals

10% of links dead just 15 months after publication

• US Supreme Court decisions

29% of links dead

49% of links do not point to the original target

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2329161

Devising Remedy for Reference Rot

for Linked Data?

Seek pro-active ‘transactional archiving’ solutions

– focus on what is regarded by authors as important

a) Understand the preparation/publication workflow

– identifying where there can be productive intervention

a) Devise prototypes for pro-active archiving

– writing & implementing code!

b) Propose/test infrastructure for temporal referencing

– supporting & using the Memento protocol

Where possible, we wish to embed ‘solutions’ in existing tools & infrastructure

Strategy for Making Remedy

3 workflows in scholarly statement

Extended length of stages in workflows magnify reference rot & affect, as referenced content on the web rots over time

① Preparation-> Study - > Compose -> (Review) -> Submission

② Publication -> (Editorial)Examination -> (Revision) -> Acceptance -> Issue

③ Post-Publication-> Deposit/Ingest -> Provide/Access -> Use

Identify the best opportunities for Intervention to make Remedy,to ‘flash-freeze’, either to avoid reference rot or to ‘stop the rot’

What are the key workflows for the manufacture, release and use of Linked Data?

3 workflows in Linked Data

What is it that changes over time: concepts, assigned attributes; why and on what timescale?

① Manufacture-> Create- > (Review) -> Prepare to publish/release/commit

② Authority: Release-> (Editorial)Examination -> (Revision) -> Acceptance

③ Use: Curate -> Deposit/Ingest -> Provide/Access -> Use

Identify the best opportunities for Intervention to make Remedy, to ‘flash-freeze’, either to avoid reference rot or to ‘stop the rot’

What are the key workflows for the manufacture, release and use of Linked Data?

1. Hiberlink Plug-in - for pro-active ‘transactional’ archiving

– At the time of authoring (ie manufacture)

2. Missing Link - re-factoring the HTML link

– By which one annotates with {DateTime; location of archived copy/ies}

3. HiberActive - a system for actively archiving references

– Designed to ‘stop the rot’, a lossy 2nd Best to transactional archiving’

LANL: Martin Klein, Harihar Shankar, Herbert Van de Sompel

UoEd EDINA: Neil Mayo, Tim Stickland, Richard Wincewicz

‘Work in progress’ to effect Remedy

Hiberlink

ETD2014, Leicester UK July 25th 2014


For use during authoring [manufacture] of information object &

before final issuebut also

before ingest by ‘library’ (& maybe for repair by ‘library’ …)

Hiberlink Plug-in [for Zotero]

① Triggers archiving of referenced web content

② Returns DateTime URI for archived content

1. Hiberlink Plug-in - to enable pro-active archiving

2. Missing Link - re-factor the HTML link that is returned

‘Work in progress’ to effect Remedy (2)

b) Augment Link with a set of Datetime & location pairs

a) Take simple URI - to French National Library (say)

Prepared by:Herbert Van de Sompel, Martin Klein, Robert Sanderson - Los Alamos National Laboratory Michael Nelson - Old Dominion University

http://mementoweb.org/missing-link/

http://mementoweb.org/missing-link/

1. Hiberlink Plug-in - to enable pro-active archiving

2. Missing Link - re-factoring the HTML link

First two approaches support ‘perfect scenario’:

• All authors archive all their cited URIs

• e.g. (but not exclusively) with Hiberlink / Zotero

3. HiberActive

– Enables repositories to ‘stop the rot’by actively archiving those references in e-theses

– A notification hub, a component for the infrastructure

• testing workflow with ResourceSync, CORE & external archive programme

‘Work in progress’ to effect Remedy (3)

• The Web changes over time: significant reference rot inevitably occurs (as a function of time)

• Web Archiving delivers only c.50:50 chance of success of co-incidentally archiving what you referenced

• Link by means of the original URI, at time of manufacture

• But then …. Augment the link with temporal context, to increase robustness of link to referenced content

o Date of linking

o URI of archived snapshot(s)

• Then again, maybe this is all about archiving to support citation and not really about ‘preservation’, but it does assist continuity of access

Summary

Picture credit: http://somanybooksblog.com/2009/03/27/library-tour/

Multi-level Problem: Digital Shelving for The Research Object; First Order References; Second Order References; ….

Simple Statements [with URIs]

1st Order References [with URIs]

Complex Research Objects {URIs}

1st Order References {URI}

2nd Order References {URI}2nd Order References [with URIs]

“Digital information is best preserved by replicating it [on digitalshelving] at multiple archives run by autonomous organizations”

B. Cooper and H. Garcia-Molina (2002)

Next Steps: how to take this work forward?

to ensure URI/references don’t rot

• Need to move from the ‘incidental Web archiving’ of cited URIsto pro-active archiving, by makers of Linked Data & by repositories?

• Engage with these Hiberlink remedies

• The Hiberlink Plug-in for Zotero / HiberActive

Email: [email protected]

Subject: Hiberlink ETD

Thank you,

Questions welcome

& check:

http://hiberlink.org/news.html

http://hiberlink.org #hiberlink


Email: [email protected]