Upload
prelida-project
View
198
Download
1
Embed Size (px)
DESCRIPTION
Peter Burnhill (EDINA, University of Edinburgh), presented at the 3rd PRELIDA Consolidation and Dissemination Workshop, Riva, Italy, October, 17, 2014. More information about the workshop at: prelida.eu
Citation preview
Reference Rot and Linked Data: Threat and Remedy
PRELIDA
18/19th October 2014
Funded by the Andrew W. Mellon Foundation
Peter BurnhillEDINA, University of Edinburgh
for the Hiberlink Team at University of Edinburgh & LANL Research Library
The Project Team 2013 – 2015, funded by the
Andrew W. Mellon Foundation
• Los Alamos National Laboratory:
Research Library: Martin Klein, [Rob Sanderson], Harihar Shankar, Herbert Van de Sompel
• University of Edinburgh:
Language Technology Group: Beatrice Alex, Claire Grover,
Colin Matheson, Richard Tobin, [Ke “Adam” Zhou]
EDINA * : Neil Mayo, Muriel Mewissen (Project Manager),
Tim Stickland, Richard Wincewicz, Peter Burnhill
Centre for Service Delivery & Digital Expertise
Funded by the Andrew W. Mellon Foundation
PRELIDA
18/19th October 2014
3
1. Social Science Research Council [now ESRC, UK]– ‘Scientific Officer’
2. Scottish Education Data Archive, 1979 – 1984/1987– Survey statistician: school leavers, YTS, 16-19 cohort surveys; demand for HE
3. Edinburgh University Data Library, 1984 to present– President of IASSIST, 1997 – 2001: social science data professionals
4. ESRC Regional Research Laboratory for Scotland, 1986 -1990– Co-director, early days of Geographical Information Systems (GIS)
– member of Data Task Force, UK Inter-Agency Global Env. Change
5. Graduate School, Faculty of Social Science, UofEd 1987 – 1997– Senior Lecturer (p/t), teaching quantitative/survey methods
– Director of RAPID: ESRC Research Activity & Publications Information Database
6. EDINA national data centre, 1995/6 to present– Director: set-up and continuous development; Jisc-funded UK national services
7. UK Digital Curation Centre (DCC), 2003/04 - 2004/05
– Director for set-up & definition of ‘data curation + digital preservation’
8. CLOCKSS Founder & Board Member / LOCKSS deployment
Data Manufacturing
Data Brokering
funding Data & use of Data
Spatial Data & MetaData
licence to use
Ensuring researchers, students and their teachers have
ease and continuing accessto online resources used for scholarship
P.Burnhill, Edinburgh 2009
accessto content & services
Buckland: thinking about Digital Libraries
mix of the document tradition (signifying objects & their use)
& the computation tradition (applying algorithmic, logical,
mathematical, and mechanical techniques to information management)
“Both traditions are needed. Information Science is rooted in part in humanities and qualitative social sciences. The landscape of Information Science is complex. An ecumenical view is needed.”
– M.Buckland, Journal of American Society for Information Science, 50, 1999
2 (non-convergent) mentalities,Document-ness & Computation
+ a third dimension, the domain of application:
• Academic discipline – if we do this for ourselves
• Business area – if we do this for use beyond …
Related Activity
by Partners
• Los Alamos National Laboratory Research Library:
• Memento
• ResourceSync• http://www.niso.org/workrooms/resourcesync/
• University of Edinburgh / Informatics / Language Technology Group:
• Text mining / Edinburgh Parser
• University of Edinburgh/ Jisc / EDINA :
• CLOCKSS / LOCKSS
• Keepers Registry• https://www.era.lib.ed.ac.uk/handle/1842/6682
Picture credit: http://somanybooksblog.com/2009/03/27/library-tour/
But online articles in the Scholarly Record are not in
the custody of Libraries, nor on their digital shelves.
Top level Problem: We would like to assume that our libraries are ensuring that online e-journal content is being kept safe
Evidence from <thekeepers.org> is worrying!
The Keepers Registry aggregates what is being kept by the (10) leading
archiving agencies (CLOCKSS, Portico, national libraries etc) with all
issued with ISSN
① ‘Ingest Ratio’ = titles being ingested by one or more Keeper
/ ‘online serials’ in ISSN Register
= 23,268 / 136,965 [in March 2014] => 17%
* We do not know about 83% of e-serials having ISSN *
‘KeepSafe Ratio’ = ingest by 3+ Keepers = 9,652 / 136,965 => 7%
② Title Lists of 3 US research libraries (Columbia, Cornell & Duke),
checked i2011/12 ‘Ingest Ratio’ = 22% to 28%; c.75% unknown fate
③ User-centric Evidence, UK usage in 2012, UK OpenURL Router logs
=> over two thirds 68% (36,326 titles) held by none!
Memento
The Memento "Time Travel for the Web" protocol http://mementoweb.org/
• an interoperable approach to access web archives (IETF RFC 7089)
• adopted by all major public archives worldwide, including the Internet Archive.
• Memento for Chrome http://bit.ly/memento-for-chrome
• This protocol underpins the work being done in Hiberlink
Now, about Reference Rot & Linked Data …
1. Some definitions
• What is Reference Rot?
• What may be special about Linked Data?
2. Evoking metaphor
• The moment / snapshot / memento
• Flash-freezing to avoid or to stop the rot (of fruit on vine)
3. Evidence of Threat of Reference
4. Devising Remedy for Reference Rot
• Proposals for intervention: plug-ins & infrastructural solutions
5. Next Steps: how to take this work forward?
Reference Rot = Link Rot + Content Drift
“when links to web resources
no longer point to what they once did”
Investigating Reference Rot in Web-Based Scholarly Communication
Link Rot
‘Link Rot’
+ Content Drift: What is at end of URI has changed, or gone!
http://dl00.org
2000
http://dl00.org
2004
http://dl00.org
2005
http://dl00.org
2008
(a) Dynamic contentas values on webpage changes over time
(b) Static contentbut very different (often
unrelated) web pages
What of Linked Data?
One or more sets of 3 linked URIs: conversation or statements for the long term? As time passes, so the content at the end of each of those URIs will suffer:
Reference Rot = Link Rot + Content Drift
“when links to web resources
no longer point to what they once did”
“Adding eScience Assets to the Data Web”, Herbert Van de Sompel, Carl Lagoze, Michael L. Nelson, Simeon Warner, Robert Sanderson, Pete Johnston. Proceedings of Linked Data on the Web (LDOW2009) Workshop, [v1] Thu, 11 Jun 2009 15:33:37 GMT http://arxiv.org/abs/0906.2135v1
Example: ‘mark up’ archaeological site record (metadata)
RDF graph: Article & Supplementary Data http://www.emeraldinsight.com/fig/0350570303002.png
1. Build and publish as metadata in XML format to be found on the web
2. Publishing text and data/multimedia content in XML will delight researchers
• Researchers want to access ‘article as data’, via computational algorithm
What we are doing in Hiberlink
1. Creating evidence on extent of ‘Reference Rot’
– Main focus has been on references (& URIs) made in Journal Articles
• Inc. reference rot in Supreme Court judgments with Harvard Law Library & permaCC
– ETD2014 was opportunity to look at Reference Rot & the e-Thesis
– PRELIDA is opportunity to look at impact on Linked Data
2. Understanding the preparation/publication/ingest workflow(s)
– Identifying opportunity for productive intervention
1. Prototypes for pro-active archiving to enable remedy
– Embedding such ‘solutions’ in existing tools & infrastructure
2. Raising awareness & seeking collaborative actions
…. through events like this
Empirical evidence on the Threat of
Reference Rot
Large-scale analyses: Journal Articles & E-Theses
Methodology: to discover answer to 2 questions
i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?
• Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and regarding an 2XX status code as ‘live’
Methodology: to discover answer to 2 questions
i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?
• Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and regarding an 2XX status code as ‘live’
ii. Is there a ‘Memento’ of that reference in the ‘Archived Web’?
Memento: a prior version, what the Original Resource was like at some time in the past.
A Measure of Reference Rot: Are those references available? [in 6,400 e-Theses defended in 2003-2010 at 5 US universities]
Less than two-thirds
of those links lead
to live content
Live on Web Not Found on ‘Live Web’ All
Count 29,122 16,860 45,982
% 63.3 36.7 100%
1st Order Indicator of
‘Reference Rot’ more than one
third of references
to the Web subject to ‘rot’
After up to 50 redirects
References in Citations Rot over Time:URIs cease to exist on the live Web
[excluding 0s&1s: a few theses are unaffected; a few are ruined]
We can’t stop that process of rot: Web content changes over time,
Reference Rot is inevitable function of time
Number of months elapsed from Date Thesis Defended until date archives checked (June 2014)
Searching for ‘Datetime’ Mementos of content in ‘Archived Web’ [in 6,400 e-Theses defended in 2003-2010 at 5 US universities]
% Live on Web Not found on ‘Live Web’ All
Found to beArchived
47.6
Not Found 52.4
All 100%
There seems a 50:50 chance that referenced content is in the ‘Archived Web’.
Some content is being ‘co-incidentally harvested’ by routine web archiving.
=> half of those references are at ‘risk of loss’
‘Incidental Archiving’ is constant over time (This is an ‘upper bound estimate’, independent of age of e-thesis)
We can improve upon this ‘50:50 chance’
by pro-actively archiving what we cite
We already have ‘Lost Content’ for References to Web[in 6,400 e-Theses defended in 2003-2010 at 5 US universities]
% Live on Web Not found on ‘Live Web’ All
Found to beArchived
29.3 18.3 47.6
Not Found 34.0 18.4 52.4
All 63.3 36.7 100%
18.4%
‘not live & not found in archive’
judged to be lost forever
34%‘live’ & ‘not in archive’
at is risk of loss
NB: The 34% ‘at risk’ could be saved by pro-active archiving
Hiberlink Next Phase: in-depth study of Content Drift
But demonstrated that problem exists & is severe
• The Web changes over time: significant reference rot occurs
• Routine Web Archiving delivers no better than 50:50 chance of success of having co-Incidentally archived what you referenced
- and probably much less chance when we check extent of content drift
- Not (yet) studied impact on Linked Data but expect similar
“Researchers need to know when information on a viewed page has changed.
“Authors of long-shelf-life material want to be sure that their links will still work far into the future.
Jonathan Zittrain, Larry Lessig and Kendra Albert report that
• Harvard Law Review
75% of links are dead
• top 1% Impact Factor Journals
10% of links dead just 15 months after publication
• US Supreme Court decisions
29% of links dead
49% of links do not point to the original target
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2329161
Devising Remedy for Reference Rot
for Linked Data?
Seek pro-active ‘transactional archiving’ solutions
– focus on what is regarded by authors as important
a) Understand the preparation/publication workflow
– identifying where there can be productive intervention
a) Devise prototypes for pro-active archiving
– writing & implementing code!
b) Propose/test infrastructure for temporal referencing
– supporting & using the Memento protocol
Where possible, we wish to embed ‘solutions’ in existing tools & infrastructure
Strategy for Making Remedy
3 workflows in scholarly statement
Extended length of stages in workflows magnify reference rot & affect, as referenced content on the web rots over time
① Preparation-> Study - > Compose -> (Review) -> Submission
② Publication -> (Editorial)Examination -> (Revision) -> Acceptance -> Issue
③ Post-Publication-> Deposit/Ingest -> Provide/Access -> Use
Identify the best opportunities for Intervention to make Remedy,to ‘flash-freeze’, either to avoid reference rot or to ‘stop the rot’
What are the key workflows for the manufacture, release and use of Linked Data?
3 workflows in Linked Data
What is it that changes over time: concepts, assigned attributes; why and on what timescale?
① Manufacture-> Create- > (Review) -> Prepare to publish/release/commit
② Authority: Release-> (Editorial)Examination -> (Revision) -> Acceptance
③ Use: Curate -> Deposit/Ingest -> Provide/Access -> Use
Identify the best opportunities for Intervention to make Remedy, to ‘flash-freeze’, either to avoid reference rot or to ‘stop the rot’
What are the key workflows for the manufacture, release and use of Linked Data?
1. Hiberlink Plug-in - for pro-active ‘transactional’ archiving
– At the time of authoring (ie manufacture)
2. Missing Link - re-factoring the HTML link
– By which one annotates with {DateTime; location of archived copy/ies}
3. HiberActive - a system for actively archiving references
– Designed to ‘stop the rot’, a lossy 2nd Best to transactional archiving’
LANL: Martin Klein, Harihar Shankar, Herbert Van de Sompel
UoEd EDINA: Neil Mayo, Tim Stickland, Richard Wincewicz
‘Work in progress’ to effect Remedy
Hiberlink
ETD2014, Leicester UK July 25th 2014
Funded by the Andrew W. Mellon Foundation
For use during authoring [manufacture] of information object &
before final issuebut also
before ingest by ‘library’ (& maybe for repair by ‘library’ …)
Hiberlink Plug-in [for Zotero]
① Triggers archiving of referenced web content
② Returns DateTime URI for archived content
1. Hiberlink Plug-in - to enable pro-active archiving
2. Missing Link - re-factor the HTML link that is returned
‘Work in progress’ to effect Remedy (2)
b) Augment Link with a set of Datetime & location pairs
a) Take simple URI - to French National Library (say)
Prepared by:Herbert Van de Sompel, Martin Klein, Robert Sanderson - Los Alamos National Laboratory Michael Nelson - Old Dominion University
http://mementoweb.org/missing-link/
1. Hiberlink Plug-in - to enable pro-active archiving
2. Missing Link - re-factoring the HTML link
First two approaches support ‘perfect scenario’:
• All authors archive all their cited URIs
• e.g. (but not exclusively) with Hiberlink / Zotero
3. HiberActive
– Enables repositories to ‘stop the rot’by actively archiving those references in e-theses
– A notification hub, a component for the infrastructure
• testing workflow with ResourceSync, CORE & external archive programme
‘Work in progress’ to effect Remedy (3)
• The Web changes over time: significant reference rot inevitably occurs (as a function of time)
• Web Archiving delivers only c.50:50 chance of success of co-incidentally archiving what you referenced
• Link by means of the original URI, at time of manufacture
• But then …. Augment the link with temporal context, to increase robustness of link to referenced content
o Date of linking
o URI of archived snapshot(s)
• Then again, maybe this is all about archiving to support citation and not really about ‘preservation’, but it does assist continuity of access
Summary
Picture credit: http://somanybooksblog.com/2009/03/27/library-tour/
Multi-level Problem: Digital Shelving for The Research Object; First Order References; Second Order References; ….
Simple Statements [with URIs]
1st Order References [with URIs]
Complex Research Objects {URIs}
1st Order References {URI}
2nd Order References {URI}2nd Order References [with URIs]
“Digital information is best preserved by replicating it [on digitalshelving] at multiple archives run by autonomous organizations”
B. Cooper and H. Garcia-Molina (2002)
Next Steps: how to take this work forward?
to ensure URI/references don’t rot
• Need to move from the ‘incidental Web archiving’ of cited URIsto pro-active archiving, by makers of Linked Data & by repositories?
• Engage with these Hiberlink remedies
• The Hiberlink Plug-in for Zotero / HiberActive
Email: [email protected]
Subject: Hiberlink ETD
Thank you,
Questions welcome
& check:
http://hiberlink.org/news.html
http://hiberlink.org #hiberlink
Funded by the Andrew W. Mellon Foundation
Email: [email protected]