32
Web Today, Gone Tomorrow? => transactional archiving of web content Peter Burnhill University of Edinburgh l/Scholarly Publishing (PSP) Division, Association of American Publ Washington DC, 1-4 February 2017

Web Today, Good Tomorrow? Transactional archiving of web content

Embed Size (px)

Citation preview

Page 1: Web Today, Good Tomorrow? Transactional archiving of web content

Web Today, Gone Tomorrow? => transactional archiving of web content

Peter Burnhill

 University of Edinburgh

Professional/Scholarly Publishing (PSP) Division, Association of American Publishers (AAP)Washington DC, 1-4 February 2017

Page 2: Web Today, Good Tomorrow? Transactional archiving of web content

12 ‘Dark Archive Nodes’ in long-lived research institutionsin 8 different countries/jurisdictions:

North America: Indiana, Rice, Stanford, Virginia, OCLC; Alberta (Ca.)Europe: Edinburgh (UK); Humboldt (Ger); Cattolica dSC (It.)Asia/Pacific: ANU; NII (Japan); UHK

Triggered 29 titles so far[1.1 m downloads in 2016]

Triggered release at Stanford & EDINA via OpenURL's to local library link-resolvers & CrossRef

CLOCKSS Archive Network Library Stewardship: Global & Decentralized

not-for-profit joint venture Board: 12 publishers & 12 libraries

Cross-sectoral collaboration & innovation

Stanford

TRAC Certified

Page 3: Web Today, Good Tomorrow? Transactional archiving of web content

① Web-scale not-for-profit archiving agencies:

② National institutions (usually national libraries) …

③ Consortia of university libraries & specialist centres …

National Science Library, Chinese Academy of Sciences

1. We now have a variety of digital shelving

National Science Library, Chinese Academy of Sciences

Good News: a lot of online e-journal content is being kept safe

Swiss National Library

Page 4: Web Today, Good Tomorrow? Transactional archiving of web content

… to discover who is looking after what An established Global Monitor

thekeepers.org

2. We have means to search ‘holdings’ on digital shelves

12 ‘keepers’ (+ Swiss National Library)

Funded by:

Developed & managed by:

on Title or ISSN, using the ISSN Register& ISSN-L as kernel field

Page 5: Web Today, Good Tomorrow? Transactional archiving of web content

3. Use Registry as ‘Observatory’: provide evidence on progress

Peter Burnhill
Have changed this to be more +ve
Page 6: Web Today, Good Tomorrow? Transactional archiving of web content

very many ‘at risk’ e-journals from the “65% of publishers”:

the hardest to reach & work with

BIG publishers act early but incompletely

** Amber Alert **

a lot of Arts, Humanities, Law & ‘applied’ literature not being archived

STEM Journalswell archived

Page 7: Web Today, Good Tomorrow? Transactional archiving of web content

Progress as archiving agencies form a Keepers Network to tackle that Long Tail and ensure completeness

=> Their recent Statement * endorsed by library community• ARL + CARL + LIBER + RLUK + AUL

IARLA : International Alliance of Research Library Associations • Ivy Plus Libraries Collections Group, USA

+ library groups in Canada, Australasia, South America and Europe

* ‘Working Together to Ensure the Future of the Digital Scholarly Record’http://thekeepers.blogs.edina.ac.uk/keepers-extra/ensuringthefuture

=> Need support from Publishers & Publisher Associations 1. To read and endorse the Keepers Statement *

• be vocal to all publishers in your support of archiving agencies• make it easier for archives to ingest your content & keep it safe

2. To dble-check actual ingest of your content via Keepers Registry

Page 8: Web Today, Good Tomorrow? Transactional archiving of web content

References to Content

=> Back into Scholarly Publications

=> Out onto the Web at Large

Has ‘fixity’ dynamic , lacks fixity

DOI, ISSN CLOCKSS, Portico,

CrossRef, etc

URLs‘Web today, gone tomorrow’

Reference RotE-Journal Archiving#keepers #hiberlink

Threat to Integrity of scholarly publication => References to Content

Now The Bad News: 3 Red Alerts for Publishers

Page 9: Web Today, Good Tomorrow? Transactional archiving of web content

Project 2 years: March 2013 to June 2015

Funder Andrew W. Mellon Foundation

Partners University of EdinburghEDINA & Language Technology Group, School of Informatics

Los Alamos National Laboratory

ambition1. Define and measure the extent of ‘Reference Rot’2. Scope possible intervention opportunities to stop the rot

we did that and went further to3. Devise sustainable solutions capable of maximal reach

The aim today is to4. Prompt action by those who can make a difference …

Page 10: Web Today, Good Tomorrow? Transactional archiving of web content

arXiv Elsevier corpus PMC

Dark solid lines represents URIs to Web-at-large, from 1997/2011

Red Alert 1 Scholarly Articles increasingly link to

Web Resources, not just back to other Articles

Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0115253

Data: 1.2m articles with URI references, of which 393,000 to ‘Wild Web’ => 1million URIs

Page 11: Web Today, Good Tomorrow? Transactional archiving of web content

Reference Rot = Link Rot + Content DriftWhen what was referenced & cited

ceases to say the same thing, or ‘has ceased to be’http://www.snorgtees.com/this-parrot-has-ceased-to-be

1. Link Rot: Link stops working

=> two questions about the 1 million URLs to Web-at-

large

1. Do those links (URLs) still work? - on the ‘Live Web’’?

2. Is there a ‘Memento’ of that reference in the ‘Archived Web’?

Page 12: Web Today, Good Tomorrow? Transactional archiving of web content

Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253 http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253

within 14 days of publication date …PMC Elsevier

‘Not Archived’ 74.5% 75.2%

Of those ‘Not Archived’ % %

still ‘Live’ on the Web 80 67.3

‘No longer Live’ on the Web 20% 32.7% Many ‘missing, presumed lost’

Most referenced URIs at risk of loss

Team at Harvard Law School established similar evidence

• 70% of the URLs within [law] journals & 50% of the URLs within U.S. Supreme Court opinions … “do not produce the information originally cited.”

Jonathan Zittrain, Kendra Albert and Lawrence Lessig (2014). Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations. Legal Information Management 14. doi:10.1017/S1472669614000255.

Red Alert 2

Reference Rot is already significant

Page 13: Web Today, Good Tomorrow? Transactional archiving of web content

Content Drift is even scarier!Red Alert 3

when what is at end of cited URL has changed, or gone!!http://dl00.org2000

http://dl00.org2004

http://dl00.org2005

http://dl00.org2008

(a) Dynamic contentas values on webpage changes over time

(b) Static contentbut very different (often unrelated) web pages

Page 14: Web Today, Good Tomorrow? Transactional archiving of web content

‘Similarity’ of Representative Mementos & Live Web Content as at August 2015 by Year of Publication 655,000 Elsevier articles, 1997 to 2012

Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. PLOS ONE 11(12): e0167475. doi:10.1371/journal.pone.0167475 http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0167475

‘Similarity’ decreases over time

After 3 years, only ¼ of URIs lead to unchanged content

+ increase in Link Rot

25%* fresh evidence on ‘Content Drift’ *

Page 15: Web Today, Good Tomorrow? Transactional archiving of web content

only about 25% of referenced resources

In articles published in 2012

remain unchanged by August of 2015

25%

25%

25%

Confirmed in all 3 datasets

Page 16: Web Today, Good Tomorrow? Transactional archiving of web content

=> Content of Citations Rot over Time!!

… leading to rotten references for the reader Get Smell Out Copyright © 2017

Page 17: Web Today, Good Tomorrow? Transactional archiving of web content

Rot in References means a Defective Article!

undermines the integrity of the scholarly record http://www.fao.org/wairdocs/tan/x5883e/x5883e01.htm

Page 18: Web Today, Good Tomorrow? Transactional archiving of web content

So what should to expect of the Publisher?

Beyond the assurance that the fish / references / articles

sold are not rotten

Kind permission from Manchester Evening News

Page 19: Web Today, Good Tomorrow? Transactional archiving of web content

5 Options to Remedy Reference Rot

Hint: Remedy for fish is ‘Quick Freeze & Store with Date Stamp’

Kind permission from Asia Quality Control

Always end on the +ve … !!

Page 20: Web Today, Good Tomorrow? Transactional archiving of web content

① Take Snapshot of what is at end of URL

& put in safe place until needed by reader• Various web archives support on-demand creation of

snapshots of URLs:– archive.is / Internet Archive / perma.cc / webcitation.org

Archive-It @archiveitorg perma.cc @permacc

Page 21: Web Today, Good Tomorrow? Transactional archiving of web content

Decide where to intervene for best effect?

Activity Actor Snapshot Quality

1. Preparation Author/reference tool best

2. Submission /Issue Editor/manuscript system

good

3. Access (post-publication)

Aggregator/publisher platform

better late than not

4. Shelving Librarian/IR, journal archive better than nothing

Need to put the means of re-creating fixity within the software being used in each workflow

Page 22: Web Today, Good Tomorrow? Transactional archiving of web content

‘Best’ would be to help authors do right thing - at earliest moment of capture!

http://the-animals-biography.blogspot.co.uk/2014/04/kingfisher.html

Page 23: Web Today, Good Tomorrow? Transactional archiving of web content

.. when the Authors are trawling for content

Page 24: Web Today, Good Tomorrow? Transactional archiving of web content

• Preparation -> Study -> Compose -> Submission

=> Good News: something already exists …

• Hiberlink Project: EDINA developed code for Zotero [open source]

NoteUniversity of Edinburgh now investigating how to assist doctoral students with their references to web resources in e-theses

② Help the Author record their dependencies?

• ‘transactional archiving’ of referenced web content • do it when noted & citation created

• OK, but how to effect change in note-taking software? eg EndNote, Mendeley, Reference Manager, RefMe, Zotero

Page 25: Web Today, Good Tomorrow? Transactional archiving of web content

Need to create a time-based record of what an Author regards as significant …

Page 26: Web Today, Good Tomorrow? Transactional archiving of web content

… or needs to provide as evidence!

Alexander Lexén

https://www.flydreamers.com/en/photo/alexander-lexen-s-fly-fishing-catch-of-a-european-brown-trout-fly-dreamers-pic291999

Page 27: Web Today, Good Tomorrow? Transactional archiving of web content

More Good News:Metadata for the citation of that Snapshot

Three key elements should be recorded in the citation:1. Original URL 2. Snapshot URL where the web content was archived3. Date/Time when the snapshot was taken (& archived)

A proposed standard ‘Robust Links’ syntax is set out at

http://robustlinks.mementoweb.org/

Page 28: Web Today, Good Tomorrow? Transactional archiving of web content

③ Adapt the publisher process to ‘stop the rot’

• Submission -> Editing -> (Revision) -> Acceptance -> Issue

a) Publishers should create Snapshots in web archives • Editors to use citations with the 3 Robust Link elements

b) Submission systems should accept citations submitted with Robust Link syntax!• Engage / amend / use ‘Robust Links’ syntax

=> Yet More Good News: something already exists …Hiberlink Project: algorithm created for OJS [open source] ; code in GitHub

Page 29: Web Today, Good Tomorrow? Transactional archiving of web content

④ Value in having ‘Hibernator’ Infrastructure

Publishing platform ‘Hibernator’

External archival service

e.g. Internet Archive, Perma cc

• Asynchronous - returns Hiberlink in Robust Link format • Distributed - archived in different locations• Lightweight - leveraging HTTP & what already exists

as middleware which simplifies interaction between publisher systems & web archives

NoteUniversity of Edinburgh is building the Hibernator for its doctoral students to support references used as evidence in e-theses

Page 30: Web Today, Good Tomorrow? Transactional archiving of web content

Activity Responsibility Snapshot Quality

3. Access Platform better late than not

⑤ Act to help the Reader, given rot

Access/Post-Publication -> Reader Access -> Use• Install ‘Link Decoration’: enable readers to employ Memento

for search web archives for content ‘around time of submission’

Finish on this Good News: Herbert Van de Sompel et al. (2015) Robust Links - Link Decorationshttp://robustlinks.mementoweb.org/spec/

Page 31: Web Today, Good Tomorrow? Transactional archiving of web content

Thank You: Questions Welcome

[email protected]

With kind permission from 'Feather Saturnfly' on flickr, All Rights Reserved

Page 32: Web Today, Good Tomorrow? Transactional archiving of web content

Useful links – that still work Hiberlink.org

Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. PLOS ONE 11(12) doi:10.1371/journal.pone.0167475

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0167475

The Cobweb: Can the Internet be archived? New Yorker, Annals of Technology, January 2015http://www.newyorker.com/magazine/2015/01/26/cobweb

The growing problem of Internet “link rot” and best practices for media and online publishershttps://journalistsresource.org/studies/society/internet/website-linking-best-practices-media-online-publishers

Law Library of Congress Implements Solution for Link and Reference Rothttps://www.digitalgov.gov/2016/04/13/law-library-of-congress-implements-solution-for-link-and-reference-rot/