Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Working together to archive
the UK Web
Helen Hockx-Yu
Head of Web Archiving, British Library
www.bl.uk 2
The UK Web Domain
4th TLD after .com, .de and .net
Over 10 million .uk registered domain
UK organisations also use non .uk domain
names (eg .com or .org) – scale unknown
Non-print Legal Deposit (since April 2013) applies to
the open (freely available) web: .uk and other UK-published (non
.uk) websites, such as .com, .org…
also e-journals, e-books, news web pages and other digital
publications, either by harvesting or mutual agreement on other
delivery methods
www.bl.uk 3
Web Archiving at the British Library
Collect UK digital heritage and provide continued access to archived
web resources
Started web archiving in 2003: Open UK Web Archive
Selective, topical collections and key sites
Consortium sharing infrastructure and development effort;
agreement on who collects what
Curating collections with organisations and researchers
Archiving UK Web for non-print Legal Deposit since April 2013: Legal
Deposit UK Web Archive
Comprehensive national archive with on-site access only
Joint responsibility of six Legal Deposit Libraries (LDLs)
www.bl.uk 4
UK Legal Deposit libraries
The British Library
Bodleian Libraries of the
University of Oxford
Cambridge University Library
The National Library of
Scotland
The Library of Trinity College,
Dublin
The National Library of Wales
www.bl.uk 5
www.bl.uk 6
Non-print Legal Deposit Governance
Governance is important
Representation of stakeholders
Ensure accountability and effectiveness of implementation
Joint decision making
Collaboration
Key groups
Joint Committee for Legal Deposit – collaboration with publishers
e.g. legal deposit content on users’ devices
Legal Deposit Libraries Committee – e.g. notice and take-down policy
Legal Deposit Implementation Group - e.g. collect embedded content
(eg CSS, images) regardless where it is hosted
Web Archiving Collection Prioritisation Group
www.bl.uk 7
Domain Crawl
News S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
Domain crawl:
• Broad
sweep of
UK domain
• Once or
twice a
year
Events & key
sites and
news:
• Events of
UK interest
• High value,
high impact
sites
• National &
regional
news
Special
Collection:
• Focused,
thematic
collections
• Support
priority
subjects
Key sites Events S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
Collecting strategy for websites
www.bl.uk 8
The Digital Library System
4 Nodes (Complete Copies)
British Library, St. Pancras
British Library, Boston Spa
National Library of Wales
National Library of Scotland
Additional Access Points
Bodleian Library, Oxford
Cambridge University Library
Trinity College Library, Dublin
www.bl.uk 9
Beyond the LDLs
Curating (open access) collections
World War 1 Collection including 1000+ Centenary Community Project funded by the
Heritage Lottery Fund
The National Archives
UK Government Web Archive
The Digital Preservation Coalition
Web Archiving Task Force
Technology Watch report
The Web Observatory
Web archives as data on the web
The crowd: nomination form & Twitter
to encourage selection
www.bl.uk 10
International collaboration
International Internet Preservation Consortium (IIPC)
49 members worldwide
British Library plays an active role in the IIPC
A founding member
On the Steering Committee
Hosts the IIPC Programme and Communications Officer
Benefits of collaboration
Community of practice
Tools development, eg OpenWayback
Staff training and development
Collaborative collections, eg Olympic Games
www.bl.uk 11
Collaboration with researchers
Building collections
Researchers’ involvement in
scoping collections, selecting
and describing websites
Creation of specific, (narrow)
topical collections
Formulating research question
Brain-storm sessions, workshops, discussion, surveys etc.
Lack of awareness & baseline knowledge
Challenging: you don’t know what you don’t know
Co-development of access services
This is changing how we collect and store data
www.bl.uk 12
JISC UK Web Domain dataset (1996-2013)
Collaboration between the Internet Archive (IA), the Joint Information Systems
Committee (JISC) and the British Library
Extracted copies of UK websites from the Internet Archives collection
1st tranche : 1996 – 2010, 30TB, 2.5 billion URLs
2nd tranche: 2010 – April 2013, 27.5TB, 1.5 billion URLs (estimated)
Research agreement between JISC and IA, upholding IA’s Terms of Use
Access via IA’s Wayback Machine
Allows replication / extraction of derivative or secondary datasets
BL hosts the dataset on behalf of JISC
Data used by research projects
Institute of Historical Research project: Analytical Access to the Domain Dark
Archive (AADDA)
Oxford Internet Institute project: Big data for political science
www.bl.uk 13
Big UK Domain Data for Arts and Humanities
Funded by the UK Arts and Humanities Research Council as one of
the 21 “Big Data” projects
Collaboration between the Institution of Historical Research, Oxford
Internet Institute, British Library and Aarhus University
Develop theoretical and methodological framework for the study of
web archives
Build on ADDAA: researchers and the BL co-produce access tools
A major study of the history of UK web space from 1996 to 2013 +
sub-projects covering a range of disciplines
Also an online training course and peer-reviewed journal articles.
www.bl.uk 14
Web archiving researcher bursaries
www.bl.uk 15
Query building
Corpus formation and
handling
Annotation and curation
In-corpus analysis
Whole-dataset analysis
Shine
www.bl.uk 16
What’s in it for us?
Helps researchers understand the value of web archives and explore new
ways of using these for scholarly research
Allows BL to obtain hands-on experience with indexing and processing
large scale web archive datasets
(Prototypes) analytics and visualisations can be applied to our own Legal
Deposit collection
Enables BL to participate in various UK, European and international
projects
Helps curators understand characteristics of large scale digital corpora
Improve the way we collet and store web archive
www.bl.uk 17
Evolution of the UK web (2004 -2013)
www.bl.uk 18
Memento service
www.bl.uk 19
The “access” paradoxes
Completeness versus openness of web archives
Some countries don’t have Legal Deposit
Legal Deposit national collections have restricted access
Documents-centred versus data driven
Pre-selected or defined collections not relevant to all researchers;
difficulty in finding relevant content in large scale web archive.
Arbitrary (national) boundaries often irrelevant to research question
but most heritage institutions operation within certain geographical
areas
…
www.bl.uk 20
Web archives for reference AND for
analytics
Base-line knowledge self-explanatory
Focus on national events for curated
collections; provide means to assemble
research corpora
Link to what we do not have
Offer a bag of tools to support scholarly use
A way forward
Exploit open licences, changes to copyright law
Online access to selected websites, metadata and secondary datasets
The British Library Collection Development Policy for websites
Lobbying – review of Non-print Legal Deposit Regulations in 2018