20
Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving, British Library

Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

Working together to archive

the UK Web

Helen Hockx-Yu

Head of Web Archiving, British Library

Page 2: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 2

The UK Web Domain

4th TLD after .com, .de and .net

Over 10 million .uk registered domain

UK organisations also use non .uk domain

names (eg .com or .org) – scale unknown

Non-print Legal Deposit (since April 2013) applies to

the open (freely available) web: .uk and other UK-published (non

.uk) websites, such as .com, .org…

also e-journals, e-books, news web pages and other digital

publications, either by harvesting or mutual agreement on other

delivery methods

Page 3: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 3

Web Archiving at the British Library

Collect UK digital heritage and provide continued access to archived

web resources

Started web archiving in 2003: Open UK Web Archive

Selective, topical collections and key sites

Consortium sharing infrastructure and development effort;

agreement on who collects what

Curating collections with organisations and researchers

Archiving UK Web for non-print Legal Deposit since April 2013: Legal

Deposit UK Web Archive

Comprehensive national archive with on-site access only

Joint responsibility of six Legal Deposit Libraries (LDLs)

Page 4: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 4

UK Legal Deposit libraries

The British Library

Bodleian Libraries of the

University of Oxford

Cambridge University Library

The National Library of

Scotland

The Library of Trinity College,

Dublin

The National Library of Wales

Page 5: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 5

Page 6: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 6

Non-print Legal Deposit Governance

Governance is important

Representation of stakeholders

Ensure accountability and effectiveness of implementation

Joint decision making

Collaboration

Key groups

Joint Committee for Legal Deposit – collaboration with publishers

e.g. legal deposit content on users’ devices

Legal Deposit Libraries Committee – e.g. notice and take-down policy

Legal Deposit Implementation Group - e.g. collect embedded content

(eg CSS, images) regardless where it is hosted

Web Archiving Collection Prioritisation Group

Page 7: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 7

Domain Crawl

News S

p

e

c

i

a

l

c

o

l

l

e

c

t

i

o

n

S

p

e

c

i

a

l

c

o

l

l

e

c

t

i

o

n

Domain crawl:

• Broad

sweep of

UK domain

• Once or

twice a

year

Events & key

sites and

news:

• Events of

UK interest

• High value,

high impact

sites

• National &

regional

news

Special

Collection:

• Focused,

thematic

collections

• Support

priority

subjects

Key sites Events S

p

e

c

i

a

l

c

o

l

l

e

c

t

i

o

n

S

p

e

c

i

a

l

c

o

l

l

e

c

t

i

o

n

Collecting strategy for websites

Page 8: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 8

The Digital Library System

4 Nodes (Complete Copies)

British Library, St. Pancras

British Library, Boston Spa

National Library of Wales

National Library of Scotland

Additional Access Points

Bodleian Library, Oxford

Cambridge University Library

Trinity College Library, Dublin

Page 9: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 9

Beyond the LDLs

Curating (open access) collections

World War 1 Collection including 1000+ Centenary Community Project funded by the

Heritage Lottery Fund

The National Archives

UK Government Web Archive

The Digital Preservation Coalition

Web Archiving Task Force

Technology Watch report

The Web Observatory

Web archives as data on the web

The crowd: nomination form & Twitter

to encourage selection

Page 10: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 10

International collaboration

International Internet Preservation Consortium (IIPC)

49 members worldwide

British Library plays an active role in the IIPC

A founding member

On the Steering Committee

Hosts the IIPC Programme and Communications Officer

Benefits of collaboration

Community of practice

Tools development, eg OpenWayback

Staff training and development

Collaborative collections, eg Olympic Games

Page 11: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 11

Collaboration with researchers

Building collections

Researchers’ involvement in

scoping collections, selecting

and describing websites

Creation of specific, (narrow)

topical collections

Formulating research question

Brain-storm sessions, workshops, discussion, surveys etc.

Lack of awareness & baseline knowledge

Challenging: you don’t know what you don’t know

Co-development of access services

This is changing how we collect and store data

Page 12: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 12

JISC UK Web Domain dataset (1996-2013)

Collaboration between the Internet Archive (IA), the Joint Information Systems

Committee (JISC) and the British Library

Extracted copies of UK websites from the Internet Archives collection

1st tranche : 1996 – 2010, 30TB, 2.5 billion URLs

2nd tranche: 2010 – April 2013, 27.5TB, 1.5 billion URLs (estimated)

Research agreement between JISC and IA, upholding IA’s Terms of Use

Access via IA’s Wayback Machine

Allows replication / extraction of derivative or secondary datasets

BL hosts the dataset on behalf of JISC

Data used by research projects

Institute of Historical Research project: Analytical Access to the Domain Dark

Archive (AADDA)

Oxford Internet Institute project: Big data for political science

Page 13: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 13

Big UK Domain Data for Arts and Humanities

Funded by the UK Arts and Humanities Research Council as one of

the 21 “Big Data” projects

Collaboration between the Institution of Historical Research, Oxford

Internet Institute, British Library and Aarhus University

Develop theoretical and methodological framework for the study of

web archives

Build on ADDAA: researchers and the BL co-produce access tools

A major study of the history of UK web space from 1996 to 2013 +

sub-projects covering a range of disciplines

Also an online training course and peer-reviewed journal articles.

Page 14: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 14

Web archiving researcher bursaries

Page 15: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 15

Query building

Corpus formation and

handling

Annotation and curation

In-corpus analysis

Whole-dataset analysis

Shine

Page 16: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 16

What’s in it for us?

Helps researchers understand the value of web archives and explore new

ways of using these for scholarly research

Allows BL to obtain hands-on experience with indexing and processing

large scale web archive datasets

(Prototypes) analytics and visualisations can be applied to our own Legal

Deposit collection

Enables BL to participate in various UK, European and international

projects

Helps curators understand characteristics of large scale digital corpora

Improve the way we collet and store web archive

Page 17: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 17

Evolution of the UK web (2004 -2013)

Page 18: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 18

Memento service

Page 19: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 19

The “access” paradoxes

Completeness versus openness of web archives

Some countries don’t have Legal Deposit

Legal Deposit national collections have restricted access

Documents-centred versus data driven

Pre-selected or defined collections not relevant to all researchers;

difficulty in finding relevant content in large scale web archive.

Arbitrary (national) boundaries often irrelevant to research question

but most heritage institutions operation within certain geographical

areas

Page 20: Working together to archive the UK Webncdd.nl/site/wp-content/uploads/2014/12/NCDDWorkshop_HHY_Final… · Working together to archive the UK Web Helen Hockx-Yu Head of Web Archiving,

www.bl.uk 20

Web archives for reference AND for

analytics

Base-line knowledge self-explanatory

Focus on national events for curated

collections; provide means to assemble

research corpora

Link to what we do not have

Offer a bag of tools to support scholarly use

A way forward

Exploit open licences, changes to copyright law

Online access to selected websites, metadata and secondary datasets

The British Library Collection Development Policy for websites

Lobbying – review of Non-print Legal Deposit Regulations in 2018