21
A survey of Web preservation initiatives Michael Day UKOLN, University of Bath [email protected] 7 th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003), Trondheim, Norway, 17-22 August 2003

A survey of Web preservation initiatives Michael Day UKOLN, University of Bath [email protected] 7 th European Conference on Research and Advanced Technology

Embed Size (px)

Citation preview

Page 1: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

A survey of Web preservation initiatives

Michael DayUKOLN, University of Bath

[email protected]

7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003),

Trondheim, Norway, 17-22 August 2003

Page 2: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Presentation overview• The importance of the Web

• Challenges:

– Technical, legal, and organisational challenges

• Approaches to collection:

– Harvesting based, selective, and deposit; combined approaches

• Discussion:

– Collection and access policies, software, costs, long-term preservation

Page 3: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Importance of the Web

An all pervasive communication medium:• In research:

– Scientists are 'increasingly reliant' on the Web for supporting research (Hendler, 2003)

• Wider societal role:– personal communication, e-commerce,

etc.– "… the information source of first resort for

millions of readers" (Lyman, 2002)

Page 4: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

The UKOLN study

Feasibility study produced for:– Joint Information Systems Committee (JISC)– Wellcome Library

– A survey of initiatives– Recommendations for the JISC and

Wellcome Library– Supplementary legal study (Charlesworth)– Published February 2003

http://library.wellcome.ac.uk/projects/archiving_reports.shtml

Page 5: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Technical challenges (1)

Size of Web:– Surface web > 50 Tb (2000) … and still

growing– The 'deep Web'– Scale of task means that Web-archiving

needs to be a collaborative activity

Page 6: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Technical challenges (2)

Dynamic nature of Web:– Web pages disappear on average after 75

days– Many leave no trace

Evolution of Web-based technologies:– Increasing reliance on databases, scripts,

plug-ins, etc.– A 'moving target'

Page 7: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Legal challenges

Copyright

Content liability, e.g.:– Defamation– Data protection

In the UK:– Selective approach would be the safest

solution (unless law changes)

See: Charlesworth (2003)http://library.wellcome.ac.uk/projects/archiving_reports.shtml

Page 8: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Organisational challenges

Decentralised organisation:– Web-archiving initiatives focus on defined

sub-sets of the Web, e.g.:– National domain, subject, organisation type

– Need for co-operation between initiatives

Quality:– Much on Web is low-quality (or worse)– Is there a need to preserve all of this?

Page 9: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Initiatives (1)

The Internet Archive– Largest initiative, running since 1996– Co-operates on special collections and

with other repositories

National Libraries:– Pioneer archives in Sweden (Kulturarw3)

and Australia (PANDORA)– Now many, many more– Changes to legal deposit legislation in

some countries

Page 10: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Initiatives (2)

National archives:– Focus on government Web-sites (however

defined)– Guidance for Web-site managers:

– e.g., UK and Australia

– Snapshots:– e.g., USA and UK

Other:– Universities and scholarly societies:

– e.g., Archipol, Occasio archive, Political Communications Web Archiving (Cornell)

Page 11: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Approaches (1)

Automatic harvesting:– Use of Web crawler technologies– Crawler follows links and downloads

content– Pioneered by Internet Archive and

Kulturarw3 project– Also used for the gathering of the Finnish

and Austrian Web

Page 12: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Approaches (2)

Selective approaches:– Selection of individual Web sites– Negotiate rights with site owners– Collection using gathering or mirroring

software, ftp, or e-mail– Pioneered in PANDORA project– Experimented with by Library of Congress

and British Library

Deposit approaches:– Site owners/administrators deposit site in

repositories

Page 13: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Approaches (3)

Combined approaches:– Combines the advantages of the

harvesting and selective approaches– Pioneered by the Bibliothèque nationale

de France– Experimented with enhancements to the

harvesting approach• e.g., noting the change frequency of sites,

and their 'importance')• Uses the selective approach for the 'deep

Web'

Page 14: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Collection policies

Dependent on technical approach chosen– National domain ++ (for harvesting-based

approaches)– Collection guidelines (for selective

approaches)– Based on relevance, provenance, quality, etc.– Frequency of capture– Possible overlap with subject gateway

initiatives - e.g. the Resource Discovery Network (RDN) in the UK

Page 15: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Approximate size (2002)Country Initiative Type Size (Gb.) No. Sites

USA Internet Archive H >150,000.00

Sweden Kulturarw3 H 4,500.00

France BnF C <1,000.00

Austria AOLA H 448.00

Australia PANDORA S 405.00 3,300

Finland HUL H 401.00

UK Britain on the Web S 0.03 100

USA MINERVA S * 35

Source: Day (2003)

Page 16: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Access policies

Access policies differ:– Internet Archive and the PANDORA

archive make data available– e.g., the Wayback Machine

– Other collections effectively closed (for legal reasons or because experimental)

– Need for specialised Web indexes that can search and navigate large collections of Web material

– e.g., Nordic Web Archive (NWA) Toolset

Page 17: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Software

Various software in use:– Harvesting:

– Adapted Combine harvester, NEDLIB harvester, Xyleme, Alexa

– Selective:– HTTrack (popular), etc.– PANDAS (PANDORA Digital Archiving

System) - helps with managing the process, adding metadata, etc.

Page 18: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Costs

Costs vary widely:– Selective approach much more expensive

(per Tb.) than bulk harvesting– But resulting archives are more widely

accessible– Significant costs in undertaking rights

clearance

Page 19: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Long-term preservation

Many initiatives until now mainly focused on the collection of resources:

– Need to consider the longer-term– Descriptive and technical metadata– Migration needs (e.g. for complex sites)– Need for Web archiving initiatives to

become trusted repositories– Need to be embedded into the 'core

activities' of their host organisation

Page 20: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Summing up

• Much experimentation to date, but now moving into implementation phase

• Co-operation and collaboration is important

• Combined technical approaches offer best way forward

• Legal challenges still problematic• Long-term preservation issues still to

be explored in detail

Page 21: A survey of Web preservation initiatives Michael Day UKOLN, University of Bath m.day@ukoln.ac.uk 7 th European Conference on Research and Advanced Technology

ECDL 2003, Trondheim, Norway, 17-22 August

2003

Acknowledgements

UKOLN is funded by Resource: the Council for Museums, Archives and Libraries, the Joint Information Systems Committee (JISC) of the UK higher and further education funding councils, as well as by project funding from the JISC and the European Union. UKOLN also receives support from the University of Bath, where it is based.

http://www.ukoln.ac.uk/