19
Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library of Australia 12 November 2004

Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

Embed Size (px)

Citation preview

Page 1: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

Web site archivingby capturing all unique responses

Kent Fitch, Project Computing Pty Ltd

Archiving the Web Conference Information Day

National Library of Australia12 November 2004

Page 2: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

Reasons for archiving web sites

● They are important – Main public and internal communication mechanism – Australian "Government Online", US Government

Paperwork Elimination Act

● Legal– Act of publication– Context as well as content

● Reputation, Community Expectations

● Commercial Advantage

● Providence

Page 3: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

Web site characteristics

● Increasing dynamic content

● Content changes relatively slowly as a % of total

● Small set of pages account for most hits

● Most responses have been seen before

Page 4: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

Web site characteristics

Page 5: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

Desirable attributes of an archiving methodology

● Coverage– Temporal– Responses to searches, forms, scripted links

● Robustness– Simple– Adaptable

● Cost– Feasible– Scalable

● Ability to recreate web site at a point in time– Exactly as originally delivered– Support analysis, recovery

Page 6: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

Approaches to archiving

● Content archiving

– Input side – capture all changes

● “Snapshot”– crawl

– backup

● Response archiving– Output side: capture all unique request/responses

Page 7: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

Contentarchiving

Snapshot Responsearchiving

Cos

tC

over

age

Rob

ust

Rec

reat

ew

eb s

ite

Often part of CMSSmall volumes

✔ Often part of CMSSmall volumes

Requires “live site”: hardware software, content,data, authentication,...

Assumes all contentis perfectly managed

Dynamic content hard to capture.Subvertable

Complete crawl is large✘

Faithful but incomplete

Simple✔

Incomplete(no forms, scripts..)Gap between crawls

Small volumes✔✘

Faithful andcomplete

Conceptually simpleIndependent ofcontent type

Address space & temporally complete.Not subvertableToo complete!

Collection overhead

In the critical path✘

Page 8: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

Is response archiving feasible?

● Yes, because:

– Only a small % of responses are unique

– Overhead and robustness can be addressed by design

– Non material changes can be defined

Page 9: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

Approaches to response archiving

● Network sniffer– Not in the critical path– Cannot support HTTPS

● Proxy– End to end problems (HTTPS, client IP addr)– Extra latency (TCP/IP session)

● Filter– Runs within web server– Full access to req/resp

Page 10: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

A Filter implementation: pageVault

● Simple filter “gatherer”

– Uses Apache 2 or IIS server architecture– Big problems with Apache 1

● Does as little as possible within the server

Page 11: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

pageVault Architecture

Page 12: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

pageVault design goals

● Filter must be simple, efficient, robust– Negligible impact on server performance– No changes to web applications

● Selection of responses to archive based on URL,content type– Support definition of “non material” differences

● Flexible archiving– Union archives, split archives

● Complete “point in time” viewing experience– Plus support for analysis

Page 13: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

Sample pageVault archive capabilities

● What did this page/this site look like at 9:30 on 4th May last year?

● How many times and exactly how has this page changed over the last 6 months?

● Which images in the "logos" directory have changed this week?

● Show these versions of this URL side-by-side

● Which media releases on our site have mentioned “John Laws”, however briefly available?

Page 14: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

Performance impact

● Determining "uniqueness" requires calculation of checksum– 0.2ms per 10KB [*]

● pageVault adds 0.3 - 0.4 ms to service a typical request– a “minimal” static page takes ~1.1ms,– typical scripted pages take ~5 – 100ms...– performance impact of determining strings to exclude for “non-

material” purposes is negligible

[*] - Apache 2.0.40, Sparc 750MHz processor, Solaris 8

Page 15: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

Comparison with Vignette’s WebCapture

● Enterprise-sized, integrated, strategic

● Large investment

● Focused on transactions

● Aims to be able to replay transactions

● Simple, standalone, lightweight

● Inexpensive

● Targets all responses

● Aims to recreate all responses on the entire website

WebCapture pageVault

Page 16: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

pageVault applicability

● Simple web site archives

● Notary service– Independent archive of delivered responses

● Union archive– Organisation-wide (multiple sites)– National archive– Thematic collection

Page 17: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

Summary

● Effective web site archiving is an unmet need– Legal– Reputation, community expectation– Providence

● Complete archiving with input-side and snapshot approaches is impractical

● An output-side approach can be scalable, complete, inexpensive

Page 18: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

Thanks to...

● Russell McCaskie, Records Manager, CSIRO

– Russell was responsible for bringing the significant issues with the preservation and management of web-site content to our attention in 1999

● The JDBM team

– An open source B+Tree implementation used by pageVault

Page 19: Web site archiving by capturing all unique responses Kent Fitch, Project Computing Pty Ltd Archiving the Web Conference Information Day National Library

More information

http://www.projectcomputing.com

[email protected]