35
Lessons Learned Archiving the National Web of New Zealand Kris Carpenter Negulescu The Internet Archive Gordon Paynter The National Library of New Zealand Future Perfect 2012, 27 March 2012 , Wellington New Zealand

Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Embed Size (px)

Citation preview

Page 1: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Lessons Learned Archiving the National Web of New Zealand

Kris Carpenter NegulescuThe Internet Archive

Gordon PaynterThe National Library of New Zealand

Future Perfect 2012, 27 March 2012 , Wellington New Zealand

Page 2: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Why collect the web?

Page 3: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Legal deposit

• The National Library of New Zealand Act (2003)

• “Legal deposit” now includes “Internet documents”

• Available from http://legislation.govt.nz/

Page 4: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Two web archiving programmes

Selective Harvesting of specific websites or parts websites

Domain Harvesting of the entire “New Zealand Internet”

http://topics.breitbart.com/fishing+pole/

http://www.trimarinegroup.com/operations/fleet.php

Page 5: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Selective Web Archiving

Page 6: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Selective web archiving

Page 7: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Selective web archiving

Page 8: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Selective web archiving

Page 9: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Selective web archiving

National Library Beta, Voyager, Tapuhi, etc

Timeframes, Papers Past

Rosetta Access modules

including ArcViewer

Web Curator Tool

Digitisation & Sound

Preservation

Administration

Submission Tools Access Tools

Technology Infrastructure

Collection Management Systems

IAMS

cd ND...

Actor 1

cd ND...

Actor 1

cd ND...

Actor 1

cd ND...

Actor 1

cd ND...

Actor 1

cd ND...

Actor 1

cd ND...

Actor 1

Other Published &

Unpublished Material

NDHA(Rosetta)

Page 10: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Selective web archiving

Page 11: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Selective web archiving

Page 12: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Selective web archiving

From January 2007: 14,182 harvests

• 83% Endorsed and Archived

• 17% Rejected or Aborted

• Using the Web Curator Tool

From 2000-2006: 441 harvests

• Some of multiple websites

• Using a desktop website capture tool

Page 13: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

New Zealand Web Harvests

October 2008April 2010

Page 14: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

New Zealand Web Harvests

• Scope

• Seeds

• Robots Policy

• Notification and communications

• How are we going to accomplish this?

• When are we going to stop?

Page 15: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

New Zealand Web Harvests

2008

• 17 days in October

• 106,184,620 URLs

• 4.6 Terabytes

• 397,000 hosts

• Seeds are known hosts

2010

• 24 days in April-May

• 131,770,485 URLs

• 6.9 Terabytes

• 559,000 hosts

• Seeds include .nz,

.com, .org and .net zone

files

Page 16: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

New Zealand Web Harvests

• Harvest analysis:– What exactly do we have?

– What’s a good harvest frequency?

• Preservation analysis:– ARC or WARC format?

– Should they be stored in the National Digital Heritage Archive?

• Public access analysis:– Ethical issues

– Privacy issues

– Legal and evidentiary value

– Copyright

Page 17: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Challenges and Lessons

Page 18: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Scope of a National Domain

• How is a national web domain defined?

– Hosts in the top-level domain or domains operated by registrars in country?

– Hosts known to be hosted on IP addresses within geographic boundaries?

– Content and advertising embedded in web sites published to the above

– Curator selected web sites, desitinations, or services considered to be within bounds of a country’s legislative or cultural heritage

Page 19: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Scope of a National Domain

• New Zealand Web Harvest scope:– Hosts in the .nz top-level domain

– Hosts from .com, .org and .net that are physically in New Zealand

– A list of hosts known to be within the scope of the legislation

– Image, video clips, and other files that are embedded in web pages on the hosts above

• New Zealand Web Harvest seeds:– 2008: Gathered from the Library and the Internet Archive’s past crawls– 2010: Zone files for .nz, .com, .org and .net (plus 2008 hosts)

Page 20: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Shape of harvest

• How broad or deep should the harvest be?

– Usually as broad as possible (survey of all resources at the highest levels)

– Usually deep enough to collect primary resources of interest and minimize unwanted, unrelated junk prevalent in any top level domain

Page 21: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Shape of harvest

• New Zealand Web Harvest

– Up to 10,000 URLs from every host

– But up to 50,000 for .govt.nz and .ac.nz.

• On average, about 250 URLs (12 megabytes) per host

Page 22: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Harvest Policies & Practices

• Robots Policy

– Respect robots.txt

– Ignore for embeds and inline content for unrestricted pages

• Notification

– Notifications may be sent to site owners/publishers prior to harvest

• Politeness settings

– Usually limit to load from a visitor navigating to the site via a browser

• Trade-off of harvest duration vs scale of resources

– Need to keep the data capture period brief

Page 23: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Harvest Policies & Practices

• New Zealand Web Harvest Robots Policy

– Selective: Ignore robots.txt (usually)

– 2008: Ignore robots.txt (unless asked otherwise)

– 2010: Mostly honour robots.txt (following consultation)

• Four to six weeks of notification through many channels

Page 24: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Harvest Infrastructure

• Dedicated crawlers to capture data– Service nodes for reporting and access; shared infrastructure for automated QA,

data mining and analysis

• Hardware: – Quad Core Processors (2.6 GHz)– 4-8 GB ram/core – 8+ Terabytes of local disk (Four 2-Terabyte SATA drives)

• Software:– Ubuntu Linux– Java(TM) SE Runtime Environment (latest build)– Heritrix 3 or v1.14.x

• Network:– Bandwidth is limited to ~300 Mbits/sec/project

Page 25: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Harvest Infrastructure

In-house

• Possibly cheaper

• Large staff requirement

• Hardware requirements

• Network requirements

• Risks: what don’t we know?

Commissioned

• Higher outright cost

• Contractor provides

expertise: Heritrix, crawler

traps, scope, etc

• Contractor provides staff,

computers, bandwidth

The New Zealand Web Harvests were commissioned from the Internet Archive.

Unexpected issue: International bandwidth

Page 26: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Challenges of All Web Archiving

• Not all data can be crawled

• Can publishers “opt in” or “opt out”?

• Data may be lost no matter how carefully it is managed

• Harvested data hard to make accessible – Intuitive interfaces for discovering and navigating resources

– With robust APIs

– All done in a compelling and sustainable way

• Research and experimentation are essential to keep pace with

publisher innovation

Page 27: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Challenges of Domain Archiving

• Harvests are at best samples– Time & expense: can’t get everything– Rate of change: don’t get every version– Rate of collection: issues of ‘time skew’

• Choice of User agents/protocols– If you crawl as the Mozilla agent your content

may not redisplay in IE– Which mobile agents should you crawl as, if any?

• Site structure & publishing models– Some parts of sites are not “archive-friendly”

(JavaScript, AJAX, Flash, etc.)

– Change both their technical structure and policy quickly and often (YouTube, Facebook, etc)

Page 28: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Challenges of Domain Archiving

Social networks and collaborative/semi-private spaces

Immersive Worlds

70+% of the world’s digital content is now generated by individuals – not all of it can be crawled

(UK Telegraph, IDC annual survey, released May 2010)

Page 29: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Challenges of Domain Archiving

• Manageable Costs/Sustainable Approaches

– Access to power & other critical operational resources

– Sufficient processing capacity for collection, analysis, discovery, & dissemination of resources

– Bandwidth

• Recruitment and retention of staff/engineering expertise;

effective ongoing training

Page 30: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Challenges of Domain Archiving

When do you stop crawling?

• The internet is infinitely large!

• Indicators that suggest diminishing returns have set in:

– A relatively small number of remaining hosts have a lot of depth

– More HTML than images appearing in the crawl log

– Higher incidence of crawler traps, content farms

• At this point we expect:

– We will capture proportionally more junk

– Website owners will complain that we're over-crawling

Page 31: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Challenges of Domain Archiving

How do you assess the quality of a harvest?

• Quantitative measures of quality, breadth and depth

• Qualitative measures including characterization of resources and how

they fit with other collections

• Usually harvest for weeks in duration depending upon the desired

scope, and then run a “patch crawl”

Page 32: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

• Being responsive during a crawl

• New Zealand Web Harvest 2008:– 37 individual contacts during harvest

– 2 major mailing list discussions

– Blogs & Twitter

– Newspapers (“Library harvest costs website dear”) and radio

• A communications strategy and plan essential – The biggest difficulty is responding promptly outside working hours

Challenges of Domain Archiving

Page 33: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Final thoughts?

What have we learned that is

particularly relevant to New Zealand?

Page 34: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Final thoughts

• New Zealand faces the same challenges as our peers overseas

• Most of the world favours dedicated web archives– But we’re preserving web material alongside other formats.

• When will it be economical to harvest from New Zealand?

Page 35: Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zealand

Final thoughts: how should national domain crawls work?

• Institutions crawl within their national domains from their own

national infrastructure

• Institutions share tools, metadata, knowledge and best practices– And to the extent possible – data!

– Collaboration will always achieve greater results than acting alone!

• Over the long term, shared goals and resources can help

mitigate economic and other barriers to collection, mining, and

access of New Zealand’s national digital heritage