WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard...

Preview:

Citation preview

WAX: A candle in the darkness

A digital to digital projectWendy Gogel, Andrea Goethals

Harvard University Library, Office for Information Systems

May 1, 2009

Today’s Journey

• The Darkness – The WebIntroducing the challenge of web archiving

• The Candle – WAXHUL’s Web Archive Collection Service

• The Light – The Collections Demonstrating the results

The Darkness: The Web

The Challenges of Web Archiving

• A fleeting record – here today, gone tomorrow• Government Documents• Public Debate• Culture • Personal expression• University Output

Harvard Magazine May/June 2009

Curator Activities

• Selection • Acquisition• Rights management• Quality assurance • Arrangement• Storage • Description and indexing for

discovery (cataloguing, searching, browsing)

• Presentations and exhibitions • Preservation

IP and Other Legal Risks

• Copyright infringement• State tort liability

• Civil damages, resulting from invasion of privacy, sensitive personal data, commercial content, defamatory content

• Statutory content restrictions• Foreign Laws

Preservation Challenges • We were not there at creation

• Viruses more likely• Formats misidentify themselves• A lot of formats are invalid (especially

HTML)• It’s a moving target – what should we

preserve?• Evolving born digital formats• Proliferation of formats• Partial capture • Complex behaviors and styles

• Complex delivery to maintain• Hyperlinked resources• Multiple renderers will continue to evolve

2006/07 AlternativesSelection

Crawling

Management (QA and Metadata)

Storage

Preservation

Discovery and Display

Notes

Wayback (IA)

No Yes No Yes Partial - Replicated storage – Not Harvard owned

No full textsearching

Contract IA

Yes Yes No, handle in-house

No, Handle in-house

No, Handle in-house

No, Handlein-house

Archive It! (IA)

Yes Yes Minimal, has since improved

Yes Partial - Replicated storage

Minimal, has since improved

2008 costs:$16,000/yr $2,000/yrHarvard copy

Customize IIPC Tools (WAX)*

Yes Yes Yes Yes More than others

Yes

* Additional benefit of integration with HUL central services

The Candle: WAX

HUL’s Web Archiving Project

• 2.5 year pilot project funded by LDI

• Key Goals1. Gain experience in domain2. Explore legal terrain3. Investigate sustainability of a

Harvard web archiving service• quantify technical, human, and

$ requirements• aim for operational efficiencies

Project Players

1. Curators and Collection Managers• Harvard University Archives • Schlesinger Library on the

History of Women in America• Edwin O. Reischauer Institute

of Japanese Studies

2. Legal Counsel – Office of General Counsel (OGC)

3. Technologists - OIS

What Did We Build? WAX

What Did We Build? WAX

What Did We Build? WAX

What Did We Build? WAX

Third Party Software

• International Internet Preservation Consortium (IIPC) tools www.netpreserve.org• Heritrix• HCC• NutchWAX• Wayback

• JBoss• Oracle• Struts• Tomcat• Quartz job scheduler

The Web is vast and interconnected.

How do you specify the part you want to capture?

Or “training a web crawler”…

How to Train a Web Crawler

1. Tell it where to start• “Seed URI”

2. Tell it what to collect and where to stop• “Scope”

3. Tell it when and how often • “Schedule”

Web Archiving Steps

1. Create a harvest profileIdentify website URI (“seed”), define scope

and schedule

2. Harvest web site3. QA harvest4. Send harvest to DRS5. Index harvest

Becomes searchable and viewable by users

A lot of work per website – which can automated?

Web Archiving Steps

Manual by curator → 1. Create a harvest profile

Automated byscheduler and crawler

software →

2. Harvest web site

Manual by curator → 3. QA harvest

Manual by curator → 4. Send harvest to DRS

Automated byIndexing software →

5. Index harvest

Workflow Efficiencies

• Curator’s manual tasks:• Create a harvest profile

• 3 scopes: Directory, host and host+1• Schedules• Global excluded URIs

• QA harvests• Remove unwanted pieces• Detect missing pieces• Refinement of seed scope

• Send harvests to DRS

How can the system help with these tasks?

Efficiencies: QA Harvests

• Exclude URIs from future crawls

• Delete URIs from harvest

• Delete URIs from harvest and Exclude them from future crawls

Efficiencies: Send Harvests to DRS

The Ultimate Shortcut?

• Can pre-configure WAX to send harvests directly to the DRS • Skip QA step• Skip push to archive step

Web Harvest Objects: Unit of Preservation in the DRS

• For each crawl starting from a seed URI:• One or more ARC files (*.arc.gz)

• contain one or more “resources” - the individual HTML, JPEG, Javascript, etc. files that make up the harvested web pages

• Crawl log• records all URI requests, regardless

of result• Crawler configuration• Metadata

• descriptive, administrative, technical

WAX Legal Mitigations: Crawls

• Polite crawling• Obey robots.txt• Leave WAX crawler information in

logs

• Employ a respectful “request frequency” during crawls• Don’t overload web servers

• Capture surface web only• No attempt to crawl protected

content

• Choice of offsite crawler for curators• Non-Harvard IP address

WAX Legal Mitigations: Use

• Don’t compete with or divert traffic from live site• Exclude robots from the WAX

archive• Add transformative content

• Framing• Presentation pages with original

intellectual content

• Embargo display for 3 months• Link to live site

The Collections

• 191 “seeds” identified by curators for harvesting

• Stored in DRS: • Over 8 million web archive

resources• 365.17 gigabytes of storage

($913/year)• 291 mime types

application/x-download

message/rfc822

image/x-portable-anymap

javascript/x-javascript

application/bds

image/png?ver=074219b2138e87ecf980914471183dfc

application/xrds+xml

"text/xml"

image/x-bmp

gif

application/x-rar-compressed

Image/png

mime/type

image/null

text/troff

application/vnd.sun.xml.impress

text/enriched

application/icalendar

application-x/javascript

x-mapp-php4

imag/x-icon

application/x-shockwave-flash2-preview

Swish

image/x-photoshop

application/x-quicktimeplayer

application/x-java-vm

text/Javascript

text\css

application/x-Shockwave-Flash

png

text/x-c++

image/x-cmu-raster

httpd/yahoo-send-as-is

application/x-mpeg

Video/X-Flv

text/x-python

audio/x-scpls

application/pgp-keys

text/calendar

text/x-vcard

application/octet-string

application/x-troff-me

video/x-m4v

application/pgp-signature

image/x-portable-graymap

image/#{favicon_formats[format]}

image/files/curryjpg

test/xml

text/x-invalid

video/x-flv

text/javascript+json

Shockwave

audio/x-realaudio

chemical/mdl-rdf

content-type

text/text

Text/HTML

audio/mid

text/Calendar

application/x-wais-source

application/x-perl

image/txt

Applicationxm

PNG

x-png

unknown/unknown

text/x-javascript

application/octetstream

Image

application/x-sh

audio/x-mpegurl

audio/unknown

chemical/x-xyz

application/perl

application/x.atom+xml

application/octet_stream

video/mp4

The Light: The Collections

The PartnersMegan Sniffin-Marinoff, University Archivist

A-Sites: Archived Harvard Web Sites collected by the Harvard University Archives

Marilyn Dunn, Executive Director of the Schlesinger Library and Librarian of the Radcliffe Institute

Blogs: Capturing Women's Voices collected by the Arthur and Elizabeth Schlesinger Library on the History of Women in America

Helen Hardacre, Reischauer Institute Professor of Japanese Religions and Society

Web Archiving Project on Constitutional Revision collected by the Reischauer Institute of Japanese Studies with Sponsorship from the Harvard College Library Documentation Center on Contemporary Japan

To Participate

http://hul.harvard.edu/ois/systems/wax

Questions?

“…we have rather chosen to fill our hives with honey and wax, thus furnishing mankind with the two noblest of things, which are sweetness and light.”

Jonathan Swift

Image Credits

Title slide:http://www.flickr.com/photos/lwr/59014972/in/set-1552655/

The darkness:http://www.melegraph.com/images/outerspace.jpg

The candle:http://www.sxc.hu/pic/m/a/as/asolario/

472153_peach_votive_candle.jpg

The Web:http://projecta-z.com/Internet_map_1024.jpg

The lighthttp://i252.photobucket.com/albums/hh2/habeba2007/

candles-1-1.gif

Recommended