WAX: A candle in the darkness
A digital to digital projectWendy Gogel, Andrea Goethals
Harvard University Library, Office for Information Systems
May 1, 2009
Today’s Journey
• The Darkness – The WebIntroducing the challenge of web archiving
• The Candle – WAXHUL’s Web Archive Collection Service
• The Light – The Collections Demonstrating the results
The Darkness: The Web
The Challenges of Web Archiving
• A fleeting record – here today, gone tomorrow• Government Documents• Public Debate• Culture • Personal expression• University Output
Harvard Magazine May/June 2009
Curator Activities
• Selection • Acquisition• Rights management• Quality assurance • Arrangement• Storage • Description and indexing for
discovery (cataloguing, searching, browsing)
• Presentations and exhibitions • Preservation
IP and Other Legal Risks
• Copyright infringement• State tort liability
• Civil damages, resulting from invasion of privacy, sensitive personal data, commercial content, defamatory content
• Statutory content restrictions• Foreign Laws
Preservation Challenges • We were not there at creation
• Viruses more likely• Formats misidentify themselves• A lot of formats are invalid (especially
HTML)• It’s a moving target – what should we
preserve?• Evolving born digital formats• Proliferation of formats• Partial capture • Complex behaviors and styles
• Complex delivery to maintain• Hyperlinked resources• Multiple renderers will continue to evolve
2006/07 AlternativesSelection
Crawling
Management (QA and Metadata)
Storage
Preservation
Discovery and Display
Notes
Wayback (IA)
No Yes No Yes Partial - Replicated storage – Not Harvard owned
No full textsearching
Contract IA
Yes Yes No, handle in-house
No, Handle in-house
No, Handle in-house
No, Handlein-house
Archive It! (IA)
Yes Yes Minimal, has since improved
Yes Partial - Replicated storage
Minimal, has since improved
2008 costs:$16,000/yr $2,000/yrHarvard copy
Customize IIPC Tools (WAX)*
Yes Yes Yes Yes More than others
Yes
* Additional benefit of integration with HUL central services
The Candle: WAX
HUL’s Web Archiving Project
• 2.5 year pilot project funded by LDI
• Key Goals1. Gain experience in domain2. Explore legal terrain3. Investigate sustainability of a
Harvard web archiving service• quantify technical, human, and
$ requirements• aim for operational efficiencies
Project Players
1. Curators and Collection Managers• Harvard University Archives • Schlesinger Library on the
History of Women in America• Edwin O. Reischauer Institute
of Japanese Studies
2. Legal Counsel – Office of General Counsel (OGC)
3. Technologists - OIS
What Did We Build? WAX
What Did We Build? WAX
What Did We Build? WAX
What Did We Build? WAX
Third Party Software
• International Internet Preservation Consortium (IIPC) tools www.netpreserve.org• Heritrix• HCC• NutchWAX• Wayback
• JBoss• Oracle• Struts• Tomcat• Quartz job scheduler
The Web is vast and interconnected.
How do you specify the part you want to capture?
Or “training a web crawler”…
How to Train a Web Crawler
1. Tell it where to start• “Seed URI”
2. Tell it what to collect and where to stop• “Scope”
3. Tell it when and how often • “Schedule”
Web Archiving Steps
1. Create a harvest profileIdentify website URI (“seed”), define scope
and schedule
2. Harvest web site3. QA harvest4. Send harvest to DRS5. Index harvest
Becomes searchable and viewable by users
A lot of work per website – which can automated?
Web Archiving Steps
Manual by curator → 1. Create a harvest profile
Automated byscheduler and crawler
software →
2. Harvest web site
Manual by curator → 3. QA harvest
Manual by curator → 4. Send harvest to DRS
Automated byIndexing software →
5. Index harvest
Workflow Efficiencies
• Curator’s manual tasks:• Create a harvest profile
• 3 scopes: Directory, host and host+1• Schedules• Global excluded URIs
• QA harvests• Remove unwanted pieces• Detect missing pieces• Refinement of seed scope
• Send harvests to DRS
How can the system help with these tasks?
Efficiencies: QA Harvests
• Exclude URIs from future crawls
• Delete URIs from harvest
• Delete URIs from harvest and Exclude them from future crawls
Efficiencies: Send Harvests to DRS
The Ultimate Shortcut?
• Can pre-configure WAX to send harvests directly to the DRS • Skip QA step• Skip push to archive step
Web Harvest Objects: Unit of Preservation in the DRS
• For each crawl starting from a seed URI:• One or more ARC files (*.arc.gz)
• contain one or more “resources” - the individual HTML, JPEG, Javascript, etc. files that make up the harvested web pages
• Crawl log• records all URI requests, regardless
of result• Crawler configuration• Metadata
• descriptive, administrative, technical
WAX Legal Mitigations: Crawls
• Polite crawling• Obey robots.txt• Leave WAX crawler information in
logs
• Employ a respectful “request frequency” during crawls• Don’t overload web servers
• Capture surface web only• No attempt to crawl protected
content
• Choice of offsite crawler for curators• Non-Harvard IP address
WAX Legal Mitigations: Use
• Don’t compete with or divert traffic from live site• Exclude robots from the WAX
archive• Add transformative content
• Framing• Presentation pages with original
intellectual content
• Embargo display for 3 months• Link to live site
The Collections
• 191 “seeds” identified by curators for harvesting
• Stored in DRS: • Over 8 million web archive
resources• 365.17 gigabytes of storage
($913/year)• 291 mime types
application/x-download
message/rfc822
image/x-portable-anymap
javascript/x-javascript
application/bds
image/png?ver=074219b2138e87ecf980914471183dfc
application/xrds+xml
"text/xml"
image/x-bmp
gif
application/x-rar-compressed
Image/png
mime/type
image/null
text/troff
application/vnd.sun.xml.impress
text/enriched
application/icalendar
application-x/javascript
x-mapp-php4
imag/x-icon
application/x-shockwave-flash2-preview
Swish
image/x-photoshop
application/x-quicktimeplayer
application/x-java-vm
text/Javascript
text\css
application/x-Shockwave-Flash
png
text/x-c++
image/x-cmu-raster
httpd/yahoo-send-as-is
application/x-mpeg
Video/X-Flv
text/x-python
audio/x-scpls
application/pgp-keys
text/calendar
text/x-vcard
application/octet-string
application/x-troff-me
video/x-m4v
application/pgp-signature
image/x-portable-graymap
image/#{favicon_formats[format]}
image/files/curryjpg
test/xml
text/x-invalid
video/x-flv
text/javascript+json
Shockwave
audio/x-realaudio
chemical/mdl-rdf
content-type
text/text
Text/HTML
audio/mid
text/Calendar
application/x-wais-source
application/x-perl
image/txt
Applicationxm
PNG
x-png
unknown/unknown
text/x-javascript
application/octetstream
Image
application/x-sh
audio/x-mpegurl
audio/unknown
chemical/x-xyz
application/perl
application/x.atom+xml
application/octet_stream
video/mp4
The Light: The Collections
The PartnersMegan Sniffin-Marinoff, University Archivist
A-Sites: Archived Harvard Web Sites collected by the Harvard University Archives
Marilyn Dunn, Executive Director of the Schlesinger Library and Librarian of the Radcliffe Institute
Blogs: Capturing Women's Voices collected by the Arthur and Elizabeth Schlesinger Library on the History of Women in America
Helen Hardacre, Reischauer Institute Professor of Japanese Religions and Society
Web Archiving Project on Constitutional Revision collected by the Reischauer Institute of Japanese Studies with Sponsorship from the Harvard College Library Documentation Center on Contemporary Japan
To Participate
http://hul.harvard.edu/ois/systems/wax
Questions?
“…we have rather chosen to fill our hives with honey and wax, thus furnishing mankind with the two noblest of things, which are sweetness and light.”
Jonathan Swift
Image Credits
Title slide:http://www.flickr.com/photos/lwr/59014972/in/set-1552655/
The darkness:http://www.melegraph.com/images/outerspace.jpg
The candle:http://www.sxc.hu/pic/m/a/as/asolario/
472153_peach_votive_candle.jpg
The Web:http://projecta-z.com/Internet_map_1024.jpg
The lighthttp://i252.photobucket.com/albums/hh2/habeba2007/
candles-1-1.gif