Upload
patrick-cummings
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Documenting Internet2an IT perspective
Eric CelesteUniversity of Minnesota (Twin Cities)
Librariesfor the Coalition for Networked Information
6 December 2005
...or... A joyful romp with Heritrix, JavaScript, & Spotlight!
background...
• DI2 brought together– University of Minnesota (CBI)– University of Michigan (SI)– Internet2
• web crawling only a small part
• the “save everything” approach
briefly…
• on crawling with spiders• on Heritrix and JavaScript• on Spotlight and local files• on sinkholes and strategies
spiders on the web
pages
links
hosts & domains
robots.txt
scope
seeds
excluded pages
done!
our crawler
• Heritrix, from the IA• aiming for broad deployment, Archive-It
• cross-platform, many users• simple setup, sophisticated options
• generates ARC files
from ARC to archive
• keep originals intact• a few large files to manage• can serve a mirror from the master
• can extract files for research• solution requires Perl, PHP, JavaScript, MySQL
processing...
• for mirroring online– optimizing and indexing with Perl
– loading into MySQL database– presenting via PHP
• for using on local disk– extracting files from ARC
joys of javascript...
• modifies the page after loading
• HTML almost unmolested• changes explicit in code
are we there yet?
• make the archive obvious• yet intrude as little as possible
global research locally• a web site in your pocket• applying local tools• maintaining browse-ability• Apple’s Spotlight one of many
sinkholes / strategies• partnership with institution
– config, IP, retention
• crawling far from perfect– no creation dates, exclusions– sticky traps, scripted pages (AJAX)
• scripts still immature– better demarcation– more self-contained (not at /)
still...
• capture & save what we can• keep it as “original” as possible
• stay flexible for the future• have fun in the present!