27
Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December 2005 ...or... A joyful romp with Heritrix, JavaScript, & Spotlight!

Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

Embed Size (px)

Citation preview

Page 1: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

Documenting Internet2an IT perspective

Eric CelesteUniversity of Minnesota (Twin Cities)

Librariesfor the Coalition for Networked Information

6 December 2005

...or... A joyful romp with Heritrix, JavaScript, & Spotlight!

Page 2: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

background...

• DI2 brought together– University of Minnesota (CBI)– University of Michigan (SI)– Internet2

• web crawling only a small part

• the “save everything” approach

Page 3: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

briefly…

• on crawling with spiders• on Heritrix and JavaScript• on Spotlight and local files• on sinkholes and strategies

Page 4: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

spiders on the web

Page 5: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

pages

Page 6: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

links

Page 7: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

hosts & domains

Page 8: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

robots.txt

Page 9: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

scope

Page 10: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

seeds

Page 11: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

excluded pages

Page 12: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December
Page 13: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December
Page 14: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December
Page 15: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December
Page 16: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December
Page 17: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December
Page 18: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

done!

Page 19: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

our crawler

• Heritrix, from the IA• aiming for broad deployment, Archive-It

• cross-platform, many users• simple setup, sophisticated options

• generates ARC files

Page 20: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

from ARC to archive

• keep originals intact• a few large files to manage• can serve a mirror from the master

• can extract files for research• solution requires Perl, PHP, JavaScript, MySQL

Page 21: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

processing...

• for mirroring online– optimizing and indexing with Perl

– loading into MySQL database– presenting via PHP

• for using on local disk– extracting files from ARC

Page 22: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

joys of javascript...

• modifies the page after loading

• HTML almost unmolested• changes explicit in code

Page 23: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

are we there yet?

• make the archive obvious• yet intrude as little as possible

Page 24: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

global research locally• a web site in your pocket• applying local tools• maintaining browse-ability• Apple’s Spotlight one of many

Page 25: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

sinkholes / strategies• partnership with institution

– config, IP, retention

• crawling far from perfect– no creation dates, exclusions– sticky traps, scripted pages (AJAX)

• scripts still immature– better demarcation– more self-contained (not at /)

Page 26: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

still...

• capture & save what we can• keep it as “original” as possible

• stay flexible for the future• have fun in the present!

Page 27: Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December

more information

• http://wiki.lib.umn.edu/DI2/

• Eric Celeste <[email protected]>