9
Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA Internet Archive Tutorial JCDL 2007 Vancouver, BC June 19, 2007

Lazy Preservation, Warrick, and the Web Infrastructure

  • Upload
    turi

  • View
    49

  • Download
    0

Embed Size (px)

DESCRIPTION

Lazy Preservation, Warrick, and the Web Infrastructure. Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA Internet Archive Tutorial JCDL 2007 Vancouver, BC June 19, 2007. McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007. - PowerPoint PPT Presentation

Citation preview

Page 1: Lazy Preservation, Warrick, and  the Web Infrastructure

Lazy Preservation, Warrick, and the Web Infrastructure

Frank McCown

Old Dominion UniversityComputer Science Department

Norfolk, Virginia, USA

Internet Archive TutorialJCDL 2007Vancouver, BCJune 19, 2007

Page 2: Lazy Preservation, Warrick, and  the Web Infrastructure

14

• McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007.

• McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007.

• McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006.

• McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.

Available at http://warrick.cs.odu.edu/

Page 3: Lazy Preservation, Warrick, and  the Web Infrastructure

15

What Types of Websites Are Lost?

Marshall, McCown, and Nelson, Evaluating Personal Archiving Strategies for Internet-based Information, IS&T Archiving 2007.

Page 4: Lazy Preservation, Warrick, and  the Web Infrastructure

31

Success of website recovery each week

*On average, we recovered 61% of a website on any given week.

Page 5: Lazy Preservation, Warrick, and  the Web Infrastructure

42

Overlap with Internet Archive

Page 6: Lazy Preservation, Warrick, and  the Web Infrastructure

46

Database

Perlscript

config

Static files (html files, PDFs,

images, style sheets, Javascript, etc.)

Web Infrastructure

Web Server

Dynamicpage

Recoverable

Not Recoverable

Page 7: Lazy Preservation, Warrick, and  the Web Infrastructure

47

Injecting Server Components into Crawlable Pages

Erasure codesHTML pages Recover at least

m blocks

Page 8: Lazy Preservation, Warrick, and  the Web Infrastructure

49

Page 9: Lazy Preservation, Warrick, and  the Web Infrastructure

50