15
VT Web Archiving Anthony Rinaldi and Dev Mehta CS 4624 Clients: Mohamed Magdy and Tarek Kanan Blacksburg, VA 5/6/2014

VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based

VT Web ArchivingAnthony Rinaldi and Dev Mehta

CS 4624Clients: Mohamed Magdy and Tarek Kanan

Blacksburg, VA5/6/2014

Page 2: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based

Project Goals● Setup a web-crawler with Heritrix

● Archive files from vt.edu

● Integrate with Wayback

● Set-up Search with Solr (Stretch)

Page 3: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based

Problems Encountered

● Older version of software. ● Finding documentation to configure

Heritrix. o Only crawl vt.edu pages. o Crawl all vt.edu pages.

● Issues with CentOS firewalling.

Page 4: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based

Work Accomplished

● Working set-up of Heritrix that successfully crawls vt.edu web-pages.o Customized configuration to increase crawl depth. o Reject non-domain based URLs.

● Working set-up of Wayback machine:o Processes warc files from Heritrix. o Front-end for Heritrix-based crawls.

Page 5: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based

Lessons Learned

● Sometimes, documentation leaves much to be desired.

● Crawls can be extremely large if not configured properly.

Page 6: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based

Demo

Heritrix:● https://administrator:[email protected]:12222/

Wayback:● http://webarchive.cc.vt.edu/

Page 7: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based
Page 8: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based
Page 9: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based
Page 10: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based
Page 11: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based
Page 12: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based
Page 13: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based
Page 14: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based
Page 15: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based

Questions?