Download pdf - Farl web archiving

Transcript
Page 1: Farl web archiving

A survey of web-based art resources with findings applicable to FARL electronic records collection development

Alison Rhonemus, LIS 698, Seminar and Practicum, Dr. Tula Giannini

Frick Art Reference LibraryDeborah Kempe, Chief, Collections Management & Access

Web Survey and Collection Development

Coffee on the terrace

Page 2: Farl web archiving

M-LEAD-TWO

Intern enterprises -"collection assessments, digital resource surveys, web archiving, provide support for important consortial programs such as shared resources"● Brooklyn Museum: Mark Daly, Ronnette Hope,

Project Manager: Emily Atwater● NYARC Latin American Resources (MOMA):

Ralph Baylor● FARL: Gretchen Nadasky, Alison Rhonemus

Page 3: Farl web archiving

Frick Art Reference Library

In early 2011, the Frick Art Reference Library and the Thomas J. Watson Library at The Metropolitan Museum of Art completed a pilot project to address coordinated collecting of born-digital auction catalogs using ContentDM and Archive-It.

Page 4: Farl web archiving

FARL web archiving program is situated in Collection Development.Current plans for website capture include online auction catalogs and art web resources

cataloged by NYARC.Fellow MLEAD-TWO intern Gretchen Nadasky has just described online auction

catalogs.My project focused on NYARC cataloged websites.

Page 5: Farl web archiving

Web Archiving

"The Internet Archive is already doing it.”

Actually, the IA is providing the tools for other institutions to use in archiving.

Page 6: Farl web archiving

ARCHIVE - ITuses open source tools developed by the

Internet Archive● Heritrix Web Crawler ● Wayback Interface● WARC format, an ISO standard

Page 7: Farl web archiving
Page 9: Farl web archiving

• Password protected sites – can not be archived

• Javascript – more complicated implementation can be difficult to capture and display. Ongoing area of development.

• Videos -- difficulty with some proprietary formats

• Form and Database driven content --‐ may be archived using a sitemap or other direct links to the content.

Evaluating seeds

Page 10: Farl web archiving

Robots.txt Blocks

The crawler by default respects all robots.txt files. Check post--‐crawl reports for blocked seeds or documents

If your site is blocked:

a) Contact the site owner and ask if they will un--‐block

b) Ask your Partner Specialist to turn on “ignore robots” feature in your account

Notes:

/ denotes single directory seed

subdomains.archive.org (add individually or expand seed)

Page 11: Farl web archiving

Site Survey Criteria● html/flash/pdf

● images

● embedded material ● links ● directories and subdomains ● terms, rights statements and permissions

Page 12: Farl web archiving

Obvious ruse

Page 13: Farl web archiving

More of the obvious

Sites created without the intention of being archived are the sites in need of

archiving.

Page 14: Farl web archiving

Survey Says

● 257 cataloged entries● 168 resources are possible to capture ● 82 resources would require more research or

display definite red flags for web archiving. ● PDFs are available for at least some of the

content in 75 resources. ● Flash was an element in 23 resources ● 16 sites used HTML5 ● 54 used a CMS like Drupal or WordPress

Page 15: Farl web archiving

There were 3 cataloged resources no longer available on the live web but viewable through Internet Archive. Another 2 defunct resources were not available through Internet Archive. The main page for one of these lost resources was available as a snapshot in WAYBACK but the actual cataloged resource was not available.

Page 16: Farl web archiving
Page 17: Farl web archiving
Page 18: Farl web archiving
Page 19: Farl web archiving

Change is Constant

Archive-It Updates:● Heritrix 1 series to Heritrix 3 series (February)● Archive-It 4.8

(May)

Page 20: Farl web archiving

Archive-It 4.8

Page 21: Farl web archiving

Plans

● Upcoming grants

● Capture of NYARC institution websites

● Include Wayback interface links in Arcade catalog records

● Continue to identify websites for capture and implement capture

Page 22: Farl web archiving

Conclusions

○ Digital resources not prevalent enough to reassign current staff

○ Website capture most costly in terms of staff time

○ Copyright continues to be an issue

○ Long term digital preservation needs yet to be assessed

○ Capture of Frick Collection sites and NYARC will pose as a challenging test case


Recommended