74
Web-at-Risk Test Crawl Report California Digital Library January 23, 2006 Web-at-Risk Test Crawl Report ...................................................................... 1 Test Crawl Overview ................................................................................. 3 A Change of Plan ...................................................................................... 3 The Respondents ...................................................................................... 3 About Crawl Results .................................................................................. 4 Pre-Crawl Analysis: Rights Issues ............................................................... 4 Test Crawl Settings and Process ................................................................. 7 Crawl Scope............................................................................................. 7 Communication / Reports .........................................................................10 Crawl Frequency......................................................................................11 Language and Web Site Models .................................................................12 Multimedia .............................................................................................12 Comparison with Other Crawlers ................................................................13 Crawl Success .........................................................................................13 Conclusions ............................................................................................15 Next Steps .............................................................................................16 Web-at-Risk Test Crawl Report: Appendix A Sites Submitted ............................17 Web-at-Risk Test Crawl Report: Appendix B The Katrina Crawl..........................19 The Crawl ...............................................................................................19 Gathering the Seeds ................................................................................19 Crawling Specifics ....................................................................................20 Katrina Crawl Results ...............................................................................21 Site Selection and Classification .................................................................22 Rights, Ownership, and Responsibilities ......................................................23 Technical Infrastructure............................................................................23 Information Analysis and Display ...............................................................23 Conclusions ............................................................................................24 Web-at-Risk Test Crawl Report: Appendix C Individual Crawl Reports ................26 Elizabeth Cowell (submitted by Ann Latta): UC Merced..................................26 Sherry DeDekker: California Water Science Center .......................................29 Megan Dreger (submitted by James Jacobs): City of San Diego Planning Department ............................................................................................31 Peter Filardo and Michael Nash: New York City Central Labor Council ..............34 Valerie Glenn and Arelene Weibel: Strengthening Social Security....................36 Valerie Glenn and Arelene Weibel: Defense Base Closure and Realignment Commission ............................................................................................39 Gabriela Gray: Join Arnold ........................................................................41 Gabriela Gray: Mayor-Elect Villaraigosa ......................................................43 Ron Heckart and Nick Robinson: Public Policy Institute of California ................46 Terry Huwe: AFL-CIO ...............................................................................48 Kris Kasianovitz: Los Angeles Dept. of City Planning .....................................50 Kris Kasianovitz: Southern California Association of Governments...................54 1

Web-at-Risk Test Crawl Report - Dashboard - Confluence

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Web-at-Risk Test Crawl Report - Dashboard - Confluence

Web-at-Risk Test Crawl Report California Digital Library January 23 2006

Web-at-Risk Test Crawl Report 1 Test Crawl Overview 3 A Change of Plan 3 The Respondents 3 About Crawl Results 4 Pre-Crawl Analysis Rights Issues 4 Test Crawl Settings and Process 7 Crawl Scope 7 Communication Reports 10 Crawl Frequency11 Language and Web Site Models 12 Multimedia 12 Comparison with Other Crawlers13 Crawl Success13 Conclusions 15 Next Steps 16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted 17 Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl19

The Crawl19 Gathering the Seeds 19 Crawling Specifics20 Katrina Crawl Results 21 Site Selection and Classification22 Rights Ownership and Responsibilities 23 Technical Infrastructure23 Information Analysis and Display 23 Conclusions 24

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports 26 Elizabeth Cowell (submitted by Ann Latta) UC Merced26 Sherry DeDekker California Water Science Center 29 Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department 31 Peter Filardo and Michael Nash New York City Central Labor Council 34 Valerie Glenn and Arelene Weibel Strengthening Social Security36 Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission39 Gabriela Gray Join Arnold 41 Gabriela Gray Mayor-Elect Villaraigosa 43 Ron Heckart and Nick Robinson Public Policy Institute of California 46 Terry Huwe AFL-CIO 48 Kris Kasianovitz Los Angeles Dept of City Planning 50 Kris Kasianovitz Southern California Association of Governments54

1

Linda Kennedy California Bay Delta Authority 57 Janet Martorana Santa Barbara County Department of Planning and Development 59 Lucia Orlando Monterey Bay National Marine Sanctuary 61 Richard Pearce Moses Arizona Department of Water Resources 63 Richard Pearce Moses Citizens Clean Election Commission65 Juri Stratford City of Davis67 Yvonne Wilson Orange County Sanitation District 70

Crawl Report Key 72

2

Test Crawl Overview During September and October 2005 the California Digital Library (CDL) embarked on a series of ldquotestrdquo web crawls for the Web-at-Risk project These crawls were conducted using the Heritrix1 crawler to gather content that was specifically requested by the 22 Web-at-Risk curatorial partners In keeping with the scope of the Web-at-Risk project the content consisted of government and political web sites The purpose of these crawls was

bull To teach CDL staff details of Heritrix crawler settings and performance bull To inform the Web Archiving Service requirements and design concerning

crawler performance default settings and the settings that need to be available to curators

bull To learn more about the nature of the materials that curators hope to collect and preserve particularly in regard to rights issues and technical obstacles

bull To convey any peculiarities about the content back to the curators bull To gather curatorsrsquo assessment of the crawlerrsquos effectiveness

A Change of Plan The CDL requested that curators submit their sample URIs by August 25 20052 On August 28 Hurricane Katrina was upgraded to a Category 5 storm and Mayor Ray Nagin ordered the evacuation of New Orleans As events unfolded in New Orleans and the Gulf Coast over the following week CDL chose to suspend the original test plan and focus on gathering Hurricane Katrina information Once this time-critical material was collected we turned back to the original set of URIs provided by curators Consequently Appendix B of this report includes a brief synopsis and evaluation of our experience crawling the Katrina web materials The body of this report concerns only our original test crawl scope An additional development at this time was the release of Web aRchive Access (WERA)3 an open-source tool for displaying crawl results (arc files) This made it possible for us to display the results to our curators and improved their ability to assess the results

The Respondents Our request for test crawl URIs went out to 22 curators at the University of North Texas New York University and several University of California campuses In many

1 Heritrix Web Site Internet Archive lthttpcrawlerarchiveorggt 2 URL (Uniform Resource Locator) and URI (Universal Resource Identifier) are related terms We will use URI in this report in keeping with the terminology used in the Heritrix crawler documentation 3 (Web aRchive Access) lthttpnwanbnogt

3

cases curators worked for different departments within the same institution We asked each curator to send us a first-choice and second-choice URI with a note describing what they hoped to capture about each site Many of the curators involved in the test crawl activities either had previous web archiving experience or were quite familiar with the issues involved A number of them warned us of the challenges our crawler might encounter with the particular sites they selected The curatorsrsquo response patterns indicated an important assumption about the service they expect Three curators responded that they were not going to send any URIs because they were collaborating with another curator who would take care of selecting the test sites This implies that our curators envision a service that will allow them to work collaboratively on building a web archive collection In all three cases the respondents and their collaborators were from the same institution In a fourth case a respondent said that he suspected he would send in the same sites as another curator because he knew they were working on similar issues The two individuals concerned were from entirely different institutions but nonetheless see the potential to work collaboratively

About Crawl Results Fifteen curators sent their site selections giving us a total of 30 sites to choose from for our test crawls It is critical to note that this sample is far too small to reveal anything conclusive about the nature of web sites or crawlers In some cases the results may be affected by the peculiarities of network traffic and server performance at the time the sites were crawled in other cases they may be affected by our own learning process in using Heritrix and WERA The value of these tests is to learn more about the specific interests and reactions of our group of curators and to acquaint us all with the tools at hand The following results should be interpreted with those caveats in mind

Pre-Crawl Analysis Rights Issues Before crawling the test sites we conducted an initial review of the sites in question to determine what issues they might pose including rights management issues In the ldquoWeb-at-Risk Rights Management Protocolrdquo the CDL outlines the approach to rights management that will be followed for the Web-at-Risk project and that will inform the development of the Web Archiving Service One aspect of this plan is that for each site selected for crawling the curator will determine which of the three following categories to apply

bull Scheme A Consent implied bull Scheme B Consent sought bull Scheme C Consent required

Only Scheme C requires the curator to get advance explicit permission to crawl the site It is hoped that the projectrsquos focus on government and political information will ensure that most materials fall within Scheme A or B However in spite of this subject focus this small collection of sites presented some interesting rights issues

4

We used a two-step review process to match our 30 suggested sites to the correct rights scheme First we looked at the nature of the content-ownerrsquos organization (federal government nonprofit organization etc) This was done by reviewing the organization itself as well as the domain name used for the organizationrsquos web site This gave us a first-round guess at what the correct rights scheme would be for each site Next the sites were carefully reviewed for copyright notices and our initial determinations were revised The breakdown of agencies by type is as follows

3

9 9 9

0

2

4

6

8

10

Site Types

Federal State Local Non-Profit

Seven of the 15 sites were from agencies devoted to water management The full list of sites submitted is available in Appendix A Site domains Although 21 sites were published by government agencies only 12 of them were in the gov domain By domain the sites included

02468

101214

Domains (all sites)

gov us org com edu

The nine local government sites presented by far the most variety in domain names There was a weak correlation between domain names and the nature of the agency at the local government level

5

0123456

Domains (local sites only)

gov us org com edu

Copyright statements We next reviewed each site to determine whether copyright statements on the site could help determine what rights scheme might apply to the site The copyright statements for the sites we crawled are available with each individual crawl report in Appendix C Here too local government sites offered little correlation between the nature of the content-owning organization and the rights statements displayed on the site City web sites varied dramatically in their rights statements some stated that their materials were in the public domain others vigorously defended their copyright This City of San Diego site did both

4 After both rights reviews it was determined that of the 30 sites submitted

bull 14 fell into Rights Scheme A and could be crawled without notification or permission

bull 13 fell into Rights Scheme B and could be crawled but would also require identifying and notifying the content owner

bull 3 fell within Rights Scheme C and would require the explicit consent of the content owner prior to crawling

The process of reviewing the sites for rights statements changed our assessment of the correct rights scheme in a number of cases and all three ldquoScheme C Consent Requiredrdquo designations were made on the basis of statements posted on the site Note that we did not ultimately seek permission for these materials and access to

4 City of San Diego web site Disclaimer page lthttpwwwsandiegogovdirectoriesdisclaimershtmlgt

6

the results of our crawls has been strictly limited to the curators and project staff for the purpose of crawl analysis In short our pre-crawl analysis of these 30 sites brought up complex rights issues and exemplified the challenges that lay ahead

Test Crawl Settings and Process Although we had originally planned to crawl only one site for each curator some curators supplied two sites that posed interesting problems in these cases we crawled both Each site was crawled with two settings resulting in 19 test sites and 38 total crawls conducted We used Heritrix version 151 to conduct the test crawls The crawls were conducted using four crawler instances on two servers Each site was crawled separately that is each seed list contained one URI We kept most default settings except for the following Crawl size Each crawl was set to stop at a maximum of 1 gigabyte (gig) of data Of the 38 crawls conducted 18 hit the 1 gig size limit Note that this limitation was imposed for the purpose of these early tests and will not be applied to future services Crawl duration When crawls took an inordinately long time to complete we started over again with ldquomax retriesrdquo set at three This setting improved crawler performance when pausing or hanging was an issue Politeness5 Because we crawled each site individually we set our crawler for very high politeness values Politeness pertains to the impact of the crawl on the content ownerrsquos server and is determined by combining a few different Heritrix settings that together determine how demanding the crawler is on the remote serversrsquo resources Our politeness settings were

bull Delay-factor 5 bull Max-delay-ms 5000 bull Min-delay-ms 500

Original host only vs linked hosts included Each site was crawled with two settings The first setting restricted results to only the host from the original seed URI The second setting allowed us to gather any pages to which the site linked directly but no more This second setting was constructed to gather pages considered relevant to the original site and to gather sites in their entirety when an organization relied on more than one host name to provide its web presence

Crawl Scope A comparison of the two different crawl settings used (original host only vs linked hosts included) turned up some counterintuitive results

5 For further information see section 6331 ldquoPolitenessrdquo in the Heritrix User Manual lthttpcrawlerarchiveorgarticlesuser_manualhtmlgt

7

When compared quantitatively 8 out of 19 crawls took longer to capture the site when limited to ldquooriginal host onlyrdquo than with the ldquolinked hostsrdquo setting It is not clear why this is the case since the ldquolinked hostsrdquo crawl should be much larger Indeed in all cases the linked hosts crawl retrieved more files than the original host crawl The following two tables compare both the number of files retrieved and the duration of the two types of crawls

Table 1 Number of files retrieved Original Host Only Linked Hosts Included Most 46197 70114 Fewest 247 1343 Median 2423 9250 Average 6359 17247

Table 2 Duration of the crawl Original Host Only Linked Hosts Included Longest 32 hr 21 min 37 hr 11 min Shortest 18 min 19 min Median 7 hr 33 min 11 hr 22 min Average 1 hr 42 min 7 hr 9 min

Given that this is a very small sample of crawls and that the gap between the largest and smallest crawls is fairly noteworthy perhaps the only telling figure to consider here is the median According to the median figures with only 505 more time the crawler acquired over 281 more documents When compared qualitatively the results also appeared somewhat counterintuitive Of the 18 curators who responded 12 stated that they preferred the ldquooriginal host onlyrdquo crawl (four were undecided) We would have expected this preference to vary a little more from site to site Oddly one of the two curators who preferred the larger crawl scope had a crawl that captured materials from over 2500 other hosts In some cases a sitersquos links to exterior hosts are critical The sitersquos value may hinge upon how well it gathers documents from other sources both of the curators who preferred the broader setting did so for this reason

For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites

More critically sites are also often composed of content from more than one server This is particularly likely to be the case if a site is providing a large body of pdf or multimedia files So a crawl restricted to the original host would be missing critical segments of the sitersquos content Our test crawls did in fact turn up sites that were composed of more than one host name For example in the case of UC Merced separate host names are used for different areas of the site such as faculty or

8

admissions In the case of the Arizona Department of Water Resources the distinction between host names appears to be accidental perhaps the result of an attempt to transition to a simpler more memorable URI Most pages from this site come from wwwazwatergov but hundreds of internal links including critical style sheet files are still hard-coded to point to wwwwaterazgov Finally when the site is restricted to the original host the end user is much more likely to encounter errors when viewing the archived results When the end user selects a link that was not captured WERA provides a ldquoSorry this URI was not foundrdquo message When the linked hosts are included the end user browsing the site sees the site more closely to its original context and with fewer error messages Conversely when end users encounter frequent error screens they may develop both frustration and a sense of mistrust in the quality of the archive It is worth noting that the curators are not likely to browse these results in the same way that an end user of their archives might The curators know what these sites contain choose them accordingly and may be less inclined to click on links that would result in a ldquoSorry this URI was not foundrdquo message Ultimately the value of a sitersquos external links would seem likely to vary depending on the nature of the site Sites with rich internal content and only ldquofrivolousrdquo external links would be best captured with the ldquooriginal host onlyrdquo setting Before we ran these crawls we asked curators to specify what they hoped a crawl would capture Many referenced specific pages or directories they hoped to capture and of those three specified URIs that were not from the original host When reviewing the results two out of those three still preferred the original host crawl even though that crawl did not capture the materials they specifically hoped to capture Why was the feedback so consistent on this point A look at the WERA interface used to display crawl results may provide an answer WERA does not offer an immediate means of browsing a site you have to search by keyword to find your way ldquointordquo the captured site Once you have a page displayed you can browse within the site but you must begin by searching for the right starting point As this image shows you select the site you want to search from a dropdown menu then enter terms to search against

Because all pages from the more comprehensive crawls are indexed the search results include pages from all of those other hosts This sets up a bit of cognitive dissonance the user specified a search against a particular site yet results from other hosts vastly outnumber pages from that site

9

Underlying the issue of crawl scope is the deeper question of what an archivist hopes to capture when a site is crawled Is it just a list of particular documents Or is it a faithful recreation of the site as it existed on that day It may be that a captured site has content of primary and secondary importance The primary content is what should be retrieved when searching against the archive while the secondary content should only be present to avoid error messages and establish the sitersquos full original context Another approach is suggested by this curatorrsquos response

The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than ldquosource + 1

This suggests the ability to link seed URIs as being related components of a single site

Communication Reports When we reported the test results back to curators we provided a synopsis of the crawl results links to particular Heritrix reports and to the WERA display interface The Heritrix reports are all plain text providing tables of MIME type or response code frequency Attempting to integrate these reports and the display of the archived results is a challenge One curator for example obtained documents from over 200 hosts in the ldquolinked hosts includedrdquo crawl but was only aware of having found 10 additional documents when reviewing these same search results in WERA Although WERA is helpful for seeing results from an end userrsquos perspective it does not provide adequate tools for analysis In some cases this is simply because WERA is a new and occasionally buggy tool It is possible for instance to follow occasional links out of the archive and into ldquoreal-timerdquo sites In some cases itrsquos also possible to browse to a page and display it but when you search for that same page by its URI WERA does not find anything One curator notes

Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IEmdashthe image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

It is also important to note that although WERA was used for the purpose of reporting test crawl results it is not envisioned as the final display interface for the Web Archiving Service Even so the feedback the curators provide about WERA should inform the functionality of the WAS interface Clearly it is still quite a struggle for curators to determine exactly what a crawl retrieved One curator reports

After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to

10

the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls

Crawl Frequency When asked how frequently they wanted to crawl sites curators responded with a variety of preferences

0123456789

Desired Crawl Frequency

DailyWeeklyMonthlyOnceUnknown

Again it is worth considering precisely what curators hope to capture in a repeated crawl of a site Some insight is provided by these curatorsrsquo comments

We hope the crawler will be able to report when new publication files are posted on the web site

And

The ability to report on new publications is critical to our goal of using the crawler as a discovery tool

As with the other NDIIPP grants the purpose of Web Archiving Service tools will be for archiving and preservation not for resource discovery This indicates that we should further investigate what a ldquoweeklyrdquo or ldquomonthlyrdquo crawl really means to curators If a site was not updated over the course of a year would the curator want to continue running weekly crawls of the site Would the curator want each crawl to appear on an archive timeline for that site even if the content was no different for each date One curator comments

I want to qualify the frequency for this site I d like to do a monthly crawl for three-four months Irsquod want [to] reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot [of] new content being added Irsquod change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation

11

Language and Web Site Models One of the challenges of communicating and interpreting crawl results is that crawlers donrsquot necessarily work the way people envision them to Further the appearance of a web site on a screen and its architecture on a server may be quite different the directory structure of a site may have no relationship to they way its navigation is organized on a screen One frequent point of confusion curators encountered while interpreting crawl results is the concept of how many ldquolevels downrdquo the crawler went One curator requested that we ldquodrill down several levels (at least 3)rdquo in our capture One challenge with this request is that ldquolevels downrdquo can be interpreted to mean different things In some cases curators clearly mean the directory structure of the web site on the remote server In other cases however they seem to mean the number of hosts away from the original site

In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise

If the curator is referring to a structure of subdirectories the next challenge is that the Heritrix crawler does not work this way Heritrix ignores the sitersquos directory structure and instead follows links from the seed URI it is provided The Heritrix manual specifically defines ldquodepthrdquo in this way

This scope allows for limiting the depth of a crawl (how many links away Heritrix should crawl) but does not impose any limits on the hosts domains or URI paths crawled6

Multimedia Some curators chose sites because of the value of their multimedia resources The crawler yielded mixed results in capturing these resources For one site

bull A text search on the log file turns up numerous ram files only one

ppt file

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files

bull smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files

As noted nearly half the sites crawled reached the 1 gig size limit and so did not complete This makes it difficult to determine whether there were genuine problems

6 Heritrix User Manual Section 611 Crawl Scope Broad Scope lt httpcrawlerarchiveorgarticlesuser_manualhtmlgt

12

with particular types of files or if the crawler simply did not get to the missing files before the crawl was stopped

Comparison with Other Crawlers Three of our curators had previously used other crawlers such as HTTrack and Wget to capture the same sites we attempted with Heritrix Because these curators provided strong details when describing their sites we crawled all of the sites they sent us (six sites) In one case the ldquoDefense Base Closure and Realignment Commissionrdquo the curator had greater success capturing aspx files with HTTrack than we had with Heritrix Other comparisons

We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using Wget found only 1474 However both spiders found roughly the same number of bytes As I understand Wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference

And

We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with Wget

Crawl Success We asked curators to rate the overall success of the test crawls on the following scale

bull Not effective (none of the desired documents were captured) bull Somewhat effective (some of the desired documents were captured) bull Mostly effective (most of the desired documents were captured) bull Effective (all of the desired documents were captured)

0

2

4

6

8

10

12

Crawl Success

Effective

MostlyEffectiveSomewhatEffectiveNot Effective

Here are some of the comments curators had concerning the overall success of these tests The full text of all curator reports and their feedback to CDL is available in Appendix C

13

Los Angeles Planning Department

The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations) however [sic] when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when I clicked on a link to get to the full report all I got was the ldquoSorry no Documents wthe given URI were foundrdquo This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved

City of San Diego Planning Department This comment is from a curator who is filling in for the person who originated the test crawl request so she is also attempting to interpret what that other person was hoping to capture

Due to the vague request to ldquodrill down several levelsrdquo Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didnrsquot expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public libraryrsquos pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but werenrsquot (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If thatrsquos the case it may be easier to do all of wwwsandiegogov rather than limit

Defense Base Closure and Realignment Commission

I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured

Public Policy Institute of California

14

There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an ldquoobject not foundrdquo message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on ldquoDaterdquo retrieves a message ldquoSorry no documents with the given uri were foundrdquo 3) The search boxes are not functional searches retrieve ldquoSorry no documents with the given uri were foundrdquo

AFL-CIO

I realize the collection interface is a ldquowork in progressrdquo and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg ldquoWorking Families Toolkitrdquo ldquoBushWatchrdquo) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness

Conclusions The process of running these test crawls has been valuable It has resulted in a number of lessons learned and further directions for inquiry It is clear that the tools that help curators analyze the results of these crawls will have to do more than mimic the original browsing context or supply tables of data We will be looking for any opportunity to improve WAS reporting capabilities as we move forward with the project The differences encountered in language and visualization raise the importance of a strong intuitive design for the curator tools and for clear help screens Each of us may visualize web sites differently and the crawler may behave differently than we expect A certain degree of online help will be needed to design crawls effectively and further guidance should be available to help people interpret crawl results when those results donrsquot match what the person anticipated

15

The help and documentation for the Web Archiving Service will also need to address the rights analysis issues raised above In most cases this is work that is done prior to issuing crawls it cannot be addressed by the design of the WAS interface alone When the Web-at-Risk project reaches the point of conducting usability studies we should be sure to include tests that further uncover how users understand crawl frequency settings Additionally the desire for an analysis tool that can convey when a site has changed significantly is not limited to this project It is a common issue faced by the Internet Archive members of the International Internet Preservation Consortium and others CDL is actively communicating with these organizations as we all work toward a solution for this problem Similarly CDL should ensure that future assessment and analysis work with our curators addresses the issues raised by the two crawl scope settings This inquiry should also extend to participants who exemplify end users to determine if people using web archives experience crawl scope differently than people who build them Finally a number of lessons were learned via the Katrina crawl described in Appendix B The most outstanding finding is that event-based crawls such as Katrina and site-specific crawls such as these have quite different characteristics and require different functionality and analysis tools

Next Steps There is certainly more to learn from these crawl results and about Heritrix in some cases it is still unclear why the crawler failed to retrieve certain documents The curatorsrsquo feedback concerning these results has been extremely valuable They have provided insight as to what was captured and what is still missing which would have been difficult to determine without their subject expertise in the sites chosen As we continue working to improve crawler success and performance we will turn our attention to the Virtual Remote Control site created by Cornell This site was developed to present particular problems to crawlers and is well documented enabling the user to gauge a crawlerrsquos results We will be using that site to replicate particular problems raised in our test crawl set and may also reattempt some of the sites crawled for these tests When we release the first version of the Web Archiving Service to curators in July 2006 we will request that they include their original test site in the crawls they attempt and compare the results with these tests The test results for these crawls will remain in place as a point of comparison while we continue developing the Web Archiving Service Additionally certain issues raised by this feedback particularly the desired scope and frequency of crawls merit attention in future usability analysis work

16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 2: Web-at-Risk Test Crawl Report - Dashboard - Confluence

Linda Kennedy California Bay Delta Authority 57 Janet Martorana Santa Barbara County Department of Planning and Development 59 Lucia Orlando Monterey Bay National Marine Sanctuary 61 Richard Pearce Moses Arizona Department of Water Resources 63 Richard Pearce Moses Citizens Clean Election Commission65 Juri Stratford City of Davis67 Yvonne Wilson Orange County Sanitation District 70

Crawl Report Key 72

2

Test Crawl Overview During September and October 2005 the California Digital Library (CDL) embarked on a series of ldquotestrdquo web crawls for the Web-at-Risk project These crawls were conducted using the Heritrix1 crawler to gather content that was specifically requested by the 22 Web-at-Risk curatorial partners In keeping with the scope of the Web-at-Risk project the content consisted of government and political web sites The purpose of these crawls was

bull To teach CDL staff details of Heritrix crawler settings and performance bull To inform the Web Archiving Service requirements and design concerning

crawler performance default settings and the settings that need to be available to curators

bull To learn more about the nature of the materials that curators hope to collect and preserve particularly in regard to rights issues and technical obstacles

bull To convey any peculiarities about the content back to the curators bull To gather curatorsrsquo assessment of the crawlerrsquos effectiveness

A Change of Plan The CDL requested that curators submit their sample URIs by August 25 20052 On August 28 Hurricane Katrina was upgraded to a Category 5 storm and Mayor Ray Nagin ordered the evacuation of New Orleans As events unfolded in New Orleans and the Gulf Coast over the following week CDL chose to suspend the original test plan and focus on gathering Hurricane Katrina information Once this time-critical material was collected we turned back to the original set of URIs provided by curators Consequently Appendix B of this report includes a brief synopsis and evaluation of our experience crawling the Katrina web materials The body of this report concerns only our original test crawl scope An additional development at this time was the release of Web aRchive Access (WERA)3 an open-source tool for displaying crawl results (arc files) This made it possible for us to display the results to our curators and improved their ability to assess the results

The Respondents Our request for test crawl URIs went out to 22 curators at the University of North Texas New York University and several University of California campuses In many

1 Heritrix Web Site Internet Archive lthttpcrawlerarchiveorggt 2 URL (Uniform Resource Locator) and URI (Universal Resource Identifier) are related terms We will use URI in this report in keeping with the terminology used in the Heritrix crawler documentation 3 (Web aRchive Access) lthttpnwanbnogt

3

cases curators worked for different departments within the same institution We asked each curator to send us a first-choice and second-choice URI with a note describing what they hoped to capture about each site Many of the curators involved in the test crawl activities either had previous web archiving experience or were quite familiar with the issues involved A number of them warned us of the challenges our crawler might encounter with the particular sites they selected The curatorsrsquo response patterns indicated an important assumption about the service they expect Three curators responded that they were not going to send any URIs because they were collaborating with another curator who would take care of selecting the test sites This implies that our curators envision a service that will allow them to work collaboratively on building a web archive collection In all three cases the respondents and their collaborators were from the same institution In a fourth case a respondent said that he suspected he would send in the same sites as another curator because he knew they were working on similar issues The two individuals concerned were from entirely different institutions but nonetheless see the potential to work collaboratively

About Crawl Results Fifteen curators sent their site selections giving us a total of 30 sites to choose from for our test crawls It is critical to note that this sample is far too small to reveal anything conclusive about the nature of web sites or crawlers In some cases the results may be affected by the peculiarities of network traffic and server performance at the time the sites were crawled in other cases they may be affected by our own learning process in using Heritrix and WERA The value of these tests is to learn more about the specific interests and reactions of our group of curators and to acquaint us all with the tools at hand The following results should be interpreted with those caveats in mind

Pre-Crawl Analysis Rights Issues Before crawling the test sites we conducted an initial review of the sites in question to determine what issues they might pose including rights management issues In the ldquoWeb-at-Risk Rights Management Protocolrdquo the CDL outlines the approach to rights management that will be followed for the Web-at-Risk project and that will inform the development of the Web Archiving Service One aspect of this plan is that for each site selected for crawling the curator will determine which of the three following categories to apply

bull Scheme A Consent implied bull Scheme B Consent sought bull Scheme C Consent required

Only Scheme C requires the curator to get advance explicit permission to crawl the site It is hoped that the projectrsquos focus on government and political information will ensure that most materials fall within Scheme A or B However in spite of this subject focus this small collection of sites presented some interesting rights issues

4

We used a two-step review process to match our 30 suggested sites to the correct rights scheme First we looked at the nature of the content-ownerrsquos organization (federal government nonprofit organization etc) This was done by reviewing the organization itself as well as the domain name used for the organizationrsquos web site This gave us a first-round guess at what the correct rights scheme would be for each site Next the sites were carefully reviewed for copyright notices and our initial determinations were revised The breakdown of agencies by type is as follows

3

9 9 9

0

2

4

6

8

10

Site Types

Federal State Local Non-Profit

Seven of the 15 sites were from agencies devoted to water management The full list of sites submitted is available in Appendix A Site domains Although 21 sites were published by government agencies only 12 of them were in the gov domain By domain the sites included

02468

101214

Domains (all sites)

gov us org com edu

The nine local government sites presented by far the most variety in domain names There was a weak correlation between domain names and the nature of the agency at the local government level

5

0123456

Domains (local sites only)

gov us org com edu

Copyright statements We next reviewed each site to determine whether copyright statements on the site could help determine what rights scheme might apply to the site The copyright statements for the sites we crawled are available with each individual crawl report in Appendix C Here too local government sites offered little correlation between the nature of the content-owning organization and the rights statements displayed on the site City web sites varied dramatically in their rights statements some stated that their materials were in the public domain others vigorously defended their copyright This City of San Diego site did both

4 After both rights reviews it was determined that of the 30 sites submitted

bull 14 fell into Rights Scheme A and could be crawled without notification or permission

bull 13 fell into Rights Scheme B and could be crawled but would also require identifying and notifying the content owner

bull 3 fell within Rights Scheme C and would require the explicit consent of the content owner prior to crawling

The process of reviewing the sites for rights statements changed our assessment of the correct rights scheme in a number of cases and all three ldquoScheme C Consent Requiredrdquo designations were made on the basis of statements posted on the site Note that we did not ultimately seek permission for these materials and access to

4 City of San Diego web site Disclaimer page lthttpwwwsandiegogovdirectoriesdisclaimershtmlgt

6

the results of our crawls has been strictly limited to the curators and project staff for the purpose of crawl analysis In short our pre-crawl analysis of these 30 sites brought up complex rights issues and exemplified the challenges that lay ahead

Test Crawl Settings and Process Although we had originally planned to crawl only one site for each curator some curators supplied two sites that posed interesting problems in these cases we crawled both Each site was crawled with two settings resulting in 19 test sites and 38 total crawls conducted We used Heritrix version 151 to conduct the test crawls The crawls were conducted using four crawler instances on two servers Each site was crawled separately that is each seed list contained one URI We kept most default settings except for the following Crawl size Each crawl was set to stop at a maximum of 1 gigabyte (gig) of data Of the 38 crawls conducted 18 hit the 1 gig size limit Note that this limitation was imposed for the purpose of these early tests and will not be applied to future services Crawl duration When crawls took an inordinately long time to complete we started over again with ldquomax retriesrdquo set at three This setting improved crawler performance when pausing or hanging was an issue Politeness5 Because we crawled each site individually we set our crawler for very high politeness values Politeness pertains to the impact of the crawl on the content ownerrsquos server and is determined by combining a few different Heritrix settings that together determine how demanding the crawler is on the remote serversrsquo resources Our politeness settings were

bull Delay-factor 5 bull Max-delay-ms 5000 bull Min-delay-ms 500

Original host only vs linked hosts included Each site was crawled with two settings The first setting restricted results to only the host from the original seed URI The second setting allowed us to gather any pages to which the site linked directly but no more This second setting was constructed to gather pages considered relevant to the original site and to gather sites in their entirety when an organization relied on more than one host name to provide its web presence

Crawl Scope A comparison of the two different crawl settings used (original host only vs linked hosts included) turned up some counterintuitive results

5 For further information see section 6331 ldquoPolitenessrdquo in the Heritrix User Manual lthttpcrawlerarchiveorgarticlesuser_manualhtmlgt

7

When compared quantitatively 8 out of 19 crawls took longer to capture the site when limited to ldquooriginal host onlyrdquo than with the ldquolinked hostsrdquo setting It is not clear why this is the case since the ldquolinked hostsrdquo crawl should be much larger Indeed in all cases the linked hosts crawl retrieved more files than the original host crawl The following two tables compare both the number of files retrieved and the duration of the two types of crawls

Table 1 Number of files retrieved Original Host Only Linked Hosts Included Most 46197 70114 Fewest 247 1343 Median 2423 9250 Average 6359 17247

Table 2 Duration of the crawl Original Host Only Linked Hosts Included Longest 32 hr 21 min 37 hr 11 min Shortest 18 min 19 min Median 7 hr 33 min 11 hr 22 min Average 1 hr 42 min 7 hr 9 min

Given that this is a very small sample of crawls and that the gap between the largest and smallest crawls is fairly noteworthy perhaps the only telling figure to consider here is the median According to the median figures with only 505 more time the crawler acquired over 281 more documents When compared qualitatively the results also appeared somewhat counterintuitive Of the 18 curators who responded 12 stated that they preferred the ldquooriginal host onlyrdquo crawl (four were undecided) We would have expected this preference to vary a little more from site to site Oddly one of the two curators who preferred the larger crawl scope had a crawl that captured materials from over 2500 other hosts In some cases a sitersquos links to exterior hosts are critical The sitersquos value may hinge upon how well it gathers documents from other sources both of the curators who preferred the broader setting did so for this reason

For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites

More critically sites are also often composed of content from more than one server This is particularly likely to be the case if a site is providing a large body of pdf or multimedia files So a crawl restricted to the original host would be missing critical segments of the sitersquos content Our test crawls did in fact turn up sites that were composed of more than one host name For example in the case of UC Merced separate host names are used for different areas of the site such as faculty or

8

admissions In the case of the Arizona Department of Water Resources the distinction between host names appears to be accidental perhaps the result of an attempt to transition to a simpler more memorable URI Most pages from this site come from wwwazwatergov but hundreds of internal links including critical style sheet files are still hard-coded to point to wwwwaterazgov Finally when the site is restricted to the original host the end user is much more likely to encounter errors when viewing the archived results When the end user selects a link that was not captured WERA provides a ldquoSorry this URI was not foundrdquo message When the linked hosts are included the end user browsing the site sees the site more closely to its original context and with fewer error messages Conversely when end users encounter frequent error screens they may develop both frustration and a sense of mistrust in the quality of the archive It is worth noting that the curators are not likely to browse these results in the same way that an end user of their archives might The curators know what these sites contain choose them accordingly and may be less inclined to click on links that would result in a ldquoSorry this URI was not foundrdquo message Ultimately the value of a sitersquos external links would seem likely to vary depending on the nature of the site Sites with rich internal content and only ldquofrivolousrdquo external links would be best captured with the ldquooriginal host onlyrdquo setting Before we ran these crawls we asked curators to specify what they hoped a crawl would capture Many referenced specific pages or directories they hoped to capture and of those three specified URIs that were not from the original host When reviewing the results two out of those three still preferred the original host crawl even though that crawl did not capture the materials they specifically hoped to capture Why was the feedback so consistent on this point A look at the WERA interface used to display crawl results may provide an answer WERA does not offer an immediate means of browsing a site you have to search by keyword to find your way ldquointordquo the captured site Once you have a page displayed you can browse within the site but you must begin by searching for the right starting point As this image shows you select the site you want to search from a dropdown menu then enter terms to search against

Because all pages from the more comprehensive crawls are indexed the search results include pages from all of those other hosts This sets up a bit of cognitive dissonance the user specified a search against a particular site yet results from other hosts vastly outnumber pages from that site

9

Underlying the issue of crawl scope is the deeper question of what an archivist hopes to capture when a site is crawled Is it just a list of particular documents Or is it a faithful recreation of the site as it existed on that day It may be that a captured site has content of primary and secondary importance The primary content is what should be retrieved when searching against the archive while the secondary content should only be present to avoid error messages and establish the sitersquos full original context Another approach is suggested by this curatorrsquos response

The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than ldquosource + 1

This suggests the ability to link seed URIs as being related components of a single site

Communication Reports When we reported the test results back to curators we provided a synopsis of the crawl results links to particular Heritrix reports and to the WERA display interface The Heritrix reports are all plain text providing tables of MIME type or response code frequency Attempting to integrate these reports and the display of the archived results is a challenge One curator for example obtained documents from over 200 hosts in the ldquolinked hosts includedrdquo crawl but was only aware of having found 10 additional documents when reviewing these same search results in WERA Although WERA is helpful for seeing results from an end userrsquos perspective it does not provide adequate tools for analysis In some cases this is simply because WERA is a new and occasionally buggy tool It is possible for instance to follow occasional links out of the archive and into ldquoreal-timerdquo sites In some cases itrsquos also possible to browse to a page and display it but when you search for that same page by its URI WERA does not find anything One curator notes

Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IEmdashthe image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

It is also important to note that although WERA was used for the purpose of reporting test crawl results it is not envisioned as the final display interface for the Web Archiving Service Even so the feedback the curators provide about WERA should inform the functionality of the WAS interface Clearly it is still quite a struggle for curators to determine exactly what a crawl retrieved One curator reports

After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to

10

the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls

Crawl Frequency When asked how frequently they wanted to crawl sites curators responded with a variety of preferences

0123456789

Desired Crawl Frequency

DailyWeeklyMonthlyOnceUnknown

Again it is worth considering precisely what curators hope to capture in a repeated crawl of a site Some insight is provided by these curatorsrsquo comments

We hope the crawler will be able to report when new publication files are posted on the web site

And

The ability to report on new publications is critical to our goal of using the crawler as a discovery tool

As with the other NDIIPP grants the purpose of Web Archiving Service tools will be for archiving and preservation not for resource discovery This indicates that we should further investigate what a ldquoweeklyrdquo or ldquomonthlyrdquo crawl really means to curators If a site was not updated over the course of a year would the curator want to continue running weekly crawls of the site Would the curator want each crawl to appear on an archive timeline for that site even if the content was no different for each date One curator comments

I want to qualify the frequency for this site I d like to do a monthly crawl for three-four months Irsquod want [to] reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot [of] new content being added Irsquod change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation

11

Language and Web Site Models One of the challenges of communicating and interpreting crawl results is that crawlers donrsquot necessarily work the way people envision them to Further the appearance of a web site on a screen and its architecture on a server may be quite different the directory structure of a site may have no relationship to they way its navigation is organized on a screen One frequent point of confusion curators encountered while interpreting crawl results is the concept of how many ldquolevels downrdquo the crawler went One curator requested that we ldquodrill down several levels (at least 3)rdquo in our capture One challenge with this request is that ldquolevels downrdquo can be interpreted to mean different things In some cases curators clearly mean the directory structure of the web site on the remote server In other cases however they seem to mean the number of hosts away from the original site

In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise

If the curator is referring to a structure of subdirectories the next challenge is that the Heritrix crawler does not work this way Heritrix ignores the sitersquos directory structure and instead follows links from the seed URI it is provided The Heritrix manual specifically defines ldquodepthrdquo in this way

This scope allows for limiting the depth of a crawl (how many links away Heritrix should crawl) but does not impose any limits on the hosts domains or URI paths crawled6

Multimedia Some curators chose sites because of the value of their multimedia resources The crawler yielded mixed results in capturing these resources For one site

bull A text search on the log file turns up numerous ram files only one

ppt file

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files

bull smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files

As noted nearly half the sites crawled reached the 1 gig size limit and so did not complete This makes it difficult to determine whether there were genuine problems

6 Heritrix User Manual Section 611 Crawl Scope Broad Scope lt httpcrawlerarchiveorgarticlesuser_manualhtmlgt

12

with particular types of files or if the crawler simply did not get to the missing files before the crawl was stopped

Comparison with Other Crawlers Three of our curators had previously used other crawlers such as HTTrack and Wget to capture the same sites we attempted with Heritrix Because these curators provided strong details when describing their sites we crawled all of the sites they sent us (six sites) In one case the ldquoDefense Base Closure and Realignment Commissionrdquo the curator had greater success capturing aspx files with HTTrack than we had with Heritrix Other comparisons

We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using Wget found only 1474 However both spiders found roughly the same number of bytes As I understand Wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference

And

We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with Wget

Crawl Success We asked curators to rate the overall success of the test crawls on the following scale

bull Not effective (none of the desired documents were captured) bull Somewhat effective (some of the desired documents were captured) bull Mostly effective (most of the desired documents were captured) bull Effective (all of the desired documents were captured)

0

2

4

6

8

10

12

Crawl Success

Effective

MostlyEffectiveSomewhatEffectiveNot Effective

Here are some of the comments curators had concerning the overall success of these tests The full text of all curator reports and their feedback to CDL is available in Appendix C

13

Los Angeles Planning Department

The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations) however [sic] when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when I clicked on a link to get to the full report all I got was the ldquoSorry no Documents wthe given URI were foundrdquo This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved

City of San Diego Planning Department This comment is from a curator who is filling in for the person who originated the test crawl request so she is also attempting to interpret what that other person was hoping to capture

Due to the vague request to ldquodrill down several levelsrdquo Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didnrsquot expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public libraryrsquos pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but werenrsquot (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If thatrsquos the case it may be easier to do all of wwwsandiegogov rather than limit

Defense Base Closure and Realignment Commission

I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured

Public Policy Institute of California

14

There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an ldquoobject not foundrdquo message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on ldquoDaterdquo retrieves a message ldquoSorry no documents with the given uri were foundrdquo 3) The search boxes are not functional searches retrieve ldquoSorry no documents with the given uri were foundrdquo

AFL-CIO

I realize the collection interface is a ldquowork in progressrdquo and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg ldquoWorking Families Toolkitrdquo ldquoBushWatchrdquo) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness

Conclusions The process of running these test crawls has been valuable It has resulted in a number of lessons learned and further directions for inquiry It is clear that the tools that help curators analyze the results of these crawls will have to do more than mimic the original browsing context or supply tables of data We will be looking for any opportunity to improve WAS reporting capabilities as we move forward with the project The differences encountered in language and visualization raise the importance of a strong intuitive design for the curator tools and for clear help screens Each of us may visualize web sites differently and the crawler may behave differently than we expect A certain degree of online help will be needed to design crawls effectively and further guidance should be available to help people interpret crawl results when those results donrsquot match what the person anticipated

15

The help and documentation for the Web Archiving Service will also need to address the rights analysis issues raised above In most cases this is work that is done prior to issuing crawls it cannot be addressed by the design of the WAS interface alone When the Web-at-Risk project reaches the point of conducting usability studies we should be sure to include tests that further uncover how users understand crawl frequency settings Additionally the desire for an analysis tool that can convey when a site has changed significantly is not limited to this project It is a common issue faced by the Internet Archive members of the International Internet Preservation Consortium and others CDL is actively communicating with these organizations as we all work toward a solution for this problem Similarly CDL should ensure that future assessment and analysis work with our curators addresses the issues raised by the two crawl scope settings This inquiry should also extend to participants who exemplify end users to determine if people using web archives experience crawl scope differently than people who build them Finally a number of lessons were learned via the Katrina crawl described in Appendix B The most outstanding finding is that event-based crawls such as Katrina and site-specific crawls such as these have quite different characteristics and require different functionality and analysis tools

Next Steps There is certainly more to learn from these crawl results and about Heritrix in some cases it is still unclear why the crawler failed to retrieve certain documents The curatorsrsquo feedback concerning these results has been extremely valuable They have provided insight as to what was captured and what is still missing which would have been difficult to determine without their subject expertise in the sites chosen As we continue working to improve crawler success and performance we will turn our attention to the Virtual Remote Control site created by Cornell This site was developed to present particular problems to crawlers and is well documented enabling the user to gauge a crawlerrsquos results We will be using that site to replicate particular problems raised in our test crawl set and may also reattempt some of the sites crawled for these tests When we release the first version of the Web Archiving Service to curators in July 2006 we will request that they include their original test site in the crawls they attempt and compare the results with these tests The test results for these crawls will remain in place as a point of comparison while we continue developing the Web Archiving Service Additionally certain issues raised by this feedback particularly the desired scope and frequency of crawls merit attention in future usability analysis work

16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 3: Web-at-Risk Test Crawl Report - Dashboard - Confluence

Test Crawl Overview During September and October 2005 the California Digital Library (CDL) embarked on a series of ldquotestrdquo web crawls for the Web-at-Risk project These crawls were conducted using the Heritrix1 crawler to gather content that was specifically requested by the 22 Web-at-Risk curatorial partners In keeping with the scope of the Web-at-Risk project the content consisted of government and political web sites The purpose of these crawls was

bull To teach CDL staff details of Heritrix crawler settings and performance bull To inform the Web Archiving Service requirements and design concerning

crawler performance default settings and the settings that need to be available to curators

bull To learn more about the nature of the materials that curators hope to collect and preserve particularly in regard to rights issues and technical obstacles

bull To convey any peculiarities about the content back to the curators bull To gather curatorsrsquo assessment of the crawlerrsquos effectiveness

A Change of Plan The CDL requested that curators submit their sample URIs by August 25 20052 On August 28 Hurricane Katrina was upgraded to a Category 5 storm and Mayor Ray Nagin ordered the evacuation of New Orleans As events unfolded in New Orleans and the Gulf Coast over the following week CDL chose to suspend the original test plan and focus on gathering Hurricane Katrina information Once this time-critical material was collected we turned back to the original set of URIs provided by curators Consequently Appendix B of this report includes a brief synopsis and evaluation of our experience crawling the Katrina web materials The body of this report concerns only our original test crawl scope An additional development at this time was the release of Web aRchive Access (WERA)3 an open-source tool for displaying crawl results (arc files) This made it possible for us to display the results to our curators and improved their ability to assess the results

The Respondents Our request for test crawl URIs went out to 22 curators at the University of North Texas New York University and several University of California campuses In many

1 Heritrix Web Site Internet Archive lthttpcrawlerarchiveorggt 2 URL (Uniform Resource Locator) and URI (Universal Resource Identifier) are related terms We will use URI in this report in keeping with the terminology used in the Heritrix crawler documentation 3 (Web aRchive Access) lthttpnwanbnogt

3

cases curators worked for different departments within the same institution We asked each curator to send us a first-choice and second-choice URI with a note describing what they hoped to capture about each site Many of the curators involved in the test crawl activities either had previous web archiving experience or were quite familiar with the issues involved A number of them warned us of the challenges our crawler might encounter with the particular sites they selected The curatorsrsquo response patterns indicated an important assumption about the service they expect Three curators responded that they were not going to send any URIs because they were collaborating with another curator who would take care of selecting the test sites This implies that our curators envision a service that will allow them to work collaboratively on building a web archive collection In all three cases the respondents and their collaborators were from the same institution In a fourth case a respondent said that he suspected he would send in the same sites as another curator because he knew they were working on similar issues The two individuals concerned were from entirely different institutions but nonetheless see the potential to work collaboratively

About Crawl Results Fifteen curators sent their site selections giving us a total of 30 sites to choose from for our test crawls It is critical to note that this sample is far too small to reveal anything conclusive about the nature of web sites or crawlers In some cases the results may be affected by the peculiarities of network traffic and server performance at the time the sites were crawled in other cases they may be affected by our own learning process in using Heritrix and WERA The value of these tests is to learn more about the specific interests and reactions of our group of curators and to acquaint us all with the tools at hand The following results should be interpreted with those caveats in mind

Pre-Crawl Analysis Rights Issues Before crawling the test sites we conducted an initial review of the sites in question to determine what issues they might pose including rights management issues In the ldquoWeb-at-Risk Rights Management Protocolrdquo the CDL outlines the approach to rights management that will be followed for the Web-at-Risk project and that will inform the development of the Web Archiving Service One aspect of this plan is that for each site selected for crawling the curator will determine which of the three following categories to apply

bull Scheme A Consent implied bull Scheme B Consent sought bull Scheme C Consent required

Only Scheme C requires the curator to get advance explicit permission to crawl the site It is hoped that the projectrsquos focus on government and political information will ensure that most materials fall within Scheme A or B However in spite of this subject focus this small collection of sites presented some interesting rights issues

4

We used a two-step review process to match our 30 suggested sites to the correct rights scheme First we looked at the nature of the content-ownerrsquos organization (federal government nonprofit organization etc) This was done by reviewing the organization itself as well as the domain name used for the organizationrsquos web site This gave us a first-round guess at what the correct rights scheme would be for each site Next the sites were carefully reviewed for copyright notices and our initial determinations were revised The breakdown of agencies by type is as follows

3

9 9 9

0

2

4

6

8

10

Site Types

Federal State Local Non-Profit

Seven of the 15 sites were from agencies devoted to water management The full list of sites submitted is available in Appendix A Site domains Although 21 sites were published by government agencies only 12 of them were in the gov domain By domain the sites included

02468

101214

Domains (all sites)

gov us org com edu

The nine local government sites presented by far the most variety in domain names There was a weak correlation between domain names and the nature of the agency at the local government level

5

0123456

Domains (local sites only)

gov us org com edu

Copyright statements We next reviewed each site to determine whether copyright statements on the site could help determine what rights scheme might apply to the site The copyright statements for the sites we crawled are available with each individual crawl report in Appendix C Here too local government sites offered little correlation between the nature of the content-owning organization and the rights statements displayed on the site City web sites varied dramatically in their rights statements some stated that their materials were in the public domain others vigorously defended their copyright This City of San Diego site did both

4 After both rights reviews it was determined that of the 30 sites submitted

bull 14 fell into Rights Scheme A and could be crawled without notification or permission

bull 13 fell into Rights Scheme B and could be crawled but would also require identifying and notifying the content owner

bull 3 fell within Rights Scheme C and would require the explicit consent of the content owner prior to crawling

The process of reviewing the sites for rights statements changed our assessment of the correct rights scheme in a number of cases and all three ldquoScheme C Consent Requiredrdquo designations were made on the basis of statements posted on the site Note that we did not ultimately seek permission for these materials and access to

4 City of San Diego web site Disclaimer page lthttpwwwsandiegogovdirectoriesdisclaimershtmlgt

6

the results of our crawls has been strictly limited to the curators and project staff for the purpose of crawl analysis In short our pre-crawl analysis of these 30 sites brought up complex rights issues and exemplified the challenges that lay ahead

Test Crawl Settings and Process Although we had originally planned to crawl only one site for each curator some curators supplied two sites that posed interesting problems in these cases we crawled both Each site was crawled with two settings resulting in 19 test sites and 38 total crawls conducted We used Heritrix version 151 to conduct the test crawls The crawls were conducted using four crawler instances on two servers Each site was crawled separately that is each seed list contained one URI We kept most default settings except for the following Crawl size Each crawl was set to stop at a maximum of 1 gigabyte (gig) of data Of the 38 crawls conducted 18 hit the 1 gig size limit Note that this limitation was imposed for the purpose of these early tests and will not be applied to future services Crawl duration When crawls took an inordinately long time to complete we started over again with ldquomax retriesrdquo set at three This setting improved crawler performance when pausing or hanging was an issue Politeness5 Because we crawled each site individually we set our crawler for very high politeness values Politeness pertains to the impact of the crawl on the content ownerrsquos server and is determined by combining a few different Heritrix settings that together determine how demanding the crawler is on the remote serversrsquo resources Our politeness settings were

bull Delay-factor 5 bull Max-delay-ms 5000 bull Min-delay-ms 500

Original host only vs linked hosts included Each site was crawled with two settings The first setting restricted results to only the host from the original seed URI The second setting allowed us to gather any pages to which the site linked directly but no more This second setting was constructed to gather pages considered relevant to the original site and to gather sites in their entirety when an organization relied on more than one host name to provide its web presence

Crawl Scope A comparison of the two different crawl settings used (original host only vs linked hosts included) turned up some counterintuitive results

5 For further information see section 6331 ldquoPolitenessrdquo in the Heritrix User Manual lthttpcrawlerarchiveorgarticlesuser_manualhtmlgt

7

When compared quantitatively 8 out of 19 crawls took longer to capture the site when limited to ldquooriginal host onlyrdquo than with the ldquolinked hostsrdquo setting It is not clear why this is the case since the ldquolinked hostsrdquo crawl should be much larger Indeed in all cases the linked hosts crawl retrieved more files than the original host crawl The following two tables compare both the number of files retrieved and the duration of the two types of crawls

Table 1 Number of files retrieved Original Host Only Linked Hosts Included Most 46197 70114 Fewest 247 1343 Median 2423 9250 Average 6359 17247

Table 2 Duration of the crawl Original Host Only Linked Hosts Included Longest 32 hr 21 min 37 hr 11 min Shortest 18 min 19 min Median 7 hr 33 min 11 hr 22 min Average 1 hr 42 min 7 hr 9 min

Given that this is a very small sample of crawls and that the gap between the largest and smallest crawls is fairly noteworthy perhaps the only telling figure to consider here is the median According to the median figures with only 505 more time the crawler acquired over 281 more documents When compared qualitatively the results also appeared somewhat counterintuitive Of the 18 curators who responded 12 stated that they preferred the ldquooriginal host onlyrdquo crawl (four were undecided) We would have expected this preference to vary a little more from site to site Oddly one of the two curators who preferred the larger crawl scope had a crawl that captured materials from over 2500 other hosts In some cases a sitersquos links to exterior hosts are critical The sitersquos value may hinge upon how well it gathers documents from other sources both of the curators who preferred the broader setting did so for this reason

For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites

More critically sites are also often composed of content from more than one server This is particularly likely to be the case if a site is providing a large body of pdf or multimedia files So a crawl restricted to the original host would be missing critical segments of the sitersquos content Our test crawls did in fact turn up sites that were composed of more than one host name For example in the case of UC Merced separate host names are used for different areas of the site such as faculty or

8

admissions In the case of the Arizona Department of Water Resources the distinction between host names appears to be accidental perhaps the result of an attempt to transition to a simpler more memorable URI Most pages from this site come from wwwazwatergov but hundreds of internal links including critical style sheet files are still hard-coded to point to wwwwaterazgov Finally when the site is restricted to the original host the end user is much more likely to encounter errors when viewing the archived results When the end user selects a link that was not captured WERA provides a ldquoSorry this URI was not foundrdquo message When the linked hosts are included the end user browsing the site sees the site more closely to its original context and with fewer error messages Conversely when end users encounter frequent error screens they may develop both frustration and a sense of mistrust in the quality of the archive It is worth noting that the curators are not likely to browse these results in the same way that an end user of their archives might The curators know what these sites contain choose them accordingly and may be less inclined to click on links that would result in a ldquoSorry this URI was not foundrdquo message Ultimately the value of a sitersquos external links would seem likely to vary depending on the nature of the site Sites with rich internal content and only ldquofrivolousrdquo external links would be best captured with the ldquooriginal host onlyrdquo setting Before we ran these crawls we asked curators to specify what they hoped a crawl would capture Many referenced specific pages or directories they hoped to capture and of those three specified URIs that were not from the original host When reviewing the results two out of those three still preferred the original host crawl even though that crawl did not capture the materials they specifically hoped to capture Why was the feedback so consistent on this point A look at the WERA interface used to display crawl results may provide an answer WERA does not offer an immediate means of browsing a site you have to search by keyword to find your way ldquointordquo the captured site Once you have a page displayed you can browse within the site but you must begin by searching for the right starting point As this image shows you select the site you want to search from a dropdown menu then enter terms to search against

Because all pages from the more comprehensive crawls are indexed the search results include pages from all of those other hosts This sets up a bit of cognitive dissonance the user specified a search against a particular site yet results from other hosts vastly outnumber pages from that site

9

Underlying the issue of crawl scope is the deeper question of what an archivist hopes to capture when a site is crawled Is it just a list of particular documents Or is it a faithful recreation of the site as it existed on that day It may be that a captured site has content of primary and secondary importance The primary content is what should be retrieved when searching against the archive while the secondary content should only be present to avoid error messages and establish the sitersquos full original context Another approach is suggested by this curatorrsquos response

The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than ldquosource + 1

This suggests the ability to link seed URIs as being related components of a single site

Communication Reports When we reported the test results back to curators we provided a synopsis of the crawl results links to particular Heritrix reports and to the WERA display interface The Heritrix reports are all plain text providing tables of MIME type or response code frequency Attempting to integrate these reports and the display of the archived results is a challenge One curator for example obtained documents from over 200 hosts in the ldquolinked hosts includedrdquo crawl but was only aware of having found 10 additional documents when reviewing these same search results in WERA Although WERA is helpful for seeing results from an end userrsquos perspective it does not provide adequate tools for analysis In some cases this is simply because WERA is a new and occasionally buggy tool It is possible for instance to follow occasional links out of the archive and into ldquoreal-timerdquo sites In some cases itrsquos also possible to browse to a page and display it but when you search for that same page by its URI WERA does not find anything One curator notes

Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IEmdashthe image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

It is also important to note that although WERA was used for the purpose of reporting test crawl results it is not envisioned as the final display interface for the Web Archiving Service Even so the feedback the curators provide about WERA should inform the functionality of the WAS interface Clearly it is still quite a struggle for curators to determine exactly what a crawl retrieved One curator reports

After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to

10

the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls

Crawl Frequency When asked how frequently they wanted to crawl sites curators responded with a variety of preferences

0123456789

Desired Crawl Frequency

DailyWeeklyMonthlyOnceUnknown

Again it is worth considering precisely what curators hope to capture in a repeated crawl of a site Some insight is provided by these curatorsrsquo comments

We hope the crawler will be able to report when new publication files are posted on the web site

And

The ability to report on new publications is critical to our goal of using the crawler as a discovery tool

As with the other NDIIPP grants the purpose of Web Archiving Service tools will be for archiving and preservation not for resource discovery This indicates that we should further investigate what a ldquoweeklyrdquo or ldquomonthlyrdquo crawl really means to curators If a site was not updated over the course of a year would the curator want to continue running weekly crawls of the site Would the curator want each crawl to appear on an archive timeline for that site even if the content was no different for each date One curator comments

I want to qualify the frequency for this site I d like to do a monthly crawl for three-four months Irsquod want [to] reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot [of] new content being added Irsquod change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation

11

Language and Web Site Models One of the challenges of communicating and interpreting crawl results is that crawlers donrsquot necessarily work the way people envision them to Further the appearance of a web site on a screen and its architecture on a server may be quite different the directory structure of a site may have no relationship to they way its navigation is organized on a screen One frequent point of confusion curators encountered while interpreting crawl results is the concept of how many ldquolevels downrdquo the crawler went One curator requested that we ldquodrill down several levels (at least 3)rdquo in our capture One challenge with this request is that ldquolevels downrdquo can be interpreted to mean different things In some cases curators clearly mean the directory structure of the web site on the remote server In other cases however they seem to mean the number of hosts away from the original site

In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise

If the curator is referring to a structure of subdirectories the next challenge is that the Heritrix crawler does not work this way Heritrix ignores the sitersquos directory structure and instead follows links from the seed URI it is provided The Heritrix manual specifically defines ldquodepthrdquo in this way

This scope allows for limiting the depth of a crawl (how many links away Heritrix should crawl) but does not impose any limits on the hosts domains or URI paths crawled6

Multimedia Some curators chose sites because of the value of their multimedia resources The crawler yielded mixed results in capturing these resources For one site

bull A text search on the log file turns up numerous ram files only one

ppt file

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files

bull smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files

As noted nearly half the sites crawled reached the 1 gig size limit and so did not complete This makes it difficult to determine whether there were genuine problems

6 Heritrix User Manual Section 611 Crawl Scope Broad Scope lt httpcrawlerarchiveorgarticlesuser_manualhtmlgt

12

with particular types of files or if the crawler simply did not get to the missing files before the crawl was stopped

Comparison with Other Crawlers Three of our curators had previously used other crawlers such as HTTrack and Wget to capture the same sites we attempted with Heritrix Because these curators provided strong details when describing their sites we crawled all of the sites they sent us (six sites) In one case the ldquoDefense Base Closure and Realignment Commissionrdquo the curator had greater success capturing aspx files with HTTrack than we had with Heritrix Other comparisons

We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using Wget found only 1474 However both spiders found roughly the same number of bytes As I understand Wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference

And

We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with Wget

Crawl Success We asked curators to rate the overall success of the test crawls on the following scale

bull Not effective (none of the desired documents were captured) bull Somewhat effective (some of the desired documents were captured) bull Mostly effective (most of the desired documents were captured) bull Effective (all of the desired documents were captured)

0

2

4

6

8

10

12

Crawl Success

Effective

MostlyEffectiveSomewhatEffectiveNot Effective

Here are some of the comments curators had concerning the overall success of these tests The full text of all curator reports and their feedback to CDL is available in Appendix C

13

Los Angeles Planning Department

The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations) however [sic] when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when I clicked on a link to get to the full report all I got was the ldquoSorry no Documents wthe given URI were foundrdquo This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved

City of San Diego Planning Department This comment is from a curator who is filling in for the person who originated the test crawl request so she is also attempting to interpret what that other person was hoping to capture

Due to the vague request to ldquodrill down several levelsrdquo Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didnrsquot expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public libraryrsquos pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but werenrsquot (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If thatrsquos the case it may be easier to do all of wwwsandiegogov rather than limit

Defense Base Closure and Realignment Commission

I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured

Public Policy Institute of California

14

There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an ldquoobject not foundrdquo message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on ldquoDaterdquo retrieves a message ldquoSorry no documents with the given uri were foundrdquo 3) The search boxes are not functional searches retrieve ldquoSorry no documents with the given uri were foundrdquo

AFL-CIO

I realize the collection interface is a ldquowork in progressrdquo and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg ldquoWorking Families Toolkitrdquo ldquoBushWatchrdquo) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness

Conclusions The process of running these test crawls has been valuable It has resulted in a number of lessons learned and further directions for inquiry It is clear that the tools that help curators analyze the results of these crawls will have to do more than mimic the original browsing context or supply tables of data We will be looking for any opportunity to improve WAS reporting capabilities as we move forward with the project The differences encountered in language and visualization raise the importance of a strong intuitive design for the curator tools and for clear help screens Each of us may visualize web sites differently and the crawler may behave differently than we expect A certain degree of online help will be needed to design crawls effectively and further guidance should be available to help people interpret crawl results when those results donrsquot match what the person anticipated

15

The help and documentation for the Web Archiving Service will also need to address the rights analysis issues raised above In most cases this is work that is done prior to issuing crawls it cannot be addressed by the design of the WAS interface alone When the Web-at-Risk project reaches the point of conducting usability studies we should be sure to include tests that further uncover how users understand crawl frequency settings Additionally the desire for an analysis tool that can convey when a site has changed significantly is not limited to this project It is a common issue faced by the Internet Archive members of the International Internet Preservation Consortium and others CDL is actively communicating with these organizations as we all work toward a solution for this problem Similarly CDL should ensure that future assessment and analysis work with our curators addresses the issues raised by the two crawl scope settings This inquiry should also extend to participants who exemplify end users to determine if people using web archives experience crawl scope differently than people who build them Finally a number of lessons were learned via the Katrina crawl described in Appendix B The most outstanding finding is that event-based crawls such as Katrina and site-specific crawls such as these have quite different characteristics and require different functionality and analysis tools

Next Steps There is certainly more to learn from these crawl results and about Heritrix in some cases it is still unclear why the crawler failed to retrieve certain documents The curatorsrsquo feedback concerning these results has been extremely valuable They have provided insight as to what was captured and what is still missing which would have been difficult to determine without their subject expertise in the sites chosen As we continue working to improve crawler success and performance we will turn our attention to the Virtual Remote Control site created by Cornell This site was developed to present particular problems to crawlers and is well documented enabling the user to gauge a crawlerrsquos results We will be using that site to replicate particular problems raised in our test crawl set and may also reattempt some of the sites crawled for these tests When we release the first version of the Web Archiving Service to curators in July 2006 we will request that they include their original test site in the crawls they attempt and compare the results with these tests The test results for these crawls will remain in place as a point of comparison while we continue developing the Web Archiving Service Additionally certain issues raised by this feedback particularly the desired scope and frequency of crawls merit attention in future usability analysis work

16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 4: Web-at-Risk Test Crawl Report - Dashboard - Confluence

cases curators worked for different departments within the same institution We asked each curator to send us a first-choice and second-choice URI with a note describing what they hoped to capture about each site Many of the curators involved in the test crawl activities either had previous web archiving experience or were quite familiar with the issues involved A number of them warned us of the challenges our crawler might encounter with the particular sites they selected The curatorsrsquo response patterns indicated an important assumption about the service they expect Three curators responded that they were not going to send any URIs because they were collaborating with another curator who would take care of selecting the test sites This implies that our curators envision a service that will allow them to work collaboratively on building a web archive collection In all three cases the respondents and their collaborators were from the same institution In a fourth case a respondent said that he suspected he would send in the same sites as another curator because he knew they were working on similar issues The two individuals concerned were from entirely different institutions but nonetheless see the potential to work collaboratively

About Crawl Results Fifteen curators sent their site selections giving us a total of 30 sites to choose from for our test crawls It is critical to note that this sample is far too small to reveal anything conclusive about the nature of web sites or crawlers In some cases the results may be affected by the peculiarities of network traffic and server performance at the time the sites were crawled in other cases they may be affected by our own learning process in using Heritrix and WERA The value of these tests is to learn more about the specific interests and reactions of our group of curators and to acquaint us all with the tools at hand The following results should be interpreted with those caveats in mind

Pre-Crawl Analysis Rights Issues Before crawling the test sites we conducted an initial review of the sites in question to determine what issues they might pose including rights management issues In the ldquoWeb-at-Risk Rights Management Protocolrdquo the CDL outlines the approach to rights management that will be followed for the Web-at-Risk project and that will inform the development of the Web Archiving Service One aspect of this plan is that for each site selected for crawling the curator will determine which of the three following categories to apply

bull Scheme A Consent implied bull Scheme B Consent sought bull Scheme C Consent required

Only Scheme C requires the curator to get advance explicit permission to crawl the site It is hoped that the projectrsquos focus on government and political information will ensure that most materials fall within Scheme A or B However in spite of this subject focus this small collection of sites presented some interesting rights issues

4

We used a two-step review process to match our 30 suggested sites to the correct rights scheme First we looked at the nature of the content-ownerrsquos organization (federal government nonprofit organization etc) This was done by reviewing the organization itself as well as the domain name used for the organizationrsquos web site This gave us a first-round guess at what the correct rights scheme would be for each site Next the sites were carefully reviewed for copyright notices and our initial determinations were revised The breakdown of agencies by type is as follows

3

9 9 9

0

2

4

6

8

10

Site Types

Federal State Local Non-Profit

Seven of the 15 sites were from agencies devoted to water management The full list of sites submitted is available in Appendix A Site domains Although 21 sites were published by government agencies only 12 of them were in the gov domain By domain the sites included

02468

101214

Domains (all sites)

gov us org com edu

The nine local government sites presented by far the most variety in domain names There was a weak correlation between domain names and the nature of the agency at the local government level

5

0123456

Domains (local sites only)

gov us org com edu

Copyright statements We next reviewed each site to determine whether copyright statements on the site could help determine what rights scheme might apply to the site The copyright statements for the sites we crawled are available with each individual crawl report in Appendix C Here too local government sites offered little correlation between the nature of the content-owning organization and the rights statements displayed on the site City web sites varied dramatically in their rights statements some stated that their materials were in the public domain others vigorously defended their copyright This City of San Diego site did both

4 After both rights reviews it was determined that of the 30 sites submitted

bull 14 fell into Rights Scheme A and could be crawled without notification or permission

bull 13 fell into Rights Scheme B and could be crawled but would also require identifying and notifying the content owner

bull 3 fell within Rights Scheme C and would require the explicit consent of the content owner prior to crawling

The process of reviewing the sites for rights statements changed our assessment of the correct rights scheme in a number of cases and all three ldquoScheme C Consent Requiredrdquo designations were made on the basis of statements posted on the site Note that we did not ultimately seek permission for these materials and access to

4 City of San Diego web site Disclaimer page lthttpwwwsandiegogovdirectoriesdisclaimershtmlgt

6

the results of our crawls has been strictly limited to the curators and project staff for the purpose of crawl analysis In short our pre-crawl analysis of these 30 sites brought up complex rights issues and exemplified the challenges that lay ahead

Test Crawl Settings and Process Although we had originally planned to crawl only one site for each curator some curators supplied two sites that posed interesting problems in these cases we crawled both Each site was crawled with two settings resulting in 19 test sites and 38 total crawls conducted We used Heritrix version 151 to conduct the test crawls The crawls were conducted using four crawler instances on two servers Each site was crawled separately that is each seed list contained one URI We kept most default settings except for the following Crawl size Each crawl was set to stop at a maximum of 1 gigabyte (gig) of data Of the 38 crawls conducted 18 hit the 1 gig size limit Note that this limitation was imposed for the purpose of these early tests and will not be applied to future services Crawl duration When crawls took an inordinately long time to complete we started over again with ldquomax retriesrdquo set at three This setting improved crawler performance when pausing or hanging was an issue Politeness5 Because we crawled each site individually we set our crawler for very high politeness values Politeness pertains to the impact of the crawl on the content ownerrsquos server and is determined by combining a few different Heritrix settings that together determine how demanding the crawler is on the remote serversrsquo resources Our politeness settings were

bull Delay-factor 5 bull Max-delay-ms 5000 bull Min-delay-ms 500

Original host only vs linked hosts included Each site was crawled with two settings The first setting restricted results to only the host from the original seed URI The second setting allowed us to gather any pages to which the site linked directly but no more This second setting was constructed to gather pages considered relevant to the original site and to gather sites in their entirety when an organization relied on more than one host name to provide its web presence

Crawl Scope A comparison of the two different crawl settings used (original host only vs linked hosts included) turned up some counterintuitive results

5 For further information see section 6331 ldquoPolitenessrdquo in the Heritrix User Manual lthttpcrawlerarchiveorgarticlesuser_manualhtmlgt

7

When compared quantitatively 8 out of 19 crawls took longer to capture the site when limited to ldquooriginal host onlyrdquo than with the ldquolinked hostsrdquo setting It is not clear why this is the case since the ldquolinked hostsrdquo crawl should be much larger Indeed in all cases the linked hosts crawl retrieved more files than the original host crawl The following two tables compare both the number of files retrieved and the duration of the two types of crawls

Table 1 Number of files retrieved Original Host Only Linked Hosts Included Most 46197 70114 Fewest 247 1343 Median 2423 9250 Average 6359 17247

Table 2 Duration of the crawl Original Host Only Linked Hosts Included Longest 32 hr 21 min 37 hr 11 min Shortest 18 min 19 min Median 7 hr 33 min 11 hr 22 min Average 1 hr 42 min 7 hr 9 min

Given that this is a very small sample of crawls and that the gap between the largest and smallest crawls is fairly noteworthy perhaps the only telling figure to consider here is the median According to the median figures with only 505 more time the crawler acquired over 281 more documents When compared qualitatively the results also appeared somewhat counterintuitive Of the 18 curators who responded 12 stated that they preferred the ldquooriginal host onlyrdquo crawl (four were undecided) We would have expected this preference to vary a little more from site to site Oddly one of the two curators who preferred the larger crawl scope had a crawl that captured materials from over 2500 other hosts In some cases a sitersquos links to exterior hosts are critical The sitersquos value may hinge upon how well it gathers documents from other sources both of the curators who preferred the broader setting did so for this reason

For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites

More critically sites are also often composed of content from more than one server This is particularly likely to be the case if a site is providing a large body of pdf or multimedia files So a crawl restricted to the original host would be missing critical segments of the sitersquos content Our test crawls did in fact turn up sites that were composed of more than one host name For example in the case of UC Merced separate host names are used for different areas of the site such as faculty or

8

admissions In the case of the Arizona Department of Water Resources the distinction between host names appears to be accidental perhaps the result of an attempt to transition to a simpler more memorable URI Most pages from this site come from wwwazwatergov but hundreds of internal links including critical style sheet files are still hard-coded to point to wwwwaterazgov Finally when the site is restricted to the original host the end user is much more likely to encounter errors when viewing the archived results When the end user selects a link that was not captured WERA provides a ldquoSorry this URI was not foundrdquo message When the linked hosts are included the end user browsing the site sees the site more closely to its original context and with fewer error messages Conversely when end users encounter frequent error screens they may develop both frustration and a sense of mistrust in the quality of the archive It is worth noting that the curators are not likely to browse these results in the same way that an end user of their archives might The curators know what these sites contain choose them accordingly and may be less inclined to click on links that would result in a ldquoSorry this URI was not foundrdquo message Ultimately the value of a sitersquos external links would seem likely to vary depending on the nature of the site Sites with rich internal content and only ldquofrivolousrdquo external links would be best captured with the ldquooriginal host onlyrdquo setting Before we ran these crawls we asked curators to specify what they hoped a crawl would capture Many referenced specific pages or directories they hoped to capture and of those three specified URIs that were not from the original host When reviewing the results two out of those three still preferred the original host crawl even though that crawl did not capture the materials they specifically hoped to capture Why was the feedback so consistent on this point A look at the WERA interface used to display crawl results may provide an answer WERA does not offer an immediate means of browsing a site you have to search by keyword to find your way ldquointordquo the captured site Once you have a page displayed you can browse within the site but you must begin by searching for the right starting point As this image shows you select the site you want to search from a dropdown menu then enter terms to search against

Because all pages from the more comprehensive crawls are indexed the search results include pages from all of those other hosts This sets up a bit of cognitive dissonance the user specified a search against a particular site yet results from other hosts vastly outnumber pages from that site

9

Underlying the issue of crawl scope is the deeper question of what an archivist hopes to capture when a site is crawled Is it just a list of particular documents Or is it a faithful recreation of the site as it existed on that day It may be that a captured site has content of primary and secondary importance The primary content is what should be retrieved when searching against the archive while the secondary content should only be present to avoid error messages and establish the sitersquos full original context Another approach is suggested by this curatorrsquos response

The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than ldquosource + 1

This suggests the ability to link seed URIs as being related components of a single site

Communication Reports When we reported the test results back to curators we provided a synopsis of the crawl results links to particular Heritrix reports and to the WERA display interface The Heritrix reports are all plain text providing tables of MIME type or response code frequency Attempting to integrate these reports and the display of the archived results is a challenge One curator for example obtained documents from over 200 hosts in the ldquolinked hosts includedrdquo crawl but was only aware of having found 10 additional documents when reviewing these same search results in WERA Although WERA is helpful for seeing results from an end userrsquos perspective it does not provide adequate tools for analysis In some cases this is simply because WERA is a new and occasionally buggy tool It is possible for instance to follow occasional links out of the archive and into ldquoreal-timerdquo sites In some cases itrsquos also possible to browse to a page and display it but when you search for that same page by its URI WERA does not find anything One curator notes

Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IEmdashthe image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

It is also important to note that although WERA was used for the purpose of reporting test crawl results it is not envisioned as the final display interface for the Web Archiving Service Even so the feedback the curators provide about WERA should inform the functionality of the WAS interface Clearly it is still quite a struggle for curators to determine exactly what a crawl retrieved One curator reports

After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to

10

the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls

Crawl Frequency When asked how frequently they wanted to crawl sites curators responded with a variety of preferences

0123456789

Desired Crawl Frequency

DailyWeeklyMonthlyOnceUnknown

Again it is worth considering precisely what curators hope to capture in a repeated crawl of a site Some insight is provided by these curatorsrsquo comments

We hope the crawler will be able to report when new publication files are posted on the web site

And

The ability to report on new publications is critical to our goal of using the crawler as a discovery tool

As with the other NDIIPP grants the purpose of Web Archiving Service tools will be for archiving and preservation not for resource discovery This indicates that we should further investigate what a ldquoweeklyrdquo or ldquomonthlyrdquo crawl really means to curators If a site was not updated over the course of a year would the curator want to continue running weekly crawls of the site Would the curator want each crawl to appear on an archive timeline for that site even if the content was no different for each date One curator comments

I want to qualify the frequency for this site I d like to do a monthly crawl for three-four months Irsquod want [to] reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot [of] new content being added Irsquod change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation

11

Language and Web Site Models One of the challenges of communicating and interpreting crawl results is that crawlers donrsquot necessarily work the way people envision them to Further the appearance of a web site on a screen and its architecture on a server may be quite different the directory structure of a site may have no relationship to they way its navigation is organized on a screen One frequent point of confusion curators encountered while interpreting crawl results is the concept of how many ldquolevels downrdquo the crawler went One curator requested that we ldquodrill down several levels (at least 3)rdquo in our capture One challenge with this request is that ldquolevels downrdquo can be interpreted to mean different things In some cases curators clearly mean the directory structure of the web site on the remote server In other cases however they seem to mean the number of hosts away from the original site

In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise

If the curator is referring to a structure of subdirectories the next challenge is that the Heritrix crawler does not work this way Heritrix ignores the sitersquos directory structure and instead follows links from the seed URI it is provided The Heritrix manual specifically defines ldquodepthrdquo in this way

This scope allows for limiting the depth of a crawl (how many links away Heritrix should crawl) but does not impose any limits on the hosts domains or URI paths crawled6

Multimedia Some curators chose sites because of the value of their multimedia resources The crawler yielded mixed results in capturing these resources For one site

bull A text search on the log file turns up numerous ram files only one

ppt file

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files

bull smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files

As noted nearly half the sites crawled reached the 1 gig size limit and so did not complete This makes it difficult to determine whether there were genuine problems

6 Heritrix User Manual Section 611 Crawl Scope Broad Scope lt httpcrawlerarchiveorgarticlesuser_manualhtmlgt

12

with particular types of files or if the crawler simply did not get to the missing files before the crawl was stopped

Comparison with Other Crawlers Three of our curators had previously used other crawlers such as HTTrack and Wget to capture the same sites we attempted with Heritrix Because these curators provided strong details when describing their sites we crawled all of the sites they sent us (six sites) In one case the ldquoDefense Base Closure and Realignment Commissionrdquo the curator had greater success capturing aspx files with HTTrack than we had with Heritrix Other comparisons

We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using Wget found only 1474 However both spiders found roughly the same number of bytes As I understand Wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference

And

We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with Wget

Crawl Success We asked curators to rate the overall success of the test crawls on the following scale

bull Not effective (none of the desired documents were captured) bull Somewhat effective (some of the desired documents were captured) bull Mostly effective (most of the desired documents were captured) bull Effective (all of the desired documents were captured)

0

2

4

6

8

10

12

Crawl Success

Effective

MostlyEffectiveSomewhatEffectiveNot Effective

Here are some of the comments curators had concerning the overall success of these tests The full text of all curator reports and their feedback to CDL is available in Appendix C

13

Los Angeles Planning Department

The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations) however [sic] when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when I clicked on a link to get to the full report all I got was the ldquoSorry no Documents wthe given URI were foundrdquo This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved

City of San Diego Planning Department This comment is from a curator who is filling in for the person who originated the test crawl request so she is also attempting to interpret what that other person was hoping to capture

Due to the vague request to ldquodrill down several levelsrdquo Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didnrsquot expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public libraryrsquos pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but werenrsquot (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If thatrsquos the case it may be easier to do all of wwwsandiegogov rather than limit

Defense Base Closure and Realignment Commission

I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured

Public Policy Institute of California

14

There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an ldquoobject not foundrdquo message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on ldquoDaterdquo retrieves a message ldquoSorry no documents with the given uri were foundrdquo 3) The search boxes are not functional searches retrieve ldquoSorry no documents with the given uri were foundrdquo

AFL-CIO

I realize the collection interface is a ldquowork in progressrdquo and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg ldquoWorking Families Toolkitrdquo ldquoBushWatchrdquo) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness

Conclusions The process of running these test crawls has been valuable It has resulted in a number of lessons learned and further directions for inquiry It is clear that the tools that help curators analyze the results of these crawls will have to do more than mimic the original browsing context or supply tables of data We will be looking for any opportunity to improve WAS reporting capabilities as we move forward with the project The differences encountered in language and visualization raise the importance of a strong intuitive design for the curator tools and for clear help screens Each of us may visualize web sites differently and the crawler may behave differently than we expect A certain degree of online help will be needed to design crawls effectively and further guidance should be available to help people interpret crawl results when those results donrsquot match what the person anticipated

15

The help and documentation for the Web Archiving Service will also need to address the rights analysis issues raised above In most cases this is work that is done prior to issuing crawls it cannot be addressed by the design of the WAS interface alone When the Web-at-Risk project reaches the point of conducting usability studies we should be sure to include tests that further uncover how users understand crawl frequency settings Additionally the desire for an analysis tool that can convey when a site has changed significantly is not limited to this project It is a common issue faced by the Internet Archive members of the International Internet Preservation Consortium and others CDL is actively communicating with these organizations as we all work toward a solution for this problem Similarly CDL should ensure that future assessment and analysis work with our curators addresses the issues raised by the two crawl scope settings This inquiry should also extend to participants who exemplify end users to determine if people using web archives experience crawl scope differently than people who build them Finally a number of lessons were learned via the Katrina crawl described in Appendix B The most outstanding finding is that event-based crawls such as Katrina and site-specific crawls such as these have quite different characteristics and require different functionality and analysis tools

Next Steps There is certainly more to learn from these crawl results and about Heritrix in some cases it is still unclear why the crawler failed to retrieve certain documents The curatorsrsquo feedback concerning these results has been extremely valuable They have provided insight as to what was captured and what is still missing which would have been difficult to determine without their subject expertise in the sites chosen As we continue working to improve crawler success and performance we will turn our attention to the Virtual Remote Control site created by Cornell This site was developed to present particular problems to crawlers and is well documented enabling the user to gauge a crawlerrsquos results We will be using that site to replicate particular problems raised in our test crawl set and may also reattempt some of the sites crawled for these tests When we release the first version of the Web Archiving Service to curators in July 2006 we will request that they include their original test site in the crawls they attempt and compare the results with these tests The test results for these crawls will remain in place as a point of comparison while we continue developing the Web Archiving Service Additionally certain issues raised by this feedback particularly the desired scope and frequency of crawls merit attention in future usability analysis work

16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 5: Web-at-Risk Test Crawl Report - Dashboard - Confluence

We used a two-step review process to match our 30 suggested sites to the correct rights scheme First we looked at the nature of the content-ownerrsquos organization (federal government nonprofit organization etc) This was done by reviewing the organization itself as well as the domain name used for the organizationrsquos web site This gave us a first-round guess at what the correct rights scheme would be for each site Next the sites were carefully reviewed for copyright notices and our initial determinations were revised The breakdown of agencies by type is as follows

3

9 9 9

0

2

4

6

8

10

Site Types

Federal State Local Non-Profit

Seven of the 15 sites were from agencies devoted to water management The full list of sites submitted is available in Appendix A Site domains Although 21 sites were published by government agencies only 12 of them were in the gov domain By domain the sites included

02468

101214

Domains (all sites)

gov us org com edu

The nine local government sites presented by far the most variety in domain names There was a weak correlation between domain names and the nature of the agency at the local government level

5

0123456

Domains (local sites only)

gov us org com edu

Copyright statements We next reviewed each site to determine whether copyright statements on the site could help determine what rights scheme might apply to the site The copyright statements for the sites we crawled are available with each individual crawl report in Appendix C Here too local government sites offered little correlation between the nature of the content-owning organization and the rights statements displayed on the site City web sites varied dramatically in their rights statements some stated that their materials were in the public domain others vigorously defended their copyright This City of San Diego site did both

4 After both rights reviews it was determined that of the 30 sites submitted

bull 14 fell into Rights Scheme A and could be crawled without notification or permission

bull 13 fell into Rights Scheme B and could be crawled but would also require identifying and notifying the content owner

bull 3 fell within Rights Scheme C and would require the explicit consent of the content owner prior to crawling

The process of reviewing the sites for rights statements changed our assessment of the correct rights scheme in a number of cases and all three ldquoScheme C Consent Requiredrdquo designations were made on the basis of statements posted on the site Note that we did not ultimately seek permission for these materials and access to

4 City of San Diego web site Disclaimer page lthttpwwwsandiegogovdirectoriesdisclaimershtmlgt

6

the results of our crawls has been strictly limited to the curators and project staff for the purpose of crawl analysis In short our pre-crawl analysis of these 30 sites brought up complex rights issues and exemplified the challenges that lay ahead

Test Crawl Settings and Process Although we had originally planned to crawl only one site for each curator some curators supplied two sites that posed interesting problems in these cases we crawled both Each site was crawled with two settings resulting in 19 test sites and 38 total crawls conducted We used Heritrix version 151 to conduct the test crawls The crawls were conducted using four crawler instances on two servers Each site was crawled separately that is each seed list contained one URI We kept most default settings except for the following Crawl size Each crawl was set to stop at a maximum of 1 gigabyte (gig) of data Of the 38 crawls conducted 18 hit the 1 gig size limit Note that this limitation was imposed for the purpose of these early tests and will not be applied to future services Crawl duration When crawls took an inordinately long time to complete we started over again with ldquomax retriesrdquo set at three This setting improved crawler performance when pausing or hanging was an issue Politeness5 Because we crawled each site individually we set our crawler for very high politeness values Politeness pertains to the impact of the crawl on the content ownerrsquos server and is determined by combining a few different Heritrix settings that together determine how demanding the crawler is on the remote serversrsquo resources Our politeness settings were

bull Delay-factor 5 bull Max-delay-ms 5000 bull Min-delay-ms 500

Original host only vs linked hosts included Each site was crawled with two settings The first setting restricted results to only the host from the original seed URI The second setting allowed us to gather any pages to which the site linked directly but no more This second setting was constructed to gather pages considered relevant to the original site and to gather sites in their entirety when an organization relied on more than one host name to provide its web presence

Crawl Scope A comparison of the two different crawl settings used (original host only vs linked hosts included) turned up some counterintuitive results

5 For further information see section 6331 ldquoPolitenessrdquo in the Heritrix User Manual lthttpcrawlerarchiveorgarticlesuser_manualhtmlgt

7

When compared quantitatively 8 out of 19 crawls took longer to capture the site when limited to ldquooriginal host onlyrdquo than with the ldquolinked hostsrdquo setting It is not clear why this is the case since the ldquolinked hostsrdquo crawl should be much larger Indeed in all cases the linked hosts crawl retrieved more files than the original host crawl The following two tables compare both the number of files retrieved and the duration of the two types of crawls

Table 1 Number of files retrieved Original Host Only Linked Hosts Included Most 46197 70114 Fewest 247 1343 Median 2423 9250 Average 6359 17247

Table 2 Duration of the crawl Original Host Only Linked Hosts Included Longest 32 hr 21 min 37 hr 11 min Shortest 18 min 19 min Median 7 hr 33 min 11 hr 22 min Average 1 hr 42 min 7 hr 9 min

Given that this is a very small sample of crawls and that the gap between the largest and smallest crawls is fairly noteworthy perhaps the only telling figure to consider here is the median According to the median figures with only 505 more time the crawler acquired over 281 more documents When compared qualitatively the results also appeared somewhat counterintuitive Of the 18 curators who responded 12 stated that they preferred the ldquooriginal host onlyrdquo crawl (four were undecided) We would have expected this preference to vary a little more from site to site Oddly one of the two curators who preferred the larger crawl scope had a crawl that captured materials from over 2500 other hosts In some cases a sitersquos links to exterior hosts are critical The sitersquos value may hinge upon how well it gathers documents from other sources both of the curators who preferred the broader setting did so for this reason

For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites

More critically sites are also often composed of content from more than one server This is particularly likely to be the case if a site is providing a large body of pdf or multimedia files So a crawl restricted to the original host would be missing critical segments of the sitersquos content Our test crawls did in fact turn up sites that were composed of more than one host name For example in the case of UC Merced separate host names are used for different areas of the site such as faculty or

8

admissions In the case of the Arizona Department of Water Resources the distinction between host names appears to be accidental perhaps the result of an attempt to transition to a simpler more memorable URI Most pages from this site come from wwwazwatergov but hundreds of internal links including critical style sheet files are still hard-coded to point to wwwwaterazgov Finally when the site is restricted to the original host the end user is much more likely to encounter errors when viewing the archived results When the end user selects a link that was not captured WERA provides a ldquoSorry this URI was not foundrdquo message When the linked hosts are included the end user browsing the site sees the site more closely to its original context and with fewer error messages Conversely when end users encounter frequent error screens they may develop both frustration and a sense of mistrust in the quality of the archive It is worth noting that the curators are not likely to browse these results in the same way that an end user of their archives might The curators know what these sites contain choose them accordingly and may be less inclined to click on links that would result in a ldquoSorry this URI was not foundrdquo message Ultimately the value of a sitersquos external links would seem likely to vary depending on the nature of the site Sites with rich internal content and only ldquofrivolousrdquo external links would be best captured with the ldquooriginal host onlyrdquo setting Before we ran these crawls we asked curators to specify what they hoped a crawl would capture Many referenced specific pages or directories they hoped to capture and of those three specified URIs that were not from the original host When reviewing the results two out of those three still preferred the original host crawl even though that crawl did not capture the materials they specifically hoped to capture Why was the feedback so consistent on this point A look at the WERA interface used to display crawl results may provide an answer WERA does not offer an immediate means of browsing a site you have to search by keyword to find your way ldquointordquo the captured site Once you have a page displayed you can browse within the site but you must begin by searching for the right starting point As this image shows you select the site you want to search from a dropdown menu then enter terms to search against

Because all pages from the more comprehensive crawls are indexed the search results include pages from all of those other hosts This sets up a bit of cognitive dissonance the user specified a search against a particular site yet results from other hosts vastly outnumber pages from that site

9

Underlying the issue of crawl scope is the deeper question of what an archivist hopes to capture when a site is crawled Is it just a list of particular documents Or is it a faithful recreation of the site as it existed on that day It may be that a captured site has content of primary and secondary importance The primary content is what should be retrieved when searching against the archive while the secondary content should only be present to avoid error messages and establish the sitersquos full original context Another approach is suggested by this curatorrsquos response

The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than ldquosource + 1

This suggests the ability to link seed URIs as being related components of a single site

Communication Reports When we reported the test results back to curators we provided a synopsis of the crawl results links to particular Heritrix reports and to the WERA display interface The Heritrix reports are all plain text providing tables of MIME type or response code frequency Attempting to integrate these reports and the display of the archived results is a challenge One curator for example obtained documents from over 200 hosts in the ldquolinked hosts includedrdquo crawl but was only aware of having found 10 additional documents when reviewing these same search results in WERA Although WERA is helpful for seeing results from an end userrsquos perspective it does not provide adequate tools for analysis In some cases this is simply because WERA is a new and occasionally buggy tool It is possible for instance to follow occasional links out of the archive and into ldquoreal-timerdquo sites In some cases itrsquos also possible to browse to a page and display it but when you search for that same page by its URI WERA does not find anything One curator notes

Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IEmdashthe image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

It is also important to note that although WERA was used for the purpose of reporting test crawl results it is not envisioned as the final display interface for the Web Archiving Service Even so the feedback the curators provide about WERA should inform the functionality of the WAS interface Clearly it is still quite a struggle for curators to determine exactly what a crawl retrieved One curator reports

After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to

10

the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls

Crawl Frequency When asked how frequently they wanted to crawl sites curators responded with a variety of preferences

0123456789

Desired Crawl Frequency

DailyWeeklyMonthlyOnceUnknown

Again it is worth considering precisely what curators hope to capture in a repeated crawl of a site Some insight is provided by these curatorsrsquo comments

We hope the crawler will be able to report when new publication files are posted on the web site

And

The ability to report on new publications is critical to our goal of using the crawler as a discovery tool

As with the other NDIIPP grants the purpose of Web Archiving Service tools will be for archiving and preservation not for resource discovery This indicates that we should further investigate what a ldquoweeklyrdquo or ldquomonthlyrdquo crawl really means to curators If a site was not updated over the course of a year would the curator want to continue running weekly crawls of the site Would the curator want each crawl to appear on an archive timeline for that site even if the content was no different for each date One curator comments

I want to qualify the frequency for this site I d like to do a monthly crawl for three-four months Irsquod want [to] reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot [of] new content being added Irsquod change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation

11

Language and Web Site Models One of the challenges of communicating and interpreting crawl results is that crawlers donrsquot necessarily work the way people envision them to Further the appearance of a web site on a screen and its architecture on a server may be quite different the directory structure of a site may have no relationship to they way its navigation is organized on a screen One frequent point of confusion curators encountered while interpreting crawl results is the concept of how many ldquolevels downrdquo the crawler went One curator requested that we ldquodrill down several levels (at least 3)rdquo in our capture One challenge with this request is that ldquolevels downrdquo can be interpreted to mean different things In some cases curators clearly mean the directory structure of the web site on the remote server In other cases however they seem to mean the number of hosts away from the original site

In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise

If the curator is referring to a structure of subdirectories the next challenge is that the Heritrix crawler does not work this way Heritrix ignores the sitersquos directory structure and instead follows links from the seed URI it is provided The Heritrix manual specifically defines ldquodepthrdquo in this way

This scope allows for limiting the depth of a crawl (how many links away Heritrix should crawl) but does not impose any limits on the hosts domains or URI paths crawled6

Multimedia Some curators chose sites because of the value of their multimedia resources The crawler yielded mixed results in capturing these resources For one site

bull A text search on the log file turns up numerous ram files only one

ppt file

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files

bull smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files

As noted nearly half the sites crawled reached the 1 gig size limit and so did not complete This makes it difficult to determine whether there were genuine problems

6 Heritrix User Manual Section 611 Crawl Scope Broad Scope lt httpcrawlerarchiveorgarticlesuser_manualhtmlgt

12

with particular types of files or if the crawler simply did not get to the missing files before the crawl was stopped

Comparison with Other Crawlers Three of our curators had previously used other crawlers such as HTTrack and Wget to capture the same sites we attempted with Heritrix Because these curators provided strong details when describing their sites we crawled all of the sites they sent us (six sites) In one case the ldquoDefense Base Closure and Realignment Commissionrdquo the curator had greater success capturing aspx files with HTTrack than we had with Heritrix Other comparisons

We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using Wget found only 1474 However both spiders found roughly the same number of bytes As I understand Wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference

And

We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with Wget

Crawl Success We asked curators to rate the overall success of the test crawls on the following scale

bull Not effective (none of the desired documents were captured) bull Somewhat effective (some of the desired documents were captured) bull Mostly effective (most of the desired documents were captured) bull Effective (all of the desired documents were captured)

0

2

4

6

8

10

12

Crawl Success

Effective

MostlyEffectiveSomewhatEffectiveNot Effective

Here are some of the comments curators had concerning the overall success of these tests The full text of all curator reports and their feedback to CDL is available in Appendix C

13

Los Angeles Planning Department

The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations) however [sic] when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when I clicked on a link to get to the full report all I got was the ldquoSorry no Documents wthe given URI were foundrdquo This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved

City of San Diego Planning Department This comment is from a curator who is filling in for the person who originated the test crawl request so she is also attempting to interpret what that other person was hoping to capture

Due to the vague request to ldquodrill down several levelsrdquo Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didnrsquot expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public libraryrsquos pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but werenrsquot (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If thatrsquos the case it may be easier to do all of wwwsandiegogov rather than limit

Defense Base Closure and Realignment Commission

I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured

Public Policy Institute of California

14

There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an ldquoobject not foundrdquo message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on ldquoDaterdquo retrieves a message ldquoSorry no documents with the given uri were foundrdquo 3) The search boxes are not functional searches retrieve ldquoSorry no documents with the given uri were foundrdquo

AFL-CIO

I realize the collection interface is a ldquowork in progressrdquo and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg ldquoWorking Families Toolkitrdquo ldquoBushWatchrdquo) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness

Conclusions The process of running these test crawls has been valuable It has resulted in a number of lessons learned and further directions for inquiry It is clear that the tools that help curators analyze the results of these crawls will have to do more than mimic the original browsing context or supply tables of data We will be looking for any opportunity to improve WAS reporting capabilities as we move forward with the project The differences encountered in language and visualization raise the importance of a strong intuitive design for the curator tools and for clear help screens Each of us may visualize web sites differently and the crawler may behave differently than we expect A certain degree of online help will be needed to design crawls effectively and further guidance should be available to help people interpret crawl results when those results donrsquot match what the person anticipated

15

The help and documentation for the Web Archiving Service will also need to address the rights analysis issues raised above In most cases this is work that is done prior to issuing crawls it cannot be addressed by the design of the WAS interface alone When the Web-at-Risk project reaches the point of conducting usability studies we should be sure to include tests that further uncover how users understand crawl frequency settings Additionally the desire for an analysis tool that can convey when a site has changed significantly is not limited to this project It is a common issue faced by the Internet Archive members of the International Internet Preservation Consortium and others CDL is actively communicating with these organizations as we all work toward a solution for this problem Similarly CDL should ensure that future assessment and analysis work with our curators addresses the issues raised by the two crawl scope settings This inquiry should also extend to participants who exemplify end users to determine if people using web archives experience crawl scope differently than people who build them Finally a number of lessons were learned via the Katrina crawl described in Appendix B The most outstanding finding is that event-based crawls such as Katrina and site-specific crawls such as these have quite different characteristics and require different functionality and analysis tools

Next Steps There is certainly more to learn from these crawl results and about Heritrix in some cases it is still unclear why the crawler failed to retrieve certain documents The curatorsrsquo feedback concerning these results has been extremely valuable They have provided insight as to what was captured and what is still missing which would have been difficult to determine without their subject expertise in the sites chosen As we continue working to improve crawler success and performance we will turn our attention to the Virtual Remote Control site created by Cornell This site was developed to present particular problems to crawlers and is well documented enabling the user to gauge a crawlerrsquos results We will be using that site to replicate particular problems raised in our test crawl set and may also reattempt some of the sites crawled for these tests When we release the first version of the Web Archiving Service to curators in July 2006 we will request that they include their original test site in the crawls they attempt and compare the results with these tests The test results for these crawls will remain in place as a point of comparison while we continue developing the Web Archiving Service Additionally certain issues raised by this feedback particularly the desired scope and frequency of crawls merit attention in future usability analysis work

16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 6: Web-at-Risk Test Crawl Report - Dashboard - Confluence

0123456

Domains (local sites only)

gov us org com edu

Copyright statements We next reviewed each site to determine whether copyright statements on the site could help determine what rights scheme might apply to the site The copyright statements for the sites we crawled are available with each individual crawl report in Appendix C Here too local government sites offered little correlation between the nature of the content-owning organization and the rights statements displayed on the site City web sites varied dramatically in their rights statements some stated that their materials were in the public domain others vigorously defended their copyright This City of San Diego site did both

4 After both rights reviews it was determined that of the 30 sites submitted

bull 14 fell into Rights Scheme A and could be crawled without notification or permission

bull 13 fell into Rights Scheme B and could be crawled but would also require identifying and notifying the content owner

bull 3 fell within Rights Scheme C and would require the explicit consent of the content owner prior to crawling

The process of reviewing the sites for rights statements changed our assessment of the correct rights scheme in a number of cases and all three ldquoScheme C Consent Requiredrdquo designations were made on the basis of statements posted on the site Note that we did not ultimately seek permission for these materials and access to

4 City of San Diego web site Disclaimer page lthttpwwwsandiegogovdirectoriesdisclaimershtmlgt

6

the results of our crawls has been strictly limited to the curators and project staff for the purpose of crawl analysis In short our pre-crawl analysis of these 30 sites brought up complex rights issues and exemplified the challenges that lay ahead

Test Crawl Settings and Process Although we had originally planned to crawl only one site for each curator some curators supplied two sites that posed interesting problems in these cases we crawled both Each site was crawled with two settings resulting in 19 test sites and 38 total crawls conducted We used Heritrix version 151 to conduct the test crawls The crawls were conducted using four crawler instances on two servers Each site was crawled separately that is each seed list contained one URI We kept most default settings except for the following Crawl size Each crawl was set to stop at a maximum of 1 gigabyte (gig) of data Of the 38 crawls conducted 18 hit the 1 gig size limit Note that this limitation was imposed for the purpose of these early tests and will not be applied to future services Crawl duration When crawls took an inordinately long time to complete we started over again with ldquomax retriesrdquo set at three This setting improved crawler performance when pausing or hanging was an issue Politeness5 Because we crawled each site individually we set our crawler for very high politeness values Politeness pertains to the impact of the crawl on the content ownerrsquos server and is determined by combining a few different Heritrix settings that together determine how demanding the crawler is on the remote serversrsquo resources Our politeness settings were

bull Delay-factor 5 bull Max-delay-ms 5000 bull Min-delay-ms 500

Original host only vs linked hosts included Each site was crawled with two settings The first setting restricted results to only the host from the original seed URI The second setting allowed us to gather any pages to which the site linked directly but no more This second setting was constructed to gather pages considered relevant to the original site and to gather sites in their entirety when an organization relied on more than one host name to provide its web presence

Crawl Scope A comparison of the two different crawl settings used (original host only vs linked hosts included) turned up some counterintuitive results

5 For further information see section 6331 ldquoPolitenessrdquo in the Heritrix User Manual lthttpcrawlerarchiveorgarticlesuser_manualhtmlgt

7

When compared quantitatively 8 out of 19 crawls took longer to capture the site when limited to ldquooriginal host onlyrdquo than with the ldquolinked hostsrdquo setting It is not clear why this is the case since the ldquolinked hostsrdquo crawl should be much larger Indeed in all cases the linked hosts crawl retrieved more files than the original host crawl The following two tables compare both the number of files retrieved and the duration of the two types of crawls

Table 1 Number of files retrieved Original Host Only Linked Hosts Included Most 46197 70114 Fewest 247 1343 Median 2423 9250 Average 6359 17247

Table 2 Duration of the crawl Original Host Only Linked Hosts Included Longest 32 hr 21 min 37 hr 11 min Shortest 18 min 19 min Median 7 hr 33 min 11 hr 22 min Average 1 hr 42 min 7 hr 9 min

Given that this is a very small sample of crawls and that the gap between the largest and smallest crawls is fairly noteworthy perhaps the only telling figure to consider here is the median According to the median figures with only 505 more time the crawler acquired over 281 more documents When compared qualitatively the results also appeared somewhat counterintuitive Of the 18 curators who responded 12 stated that they preferred the ldquooriginal host onlyrdquo crawl (four were undecided) We would have expected this preference to vary a little more from site to site Oddly one of the two curators who preferred the larger crawl scope had a crawl that captured materials from over 2500 other hosts In some cases a sitersquos links to exterior hosts are critical The sitersquos value may hinge upon how well it gathers documents from other sources both of the curators who preferred the broader setting did so for this reason

For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites

More critically sites are also often composed of content from more than one server This is particularly likely to be the case if a site is providing a large body of pdf or multimedia files So a crawl restricted to the original host would be missing critical segments of the sitersquos content Our test crawls did in fact turn up sites that were composed of more than one host name For example in the case of UC Merced separate host names are used for different areas of the site such as faculty or

8

admissions In the case of the Arizona Department of Water Resources the distinction between host names appears to be accidental perhaps the result of an attempt to transition to a simpler more memorable URI Most pages from this site come from wwwazwatergov but hundreds of internal links including critical style sheet files are still hard-coded to point to wwwwaterazgov Finally when the site is restricted to the original host the end user is much more likely to encounter errors when viewing the archived results When the end user selects a link that was not captured WERA provides a ldquoSorry this URI was not foundrdquo message When the linked hosts are included the end user browsing the site sees the site more closely to its original context and with fewer error messages Conversely when end users encounter frequent error screens they may develop both frustration and a sense of mistrust in the quality of the archive It is worth noting that the curators are not likely to browse these results in the same way that an end user of their archives might The curators know what these sites contain choose them accordingly and may be less inclined to click on links that would result in a ldquoSorry this URI was not foundrdquo message Ultimately the value of a sitersquos external links would seem likely to vary depending on the nature of the site Sites with rich internal content and only ldquofrivolousrdquo external links would be best captured with the ldquooriginal host onlyrdquo setting Before we ran these crawls we asked curators to specify what they hoped a crawl would capture Many referenced specific pages or directories they hoped to capture and of those three specified URIs that were not from the original host When reviewing the results two out of those three still preferred the original host crawl even though that crawl did not capture the materials they specifically hoped to capture Why was the feedback so consistent on this point A look at the WERA interface used to display crawl results may provide an answer WERA does not offer an immediate means of browsing a site you have to search by keyword to find your way ldquointordquo the captured site Once you have a page displayed you can browse within the site but you must begin by searching for the right starting point As this image shows you select the site you want to search from a dropdown menu then enter terms to search against

Because all pages from the more comprehensive crawls are indexed the search results include pages from all of those other hosts This sets up a bit of cognitive dissonance the user specified a search against a particular site yet results from other hosts vastly outnumber pages from that site

9

Underlying the issue of crawl scope is the deeper question of what an archivist hopes to capture when a site is crawled Is it just a list of particular documents Or is it a faithful recreation of the site as it existed on that day It may be that a captured site has content of primary and secondary importance The primary content is what should be retrieved when searching against the archive while the secondary content should only be present to avoid error messages and establish the sitersquos full original context Another approach is suggested by this curatorrsquos response

The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than ldquosource + 1

This suggests the ability to link seed URIs as being related components of a single site

Communication Reports When we reported the test results back to curators we provided a synopsis of the crawl results links to particular Heritrix reports and to the WERA display interface The Heritrix reports are all plain text providing tables of MIME type or response code frequency Attempting to integrate these reports and the display of the archived results is a challenge One curator for example obtained documents from over 200 hosts in the ldquolinked hosts includedrdquo crawl but was only aware of having found 10 additional documents when reviewing these same search results in WERA Although WERA is helpful for seeing results from an end userrsquos perspective it does not provide adequate tools for analysis In some cases this is simply because WERA is a new and occasionally buggy tool It is possible for instance to follow occasional links out of the archive and into ldquoreal-timerdquo sites In some cases itrsquos also possible to browse to a page and display it but when you search for that same page by its URI WERA does not find anything One curator notes

Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IEmdashthe image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

It is also important to note that although WERA was used for the purpose of reporting test crawl results it is not envisioned as the final display interface for the Web Archiving Service Even so the feedback the curators provide about WERA should inform the functionality of the WAS interface Clearly it is still quite a struggle for curators to determine exactly what a crawl retrieved One curator reports

After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to

10

the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls

Crawl Frequency When asked how frequently they wanted to crawl sites curators responded with a variety of preferences

0123456789

Desired Crawl Frequency

DailyWeeklyMonthlyOnceUnknown

Again it is worth considering precisely what curators hope to capture in a repeated crawl of a site Some insight is provided by these curatorsrsquo comments

We hope the crawler will be able to report when new publication files are posted on the web site

And

The ability to report on new publications is critical to our goal of using the crawler as a discovery tool

As with the other NDIIPP grants the purpose of Web Archiving Service tools will be for archiving and preservation not for resource discovery This indicates that we should further investigate what a ldquoweeklyrdquo or ldquomonthlyrdquo crawl really means to curators If a site was not updated over the course of a year would the curator want to continue running weekly crawls of the site Would the curator want each crawl to appear on an archive timeline for that site even if the content was no different for each date One curator comments

I want to qualify the frequency for this site I d like to do a monthly crawl for three-four months Irsquod want [to] reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot [of] new content being added Irsquod change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation

11

Language and Web Site Models One of the challenges of communicating and interpreting crawl results is that crawlers donrsquot necessarily work the way people envision them to Further the appearance of a web site on a screen and its architecture on a server may be quite different the directory structure of a site may have no relationship to they way its navigation is organized on a screen One frequent point of confusion curators encountered while interpreting crawl results is the concept of how many ldquolevels downrdquo the crawler went One curator requested that we ldquodrill down several levels (at least 3)rdquo in our capture One challenge with this request is that ldquolevels downrdquo can be interpreted to mean different things In some cases curators clearly mean the directory structure of the web site on the remote server In other cases however they seem to mean the number of hosts away from the original site

In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise

If the curator is referring to a structure of subdirectories the next challenge is that the Heritrix crawler does not work this way Heritrix ignores the sitersquos directory structure and instead follows links from the seed URI it is provided The Heritrix manual specifically defines ldquodepthrdquo in this way

This scope allows for limiting the depth of a crawl (how many links away Heritrix should crawl) but does not impose any limits on the hosts domains or URI paths crawled6

Multimedia Some curators chose sites because of the value of their multimedia resources The crawler yielded mixed results in capturing these resources For one site

bull A text search on the log file turns up numerous ram files only one

ppt file

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files

bull smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files

As noted nearly half the sites crawled reached the 1 gig size limit and so did not complete This makes it difficult to determine whether there were genuine problems

6 Heritrix User Manual Section 611 Crawl Scope Broad Scope lt httpcrawlerarchiveorgarticlesuser_manualhtmlgt

12

with particular types of files or if the crawler simply did not get to the missing files before the crawl was stopped

Comparison with Other Crawlers Three of our curators had previously used other crawlers such as HTTrack and Wget to capture the same sites we attempted with Heritrix Because these curators provided strong details when describing their sites we crawled all of the sites they sent us (six sites) In one case the ldquoDefense Base Closure and Realignment Commissionrdquo the curator had greater success capturing aspx files with HTTrack than we had with Heritrix Other comparisons

We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using Wget found only 1474 However both spiders found roughly the same number of bytes As I understand Wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference

And

We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with Wget

Crawl Success We asked curators to rate the overall success of the test crawls on the following scale

bull Not effective (none of the desired documents were captured) bull Somewhat effective (some of the desired documents were captured) bull Mostly effective (most of the desired documents were captured) bull Effective (all of the desired documents were captured)

0

2

4

6

8

10

12

Crawl Success

Effective

MostlyEffectiveSomewhatEffectiveNot Effective

Here are some of the comments curators had concerning the overall success of these tests The full text of all curator reports and their feedback to CDL is available in Appendix C

13

Los Angeles Planning Department

The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations) however [sic] when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when I clicked on a link to get to the full report all I got was the ldquoSorry no Documents wthe given URI were foundrdquo This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved

City of San Diego Planning Department This comment is from a curator who is filling in for the person who originated the test crawl request so she is also attempting to interpret what that other person was hoping to capture

Due to the vague request to ldquodrill down several levelsrdquo Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didnrsquot expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public libraryrsquos pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but werenrsquot (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If thatrsquos the case it may be easier to do all of wwwsandiegogov rather than limit

Defense Base Closure and Realignment Commission

I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured

Public Policy Institute of California

14

There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an ldquoobject not foundrdquo message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on ldquoDaterdquo retrieves a message ldquoSorry no documents with the given uri were foundrdquo 3) The search boxes are not functional searches retrieve ldquoSorry no documents with the given uri were foundrdquo

AFL-CIO

I realize the collection interface is a ldquowork in progressrdquo and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg ldquoWorking Families Toolkitrdquo ldquoBushWatchrdquo) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness

Conclusions The process of running these test crawls has been valuable It has resulted in a number of lessons learned and further directions for inquiry It is clear that the tools that help curators analyze the results of these crawls will have to do more than mimic the original browsing context or supply tables of data We will be looking for any opportunity to improve WAS reporting capabilities as we move forward with the project The differences encountered in language and visualization raise the importance of a strong intuitive design for the curator tools and for clear help screens Each of us may visualize web sites differently and the crawler may behave differently than we expect A certain degree of online help will be needed to design crawls effectively and further guidance should be available to help people interpret crawl results when those results donrsquot match what the person anticipated

15

The help and documentation for the Web Archiving Service will also need to address the rights analysis issues raised above In most cases this is work that is done prior to issuing crawls it cannot be addressed by the design of the WAS interface alone When the Web-at-Risk project reaches the point of conducting usability studies we should be sure to include tests that further uncover how users understand crawl frequency settings Additionally the desire for an analysis tool that can convey when a site has changed significantly is not limited to this project It is a common issue faced by the Internet Archive members of the International Internet Preservation Consortium and others CDL is actively communicating with these organizations as we all work toward a solution for this problem Similarly CDL should ensure that future assessment and analysis work with our curators addresses the issues raised by the two crawl scope settings This inquiry should also extend to participants who exemplify end users to determine if people using web archives experience crawl scope differently than people who build them Finally a number of lessons were learned via the Katrina crawl described in Appendix B The most outstanding finding is that event-based crawls such as Katrina and site-specific crawls such as these have quite different characteristics and require different functionality and analysis tools

Next Steps There is certainly more to learn from these crawl results and about Heritrix in some cases it is still unclear why the crawler failed to retrieve certain documents The curatorsrsquo feedback concerning these results has been extremely valuable They have provided insight as to what was captured and what is still missing which would have been difficult to determine without their subject expertise in the sites chosen As we continue working to improve crawler success and performance we will turn our attention to the Virtual Remote Control site created by Cornell This site was developed to present particular problems to crawlers and is well documented enabling the user to gauge a crawlerrsquos results We will be using that site to replicate particular problems raised in our test crawl set and may also reattempt some of the sites crawled for these tests When we release the first version of the Web Archiving Service to curators in July 2006 we will request that they include their original test site in the crawls they attempt and compare the results with these tests The test results for these crawls will remain in place as a point of comparison while we continue developing the Web Archiving Service Additionally certain issues raised by this feedback particularly the desired scope and frequency of crawls merit attention in future usability analysis work

16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 7: Web-at-Risk Test Crawl Report - Dashboard - Confluence

the results of our crawls has been strictly limited to the curators and project staff for the purpose of crawl analysis In short our pre-crawl analysis of these 30 sites brought up complex rights issues and exemplified the challenges that lay ahead

Test Crawl Settings and Process Although we had originally planned to crawl only one site for each curator some curators supplied two sites that posed interesting problems in these cases we crawled both Each site was crawled with two settings resulting in 19 test sites and 38 total crawls conducted We used Heritrix version 151 to conduct the test crawls The crawls were conducted using four crawler instances on two servers Each site was crawled separately that is each seed list contained one URI We kept most default settings except for the following Crawl size Each crawl was set to stop at a maximum of 1 gigabyte (gig) of data Of the 38 crawls conducted 18 hit the 1 gig size limit Note that this limitation was imposed for the purpose of these early tests and will not be applied to future services Crawl duration When crawls took an inordinately long time to complete we started over again with ldquomax retriesrdquo set at three This setting improved crawler performance when pausing or hanging was an issue Politeness5 Because we crawled each site individually we set our crawler for very high politeness values Politeness pertains to the impact of the crawl on the content ownerrsquos server and is determined by combining a few different Heritrix settings that together determine how demanding the crawler is on the remote serversrsquo resources Our politeness settings were

bull Delay-factor 5 bull Max-delay-ms 5000 bull Min-delay-ms 500

Original host only vs linked hosts included Each site was crawled with two settings The first setting restricted results to only the host from the original seed URI The second setting allowed us to gather any pages to which the site linked directly but no more This second setting was constructed to gather pages considered relevant to the original site and to gather sites in their entirety when an organization relied on more than one host name to provide its web presence

Crawl Scope A comparison of the two different crawl settings used (original host only vs linked hosts included) turned up some counterintuitive results

5 For further information see section 6331 ldquoPolitenessrdquo in the Heritrix User Manual lthttpcrawlerarchiveorgarticlesuser_manualhtmlgt

7

When compared quantitatively 8 out of 19 crawls took longer to capture the site when limited to ldquooriginal host onlyrdquo than with the ldquolinked hostsrdquo setting It is not clear why this is the case since the ldquolinked hostsrdquo crawl should be much larger Indeed in all cases the linked hosts crawl retrieved more files than the original host crawl The following two tables compare both the number of files retrieved and the duration of the two types of crawls

Table 1 Number of files retrieved Original Host Only Linked Hosts Included Most 46197 70114 Fewest 247 1343 Median 2423 9250 Average 6359 17247

Table 2 Duration of the crawl Original Host Only Linked Hosts Included Longest 32 hr 21 min 37 hr 11 min Shortest 18 min 19 min Median 7 hr 33 min 11 hr 22 min Average 1 hr 42 min 7 hr 9 min

Given that this is a very small sample of crawls and that the gap between the largest and smallest crawls is fairly noteworthy perhaps the only telling figure to consider here is the median According to the median figures with only 505 more time the crawler acquired over 281 more documents When compared qualitatively the results also appeared somewhat counterintuitive Of the 18 curators who responded 12 stated that they preferred the ldquooriginal host onlyrdquo crawl (four were undecided) We would have expected this preference to vary a little more from site to site Oddly one of the two curators who preferred the larger crawl scope had a crawl that captured materials from over 2500 other hosts In some cases a sitersquos links to exterior hosts are critical The sitersquos value may hinge upon how well it gathers documents from other sources both of the curators who preferred the broader setting did so for this reason

For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites

More critically sites are also often composed of content from more than one server This is particularly likely to be the case if a site is providing a large body of pdf or multimedia files So a crawl restricted to the original host would be missing critical segments of the sitersquos content Our test crawls did in fact turn up sites that were composed of more than one host name For example in the case of UC Merced separate host names are used for different areas of the site such as faculty or

8

admissions In the case of the Arizona Department of Water Resources the distinction between host names appears to be accidental perhaps the result of an attempt to transition to a simpler more memorable URI Most pages from this site come from wwwazwatergov but hundreds of internal links including critical style sheet files are still hard-coded to point to wwwwaterazgov Finally when the site is restricted to the original host the end user is much more likely to encounter errors when viewing the archived results When the end user selects a link that was not captured WERA provides a ldquoSorry this URI was not foundrdquo message When the linked hosts are included the end user browsing the site sees the site more closely to its original context and with fewer error messages Conversely when end users encounter frequent error screens they may develop both frustration and a sense of mistrust in the quality of the archive It is worth noting that the curators are not likely to browse these results in the same way that an end user of their archives might The curators know what these sites contain choose them accordingly and may be less inclined to click on links that would result in a ldquoSorry this URI was not foundrdquo message Ultimately the value of a sitersquos external links would seem likely to vary depending on the nature of the site Sites with rich internal content and only ldquofrivolousrdquo external links would be best captured with the ldquooriginal host onlyrdquo setting Before we ran these crawls we asked curators to specify what they hoped a crawl would capture Many referenced specific pages or directories they hoped to capture and of those three specified URIs that were not from the original host When reviewing the results two out of those three still preferred the original host crawl even though that crawl did not capture the materials they specifically hoped to capture Why was the feedback so consistent on this point A look at the WERA interface used to display crawl results may provide an answer WERA does not offer an immediate means of browsing a site you have to search by keyword to find your way ldquointordquo the captured site Once you have a page displayed you can browse within the site but you must begin by searching for the right starting point As this image shows you select the site you want to search from a dropdown menu then enter terms to search against

Because all pages from the more comprehensive crawls are indexed the search results include pages from all of those other hosts This sets up a bit of cognitive dissonance the user specified a search against a particular site yet results from other hosts vastly outnumber pages from that site

9

Underlying the issue of crawl scope is the deeper question of what an archivist hopes to capture when a site is crawled Is it just a list of particular documents Or is it a faithful recreation of the site as it existed on that day It may be that a captured site has content of primary and secondary importance The primary content is what should be retrieved when searching against the archive while the secondary content should only be present to avoid error messages and establish the sitersquos full original context Another approach is suggested by this curatorrsquos response

The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than ldquosource + 1

This suggests the ability to link seed URIs as being related components of a single site

Communication Reports When we reported the test results back to curators we provided a synopsis of the crawl results links to particular Heritrix reports and to the WERA display interface The Heritrix reports are all plain text providing tables of MIME type or response code frequency Attempting to integrate these reports and the display of the archived results is a challenge One curator for example obtained documents from over 200 hosts in the ldquolinked hosts includedrdquo crawl but was only aware of having found 10 additional documents when reviewing these same search results in WERA Although WERA is helpful for seeing results from an end userrsquos perspective it does not provide adequate tools for analysis In some cases this is simply because WERA is a new and occasionally buggy tool It is possible for instance to follow occasional links out of the archive and into ldquoreal-timerdquo sites In some cases itrsquos also possible to browse to a page and display it but when you search for that same page by its URI WERA does not find anything One curator notes

Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IEmdashthe image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

It is also important to note that although WERA was used for the purpose of reporting test crawl results it is not envisioned as the final display interface for the Web Archiving Service Even so the feedback the curators provide about WERA should inform the functionality of the WAS interface Clearly it is still quite a struggle for curators to determine exactly what a crawl retrieved One curator reports

After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to

10

the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls

Crawl Frequency When asked how frequently they wanted to crawl sites curators responded with a variety of preferences

0123456789

Desired Crawl Frequency

DailyWeeklyMonthlyOnceUnknown

Again it is worth considering precisely what curators hope to capture in a repeated crawl of a site Some insight is provided by these curatorsrsquo comments

We hope the crawler will be able to report when new publication files are posted on the web site

And

The ability to report on new publications is critical to our goal of using the crawler as a discovery tool

As with the other NDIIPP grants the purpose of Web Archiving Service tools will be for archiving and preservation not for resource discovery This indicates that we should further investigate what a ldquoweeklyrdquo or ldquomonthlyrdquo crawl really means to curators If a site was not updated over the course of a year would the curator want to continue running weekly crawls of the site Would the curator want each crawl to appear on an archive timeline for that site even if the content was no different for each date One curator comments

I want to qualify the frequency for this site I d like to do a monthly crawl for three-four months Irsquod want [to] reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot [of] new content being added Irsquod change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation

11

Language and Web Site Models One of the challenges of communicating and interpreting crawl results is that crawlers donrsquot necessarily work the way people envision them to Further the appearance of a web site on a screen and its architecture on a server may be quite different the directory structure of a site may have no relationship to they way its navigation is organized on a screen One frequent point of confusion curators encountered while interpreting crawl results is the concept of how many ldquolevels downrdquo the crawler went One curator requested that we ldquodrill down several levels (at least 3)rdquo in our capture One challenge with this request is that ldquolevels downrdquo can be interpreted to mean different things In some cases curators clearly mean the directory structure of the web site on the remote server In other cases however they seem to mean the number of hosts away from the original site

In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise

If the curator is referring to a structure of subdirectories the next challenge is that the Heritrix crawler does not work this way Heritrix ignores the sitersquos directory structure and instead follows links from the seed URI it is provided The Heritrix manual specifically defines ldquodepthrdquo in this way

This scope allows for limiting the depth of a crawl (how many links away Heritrix should crawl) but does not impose any limits on the hosts domains or URI paths crawled6

Multimedia Some curators chose sites because of the value of their multimedia resources The crawler yielded mixed results in capturing these resources For one site

bull A text search on the log file turns up numerous ram files only one

ppt file

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files

bull smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files

As noted nearly half the sites crawled reached the 1 gig size limit and so did not complete This makes it difficult to determine whether there were genuine problems

6 Heritrix User Manual Section 611 Crawl Scope Broad Scope lt httpcrawlerarchiveorgarticlesuser_manualhtmlgt

12

with particular types of files or if the crawler simply did not get to the missing files before the crawl was stopped

Comparison with Other Crawlers Three of our curators had previously used other crawlers such as HTTrack and Wget to capture the same sites we attempted with Heritrix Because these curators provided strong details when describing their sites we crawled all of the sites they sent us (six sites) In one case the ldquoDefense Base Closure and Realignment Commissionrdquo the curator had greater success capturing aspx files with HTTrack than we had with Heritrix Other comparisons

We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using Wget found only 1474 However both spiders found roughly the same number of bytes As I understand Wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference

And

We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with Wget

Crawl Success We asked curators to rate the overall success of the test crawls on the following scale

bull Not effective (none of the desired documents were captured) bull Somewhat effective (some of the desired documents were captured) bull Mostly effective (most of the desired documents were captured) bull Effective (all of the desired documents were captured)

0

2

4

6

8

10

12

Crawl Success

Effective

MostlyEffectiveSomewhatEffectiveNot Effective

Here are some of the comments curators had concerning the overall success of these tests The full text of all curator reports and their feedback to CDL is available in Appendix C

13

Los Angeles Planning Department

The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations) however [sic] when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when I clicked on a link to get to the full report all I got was the ldquoSorry no Documents wthe given URI were foundrdquo This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved

City of San Diego Planning Department This comment is from a curator who is filling in for the person who originated the test crawl request so she is also attempting to interpret what that other person was hoping to capture

Due to the vague request to ldquodrill down several levelsrdquo Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didnrsquot expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public libraryrsquos pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but werenrsquot (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If thatrsquos the case it may be easier to do all of wwwsandiegogov rather than limit

Defense Base Closure and Realignment Commission

I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured

Public Policy Institute of California

14

There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an ldquoobject not foundrdquo message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on ldquoDaterdquo retrieves a message ldquoSorry no documents with the given uri were foundrdquo 3) The search boxes are not functional searches retrieve ldquoSorry no documents with the given uri were foundrdquo

AFL-CIO

I realize the collection interface is a ldquowork in progressrdquo and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg ldquoWorking Families Toolkitrdquo ldquoBushWatchrdquo) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness

Conclusions The process of running these test crawls has been valuable It has resulted in a number of lessons learned and further directions for inquiry It is clear that the tools that help curators analyze the results of these crawls will have to do more than mimic the original browsing context or supply tables of data We will be looking for any opportunity to improve WAS reporting capabilities as we move forward with the project The differences encountered in language and visualization raise the importance of a strong intuitive design for the curator tools and for clear help screens Each of us may visualize web sites differently and the crawler may behave differently than we expect A certain degree of online help will be needed to design crawls effectively and further guidance should be available to help people interpret crawl results when those results donrsquot match what the person anticipated

15

The help and documentation for the Web Archiving Service will also need to address the rights analysis issues raised above In most cases this is work that is done prior to issuing crawls it cannot be addressed by the design of the WAS interface alone When the Web-at-Risk project reaches the point of conducting usability studies we should be sure to include tests that further uncover how users understand crawl frequency settings Additionally the desire for an analysis tool that can convey when a site has changed significantly is not limited to this project It is a common issue faced by the Internet Archive members of the International Internet Preservation Consortium and others CDL is actively communicating with these organizations as we all work toward a solution for this problem Similarly CDL should ensure that future assessment and analysis work with our curators addresses the issues raised by the two crawl scope settings This inquiry should also extend to participants who exemplify end users to determine if people using web archives experience crawl scope differently than people who build them Finally a number of lessons were learned via the Katrina crawl described in Appendix B The most outstanding finding is that event-based crawls such as Katrina and site-specific crawls such as these have quite different characteristics and require different functionality and analysis tools

Next Steps There is certainly more to learn from these crawl results and about Heritrix in some cases it is still unclear why the crawler failed to retrieve certain documents The curatorsrsquo feedback concerning these results has been extremely valuable They have provided insight as to what was captured and what is still missing which would have been difficult to determine without their subject expertise in the sites chosen As we continue working to improve crawler success and performance we will turn our attention to the Virtual Remote Control site created by Cornell This site was developed to present particular problems to crawlers and is well documented enabling the user to gauge a crawlerrsquos results We will be using that site to replicate particular problems raised in our test crawl set and may also reattempt some of the sites crawled for these tests When we release the first version of the Web Archiving Service to curators in July 2006 we will request that they include their original test site in the crawls they attempt and compare the results with these tests The test results for these crawls will remain in place as a point of comparison while we continue developing the Web Archiving Service Additionally certain issues raised by this feedback particularly the desired scope and frequency of crawls merit attention in future usability analysis work

16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 8: Web-at-Risk Test Crawl Report - Dashboard - Confluence

When compared quantitatively 8 out of 19 crawls took longer to capture the site when limited to ldquooriginal host onlyrdquo than with the ldquolinked hostsrdquo setting It is not clear why this is the case since the ldquolinked hostsrdquo crawl should be much larger Indeed in all cases the linked hosts crawl retrieved more files than the original host crawl The following two tables compare both the number of files retrieved and the duration of the two types of crawls

Table 1 Number of files retrieved Original Host Only Linked Hosts Included Most 46197 70114 Fewest 247 1343 Median 2423 9250 Average 6359 17247

Table 2 Duration of the crawl Original Host Only Linked Hosts Included Longest 32 hr 21 min 37 hr 11 min Shortest 18 min 19 min Median 7 hr 33 min 11 hr 22 min Average 1 hr 42 min 7 hr 9 min

Given that this is a very small sample of crawls and that the gap between the largest and smallest crawls is fairly noteworthy perhaps the only telling figure to consider here is the median According to the median figures with only 505 more time the crawler acquired over 281 more documents When compared qualitatively the results also appeared somewhat counterintuitive Of the 18 curators who responded 12 stated that they preferred the ldquooriginal host onlyrdquo crawl (four were undecided) We would have expected this preference to vary a little more from site to site Oddly one of the two curators who preferred the larger crawl scope had a crawl that captured materials from over 2500 other hosts In some cases a sitersquos links to exterior hosts are critical The sitersquos value may hinge upon how well it gathers documents from other sources both of the curators who preferred the broader setting did so for this reason

For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites

More critically sites are also often composed of content from more than one server This is particularly likely to be the case if a site is providing a large body of pdf or multimedia files So a crawl restricted to the original host would be missing critical segments of the sitersquos content Our test crawls did in fact turn up sites that were composed of more than one host name For example in the case of UC Merced separate host names are used for different areas of the site such as faculty or

8

admissions In the case of the Arizona Department of Water Resources the distinction between host names appears to be accidental perhaps the result of an attempt to transition to a simpler more memorable URI Most pages from this site come from wwwazwatergov but hundreds of internal links including critical style sheet files are still hard-coded to point to wwwwaterazgov Finally when the site is restricted to the original host the end user is much more likely to encounter errors when viewing the archived results When the end user selects a link that was not captured WERA provides a ldquoSorry this URI was not foundrdquo message When the linked hosts are included the end user browsing the site sees the site more closely to its original context and with fewer error messages Conversely when end users encounter frequent error screens they may develop both frustration and a sense of mistrust in the quality of the archive It is worth noting that the curators are not likely to browse these results in the same way that an end user of their archives might The curators know what these sites contain choose them accordingly and may be less inclined to click on links that would result in a ldquoSorry this URI was not foundrdquo message Ultimately the value of a sitersquos external links would seem likely to vary depending on the nature of the site Sites with rich internal content and only ldquofrivolousrdquo external links would be best captured with the ldquooriginal host onlyrdquo setting Before we ran these crawls we asked curators to specify what they hoped a crawl would capture Many referenced specific pages or directories they hoped to capture and of those three specified URIs that were not from the original host When reviewing the results two out of those three still preferred the original host crawl even though that crawl did not capture the materials they specifically hoped to capture Why was the feedback so consistent on this point A look at the WERA interface used to display crawl results may provide an answer WERA does not offer an immediate means of browsing a site you have to search by keyword to find your way ldquointordquo the captured site Once you have a page displayed you can browse within the site but you must begin by searching for the right starting point As this image shows you select the site you want to search from a dropdown menu then enter terms to search against

Because all pages from the more comprehensive crawls are indexed the search results include pages from all of those other hosts This sets up a bit of cognitive dissonance the user specified a search against a particular site yet results from other hosts vastly outnumber pages from that site

9

Underlying the issue of crawl scope is the deeper question of what an archivist hopes to capture when a site is crawled Is it just a list of particular documents Or is it a faithful recreation of the site as it existed on that day It may be that a captured site has content of primary and secondary importance The primary content is what should be retrieved when searching against the archive while the secondary content should only be present to avoid error messages and establish the sitersquos full original context Another approach is suggested by this curatorrsquos response

The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than ldquosource + 1

This suggests the ability to link seed URIs as being related components of a single site

Communication Reports When we reported the test results back to curators we provided a synopsis of the crawl results links to particular Heritrix reports and to the WERA display interface The Heritrix reports are all plain text providing tables of MIME type or response code frequency Attempting to integrate these reports and the display of the archived results is a challenge One curator for example obtained documents from over 200 hosts in the ldquolinked hosts includedrdquo crawl but was only aware of having found 10 additional documents when reviewing these same search results in WERA Although WERA is helpful for seeing results from an end userrsquos perspective it does not provide adequate tools for analysis In some cases this is simply because WERA is a new and occasionally buggy tool It is possible for instance to follow occasional links out of the archive and into ldquoreal-timerdquo sites In some cases itrsquos also possible to browse to a page and display it but when you search for that same page by its URI WERA does not find anything One curator notes

Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IEmdashthe image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

It is also important to note that although WERA was used for the purpose of reporting test crawl results it is not envisioned as the final display interface for the Web Archiving Service Even so the feedback the curators provide about WERA should inform the functionality of the WAS interface Clearly it is still quite a struggle for curators to determine exactly what a crawl retrieved One curator reports

After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to

10

the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls

Crawl Frequency When asked how frequently they wanted to crawl sites curators responded with a variety of preferences

0123456789

Desired Crawl Frequency

DailyWeeklyMonthlyOnceUnknown

Again it is worth considering precisely what curators hope to capture in a repeated crawl of a site Some insight is provided by these curatorsrsquo comments

We hope the crawler will be able to report when new publication files are posted on the web site

And

The ability to report on new publications is critical to our goal of using the crawler as a discovery tool

As with the other NDIIPP grants the purpose of Web Archiving Service tools will be for archiving and preservation not for resource discovery This indicates that we should further investigate what a ldquoweeklyrdquo or ldquomonthlyrdquo crawl really means to curators If a site was not updated over the course of a year would the curator want to continue running weekly crawls of the site Would the curator want each crawl to appear on an archive timeline for that site even if the content was no different for each date One curator comments

I want to qualify the frequency for this site I d like to do a monthly crawl for three-four months Irsquod want [to] reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot [of] new content being added Irsquod change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation

11

Language and Web Site Models One of the challenges of communicating and interpreting crawl results is that crawlers donrsquot necessarily work the way people envision them to Further the appearance of a web site on a screen and its architecture on a server may be quite different the directory structure of a site may have no relationship to they way its navigation is organized on a screen One frequent point of confusion curators encountered while interpreting crawl results is the concept of how many ldquolevels downrdquo the crawler went One curator requested that we ldquodrill down several levels (at least 3)rdquo in our capture One challenge with this request is that ldquolevels downrdquo can be interpreted to mean different things In some cases curators clearly mean the directory structure of the web site on the remote server In other cases however they seem to mean the number of hosts away from the original site

In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise

If the curator is referring to a structure of subdirectories the next challenge is that the Heritrix crawler does not work this way Heritrix ignores the sitersquos directory structure and instead follows links from the seed URI it is provided The Heritrix manual specifically defines ldquodepthrdquo in this way

This scope allows for limiting the depth of a crawl (how many links away Heritrix should crawl) but does not impose any limits on the hosts domains or URI paths crawled6

Multimedia Some curators chose sites because of the value of their multimedia resources The crawler yielded mixed results in capturing these resources For one site

bull A text search on the log file turns up numerous ram files only one

ppt file

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files

bull smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files

As noted nearly half the sites crawled reached the 1 gig size limit and so did not complete This makes it difficult to determine whether there were genuine problems

6 Heritrix User Manual Section 611 Crawl Scope Broad Scope lt httpcrawlerarchiveorgarticlesuser_manualhtmlgt

12

with particular types of files or if the crawler simply did not get to the missing files before the crawl was stopped

Comparison with Other Crawlers Three of our curators had previously used other crawlers such as HTTrack and Wget to capture the same sites we attempted with Heritrix Because these curators provided strong details when describing their sites we crawled all of the sites they sent us (six sites) In one case the ldquoDefense Base Closure and Realignment Commissionrdquo the curator had greater success capturing aspx files with HTTrack than we had with Heritrix Other comparisons

We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using Wget found only 1474 However both spiders found roughly the same number of bytes As I understand Wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference

And

We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with Wget

Crawl Success We asked curators to rate the overall success of the test crawls on the following scale

bull Not effective (none of the desired documents were captured) bull Somewhat effective (some of the desired documents were captured) bull Mostly effective (most of the desired documents were captured) bull Effective (all of the desired documents were captured)

0

2

4

6

8

10

12

Crawl Success

Effective

MostlyEffectiveSomewhatEffectiveNot Effective

Here are some of the comments curators had concerning the overall success of these tests The full text of all curator reports and their feedback to CDL is available in Appendix C

13

Los Angeles Planning Department

The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations) however [sic] when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when I clicked on a link to get to the full report all I got was the ldquoSorry no Documents wthe given URI were foundrdquo This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved

City of San Diego Planning Department This comment is from a curator who is filling in for the person who originated the test crawl request so she is also attempting to interpret what that other person was hoping to capture

Due to the vague request to ldquodrill down several levelsrdquo Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didnrsquot expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public libraryrsquos pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but werenrsquot (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If thatrsquos the case it may be easier to do all of wwwsandiegogov rather than limit

Defense Base Closure and Realignment Commission

I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured

Public Policy Institute of California

14

There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an ldquoobject not foundrdquo message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on ldquoDaterdquo retrieves a message ldquoSorry no documents with the given uri were foundrdquo 3) The search boxes are not functional searches retrieve ldquoSorry no documents with the given uri were foundrdquo

AFL-CIO

I realize the collection interface is a ldquowork in progressrdquo and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg ldquoWorking Families Toolkitrdquo ldquoBushWatchrdquo) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness

Conclusions The process of running these test crawls has been valuable It has resulted in a number of lessons learned and further directions for inquiry It is clear that the tools that help curators analyze the results of these crawls will have to do more than mimic the original browsing context or supply tables of data We will be looking for any opportunity to improve WAS reporting capabilities as we move forward with the project The differences encountered in language and visualization raise the importance of a strong intuitive design for the curator tools and for clear help screens Each of us may visualize web sites differently and the crawler may behave differently than we expect A certain degree of online help will be needed to design crawls effectively and further guidance should be available to help people interpret crawl results when those results donrsquot match what the person anticipated

15

The help and documentation for the Web Archiving Service will also need to address the rights analysis issues raised above In most cases this is work that is done prior to issuing crawls it cannot be addressed by the design of the WAS interface alone When the Web-at-Risk project reaches the point of conducting usability studies we should be sure to include tests that further uncover how users understand crawl frequency settings Additionally the desire for an analysis tool that can convey when a site has changed significantly is not limited to this project It is a common issue faced by the Internet Archive members of the International Internet Preservation Consortium and others CDL is actively communicating with these organizations as we all work toward a solution for this problem Similarly CDL should ensure that future assessment and analysis work with our curators addresses the issues raised by the two crawl scope settings This inquiry should also extend to participants who exemplify end users to determine if people using web archives experience crawl scope differently than people who build them Finally a number of lessons were learned via the Katrina crawl described in Appendix B The most outstanding finding is that event-based crawls such as Katrina and site-specific crawls such as these have quite different characteristics and require different functionality and analysis tools

Next Steps There is certainly more to learn from these crawl results and about Heritrix in some cases it is still unclear why the crawler failed to retrieve certain documents The curatorsrsquo feedback concerning these results has been extremely valuable They have provided insight as to what was captured and what is still missing which would have been difficult to determine without their subject expertise in the sites chosen As we continue working to improve crawler success and performance we will turn our attention to the Virtual Remote Control site created by Cornell This site was developed to present particular problems to crawlers and is well documented enabling the user to gauge a crawlerrsquos results We will be using that site to replicate particular problems raised in our test crawl set and may also reattempt some of the sites crawled for these tests When we release the first version of the Web Archiving Service to curators in July 2006 we will request that they include their original test site in the crawls they attempt and compare the results with these tests The test results for these crawls will remain in place as a point of comparison while we continue developing the Web Archiving Service Additionally certain issues raised by this feedback particularly the desired scope and frequency of crawls merit attention in future usability analysis work

16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 9: Web-at-Risk Test Crawl Report - Dashboard - Confluence

admissions In the case of the Arizona Department of Water Resources the distinction between host names appears to be accidental perhaps the result of an attempt to transition to a simpler more memorable URI Most pages from this site come from wwwazwatergov but hundreds of internal links including critical style sheet files are still hard-coded to point to wwwwaterazgov Finally when the site is restricted to the original host the end user is much more likely to encounter errors when viewing the archived results When the end user selects a link that was not captured WERA provides a ldquoSorry this URI was not foundrdquo message When the linked hosts are included the end user browsing the site sees the site more closely to its original context and with fewer error messages Conversely when end users encounter frequent error screens they may develop both frustration and a sense of mistrust in the quality of the archive It is worth noting that the curators are not likely to browse these results in the same way that an end user of their archives might The curators know what these sites contain choose them accordingly and may be less inclined to click on links that would result in a ldquoSorry this URI was not foundrdquo message Ultimately the value of a sitersquos external links would seem likely to vary depending on the nature of the site Sites with rich internal content and only ldquofrivolousrdquo external links would be best captured with the ldquooriginal host onlyrdquo setting Before we ran these crawls we asked curators to specify what they hoped a crawl would capture Many referenced specific pages or directories they hoped to capture and of those three specified URIs that were not from the original host When reviewing the results two out of those three still preferred the original host crawl even though that crawl did not capture the materials they specifically hoped to capture Why was the feedback so consistent on this point A look at the WERA interface used to display crawl results may provide an answer WERA does not offer an immediate means of browsing a site you have to search by keyword to find your way ldquointordquo the captured site Once you have a page displayed you can browse within the site but you must begin by searching for the right starting point As this image shows you select the site you want to search from a dropdown menu then enter terms to search against

Because all pages from the more comprehensive crawls are indexed the search results include pages from all of those other hosts This sets up a bit of cognitive dissonance the user specified a search against a particular site yet results from other hosts vastly outnumber pages from that site

9

Underlying the issue of crawl scope is the deeper question of what an archivist hopes to capture when a site is crawled Is it just a list of particular documents Or is it a faithful recreation of the site as it existed on that day It may be that a captured site has content of primary and secondary importance The primary content is what should be retrieved when searching against the archive while the secondary content should only be present to avoid error messages and establish the sitersquos full original context Another approach is suggested by this curatorrsquos response

The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than ldquosource + 1

This suggests the ability to link seed URIs as being related components of a single site

Communication Reports When we reported the test results back to curators we provided a synopsis of the crawl results links to particular Heritrix reports and to the WERA display interface The Heritrix reports are all plain text providing tables of MIME type or response code frequency Attempting to integrate these reports and the display of the archived results is a challenge One curator for example obtained documents from over 200 hosts in the ldquolinked hosts includedrdquo crawl but was only aware of having found 10 additional documents when reviewing these same search results in WERA Although WERA is helpful for seeing results from an end userrsquos perspective it does not provide adequate tools for analysis In some cases this is simply because WERA is a new and occasionally buggy tool It is possible for instance to follow occasional links out of the archive and into ldquoreal-timerdquo sites In some cases itrsquos also possible to browse to a page and display it but when you search for that same page by its URI WERA does not find anything One curator notes

Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IEmdashthe image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

It is also important to note that although WERA was used for the purpose of reporting test crawl results it is not envisioned as the final display interface for the Web Archiving Service Even so the feedback the curators provide about WERA should inform the functionality of the WAS interface Clearly it is still quite a struggle for curators to determine exactly what a crawl retrieved One curator reports

After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to

10

the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls

Crawl Frequency When asked how frequently they wanted to crawl sites curators responded with a variety of preferences

0123456789

Desired Crawl Frequency

DailyWeeklyMonthlyOnceUnknown

Again it is worth considering precisely what curators hope to capture in a repeated crawl of a site Some insight is provided by these curatorsrsquo comments

We hope the crawler will be able to report when new publication files are posted on the web site

And

The ability to report on new publications is critical to our goal of using the crawler as a discovery tool

As with the other NDIIPP grants the purpose of Web Archiving Service tools will be for archiving and preservation not for resource discovery This indicates that we should further investigate what a ldquoweeklyrdquo or ldquomonthlyrdquo crawl really means to curators If a site was not updated over the course of a year would the curator want to continue running weekly crawls of the site Would the curator want each crawl to appear on an archive timeline for that site even if the content was no different for each date One curator comments

I want to qualify the frequency for this site I d like to do a monthly crawl for three-four months Irsquod want [to] reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot [of] new content being added Irsquod change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation

11

Language and Web Site Models One of the challenges of communicating and interpreting crawl results is that crawlers donrsquot necessarily work the way people envision them to Further the appearance of a web site on a screen and its architecture on a server may be quite different the directory structure of a site may have no relationship to they way its navigation is organized on a screen One frequent point of confusion curators encountered while interpreting crawl results is the concept of how many ldquolevels downrdquo the crawler went One curator requested that we ldquodrill down several levels (at least 3)rdquo in our capture One challenge with this request is that ldquolevels downrdquo can be interpreted to mean different things In some cases curators clearly mean the directory structure of the web site on the remote server In other cases however they seem to mean the number of hosts away from the original site

In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise

If the curator is referring to a structure of subdirectories the next challenge is that the Heritrix crawler does not work this way Heritrix ignores the sitersquos directory structure and instead follows links from the seed URI it is provided The Heritrix manual specifically defines ldquodepthrdquo in this way

This scope allows for limiting the depth of a crawl (how many links away Heritrix should crawl) but does not impose any limits on the hosts domains or URI paths crawled6

Multimedia Some curators chose sites because of the value of their multimedia resources The crawler yielded mixed results in capturing these resources For one site

bull A text search on the log file turns up numerous ram files only one

ppt file

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files

bull smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files

As noted nearly half the sites crawled reached the 1 gig size limit and so did not complete This makes it difficult to determine whether there were genuine problems

6 Heritrix User Manual Section 611 Crawl Scope Broad Scope lt httpcrawlerarchiveorgarticlesuser_manualhtmlgt

12

with particular types of files or if the crawler simply did not get to the missing files before the crawl was stopped

Comparison with Other Crawlers Three of our curators had previously used other crawlers such as HTTrack and Wget to capture the same sites we attempted with Heritrix Because these curators provided strong details when describing their sites we crawled all of the sites they sent us (six sites) In one case the ldquoDefense Base Closure and Realignment Commissionrdquo the curator had greater success capturing aspx files with HTTrack than we had with Heritrix Other comparisons

We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using Wget found only 1474 However both spiders found roughly the same number of bytes As I understand Wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference

And

We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with Wget

Crawl Success We asked curators to rate the overall success of the test crawls on the following scale

bull Not effective (none of the desired documents were captured) bull Somewhat effective (some of the desired documents were captured) bull Mostly effective (most of the desired documents were captured) bull Effective (all of the desired documents were captured)

0

2

4

6

8

10

12

Crawl Success

Effective

MostlyEffectiveSomewhatEffectiveNot Effective

Here are some of the comments curators had concerning the overall success of these tests The full text of all curator reports and their feedback to CDL is available in Appendix C

13

Los Angeles Planning Department

The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations) however [sic] when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when I clicked on a link to get to the full report all I got was the ldquoSorry no Documents wthe given URI were foundrdquo This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved

City of San Diego Planning Department This comment is from a curator who is filling in for the person who originated the test crawl request so she is also attempting to interpret what that other person was hoping to capture

Due to the vague request to ldquodrill down several levelsrdquo Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didnrsquot expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public libraryrsquos pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but werenrsquot (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If thatrsquos the case it may be easier to do all of wwwsandiegogov rather than limit

Defense Base Closure and Realignment Commission

I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured

Public Policy Institute of California

14

There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an ldquoobject not foundrdquo message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on ldquoDaterdquo retrieves a message ldquoSorry no documents with the given uri were foundrdquo 3) The search boxes are not functional searches retrieve ldquoSorry no documents with the given uri were foundrdquo

AFL-CIO

I realize the collection interface is a ldquowork in progressrdquo and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg ldquoWorking Families Toolkitrdquo ldquoBushWatchrdquo) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness

Conclusions The process of running these test crawls has been valuable It has resulted in a number of lessons learned and further directions for inquiry It is clear that the tools that help curators analyze the results of these crawls will have to do more than mimic the original browsing context or supply tables of data We will be looking for any opportunity to improve WAS reporting capabilities as we move forward with the project The differences encountered in language and visualization raise the importance of a strong intuitive design for the curator tools and for clear help screens Each of us may visualize web sites differently and the crawler may behave differently than we expect A certain degree of online help will be needed to design crawls effectively and further guidance should be available to help people interpret crawl results when those results donrsquot match what the person anticipated

15

The help and documentation for the Web Archiving Service will also need to address the rights analysis issues raised above In most cases this is work that is done prior to issuing crawls it cannot be addressed by the design of the WAS interface alone When the Web-at-Risk project reaches the point of conducting usability studies we should be sure to include tests that further uncover how users understand crawl frequency settings Additionally the desire for an analysis tool that can convey when a site has changed significantly is not limited to this project It is a common issue faced by the Internet Archive members of the International Internet Preservation Consortium and others CDL is actively communicating with these organizations as we all work toward a solution for this problem Similarly CDL should ensure that future assessment and analysis work with our curators addresses the issues raised by the two crawl scope settings This inquiry should also extend to participants who exemplify end users to determine if people using web archives experience crawl scope differently than people who build them Finally a number of lessons were learned via the Katrina crawl described in Appendix B The most outstanding finding is that event-based crawls such as Katrina and site-specific crawls such as these have quite different characteristics and require different functionality and analysis tools

Next Steps There is certainly more to learn from these crawl results and about Heritrix in some cases it is still unclear why the crawler failed to retrieve certain documents The curatorsrsquo feedback concerning these results has been extremely valuable They have provided insight as to what was captured and what is still missing which would have been difficult to determine without their subject expertise in the sites chosen As we continue working to improve crawler success and performance we will turn our attention to the Virtual Remote Control site created by Cornell This site was developed to present particular problems to crawlers and is well documented enabling the user to gauge a crawlerrsquos results We will be using that site to replicate particular problems raised in our test crawl set and may also reattempt some of the sites crawled for these tests When we release the first version of the Web Archiving Service to curators in July 2006 we will request that they include their original test site in the crawls they attempt and compare the results with these tests The test results for these crawls will remain in place as a point of comparison while we continue developing the Web Archiving Service Additionally certain issues raised by this feedback particularly the desired scope and frequency of crawls merit attention in future usability analysis work

16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 10: Web-at-Risk Test Crawl Report - Dashboard - Confluence

Underlying the issue of crawl scope is the deeper question of what an archivist hopes to capture when a site is crawled Is it just a list of particular documents Or is it a faithful recreation of the site as it existed on that day It may be that a captured site has content of primary and secondary importance The primary content is what should be retrieved when searching against the archive while the secondary content should only be present to avoid error messages and establish the sitersquos full original context Another approach is suggested by this curatorrsquos response

The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than ldquosource + 1

This suggests the ability to link seed URIs as being related components of a single site

Communication Reports When we reported the test results back to curators we provided a synopsis of the crawl results links to particular Heritrix reports and to the WERA display interface The Heritrix reports are all plain text providing tables of MIME type or response code frequency Attempting to integrate these reports and the display of the archived results is a challenge One curator for example obtained documents from over 200 hosts in the ldquolinked hosts includedrdquo crawl but was only aware of having found 10 additional documents when reviewing these same search results in WERA Although WERA is helpful for seeing results from an end userrsquos perspective it does not provide adequate tools for analysis In some cases this is simply because WERA is a new and occasionally buggy tool It is possible for instance to follow occasional links out of the archive and into ldquoreal-timerdquo sites In some cases itrsquos also possible to browse to a page and display it but when you search for that same page by its URI WERA does not find anything One curator notes

Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IEmdashthe image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

It is also important to note that although WERA was used for the purpose of reporting test crawl results it is not envisioned as the final display interface for the Web Archiving Service Even so the feedback the curators provide about WERA should inform the functionality of the WAS interface Clearly it is still quite a struggle for curators to determine exactly what a crawl retrieved One curator reports

After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to

10

the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls

Crawl Frequency When asked how frequently they wanted to crawl sites curators responded with a variety of preferences

0123456789

Desired Crawl Frequency

DailyWeeklyMonthlyOnceUnknown

Again it is worth considering precisely what curators hope to capture in a repeated crawl of a site Some insight is provided by these curatorsrsquo comments

We hope the crawler will be able to report when new publication files are posted on the web site

And

The ability to report on new publications is critical to our goal of using the crawler as a discovery tool

As with the other NDIIPP grants the purpose of Web Archiving Service tools will be for archiving and preservation not for resource discovery This indicates that we should further investigate what a ldquoweeklyrdquo or ldquomonthlyrdquo crawl really means to curators If a site was not updated over the course of a year would the curator want to continue running weekly crawls of the site Would the curator want each crawl to appear on an archive timeline for that site even if the content was no different for each date One curator comments

I want to qualify the frequency for this site I d like to do a monthly crawl for three-four months Irsquod want [to] reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot [of] new content being added Irsquod change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation

11

Language and Web Site Models One of the challenges of communicating and interpreting crawl results is that crawlers donrsquot necessarily work the way people envision them to Further the appearance of a web site on a screen and its architecture on a server may be quite different the directory structure of a site may have no relationship to they way its navigation is organized on a screen One frequent point of confusion curators encountered while interpreting crawl results is the concept of how many ldquolevels downrdquo the crawler went One curator requested that we ldquodrill down several levels (at least 3)rdquo in our capture One challenge with this request is that ldquolevels downrdquo can be interpreted to mean different things In some cases curators clearly mean the directory structure of the web site on the remote server In other cases however they seem to mean the number of hosts away from the original site

In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise

If the curator is referring to a structure of subdirectories the next challenge is that the Heritrix crawler does not work this way Heritrix ignores the sitersquos directory structure and instead follows links from the seed URI it is provided The Heritrix manual specifically defines ldquodepthrdquo in this way

This scope allows for limiting the depth of a crawl (how many links away Heritrix should crawl) but does not impose any limits on the hosts domains or URI paths crawled6

Multimedia Some curators chose sites because of the value of their multimedia resources The crawler yielded mixed results in capturing these resources For one site

bull A text search on the log file turns up numerous ram files only one

ppt file

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files

bull smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files

As noted nearly half the sites crawled reached the 1 gig size limit and so did not complete This makes it difficult to determine whether there were genuine problems

6 Heritrix User Manual Section 611 Crawl Scope Broad Scope lt httpcrawlerarchiveorgarticlesuser_manualhtmlgt

12

with particular types of files or if the crawler simply did not get to the missing files before the crawl was stopped

Comparison with Other Crawlers Three of our curators had previously used other crawlers such as HTTrack and Wget to capture the same sites we attempted with Heritrix Because these curators provided strong details when describing their sites we crawled all of the sites they sent us (six sites) In one case the ldquoDefense Base Closure and Realignment Commissionrdquo the curator had greater success capturing aspx files with HTTrack than we had with Heritrix Other comparisons

We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using Wget found only 1474 However both spiders found roughly the same number of bytes As I understand Wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference

And

We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with Wget

Crawl Success We asked curators to rate the overall success of the test crawls on the following scale

bull Not effective (none of the desired documents were captured) bull Somewhat effective (some of the desired documents were captured) bull Mostly effective (most of the desired documents were captured) bull Effective (all of the desired documents were captured)

0

2

4

6

8

10

12

Crawl Success

Effective

MostlyEffectiveSomewhatEffectiveNot Effective

Here are some of the comments curators had concerning the overall success of these tests The full text of all curator reports and their feedback to CDL is available in Appendix C

13

Los Angeles Planning Department

The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations) however [sic] when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when I clicked on a link to get to the full report all I got was the ldquoSorry no Documents wthe given URI were foundrdquo This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved

City of San Diego Planning Department This comment is from a curator who is filling in for the person who originated the test crawl request so she is also attempting to interpret what that other person was hoping to capture

Due to the vague request to ldquodrill down several levelsrdquo Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didnrsquot expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public libraryrsquos pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but werenrsquot (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If thatrsquos the case it may be easier to do all of wwwsandiegogov rather than limit

Defense Base Closure and Realignment Commission

I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured

Public Policy Institute of California

14

There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an ldquoobject not foundrdquo message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on ldquoDaterdquo retrieves a message ldquoSorry no documents with the given uri were foundrdquo 3) The search boxes are not functional searches retrieve ldquoSorry no documents with the given uri were foundrdquo

AFL-CIO

I realize the collection interface is a ldquowork in progressrdquo and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg ldquoWorking Families Toolkitrdquo ldquoBushWatchrdquo) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness

Conclusions The process of running these test crawls has been valuable It has resulted in a number of lessons learned and further directions for inquiry It is clear that the tools that help curators analyze the results of these crawls will have to do more than mimic the original browsing context or supply tables of data We will be looking for any opportunity to improve WAS reporting capabilities as we move forward with the project The differences encountered in language and visualization raise the importance of a strong intuitive design for the curator tools and for clear help screens Each of us may visualize web sites differently and the crawler may behave differently than we expect A certain degree of online help will be needed to design crawls effectively and further guidance should be available to help people interpret crawl results when those results donrsquot match what the person anticipated

15

The help and documentation for the Web Archiving Service will also need to address the rights analysis issues raised above In most cases this is work that is done prior to issuing crawls it cannot be addressed by the design of the WAS interface alone When the Web-at-Risk project reaches the point of conducting usability studies we should be sure to include tests that further uncover how users understand crawl frequency settings Additionally the desire for an analysis tool that can convey when a site has changed significantly is not limited to this project It is a common issue faced by the Internet Archive members of the International Internet Preservation Consortium and others CDL is actively communicating with these organizations as we all work toward a solution for this problem Similarly CDL should ensure that future assessment and analysis work with our curators addresses the issues raised by the two crawl scope settings This inquiry should also extend to participants who exemplify end users to determine if people using web archives experience crawl scope differently than people who build them Finally a number of lessons were learned via the Katrina crawl described in Appendix B The most outstanding finding is that event-based crawls such as Katrina and site-specific crawls such as these have quite different characteristics and require different functionality and analysis tools

Next Steps There is certainly more to learn from these crawl results and about Heritrix in some cases it is still unclear why the crawler failed to retrieve certain documents The curatorsrsquo feedback concerning these results has been extremely valuable They have provided insight as to what was captured and what is still missing which would have been difficult to determine without their subject expertise in the sites chosen As we continue working to improve crawler success and performance we will turn our attention to the Virtual Remote Control site created by Cornell This site was developed to present particular problems to crawlers and is well documented enabling the user to gauge a crawlerrsquos results We will be using that site to replicate particular problems raised in our test crawl set and may also reattempt some of the sites crawled for these tests When we release the first version of the Web Archiving Service to curators in July 2006 we will request that they include their original test site in the crawls they attempt and compare the results with these tests The test results for these crawls will remain in place as a point of comparison while we continue developing the Web Archiving Service Additionally certain issues raised by this feedback particularly the desired scope and frequency of crawls merit attention in future usability analysis work

16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 11: Web-at-Risk Test Crawl Report - Dashboard - Confluence

the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls

Crawl Frequency When asked how frequently they wanted to crawl sites curators responded with a variety of preferences

0123456789

Desired Crawl Frequency

DailyWeeklyMonthlyOnceUnknown

Again it is worth considering precisely what curators hope to capture in a repeated crawl of a site Some insight is provided by these curatorsrsquo comments

We hope the crawler will be able to report when new publication files are posted on the web site

And

The ability to report on new publications is critical to our goal of using the crawler as a discovery tool

As with the other NDIIPP grants the purpose of Web Archiving Service tools will be for archiving and preservation not for resource discovery This indicates that we should further investigate what a ldquoweeklyrdquo or ldquomonthlyrdquo crawl really means to curators If a site was not updated over the course of a year would the curator want to continue running weekly crawls of the site Would the curator want each crawl to appear on an archive timeline for that site even if the content was no different for each date One curator comments

I want to qualify the frequency for this site I d like to do a monthly crawl for three-four months Irsquod want [to] reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot [of] new content being added Irsquod change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation

11

Language and Web Site Models One of the challenges of communicating and interpreting crawl results is that crawlers donrsquot necessarily work the way people envision them to Further the appearance of a web site on a screen and its architecture on a server may be quite different the directory structure of a site may have no relationship to they way its navigation is organized on a screen One frequent point of confusion curators encountered while interpreting crawl results is the concept of how many ldquolevels downrdquo the crawler went One curator requested that we ldquodrill down several levels (at least 3)rdquo in our capture One challenge with this request is that ldquolevels downrdquo can be interpreted to mean different things In some cases curators clearly mean the directory structure of the web site on the remote server In other cases however they seem to mean the number of hosts away from the original site

In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise

If the curator is referring to a structure of subdirectories the next challenge is that the Heritrix crawler does not work this way Heritrix ignores the sitersquos directory structure and instead follows links from the seed URI it is provided The Heritrix manual specifically defines ldquodepthrdquo in this way

This scope allows for limiting the depth of a crawl (how many links away Heritrix should crawl) but does not impose any limits on the hosts domains or URI paths crawled6

Multimedia Some curators chose sites because of the value of their multimedia resources The crawler yielded mixed results in capturing these resources For one site

bull A text search on the log file turns up numerous ram files only one

ppt file

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files

bull smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files

As noted nearly half the sites crawled reached the 1 gig size limit and so did not complete This makes it difficult to determine whether there were genuine problems

6 Heritrix User Manual Section 611 Crawl Scope Broad Scope lt httpcrawlerarchiveorgarticlesuser_manualhtmlgt

12

with particular types of files or if the crawler simply did not get to the missing files before the crawl was stopped

Comparison with Other Crawlers Three of our curators had previously used other crawlers such as HTTrack and Wget to capture the same sites we attempted with Heritrix Because these curators provided strong details when describing their sites we crawled all of the sites they sent us (six sites) In one case the ldquoDefense Base Closure and Realignment Commissionrdquo the curator had greater success capturing aspx files with HTTrack than we had with Heritrix Other comparisons

We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using Wget found only 1474 However both spiders found roughly the same number of bytes As I understand Wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference

And

We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with Wget

Crawl Success We asked curators to rate the overall success of the test crawls on the following scale

bull Not effective (none of the desired documents were captured) bull Somewhat effective (some of the desired documents were captured) bull Mostly effective (most of the desired documents were captured) bull Effective (all of the desired documents were captured)

0

2

4

6

8

10

12

Crawl Success

Effective

MostlyEffectiveSomewhatEffectiveNot Effective

Here are some of the comments curators had concerning the overall success of these tests The full text of all curator reports and their feedback to CDL is available in Appendix C

13

Los Angeles Planning Department

The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations) however [sic] when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when I clicked on a link to get to the full report all I got was the ldquoSorry no Documents wthe given URI were foundrdquo This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved

City of San Diego Planning Department This comment is from a curator who is filling in for the person who originated the test crawl request so she is also attempting to interpret what that other person was hoping to capture

Due to the vague request to ldquodrill down several levelsrdquo Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didnrsquot expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public libraryrsquos pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but werenrsquot (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If thatrsquos the case it may be easier to do all of wwwsandiegogov rather than limit

Defense Base Closure and Realignment Commission

I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured

Public Policy Institute of California

14

There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an ldquoobject not foundrdquo message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on ldquoDaterdquo retrieves a message ldquoSorry no documents with the given uri were foundrdquo 3) The search boxes are not functional searches retrieve ldquoSorry no documents with the given uri were foundrdquo

AFL-CIO

I realize the collection interface is a ldquowork in progressrdquo and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg ldquoWorking Families Toolkitrdquo ldquoBushWatchrdquo) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness

Conclusions The process of running these test crawls has been valuable It has resulted in a number of lessons learned and further directions for inquiry It is clear that the tools that help curators analyze the results of these crawls will have to do more than mimic the original browsing context or supply tables of data We will be looking for any opportunity to improve WAS reporting capabilities as we move forward with the project The differences encountered in language and visualization raise the importance of a strong intuitive design for the curator tools and for clear help screens Each of us may visualize web sites differently and the crawler may behave differently than we expect A certain degree of online help will be needed to design crawls effectively and further guidance should be available to help people interpret crawl results when those results donrsquot match what the person anticipated

15

The help and documentation for the Web Archiving Service will also need to address the rights analysis issues raised above In most cases this is work that is done prior to issuing crawls it cannot be addressed by the design of the WAS interface alone When the Web-at-Risk project reaches the point of conducting usability studies we should be sure to include tests that further uncover how users understand crawl frequency settings Additionally the desire for an analysis tool that can convey when a site has changed significantly is not limited to this project It is a common issue faced by the Internet Archive members of the International Internet Preservation Consortium and others CDL is actively communicating with these organizations as we all work toward a solution for this problem Similarly CDL should ensure that future assessment and analysis work with our curators addresses the issues raised by the two crawl scope settings This inquiry should also extend to participants who exemplify end users to determine if people using web archives experience crawl scope differently than people who build them Finally a number of lessons were learned via the Katrina crawl described in Appendix B The most outstanding finding is that event-based crawls such as Katrina and site-specific crawls such as these have quite different characteristics and require different functionality and analysis tools

Next Steps There is certainly more to learn from these crawl results and about Heritrix in some cases it is still unclear why the crawler failed to retrieve certain documents The curatorsrsquo feedback concerning these results has been extremely valuable They have provided insight as to what was captured and what is still missing which would have been difficult to determine without their subject expertise in the sites chosen As we continue working to improve crawler success and performance we will turn our attention to the Virtual Remote Control site created by Cornell This site was developed to present particular problems to crawlers and is well documented enabling the user to gauge a crawlerrsquos results We will be using that site to replicate particular problems raised in our test crawl set and may also reattempt some of the sites crawled for these tests When we release the first version of the Web Archiving Service to curators in July 2006 we will request that they include their original test site in the crawls they attempt and compare the results with these tests The test results for these crawls will remain in place as a point of comparison while we continue developing the Web Archiving Service Additionally certain issues raised by this feedback particularly the desired scope and frequency of crawls merit attention in future usability analysis work

16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 12: Web-at-Risk Test Crawl Report - Dashboard - Confluence

Language and Web Site Models One of the challenges of communicating and interpreting crawl results is that crawlers donrsquot necessarily work the way people envision them to Further the appearance of a web site on a screen and its architecture on a server may be quite different the directory structure of a site may have no relationship to they way its navigation is organized on a screen One frequent point of confusion curators encountered while interpreting crawl results is the concept of how many ldquolevels downrdquo the crawler went One curator requested that we ldquodrill down several levels (at least 3)rdquo in our capture One challenge with this request is that ldquolevels downrdquo can be interpreted to mean different things In some cases curators clearly mean the directory structure of the web site on the remote server In other cases however they seem to mean the number of hosts away from the original site

In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise

If the curator is referring to a structure of subdirectories the next challenge is that the Heritrix crawler does not work this way Heritrix ignores the sitersquos directory structure and instead follows links from the seed URI it is provided The Heritrix manual specifically defines ldquodepthrdquo in this way

This scope allows for limiting the depth of a crawl (how many links away Heritrix should crawl) but does not impose any limits on the hosts domains or URI paths crawled6

Multimedia Some curators chose sites because of the value of their multimedia resources The crawler yielded mixed results in capturing these resources For one site

bull A text search on the log file turns up numerous ram files only one

ppt file

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files

bull smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files

As noted nearly half the sites crawled reached the 1 gig size limit and so did not complete This makes it difficult to determine whether there were genuine problems

6 Heritrix User Manual Section 611 Crawl Scope Broad Scope lt httpcrawlerarchiveorgarticlesuser_manualhtmlgt

12

with particular types of files or if the crawler simply did not get to the missing files before the crawl was stopped

Comparison with Other Crawlers Three of our curators had previously used other crawlers such as HTTrack and Wget to capture the same sites we attempted with Heritrix Because these curators provided strong details when describing their sites we crawled all of the sites they sent us (six sites) In one case the ldquoDefense Base Closure and Realignment Commissionrdquo the curator had greater success capturing aspx files with HTTrack than we had with Heritrix Other comparisons

We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using Wget found only 1474 However both spiders found roughly the same number of bytes As I understand Wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference

And

We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with Wget

Crawl Success We asked curators to rate the overall success of the test crawls on the following scale

bull Not effective (none of the desired documents were captured) bull Somewhat effective (some of the desired documents were captured) bull Mostly effective (most of the desired documents were captured) bull Effective (all of the desired documents were captured)

0

2

4

6

8

10

12

Crawl Success

Effective

MostlyEffectiveSomewhatEffectiveNot Effective

Here are some of the comments curators had concerning the overall success of these tests The full text of all curator reports and their feedback to CDL is available in Appendix C

13

Los Angeles Planning Department

The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations) however [sic] when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when I clicked on a link to get to the full report all I got was the ldquoSorry no Documents wthe given URI were foundrdquo This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved

City of San Diego Planning Department This comment is from a curator who is filling in for the person who originated the test crawl request so she is also attempting to interpret what that other person was hoping to capture

Due to the vague request to ldquodrill down several levelsrdquo Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didnrsquot expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public libraryrsquos pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but werenrsquot (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If thatrsquos the case it may be easier to do all of wwwsandiegogov rather than limit

Defense Base Closure and Realignment Commission

I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured

Public Policy Institute of California

14

There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an ldquoobject not foundrdquo message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on ldquoDaterdquo retrieves a message ldquoSorry no documents with the given uri were foundrdquo 3) The search boxes are not functional searches retrieve ldquoSorry no documents with the given uri were foundrdquo

AFL-CIO

I realize the collection interface is a ldquowork in progressrdquo and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg ldquoWorking Families Toolkitrdquo ldquoBushWatchrdquo) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness

Conclusions The process of running these test crawls has been valuable It has resulted in a number of lessons learned and further directions for inquiry It is clear that the tools that help curators analyze the results of these crawls will have to do more than mimic the original browsing context or supply tables of data We will be looking for any opportunity to improve WAS reporting capabilities as we move forward with the project The differences encountered in language and visualization raise the importance of a strong intuitive design for the curator tools and for clear help screens Each of us may visualize web sites differently and the crawler may behave differently than we expect A certain degree of online help will be needed to design crawls effectively and further guidance should be available to help people interpret crawl results when those results donrsquot match what the person anticipated

15

The help and documentation for the Web Archiving Service will also need to address the rights analysis issues raised above In most cases this is work that is done prior to issuing crawls it cannot be addressed by the design of the WAS interface alone When the Web-at-Risk project reaches the point of conducting usability studies we should be sure to include tests that further uncover how users understand crawl frequency settings Additionally the desire for an analysis tool that can convey when a site has changed significantly is not limited to this project It is a common issue faced by the Internet Archive members of the International Internet Preservation Consortium and others CDL is actively communicating with these organizations as we all work toward a solution for this problem Similarly CDL should ensure that future assessment and analysis work with our curators addresses the issues raised by the two crawl scope settings This inquiry should also extend to participants who exemplify end users to determine if people using web archives experience crawl scope differently than people who build them Finally a number of lessons were learned via the Katrina crawl described in Appendix B The most outstanding finding is that event-based crawls such as Katrina and site-specific crawls such as these have quite different characteristics and require different functionality and analysis tools

Next Steps There is certainly more to learn from these crawl results and about Heritrix in some cases it is still unclear why the crawler failed to retrieve certain documents The curatorsrsquo feedback concerning these results has been extremely valuable They have provided insight as to what was captured and what is still missing which would have been difficult to determine without their subject expertise in the sites chosen As we continue working to improve crawler success and performance we will turn our attention to the Virtual Remote Control site created by Cornell This site was developed to present particular problems to crawlers and is well documented enabling the user to gauge a crawlerrsquos results We will be using that site to replicate particular problems raised in our test crawl set and may also reattempt some of the sites crawled for these tests When we release the first version of the Web Archiving Service to curators in July 2006 we will request that they include their original test site in the crawls they attempt and compare the results with these tests The test results for these crawls will remain in place as a point of comparison while we continue developing the Web Archiving Service Additionally certain issues raised by this feedback particularly the desired scope and frequency of crawls merit attention in future usability analysis work

16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 13: Web-at-Risk Test Crawl Report - Dashboard - Confluence

with particular types of files or if the crawler simply did not get to the missing files before the crawl was stopped

Comparison with Other Crawlers Three of our curators had previously used other crawlers such as HTTrack and Wget to capture the same sites we attempted with Heritrix Because these curators provided strong details when describing their sites we crawled all of the sites they sent us (six sites) In one case the ldquoDefense Base Closure and Realignment Commissionrdquo the curator had greater success capturing aspx files with HTTrack than we had with Heritrix Other comparisons

We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using Wget found only 1474 However both spiders found roughly the same number of bytes As I understand Wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference

And

We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with Wget

Crawl Success We asked curators to rate the overall success of the test crawls on the following scale

bull Not effective (none of the desired documents were captured) bull Somewhat effective (some of the desired documents were captured) bull Mostly effective (most of the desired documents were captured) bull Effective (all of the desired documents were captured)

0

2

4

6

8

10

12

Crawl Success

Effective

MostlyEffectiveSomewhatEffectiveNot Effective

Here are some of the comments curators had concerning the overall success of these tests The full text of all curator reports and their feedback to CDL is available in Appendix C

13

Los Angeles Planning Department

The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations) however [sic] when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when I clicked on a link to get to the full report all I got was the ldquoSorry no Documents wthe given URI were foundrdquo This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved

City of San Diego Planning Department This comment is from a curator who is filling in for the person who originated the test crawl request so she is also attempting to interpret what that other person was hoping to capture

Due to the vague request to ldquodrill down several levelsrdquo Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didnrsquot expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public libraryrsquos pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but werenrsquot (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If thatrsquos the case it may be easier to do all of wwwsandiegogov rather than limit

Defense Base Closure and Realignment Commission

I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured

Public Policy Institute of California

14

There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an ldquoobject not foundrdquo message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on ldquoDaterdquo retrieves a message ldquoSorry no documents with the given uri were foundrdquo 3) The search boxes are not functional searches retrieve ldquoSorry no documents with the given uri were foundrdquo

AFL-CIO

I realize the collection interface is a ldquowork in progressrdquo and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg ldquoWorking Families Toolkitrdquo ldquoBushWatchrdquo) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness

Conclusions The process of running these test crawls has been valuable It has resulted in a number of lessons learned and further directions for inquiry It is clear that the tools that help curators analyze the results of these crawls will have to do more than mimic the original browsing context or supply tables of data We will be looking for any opportunity to improve WAS reporting capabilities as we move forward with the project The differences encountered in language and visualization raise the importance of a strong intuitive design for the curator tools and for clear help screens Each of us may visualize web sites differently and the crawler may behave differently than we expect A certain degree of online help will be needed to design crawls effectively and further guidance should be available to help people interpret crawl results when those results donrsquot match what the person anticipated

15

The help and documentation for the Web Archiving Service will also need to address the rights analysis issues raised above In most cases this is work that is done prior to issuing crawls it cannot be addressed by the design of the WAS interface alone When the Web-at-Risk project reaches the point of conducting usability studies we should be sure to include tests that further uncover how users understand crawl frequency settings Additionally the desire for an analysis tool that can convey when a site has changed significantly is not limited to this project It is a common issue faced by the Internet Archive members of the International Internet Preservation Consortium and others CDL is actively communicating with these organizations as we all work toward a solution for this problem Similarly CDL should ensure that future assessment and analysis work with our curators addresses the issues raised by the two crawl scope settings This inquiry should also extend to participants who exemplify end users to determine if people using web archives experience crawl scope differently than people who build them Finally a number of lessons were learned via the Katrina crawl described in Appendix B The most outstanding finding is that event-based crawls such as Katrina and site-specific crawls such as these have quite different characteristics and require different functionality and analysis tools

Next Steps There is certainly more to learn from these crawl results and about Heritrix in some cases it is still unclear why the crawler failed to retrieve certain documents The curatorsrsquo feedback concerning these results has been extremely valuable They have provided insight as to what was captured and what is still missing which would have been difficult to determine without their subject expertise in the sites chosen As we continue working to improve crawler success and performance we will turn our attention to the Virtual Remote Control site created by Cornell This site was developed to present particular problems to crawlers and is well documented enabling the user to gauge a crawlerrsquos results We will be using that site to replicate particular problems raised in our test crawl set and may also reattempt some of the sites crawled for these tests When we release the first version of the Web Archiving Service to curators in July 2006 we will request that they include their original test site in the crawls they attempt and compare the results with these tests The test results for these crawls will remain in place as a point of comparison while we continue developing the Web Archiving Service Additionally certain issues raised by this feedback particularly the desired scope and frequency of crawls merit attention in future usability analysis work

16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 14: Web-at-Risk Test Crawl Report - Dashboard - Confluence

Los Angeles Planning Department

The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations) however [sic] when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when I clicked on a link to get to the full report all I got was the ldquoSorry no Documents wthe given URI were foundrdquo This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved

City of San Diego Planning Department This comment is from a curator who is filling in for the person who originated the test crawl request so she is also attempting to interpret what that other person was hoping to capture

Due to the vague request to ldquodrill down several levelsrdquo Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didnrsquot expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public libraryrsquos pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but werenrsquot (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If thatrsquos the case it may be easier to do all of wwwsandiegogov rather than limit

Defense Base Closure and Realignment Commission

I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured

Public Policy Institute of California

14

There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an ldquoobject not foundrdquo message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on ldquoDaterdquo retrieves a message ldquoSorry no documents with the given uri were foundrdquo 3) The search boxes are not functional searches retrieve ldquoSorry no documents with the given uri were foundrdquo

AFL-CIO

I realize the collection interface is a ldquowork in progressrdquo and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg ldquoWorking Families Toolkitrdquo ldquoBushWatchrdquo) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness

Conclusions The process of running these test crawls has been valuable It has resulted in a number of lessons learned and further directions for inquiry It is clear that the tools that help curators analyze the results of these crawls will have to do more than mimic the original browsing context or supply tables of data We will be looking for any opportunity to improve WAS reporting capabilities as we move forward with the project The differences encountered in language and visualization raise the importance of a strong intuitive design for the curator tools and for clear help screens Each of us may visualize web sites differently and the crawler may behave differently than we expect A certain degree of online help will be needed to design crawls effectively and further guidance should be available to help people interpret crawl results when those results donrsquot match what the person anticipated

15

The help and documentation for the Web Archiving Service will also need to address the rights analysis issues raised above In most cases this is work that is done prior to issuing crawls it cannot be addressed by the design of the WAS interface alone When the Web-at-Risk project reaches the point of conducting usability studies we should be sure to include tests that further uncover how users understand crawl frequency settings Additionally the desire for an analysis tool that can convey when a site has changed significantly is not limited to this project It is a common issue faced by the Internet Archive members of the International Internet Preservation Consortium and others CDL is actively communicating with these organizations as we all work toward a solution for this problem Similarly CDL should ensure that future assessment and analysis work with our curators addresses the issues raised by the two crawl scope settings This inquiry should also extend to participants who exemplify end users to determine if people using web archives experience crawl scope differently than people who build them Finally a number of lessons were learned via the Katrina crawl described in Appendix B The most outstanding finding is that event-based crawls such as Katrina and site-specific crawls such as these have quite different characteristics and require different functionality and analysis tools

Next Steps There is certainly more to learn from these crawl results and about Heritrix in some cases it is still unclear why the crawler failed to retrieve certain documents The curatorsrsquo feedback concerning these results has been extremely valuable They have provided insight as to what was captured and what is still missing which would have been difficult to determine without their subject expertise in the sites chosen As we continue working to improve crawler success and performance we will turn our attention to the Virtual Remote Control site created by Cornell This site was developed to present particular problems to crawlers and is well documented enabling the user to gauge a crawlerrsquos results We will be using that site to replicate particular problems raised in our test crawl set and may also reattempt some of the sites crawled for these tests When we release the first version of the Web Archiving Service to curators in July 2006 we will request that they include their original test site in the crawls they attempt and compare the results with these tests The test results for these crawls will remain in place as a point of comparison while we continue developing the Web Archiving Service Additionally certain issues raised by this feedback particularly the desired scope and frequency of crawls merit attention in future usability analysis work

16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 15: Web-at-Risk Test Crawl Report - Dashboard - Confluence

There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an ldquoobject not foundrdquo message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on ldquoDaterdquo retrieves a message ldquoSorry no documents with the given uri were foundrdquo 3) The search boxes are not functional searches retrieve ldquoSorry no documents with the given uri were foundrdquo

AFL-CIO

I realize the collection interface is a ldquowork in progressrdquo and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg ldquoWorking Families Toolkitrdquo ldquoBushWatchrdquo) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness

Conclusions The process of running these test crawls has been valuable It has resulted in a number of lessons learned and further directions for inquiry It is clear that the tools that help curators analyze the results of these crawls will have to do more than mimic the original browsing context or supply tables of data We will be looking for any opportunity to improve WAS reporting capabilities as we move forward with the project The differences encountered in language and visualization raise the importance of a strong intuitive design for the curator tools and for clear help screens Each of us may visualize web sites differently and the crawler may behave differently than we expect A certain degree of online help will be needed to design crawls effectively and further guidance should be available to help people interpret crawl results when those results donrsquot match what the person anticipated

15

The help and documentation for the Web Archiving Service will also need to address the rights analysis issues raised above In most cases this is work that is done prior to issuing crawls it cannot be addressed by the design of the WAS interface alone When the Web-at-Risk project reaches the point of conducting usability studies we should be sure to include tests that further uncover how users understand crawl frequency settings Additionally the desire for an analysis tool that can convey when a site has changed significantly is not limited to this project It is a common issue faced by the Internet Archive members of the International Internet Preservation Consortium and others CDL is actively communicating with these organizations as we all work toward a solution for this problem Similarly CDL should ensure that future assessment and analysis work with our curators addresses the issues raised by the two crawl scope settings This inquiry should also extend to participants who exemplify end users to determine if people using web archives experience crawl scope differently than people who build them Finally a number of lessons were learned via the Katrina crawl described in Appendix B The most outstanding finding is that event-based crawls such as Katrina and site-specific crawls such as these have quite different characteristics and require different functionality and analysis tools

Next Steps There is certainly more to learn from these crawl results and about Heritrix in some cases it is still unclear why the crawler failed to retrieve certain documents The curatorsrsquo feedback concerning these results has been extremely valuable They have provided insight as to what was captured and what is still missing which would have been difficult to determine without their subject expertise in the sites chosen As we continue working to improve crawler success and performance we will turn our attention to the Virtual Remote Control site created by Cornell This site was developed to present particular problems to crawlers and is well documented enabling the user to gauge a crawlerrsquos results We will be using that site to replicate particular problems raised in our test crawl set and may also reattempt some of the sites crawled for these tests When we release the first version of the Web Archiving Service to curators in July 2006 we will request that they include their original test site in the crawls they attempt and compare the results with these tests The test results for these crawls will remain in place as a point of comparison while we continue developing the Web Archiving Service Additionally certain issues raised by this feedback particularly the desired scope and frequency of crawls merit attention in future usability analysis work

16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 16: Web-at-Risk Test Crawl Report - Dashboard - Confluence

The help and documentation for the Web Archiving Service will also need to address the rights analysis issues raised above In most cases this is work that is done prior to issuing crawls it cannot be addressed by the design of the WAS interface alone When the Web-at-Risk project reaches the point of conducting usability studies we should be sure to include tests that further uncover how users understand crawl frequency settings Additionally the desire for an analysis tool that can convey when a site has changed significantly is not limited to this project It is a common issue faced by the Internet Archive members of the International Internet Preservation Consortium and others CDL is actively communicating with these organizations as we all work toward a solution for this problem Similarly CDL should ensure that future assessment and analysis work with our curators addresses the issues raised by the two crawl scope settings This inquiry should also extend to participants who exemplify end users to determine if people using web archives experience crawl scope differently than people who build them Finally a number of lessons were learned via the Katrina crawl described in Appendix B The most outstanding finding is that event-based crawls such as Katrina and site-specific crawls such as these have quite different characteristics and require different functionality and analysis tools

Next Steps There is certainly more to learn from these crawl results and about Heritrix in some cases it is still unclear why the crawler failed to retrieve certain documents The curatorsrsquo feedback concerning these results has been extremely valuable They have provided insight as to what was captured and what is still missing which would have been difficult to determine without their subject expertise in the sites chosen As we continue working to improve crawler success and performance we will turn our attention to the Virtual Remote Control site created by Cornell This site was developed to present particular problems to crawlers and is well documented enabling the user to gauge a crawlerrsquos results We will be using that site to replicate particular problems raised in our test crawl set and may also reattempt some of the sites crawled for these tests When we release the first version of the Web Archiving Service to curators in July 2006 we will request that they include their original test site in the crawls they attempt and compare the results with these tests The test results for these crawls will remain in place as a point of comparison while we continue developing the Web Archiving Service Additionally certain issues raised by this feedback particularly the desired scope and frequency of crawls merit attention in future usability analysis work

16

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 17: Web-at-Risk Test Crawl Report - Dashboard - Confluence

Web-at-Risk Test Crawl Report Appendix A Sites Submitted

Curator Site Crawled Sherry DeDekker httpcawaterusgsgov

California Water Science Center Yes

Sherry DeDekker httpwwwdwrwatercagov California Department of Water Resources

Peter Filardo and Michael Nash

httpwwwnycclcorg New York City Central Labor Council

Yes

Peter Filardo and Michael Nash

httpwwwdsausaorg Democratic Socialists of America

Valerie Glenn and Arelene Weibel

httpwwwstrengtheningsocialsecuritygov Strengthening Social Security

Yes

Valerie Glenn and Arelene Weibel

httpwwwbracgov The Defense Base Closure and Realignment Commission

Yes

Gabriela Gray httpwwwjoinarnoldcom Join Arnold

Yes

Gabriela Gray httpwwwantonio2005com Mayor mayor-elect Antonio Villaraigosa

Yes

Ron Heckart and Nick Robinson

httpwwwppicorg Public Policy Institute of California

Yes

Ron Heckart and Nick Robinson

httpwwwcbporg California Budget Project

Terrence Huwe httpwwwaflcioorg AFL-CIO

Yes

Terrence Huwe httpwwwseiuorg Service Employees International Union

James Jacobs httpwwwsandiegogovplanning City of San Diego Planning Department (analyzed by Megan Dreger)

Yes

James Jacobs httpwwwsandagorg San Diego Association of Governments

Kris Kasianovitz httpcityplanninglacityorg Los Angeles Department of City Planning

Yes

Kris Kasianovitz httpwwwscagcagov Southern California Association of Governments

Yes

Linda Kennedy httpcalwatercagov California Bay-Delta Authority (CALFED)

Yes

Linda Kennedy httpwwwdfgcagov California Department of Fish and Game

Ann Latta httpwwwucmercededu UC Merced (analyzed by Elizabeth Cowell)

Yes

Ann Latta httpwwwcoastalcagovweb California Coastal Commission

Janet Martorana httpwwwcountyofsborgplandevdefaulthtm Santa Barbara County Department of Planning and Development

Yes

Janet Martorana httpwwwsbcagorg Santa Barbara County Association of Governments

Lucia Orlando httpmontereybaynoaagov Monterey Bay National Marine Sanctuary

Yes

Lucia Orlando httpwwwwaterboardscagovcentralcoast Central Coast Regional Water Quality Control Board

Richard Pearce-Moses httpwwwazwatergov Arizona Department of Water Resources

Yes

17

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 18: Web-at-Risk Test Crawl Report - Dashboard - Confluence

Richard Pearce-Moses httpwwwccecstateazusccecscrhomeasp Citizenrsquos Clean Election Commission

Yes

Juri Stratford httpwwwcitydaviscaus City of Davis California

Yes

Juri Stratford httpwwwsacogorg Sacramento Area Council of Governments

Yvonne Wilson httpwwwocsdcom The Orange County Sanitation District

Yes

18

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 19: Web-at-Risk Test Crawl Report - Dashboard - Confluence

Web-at-Risk Test Crawl Report Appendix B The Katrina Crawl

The Crawl During the early Fall of 2005 the California Digital Library was embarking on a series of test crawls as part of the Web-at-Risk project when hurricane Katrina struck CDL had requested that curators submit their sample URLs by August 25th and on Friday August 26th Louisiana Governor Blanco declared a state of emergency It was over that weekend that CDL staff reached the conclusion that an event of this historic significance would shift our priorities On Monday we suspended our initial test crawl plans and began preparing on a number of fronts to capturie web-based materials related to Katrina Specifically this meant

bull identifying which web sites or sub-sites to collect which involved o notifiying 30+ curators of our emergency crawl plans and requesting

their assistance o setting up a web-based interface to gather their suggested seed URLs o sorting through the resulting seed list and feeding it to our primary

and secondary (Stanford) crawl operators bull determining our collection parameters bull getting our crawlers installed and configured bull locating and setting up disk space to store the crawls bull initiating and monitoring the crawls

This crawl activity posed several new challenges It was the first time CDL staff had used the Heritrix crawler which was not yet installed on a production server when the hurricane struck The last week of August saw CDL and UC Berkeley staff scrambling to find server space to set up Heritrix instances disk partitions and job parameters for the crawl This was done with remarkable speed owing to the urgency of the situation and to everyonersquos desire to capture a record of the events Given our limited experience in order to reduce the risk of losing the historically significant and fleeting materials related to hurricane Katrina CDL worked with Stanford University to concurrently run the same crawl using a different crawler After both CDL and Stanford had crawled the same seed list for a month the task was then taken up by the San Diego Supercomputer Center who have continued crawling these sites using Heritrix

Gathering the Seeds The CDL sent out an initial request to the Web-at-Risk curators to submit URLs related to hurricane Katrina for crawling We worked collaboratively with a large group of content specialists to identify the sites

bull The 22 curators of the Web-at-Risk project (University of California Libraries University of North Texas New York University Stanford University and the Arizona State Library)

bull The Library of Congress bull Librarians at Louisiana State University University of Mississippi

19

Over the course of the crawl the list of seed URLs grew to over 700 (just over 500 of which were crawled by CDL) The image of our input form above provides some sense of the range of materials collected Given the sudden nature of this event there was no time to investigate rights issues or technical problems each site might have presented We informed the curators that ldquoOur immediate plan is to simply collect the material before it disappears We will not make the material immediately availablerdquo We also had little time for quality control and were not able to guarantee that sites in our seed list would be comprehensively crawled

Crawling Specifics CDLrsquos first crawl was run on September 1 2005 using a seed list of 89 URLs The final crawl run by CDL was on October 10 using a seed list of 589 URLs This final seed list was then sent to San Diego Supercomputing Center who are continuing to run twice-weekly crawls

20

The CDL crawls were done using Heritrix version 151 We began with a single instance of Heritrix eventually dividing the seed list among 6 instances We got through the entire seed list 29 times in 40 days However the content we were able to collect was limited by our crawler settings Our crawler was configured to visit one host at a time collecting content at a conservative rate and never more than three hops away from the seed URL To insure that the crawler moved through the seed list in a timely way it was limited to spending 15 minutes at any given host So the material collected does not represent the entirety of what was available at each site Note that all of the seeds were crawled with the same configuration no matter how different the structure of these sites might be Thus the New Orleans version of Craigrsquos list NASArsquos information pages and blog sites were all crawled in the same manner despite being quite different in context architecture and other characteristics Given our short preparation time the goal was to find a crawler configuration that would be a match the broadest range of Katrina materials In most cases whether the seed URL was for a personal blog or for a government agency the seed itself was not centrally about Katrina The Katrina information was generally situated on the front page and top levels of each site so CDL expected that this setting even with the 15-minute time limit would at least capture the content nearest to the front pages of the widest variety of sites we could capture In choosing crawler configuration settings we also faced conflicting goals On one hand we wanted to capture as much Katrina content as possible On the other hand we were very hesitant to start hitting sites that were providing much needed information at a crucial time (emergency sites relief sites) thus making sites hard to reach due to our capture activity Many of these sites were already getting heavily used and perhaps were not running at full capacity Some were also geographically impacted directly by the hurricane So we had to choose settings that balanced the need to collect with politeness across a wide range of sites When the San Diego Supercomputer Center took over the Katrina crawls in October they revisited the Heritrix configuration settings The SDSC crawl placed a limit not on the amount of time spent at a site but on the number of documents to be retrieved overall Their crawls are being conducted ldquobreadth firstrdquo in the sense of gathering pages consecutively across the seed list rather than gathering pages from one site at a time The Stanford university crawl of the same sites began on September 7th and continued for 30 consecutive days using the WebVac crawler While we have statistics concerning the size of the Stanford Katrina collection it is very difficult to compare the configuration settings and effectiveness of WebVac vs Heritrix Further there is no easy way to display materials gathered with WebVac so the Stanford Katrina content is stored but is not accessible to us at the moment

Katrina Crawl Results In terms of creating a collection of the Katrina event on the web we are not satisfied that CDLrsquos crawling efforts were successful As mentioned we had to impose time

21

limits on each site visited and were not able to conduct quality control on the URLs that were captured The total size of the Katrina capture at CDL is 50 GB This represents 29 completed crawls of the seed list which grew to over 500 seeds over the 40 days we ran the crawl Itrsquos clear from the byte size alone that we barely skimmed the surface and indeed the Stanford crawl proves this point When the Stanford crawl was based on about 370 seeds it was pulling in about 35 GB a day or 15 million pages a day In spite of this our attempts to capture Katrina news and events did have some very useful and positive outcomes In terms of test crawling the experience was quite successful opening our eyes to both certain obstacles and to new possibilities Most importantly it prompted us to consider the nature and demands of event-based crawling which had not been part of our original test plan Here are some of the aspects of event-based crawls that Katrina surfaced

Site Selection and Classification This type of event demands a deep collaborative effort to identify material to collect CDL had to mobilize a large number of curators to make their best guesses at identifying promising sites We set up a lsquocrawl seed nominationrsquo web form accessible to the curators from their home institutions and invited the curators to use it to enter their suggested seed URLs CDL staff did not have time to add much automatic error or duplicate checking features so this cumbersome work was done by hand On the other hand it provided us a first-hand trial of what a more general curator user interface might require The selection and management of seed lists is critical for sudden event-based crawls The curators contributing the URLs will not necessarily be well-versed in the topic in the case of Katrina curators in California were not uniformly familiar with the Gulf Coast the towns the government agencies etc In addition it is difficult to predict which aspects of the event will be of historic enduring value Because disk storage was not a pressing issue it was better to err on the side of a wider net when selecting sites It became clear that there might be a role for ldquosmart crawlingrdquo which would spread a wider net than that provided by human-generated seed lists through the use of automated tools that discovered relevant materials We were specifically interested in using tools developed at Emory and Cornell that interface with the Rainbow text classification tool7 We did not have the resources to investigate this for the Katrina crawl but plan to fold it into future crawling tests Finally it is worth examining why the seed list grew continuously throughout the event Part of this growth is due to the natural increase in the number of pages and sites devoted to the emergency and having the additional time to identify and add them However the nature of the event itself changed over time starting as a hurricane then becoming a flood a massive relocation and a political and social issue So the range of relevant sites changed as the event itself took on broader

7 Bergmark Donna ldquoHeritrix processor for use with rainbowldquo lthttpgroupsyahoocomgrouparchive-crawlermessage1905gt

22

implications This suggests that site selection is an ongoing process not strictly an activity undertaken at the beginning of an event

Rights Ownership and Responsibilities Our current rights protocol stipulates that when a web page is clearly in the for-profit domain we must seek permission from the content owners to collect This restriction applies most notably to news agency sites which would likely be a critical part of the web-based fallout of any major event This was certainly the case with Katrina However rights negotiation is a time-consuming process In the case of a planned event such as an election one is able to anticipate a certain number of sources and take rights management steps before the event occurs But with emergencies such as Katrina there is simply no way to secure rights in advance without missing time-sensitive material Knowing this would be an issue CDL notified our curators in advance that the Katrina materials we gathered would not be publicly accessible but merely captured and preserved The complex rights issues behind web archiving as well as the collaborative environment described above raise a number of questions

bull Who owns the content bull Who is responsible for the project bull Who is responsible for responding if a content owner objects bull When is it permissible to ignore robots exclusion files bull How much rights management can be done as part of post-processing for an

event crawl bull Can there be standing agreements with major news agencies to allow for this

type of crawl in emergency situations If so what constitutes an emergency

Technical Infrastructure CDL did not have an adequate technical infrastructure in place at the time of the Katrina crawl As mentioned above staff at both CDL and UC Berkeley scrambled to find server space and to install and configure the crawlers This sudden shift in workload left the system vulnerable and somewhat unstable In order to start the crawling process as soon as possible CDL began the project on a temporary server Making the transition to more permanent storage in the midst of this crawling project was neither easy nor flawless and a certain amount of data had to be recovered from backup sources

Information Analysis and Display The challenge of analyzing crawl results has been described in some detail in our Test Crawl Report That report cites a curator who found the task of reviewing an 8899 document crawl to be unmanageable The tools currently available are not at all up to the task of analyzing a large and complex crawl An event-based crawl is

23

likely to result in massive amounts of data of widely varying quality The selection of seeds is based on guesses that given sufficient resources should be reviewed refined and enhanced as the event progresses Time series data based on changing input parameters represents a kind of moving target that suggests the need to develop new analysis tools Andreas Paepcke of Stanford University addressed this problem when considering how a social scientist might want to analyze the Katrina materials Consider the researcher who is interested in finding out how quickly the notion of race entered the public discourse in the aftermath of Katrina Paepcke suggests

ldquoFor example the tool could perform word frequency analysis across successive crawls of Katrina sites so that the user could identify the emergence of word usage such as looter ninth ward etc The social scientist would then define within the tool clusters of words that in the scientists judgment are direct or indirect occurrence indicators of the concept race (ie ninth ward looters lacks poor ) The scientist could subsequently interact with the tool at the level of these well-defined concepts Example command count occurrences of race within the first three paragraphs of all pages and tabulate the differences across 10 days We could go further and apply well-known topic categorization algorithms on the data to suggest new concepts as they arise in the series of text streams For example the tool might identify an increasing frequency of the term mismanagement beginning in week two after the storm and alert the sociologist to this evidence of a new discourse themerdquo 8

While we donrsquot expect to be able to incorporate this level of analysis into our web archiving tools any time soon these ideas illustrate a useful direction for archival analysis tools

Conclusions At time Hurricane Katrina hit the Web at Risk service requirements and test crawl plan had been written for a more orderly and considered approach to web crawling Our attempt to capture the web-based aftermath of Katrina highlighted our need to revisit the service requirements and see how well they supported an event-based response We need to develop a collaborative and robust mechanism to identify and collect seed URLs When an event happens quickly it is also important to have in place general crawling guidelines that will result in a well-rounded collection The technical infrastructure for the Web Archiving Service that we are building for the Web-At-Risk project should be robust enough to handle occasional and sudden bursts of activity Attempting to alter the infrastructure quickly in reaction to emergency events impairs our responsiveness and leaves the altered systems in a vulnerable state 8 Paepcke Andreas Senior Research Scientist and Director of the Digital Library Stanford University Email correspondence with Patricia Cruse October 26 2005

24

Finally work needs to continue on both the rights management front and on developing improved web archiving analysis tools so that the material gathered can be used to its greatest potential

25

Web-at-Risk Test Crawl Report Appendix C Individual Crawl Reports Included below the crawl reports provided to individual curators including their analysis and feedback about those results Note that a key to interpreting the tables in these crawl reports is provided at the end of this appendix

Elizabeth Cowell (submitted by Ann Latta) UC Merced CDL Report to Curator

URL httpwwwucmercededu Curatorrsquos original comments ldquoUC Merced is the first research university to be built in the 21st century The educational and land use issues are significant Of particular interest is httpwwwucmercedplanningnet This site addresses major issues of land use - the university is being built on agricultural land Controversy existed re issues of redevelopment of downtown Merced vs appropriation of agricultural land - there are major environmental issues focused on endangered species - Educational issues involving faculty job descriptions student body etc are significant because of the economic ethnic and cultural diversity of the regionrdquo Site copyright statement ldquocopy 2004 UC Regentsrdquo Crawl Results

26

Comments from crawl operator When we set the crawl to include pages from linked sites the crawler got ldquotrappedrdquo at the Elsevier site There is JavaScript on that linked page that causes the crawler to continue looking for additional pages on the Elsevier site even when yoursquore only trying to capture a single page Once we set a limit for the maximum number of retry attempts the crawl completed This data is from completed crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 227 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 969 71552369 wwwucmercededu 238 2564803 wwwucopedu 226 14851 dns 197 16583197 wwwuniversityofcaliforniaedu 156 8487817 wwwelseviercom 151 1437436 wwwgreatvalleyorg 112 2354582 facultyucmercededu 105 5659795 wwwpacificedu 90 111985 k12ucopedu 86 255733 www-cmsllnlgov 85 1178031 admissionsucmercededu 81 297947 uc-industryberkeleyedu 71 108265 wwwmssmfoundationorg 67 349300 wwwnpsgov 66 308926 wwwusafreedomcorpsgov 54 137085 slugstoreucscedu 52 52202 wwwcerrocosoedu 51 977315 wwwuniversityofcaliforniacom

Curator Feedback to CDL (Cowell Merced)

Crawl Success mostly effective

27

Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments 10 more links came up in the via search 8 of these were not useful for the research of land use issues The two that were useful were a benefit of the via search Crawl Frequency monthly

28

Sherry DeDekker California Water Science Center CDL Report to Curator

URL httpcawaterusgsgov Curatorrsquos original comments ldquoWe are interested in the CA water reports and databases behind the links on this page Also in looking at httpwaterdatausgsgovcanwisnwis this section is an interactive interface to multiple data sets Is it possible to capture this type of site as well as the static reportsrdquo Site copyright statement ldquoInformation presented on this website is considered public information (unless otherwise noted) and may be distributed or copied Use of appropriate bylinephotoimage credit is requested We strongly recommend that USGS data be acquired directly from a USGS server and not through other sources that may change the data in some wayrdquo Crawl Results

Comments from crawl operator Databases eg httpwaterdatausgsgovcanwisnwis disallowed by httpwaterdatausgsgovrobotstxt Some water data reports are NOT caught by broader ldquolinked hosts includedrdquo settings eg httppubsusgsgovwdr2004wdr-ca-04-1 -- would also want to submit httpcawaterusgsgovwaterdata as a seed Related hosts crawled

29

When the crawl was set to include documents from other sites that the original site linked to 662 additional sites were crawled The following list includes the hosts that supplied more than 50 files Note that the host ldquopubsusgsgovrdquo supplied a higher number of files than the original host itself [urls] [bytes] [host] 1963 255912820 pubsusgsgov 1153 47066381 cawaterusgsgov 698 56570 dns 404 112354772 geopubswrusgsgov 385 9377715 waterusgsgov 327 203939163 greenwoodcrusgsgov 318 17431487 wwwelseviercom 219 3254794 wwwusgsgov 189 2737159 wwwlsuedu 163 2292905 wrgiswrusgsgov 158 31124201 wwwepagov 149 921063 wwwusdagov [list truncatedhellip]

Curator Feedback to CDL (DeDekker CWSC)

Crawl Success somewhat effective Crawl Success Comments Site appears to access water data reports (httpcawaterusgsgovarchivewaterdataindexhtml) but none are actually available through the links I expected the site to not be able to access real time data but these are archived reports Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly

30

Megan Dreger (submitted by James Jacobs) City of San Diego Planning Department

CDL Report to Curator

URL httpwwwsandiegogovplanning Curatorrsquos original comments I would like to drill down several levels (at least 3) of this site For example following the link to City of Villagesgeneral plan update leads many more important planning documents Site copyright statement This site contains the two following notices on the same page Restrictions on Use of Materials ldquoThis site is operated and maintained by the City of San Diego through its Department of Information Technology and Communications (referred to as ITampC) Except as provided herein no material or information from this site may be copied reproduced republished uploaded posted transmitted or distributed except as authorized in this notice expressly authorized within this site or approved in writing by ITampC Copyright Notice Unless a copyright is indicated information on the City of San Diego Web site is in the public domain and may be reproduced published or otherwise used with the City of San Diegos permission We request only that the City of San Diego be cited as the source of the information and that any photo credits graphics or byline be similarly credited to the photographer author or City of San Diego as appropriate If a copyright is indicated on a photo graphic or any other material permission to copy these materials must be obtained from the original sourcerdquo Crawl Results

31

Comments from crawl operator Need feedback about whether desired content retrieved Question for curator Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 1084 additional hosts were crawled (This figure does not represent a complete crawl as the size limitation was reached) The following hosts supplied more than 75 documents to your site [urls] [bytes] [host] 3728 556231640 wwwsandiegogov 1247 38685244 genesissannetgov 1085 80905 dns 807 6676252 wwwhoustontexanscom 428 1079658 wwwcacitiesorg 399 102298888 wwwbuccaneerscom 259 1797232 granicussandiegogov 258 42666066 clerkdocsannetgov 238 5413894 wwwccdccom 225 2503591 wwwciel-cajoncaus 223 1387347 wwwiplorg 217 2683826 wwwsdcountycagov 203 11673212 restaurantssandiegocom 195 2620365 wwwsdcommutecom 192 1344523 wwwbengalscom 189 2221192 wwwkidsdomaincom 176 1333528 wwwbuffalobillscom 171 685965 wwwchumpsoftcom 166 277238 wwwproquestcom [list truncatedhellip]

32

Curator Feedback to CDL (Dreger San Diego)

Crawl Success mostly effective Crawl Success Comments This crawl was not completed due to size so that may explain some of my questions It was pretty effective in terms of getting the Planning Dept pages but went out further than I expected Due to the vague request to drill down several levels Im not sure how this crawl was set up It includes many pages that are not related to the City Planning Dept For example there were many pages that I didn t expect to appear (wwwproquestcom wwwinfopeopleorg) that I believe are included because they are listed on the public library s pages (wwwsandiegogovpublic-library) So the crawl appears to include not just the pages linked from wwwsandiegogovplanning (in the nav bar as well as the content) but also the pages that those secondary pages link to Some other pages that I expected to be there but weren t (for example httpwwwsandiegogovcityofvillagesoverviewrootsshtml) are linked from the Planning Department pages but are a couple of levels down and in a different directory So it may be the directory structure that causes problems trying to search only one agency If that s the case it may be easier to do all of wwwsandiegogov rather than limit It s nice that PDFs and other formats are included Crawl Scope Preferences original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The link hosts included (via) seemed to include more extraneous stuff Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q Can you possibly define what you mean by ldquodrill down several levels (at least 3)rdquo Itrsquos not clear if you mean directory levels in the site architecture or navigation levels in the web site interface A You asked for a better definition of what was meant by the request to drill down several levels (at least 3) Unfortunately I wasnt the original curator and Im not sure what he meant The crawl you did was very useful Crawl Frequency monthly Questions Comments about crawl I think that the crawl frequency should be at least monthly for these pages

33

Peter Filardo and Michael Nash New York City Central Labor Council CDL Report to Curator

URL httpwwwnycclcorg Curatorrsquos original comments (none) Site copyright statement ldquocopy 2004 New York City Central Labor Council No portion of this website may be reproduced in any form without permission from the Central Labor Council Contact our offices for more information at nycaflcioaolcomrdquo Crawl Results NOTE Because your Crawl ldquoArdquo had to be stopped then resumed each of your reports for that crawl is in two segments To browse a list of all reports for that crawl go to httpvorocdliborg8081ingest_miscndiipptestcrawls_rawfilardo_labor_via

Comments from crawl operator A Linked hosts included Crawl complete after recovery with addition of max retries Seemed to hang at httpwwwnycclcorgcalendareventaspEventId=501 httpwwwnycclcorgassetsHLCapplicationmembershippdf Ended crawl seemed to hang Recovered from previous job the recovery was successful Note for future that a recovered job is identifiable because the logs directory is called logs-R

34

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to x additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 1913 74260017 wwwnycclcorg 156 11755 dns 115 710552 wwwaflcioorg 73 1477966 wwwcomptrollernycgov 71 193264 wwwempirepagecom 60 570115 wwwredcrossorg 58 269079 wwwafl-cioorg 57 240845 wwwcampsussexorg 57 113676 wwwmssmedu 56 449473 wwwlabor-studiesorg 53 184605 wwwpbbcorg 52 134326 wwwsenategov [list truncatedhellip]

Curator Feedback to CDL Filardo NYCCLC

None provided

35

Valerie Glenn and Arelene Weibel Strengthening Social Security CDL Report to Curator

URL httpwwwstrengtheningsocialsecuritygov Curatorrsquos original comments ldquocontains external links to audio amp video that would be essential to completing this site (see press room) some are files some are links to webcasts some are on external gov sites and some are external com sitesrdquo Site copyright statement Copyright info not found Crawl Results

Comments from crawl operator Interesting audiovideowebcast -- need feedback about success in capturing Questions for Curator How successfully did this crawl capture the multimedia documents you were interested in Comments from coordinator In the media center area of this site (httpwwwstrengtheningsocialsecuritygovpressmedia_centershtml) Irsquove found ram (both video and audio alone) smil and asx files The site also contains numerous ppt and pdf files bull A text search on the log file turns up numerous ram files only 1ppt file

36

bull asx files are windows streaming media redirector files which generally lead to associated wmf files No asx references appear in the crawl log nor do any wmf files bull Similarly smil files are used to control and point to associated media files in this case rm files We are getting the smil files but not the rm files I assume that when displayed some of the real media files from this site would function but many of the other multimedia files would not Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 388 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 660 10668874 wwwchelseapierscom 562 7334035 wwwwhitehousegov 477 6366197 wwwlaopinioncom 391 29623 dns 356 3874719 wwwwkrccom 243 12294240 wwwstrengtheningsocialsecuritygov 178 1935969 wwwxavieredu 148 237055 imagecomcom 127 682069 onlinewsjcom 117 898439 wwwomahacom 116 514995 wwwnprorg 108 995733 wwwnbacom [list truncatedhellip]

Curator Feedback to CDL (Glenn Strength Soc Sec)

Crawl Success mostly effective Crawl Success Comments My main concern about this site was the multimedia documents included - Ive posted those thoughts in the Questions for Curator text box Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments For this site it was essential to capture the link hosts (via) because many of the press materials etc were on external sites Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Q How successfully did this crawl capture the multimedia documents you were interested in A Im disappointed that not all of the multimedia files were captured but there seem to be only a few that arent included I found it interesting that

37

broadcasts from the same host (whitehousegov) werent completely captured - some were some werent Crawl Frequency Once Questions Comments about crawl [note the curator alludes to adding a comment here but no comment was received]

38

Valerie Glenn and Arelene Weibel Defense Base Closure and Realignment Commission CDL Report to Curator

URL httpwwwbracgov Curatorrsquos original comments ldquoIn our previous efforts we have been unable to capture agency databases The BRAC site includes a document library which has a search feature (httpwwwbracgovSearchaspx) and a browse feature (httpwwwbracgovBrowseaspx) We would really like to see how this information can be captured so that we can recreate it on our own servers Site copyright statement ldquoThe contents of all material available on this Internet site are in the public domain and are not copyrighted The content of this site may be freely reproduced downloaded disseminated published or transferred in any form and by any means However in some cases the copyright for certain text or images on this site may be held by other partiesrdquo Crawl Results

Comments from crawl operator A Linked hosts included httpwwwbracgovSearchaspx cant be captured by heritrix httpwwwbracgovBrowseaspx seems to only capture the first 25 documents Tried again with the browse page as the starting point but stopped after 1005 documents extracted 20 links from browse page and then there were no more URLs in frontier queue which had been extracted from browseaspx -

39

-perhaps need more experimentation B Restricted to original host again only 1st 25 pages from browse -- cant even successfully pass a seed URL listing the max docs per browse page (50) Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 44 additional hosts were crawled Note that because your crawl did not complete this is not an accurate count of how many other sites your original site may link to [urls] [bytes] [host] 2034 1064389540 wwwbracgov 555 5874934 wwwsluedu 87 173510 wwwcpccedu 54 154588 wwwwmatacom 47 685158 wwwsluhospitalcom 44 3501 dns 44 582555 wwwc-spanorg 43 174467 wwwadobecom 38 178153 wwwq-and-aorg 32 127325 slubkstorecom 24 140653 wwwc-spanclassroomorg 23 326680 wwwcapitalnewsorg 22 213116 cancercentersluedu 21 196012 wwwdefenselinkmil [List truncatedhellip]

Curator Feedback to CDL (Glenn Defense Base Closure)

Crawl Success somewhat effective Crawl Success Comments I dont think this crawl was very successful None of the documents in the folders on httpwwwbracgovSupplementalaspx were captured no public comments after the opening page (httpwwwbracgovBrowseCommentsaspx) were captured none of the documents linked from the Browse page (httpwwwbracgovBrowseaspx) seem to have been captured I realize that the crawl was limited to 1GB but I think that more documents could have been captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency once Questions Comments about crawl One of the reasons I submitted this site to be crawled is that Id already tried to harvest it using HTTrack That product appeared to capture more of the documents than Heritrix - but we still havent been able to capture the entire site

40

Gabriela Gray Join Arnold CDL Report to Curator

URL httpwwwjoinarnoldcom Curatorrsquos original comments ldquoCalifornia Recovery Team Non-profit pro-Arnold group not registered as a campaign committee Critical AspectsComplex file and directory naming structure Looping eg contactusasp and contactaddasp linksrdquo Site copyright statement ldquoCopyright 2005rdquo Crawl Results

Comments from crawl operator A Linked hosts included Great site for testing -- this loop is really interesting because a new URL is generated with each loop so that duplicate-detection underway at IA would still not eliminate it 44332 of the retrieved URLs were contact pages B Restricted to original host Got into a loop by the end of 999 documents retrieved 34 minutes C Restricted to original host + regular expression

41

Excluding pages that matched regular expression contactaddaspc= did not end the loop What did end the loop excluding both contactus and contactadd pages so they were not retrieved -- a drawback (IA takes manual approach of gathering the pages then having an operator stop the crawl and take out the looping urls by hand -- not scaleable) ltnewObject name=contact class=orgarchivecrawlerdeciderulesMatchesRegExpDecideRulegt ltstring name=decisiongtREJECTltstringgt ltstring name=regexpgtcontactaspc=ltstringgt ltnewObjectgt Related hosts crawled Because of looping problems we were not able to crawl other hosts linked from this site

Curator Feedback to CDL (Gray Arnold)

Crawl Success somewhat effective Crawl Success Comments We spot-checked and it looks like most files were captured but the individual pages dont display most of the images (This may simply be a problem with the WERA interface) Strangely enough the Flash files work perfectly which is exactly the opposite of our own capture experience Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments As per the crawl notes we only checked the original host version since the via crawl failed Crawl Frequency once Questions Comments about crawl Same comments as on the Villariagosa crawl Were inexperienced with this type of capture archive files as is and use a tacked-on script and a special server-side interface to interpret links in the new environment Our model has been to actually alter the internal links from absolute to relative formats so that it works in any environment

42

Gabriela Gray Mayor-Elect Villaraigosa CDL Report to Curator

URL httpwwwantonio2005com Curatorrsquos original comments Critical Aspects Flash Animation Content scattered across multiple servers Maintaining complex internal link structure JavaScript menus Streaming media Site copyright statement ldquocopy2005 Villaraigosa for Mayor 2005rdquo Crawl Results

Comments from crawl operator

bull (for Linked hosts results) Need feedback on media etc retrieved -- this site is ideal example of need for scope+one

bull (for Restricted to original host) How much was left out due to domain restriction Need feedback

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 263 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 817 10291631 ensim3interlixcom

43

805 117538973 wwwantonio2005com 472 6333775 wwwlaopinioncom 265 21173 dns 110 19355921 www2dailynewscom 100 16605730 www2dailybulletincom 95 1410145 wwwamericanpresidentsorg 86 820148 wwwdailynewscom 73 168698 wwwchumpsoftcom 72 52321 imagesibsyscom 69 836295 wwwlaobservedcom 65 137700 wwwmysqlcom 55 213569 wwwensimcom 55 177141 wwwlamayorcncom 55 296311 wwwsurveyusacom 53 495858 abclocalgocom 52 522324 wwwc-spanorg 51 244668 gallerymenaltocom [list truncatedhellip]

Curator Feedback to CDL (Gray Villaraigosa)

Crawl Success mostly effective Crawl Success Comments Doing some spot checks it looks like all of the pages were captured Some problems with media files -- WERA shows them when we search but the files are often size 0 In addition many files on external servers are listed and even have some descriptive info but when we click on Overview it sasy Sorry no documents with the given uri were found so no idea if they were really captured Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments Most of the links to other sites go to pages that truly are external to the site not incorporated into it Only exception would be the pages from ga3org and ga4org Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The results on the multiple hosts crawl is mixed As mentioned above much of it is superfluous Ideally there would be a way to limit to specific multiple domains rather than source + 1 Also there seem to be a lot of links going to ensim3interlixcom -- there were more files captured from there than from antonio2005 This is a case where it looks like the entire site (or much of it) was mirrored on two different servers When we find these we often try to collapse them into one seamless whole elminating the duplication which is meaningless from the users viewpoint Simply capturing both mirrors and leaving the cross-links intact is an option weve used when we cant collapse but it often leads to problems with links between the two which seems to be the case here

44

Crawl Frequency once Questions Comments about crawl Were very confused by the WERA interface which makes it hard to see whats going on We noticed that many of the images dont display properly in IE -- the image files seem to have been captured but some of the links between the captured html pages and the captured images arent working properly

45

Ron Heckart and Nick Robinson Public Policy Institute of California CDL Report to Curator

URL httpwwwppicorg Curatorrsquos original comments ldquoWe are particularly interested in their publications We hope the crawler will be able to report when new publication files are posted on the website Our main focus of interest is on their new publications at httpwwwppicorgmainnewpubsasprdquo Site copyright statement ldquoAll Contents copy Public Policy Institute of California 2003 2004 2005rdquo Crawl Results

Comments from crawl operator We cant at the moment use Heritrix to report on new publications posted Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 238 additional hosts were crawled The following hosts supplied more than 50 URLs to your site [urls] [bytes] [host] 2421 324309107 wwwppicorg 433 1367362 wwwcacitiesorg 238 19286 dns 229 4675065 wwwicmaorg 200 598505 bookstoreicmaorg 151 1437436 wwwgreatvalleyorg 144 517953 wwwkfforg 137 5304390 wwwrfforg 113 510174 www-hooverstanfordedu 102 1642991 wwwknowledgeplexorg 97 101335 cdnmapquestcom

46

81 379020 wwwcdecagov 73 184118 wwwilsgorg 68 4539957 caagstatecaus 62 246921 wwwmilkeninstituteorg [list truncatedhellip]

Curator Feedback to CDL (Heckart PPIC)

Crawl Success mostly effective Crawl Success Comments There are some problems with the functionality of captured pages 1) httpwwwppicorgmainhomeasp The drop-down links from the banner are not functional For example if you point to Publications and click on any of the drop-down items you will retrieve an object not found message The pages can be retrieved via the sidebar navigation links 2) httpwwwppicorgmainallpubsasp The radio button selections are not functional For example clicking on Date retrieves a message Sorry no documents with the given uri were found 3) The search boxes are not functional searches retrieve Sorry no documents with the given uri were found Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) Crawl Frequency weekly Questions Comments about crawl Our crawl report included the following comment from the crawl operator We cant at the moment use Heritrix to report on new publications posted The ability to report on new publications is critical to our goal of using the crawler as a discovery tool What are the prospects for providing this functionality in the future

47

Terry Huwe AFL-CIO CDL Report to Curator

URL httpwwwaflcioorg Curatorrsquos original comments ldquoThis site is content rich and has many files that will be useful in the future Specific areas that are of special interest follow below httpwwwaflcioorgcorporatewatch the data related to executive pay watch is especially useful httpwwwaflcioorgmediacenter would like to see press stories captured if possible httpwwwaflcioorgissues links to newsletters and original content Also ldquoLegislative Action Centerrdquo on the home page this is a useful topic guide to legislative history from a labor perspectiverdquo Site copyright statement ldquoCopyright copy 2005 AFL-CIOrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 2572 additional hosts were crawled The following hosts supplied more than 75 URLs to your site

[urls] [bytes] [host] 12702 481956063 wwwaflcioorg 2657 184477 dns 1375 35611678 wwwlocal237teamsterscom

48

570 8144650 wwwillinoisgov 502 52847039 wwwiloorg 435 3851046 wwwciosloritorg 427 2782314 wwwnolacom 401 8414837 www1paperthincom 392 15725244 wwwstatehealthfactskfforg 326 4600633 wwwdolgov 288 12303728 searchoxidecom 284 3401275 wwwsikidscom 280 3069385 wwwwashingtonpostcom 272 1480539 wwwcdcgov 235 5455692 wwwkfforg [list truncatedhellip]

Curator Feedback to CDL (Huwe AFL-CIO)

Crawl Success effective Crawl Success Comments I realize the collection interface is a work in progress and therefore not super user-friendly Nonetheless I think the results of this crawl are excellent Using search and display of collections I was able to ascertain that a lot of original content was captured (eg Working Families Toolkit BushWatch) that will have historical value Im hard-pressed to find fault with the crawl short of reading through the crawl log in detail (which I dont think youre asking for but which I _do_ have to do for IIRs 2 million-plus hits per year Web sites to analyze them for our program units) My feeling is that for the next cycle it might be really helpful for curators to have a more finished viewer which will at least mimic how the collection might in fact be searched by an average user That may be a tall order and it may have to wait But I think searching content in that kind of online environment would improve curators awareness of the strengths and weakness of the toolkit Having said all that my short answer is Im stoked -) Crawl Scope Preferences prefer linked hosts (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In the case of this collection theres a lot of original content (and context-making documents) on the linked sites simply due to the hierarchicalfederal nature of the labor movement Crawl Frequency monthly Questions Comments about crawl I may have missed correspondence on this question as Ive been busy with a library renovation here My question Is the 1 gig limit a useful one I couldnt help wondering what the actual figure would be if that limit were set higher Im generally interested in total size of Web collections as I need to monitor ours (which is content-rich but even so does not exceed 1 gig) So discussion of this at least would be interesting and maybe if its not technically challenging that 1 gig level might be raised and we could see what happens Thanks to all involved--very interesting process

49

Kris Kasianovitz Los Angeles Dept of City Planning CDL Report to Curator

URL httpcityplanninglacityorg Curatorrsquos original comments ldquoWebsite uses frames Most of the documents will be pdfs Of particular interest - the EIRrsquos which are not archived on the page once the project is approved See httpcityplanninglacityorgEIRTOC_EIRhtm - General and Community Plans httpcityplanninglacityorgcomplangen_plangenplan2htm httpcityplanninglacityorgComPlancpbpagehtmrdquo Site copyright statement No copyright information found Crawl Results

Comments from crawl operator (Linked hosts included crawl) ended because it ground on for 3 days without hitting data limit not sure if URLs at end of log are validuseful Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 119 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host]

50

10493 840876945 cityplanninglacityorg 601 5156252 metrolinktrainscom 183 644377 wwwcrnpsgov 121 11162 dns 90 977850 wwwmetrolinktrainscom 81 1207859 wwwftadotgov 79 263432 wwwfypowerorg 66 333540 wwwadobecom 64 344638 lacityorg 63 133340 cerescagov 60 274940 wwwamtrakcom 59 389217 wwwnhtsadotgov 58 347752 wwwunitedweridegov 52 209082 wwwdotgov 52 288783 wwwnationaltrustorg 51 278949 wwwportoflosangelesorg [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz LA City Planning)

Crawl Success mostly effective Crawl Success Comments The crawl in some cases captured more than I expected AND then didnt capture items that I thought it would For example the City Planning department is loaded with EIRs notices etc In most cases the documents are all pdfs When searching specifically for EIRs I got a large result list (699 citations however when I investigated whether or not the actual file was captured I found that the main EIR page was captured typically a htmhtml file but when i clicked on a link to get to the full report all I got was the Sorry no Documents wthe given URI were found This could be that the file was no longer available when the site was harvested However I tested a few of these and found that I could still access them on the city plannings live page Typically this occurred when there was a cover page Is this an issue of setting the crawler to go down more levels Or something else These are key documents that I would want to have harvested and preserved Here are a few specific examples Final EIR directory httpcityplanninglacityorgEIRTocfeirhtm Sierra Canyon Secondary School (cover page) httpcityplanninglacityorgEIRSierraCyn2ndSchoolSierraCyn_coverpghtm Access to the Draft EIR and Final EIR are provided from this coverpage Within the system the links to both the Draft and Final are brokenno documents with that given URI httpcityplanninglacityorgEIRSierraCyn2ndSchoolDEIRTable of Contentshtm Villa Marina EIR httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm Directory of EIR notices of preparation httpcityplanninglacityorgEIRNOPsTOCNOPHTM

51

This provides links to EIR preparation notices (all pdfs) a total of 27 linksdocuments I encountered the following three issues

bull pdf Opened = when clicking on the link to the notice the pdf opened with no problem 16 of 27

bull Sorry no document with the given uri was found = no pdf harvested but I could get to it from the live site 4 of 27

bull Acrobat Could Not Open message (could open live page outside of WERA) = the following Acrobat message came up when I tried to open the pdf Acrobat could not open ENV-2005-0881-EIR[1]pdf because it is either not a supported file type or because the file has been damaged (for example it was sent as an email attachment and wasn t correctly decoded) To create an Adobe Acrobat PDF document go to the source application Then print the document to Adobe PDF I copied and pasted the url into a regular browser and could open the pdf with no problem Also in a few cases if I clicked on the GO button after the first attempt to open the pdf in the system it seemed to launch 7 of 27

Conversely I found a number of pages that contained full documents in html with links to pdfs that worked with no problem See the following document httpcityplanninglacityorgcwdgnlplntranseltTET2Bkgrndhtm File types and error codes were what I expected Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments The restricted to original host is more helpfulrelevant for getting to the materials from a specific agency For some of the results that were restricted to host I was getting external links Here are some comparisons for each of the crawl settings Searched for villa marina LA Dept of City Planning 6 results httpcityplanninglacityorgEIRNOPsENV-2004-3812-EIRpdf httpcityplanninglacityorgEIRVillaMarinaVillaMarina_coverpghtm httpcityplanninglacityorgEIRNOPsTOCNOPHTM httpcityplanninglacityorgEIRTocfeirhtm httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm LA City Dept of Planning (via) 2 results httpcityplanninglacityorgcomplanpdfplmcptxtpdf httpcityplanninglacityorgCwdGnlPlnHsgEltHETblFigApVHgSithtm Searched for eir LA Dept of City Planning 699 results LA City Dept of Planning (via) 324 results

52

For both of these searches the uri s were from the cityplanninglacityprg Searched for transportation LA Dept of City Planning 699 results LA City Dept of Planning (via) 290 results (most are from external sources and tended to be the index or main page of another agency or organization Because this just got me to the main page and none of the links functioned at that level the via result was less helpful However the via results are useful for discovering other agencies or organizations that I should be looking at for materials Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) The list of linked hosts provided at the end of the report is helpful This will help me target other agencies that might be key places to check for further collection development Crawl Frequency monthly Questions Comments about crawl I want to qualify the frequency for this siteI d like to do a monthly crawl for 3-4 months I d want reassess to determine how much content is being added (or lost) and how much is remaining stable If the loss rate is low the amount remaining is stable and not a lot new content being added I d change the frequency I do know that notices and EIRs get added monthly however so it might be worth keeping the crawl at the monthly designation After reviewing searches that brought back over 200 results I am wondering how I would be able to review all of the content and manage it According to the crawl report there are 8899 documents in this crawl alone My other crawl yielded 2991 documents I think that the tools that are being developed will help us manage these crawls I should note that at this point in time for local documents I am more interested in individual documents more so than capturing an entire website and preserving the functionality Local agencies (with the exception of perhaps the Mayor s site) tend not change the design (look and feel) very often or have a very sophisticated design (ie flash changing images etc) and I don t see that this would be of interest to researchers needs it is the content reports maps etc that are containedaccessed on the websites that are important Maybe I m wrong or being short-sighted about that

53

Kris Kasianovitz Southern California Association of Governments CDL Report to Curator

Curatorrsquos original comments ldquoThis is a critical regional agency for Los Angeles Orange Ventura ImperialRiverside and San Bernadino counties Its main areas are Transportation Housing Economic Development This will provide an analysis of the overall site which has a lot of content Publicationsreports are typically in pdf ndash they are presented as full reports and pieces of the report (for easier downloading) so there might be duplication The full report is really all that would be needed See httpwwwscagcagovpublications Resources page contains the pdfs images dynamic content gis programs including an interactive atlas httpwwwscagcagovresourceshtm One part of the Resource site is the Web Accessible Geographic Data Search (WAGS) httpmapsvrscagcagovwagsindexcfmfuseaction= It requires a user created login and password (although there is a guest login that allows you to bypass this) Irsquom not sure what kind of difficulty the harvester will encounter with this portion of the site The interactive atlas also has a create an accountguest login issue httpmapsvrscagcagovatlaspresmapaspCmd=INIT Since it is a dynamic page I donrsquot know how this will be handled by the harvesterrdquo Site copyright statement ldquocopy 1999-2005 Southern California Association of Governmentsrdquo Crawl Results

54

Comments from crawl operator Interesting login problem Heritrix unable to retrieve guest login pages Cold fusion and asp dont generate new URLs and thus dont get crawled NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than 1 minute Robotstxt file The site you selected forbids crawlers from gathering certain data It reads

User-agent Disallow _mm Disallow _notes Disallow _baks Disallow MMWIP User-agent googlebot Disallow csi

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 500 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 2517 863231651 wwwscagcagov 690 6134101 wwwmetrolinktrainscom 506 40063 dns 428 1084533 wwwcacitiesorg 397 16161513 wwwscecom 196 581022 bookstoreicmaorg 187 4505985 wwwicmaorg 175 7757737 wwwciseal-beachcaus 158 1504151 wwwh2ouseorg 149 940692 wwwhealthebayorg 137 317748 wwwcipico-riveracaus 130 18259431 wwwciventuracaus 123 490154 wwwchinohillsorg 121 406068 wwwlakewoodcityorg 119 203542 wwwlavotenet 117 2449995 wwwcimalibucaus 114 744410 wwwciirvinecaus 113 368023 wwwwhitehousegov 109 974674 wwwdotcagov 107 892192 wwwlacanadaflintridgecom [list truncatedhellip]

Curator Feedback to CDL (Kasianovitz SCAG)

Crawl Success mostly effective

55

Crawl Success Comments Similar to my comments about the Los Angeles Dept of City Planning - crawl brought back a lot of webpages but not a the publicationsdocuments that I would want to collect Again the same problem happened with SCAG as happened with the EIR example THe crawl brought back html pages with links to reports (typically in pdf format) - but the actual documents were not captured While the webpage is helpful as it gives context the main content that Id want to capture (the reports) were not captured See the following for example None of the webpages linked from this page are available they should link to a page that will have the material I tried searching for the documents separately and couldn t get to them See httpwwwscagcagovpublicationsindexhtm (the timeline arrows at the top seemed to functionim not sure what this is for) httpwwwscagcagovlivablepubshtm I was impressed to find that zip files were captured and I was able to download them Unfortunately when I opened themthere wasnt any content (i did the same search by mistake with the Arizona Dept of Water resources and actually found content in the folders)I found 10 with the search typezip gif or jpg images retrieved are not useful - most were just bars or bulllets or covers of reports (although this might be helpful to identify titles i think i would end up discarding these -- after doing more checking of the results) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments from LA Dept of City Planning Restricted get me to the relevant materials for that agency via brings back too many main webpages for other agencies to be useful Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) I expected that there would be a problem with the content behind logins The crawl confirmed that material behind login screens couldnrsquot be captured Can I get a copy of these crawl results NOTE A third crawl attempt was made for this site with new settings This crawl focused on the login pages only retrieved 28 files and took less than one minute Crawl Frequency monthly Questions Comments about crawl How to handle the copyright issue For the login information Im not sure what all was blocked by the robots file Interesting that metrolinktrainscom is the 1 related host for both of my crawled sites

56

Linda Kennedy California Bay Delta Authority CDL Report to Curator

URL httpcalwatercagov Curatorrsquos original comments ldquoWe are interested in the environmental impact statements and other key documents and the various news releases and other announcements and archives of CALFEDrdquo Site copyright statement ldquocopy 2001 CALFED Bay-Delta Programrdquo Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 519 additional hosts were crawled The following hosts supplied more than 50 files to your site

[urls] [bytes] [host] 1130 473192247 calwatercagov 741 201538533 wwwparkscagov 521 40442 dns 373 51291934 solicitationcalwatercagov 242 78913513 wwwcalwatercagov 225 410972 cweaorg 209 87556344 wwwsciencecalwatercagov 173 109807146 sciencecalwatercagov 172 1160607 wwwadobecom 129 517834 wwwwhitehousegov [list truncatedhellip]

Curator Feedback to CDL (Kennedy CALFED)

57

Crawl Success mostly effective Crawl Success Comments I looked closely at the CALFED home page (httpcalwatercagov) Tribal Home page(httpcalwatercagovTribalTribal_Homeshtml) Key Documents (httpcalwatercagovCALFEDDocumentsCALFEDDocumentsshtml) and Archives page (httpcalwatercagovArchivesArchivesshtml) The crawl did not complete in either the via or non-via search Nearly all linked pages were retrieved in the non-via search However the retrievals from the via search were much less complete than the retrievals from the non-via search For example on the Key Documents page there were 3 missing links from the non-via search but 14 missing links from the via search When Adobe documents were retrieved from either crawl they came up correctly Three asp links of tribal maps from the Tribal Home page were retrieved by the non-via search but not the via search A few of the images were missing from the displays and this was also affected by the browser used Usually the same image was missing from both crawls but sometimes the images were more complete in the non-via crawl retrievals There were some display problems with the right-hand menu boxes on the httpcalwatercagovTribalTribal_Homeshtml page for example that did not display correctly when viewed via Firefox The same page viewed correctly on Internet Explorer but when printed out the boxes printed out incorrectly just as viewed on the Firefox browser Grant Opportunities httpcalwatercagovGrantOpportunitiesGrantInformationshtml this link did not work in 2 via and 2 non-via instances (from Tribal home page and from Archives page) but did work on one non-via crawl page (the calfed home page) It could also be searched and retrieved directly from the testcrawl search page Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments See comments above Non-via search was substantially more complete Crawl Frequency monthly

58

Janet Martorana Santa Barbara County Department of Planning and Development CDL Report to Curator

URL httpwwwcountyofsborgplandevdefaulthtm Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 487additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 3119 1102414495 wwwcountyofsborg 485 34416 dns 428 1083047 wwwcacitiesorg 357 6126453 wwwsbcphdorg 320 6203035 icmaorg 250 438507 wwwsbcourtsorg 234 1110744 vortexaccuweathercom 200 593112 bookstoreicmaorg [list truncatedhellip]

Curator Feedback to CDL (Martorana SBCD)

Crawl Success mostly effective

59

Crawl Success Comments Most documents I expected to find were captured but a number were not for example off this page httpwwwcountyofsborgplandevcompthreeyear2005-2008defaulthtml I expected to get to the final work program httpwwwcountyofsborgplandevpdfcompprogramsThree_Year_WP2005-2008_3YrWrkProgrampdf but got the Sorry no documents with the given uri were found message Other examples within Energy Division a part of the Planning amp Development Dept Off this page httpwwwcountyofsborgenergyinformationasp I could access all links except for two httpwwwcountyofsborgenergyinformationoilampGasFieldsasp (Oil and Gas Fields) and httpwwwcountyofsborgenergyinformationoilampGasProductionasp (Oil and Gas Production) The crawler seemed to cut off the URL right before the ampersand perhaps it has problems with ampersands Crawl Scope Preferences unknown (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Frequency weekly Questions Comments about crawl Id like to be notified when Ive navigated away from the captured site I found myself on realtime web sites but couldnt tell what was on the captured site and what was real Example When I did a search on santa barbara I got 1528 results using the via results I went to the County of Santa Barbara Online site and then after clicking on links I was on the web realtime and not in the crawl results database any longer yet there were no indications that I left the crawled database The WERA uri was still displaying at the top of the screen I couldnt tell what were the captured sites and what was the current realtime sites Other observations the webpage navigation doesnt work eg Table of Contents doesnt jump to that section on the webpage wwwcountyofsborgenergyprojectsshellasp and wwwcountyofsborgenergymitigationoakProjectasp links to glossary terms go to glossary but not to term itself

60

Lucia Orlando Monterey Bay National Marine Sanctuary CDL Report to Curator

URL httpmontereybaynoaagov Curatorrsquos original comments None provided Site copyright statement No copyright information found Crawl Results

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 795 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 5272 468755541 montereybaynoaagov 861 61141 dns 554 20831035 wwwwundergroundcom 368 4718168 montereybaynosnoaagov 282 3682907 wwwoceanfuturesorg 273 10146417 wwwmbnms-simonorg 260 7159780 wwwmbayaqorg 163 61399 bcusyahoocom 152 1273085 wwwmbariorg 146 710203 wwwmontereycom 119 3474881 wwwrsiscom 119 279531 wwwsteinbeckorg 118 1092484 bonitambnmsnosnoaagov 109 924184 wwwdukeedu 104 336986 wwwmontereybayaquariumorg

61

103 595953 iconswundergroundcom 102 339589 wwwuncwedu [list truncatedhellip]

Curator Feedback to CDL (Orlando ndash Monterey Bay)

Crawl Success (rating not provided) Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments This site contains a large mix of government policy laws and regulatory information as well as links to recreational and educational activities available in the MBNMS I was most interested in links to lawsregspolicy and educational info and organizations I thought the links restricted to the original host best captured this information succinctly Crawl Frequency unknown

62

Richard Pearce Moses Arizona Department of Water Resources CDL Report to Curator

URL httpwwwazwatergov Curatorrsquos original comments (redirects to httpwwwazwatergovdwr) In arid Arizona water is one of the most important ndash and most contested ndash resources The publications and records of this Department are of critical value to the state Our spider can get many files from this site (1474 files in 258 directories) We are mostly interested in documents by programs Although our spide canrsquot get the imaged documents database (httpwwwazwatergovdwrContent ImagedRecordsdefaulthtm) this directory may not be critical if we can get the imaged documents transferred to us We are not interested in blank forms and applications Site copyright statement ldquoCopyright copy 1998 - 2005 Arizona Department of Water Resources and ADWR Network All Rights Reservedrdquo Crawl Results

Questions for curator Did this capture the documents you needed

63

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 195 additional hosts were crawled The following hosts supplied more than 50 files to your site [urls] [bytes] [host] 2233 988447782 wwwazwatergov 286 2350888 wwwwaterazgov 253 4587125 wwwgroundwaterorg 226 3093331 wwwazcentralcom 196 15626 dns 178 395216 wwwmacromediacom 128 1679057 wwwprescottedu 123 947183 wwwazlegstateazus 115 792968 wwwusdagov [List truncatedhellip]

Curator Feedback to CDL (Pearce-Moses AZWater)

Crawl Success (not provided) Crawl Success Comments We were surprised that your crawl found 4888 documents Another crawl that we conducted about the same time using wget found only 1474 However both spiders found roughly the same number of bytes As I understand wget cannot follow links in Flash or Java while it appears that the Heritrix spider can That may be the difference The crawl is listed as not completing but it appears to be very close based on the total number of bytes downloaded Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Questions Comments about crawl It seemed odd that it took 2h 54m to crawl 7709 documents with linked hosts but 4h 4m to crawl only 4888 docs when the spider was restricted to the original host

64

Richard Pearce Moses Citizens Clean Election Commission CDL Report to Curator

URL httpwwwccecstateazusccecscrhomeasp Curatorrsquos original comments This commission was established by initiative Itrsquos work is of great historical significance as it is changing the way the public elects officials We have not been able to spider this site because links are buried in java script (We use wget as our spider) We are primiarily interested in acquiring their publications election data and things listed under ldquopopular linksrdquo Site copyright statement ldquoCopyright 2004 Arizona Citizens Clean Elections Commission All Rights Reservedrdquo Crawl Results

Comments from crawl operator Javascript issue interesting problem need curator feedback about what we captured Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to the following additional sites were crawled Total 15

65

[urls] [bytes] [host] 929 95456563 wwwccecstateazus 76 6117977 wwwazcleanelectionsgov 55 513218 azgov 49 499337 wwwgovernorstateazus 44 174903 wwwadobecom 40 141202 wwwazlegstateazus 31 18549 wwwazgov 28 202755 wwwazsosgov 23 462603 gitastateazus 19 213976 wwwbenefitoptionsazgov 17 89612 wwwazredistrictingorg 14 1385 dns 3 1687 wwwimagesadobecom 2 1850 wwwcapitolridesharecom 2 26438 wwwftcgov

Curator Feedback to CDL (Pearce-Moses CCEC)

Crawl Success (not provided) Crawl Success Comments We were very pleased with this crawl as it demonstrated that the Heritrix spider could follow links embedded in Java We have not been able to crawl this site with wget Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts)

66

Juri Stratford City of Davis CDL Report to Curator

URL httpwwwcitydaviscaus Curatorrsquos original comments ldquoWe are primarily interested in the GIS data produced for the City of Davis As the GIS data represent the largest part of the City of Davis web site It may not be much more difficult to archive the site at the top level Mapping and Geographic Information Systems URL httpwwwcitydaviscausgisrdquo Site copyright statement ldquoThis web site is Copyright copy 2004 by the City of Davis All Rights Reserved The City retains the copyright on all text graphic images and other content of this site You may not copy modify andor re-use text images or other web content from this web site distribute the Citys web content mirror content from this web site on a non-City server or make any other use of the content of this web site that would violate the Citys copyright without written permission from the City of Davis To the extent allowed by law commercial use of our web material is prohibited without written permission from the City of Davis All art work shown on these web pages is protected by US Copyright laws Limited reproduction for non-commercial purposes can be authorized by the City of Davis provided that requests are approved prior to use Contact the Community Development Department Cultural Services program staff at (530) 757-5610 for more information Some content included in this web site may be provided courtesy of third parties who may retain copyright control of the provided material Any service marks and trademarks contained herein are the property of their respective ownersrdquo Crawl Results

67

Comments from crawl operator ldquoGIS Potential issue img disallowed by robotstxt eg httpwwwcitydaviscausimgfeaturedmap-staticjpg canrsquot be retrieved also some maps on a second server disallowed Need feedback about gis material that was captured what was captured that is useful Much duplication -- pages captured repeatedlyrdquo Robotstxt The site you selected prohibits crawlers from collecting certain documents The file reads

User-agent Disallow img Disallow calendar Disallow miscemailcfm Disallow edbusiness Disallow gisoldmap Disallow policelog Disallow pcsgrantssacog Disallow jobslistings Disallow css Disallow pcsnutcrackerhistorycfm Disallow pcsnutcrackerpdfs User-agent asterias Disallow User-agent gigabot Disallow

Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 420 additional hosts were crawled The following hosts supplied more than 50 files to your site

68

[urls] [bytes] [host] 16455 947871325 wwwcitydaviscaus 420 29555 dns 332 10377948 wwwasucducdavisedu 305 33270715 selectreecalpolyedu 279 3815103 wwww3org 161 2027740 wwwcrnpsgov 139 941939 wwwcomcastcom 133 951815 wwwyolocountyorg [List truncatedhellip]

Curator Feedback to CDL (Stratford Davis)

Crawl Success mostly effective Crawl Success Comments Looking at the GIS Online Maps page its not clear which formats were retrieved and which were not For example the Growth Map Flash file downloads fine but the FlashArcIMS files do not download Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments In general restricted to original host works better The broader search includes pages from businesses eg Comcast and other local agencies eg other local and state government sites But restricting the outside sites to the first level seems to be a good compromise Response to CDL questions (In some cases CDL posed specific questions to the curator in the test report This is the curatorrsquos answer to the question) In general it looked like it did a good job pulling about geographic dataimages For example you can pull down data from httpwwwcitydaviscausgislibrary Its difficult for me to get a sense of the level of duplication from the way the search results display Crawl Frequency monthly Questions Comments about crawl I like the ability to navigate within the archive once you have a starting point This is very nice

69

Yvonne Wilson Orange County Sanitation District CDL Report to Curator

URL httpwwwocsdcom Curatorrsquos original comments ldquoThe is an Orange County agency that has small reports and complex panning documents on its web siterdquo Site copyright statement ldquoCopyright copy 2001-2005 Orange County Sanitation District Unless a copyright is indicated the information on this site is freely available for non-commercial non-profit making use If a copyright is indicated on any materials displayed on our website permission to copy these materials must be obtained from the original source Commercial use of District materials is expressly prohibited without the written permission of the OCSDrdquo Crawl Results

Comments from crawl operator In both crawl settings we had to limit the maximum number of retry attempts in order to complete the crawl Related hosts crawled When the crawl was set to include documents from other sites that the original site linked to 85 additional hosts were crawled The following hosts supplied more than 50 documents to your site [urls] [bytes] [host] 755 85943567 wwwocsdcom 164 7635257 wwwciseal-beachcaus 122 809190 wwwciirvinecaus 95 169207 epagov 86 7673 dns

70

85 559125 ordere-arccom 66 840581 wwwcihuntington-beachcaus 62 213476 wwwcityoforangeorg 57 313579 wwwepagov 55 4477820 wwwvillaparkorg 50 1843748 wwwcityoflapalmaorg 50 463285 wwwocbinccom [List truncatedhellip]

Curator Feedback to CDL (Wilson OCSD)

Crawl Success somewhat effective Crawl Success Comments Using the WERA I searched inquiries by type and title in the two OCSD collections plain and via I received no hits for pdf only the homepage for html and three hits for text There are many pdf sections in the eirs I next searched by title in the two collections I was the most successful in via By searching the titles carbon canyon and Ellis Ave Pumping Station I found articles but not the eirs which are available full text At this point I presumed the crawl did not drill down far enough Then I searched for a secondary page entitled Ocean Monitoring this time the search found only an internal letter and memo but not all the documents related to this topic Via collection search seems to be the most productive but it is not consistent Crawl Scope Preferences Original host only (Does the curator prefer the crawl to be restricted only to the original host name or the wider scope that includes linked pages from other hosts) Crawl Scope Comments I searched some the outside links to US Marine Fisheries and EPA Beach Watch and received no hits Crawl Frequency monthly

71

Crawl Report Key Web-at-Risk Test Crawls This document is a guide to the test crawl report providing further information about some of the statistics and results conveyed there Your report includes basic information about the site you submitted your original comments about that site and any copyright statements CDL found when we examined the site Your site was crawled using the Heritrix crawler version 151 Your report will include the following information about the site Item Explanation Crawl Settings We crawled each site in two different ways A Linked hosts included B Restricted to original host Where linked hosts were included we set the crawler to gather any outside page that your nominated site linked to but no further So if your site linked to a single document from whitehousegov we captured that document but did not crawl any further on the whitehousegov site Your report will show results from both styles of crawling to give you a sense of whether or not the site draws heavily from valuable materials on another site You will also receive a list of the other hosts that site linked to and how many documents were gathered from those hosts Robots txt file The presence of a robotstxt file means that the content provider is asking us to refrain from crawling either all or part of the site This refers only to the host you named in your crawl request In some cases the site had a robotstxt file but it didnrsquot say anything We noted when this occurred We obeyed robotstxt instructions for these crawls so if the site contained one but we still got a result that means the robotstxt file only prevented us from crawling certain areas of the site When robotstxt files were present we have included the text of that file in your report so you can see which segments of the site the site owner wants to protect Crawl duration Total number of documents The ldquoDocumentsrdquo count will include page components (such as images or flash files) File types (mime types) This area will contain a URL When you go to that URL you will see a list of the different file types that were retrieved as part of the crawl IMPORTANT The Heritrix crawler is currently experiencing difficulty with this report and it is missing a crucial column Until that is fixed you can see the different file

72

types retrieved from most common to least but you cannot yet tell how many files each one included CDL will contact you when this report has been fixed Response code reports The URL in this column will lead to a list of response codes in order by frequency This will include ldquo200rdquo for files that were successfully captured and error codes for files that were not captured The error code list includes some codes specific to Heritrix The key to interpreting these codes is at httpcrawlerarchiveorgarticlesuser_manualhtmlstatuscodes Note that this report only gives you quantitative information about response codes it does not link response codes to specific files For these details see ldquohosts report and crawl logrdquo below How much data collected (bytes) The file size of the total crawl is reported in bytes You can use the byte conversion tool at httpwwwtechtutorialsnetreferencebyteconvertershtml if you want to recalculate the size of the crawl in another measurement such as kilobytes or megabytes Did crawl complete This will say ldquonordquo if the crawl results exceeded 1 gigabyte or if the crawler encountered an obstacle to capturing the site that could not be fixed Location of hosts report and crawl log You are welcome to review the page by page details of the crawl log This is a generic report that comes with the Heritrix crawler and is not terribly user friendly but provides the most detail about the crawl process This report will list every file that the crawler attempted to get and provide some information about each file The Heritrix manual can help you interpret this report httpcrawlerarchiveorgarticlesuser_manualhtmllogs Go to section 821 Crawl Log Comments from Crawl Operator These are observations that the Web Archive Programmer made about the crawl process for your site Questions for Curator This section does not appear in every report If you have a question listed here please respond to it in the Test Crawl Feedback form Your Collection Important The screens you will see do not represent the final user interface for the Web Archiving Service tools We are using WERA an open-source search and display tool only to show you your test crawl results

73

Because we did not seek the right to redistribute these documents these pages are available only for the purpose of analyzing crawler effectiveness You must have a password to view these pages Your report will include the address of a wiki page and a login and password each site was crawled twice plain crawl = only pages from the original site were collected via = pages from the original site as well as pages that site links to were collected Unfortunately you cannot simply browse your site you must select a collection and type a search You will be able to navigate throughout your site once you load a page containing links You will be able to review your colleaguesrsquo sites as well Note that the WERA display tool is not perfect If the same document was gathered from more one crawl it may not display in every collection Related Hosts Crawled This section provides further information about the additional materials that were gathered when we set the crawler to inlcude documents that your site links to This can be critical in deciding what settings are needed to capture your site Some sites for instance will keep all of their pdf or image files on a separate server If you donrsquot allow the crawler to move away from the original URI you wonrsquot capture a critical portion of the sitersquos content In other cases however this setting will lead to irrelevant information This report includes the most commonly linked hosts from your site

74

  • Web-at-Risk Test Crawl Report
    • Test Crawl Overview
    • A Change of Plan
    • The Respondents
    • About Crawl Results
    • Pre-Crawl Analysis Rights Issues
    • Test Crawl Settings and Process
    • Crawl Scope
    • Communication Reports
    • Crawl Frequency
    • Language and Web Site Models
    • Multimedia
    • Comparison with Other Crawlers
    • Crawl Success
    • Conclusions
    • Next Steps
      • Web-at-Risk Test Crawl Report Appendix ASites Submitted
      • Web-at-Risk Test Crawl Report Appendix BThe Katrina Crawl
        • The Crawl
        • Gathering the Seeds
        • Crawling Specifics
        • Katrina Crawl Results
        • Site Selection and Classification
        • Rights Ownership and Responsibilities
        • Technical Infrastructure
        • Information Analysis and Display
        • Conclusions
          • Web-at-Risk Test Crawl Report Appendix CIndividual Crawl R
            • Elizabeth Cowell (submitted by Ann Latta) UC Merced
            • Sherry DeDekker California Water Science Center
            • Megan Dreger (submitted by James Jacobs) City of San Diego
            • Peter Filardo and Michael Nash New York City Central Labor
            • Valerie Glenn and Arelene Weibel Strengthening Social Secur
            • Valerie Glenn and Arelene Weibel Defense Base Closure and R
            • Gabriela Gray Join Arnold
            • Gabriela Gray Mayor-Elect Villaraigosa
            • Ron Heckart and Nick Robinson Public Policy Institute of Ca
            • Terry Huwe AFL-CIO
            • Kris Kasianovitz Los Angeles Dept of City Planning
            • Kris Kasianovitz Southern California Association of Governm
            • Linda Kennedy California Bay Delta Authority
            • Janet Martorana Santa Barbara County Department of Planning
            • Lucia Orlando Monterey Bay National Marine Sanctuary
            • Richard Pearce Moses Arizona Department of Water Resources
            • Richard Pearce Moses Citizens Clean Election Commission
            • Juri Stratford City of Davis
            • Yvonne Wilson Orange County Sanitation District
              • Crawl Report Key
Page 20: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 21: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 22: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 23: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 24: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 25: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 26: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 27: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 28: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 29: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 30: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 31: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 32: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 33: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 34: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 35: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 36: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 37: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 38: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 39: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 40: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 41: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 42: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 43: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 44: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 45: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 46: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 47: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 48: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 49: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 50: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 51: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 52: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 53: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 54: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 55: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 56: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 57: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 58: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 59: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 60: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 61: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 62: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 63: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 64: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 65: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 66: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 67: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 68: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 69: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 70: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 71: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 72: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 73: Web-at-Risk Test Crawl Report - Dashboard - Confluence
Page 74: Web-at-Risk Test Crawl Report - Dashboard - Confluence