17
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@ nullhandle) Web Archiving Service Manager Stanford University Libraries Archives 2016 209 - Balancing Quality of Life and Quality Assurance August 4, 2016

Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

Rethinking Web Archiving Quality

Assurance for Impact, Scalability,

and Sustainability

Nicholas Taylor (@nullhandle)

Web Archiving Service Manager

Stanford University Libraries

Archives 2016

209 - Balancing Quality of Life and Quality Assurance

August 4, 2016

Page 2: Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

QA panelists

Dory Bower

Government Publishing Office

Lori Donovan

Internet Archive / Archive-It

Dallas Pillen

Bentley Historical Library

Nicholas Taylor

Stanford University Libraries

Alex Thurman

Columbia University Libraries

Page 3: Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

balancing QA + quality of life?

“Tab Tatham "junk. balance scales."” by ▓▒░ TORLEY ░▒▓ under CC BY-SA 2.0

Page 4: Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

overheard re: QA @ SAA 2015

we set and forget; I’m just glad we’re doing something

steady,

ongoing QA is

challenging

occasionally I set

aside a lunch hour

to do some QA

my strategy right now is to let the big schools figure it out

Page 5: Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

2015 SAA WebArchRT discussion

• if you could only apply 3 QA practices to

your web archives, which 3?

• do you apply different QA practices to

web archives created for different use

cases?

• how do you ensure that staff time

allocated to QA is best spent?

Page 6: Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

quality assurance in the lifecycle

Archive-It: “The Web Archiving Life Cycle Model”

Page 7: Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

quality assurance, expansively

typical QA

• parsing robots.txt

• scoping rules

• object count limits

• test crawling

• inspecting archived site

• reviewing reports

• patch crawling

and more

• seed selection

• assessing live site

• capture tool selection

• crawl scheduling

• crawl duration limits

• monitoring crawl

• archivability advocacy

• training

Page 8: Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

3rd highest desired skill

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

NDSA: “2015 NDSA Web Archiving Survey”

Page 9: Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

low perceived programmatic progress

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

NDSA: “2015 NDSA Web Archiving Survey”

Page 10: Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

greatest collaboration interest

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Policy + Risk

Management

Capture

Configuration

Collaborative

Collection

Dev

Input on APIs

+ Standards

Metadata

Standards

QA

Techniques +

Strategies

Tool Dev Other

NDSA: “2015 NDSA Web Archiving Survey”

Page 12: Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

web archiving at Stanford

• 7 Archive-It accounts

• Heritrix, Webrecorder

• local preservation,

discovery, access

• program manager,

curators, students

• tens of collections

• thousands of seeds Internet Archive: “Stanford University Homepage”

Page 13: Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

quality assurance goals

• maximize impact +

efficiency of QA efforts

• enable diverse,

distributed, +

approachable

contributions

• calibrate investments

in quality based on

tool capabilities “Goals” by Eric Peacock under CC BY-NC-SA 2.0

Page 14: Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

capture, behavior, appearance

appearancebehavior

capture

NYARC: “I. Introduction - NYARC Documentation”

Page 15: Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

capture, behavior, appearance

appearancebehavior

capture

NYARC: “I. Introduction - NYARC Documentation”

Page 16: Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

in practice

care more about…

• report data

• crawl finishing

• 4xx, 5xx, complete

robots.txt block

• plausible duration

• plausible object counts

• scoping out extraneous

content

• new seeds

care less about…

• visual inspection

• reviewing every capture

• appearance fidelity

• behavior fidelity

• partial content out of

scope

• partial content blocked by

robots.txt

• ongoing seeds

Page 17: Rethinking Web Archiving Quality Assurance for Impact ......2016/08/04  · Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@nullhandle)

more next from Lori, Alex, Dallas, Dory

“Olympic Relay Handoff” by Dr. Mark Kubert under CC BY-NC-ND 2.0