Building Web Archiving Technology, Together

  • View
    1.198

  • Download
    0

  • Category

    Internet

Preview:

Citation preview

Building Web Archiving Technology, Together

Nicholas TaylorWeb Archiving Service ManagerStanford University Libraries

Web Archives 2015: Capture, Curate, AnalyzeNovember 13, 2015

overview

• why build together?

• community for collaborative work

• APIs for collaborative work

“LAX on take off” by Doug under CC BY-NC-ND 2.0

not a programmer

“Bug” by Randall Munroe under CC BY-NC 2.5

aspiring OSS contributor

GitHub: “nullhandle (Nicholas Taylor)”

studying the landscape

“2010 Grand Canyon Celebration of Art 172” by Grand Canyon National Park under CC BY 2.0

a centralized enterprise

External Local Both0%

10%

20%

30%

40%

50%

60%

70%

60%

25%

14%

63%

20%16%

2011 2013NDSA: “Web Archiving in the U.S.: A 2013 Survey”

a centralized enterprise

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 20130

2

4

6

8

10

12

14

16

18

20

0 0 1 02

0 1 0 1 0

3 31 2

42

64

1 0

2

0

0

11

0

1 3

53

4 2

2 5

6

15

Number of organizations Archive-It Partner as of 2013NDSA: “Web Archiving in the U.S.: A 2013 Survey”

minimal local preservation

Transferred Haven't transferred0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

19%

81%

20%

80%

2011 2013NDSA: “Web Archiving in the U.S.: A 2013 Survey”

opportunities for research

“Exploring the Canadian Political Interest Group and Political Parties Web Sphere” by Ian Milligan under Standard YouTube License

community analysis

SAA Web

Archiving

Roundtable

Archive-It

Partners

IIPC

NDSA: “Web Archiving in the U.S.: A 2013 Survey”

models of software production

(irrespective of license)• sole source

– single developer• closed source

– team/corporate dev; no outside contributions• club source

– pool resources for solo/team/corporate dev• community source

– direct and distributed community participation

• open source– grassroots, democratic, meritocratic

participationTom Cramer: “Collaborative Open Source Software Production & APIs”

club source examples• Archivematica, AtoM (Artefactual)• ArchivesSpace (Lyrasis)• Bitcurator (Educopia)• Fedora (DuraSpace)• JHOVE (OPF)• LOCKSS (Stanford University)• Omeka (George Mason University)

community source examples

community architecture• privileges community over code• recognizes distribution of

investment• embraces community diversity• models open processes and

governance• encourages varied contributions• serves community needs

success of a standard• capture: DeDuplicator, Heritrix,

python-heritrix, SiteStory, WAIL, WARCreate, WarcMITMProxy, WarcProxy, Webrecorder, wget, Wpull

• access: OpenWayback, pywb, warc-proxy, WarcManager, Wayback Machine, Web Archive Discovery, WebArchivePlayer

• utilities: JHOVE2, JWAT, Megawarc, pylibwarc, WARCAT, Warcbase, warctools, Web Archive Commons

web archiving lifecycle

Internet Archive: “The Web Archiving Life Cycle Model”

smaller projects do bettersmall projects (<$1 million)

large projects (>$10 million)

on time/budget challengedfailed

on time/budget challengedfailed

Standish Group: “Chaos Manifesto 2013: Thing Big, Act Small”

IIPC community interest in APIs

contribution type% of

respondents

# of responde

ntshelp define functional

requirements 94% 15

contribute use cases 81% 13help define technical

details 69% 11

help schedule and run meetings 19% 3

implement and test 6% 1Andrea Goethals: “Results of the Web Archiving API Survey of IIPC Members”

API candidates

• capture tool/proxy interconnect

• capture tool management

• data import/export• query + extraction• integrity audit +

repair• descriptive

metadata

• logs + analytics• renderings/

derivative formats• federated data

delivery• federated replay• federated full-text

search