Negotiating the archives of UK web space - NetLab › wp-content › uploads › 2016 › 12 ›...

Preview:

Citation preview

Negotiating the archives of UK web space

Jane Winters, Professor of Digital Humanities, School of Advanced Study, University of London

Workshop on National Webs, Aarhus, 8-9 December 2016

Jisc Domain Dataset (1996-

2013)

Legal Deposit Domain Crawl (2013-2016)

Open UKWA (2004-2016)

UK parliament web archive (2009-2016)

UK Government Web Archive (1996-2016)

Internet Archive (1996-2016)

Common Crawl (1999-2016)

Archive-It

Other national

web archives

Facts and figures I

• Jisc historical dataset 1996 to 6 April 2013

– 3,520,628,647 distinct records

– 65 terabytes

• 2014 domain crawl (.uk)

– 56TB data

– 2.5 billion webpages and other assets (including 4.7GB of viruses)

Facts and figures II

• UK Parliament Web Archive

– Three snapshots per year covering 30 sites (37 sites in the archive in total)

– 4.8TB data

• UK Government Web Archive

– 3,000+ websites,

– Twitter (65,000 tweets) and video archives

Internal inconsistencies

• UKGWA consists of data provided by IA 2003-4 (plus back catalogue to 1996); and by the Internet Memory Foundation from 2005 onwards (further complicated by membership of UKWAC)

• The BL annual domain crawl has failed differently each time it has run

• The ‘break’ between IA and nationally archived content

0

50

100

150

200

250

300

Text types Image types Application types Video types File types

Number of format types, 1996-1997

1996 1997

nexbri.demon.co.uk/local.gif 19970823153342 http://nexbri.demon.co.uk:80/local.gif image/gif 200 DFBOHMHZPPQSEAIGZGL5MTATRKVB3FGF - 40806909 DOTUK-HISTORICAL-1996-2010-GROUP-AK-XABCKD-20110428000000-00002.arc.gz

mirex.demon.co.uk/background3.gif 19970824013134 http://mirex.demon.co.uk:80/background3.gif image/* 200 Z2V3V4NZTEYL634PR4VPS7YWIVG7J4B4 - 40832067 DOTUK-HISTORICAL-1996-2010-GROUP-AK-XABCKD-20110428000000-00002.arc.gz

mirex.demon.co.uk/mirex.gif 19970824013153 http://mirex.demon.co.uk:80/mirex.gif image/* 200 KVZHDCQIPPU4T5TA6P4TCP2BAAJNSH6H - 40840076 DOTUK-HISTORICAL-1996-2010-GROUP-AK-XABCKD-20110428000000-00002.arc.gz

mirex.demon.co.uk/ibrowsenowanim.gif 19970824013315 http://mirex.demon.co.uk:80/IBrowseNowAnim.gif image/* 200 CQXESYZG2DMVYDISQDJVQCMDJAHD7YEK - 40860957 DOTUK-HISTORICAL-1996-2010-GROUP-AK-XABCKD-20110428000000-00002.arc.gz

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1,600,000

1,800,000

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008

Number of .uk names registered, 1996-2008 (Nominet)

1,575,655

108,711

4,626 265 42 8,8300

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1,600,000

1,800,000

.co.uk .org.uk .ltd.uk .plc.uk .net.uk .sch.uk

Breakdown of domain name registrations, 2000 (Nominet)

Acknowledgements

• BUDDAH project team – Jonathan Blaney, Niels Brügger, Josh Cowls, Helen Hockx-Yu, Andrew Jackson, Eric Meyer, Ralph Schroeder, Jason Webber, Peter Webster

• Bursary holders – Rowan Aust, Rona Cran, Richard Deswarte, Saskia Huc-Hepher, Alison Kay, Gareth Millward, Marta Musso, Harry Raffal, Lorna Richardson, Helen Taylor

Recommended