12
Interrogating the archived UK web “RNIB” Gareth Millward – [email protected] – Centre for History in Public Health Improving health worldwide http:://history.lshtm.ac.uk

Gareth millwood interrogating the archived uk web

Embed Size (px)

DESCRIPTION

Digital History seminar 4 November 2014 Live Stream: http://ihrdighist.blogs.sas.ac.uk/2014/10/28/tuesday-4-november-interrogating-the-archived-uk-web-historians-and-social-scientists-research-experiences/

Citation preview

Page 1: Gareth millwood   interrogating the archived uk web

Interrogating the

archived UK web

“RNIB”

Gareth Millward – [email protected] – Centre for History in Public Health

Improving health worldwide

http:://history.lshtm.ac.uk

Page 2: Gareth millwood   interrogating the archived uk web

“The best-laid schemes

o’ mice an’ men…

• Original plan to investigate the presence of information for disabled people on the UK web

• Also to look at the accessibility of that info through Web Accessibility Standard 1.0 (1998)

• Search for major organisations and key disability words

• Run sample through validation tools

Pieter Bruegel the Elder - The Tower of Babel (Vienna) - Google Art Project – edited : from Wikipedia

Page 3: Gareth millwood   interrogating the archived uk web

… Gang aft

agley.”

• Far too much stuff!

• Search terms such as “RADAR”, “SCOPE” and “MIND” obviously… problematic…

• No discernible pattern from code validation

• “Experience” of using screen readers impossible (for now)*

• Defining “information” or “reach” not a simple task

• Still major problems with assessing “importance” and “relevance”

* - At least within design scope of this project… !

Macintosh Performa 5200, a mid-90s Apple computer. From Wikipedia.

Page 4: Gareth millwood   interrogating the archived uk web

“RNIB”

• A simple four-letter string

• Played a key role in promoting web standards in Britain

• Just over half a million “hits” –significant number compared to other disability organisations.

RNIB logo © RNIB – RNIB.org.uk

Page 5: Gareth millwood   interrogating the archived uk web

Large number of instances

relative to peers…

Search term Instances

RNIB 516,165

MENCAP 218,439

RNID 217,963

"disability alliance" 22,421

royal association for disability and rehabilitation

16,072

BCODP 12,501

UKDPC 2,348

"spinal injuries association"

45,477

"centre for independent living"

23,185

"disability benefits consortium"

2,205

disability 12,909,868

*.* (all) 2,023,288,655

0.00%

0.01%

0.01%

0.02%

0.02%

0.03%

0.03%

0.04%

0.04%

0.05%

0.05%

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Inst

ance

s p

.a.

as p

ere

cen

tage

of

wh

ole

p.a

.

Instances of search terms relative to *.*, 1996 - 2010

RNIB MENCAP RNID

Page 6: Gareth millwood   interrogating the archived uk web

… and not all self-

referential

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

Instances per domain as percentage of total for "RNIB"

Page 7: Gareth millwood   interrogating the archived uk web

Predominance of .org.uk

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

.org.uk .co.uk .gov.uk .ac.uk .nhs.uk .parliament.uk

Domains of instances as percentage of total of "RNIB"

Page 8: Gareth millwood   interrogating the archived uk web

The trouble

begins - links

Links to Instances

-> rnib.org.uk 259,421

-> w3.org 71,798

-> mla.gov.uk 34,435

-> openharmonise.org 32,071

-> facebook.com 31,098

• Disaggregated statistics are basically meaningless

• Second most common link is to W3.org – had virtually nothing to do with the actual activities of RNIB

• openharmonise.org – the CMS for mla.gov.uk. Reflects references on MLA site, not the activity of RNIB

Page 9: Gareth millwood   interrogating the archived uk web

The bloody Guardian…

Page 10: Gareth millwood   interrogating the archived uk web

Commensurability goes

out the window..

• Once you start filtering out the areas that aren’t “really” part of your search, it becomes impossible to compare one search term with another.

• You will lose “useful” information and keep “useless” stuff

• Can begin to build a “human readable” corpus – but what the heck do I actually have, here? Certainly not what I originally intended to look at…

xkcd:Thesis Defence

Page 11: Gareth millwood   interrogating the archived uk web

Whittling down

• REMOVED LINKS TO W3.org (usually just a mention of WAI)

• REMOVED RNIB.org.uk (I can browse the main site – more interested in external material)

• REMOVED 2009 & 2010 (made the sample smaller, and these use different crawling system)

• REMOVED RNIB.co.uk

• REMOVED big-print.co.uk

• REMOVED MLA.gov.uk (mentions RNIB a lot, but becomes noise)

• The result of all this? The corpus is down to 71,112

• (Actually, by reducing the date range further and adding a couple of extra tweaks, now down to 39,270)

Page 12: Gareth millwood   interrogating the archived uk web

What did we learn

today?

• Visible effects of the impact of RNIB on UK web standards

• Sheer presence suggests RNIB was better than its peers at establishing itself on the internet

• Google has made us me lazy

• An archive without an archivist or a catalogue is highly problematic for researchers The British Library – from Wikicommons