Seeing In The DarkDiscovery and data-mining of restricted web archives
Andrew Jackson,Web Archiving Technical Lead
IIPC GENERAL ASSEMBLY | 25-04-2013 | LJUBLJANA
RESTRICTED ARCHIVESSeeing in the dark
Discovery in the dark
3
The JISC UK Web Domain Dataset
Internet Archive UK Domain Dataset 1996-2010 Millions of websites 2.5 billion resources > 35TB
No direct access No bulk downloads Open metadata datasets Analytical access
OPEN DATASETSSeeing in the dark
5
GeoIndexes – Discovering Local Web History
http://data.webarchive.org.uk/opendata/ukwa.ds.2/geo/
Format Profiles – HTML Version Analysis
http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/
Top-Level Linkage Analysis
http://www.webarchive.org.uk/ukwa/visualisation/ukwa.ds.2/linkage
Host Linkage Dataset
9
1996|appserver.ed.ac.uk|portico.bl.uk 11996|art-www.acorn.co.uk|portico.bl.uk 11996|astra.ich.ucl.ac.uk|portico.bl.uk 11996|back.niss.ac.uk|portico.bl.uk 11996|beta.bids.ac.uk|portico.bl.uk 21996|blaiseweb.bl.uk|blaiseweb.bl.uk 41996|bonsai.iielr.dmu.ac.uk|portico.bl.uk 41996|dominica.lshtm.ac.uk|portico.bl.uk 11996|dux.dundee.ac.uk|portico.bl.uk 21996|eisv01.lancs.ac.uk|portico.bl.uk 1
http://www.webarchive.org.uk/datasets/ukwa.ds.2/linkage/
WATs
10
Web Archive Transformation (WAT) https://webarchive.jira.com/wiki/display/Iresearch/Web+Archiv
e+Transformation+(WAT)+Specification,+Utilities,+and+Usage+Overview
Contains links and anchor text.
Size & distribution: 6TB of compressed JSON in WARC packaging Looking at hosting options CC0 licence
Working with the Oxford Internet Institute http://www.oii.ox.ac.uk/research/projects/?id=88
DATA SERVICESSeeing in the dark
11
Full-text Search: Prime Ministers
http://www.webarchive.org.uk/ukwa
Analytical Access to the Dark Domain Archive (AADDA)http://domaindarkarchive.blogspot.co.uk/
13
http://
http://www.webarchive.org.uk/aadda-discovery/browse
GLOBAL INTEGRATIONSeeing in the dark
14
Memento
15
[Mementos Screenshot]
http://www.webarchive.org.uk/mementos/search
Integrated, Global Discovery
16
Exploit existing APIs Use item hash values via Wayback to compare our archives
or validate independent archives Expose more information alongside the Memento API Improve prototype Memento browser plugin(s)
Develop new APIs Expose link information via Wayback and/or Memento Lookup by fields other than host and timestamp, e.g.
In-links Hash values
INSIDE-OUT ARCHIVESSeeing in the dark
17
Summary: Inside-Out Archives
18
CC0 open datasets
Analytical access services
Richer APIs
Integrated, contextualized, global discovery