Seeing In The Dark: Discovery and data-mining of restricted web archives

Preview:

Citation preview

Seeing In The DarkDiscovery and data-mining of restricted web archives

Andrew Jackson,Web Archiving Technical Lead

IIPC GENERAL ASSEMBLY | 25-04-2013 | LJUBLJANA

RESTRICTED ARCHIVESSeeing in the dark

Discovery in the dark

3

The JISC UK Web Domain Dataset

Internet Archive UK Domain Dataset 1996-2010 Millions of websites 2.5 billion resources > 35TB

No direct access No bulk downloads Open metadata datasets Analytical access

OPEN DATASETSSeeing in the dark

5

GeoIndexes – Discovering Local Web History

http://data.webarchive.org.uk/opendata/ukwa.ds.2/geo/

Format Profiles – HTML Version Analysis

http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/

Top-Level Linkage Analysis

http://www.webarchive.org.uk/ukwa/visualisation/ukwa.ds.2/linkage

Host Linkage Dataset

9

1996|appserver.ed.ac.uk|portico.bl.uk 11996|art-www.acorn.co.uk|portico.bl.uk 11996|astra.ich.ucl.ac.uk|portico.bl.uk 11996|back.niss.ac.uk|portico.bl.uk 11996|beta.bids.ac.uk|portico.bl.uk 21996|blaiseweb.bl.uk|blaiseweb.bl.uk 41996|bonsai.iielr.dmu.ac.uk|portico.bl.uk 41996|dominica.lshtm.ac.uk|portico.bl.uk 11996|dux.dundee.ac.uk|portico.bl.uk 21996|eisv01.lancs.ac.uk|portico.bl.uk 1

http://www.webarchive.org.uk/datasets/ukwa.ds.2/linkage/

WATs

10

Web Archive Transformation (WAT) https://webarchive.jira.com/wiki/display/Iresearch/Web+Archiv

e+Transformation+(WAT)+Specification,+Utilities,+and+Usage+Overview

Contains links and anchor text.

Size & distribution: 6TB of compressed JSON in WARC packaging Looking at hosting options CC0 licence

Working with the Oxford Internet Institute http://www.oii.ox.ac.uk/research/projects/?id=88

DATA SERVICESSeeing in the dark

11

Full-text Search: Prime Ministers

http://www.webarchive.org.uk/ukwa

Analytical Access to the Dark Domain Archive (AADDA)http://domaindarkarchive.blogspot.co.uk/

13

http://

http://www.webarchive.org.uk/aadda-discovery/browse

GLOBAL INTEGRATIONSeeing in the dark

14

Memento

15

[Mementos Screenshot]

http://www.webarchive.org.uk/mementos/search

Integrated, Global Discovery

16

Exploit existing APIs Use item hash values via Wayback to compare our archives

or validate independent archives Expose more information alongside the Memento API Improve prototype Memento browser plugin(s)

Develop new APIs Expose link information via Wayback and/or Memento Lookup by fields other than host and timestamp, e.g.

In-links Hash values

INSIDE-OUT ARCHIVESSeeing in the dark

17

Summary: Inside-Out Archives

18

CC0 open datasets

Analytical access services

Richer APIs

Integrated, contextualized, global discovery

Recommended