18
Seeing In The Dark Discovery and data-mining of restricted web archives Andrew Jackson, Web Archiving Technical Lead IIPC GENERAL ASSEMBLY | 25-04-2013 | LJUBLJANA

Seeing In The Dark: Discovery and data-mining of restricted web archives

Embed Size (px)

Citation preview

Page 1: Seeing In The Dark: Discovery and data-mining of restricted web archives

Seeing In The DarkDiscovery and data-mining of restricted web archives

Andrew Jackson,Web Archiving Technical Lead

IIPC GENERAL ASSEMBLY | 25-04-2013 | LJUBLJANA

Page 2: Seeing In The Dark: Discovery and data-mining of restricted web archives

RESTRICTED ARCHIVESSeeing in the dark

Page 3: Seeing In The Dark: Discovery and data-mining of restricted web archives

Discovery in the dark

3

Page 4: Seeing In The Dark: Discovery and data-mining of restricted web archives

The JISC UK Web Domain Dataset

Internet Archive UK Domain Dataset 1996-2010 Millions of websites 2.5 billion resources > 35TB

No direct access No bulk downloads Open metadata datasets Analytical access

Page 5: Seeing In The Dark: Discovery and data-mining of restricted web archives

OPEN DATASETSSeeing in the dark

5

Page 6: Seeing In The Dark: Discovery and data-mining of restricted web archives

GeoIndexes – Discovering Local Web History

http://data.webarchive.org.uk/opendata/ukwa.ds.2/geo/

Page 7: Seeing In The Dark: Discovery and data-mining of restricted web archives

Format Profiles – HTML Version Analysis

http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/

Page 8: Seeing In The Dark: Discovery and data-mining of restricted web archives

Top-Level Linkage Analysis

http://www.webarchive.org.uk/ukwa/visualisation/ukwa.ds.2/linkage

Page 9: Seeing In The Dark: Discovery and data-mining of restricted web archives

Host Linkage Dataset

9

1996|appserver.ed.ac.uk|portico.bl.uk 11996|art-www.acorn.co.uk|portico.bl.uk 11996|astra.ich.ucl.ac.uk|portico.bl.uk 11996|back.niss.ac.uk|portico.bl.uk 11996|beta.bids.ac.uk|portico.bl.uk 21996|blaiseweb.bl.uk|blaiseweb.bl.uk 41996|bonsai.iielr.dmu.ac.uk|portico.bl.uk 41996|dominica.lshtm.ac.uk|portico.bl.uk 11996|dux.dundee.ac.uk|portico.bl.uk 21996|eisv01.lancs.ac.uk|portico.bl.uk 1

http://www.webarchive.org.uk/datasets/ukwa.ds.2/linkage/

Page 10: Seeing In The Dark: Discovery and data-mining of restricted web archives

WATs

10

Web Archive Transformation (WAT) https://webarchive.jira.com/wiki/display/Iresearch/Web+Archiv

e+Transformation+(WAT)+Specification,+Utilities,+and+Usage+Overview

Contains links and anchor text.

Size & distribution: 6TB of compressed JSON in WARC packaging Looking at hosting options CC0 licence

Working with the Oxford Internet Institute http://www.oii.ox.ac.uk/research/projects/?id=88

Page 11: Seeing In The Dark: Discovery and data-mining of restricted web archives

DATA SERVICESSeeing in the dark

11

Page 12: Seeing In The Dark: Discovery and data-mining of restricted web archives

Full-text Search: Prime Ministers

http://www.webarchive.org.uk/ukwa

Page 13: Seeing In The Dark: Discovery and data-mining of restricted web archives

Analytical Access to the Dark Domain Archive (AADDA)http://domaindarkarchive.blogspot.co.uk/

13

http://

http://www.webarchive.org.uk/aadda-discovery/browse

Page 14: Seeing In The Dark: Discovery and data-mining of restricted web archives

GLOBAL INTEGRATIONSeeing in the dark

14

Page 15: Seeing In The Dark: Discovery and data-mining of restricted web archives

Memento

15

[Mementos Screenshot]

http://www.webarchive.org.uk/mementos/search

Page 16: Seeing In The Dark: Discovery and data-mining of restricted web archives

Integrated, Global Discovery

16

Exploit existing APIs Use item hash values via Wayback to compare our archives

or validate independent archives Expose more information alongside the Memento API Improve prototype Memento browser plugin(s)

Develop new APIs Expose link information via Wayback and/or Memento Lookup by fields other than host and timestamp, e.g.

In-links Hash values

Page 17: Seeing In The Dark: Discovery and data-mining of restricted web archives

INSIDE-OUT ARCHIVESSeeing in the dark

17

Page 18: Seeing In The Dark: Discovery and data-mining of restricted web archives

Summary: Inside-Out Archives

18

CC0 open datasets

Analytical access services

Richer APIs

Integrated, contextualized, global discovery