Temporal Spread In Archived Composite Resources (work in progress)

Preview:

DESCRIPTION

WADL 2013 July 25 – 26, 2013 Indianapolis, Indiana USA. Temporal Spread In Archived Composite Resources (work in progress). Scott G. Ainsworth Michael L. Nelson Old Dominion University Computer Science. Contents. Motivation Related work Preliminary work Temporal Spread Future work - PowerPoint PPT Presentation

Citation preview

TEMPORAL SPREAD IN ARCHIVEDCOMPOSITE RESOURCES(WORK IN PROGRESS) SCOTT G. AINSWORTHMICHAEL L. NELSONOLD DOMINION UNIVERSITYCOMPUTER SCIENCE

WADL 2013JULY 25–26, 2013

INDIANAPOLIS, INDIANA USA

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

2

CONTENTS

Motivation Related work Preliminary work Temporal Spread Future work Conclusion

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

3

A FABLE FROM WAYBACK

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

4

TEMPORAL SPREAD

7/26/13

2005-05-1401:36:08

+9 days

+18 days +18 days

+7 months

+2.1 years

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

5

QUESTIONS• How much temporal spread exists in composite

mementos?• How can temporal spread be minimized?• What factors contribute, positively or negatively,

to spread?• Does combining multiple archives produce

better results?• Would users with differing goals benefit from

different minimization policies and heuristics?• How can temporal coherence be displayed to

users—simply?

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

6

CONTENTS

Motivation Related work Preliminary work Temporal Spread Future work Conclusion

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

7

RELATED WORKControl Crawl Data Quality, Future collections

• Spaniol et al. – crawling strategy • Denev et al. – change rates by MIME type and

depth• Ben Saad et al. – metadata from crawl used to

select best results from archive

Our Focus: Existing Data Quality• Existing collections• Datetime selection policies

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

8

RELATED WORKUse Patterns

• AlNoamony et al. – Archive Access Patterns• Humans vs. Robots• Dip, dive, slide, & skim

Identifying Duplicates• Simple identity – images, other binary formats

• direct comparison• Hash comparison

• HTML, CSS (text)• Shingling, Jaccard distances, etc.• SimHash most promise ⃪

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

9

RELATED WORK – MEMENTO*• HTTP extension for datetime negotiation

Request

Response

7/26/13

GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1…Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT…

HTTP/1.1 200 OK…Memento-Datetime: Sat, 14 May 2005 01:36:08 GMT…

*https://datatracker.ietf.org/doc/draft-vandesompel-memento/

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

10

CONTENTS Motivation Related work Preliminary work

How much of the Web is archived Temporal Drift

Temporal Spread Future work Conclusion

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

11

HOW MUCH IS ARCHIVED?

7/26/13

35 – 90% At least one archived copy17 – 49% 2 – 5 copies 1 – 8% 6 – 10 copies 8 – 63% > 10 copies JCDL’11

Internet Archive Search Engine Other

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

12

CONTENTS Motivation Related work Preliminary work

How much of the Web is archived Temporal Drift

Temporal Spread Future work Conclusion

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

13

TEMPORAL DRIFTComparing two policies

• Sliding –target datetime changes• Sticky – target datetime held steady

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

14

SLIDING TARGET

7/26/13

2005-05-14 01:36:08

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

15

SLIDING TARGET

7/26/13

2005-04-2200:17:52

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

16

SLIDING TARGET

7/26/13

2005-03-3109:16:10

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

17

TEMPORAL DRIFTWHAT WE EXPECTED2005-05-14 @ 01:36:08

WHAT WE GOT2005-03-31 @ 09:16:10

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

18

STICKY TARGET

What if the target is held steady?

(Enabled by Memento API)

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

19

2005-05-14STICKY TARGET

7/26/13

Mem

ento

Fox

Exte

nsio

n2005-05-14

01:36:08

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

20

STICKY TARGET

7/26/13

2005-04-2200:17:52

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

21

STICKY TARGET

7/26/13

2005-05-1401:36:08

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

22

DRIFT COMPARISONPage

Sliding StickyDatetime Drift Datetime Drift

CS Home 2005-05-1401:36:08 – 2005-05-14

01:36:08 –

Science Home

2005-04-2200:17:52 22.1 days 2005-04-22

00:17:52 22.1 days

CS Home 2005-03-3109:16:10

43.7 days(+21.6 days)

2005-05-1401:36:08 –

Mean 32.9 days 11.0 days

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

23

MEDIAN DRIFT BY STEP

● Sliding● Sticky

Med

ian

Drif

t (m

onth

s)

7/26/13

Step Number

JCDL’13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

24

CONTENTS Motivation Related work Preliminary work

How much of the Web is archived Temporal Drift

Temporal Spread Future work Conclusion

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

25

TEMPORAL SPREAD

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

26

COMPOSITE MEMENTOPRESENTATION STRUCTURE

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

27

TEMPORAL SPREAD

7/26/13

2005-05-1401:36:08

+9 days

+18 days +18 days

+7 months

+2.1 years

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

28

EMBEDDED RESOURCESResource Memento-Datetime Delta Resource Memento-

Datetime Delta

http://www.cs.odu.edu 2005-05-14 01:36:08 spacer.gif 2005-06-01 16:23:10 18.6 d

mm_menu.js 2005-05-23 02:39:12 9.0 d jimcheng.gif 2005-06-01 16:37:39 18.6 d

style.css 2005-05-23 02:39:39 9.0 d jsmith.gif 2005-06-01 16:58:50 18.6 d

gfx-logo-odu-crown.gif 2005-05-23 02:39:39 9.0 d rmenu_1st_featured_alumni.png 2005-06-01 21:21:45 18.8 d

ddmenu_ddown.js 2005-05-23 02:39:43 9.0 d hmenu_college_...-new.png 2005-12-21 20:14:25 7.3 mo

university.js 2005-05-23 02:39:56 9.0 d rmenu_1st_upcoming_news.png 2005-12-21 20:15:14 7.3 mo

rmenu_1st_about.png 2005-06-01 13:40:25 18.5 d rmenu_1st_upcoming_events.png 2005-12-21 21:01:12 7.3 mo

rmenu_bottom_229.gif 2005-06-01 14:07:29 18.5 d lmenu_1st_resources.png 2005-12-28 17:47:41 7.5 mo

shadow-bl.gif 2005-06-01 14:55:53 18.6 d bullet_blue_triangle.gif 2005-12-28 19:43:48 7.5 mo

ecsbdg.jpg 2005-06-01 14:56:17 18.6 d logo-cs.gif 2005-12-28 19:54:29 7.5 mo

shadow-br.gif 2005-06-01 15:18:18 18.6 d rmenu_1st_featured_student.png 2007-06-12 02:36:07 2.1 years

gfx-btn-go-dblue.gif 2005-06-01 15:34:19 18.6 d shadow-b.gif 2007-06-21 02:35:17 2.1 years

shadow-tr.gif 2005-06-01 15:55:57 18.6 d shadow-r.gif 404 Not Found

header-right1.gif 2005-06-01 16:06:16 18.6 d

7/26/13

Embedded Resources 26

Mean Delta 125.9 days

Standard Deviation 207.7 days

Spread 2.1 years

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

29

REPRESENTING SPREADCOMPOSITE MEMENTO

TEMPORAL SPREAD CHART

7/26/13

RootEmbeddedDiff. DomainReused

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

30

TEMPORAL SPREAD – ODU CS

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

31

FIRST EXPERIMENT

• 1,000 URIs from DMOZ (Open Directory)• Download all timemaps• Download all composite mementos• Download all embedded resources• Single and Multiple Archives• Four Heuristics

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

32

PRELIMINARY RESULTSCount Description Percent

1,000 Root URI-Rs

910 Root timemaps 91%

87,847 Root URI-Ms in timemaps

96.5 URI-Ms per Root URI-R

85,570 Root memento downloaded 97%

1,488,420 Embedded URI-Rs

17.4 Embedded URI-Rs per Root memento

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

33

SINGLE/MULTI & HEURISTICSDescription Minimize

Distance, Single

Archive

Minimize Distance,

Multi-Archive

3-Month Window,

Multi-Archive

Embedded URI-Rs 1,488,440 1,488,420 1,447,351

Embedded URI-Ms in timemaps 1,169,787 1,186,456 500,541

URI-M/Embedded URI-R 0.79 0.80 0.35

% Complete 73.8% 75.4% 33.8%

Mean spread 200.2 200.1 15.1

Standard Deviation 219.2 219.9 14.3

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

34

TEMPORAL COHERENCE

7/26/13

1 Memento, Bracketed Root

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

35

TEMPORAL COHERENCE

7/26/13

1 Memento, Bracketed Root

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

36

TEMPORAL COHERENCE

7/26/13

1 Memento, Bracketed Root

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

37

TEMPORAL COHERENCE

7/26/13

1 Memento, Root Not Bracketed

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

38

TEMPORAL COHERENCE

7/26/13

1 Memento, Root Not Bracketed

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

39

TEMPORAL COHERENCE

7/26/13

1 Memento, No Last-Modified

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

40

TEMPORAL COHERENCE

7/26/13

1 Memento, Before Root

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

41

TEMPORAL COHERENCE

7/26/13

2 Mementos, Root Not Bracketed

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

42

TEMPORAL COHERENCE

7/26/13

2 Mementos, Root Not Bracketed

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

43

TEMPORAL COHERENCE

7/26/13

2 Mementos, Use Content – Similarity

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

44

TEMPORAL COHERENCE

7/26/13

2 Mementos, Contents Equal or Equivalent

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

45

TEMPORAL COHERENCE

7/26/13

2 Mementos, Contents Not Equal or Equivalent

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

46

CURRENT EXPERIMENT

• 4,000 URIs from JCDL’11 “How Much…” paper• 1 URI/month vice all• Temporal coherence patterns• Target WSDM 2013

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

47

CURRENT EXPERIMENT

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

48

CONTENTS Motivation Related work Preliminary work Temporal Spread Future work Conclusion

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

49

FUTURE WORKTimemaps, Redirection, Missing Mementos• Timemaps only tell part of the story• URI-R redirection (302 from source)• URI-M redirection (Archive action)• Mementos in timemaps but not accessible• Policies must consider user needs

• Leave it missing• Show “best” substitute

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

50

FUTURE WORKSimilarity & Duplication• Delta are currently | root – embedded |• If bracketing mementos are identical,

should delta be zero?

• HTML is usually modified by the archive• Can’t check for equality• Shingling? SimHash?

7/26/13

0 +30d–30d

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

51

FUTURE WORKCommunicating Status

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

52

FUTURE WORKPolicies & Heuristics• Current Spread Heuristics

• Minimize distance• Past only• Past preferred• Near or within distance• Single vs. multi-archive

• Refine to meet user expectations• Speed (minimize time)• Accuracy (minimize temporal error)

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

53

CONTENTS

Motivation Related work Preliminary work Future work Conclusion

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

54

CONCLUSIONExtensive research on improving acquisition exists

Best use of existing collections needs study

We are looking at

• Characterizing existing holdings

• Characterizing temporal coherence

• Policies that minimize impact of temporal incoherence

• Visualizations of temporal coherence

7/26/13

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

55

MY QUESTIONS

7/26/13

Coherent

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

56

MY QUESTIONS

7/26/13

Violation

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

57

MY QUESTIONS

7/26/13

What do

these mean

to users?

(3)

(2)

(1)

(4)

Join

t Con

fere

nce

on D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

58

MY QUESTIONS

7/26/13

What does

this mean

to users?

Recommended