TEMPORAL SPREAD IN ARCHIVEDCOMPOSITE RESOURCES(WORK IN PROGRESS)
SCOTT G. AINSWORTH
MICHAEL L. NELSON
OLD DOMINION UNIVERSITY
COMPUTER SCIENCE
WADL 2013
JULY 25–26, 2013
INDIANAPOLIS, INDIANA USA
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
2
CONTENTS
Motivation
Related work
Preliminary work
Temporal Spread
Future work
Conclusion
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
3
A FABLE FROM WAYBACK
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
4
TEMPORAL SPREAD
7/26/13
2005-05-1401:36:08
+9 days
+18 days +18 days
+7 months
+2.1 years
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
5
QUESTIONS• How much temporal spread exists in composite
mementos?
• How can temporal spread be minimized?
• What factors contribute, positively or negatively, to spread?
• Does combining multiple archives produce better results?
• Would users with differing goals benefit from different minimization policies and heuristics?
• How can temporal coherence be displayed to users—simply?
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
6
CONTENTS
Motivation
Related work
Preliminary work
Temporal Spread
Future work
Conclusion
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
7
RELATED WORKControl Crawl Data Quality, Future collections
• Spaniol et al. – crawling strategy • Denev et al. – change rates by MIME type and
depth• Ben Saad et al. – metadata from crawl used to
select best results from archive
Our Focus: Existing Data Quality• Existing collections• Datetime selection policies
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
8
RELATED WORKUse Patterns
• AlNoamony et al. – Archive Access Patterns• Humans vs. Robots• Dip, dive, slide, & skim
Identifying Duplicates• Simple identity – images, other binary formats
• direct comparison• Hash comparison
• HTML, CSS (text)• Shingling, Jaccard distances, etc.• SimHash most promise ⃪�
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
9
RELATED WORK – MEMENTO*• HTTP extension for datetime negotiation
Request
Response
7/26/13
GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1…Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT…
HTTP/1.1 200 OK…Memento-Datetime: Sat, 14 May 2005 01:36:08 GMT…
*https://datatracker.ietf.org/doc/draft-vandesompel-memento/
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
10
CONTENTS
Motivation
Related work
Preliminary work How much of the Web is archived Temporal Drift
Temporal Spread
Future work
Conclusion
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
11
HOW MUCH IS ARCHIVED?
7/26/13
35 – 90% At least one archived copy
17 – 49% 2 – 5 copies
1 – 8% 6 – 10 copies
8 – 63% > 10 copies JCDL’11
Internet Archive Search Engine Other
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
12
CONTENTS
Motivation
Related work
Preliminary work How much of the Web is archived Temporal Drift
Temporal Spread
Future work
Conclusion
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
13
TEMPORAL DRIFTComparing two policies
• Sliding –target datetime changes• Sticky – target datetime held steady
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
14
SLIDING TARGET
7/26/13
2005-05-14 01:36:08
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
15
SLIDING TARGET
7/26/13
2005-04-2200:17:52
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
16
SLIDING TARGET
7/26/13
2005-03-3109:16:10
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
17
TEMPORAL DRIFTWHAT WE EXPECTED2005-05-14 @ 01:36:08
WHAT WE GOT2005-03-31 @ 09:16:10
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
18
STICKY TARGET
What if the target is held steady?
(Enabled by Memento API)
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
19
2005-05-14STICKY TARGET
7/26/13
Mem
ento
Fo
x E
xten
sio
n2005-05-14
01:36:08
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
20
STICKY TARGET
7/26/13
2005-04-2200:17:52
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
21
STICKY TARGET
7/26/13
2005-05-1401:36:08
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
22
DRIFT COMPARISON
PageSliding Sticky
Datetime Drift Datetime Drift
CS Home2005-05-14
01:36:08– 2005-05-14
01:36:08–
Science Home
2005-04-2200:17:52
22.1 days 2005-04-2200:17:52
22.1 days
CS Home2005-03-31
09:16:1043.7 days(+21.6 days)
2005-05-1401:36:08
–
Mean 32.9 days 11.0 days
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
23
MEDIAN DRIFT BY STEP
● Sliding● Sticky
Med
ian
Drif
t (m
onth
s)
7/26/13
Step Number
JCDL’13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
24
CONTENTS
Motivation
Related work
Preliminary work How much of the Web is archived Temporal Drift
Temporal Spread
Future work
Conclusion
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
25
TEMPORAL SPREAD
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
26
COMPOSITE MEMENTO
PRESENTATION STRUCTURE
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
27
TEMPORAL SPREAD
7/26/13
2005-05-1401:36:08
+9 days
+18 days +18 days
+7 months
+2.1 years
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
28
EMBEDDED RESOURCESResource Memento-Datetime Delta Resource
Memento-Datetime
Delta
http://www.cs.odu.edu 2005-05-14 01:36:08 spacer.gif 2005-06-01 16:23:10 18.6 d
mm_menu.js 2005-05-23 02:39:12 9.0 d jimcheng.gif 2005-06-01 16:37:39 18.6 d
style.css 2005-05-23 02:39:39 9.0 d jsmith.gif 2005-06-01 16:58:50 18.6 d
gfx-logo-odu-crown.gif 2005-05-23 02:39:39 9.0 d rmenu_1st_featured_alumni.png 2005-06-01 21:21:45 18.8 d
ddmenu_ddown.js 2005-05-23 02:39:43 9.0 d hmenu_college_...-new.png 2005-12-21 20:14:25 7.3 mo
university.js 2005-05-23 02:39:56 9.0 d rmenu_1st_upcoming_news.png 2005-12-21 20:15:14 7.3 mo
rmenu_1st_about.png 2005-06-01 13:40:25 18.5 d rmenu_1st_upcoming_events.png 2005-12-21 21:01:12 7.3 mo
rmenu_bottom_229.gif 2005-06-01 14:07:29 18.5 d lmenu_1st_resources.png 2005-12-28 17:47:41 7.5 mo
shadow-bl.gif 2005-06-01 14:55:53 18.6 d bullet_blue_triangle.gif 2005-12-28 19:43:48 7.5 mo
ecsbdg.jpg 2005-06-01 14:56:17 18.6 d logo-cs.gif 2005-12-28 19:54:29 7.5 mo
shadow-br.gif 2005-06-01 15:18:18 18.6 d rmenu_1st_featured_student.png 2007-06-12 02:36:07 2.1 years
gfx-btn-go-dblue.gif 2005-06-01 15:34:19 18.6 d shadow-b.gif 2007-06-21 02:35:17 2.1 years
shadow-tr.gif 2005-06-01 15:55:57 18.6 d shadow-r.gif 404 Not Found
header-right1.gif 2005-06-01 16:06:16 18.6 d
7/26/13
Embedded Resources 26
Mean Delta 125.9 days
Standard Deviation 207.7 days
Spread 2.1 years
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
29
REPRESENTING SPREAD
COMPOSITE MEMENTO
TEMPORAL SPREAD CHART
7/26/13
RootEmbeddedDiff. DomainReused
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
30
TEMPORAL SPREAD – ODU CS
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
31
FIRST EXPERIMENT
• 1,000 URIs from DMOZ (Open Directory)• Download all timemaps• Download all composite mementos• Download all embedded resources• Single and Multiple Archives• Four Heuristics
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
32
PRELIMINARY RESULTSCount Description Percent
1,000 Root URI-Rs
910 Root timemaps 91%
87,847 Root URI-Ms in timemaps
96.5 URI-Ms per Root URI-R
85,570 Root memento downloaded 97%
1,488,420 Embedded URI-Rs
17.4 Embedded URI-Rs per Root memento
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
33
SINGLE/MULTI & HEURISTICSDescription Minimize
Distance, Single
Archive
Minimize Distance,
Multi-Archive
3-Month Window,
Multi-Archive
Embedded URI-Rs 1,488,440 1,488,420 1,447,351
Embedded URI-Ms in timemaps 1,169,787 1,186,456 500,541
URI-M/Embedded URI-R 0.79 0.80 0.35
% Complete 73.8% 75.4% 33.8%
Mean spread 200.2 200.1 15.1
Standard Deviation 219.2 219.9 14.3
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
34
TEMPORAL COHERENCE
7/26/13
1 Memento, Bracketed Root
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
35
TEMPORAL COHERENCE
7/26/13
1 Memento, Bracketed Root
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
36
TEMPORAL COHERENCE
7/26/13
1 Memento, Bracketed Root
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
37
TEMPORAL COHERENCE
7/26/13
1 Memento, Root Not Bracketed
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
38
TEMPORAL COHERENCE
7/26/13
1 Memento, Root Not Bracketed
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
39
TEMPORAL COHERENCE
7/26/13
1 Memento, No Last-Modified
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
40
TEMPORAL COHERENCE
7/26/13
1 Memento, Before Root
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
41
TEMPORAL COHERENCE
7/26/13
2 Mementos, Root Not Bracketed
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
42
TEMPORAL COHERENCE
7/26/13
2 Mementos, Root Not Bracketed
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
43
TEMPORAL COHERENCE
7/26/13
2 Mementos, Use Content – Similarity
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
44
TEMPORAL COHERENCE
7/26/13
2 Mementos, Contents Equal or Equivalent
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
45
TEMPORAL COHERENCE
7/26/13
2 Mementos, Contents Not Equal or Equivalent
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
46
CURRENT EXPERIMENT
• 4,000 URIs from JCDL’11 “How Much…” paper• 1 URI/month vice all• Temporal coherence patterns• Target WSDM 2013
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
47
CURRENT EXPERIMENT
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
48
CONTENTS
Motivation
Related work
Preliminary work
Temporal Spread
Future work
Conclusion
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
49
FUTURE WORKTimemaps, Redirection, Missing Mementos
• Timemaps only tell part of the story
• URI-R redirection (302 from source)
• URI-M redirection (Archive action)
• Mementos in timemaps but not accessible
• Policies must consider user needs• Leave it missing• Show “best” substitute
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
50
FUTURE WORKSimilarity & Duplication
• Delta are currently | root – embedded |
• If bracketing mementos are identical,should delta be zero?
• HTML is usually modified by the archive
• Can’t check for equality
• Shingling? SimHash?
7/26/13
0 +30d–30d
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
51
FUTURE WORKCommunicating Status
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
52
FUTURE WORKPolicies & Heuristics
• Current Spread Heuristics• Minimize distance• Past only• Past preferred• Near or within distance• Single vs. multi-archive
• Refine to meet user expectations• Speed (minimize time)• Accuracy (minimize temporal error)
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
53
CONTENTS
Motivation
Related work
Preliminary work
Future work
Conclusion
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
54
CONCLUSION
Extensive research on improving acquisition
exists
Best use of existing collections needs study
We are looking at
• Characterizing existing holdings
• Characterizing temporal coherence
• Policies that minimize impact of temporal
incoherence
• Visualizations of temporal coherence
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
55
MY QUESTIONS
7/26/13
Coherent
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
56
MY QUESTIONS
7/26/13
Violation
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
57
MY QUESTIONS
7/26/13
What do
these mean
to users?
(3)
(2)
(1)
(4)
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
58
MY QUESTIONS
7/26/13
What does
this mean
to users?