Upload
jody-farmer
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
TEMPORAL SPREAD IN ARCHIVEDCOMPOSITE RESOURCES(WORK IN PROGRESS)
SCOTT G. AINSWORTH
MICHAEL L. NELSON
OLD DOMINION UNIVERSITY
COMPUTER SCIENCE
WADL 2013
JULY 25–26, 2013
INDIANAPOLIS, INDIANA USA
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
2
CONTENTS
Motivation
Related work
Preliminary work
Temporal Spread
Future work
Conclusion
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
3
A FABLE FROM WAYBACK
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
4
TEMPORAL SPREAD
7/26/13
2005-05-1401:36:08
+9 days
+18 days +18 days
+7 months
+2.1 years
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
5
QUESTIONS• How much temporal spread exists in composite
mementos?
• How can temporal spread be minimized?
• What factors contribute, positively or negatively, to spread?
• Does combining multiple archives produce better results?
• Would users with differing goals benefit from different minimization policies and heuristics?
• How can temporal coherence be displayed to users—simply?
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
6
CONTENTS
Motivation
Related work
Preliminary work
Temporal Spread
Future work
Conclusion
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
7
RELATED WORKControl Crawl Data Quality, Future collections
• Spaniol et al. – crawling strategy • Denev et al. – change rates by MIME type and
depth• Ben Saad et al. – metadata from crawl used to
select best results from archive
Our Focus: Existing Data Quality• Existing collections• Datetime selection policies
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
8
RELATED WORKUse Patterns
• AlNoamony et al. – Archive Access Patterns• Humans vs. Robots• Dip, dive, slide, & skim
Identifying Duplicates• Simple identity – images, other binary formats
• direct comparison• Hash comparison
• HTML, CSS (text)• Shingling, Jaccard distances, etc.• SimHash most promise ⃪�
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
9
RELATED WORK – MEMENTO*• HTTP extension for datetime negotiation
Request
Response
7/26/13
GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1…Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT…
HTTP/1.1 200 OK…Memento-Datetime: Sat, 14 May 2005 01:36:08 GMT…
*https://datatracker.ietf.org/doc/draft-vandesompel-memento/
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
10
CONTENTS
Motivation
Related work
Preliminary work How much of the Web is archived Temporal Drift
Temporal Spread
Future work
Conclusion
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
11
HOW MUCH IS ARCHIVED?
7/26/13
35 – 90% At least one archived copy
17 – 49% 2 – 5 copies
1 – 8% 6 – 10 copies
8 – 63% > 10 copies JCDL’11
Internet Archive Search Engine Other
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
12
CONTENTS
Motivation
Related work
Preliminary work How much of the Web is archived Temporal Drift
Temporal Spread
Future work
Conclusion
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
13
TEMPORAL DRIFTComparing two policies
• Sliding –target datetime changes• Sticky – target datetime held steady
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
14
SLIDING TARGET
7/26/13
2005-05-14 01:36:08
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
15
SLIDING TARGET
7/26/13
2005-04-2200:17:52
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
16
SLIDING TARGET
7/26/13
2005-03-3109:16:10
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
17
TEMPORAL DRIFTWHAT WE EXPECTED2005-05-14 @ 01:36:08
WHAT WE GOT2005-03-31 @ 09:16:10
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
18
STICKY TARGET
What if the target is held steady?
(Enabled by Memento API)
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
19
2005-05-14STICKY TARGET
7/26/13
Mem
ento
Fo
x E
xten
sio
n2005-05-14
01:36:08
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
20
STICKY TARGET
7/26/13
2005-04-2200:17:52
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
21
STICKY TARGET
7/26/13
2005-05-1401:36:08
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
22
DRIFT COMPARISON
PageSliding Sticky
Datetime Drift Datetime Drift
CS Home2005-05-14
01:36:08– 2005-05-14
01:36:08–
Science Home
2005-04-2200:17:52
22.1 days 2005-04-2200:17:52
22.1 days
CS Home2005-03-31
09:16:1043.7 days(+21.6 days)
2005-05-1401:36:08
–
Mean 32.9 days 11.0 days
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
23
MEDIAN DRIFT BY STEP
● Sliding● Sticky
Med
ian
Drif
t (m
onth
s)
7/26/13
Step Number
JCDL’13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
24
CONTENTS
Motivation
Related work
Preliminary work How much of the Web is archived Temporal Drift
Temporal Spread
Future work
Conclusion
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
25
TEMPORAL SPREAD
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
26
COMPOSITE MEMENTO
PRESENTATION STRUCTURE
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
27
TEMPORAL SPREAD
7/26/13
2005-05-1401:36:08
+9 days
+18 days +18 days
+7 months
+2.1 years
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
28
EMBEDDED RESOURCESResource Memento-Datetime Delta Resource
Memento-Datetime
Delta
http://www.cs.odu.edu 2005-05-14 01:36:08 spacer.gif 2005-06-01 16:23:10 18.6 d
mm_menu.js 2005-05-23 02:39:12 9.0 d jimcheng.gif 2005-06-01 16:37:39 18.6 d
style.css 2005-05-23 02:39:39 9.0 d jsmith.gif 2005-06-01 16:58:50 18.6 d
gfx-logo-odu-crown.gif 2005-05-23 02:39:39 9.0 d rmenu_1st_featured_alumni.png 2005-06-01 21:21:45 18.8 d
ddmenu_ddown.js 2005-05-23 02:39:43 9.0 d hmenu_college_...-new.png 2005-12-21 20:14:25 7.3 mo
university.js 2005-05-23 02:39:56 9.0 d rmenu_1st_upcoming_news.png 2005-12-21 20:15:14 7.3 mo
rmenu_1st_about.png 2005-06-01 13:40:25 18.5 d rmenu_1st_upcoming_events.png 2005-12-21 21:01:12 7.3 mo
rmenu_bottom_229.gif 2005-06-01 14:07:29 18.5 d lmenu_1st_resources.png 2005-12-28 17:47:41 7.5 mo
shadow-bl.gif 2005-06-01 14:55:53 18.6 d bullet_blue_triangle.gif 2005-12-28 19:43:48 7.5 mo
ecsbdg.jpg 2005-06-01 14:56:17 18.6 d logo-cs.gif 2005-12-28 19:54:29 7.5 mo
shadow-br.gif 2005-06-01 15:18:18 18.6 d rmenu_1st_featured_student.png 2007-06-12 02:36:07 2.1 years
gfx-btn-go-dblue.gif 2005-06-01 15:34:19 18.6 d shadow-b.gif 2007-06-21 02:35:17 2.1 years
shadow-tr.gif 2005-06-01 15:55:57 18.6 d shadow-r.gif 404 Not Found
header-right1.gif 2005-06-01 16:06:16 18.6 d
7/26/13
Embedded Resources 26
Mean Delta 125.9 days
Standard Deviation 207.7 days
Spread 2.1 years
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
29
REPRESENTING SPREAD
COMPOSITE MEMENTO
TEMPORAL SPREAD CHART
7/26/13
RootEmbeddedDiff. DomainReused
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
30
TEMPORAL SPREAD – ODU CS
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
31
FIRST EXPERIMENT
• 1,000 URIs from DMOZ (Open Directory)• Download all timemaps• Download all composite mementos• Download all embedded resources• Single and Multiple Archives• Four Heuristics
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
32
PRELIMINARY RESULTSCount Description Percent
1,000 Root URI-Rs
910 Root timemaps 91%
87,847 Root URI-Ms in timemaps
96.5 URI-Ms per Root URI-R
85,570 Root memento downloaded 97%
1,488,420 Embedded URI-Rs
17.4 Embedded URI-Rs per Root memento
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
33
SINGLE/MULTI & HEURISTICSDescription Minimize
Distance, Single
Archive
Minimize Distance,
Multi-Archive
3-Month Window,
Multi-Archive
Embedded URI-Rs 1,488,440 1,488,420 1,447,351
Embedded URI-Ms in timemaps 1,169,787 1,186,456 500,541
URI-M/Embedded URI-R 0.79 0.80 0.35
% Complete 73.8% 75.4% 33.8%
Mean spread 200.2 200.1 15.1
Standard Deviation 219.2 219.9 14.3
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
34
TEMPORAL COHERENCE
7/26/13
1 Memento, Bracketed Root
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
35
TEMPORAL COHERENCE
7/26/13
1 Memento, Bracketed Root
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
36
TEMPORAL COHERENCE
7/26/13
1 Memento, Bracketed Root
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
37
TEMPORAL COHERENCE
7/26/13
1 Memento, Root Not Bracketed
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
38
TEMPORAL COHERENCE
7/26/13
1 Memento, Root Not Bracketed
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
39
TEMPORAL COHERENCE
7/26/13
1 Memento, No Last-Modified
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
40
TEMPORAL COHERENCE
7/26/13
1 Memento, Before Root
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
41
TEMPORAL COHERENCE
7/26/13
2 Mementos, Root Not Bracketed
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
42
TEMPORAL COHERENCE
7/26/13
2 Mementos, Root Not Bracketed
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
43
TEMPORAL COHERENCE
7/26/13
2 Mementos, Use Content – Similarity
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
44
TEMPORAL COHERENCE
7/26/13
2 Mementos, Contents Equal or Equivalent
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
45
TEMPORAL COHERENCE
7/26/13
2 Mementos, Contents Not Equal or Equivalent
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
46
CURRENT EXPERIMENT
• 4,000 URIs from JCDL’11 “How Much…” paper• 1 URI/month vice all• Temporal coherence patterns• Target WSDM 2013
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
47
CURRENT EXPERIMENT
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
48
CONTENTS
Motivation
Related work
Preliminary work
Temporal Spread
Future work
Conclusion
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
49
FUTURE WORKTimemaps, Redirection, Missing Mementos
• Timemaps only tell part of the story
• URI-R redirection (302 from source)
• URI-M redirection (Archive action)
• Mementos in timemaps but not accessible
• Policies must consider user needs• Leave it missing• Show “best” substitute
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
50
FUTURE WORKSimilarity & Duplication
• Delta are currently | root – embedded |
• If bracketing mementos are identical,should delta be zero?
• HTML is usually modified by the archive
• Can’t check for equality
• Shingling? SimHash?
7/26/13
0 +30d–30d
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
51
FUTURE WORKCommunicating Status
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
52
FUTURE WORKPolicies & Heuristics
• Current Spread Heuristics• Minimize distance• Past only• Past preferred• Near or within distance• Single vs. multi-archive
• Refine to meet user expectations• Speed (minimize time)• Accuracy (minimize temporal error)
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
53
CONTENTS
Motivation
Related work
Preliminary work
Future work
Conclusion
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
54
CONCLUSION
Extensive research on improving acquisition
exists
Best use of existing collections needs study
We are looking at
• Characterizing existing holdings
• Characterizing temporal coherence
• Policies that minimize impact of temporal
incoherence
• Visualizations of temporal coherence
7/26/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
55
MY QUESTIONS
7/26/13
Coherent
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
56
MY QUESTIONS
7/26/13
Violation
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
57
MY QUESTIONS
7/26/13
What do
these mean
to users?
(3)
(2)
(1)
(4)
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
58
MY QUESTIONS
7/26/13
What does
this mean
to users?