Upload
iorwen
View
31
Download
0
Tags:
Embed Size (px)
DESCRIPTION
WADL 2013 July 25 – 26, 2013 Indianapolis, Indiana USA. Temporal Spread In Archived Composite Resources (work in progress). Scott G. Ainsworth Michael L. Nelson Old Dominion University Computer Science. Contents. Motivation Related work Preliminary work Temporal Spread Future work - PowerPoint PPT Presentation
Citation preview
TEMPORAL SPREAD IN ARCHIVEDCOMPOSITE RESOURCES(WORK IN PROGRESS) SCOTT G. AINSWORTHMICHAEL L. NELSONOLD DOMINION UNIVERSITYCOMPUTER SCIENCE
WADL 2013JULY 25–26, 2013
INDIANAPOLIS, INDIANA USA
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
2
CONTENTS
Motivation Related work Preliminary work Temporal Spread Future work Conclusion
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
3
A FABLE FROM WAYBACK
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
4
TEMPORAL SPREAD
7/26/13
2005-05-1401:36:08
+9 days
+18 days +18 days
+7 months
+2.1 years
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
5
QUESTIONS• How much temporal spread exists in composite
mementos?• How can temporal spread be minimized?• What factors contribute, positively or negatively,
to spread?• Does combining multiple archives produce
better results?• Would users with differing goals benefit from
different minimization policies and heuristics?• How can temporal coherence be displayed to
users—simply?
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
6
CONTENTS
Motivation Related work Preliminary work Temporal Spread Future work Conclusion
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
7
RELATED WORKControl Crawl Data Quality, Future collections
• Spaniol et al. – crawling strategy • Denev et al. – change rates by MIME type and
depth• Ben Saad et al. – metadata from crawl used to
select best results from archive
Our Focus: Existing Data Quality• Existing collections• Datetime selection policies
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
8
RELATED WORKUse Patterns
• AlNoamony et al. – Archive Access Patterns• Humans vs. Robots• Dip, dive, slide, & skim
Identifying Duplicates• Simple identity – images, other binary formats
• direct comparison• Hash comparison
• HTML, CSS (text)• Shingling, Jaccard distances, etc.• SimHash most promise ⃪
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
9
RELATED WORK – MEMENTO*• HTTP extension for datetime negotiation
Request
Response
7/26/13
GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1…Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT…
HTTP/1.1 200 OK…Memento-Datetime: Sat, 14 May 2005 01:36:08 GMT…
*https://datatracker.ietf.org/doc/draft-vandesompel-memento/
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
10
CONTENTS Motivation Related work Preliminary work
How much of the Web is archived Temporal Drift
Temporal Spread Future work Conclusion
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
11
HOW MUCH IS ARCHIVED?
7/26/13
35 – 90% At least one archived copy17 – 49% 2 – 5 copies 1 – 8% 6 – 10 copies 8 – 63% > 10 copies JCDL’11
Internet Archive Search Engine Other
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
12
CONTENTS Motivation Related work Preliminary work
How much of the Web is archived Temporal Drift
Temporal Spread Future work Conclusion
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
13
TEMPORAL DRIFTComparing two policies
• Sliding –target datetime changes• Sticky – target datetime held steady
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
14
SLIDING TARGET
7/26/13
2005-05-14 01:36:08
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
15
SLIDING TARGET
7/26/13
2005-04-2200:17:52
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
16
SLIDING TARGET
7/26/13
2005-03-3109:16:10
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
17
TEMPORAL DRIFTWHAT WE EXPECTED2005-05-14 @ 01:36:08
WHAT WE GOT2005-03-31 @ 09:16:10
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
18
STICKY TARGET
What if the target is held steady?
(Enabled by Memento API)
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
19
2005-05-14STICKY TARGET
7/26/13
Mem
ento
Fox
Exte
nsio
n2005-05-14
01:36:08
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
20
STICKY TARGET
7/26/13
2005-04-2200:17:52
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
21
STICKY TARGET
7/26/13
2005-05-1401:36:08
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
22
DRIFT COMPARISONPage
Sliding StickyDatetime Drift Datetime Drift
CS Home 2005-05-1401:36:08 – 2005-05-14
01:36:08 –
Science Home
2005-04-2200:17:52 22.1 days 2005-04-22
00:17:52 22.1 days
CS Home 2005-03-3109:16:10
43.7 days(+21.6 days)
2005-05-1401:36:08 –
Mean 32.9 days 11.0 days
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
23
MEDIAN DRIFT BY STEP
● Sliding● Sticky
Med
ian
Drif
t (m
onth
s)
7/26/13
Step Number
JCDL’13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
24
CONTENTS Motivation Related work Preliminary work
How much of the Web is archived Temporal Drift
Temporal Spread Future work Conclusion
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
25
TEMPORAL SPREAD
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
26
COMPOSITE MEMENTOPRESENTATION STRUCTURE
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
27
TEMPORAL SPREAD
7/26/13
2005-05-1401:36:08
+9 days
+18 days +18 days
+7 months
+2.1 years
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
28
EMBEDDED RESOURCESResource Memento-Datetime Delta Resource Memento-
Datetime Delta
http://www.cs.odu.edu 2005-05-14 01:36:08 spacer.gif 2005-06-01 16:23:10 18.6 d
mm_menu.js 2005-05-23 02:39:12 9.0 d jimcheng.gif 2005-06-01 16:37:39 18.6 d
style.css 2005-05-23 02:39:39 9.0 d jsmith.gif 2005-06-01 16:58:50 18.6 d
gfx-logo-odu-crown.gif 2005-05-23 02:39:39 9.0 d rmenu_1st_featured_alumni.png 2005-06-01 21:21:45 18.8 d
ddmenu_ddown.js 2005-05-23 02:39:43 9.0 d hmenu_college_...-new.png 2005-12-21 20:14:25 7.3 mo
university.js 2005-05-23 02:39:56 9.0 d rmenu_1st_upcoming_news.png 2005-12-21 20:15:14 7.3 mo
rmenu_1st_about.png 2005-06-01 13:40:25 18.5 d rmenu_1st_upcoming_events.png 2005-12-21 21:01:12 7.3 mo
rmenu_bottom_229.gif 2005-06-01 14:07:29 18.5 d lmenu_1st_resources.png 2005-12-28 17:47:41 7.5 mo
shadow-bl.gif 2005-06-01 14:55:53 18.6 d bullet_blue_triangle.gif 2005-12-28 19:43:48 7.5 mo
ecsbdg.jpg 2005-06-01 14:56:17 18.6 d logo-cs.gif 2005-12-28 19:54:29 7.5 mo
shadow-br.gif 2005-06-01 15:18:18 18.6 d rmenu_1st_featured_student.png 2007-06-12 02:36:07 2.1 years
gfx-btn-go-dblue.gif 2005-06-01 15:34:19 18.6 d shadow-b.gif 2007-06-21 02:35:17 2.1 years
shadow-tr.gif 2005-06-01 15:55:57 18.6 d shadow-r.gif 404 Not Found
header-right1.gif 2005-06-01 16:06:16 18.6 d
7/26/13
Embedded Resources 26
Mean Delta 125.9 days
Standard Deviation 207.7 days
Spread 2.1 years
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
29
REPRESENTING SPREADCOMPOSITE MEMENTO
TEMPORAL SPREAD CHART
7/26/13
RootEmbeddedDiff. DomainReused
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
30
TEMPORAL SPREAD – ODU CS
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
31
FIRST EXPERIMENT
• 1,000 URIs from DMOZ (Open Directory)• Download all timemaps• Download all composite mementos• Download all embedded resources• Single and Multiple Archives• Four Heuristics
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
32
PRELIMINARY RESULTSCount Description Percent
1,000 Root URI-Rs
910 Root timemaps 91%
87,847 Root URI-Ms in timemaps
96.5 URI-Ms per Root URI-R
85,570 Root memento downloaded 97%
1,488,420 Embedded URI-Rs
17.4 Embedded URI-Rs per Root memento
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
33
SINGLE/MULTI & HEURISTICSDescription Minimize
Distance, Single
Archive
Minimize Distance,
Multi-Archive
3-Month Window,
Multi-Archive
Embedded URI-Rs 1,488,440 1,488,420 1,447,351
Embedded URI-Ms in timemaps 1,169,787 1,186,456 500,541
URI-M/Embedded URI-R 0.79 0.80 0.35
% Complete 73.8% 75.4% 33.8%
Mean spread 200.2 200.1 15.1
Standard Deviation 219.2 219.9 14.3
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
34
TEMPORAL COHERENCE
7/26/13
1 Memento, Bracketed Root
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
35
TEMPORAL COHERENCE
7/26/13
1 Memento, Bracketed Root
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
36
TEMPORAL COHERENCE
7/26/13
1 Memento, Bracketed Root
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
37
TEMPORAL COHERENCE
7/26/13
1 Memento, Root Not Bracketed
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
38
TEMPORAL COHERENCE
7/26/13
1 Memento, Root Not Bracketed
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
39
TEMPORAL COHERENCE
7/26/13
1 Memento, No Last-Modified
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
40
TEMPORAL COHERENCE
7/26/13
1 Memento, Before Root
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
41
TEMPORAL COHERENCE
7/26/13
2 Mementos, Root Not Bracketed
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
42
TEMPORAL COHERENCE
7/26/13
2 Mementos, Root Not Bracketed
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
43
TEMPORAL COHERENCE
7/26/13
2 Mementos, Use Content – Similarity
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
44
TEMPORAL COHERENCE
7/26/13
2 Mementos, Contents Equal or Equivalent
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
45
TEMPORAL COHERENCE
7/26/13
2 Mementos, Contents Not Equal or Equivalent
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
46
CURRENT EXPERIMENT
• 4,000 URIs from JCDL’11 “How Much…” paper• 1 URI/month vice all• Temporal coherence patterns• Target WSDM 2013
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
47
CURRENT EXPERIMENT
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
48
CONTENTS Motivation Related work Preliminary work Temporal Spread Future work Conclusion
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
49
FUTURE WORKTimemaps, Redirection, Missing Mementos• Timemaps only tell part of the story• URI-R redirection (302 from source)• URI-M redirection (Archive action)• Mementos in timemaps but not accessible• Policies must consider user needs
• Leave it missing• Show “best” substitute
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
50
FUTURE WORKSimilarity & Duplication• Delta are currently | root – embedded |• If bracketing mementos are identical,
should delta be zero?
• HTML is usually modified by the archive• Can’t check for equality• Shingling? SimHash?
7/26/13
0 +30d–30d
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
51
FUTURE WORKCommunicating Status
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
52
FUTURE WORKPolicies & Heuristics• Current Spread Heuristics
• Minimize distance• Past only• Past preferred• Near or within distance• Single vs. multi-archive
• Refine to meet user expectations• Speed (minimize time)• Accuracy (minimize temporal error)
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
53
CONTENTS
Motivation Related work Preliminary work Future work Conclusion
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
54
CONCLUSIONExtensive research on improving acquisition exists
Best use of existing collections needs study
We are looking at
• Characterizing existing holdings
• Characterizing temporal coherence
• Policies that minimize impact of temporal incoherence
• Visualizations of temporal coherence
7/26/13
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
55
MY QUESTIONS
7/26/13
Coherent
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
56
MY QUESTIONS
7/26/13
Violation
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
57
MY QUESTIONS
7/26/13
What do
these mean
to users?
(3)
(2)
(1)
(4)
Join
t Con
fere
nce
on D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
58
MY QUESTIONS
7/26/13
What does
this mean
to users?