EVALUATING SLIDING AND STICKY TARGET POLICIES BY MEASURING TEMPORAL DRIFT IN ACYCLIC WALKS THROUGH A WEB ARCHIVE
SCOTT G. AINSWORTH
MICHAEL L. NELSON
OLD DOMINION UNIVERSITY
COMPUTER SCIENCE
JCDL 2013
JULY 23-25, 2013
INDIANAPOLIS, INDIANA USA
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
2
A FABLE FROM WAYBACK
7/23/13
A long, long time ago…
ODU Computer Scienceupdated its web site…
What did it look like?
May 2005...
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
3
A FABLE FROM WAYBACK
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
4
A FABLE FROM WAYBACK
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
5
A FABLE FROM WAYBACK
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
6
A FABLE FROM WAYBACK
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
7
WHAT JUST HAPPENED?WHAT WE EXPECTED2005-05-14 @ 01:36:08
WHAT WE GOT2005-03-31 @ 09:16:10
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
8
SLIDING TARGET
7/23/13
2005-05-14 01:36:08
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
9
SLIDING TARGET
7/23/13
2005-04-2200:17:52
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
10
SLIDING TARGET
7/23/13
2005-03-3109:16:10
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
11
STICKY TARGET
What if the target is held steady?
(Enabled by Memento API)
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
12
MEMENTO HTTP EXTENSION*Adds ability to request a particular date-timeEnables Sticky Target
Request
Response
7/23/13
GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1…Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT…
HTTP/1.1 200 OK…Memento-Datetime: Sat, 14 May 2005 01:36:08 GMT…
*https://datatracker.ietf.org/doc/draft-vandesompel-memento/
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
13
2005-05-142005-05-1401:36:08
STICKY TARGET
7/23/13
Mem
ento
Fo
x E
xten
sio
n
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
14
STICKY TARGET
7/23/13
2005-04-2200:17:52
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
15
STICKY TARGET
7/23/13
2005-05-1401:36:08
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
16
DRIFT COMPARISON
PageSliding Sticky
Datetime Drift Datetime Drift
CS Home2005-05-14
01:36:08– 2005-05-14
01:36:08–
Science Home
2005-04-2200:17:52
22.1 days 2005-04-2200:17:52
22.1 days
CS Home2005-03-31
09:16:1043.7 days(+21.6 days)
2005-05-1401:36:08
–
Mean 32.9 days 11.0 days
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
17
QUESTIONSHow much temporal drift is there with the two policies?
Does the sticky policy reduce drift as expected?
If so, by how much?
How do • Choice (number of links)• Domains visited• Walk length
Influence drift?
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
18
CONTENTS Motivation
Related work
Measuring Drift
Results
Future work
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
19
RELATED WORKControl Crawl Data Quality, Future collections
• Spaniol et al. – crawling strategy • Denev et al. – change rates by MIME type and
depth• Ben Saad et al. – metadata from crawl used to
select best results from archive
Our Focus: Existing Data Quality• Existing collections• Datetime selection policies
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
20
CONTENTS Motivation
Related work
Measuring drift
Results
Future work & conclusions
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
21
DEFINITIONS
Walk Length Number of successful steps (HTTP 200 response)
Unique Domains
Number of unique domains (jcdl.org, amazon.com, etc.)
Choice Number of unique links (calculated per page)
Drift | target-datetime1 – Memento-Datetimei |
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
22
PROCESS BY EXAMPLESelect a URI
• Random selection of 1 out of 4,000
4000 Sample URIs – same as JCDL 2011 paper• DMOZ – a reference• Search Engines – best random sampling• Bitly – does shortening have an impact?• Delicious – does popularity have an impact?
“How Much of the Web Is Archived?”
http://arxiv.org/abs/1212.6177
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
23
PROCESS BY EXAMPLEFirst, select a URI
• Random selection of 1 out of 4,000
Second, download timemap
7/23/13
<http://api.wayback.archive.org/memento/20050507093740/http://www.cs.odu.edu/>; rel="memento"; datetime="Sat, 07 May 2005 09:37:40 GMT", <http://api.wayback.archive.org/memento/20050514013608/http://www.cs.odu.edu/>; rel="memento"; datetime="Sat, 14 May 2005 01:36:08 GMT", <http://api.wayback.archive.org/memento/20050515002903/http://www.cs.odu.edu/>; rel="memento"; datetime="Sun, 15 May 2005 00:29:03 GMT",
<http://api.wayback.archive.org/memento/20050514013608/http://www.cs.odu.edu/>; rel="memento"; datetime="Sat, 14 May 2005 01:36:08 GMT",
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
24
PROCESS BY EXAMPLENext, download both mementos
Wayback Machine Memento API
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
25
PROCESS BY EXAMPLENext, download both mementos
And Find common links
Wayback Machine Memento API
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
26
STATUS SO FAR
Successful Steps 1
Unique Domains 1
Choice 48
Mean Drift (days) 0.0 WB 0.0 API
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
27
PROCESS BY EXAMPLEFind common links
and select one for the next step
Wayback Machine Memento API
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
28
PROCESS BY EXAMPLEThe timemap downloaded, the best datetimes are selected, and the memento downloaded…
Wayback Machine Memento API
7/23/13
Successful Steps 1 + 1 = 2
Unique Domains 1 + 0 = 1
Choice 48 + 36 = 84
Mean Drift (days) 11.0 WB 11.0 API
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
29
PROCESS BY EXAMPLEAgain for http://www.odu.edu
Wayback Machine Memento API
7/23/13
Successful Steps 2 + 1 = 3
Unique Domains 1 + 0 = 1
Choice 84 + 33 = 117
Mean Drift (days) 14.7 WB 7.4 API
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
30
HTTP Response:
• 302 Redirect• Location header
PROCESS BY EXAMPLEAnd for http://www.odusports.com
Redirected at acquisition time
Wayback Machine Memento API
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
31
PROCESS BY EXAMPLEAnd for http://odusports.collegesports.com
Wayback Machine Memento API
7/23/13
Successful Steps 3 + 1 = 4
Unique Domains 1 + 1 = 2
Choice 117 + 77 = 194
Mean Drift (days) 18.2 WB 7.3 API
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
32
PROCESS BY EXAMPLEAnd for http://www.vtext.com
Wayback Machine Memento API
7/23/13
Successful Steps 4 + 1 = 5
Unique Domains 2 + 1 = 3
Choice 194 + 14 = 208
Mean Drift (days) 20.3 WB 5.8 API
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
33
PROCESS BY EXAMPLEAnd 404 stops the walk
Wayback Machine Memento API
7/23/13
HTTP Response:
• 404 Not Found
Successful Steps 4 + 1 = 5
Unique Domains 2 + 1 = 3
Choice 194 + 14 = 208
Mean Drift (days) 20.3 WB 5.8 API
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
34
STOP CAUSESFirst Step Subsequent Steps
Stop Cause Count Percent Count Percent
Timemaps
HTTP 403 74 1.7% 4,803 9.1%
HTTP 404 1,327 30.1% 15,850 29.0%
HTTP 503 0 0.0% 43 0.1%
Other 2 0.0% 180 0.3%
Mementos
HTTP 403 52 1.2% 476 0.9%
HTTP 404 215 4.9% 3,633 6.8%
HTTP 503 1,957 44.4% 10,535 19.9%
Download failed 154 3.5% 589 1.1%
Not HTML 514 11.7% 2,856 5.4%
No Common Links 0 0.0% 12,957 24.4%
Other 117 2.7% 1,128 2.1%
Totals 4,412 53,050
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
35
CONTENTS Motivation
Related work
Measuring drift
Results
Future work & conclusions
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
36
WALKS AND STEPS
Status Total
Walks Attempted 200,000
Unique Walks 53,100
Successful Walks 48,685
Pct. Successful 91.7%
Steps 240,439
Successful Steps 187,371
w/drift > 1yr 6,701
w/drift > 5yrs 111
Successful Steps/Walk 3.8
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
37
WALK LENGTHS
7/23/13
Walk Length
Occ
urre
nces
(lo
g sc
ale)
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
38
MEDIAN DRIFT BY STEP
● Sliding● Sticky
Med
ian
Drif
t (m
onth
s)
7/23/13
Step Number
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
39
DRIFT BY STEP
SLIDING POLICY STICKY POLICY
Drif
t (y
ears
)
Step Number Step Number
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
40
DRIFT BY CHOICE
7/23/13
Choice
Mea
n D
rift
(mon
ths)
● Sliding● Sticky
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
41
DRIFT BY DOMAINS
7/23/13
Domain Count
Mea
n D
rift
(mon
ths)
● Sliding● Sticky
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
42
CONTENTS Motivation
Related work
Measuring drift
Results
Future work & conclusions
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
43
FUTURE WORKIntegrate real-world walk patterns
• AlNoamany et al. – Internet Archive logs• Domains users avoid – link farms, etc.• Domain clusters• Self referencing domains – 101celebrities.com
Check other archives• Other archives now have Memento API
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
44
CONCLUSIONS
30 days less drift using Sticky policy.
Sticky policy controls drift;Sliding policy does not.
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
45
BACKUP
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
46
WALK LENGTHSWalk Length DMOZ S.Eng. Delicious Bitly Total
1 5,355 1,239 7,139 1,289 15,076
2 3,571 924 4,857 817 10,169
3 1,891 598 3,311 623 6,423
4 1,212 381 2,228 415 4,236
5 791 315 1,588 314 3,008
6 583 232 1,168 259 2,242
7 417 178 877 186 1,658
8 258 153 651 136 1,198
9 187 111 498 108 904
10 144 79 337 79 679
…
20 14 10 36 9 76
…
41-45 6 2 14 2 24
46-50 6 3 6 1 16
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
47
MEAN DRIFT BY STEP
7/23/13
Step Number
Mea
n D
rift
(mon
ths)
● Sliding● Sticky
● μ ○ σ
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
48
SLIDING TARGET⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY
⟹ GET …/20050514013608/http://sci.odu.edu/ HTTP/1.1⟸ HTTP/1.1 302 FOUND
Location: …/20050522001752/http://sci.odu.edu/⟹ GET …/20050522001752/http://sci.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY
⟹ GET …/20050522001752/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 302 FOUND
Location: …/20050331091610/http://www.cs.odu.edu/⟹ GET …/20050331091610/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
49
SLIDING TARGET⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY
⟹ GET …/20050514013608/http://sci.odu.edu/ HTTP/1.1⟸ HTTP/1.1 302 FOUND
Location: …/20050522001752/http://sci.odu.edu/⟹ GET …/20050522001752/http://sci.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY
⟹ GET …/20050522001752/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 302 FOUND
Location: …/20050331091610/http://www.cs.odu.edu/⟹ GET …/20050331091610/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY
7/23/13
22 Days
44 Days
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
50
STICKY TARGET⟹ GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1
Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT⟸ HTTP/1.1 302 FOUND
Location: …/20050514013608/http://www.cs.odu.edu/⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY
⟹ GET <timegate>/http://sci.odu.edu/ HTTP/1.1Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT
⟸ HTTP/1.1 302 FOUNDLocation: …/20050522001752/http://sci.odu.edu/
⟹ GET …/20050522001752/http://sci.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY
⟹ GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT
⟸ HTTP/1.1 302 FOUNDLocation: …/20050514013608/http://www.cs.odu.edu/
⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
51
STICKY TARGET (MEMENTO)⟹ GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1
Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT⟸ HTTP/1.1 302 FOUND
Location: …/20050514013608/http://www.cs.odu.edu/⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY
⟹ GET <timegate>/http://sci.odu.edu/ HTTP/1.1Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT
⟸ HTTP/1.1 302 FOUNDLocation: …/20050522001752/http://sci.odu.edu/
⟹ GET …/20050522001752/http://sci.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY
⟹ GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT
⟸ HTTP/1.1 302 FOUNDLocation: …/20050514013608/http://www.cs.odu.edu/
⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY
7/23/13
22 Days
0 Days
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
52
TWO BROWSING POLICIES
SLIDING TARGET
Target• Resource datetime
Drift types• Memento drift• Target drift
STICKY TARGET
Target• Original datetime
Drift type• Only memento drift
7/23/13
Join
t C
onfe
renc
e o
n D
igita
l Lib
rarie
s (J
CD
L) 2
013
Scott G. Ainsworth • Michael L. Nelson
53
TWO TYPES OF DRIFTTarget Drift
• Drift introduced by changing the target datetime• | received-datetime – original-datetime |
Memento Drift• Drift introduced by not having the exact datetime
requested available.• | received-datetime – requested-datetime |
7/23/13