53
EVALUATING SLIDING AND STICKY TARGET POLICIES BY MEASURING TEMPORAL DRIFT IN ACYCLIC WALKS THROUGH A WEB ARCHIVE SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE JCDL 2013 JULY 23-25, 2013 INDIANAPOLIS, INDIANA USA

Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in Acyclic Walks Through a Web Archive�

Embed Size (px)

Citation preview

EVALUATING SLIDING AND STICKY TARGET POLICIES BY MEASURING TEMPORAL DRIFT IN ACYCLIC WALKS THROUGH A WEB ARCHIVE

SCOTT G. AINSWORTH

MICHAEL L. NELSON

OLD DOMINION UNIVERSITY

COMPUTER SCIENCE

JCDL 2013

JULY 23-25, 2013

INDIANAPOLIS, INDIANA USA

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

2

A FABLE FROM WAYBACK

7/23/13

A long, long time ago…

ODU Computer Scienceupdated its web site…

What did it look like?

May 2005...

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

3

A FABLE FROM WAYBACK

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

4

A FABLE FROM WAYBACK

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

5

A FABLE FROM WAYBACK

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

6

A FABLE FROM WAYBACK

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

7

WHAT JUST HAPPENED?WHAT WE EXPECTED2005-05-14 @ 01:36:08

WHAT WE GOT2005-03-31 @ 09:16:10

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

8

SLIDING TARGET

7/23/13

2005-05-14 01:36:08

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

9

SLIDING TARGET

7/23/13

2005-04-2200:17:52

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

10

SLIDING TARGET

7/23/13

2005-03-3109:16:10

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

11

STICKY TARGET

What if the target is held steady?

(Enabled by Memento API)

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

12

MEMENTO HTTP EXTENSION*Adds ability to request a particular date-timeEnables Sticky Target

Request

Response

7/23/13

GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1…Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT…

HTTP/1.1 200 OK…Memento-Datetime: Sat, 14 May 2005 01:36:08 GMT…

*https://datatracker.ietf.org/doc/draft-vandesompel-memento/

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

13

2005-05-142005-05-1401:36:08

STICKY TARGET

7/23/13

Mem

ento

Fo

x E

xten

sio

n

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

14

STICKY TARGET

7/23/13

2005-04-2200:17:52

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

15

STICKY TARGET

7/23/13

2005-05-1401:36:08

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

16

DRIFT COMPARISON

PageSliding Sticky

Datetime Drift Datetime Drift

CS Home2005-05-14

01:36:08– 2005-05-14

01:36:08–

Science Home

2005-04-2200:17:52

22.1 days 2005-04-2200:17:52

22.1 days

CS Home2005-03-31

09:16:1043.7 days(+21.6 days)

2005-05-1401:36:08

Mean 32.9 days 11.0 days

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

17

QUESTIONSHow much temporal drift is there with the two policies?

Does the sticky policy reduce drift as expected?

If so, by how much?

How do • Choice (number of links)• Domains visited• Walk length

Influence drift?

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

18

CONTENTS Motivation

Related work

Measuring Drift

Results

Future work

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

19

RELATED WORKControl Crawl Data Quality, Future collections

• Spaniol et al. – crawling strategy • Denev et al. – change rates by MIME type and

depth• Ben Saad et al. – metadata from crawl used to

select best results from archive

Our Focus: Existing Data Quality• Existing collections• Datetime selection policies

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

20

CONTENTS Motivation

Related work

Measuring drift

Results

Future work & conclusions

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

21

DEFINITIONS

Walk Length Number of successful steps (HTTP 200 response)

Unique Domains

Number of unique domains (jcdl.org, amazon.com, etc.)

Choice Number of unique links (calculated per page)

Drift | target-datetime1 – Memento-Datetimei |

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

22

PROCESS BY EXAMPLESelect a URI

• Random selection of 1 out of 4,000

4000 Sample URIs – same as JCDL 2011 paper• DMOZ – a reference• Search Engines – best random sampling• Bitly – does shortening have an impact?• Delicious – does popularity have an impact?

“How Much of the Web Is Archived?”

http://arxiv.org/abs/1212.6177

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

23

PROCESS BY EXAMPLEFirst, select a URI

• Random selection of 1 out of 4,000

Second, download timemap

7/23/13

<http://api.wayback.archive.org/memento/20050507093740/http://www.cs.odu.edu/>; rel="memento"; datetime="Sat, 07 May 2005 09:37:40 GMT", <http://api.wayback.archive.org/memento/20050514013608/http://www.cs.odu.edu/>; rel="memento"; datetime="Sat, 14 May 2005 01:36:08 GMT", <http://api.wayback.archive.org/memento/20050515002903/http://www.cs.odu.edu/>; rel="memento"; datetime="Sun, 15 May 2005 00:29:03 GMT",

<http://api.wayback.archive.org/memento/20050514013608/http://www.cs.odu.edu/>; rel="memento"; datetime="Sat, 14 May 2005 01:36:08 GMT",

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

24

PROCESS BY EXAMPLENext, download both mementos

Wayback Machine Memento API

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

25

PROCESS BY EXAMPLENext, download both mementos

And Find common links

Wayback Machine Memento API

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

26

STATUS SO FAR

Successful Steps 1

Unique Domains 1

Choice 48

Mean Drift (days) 0.0 WB 0.0 API

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

27

PROCESS BY EXAMPLEFind common links

and select one for the next step

Wayback Machine Memento API

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

28

PROCESS BY EXAMPLEThe timemap downloaded, the best datetimes are selected, and the memento downloaded…

Wayback Machine Memento API

7/23/13

Successful Steps 1 + 1 = 2

Unique Domains 1 + 0 = 1

Choice 48 + 36 = 84

Mean Drift (days) 11.0 WB 11.0 API

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

29

PROCESS BY EXAMPLEAgain for http://www.odu.edu

Wayback Machine Memento API

7/23/13

Successful Steps 2 + 1 = 3

Unique Domains 1 + 0 = 1

Choice 84 + 33 = 117

Mean Drift (days) 14.7 WB 7.4 API

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

30

HTTP Response:

• 302 Redirect• Location header

PROCESS BY EXAMPLEAnd for http://www.odusports.com

Redirected at acquisition time

Wayback Machine Memento API

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

31

PROCESS BY EXAMPLEAnd for http://odusports.collegesports.com

Wayback Machine Memento API

7/23/13

Successful Steps 3 + 1 = 4

Unique Domains 1 + 1 = 2

Choice 117 + 77 = 194

Mean Drift (days) 18.2 WB 7.3 API

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

32

PROCESS BY EXAMPLEAnd for http://www.vtext.com

Wayback Machine Memento API

7/23/13

Successful Steps 4 + 1 = 5

Unique Domains 2 + 1 = 3

Choice 194 + 14 = 208

Mean Drift (days) 20.3 WB 5.8 API

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

33

PROCESS BY EXAMPLEAnd 404 stops the walk

Wayback Machine Memento API

7/23/13

HTTP Response:

• 404 Not Found

Successful Steps 4 + 1 = 5

Unique Domains 2 + 1 = 3

Choice 194 + 14 = 208

Mean Drift (days) 20.3 WB 5.8 API

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

34

STOP CAUSESFirst Step Subsequent Steps

Stop Cause Count Percent Count Percent

Timemaps

HTTP 403 74 1.7% 4,803 9.1%

HTTP 404 1,327 30.1% 15,850 29.0%

HTTP 503 0 0.0% 43 0.1%

Other 2 0.0% 180 0.3%

Mementos

HTTP 403 52 1.2% 476 0.9%

HTTP 404 215 4.9% 3,633 6.8%

HTTP 503 1,957 44.4% 10,535 19.9%

Download failed 154 3.5% 589 1.1%

Not HTML 514 11.7% 2,856 5.4%

No Common Links 0 0.0% 12,957 24.4%

Other 117 2.7% 1,128 2.1%

Totals 4,412 53,050

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

35

CONTENTS Motivation

Related work

Measuring drift

Results

Future work & conclusions

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

36

WALKS AND STEPS

Status Total

Walks Attempted 200,000

Unique Walks 53,100

Successful Walks 48,685

Pct. Successful 91.7%

Steps 240,439

Successful Steps 187,371

w/drift > 1yr 6,701

w/drift > 5yrs 111

Successful Steps/Walk 3.8

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

37

WALK LENGTHS

7/23/13

Walk Length

Occ

urre

nces

(lo

g sc

ale)

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

38

MEDIAN DRIFT BY STEP

● Sliding● Sticky

Med

ian

Drif

t (m

onth

s)

7/23/13

Step Number

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

39

DRIFT BY STEP

SLIDING POLICY STICKY POLICY

Drif

t (y

ears

)

Step Number Step Number

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

40

DRIFT BY CHOICE

7/23/13

Choice

Mea

n D

rift

(mon

ths)

● Sliding● Sticky

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

41

DRIFT BY DOMAINS

7/23/13

Domain Count

Mea

n D

rift

(mon

ths)

● Sliding● Sticky

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

42

CONTENTS Motivation

Related work

Measuring drift

Results

Future work & conclusions

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

43

FUTURE WORKIntegrate real-world walk patterns

• AlNoamany et al. – Internet Archive logs• Domains users avoid – link farms, etc.• Domain clusters• Self referencing domains – 101celebrities.com

Check other archives• Other archives now have Memento API

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

44

CONCLUSIONS

30 days less drift using Sticky policy.

Sticky policy controls drift;Sliding policy does not.

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

45

BACKUP

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

46

WALK LENGTHSWalk Length DMOZ S.Eng. Delicious Bitly Total

1 5,355 1,239 7,139 1,289 15,076

2 3,571 924 4,857 817 10,169

3 1,891 598 3,311 623 6,423

4 1,212 381 2,228 415 4,236

5 791 315 1,588 314 3,008

6 583 232 1,168 259 2,242

7 417 178 877 186 1,658

8 258 153 651 136 1,198

9 187 111 498 108 904

10 144 79 337 79 679

20 14 10 36 9 76

41-45 6 2 14 2 24

46-50 6 3 6 1 16

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

47

MEAN DRIFT BY STEP

7/23/13

Step Number

Mea

n D

rift

(mon

ths)

● Sliding● Sticky

● μ ○ σ

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

48

SLIDING TARGET⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY

⟹ GET …/20050514013608/http://sci.odu.edu/ HTTP/1.1⟸ HTTP/1.1 302 FOUND

Location: …/20050522001752/http://sci.odu.edu/⟹ GET …/20050522001752/http://sci.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY

⟹ GET …/20050522001752/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 302 FOUND

Location: …/20050331091610/http://www.cs.odu.edu/⟹ GET …/20050331091610/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

49

SLIDING TARGET⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY

⟹ GET …/20050514013608/http://sci.odu.edu/ HTTP/1.1⟸ HTTP/1.1 302 FOUND

Location: …/20050522001752/http://sci.odu.edu/⟹ GET …/20050522001752/http://sci.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY

⟹ GET …/20050522001752/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 302 FOUND

Location: …/20050331091610/http://www.cs.odu.edu/⟹ GET …/20050331091610/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY

7/23/13

22 Days

44 Days

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

50

STICKY TARGET⟹ GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1

Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT⟸ HTTP/1.1 302 FOUND

Location: …/20050514013608/http://www.cs.odu.edu/⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY

⟹ GET <timegate>/http://sci.odu.edu/ HTTP/1.1Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT

⟸ HTTP/1.1 302 FOUNDLocation: …/20050522001752/http://sci.odu.edu/

⟹ GET …/20050522001752/http://sci.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY

⟹ GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT

⟸ HTTP/1.1 302 FOUNDLocation: …/20050514013608/http://www.cs.odu.edu/

⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

51

STICKY TARGET (MEMENTO)⟹ GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1

Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT⟸ HTTP/1.1 302 FOUND

Location: …/20050514013608/http://www.cs.odu.edu/⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY

⟹ GET <timegate>/http://sci.odu.edu/ HTTP/1.1Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT

⟸ HTTP/1.1 302 FOUNDLocation: …/20050522001752/http://sci.odu.edu/

⟹ GET …/20050522001752/http://sci.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY

⟹ GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT

⟸ HTTP/1.1 302 FOUNDLocation: …/20050514013608/http://www.cs.odu.edu/

⟹ GET …/20050514013608/http://www.cs.odu.edu/ HTTP/1.1⟸ HTTP/1.1 200 OKAY

7/23/13

22 Days

0 Days

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

52

TWO BROWSING POLICIES

SLIDING TARGET

Target• Resource datetime

Drift types• Memento drift• Target drift

STICKY TARGET

Target• Original datetime

Drift type• Only memento drift

7/23/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

53

TWO TYPES OF DRIFTTarget Drift

• Drift introduced by changing the target datetime• | received-datetime – original-datetime |

Memento Drift• Drift introduced by not having the exact datetime

requested available.• | received-datetime – requested-datetime |

7/23/13