113
Using the Web Infrastructure for Real Time Recovery of Missing Web Pages Dissertation Defense Martin Klein [email protected] Old Dominion University Norfolk, VA 07/18/2011 mittee: Michael L. Nelson (Advisor) Yaohang Li Michele C. Weigle Mohammad Zubair Robert Sanderson Herbert Van de Sompel

Dissertation Defense

Embed Size (px)

DESCRIPTION

Martin Klein's dissertation defense slides.

Citation preview

Page 1: Dissertation Defense

Using the Web Infrastructurefor Real Time Recoveryof Missing Web Pages

Dissertation Defense

Martin [email protected]

Old Dominion UniversityNorfolk, VA07/18/2011

Committee:Dr. Michael L. Nelson (Advisor)Dr. Yaohang LiDr. Michele C. WeigleDr. Mohammad ZubairDr. Robert SandersonDr. Herbert Van de Sompel

Page 2: Dissertation Defense

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

2

Motivation Background

Page 3: Dissertation Defense

The Problem

3

Page 4: Dissertation Defense

The Problem - 404 Errors

• Expected lifetime of a web page is 44 days [Kahle1997]

• URIs inaccessible in CS papers: 23%-53% [Lawrence2001]

• Inaccessible web pages: 67% after 4 years [Koehler2002]

• Inaccessible objects in DLs: 3% [Nelson2002]

• URIs inaccessible in high IF journals: 3.8% after 3 months; 13% after 27 months [Dellavalle2003]

• URIs inaccessible in D-Lib Magazine: ~30% [McCown2005]

• URIs inaccessible (and not archived) in scholarly articles: ~25% [Sanderson2011]

4

Page 5: Dissertation Defense

The Problem - 404 Errors

• Are they really gone? Or just relocated?• Has anybody crawled and indexed it?• Do Google, Yahoo!, Bing have a copy of the

page?• Has the page been archived by a web

archive?• Information retrieval techniques needed

to (re-)discover content

5

Page 6: Dissertation Defense

The Solution?

• Search engines• Requires knowledge about content• Problem with homographs (jaguar, present, lead,

M/mobile, etc)• Problem with very frequent terms/names

(Michael Nelson, Eric Miller, etc)• Web archives

• Helps for apple pie recipe but not for web page of transferred faculty, e.g.

6

Page 7: Dissertation Defense

Content Similarity

JCDL 2005http://www.jcdl2005.org/

July 2005http://www.jcdl2005.org/

Today

7

Page 8: Dissertation Defense

Content Similarity

Hypertext 2006http://www.ht06.org/

August 2006http://www.ht06.org/

Today

8

Page 9: Dissertation Defense

Content Similarity

PSP 2003http://www.pspcentral.org/events/annual_meeting_2003.htmlhttp://www.pspcentral.org/events/archive/annual_meeting_2003.html

August 2003 Today

9

Page 10: Dissertation Defense

Content Similarity

ECDL 1999

http://www-rocq.inria.fr/EuroDL99/http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html

October 1999 Today

10

Page 11: Dissertation Defense

Content Similarity

Greynet 1999http://www.konbib.nl/infolev/greynet/2.5.htm

1999Today

? ?

11

Page 12: Dissertation Defense

The ProblemResearch Questions (1)

1. Based on the WI, can we use content- and link structure based methods to (re-)discover missing web pages in real time?

Investigated Methods:a) Lexical signaturesb) Titlesc) Tagsd) Link neighborhood lexical signatures

12

Page 13: Dissertation Defense

The ProblemResearch Questions (2)

2. What are the optimal characteristics of these methods (age, length, etc) with respect to retrieval performance?

3. Can we improve the performance by consolidating two or more methods?

4. Can we have a real-world implementation and evaluation of the above?

13

Page 14: Dissertation Defense

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

14

Motivation Background

Page 15: Dissertation Defense

Memento, Web Infrastructure (WI)

15

Page 16: Dissertation Defense

LexicalSignature

RemovalHit

RateProxyCache

5 terms

Lexical Signatures (LSs)

• First introduced by Phelps and Wilensky [Phelps2000]

• Small set of terms capturing “aboutness” of a document, “lightweight” metadata

Resource

10,000 terms

Abstract

200 terms

16

Page 17: Dissertation Defense

• Following TF-IDF scheme first introduced by Spaerck Jones and Robertson [Jones1973]

• Term frequency (TF):• “How often does this word appear in this

document?”• Inverse document frequency (IDF):

• “In how many documents does this word appear?”

Lexical Signature Generation

17

Page 18: Dissertation Defense

Rank/Results URL LS

1/1,930 http://www.jcdl2005.orgjcdl2005 libraries conference cyberinfrastructure jcdl

1/24,100 http://www.norfolk.govnorfolk city council rfps nauticus

1/185 http://library.lanl.gov lanl library ldrd alamos oppie

2/738,000 http://www.usopen.org open us ashe tickets usta

Lexical Signatures -- Examples

18

Page 19: Dissertation Defense

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

19

Motivation Background

A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web(WIDM 2008)

Page 20: Dissertation Defense

Accurate IDF Values for LSs

Screen scraping the Google web interface

20

Page 21: Dissertation Defense

The Dataset

Local universe consisting of copies of URIsfrom the Internet Archive between 1996 and 2007

21

Page 22: Dissertation Defense

• Use IDF values obtained from 1. Local collection of web pages2. “screen scraping” SE result pages

• Validate both methods against a baseline• Google N-Grams

Note: N-Grams provide term count (TC) and not DF values – ask me for details

The Idea

22

Page 23: Dissertation Defense

Based on all 3 methodsURL: http://www.perfect10wines.comYear: 2007Union: 12 unique terms

LSs Example

23

Page 24: Dissertation Defense

1. Normalized term overlap• Assume term commutativity• k-term LSs normalized by k

2. Kendall Tau• Modified version since LSs to compare

may contain different terms3. M-Score

• Penalizes discordance in higher ranks

Comparing LSs

24

Page 25: Dissertation Defense

Top 5, 10 and 15 terms

LC – local universe

SC – screen scraping

NG – N-Grams

Comparing LSs

25

Page 26: Dissertation Defense

• Both methods for the computation of IDF values provide accurate results• Compared to the Google N-Gram baseline

• Screen scraping method seems preferable • Similarity scores are slightly higher• Feasible in real time!!!

Contribution:Established well performing IDF estimation technique.

Conclusions

26

Page 27: Dissertation Defense

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

27

Motivation Background

Revisiting Lexical Signatures to (Re-)Discover Web Pages(ECDL 2008)

Page 28: Dissertation Defense

The Idea

Evaluate Evolution of LSs over Time by• Generate LSs of URIs (from local universe

mentioned above) over time• Conduct overlap analysis

• Neither Phelps and Wilensky nor Park et al.[Park2004] did that• Park et al. just re-confirmed their findings after 6

months28

Page 29: Dissertation Defense

10-term LSs generated forhttp://www.perfect10wines.com

LSs Over Time - Example

29

Page 30: Dissertation Defense

LS Overlap Analysis

Rooted:overlap between the LS of the year of the first observation in the IA and all LSs of the consecutive years that URI has been observed

Sliding:overlap between two LSs of consecutive years starting with the first year and ending with the last

30

Page 31: Dissertation Defense

Evolution of LSs over Time

Results:• Little overlap between the early years and more recent ones• Highest overlap in the first 1-2 years after creation of the LS• Rarely peaks after that – once terms are gone do not return

Rooted

31

Page 32: Dissertation Defense

Evolution of LSs over Time

Results:• Overlap increases over time• Seem to reach steady state around 2003

Sliding

32

Page 33: Dissertation Defense

Performance of LSs

Idea: • Query LSs against Google search API• Identify URI in result set

• For each URI it is possible that:1. URI is returned as the top ranked result2. URI is ranked somewhere between 2 and 103. URI is ranked somewhere between 11 and 1004. URI is ranked somewhere beyond rank 100

considered as not returned33

Page 34: Dissertation Defense

Performance of LSs wrt Length

Results:• 2-, 3- and 4-term LSs perform poorly• 5-, 6- and 7-term LSs seem best

• Top mean rank (MR) value with 5 terms• Most top ranked with 7 terms• Binary pattern: either in top 10 or undiscovered

• 8 terms and beyond do not show improvement 34

Page 35: Dissertation Defense

nDCG for LSs consisting of 2-15 terms(mean over all years)

Performance of LSs wrt Length

35

Page 36: Dissertation Defense

Performance of LSs over Time

nDCG for LSs consisting of 2, 5, 7 and 10 terms

36

Page 37: Dissertation Defense

• LSs decay over time• Rooted: quickly after generation• Sliding: seem to stabilize

• LSs older than 5 years perform poorly• 5-, 6- and 7-term LSs seem to perform best

• 7 – most top ranked• 5 – lowest mean rank

• 2..4 as well as 8+ term LSs are insufficient

Contribution:Determined age and length limits for LSs.

Conclusions

37

Page 38: Dissertation Defense

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

38

Motivation Background

Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure(JCDL 2010)

Page 39: Dissertation Defense

The Problem

Internet Archive - Wayback Machine

www.aircharter-international.comhttp://web.archive.org/web/*/http://www.aircharter-international.com

Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry

TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International

59 copies

The Problem

39

Page 40: Dissertation Defense

The Problem

www.aircharter-international.com

Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry

The Problem

40

Page 41: Dissertation Defense

The Problem

www.aircharter-international.com

TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International

The Problem

41

Page 42: Dissertation Defense

Contributions

• Compare performance of two automated methods to rediscover web pages

1. Lexical signatures (LSs)

2. Titles

• Evaluate performance of combination of methods and suggest workflow for real time web page rediscovery

The Idea

42

Page 43: Dissertation Defense

LS Retrieval Performance

5- and 7-Term LSs

LS Retrieval Performance

• Yahoo! returns most URIs top ranked and leaves least undiscovered

• Binary retrieval pattern, URI either within top 10 or undiscovered

43

Page 44: Dissertation Defense

Title Retrieval Performance

Non-Quoted and Quoted Titles

Title Retrieval Performance

• Results at least as good as for LSs

• Google and Yahoo! return more URIs for non-quoted titles

• Same binary retrieval pattern

44

Page 45: Dissertation Defense

Combination of Methods

Top Results for Combination of Methods

Combination of Methods

45

Google Yahoo! MSN Live

LS5-TI 65.0 73.8 71.5

LS7-TI 70.9 75.7 73.8

TI-LS5 73.5 75.7 73.1

TI-LS7 74.1 75.1 74.1

LS5-TI-LS7 65.4 73.8 72.5

LS7-TI-LS5 71.2 76.4 74.4

TI-LS5-LS7 73.8 75.7 74.1

TI-LS7-LS5 74.4 75.7 74.8

LS5-LS7 52.8 68.0 64.4

LS7-LS5 59.9 71.5 66.7

Page 46: Dissertation Defense

Concluding RemarksConclusions

• LSs and titles are suitable as search engine queries• Return 50%-70% URIs top ranked

BUT• Titles are cheaper to obtain, hence

• Preferred primary method• 5-term LSs secondary method• Results in 75% top ranked URIs

Contributions:Provided evidence for suitability of titles and introduced web page discovery framework.

46

Page 47: Dissertation Defense

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

47

Motivation Background

Is This a Good Title?(Hypertext 2010)

Page 48: Dissertation Defense

The Problem

http://www.drbartell.com/

Lexical Signature(TF/IDF)Plastic Surgeon Reconstructive Dr Bartell Symbol University

???

The Problem

48

Page 49: Dissertation Defense

The Problem

http://www.drbartell.com/

TitleThomas Bartell MD Board-Certified - Cosmetic Plastic Reconstructive Surgery

The Problem

49

Page 50: Dissertation Defense

The Problem

www.reagan.navy.mil

Lexical Signature(TF/IDF)Ronald USS MCSN Torrey Naval Sea Commanding

The Problem

50

Page 51: Dissertation Defense

The Problem

TitleHome Page ???

www.reagan.navy.mil

Is This a Good Title?

The Problem

51

Page 52: Dissertation Defense

Contributions

• Display title evolution over time

• Compare to content evolution

• “Normalize” time as fixed size windows

• Provide prediction model for title’s retrieval potential

The Idea

52

Page 53: Dissertation Defense

Title (and LS) Retrieval Performance

Titles 5- and 7-Term LSs

Title and LS Retrieval Performance

• Titles return more than 60% URIs top ranked• Binary retrieval pattern, URI either within top 10

or undiscovered 53

Page 54: Dissertation Defense

1998-01-27Sun Software Products Selector Guides - Solutions Tree

1999-02-20Sun Software Solutions

2002-02-01Sun Microsystems Products

2002-06-01Sun Microsystems - Business & Industry Solutions

2003-08-01Sun Microsystems - Industry & Infrastructure Solutions Sun Solutions

Title Evolution - Example I

2004-02-02Sun Microsystems – Solutions

2004-06-10Gateway Page - Sun Solutions

2006-01-09Sun Microsystems Solutions & Services

2007-01-03Services & Solutions

2007-02-07Sun Services & Solutions

2008-01-19Sun Solutions

www.sun.com/solutions

Title Evolution – Example I

54

Page 55: Dissertation Defense

2000-06-19DataCity of Manassas Park Main Page

2000-10-12DataCity of Manassas Park sells Custom Built Computers & Removable Hard Drives

2001-08-21DataCity a computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives

Title Evolution - Example II

2002-10-16computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity 800-326-5051 toll free

2006-03-14Est 1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD 5200.1-R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB

www.datacity.com/mainf.html

Title Evolution – Example II

55

Page 56: Dissertation Defense

How much do titles change over time?

Title Evolution Over TimeTitle Evolution Over Time

• Copies from fixed size time windows per year

• Extract available titles of past 14 years

• Compute normalized Levenshtein edit distance between titles of copies and baseline (today)(0 = identical;1 = completely dissimilar) 56

Page 57: Dissertation Defense

Title Evolution Over Time

Title edit distance frequencies

Title Evolution Over Time

• Half the titles of available copies from recent years are (close to) identical

• Decay from 2005 on (with fewer copies available)

• 4 year old title:40% chance to be unchanged

57

Page 58: Dissertation Defense

Title Evolution Over Time

Title vs Document

[0,1] - over 1600 times

[0,0] - 122 times

Title Evolution Over Time

• Y: avg shingle value for all copies per URI

• X: avg edit distance of corresponding titles

• overlap indicated by:green: <10red: >90

• Semi-transparent: total amount of points plotted

58

Page 59: Dissertation Defense

Title Performance Prediction

home, index, home page, welcome, untitled document

The performance of any given title can be predicted as insufficient if it consists to 75% or more of a “Stop Title”!

Title Performance Prediction

• Quality prediction of title by• Number of nouns, articles etc.• Amount of title terms, characters [Ntoulas2006]

• Observation of re-occurring terms in poorly performing titles - “Stop Titles”

59

Page 60: Dissertation Defense

Concluding RemarksConclusions

• Titles change more slowly and less significantly over time than web page content

• Not all titles equally good• If the majority of title terms are Stop Titles its

quality can be predicted poor

Contribution:Quantified title evolution and introduced stop titles.

60

Page 61: Dissertation Defense

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

61

Motivation Background

Find, New, Copy, Web, Page - Tagging for the (Re-)Discovery of Web Pages(TPDL 2011)

Page 62: Dissertation Defense

The Problem

We have seen that we have a good chance to rediscover missing pages with

• Lexical signatures• Titles

BUT

What if no archived/cached copy can be found?

The Problem

62

Page 63: Dissertation Defense

The ProblemThe Solution?

ConferencesDigitallibrariesConferenceLibraryJcdl2005

63

Page 64: Dissertation Defense

The ProblemThe Idea

• Experimental evaluation of tag based query length cf. 5- or 7-term LSs

• Test combination of methods to improve retrieval performance

• Investigate “descriptive” power of tags

64

Page 65: Dissertation Defense

The ProblemThe Experiment

• Tags queried against the Yahoo! BOSS API• Same four retrieval cases introduced earlier• nDCG w/ binary relevance scoring• Mean Average Precision

65

Page 66: Dissertation Defense

The ProblemThe Experiment

Combining methods

66

Page 67: Dissertation Defense

The Problem

• Fact:• ~50% of tags do not occur in page [Bischoff2008]

• “Secret”:• ~50% of tags do not occur in current version

of page• ergo: How about previous versions?

The Experiment

67

Page 68: Dissertation Defense

The Problem

• 3,306 URIs w/ older copies• 66.3% of our tags do not occur in page • 4.9% of tags occur in previous version of page Ghost Tags• represent a previous version better than the

current one

• What kind of tags are these?• Important to the document, to the Delicious

user?

Ghost Tags

68

Page 69: Dissertation Defense

The ProblemGhost Tags

Document importance:TF rank

User importance:Delicious rank

Normalized rank:0 - top1 - bottom

69

Page 70: Dissertation Defense

Concluding RemarksConclusions

• Tags can be used for search (if available)• Combining tags with titles and LSs gains URIs• Ghost Tags exist!

• 1/3 of them are important to the page and user

Contributions:Added tags to web page discovery framework and introduced notion of Ghost Tags.

70

Page 71: Dissertation Defense

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

71

Motivation Background

Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures(JCDL 2011)

Page 72: Dissertation Defense

The Problem

We have seen that we have a good chance to rediscover missing pages with

• Lexical signatures• Titles

BUT

What if no archived/cached copy can be found?Plan A: Tags

The Problem

72

Page 73: Dissertation Defense

The ProblemPlan B

ComputerDominionNorfolkMonarchextract

is about

Link neighborhood Lexical Signatures (LNLSs)

73

Page 74: Dissertation Defense

The ProblemThe Idea

• Determine for well performing LNLS:• Length• Number of backlinks• Backlink levels• Radius of terms on backlink page

74

Page 75: Dissertation Defense

The ProblemThe Radius on a Backlink Page

Paragraph

Entire page

Anchor text

75

Page 76: Dissertation Defense

The Dataset

• 309 URIs• 28,325 first level• 306,700 second level backlinks• Filter for language, file type, etc.

12% discarded• Lexical signature generation

• IDF values from Yahoo!• 1..7 and 10 terms

• Query Yahoo! API• Compute “goodness” (nDCG) 76

Page 77: Dissertation Defense

The ProblemThe Results

level-radius-rank

1st and 2nd

level

bett

er

77

Page 78: Dissertation Defense

The ProblemThe Results – Radius

level-radius-rank

All Radii

78

Page 79: Dissertation Defense

The ProblemThe Results – Backlink Rank

level-radius-rank

Ranks10

1001000

79

Page 80: Dissertation Defense

The ProblemThe Results – In Numbers

1-anchor-1000

WINNER1-anchor-10

GOOD

80

Page 81: Dissertation Defense

Concluding RemarksConclusions

• Optimal link neighborhood lexical signatures:• Contain 4 terms• Parsed from top 10 backlink pages• Include first backlink level only• Consider anchor text only

Contributions:Added LNLS to web page discovery framework.

81

Page 82: Dissertation Defense

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

82

Motivation Background

Synchronicity – Automatically Rediscover Missing Web Pages in Real Time(JCDL 2011)

Page 83: Dissertation Defense

Concluding RemarksSynchronicity

• Firefox add-on• Triggers on 404 error• Rediscover page via:

• Memento• Title• Lexical signature• Tags• Link neighborhood lexical signature• URI modification

• http://bit.ly/no-more-40483

Page 84: Dissertation Defense

Concluding RemarksContributions

1. Introduce reliable real-time approach to estimate IDF values

2. Workflow for generation of well performing lexical signatures

3. Performance evaluation of web page titles4. Investigation of tags for web page discovery5. Analysis of link neighborhood lexical

signatures and their optimal parameter6. Introduce Synchronicity implementing the

entire framework 84

Page 85: Dissertation Defense

Concluding Remarks

85

Page 86: Dissertation Defense

Concluding RemarksNext Stop… New Mexico

86

Page 87: Dissertation Defense

Concluding RemarksList of my Relevant Publications

1. M.Klein, M.L.Nelson, “A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web“, WIDM 2008, pp. 39-46

2. M.Klein, M.L.Nelson, “Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008, pp. 371-382

3. M.Klein, M.L.Nelson, “Correlation of Term Count and Document Frequency for Google N-Grams“, ECIR 2009, pp. 620-627

4. M.Klein, M.L.Nelson, “Inter-Search Engine Lexical Signature Performance“, JCDL 2009, pp. 413-414

5. M.Klein, M.L.Nelson, “Investigating the Change of Web Pages Titles Over Time“, InDP 2009

6. M.Klein, J.Shipman, M.L.Nelson, “Is This a Good Title”, Hypertext 2010, pp. 3-127. M.Klein, M.L.Nelson, “Evaluating Methods to Rediscover Missing Web Pages

from the Web Infrastructure”, JCDL 2010, pp. 59-688. M.Klein, J.Ware, M.L.Nelson, “Rediscovering Missing Web Pages Using Link

Neighborhood Lexical Signatures”, JCDL 20119. M.Klein, M.Aly, M.L.Nelson, “Synchronicity - Automatically Rediscover Missing

Web Pages in Real Time”, JCDL 201110. M.Klein, M.L.Nelson, “Find, New, Copy, Web, Page – Tagging for the

(Re-)Discovery of Web Pages”, TPDL 2011 to appear87

Page 88: Dissertation Defense

Concluding RemarksReferencesBischoff2008K.Bischoff, C.Firan, W.Nejdl, R.Paiu, “Can All Tags Be Used for Search?” In: Proceedings of CIKM '08, pp.193-202, 2008Dellavalle2003R.P.Dellavalle, E.J.Hester, L.F.Heilig, A.L.Drake, J.W.Kuntzman, M.Graber, L.M.Schilling, “Information Science: Going, Going, Gone: Lost Internet References”, Science 302(5646), pp.787-788, 2003Jones1973K.Spärck Jones, “Index Term Weighting”, Information Storage and Retrieval, pp. 619-633, 1973Kahle1997B.Kahle, “Preserving the Internet”, Scientific American 276, pp.82-83, 1997Koehler2002W.C.Koehler, “Web Page Change and Persistence - A Four-Year Longitudinal Study”, JASIST 53(2), pp.162-171, 2002Lawrence2001S.Lawrence, D.M.Pennock, G.W.Flake, R.Krovetz, F.M.Coetzee, E.Glover, F.A.Nielsen, A.Kruger, C.L.Giles, “Persistence of Web References in Scientic Research”, Computer 34(2), pp.26-31, 2001McCown2005F.McCown, S.Chan, M.L.Nelson, J.Bollen, “The Availability and Persistence of Web References in D-Lib Magazine”, Proceedings of IWAW '05, 2005Nelson2002M.L.Nelson, B.D.Allen, “Object Persistence and Availability in Digital Libraries”, D-Lib Magazine 8(1), 2002Ntoulas2006A. Ntoulas, M.Najork, M.Manasse, D.Fetterly, “Detecting Spam Web Pages Through Content Analysis”, Proceedings of WWW ’06, pp 83-92, 2006Park2004S.T.Park, D.M.Pennock, C.L.Giles, R.Krovetz, “Analysis of Lexical Signatures for Improving Information Persistence on the World Wide Web”, TOIS 22(4), pp.540-572, 2004Phelps2000T.A.Phelps, R.Wilensky, “Robust Hyperlinks Cost Just Five Words Each”, technical report, UC Berkeley, 2000Sanderson2011R.Sanderson, M.Phillips, H.Van de Sompel, “Analyzing the Persistence of Referenced Web Resources with Memento”, Proceedings of OR '11, 2011 88

Page 89: Dissertation Defense

Using the Web Infrastructurefor Real Time Recoveryof Missing Web Pages

Martin [email protected]

http://www.cs.odu.edu/~mklein/

Page 90: Dissertation Defense

Backup Slides

Page 91: Dissertation Defense

Future Work

91

• “Story Telling” with Memento• Find more Stop Titles• Find more Ghost Tags• Identify “Stop Anchors”• Synchronicity 1.0

• Web service• CMD line tool

Page 92: Dissertation Defense

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

92

Motivation Background

Correlation of Term Count and Document Frequency for Google N-Grams(ECIR 2009)

Page 93: Dissertation Defense

• Need of a reliable source to accurately compute IDF values of web pages (in real time)

• Shown, screen scraping works but• missing validation of baseline (Google N-

Grams)• N-Grams seem suitable (recently created,

based on web pages) but provide TC and not DF what is their relationship?

The Problem

93

Page 94: Dissertation Defense

94

Background

Term All Buy Can’t Is Love Me Need Please You Long

TC 1 1 1 1 2 2 1 2 1 3

DF 1 1 1 1 2 2 1 1 1 1

• Google N-grams provide term count (TC) values

D1 = “Please, Please Me” D2 = “Can’t Buy Me Love”D3 = “All You Need Is Love” D4 = “Long, Long, Long”

TC >= DF, but is there a correlation?Can we use TC to estimate DF?

Page 95: Dissertation Defense

95

Experiment Results

Investigate correlation between TC and DFwithin “Web as Corpus” (WaC)

Rank similarity of all terms

Page 96: Dissertation Defense

96

Experiment Results

Investigate correlation between TC and DFwithin “Web as Corpus” (WaC)

Spearman’s ρ and Kendall τ

Page 97: Dissertation Defense

97

Experiment Results

Rank WaC-DF WaC-TC Google N-Grams1 IR IR IR IR2 RETRIEVAL RETRIEVAL RETRIEVAL IRSG3 IRSG IRSG IRSG RETRIEVAL4 BCS IRIT CONFERENCE BCS5 IRIT BCS BCS EUROPEAN6 CONFERENCE 2009 GRANT CONFERENCE7 GOOGLE FILTERING IRIT IRIT8 2009 GOOGLE FILTERING GOOGLE9 FILTERING CONFERENCE EUROPEAN ACM

10 GRANT ARIA PAPERS GRANT

Google: screen scraping DF values from the Google web interface

Top 10 terms in decreasing order of their TF/IDF valuestaken from http://ecir09.irit.fr

U = 14∩ = 6

Strong indicator that TC can be used to estimate DF for web pages!

Page 98: Dissertation Defense

98

Experiment Results

Show similarity between WaC based TC andGoogle N-Gram based TC

TC frequencies

N-Grams have a threshold of 200

Page 99: Dissertation Defense

Integer ValuesTwo Decimals One Decimal

Frequency of TC/DF Ratio Within the WaC

Experiment Results

99

Page 100: Dissertation Defense

• TC and DF Ranks within the WaC show strong correlation

• TC frequencies of WaC and Google N-Grams are very similiar

• N-Grams are suitable for accurate IDF estimation for web pages

Does not mean everything correlated to TC can be used as DF substitute!

Conclusions

100

Page 101: Dissertation Defense

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

101

Motivation Background

Inter-Search Engine Lexical Signature Performance(JCDL 2009)

Page 102: Dissertation Defense

Inter-Search EngineLexical Signature Performance

Martin Klein Michael L. Nelson

{mklein,mln}@cs.odu.edu

http://en.wikipedia.org/wiki/ElephantElephantTusksTrunkAfricanLoxodonta

Elephant, Asian, AfricanSpecies, TrunkElephant, African, Tusks

Asian, Trunk

Page 103: Dissertation Defense

103

Page 104: Dissertation Defense

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

104

Motivation Background

Synchronicity – Automatically Rediscover Missing Web Pages in Real Time(JCDL 2011)

Page 105: Dissertation Defense

Synchro…What?

Synchronicity• Experience of causally unrelated events

occurring together in a meaningful manner• Events reveal underlying pattern, framework

bigger than any of the synchronous systems• Carl Gustav Jung (1875-1961)

• “meaningful coincidence”• Deschamps – de Fontgibu plum

pudding example

picture from http://www.crystalinks.com/jung.html105

Page 106: Dissertation Defense

Synchro…What?

http://www.youtube.com/watch?v=X4HQyqc-aVU

Repo Man (1984)http://www.imdb.com/title/tt0087995/

106

Page 107: Dissertation Defense

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

107

Motivation Background

(Not yet published)

Page 108: Dissertation Defense

Book of the Dead

• Corpus of missing web pages• 233 URIs returning status 404• Mechanical Turk to determine “aboutness”

• Guess from URI string• Mementos for 161 URIs

• Apply lexical signatures and title

108

Page 109: Dissertation Defense

5-term LSs Titles

109

Experiment Results

Dice Similarity Coefficientof Top 100 Results D = 0

0.0 < D ≤ 0.30.3 < D ≤ 0.60.6 < D ≤ 1.0

Page 110: Dissertation Defense

5-term LSs Titles

110

Experiment Results

Jaro Distance of Top 100 Results J = 0

0.0 < J ≤ 0.30.3 < J ≤ 0.60.6 < J ≤ 1.0

Page 111: Dissertation Defense

Book of the Dead

• Mechanical Turk to determine relevance of results

• Top 10 only• Relevant• Somewhat relevant• Not relevant• Broken URI

• nDCG of top 10 results

111

Page 112: Dissertation Defense

5-term LSs Titles

112

Experiment Results

Relevance of Top 10 Results

Page 113: Dissertation Defense

113

Experiment Results

nDCG of Top 10 Results