Dissertation Defense

Preview:

DESCRIPTION

Martin Klein's dissertation defense slides.

Citation preview

Using the Web Infrastructurefor Real Time Recoveryof Missing Web Pages

Dissertation Defense

Martin Kleinmklein@cs.odu.edu

Old Dominion UniversityNorfolk, VA07/18/2011

Committee:Dr. Michael L. Nelson (Advisor)Dr. Yaohang LiDr. Michele C. WeigleDr. Mohammad ZubairDr. Robert SandersonDr. Herbert Van de Sompel

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

2

Motivation Background

The Problem

3

The Problem - 404 Errors

• Expected lifetime of a web page is 44 days [Kahle1997]

• URIs inaccessible in CS papers: 23%-53% [Lawrence2001]

• Inaccessible web pages: 67% after 4 years [Koehler2002]

• Inaccessible objects in DLs: 3% [Nelson2002]

• URIs inaccessible in high IF journals: 3.8% after 3 months; 13% after 27 months [Dellavalle2003]

• URIs inaccessible in D-Lib Magazine: ~30% [McCown2005]

• URIs inaccessible (and not archived) in scholarly articles: ~25% [Sanderson2011]

4

The Problem - 404 Errors

• Are they really gone? Or just relocated?• Has anybody crawled and indexed it?• Do Google, Yahoo!, Bing have a copy of the

page?• Has the page been archived by a web

archive?• Information retrieval techniques needed

to (re-)discover content

5

The Solution?

• Search engines• Requires knowledge about content• Problem with homographs (jaguar, present, lead,

M/mobile, etc)• Problem with very frequent terms/names

(Michael Nelson, Eric Miller, etc)• Web archives

• Helps for apple pie recipe but not for web page of transferred faculty, e.g.

6

Content Similarity

JCDL 2005http://www.jcdl2005.org/

July 2005http://www.jcdl2005.org/

Today

7

Content Similarity

Hypertext 2006http://www.ht06.org/

August 2006http://www.ht06.org/

Today

8

Content Similarity

PSP 2003http://www.pspcentral.org/events/annual_meeting_2003.htmlhttp://www.pspcentral.org/events/archive/annual_meeting_2003.html

August 2003 Today

9

Content Similarity

ECDL 1999

http://www-rocq.inria.fr/EuroDL99/http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html

October 1999 Today

10

Content Similarity

Greynet 1999http://www.konbib.nl/infolev/greynet/2.5.htm

1999Today

? ?

11

The ProblemResearch Questions (1)

1. Based on the WI, can we use content- and link structure based methods to (re-)discover missing web pages in real time?

Investigated Methods:a) Lexical signaturesb) Titlesc) Tagsd) Link neighborhood lexical signatures

12

The ProblemResearch Questions (2)

2. What are the optimal characteristics of these methods (age, length, etc) with respect to retrieval performance?

3. Can we improve the performance by consolidating two or more methods?

4. Can we have a real-world implementation and evaluation of the above?

13

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

14

Motivation Background

Memento, Web Infrastructure (WI)

15

LexicalSignature

RemovalHit

RateProxyCache

5 terms

Lexical Signatures (LSs)

• First introduced by Phelps and Wilensky [Phelps2000]

• Small set of terms capturing “aboutness” of a document, “lightweight” metadata

Resource

10,000 terms

Abstract

200 terms

16

• Following TF-IDF scheme first introduced by Spaerck Jones and Robertson [Jones1973]

• Term frequency (TF):• “How often does this word appear in this

document?”• Inverse document frequency (IDF):

• “In how many documents does this word appear?”

Lexical Signature Generation

17

Rank/Results URL LS

1/1,930 http://www.jcdl2005.orgjcdl2005 libraries conference cyberinfrastructure jcdl

1/24,100 http://www.norfolk.govnorfolk city council rfps nauticus

1/185 http://library.lanl.gov lanl library ldrd alamos oppie

2/738,000 http://www.usopen.org open us ashe tickets usta

Lexical Signatures -- Examples

18

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

19

Motivation Background

A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web(WIDM 2008)

Accurate IDF Values for LSs

Screen scraping the Google web interface

20

The Dataset

Local universe consisting of copies of URIsfrom the Internet Archive between 1996 and 2007

21

• Use IDF values obtained from 1. Local collection of web pages2. “screen scraping” SE result pages

• Validate both methods against a baseline• Google N-Grams

Note: N-Grams provide term count (TC) and not DF values – ask me for details

The Idea

22

Based on all 3 methodsURL: http://www.perfect10wines.comYear: 2007Union: 12 unique terms

LSs Example

23

1. Normalized term overlap• Assume term commutativity• k-term LSs normalized by k

2. Kendall Tau• Modified version since LSs to compare

may contain different terms3. M-Score

• Penalizes discordance in higher ranks

Comparing LSs

24

Top 5, 10 and 15 terms

LC – local universe

SC – screen scraping

NG – N-Grams

Comparing LSs

25

• Both methods for the computation of IDF values provide accurate results• Compared to the Google N-Gram baseline

• Screen scraping method seems preferable • Similarity scores are slightly higher• Feasible in real time!!!

Contribution:Established well performing IDF estimation technique.

Conclusions

26

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

27

Motivation Background

Revisiting Lexical Signatures to (Re-)Discover Web Pages(ECDL 2008)

The Idea

Evaluate Evolution of LSs over Time by• Generate LSs of URIs (from local universe

mentioned above) over time• Conduct overlap analysis

• Neither Phelps and Wilensky nor Park et al.[Park2004] did that• Park et al. just re-confirmed their findings after 6

months28

10-term LSs generated forhttp://www.perfect10wines.com

LSs Over Time - Example

29

LS Overlap Analysis

Rooted:overlap between the LS of the year of the first observation in the IA and all LSs of the consecutive years that URI has been observed

Sliding:overlap between two LSs of consecutive years starting with the first year and ending with the last

30

Evolution of LSs over Time

Results:• Little overlap between the early years and more recent ones• Highest overlap in the first 1-2 years after creation of the LS• Rarely peaks after that – once terms are gone do not return

Rooted

31

Evolution of LSs over Time

Results:• Overlap increases over time• Seem to reach steady state around 2003

Sliding

32

Performance of LSs

Idea: • Query LSs against Google search API• Identify URI in result set

• For each URI it is possible that:1. URI is returned as the top ranked result2. URI is ranked somewhere between 2 and 103. URI is ranked somewhere between 11 and 1004. URI is ranked somewhere beyond rank 100

considered as not returned33

Performance of LSs wrt Length

Results:• 2-, 3- and 4-term LSs perform poorly• 5-, 6- and 7-term LSs seem best

• Top mean rank (MR) value with 5 terms• Most top ranked with 7 terms• Binary pattern: either in top 10 or undiscovered

• 8 terms and beyond do not show improvement 34

nDCG for LSs consisting of 2-15 terms(mean over all years)

Performance of LSs wrt Length

35

Performance of LSs over Time

nDCG for LSs consisting of 2, 5, 7 and 10 terms

36

• LSs decay over time• Rooted: quickly after generation• Sliding: seem to stabilize

• LSs older than 5 years perform poorly• 5-, 6- and 7-term LSs seem to perform best

• 7 – most top ranked• 5 – lowest mean rank

• 2..4 as well as 8+ term LSs are insufficient

Contribution:Determined age and length limits for LSs.

Conclusions

37

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

38

Motivation Background

Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure(JCDL 2010)

The Problem

Internet Archive - Wayback Machine

www.aircharter-international.comhttp://web.archive.org/web/*/http://www.aircharter-international.com

Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry

TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International

59 copies

The Problem

39

The Problem

www.aircharter-international.com

Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry

The Problem

40

The Problem

www.aircharter-international.com

TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International

The Problem

41

Contributions

• Compare performance of two automated methods to rediscover web pages

1. Lexical signatures (LSs)

2. Titles

• Evaluate performance of combination of methods and suggest workflow for real time web page rediscovery

The Idea

42

LS Retrieval Performance

5- and 7-Term LSs

LS Retrieval Performance

• Yahoo! returns most URIs top ranked and leaves least undiscovered

• Binary retrieval pattern, URI either within top 10 or undiscovered

43

Title Retrieval Performance

Non-Quoted and Quoted Titles

Title Retrieval Performance

• Results at least as good as for LSs

• Google and Yahoo! return more URIs for non-quoted titles

• Same binary retrieval pattern

44

Combination of Methods

Top Results for Combination of Methods

Combination of Methods

45

Google Yahoo! MSN Live

LS5-TI 65.0 73.8 71.5

LS7-TI 70.9 75.7 73.8

TI-LS5 73.5 75.7 73.1

TI-LS7 74.1 75.1 74.1

LS5-TI-LS7 65.4 73.8 72.5

LS7-TI-LS5 71.2 76.4 74.4

TI-LS5-LS7 73.8 75.7 74.1

TI-LS7-LS5 74.4 75.7 74.8

LS5-LS7 52.8 68.0 64.4

LS7-LS5 59.9 71.5 66.7

Concluding RemarksConclusions

• LSs and titles are suitable as search engine queries• Return 50%-70% URIs top ranked

BUT• Titles are cheaper to obtain, hence

• Preferred primary method• 5-term LSs secondary method• Results in 75% top ranked URIs

Contributions:Provided evidence for suitability of titles and introduced web page discovery framework.

46

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

47

Motivation Background

Is This a Good Title?(Hypertext 2010)

The Problem

http://www.drbartell.com/

Lexical Signature(TF/IDF)Plastic Surgeon Reconstructive Dr Bartell Symbol University

???

The Problem

48

The Problem

http://www.drbartell.com/

TitleThomas Bartell MD Board-Certified - Cosmetic Plastic Reconstructive Surgery

The Problem

49

The Problem

www.reagan.navy.mil

Lexical Signature(TF/IDF)Ronald USS MCSN Torrey Naval Sea Commanding

The Problem

50

The Problem

TitleHome Page ???

www.reagan.navy.mil

Is This a Good Title?

The Problem

51

Contributions

• Display title evolution over time

• Compare to content evolution

• “Normalize” time as fixed size windows

• Provide prediction model for title’s retrieval potential

The Idea

52

Title (and LS) Retrieval Performance

Titles 5- and 7-Term LSs

Title and LS Retrieval Performance

• Titles return more than 60% URIs top ranked• Binary retrieval pattern, URI either within top 10

or undiscovered 53

1998-01-27Sun Software Products Selector Guides - Solutions Tree

1999-02-20Sun Software Solutions

2002-02-01Sun Microsystems Products

2002-06-01Sun Microsystems - Business & Industry Solutions

2003-08-01Sun Microsystems - Industry & Infrastructure Solutions Sun Solutions

Title Evolution - Example I

2004-02-02Sun Microsystems – Solutions

2004-06-10Gateway Page - Sun Solutions

2006-01-09Sun Microsystems Solutions & Services

2007-01-03Services & Solutions

2007-02-07Sun Services & Solutions

2008-01-19Sun Solutions

www.sun.com/solutions

Title Evolution – Example I

54

2000-06-19DataCity of Manassas Park Main Page

2000-10-12DataCity of Manassas Park sells Custom Built Computers & Removable Hard Drives

2001-08-21DataCity a computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives

Title Evolution - Example II

2002-10-16computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity 800-326-5051 toll free

2006-03-14Est 1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD 5200.1-R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB

www.datacity.com/mainf.html

Title Evolution – Example II

55

How much do titles change over time?

Title Evolution Over TimeTitle Evolution Over Time

• Copies from fixed size time windows per year

• Extract available titles of past 14 years

• Compute normalized Levenshtein edit distance between titles of copies and baseline (today)(0 = identical;1 = completely dissimilar) 56

Title Evolution Over Time

Title edit distance frequencies

Title Evolution Over Time

• Half the titles of available copies from recent years are (close to) identical

• Decay from 2005 on (with fewer copies available)

• 4 year old title:40% chance to be unchanged

57

Title Evolution Over Time

Title vs Document

[0,1] - over 1600 times

[0,0] - 122 times

Title Evolution Over Time

• Y: avg shingle value for all copies per URI

• X: avg edit distance of corresponding titles

• overlap indicated by:green: <10red: >90

• Semi-transparent: total amount of points plotted

58

Title Performance Prediction

home, index, home page, welcome, untitled document

The performance of any given title can be predicted as insufficient if it consists to 75% or more of a “Stop Title”!

Title Performance Prediction

• Quality prediction of title by• Number of nouns, articles etc.• Amount of title terms, characters [Ntoulas2006]

• Observation of re-occurring terms in poorly performing titles - “Stop Titles”

59

Concluding RemarksConclusions

• Titles change more slowly and less significantly over time than web page content

• Not all titles equally good• If the majority of title terms are Stop Titles its

quality can be predicted poor

Contribution:Quantified title evolution and introduced stop titles.

60

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

61

Motivation Background

Find, New, Copy, Web, Page - Tagging for the (Re-)Discovery of Web Pages(TPDL 2011)

The Problem

We have seen that we have a good chance to rediscover missing pages with

• Lexical signatures• Titles

BUT

What if no archived/cached copy can be found?

The Problem

62

The ProblemThe Solution?

ConferencesDigitallibrariesConferenceLibraryJcdl2005

63

The ProblemThe Idea

• Experimental evaluation of tag based query length cf. 5- or 7-term LSs

• Test combination of methods to improve retrieval performance

• Investigate “descriptive” power of tags

64

The ProblemThe Experiment

• Tags queried against the Yahoo! BOSS API• Same four retrieval cases introduced earlier• nDCG w/ binary relevance scoring• Mean Average Precision

65

The ProblemThe Experiment

Combining methods

66

The Problem

• Fact:• ~50% of tags do not occur in page [Bischoff2008]

• “Secret”:• ~50% of tags do not occur in current version

of page• ergo: How about previous versions?

The Experiment

67

The Problem

• 3,306 URIs w/ older copies• 66.3% of our tags do not occur in page • 4.9% of tags occur in previous version of page Ghost Tags• represent a previous version better than the

current one

• What kind of tags are these?• Important to the document, to the Delicious

user?

Ghost Tags

68

The ProblemGhost Tags

Document importance:TF rank

User importance:Delicious rank

Normalized rank:0 - top1 - bottom

69

Concluding RemarksConclusions

• Tags can be used for search (if available)• Combining tags with titles and LSs gains URIs• Ghost Tags exist!

• 1/3 of them are important to the page and user

Contributions:Added tags to web page discovery framework and introduced notion of Ghost Tags.

70

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

71

Motivation Background

Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures(JCDL 2011)

The Problem

We have seen that we have a good chance to rediscover missing pages with

• Lexical signatures• Titles

BUT

What if no archived/cached copy can be found?Plan A: Tags

The Problem

72

The ProblemPlan B

ComputerDominionNorfolkMonarchextract

is about

Link neighborhood Lexical Signatures (LNLSs)

73

The ProblemThe Idea

• Determine for well performing LNLS:• Length• Number of backlinks• Backlink levels• Radius of terms on backlink page

74

The ProblemThe Radius on a Backlink Page

Paragraph

Entire page

Anchor text

75

The Dataset

• 309 URIs• 28,325 first level• 306,700 second level backlinks• Filter for language, file type, etc.

12% discarded• Lexical signature generation

• IDF values from Yahoo!• 1..7 and 10 terms

• Query Yahoo! API• Compute “goodness” (nDCG) 76

The ProblemThe Results

level-radius-rank

1st and 2nd

level

bett

er

77

The ProblemThe Results – Radius

level-radius-rank

All Radii

78

The ProblemThe Results – Backlink Rank

level-radius-rank

Ranks10

1001000

79

The ProblemThe Results – In Numbers

1-anchor-1000

WINNER1-anchor-10

GOOD

80

Concluding RemarksConclusions

• Optimal link neighborhood lexical signatures:• Contain 4 terms• Parsed from top 10 backlink pages• Include first backlink level only• Consider anchor text only

Contributions:Added LNLS to web page discovery framework.

81

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

82

Motivation Background

Synchronicity – Automatically Rediscover Missing Web Pages in Real Time(JCDL 2011)

Concluding RemarksSynchronicity

• Firefox add-on• Triggers on 404 error• Rediscover page via:

• Memento• Title• Lexical signature• Tags• Link neighborhood lexical signature• URI modification

• http://bit.ly/no-more-40483

Concluding RemarksContributions

1. Introduce reliable real-time approach to estimate IDF values

2. Workflow for generation of well performing lexical signatures

3. Performance evaluation of web page titles4. Investigation of tags for web page discovery5. Analysis of link neighborhood lexical

signatures and their optimal parameter6. Introduce Synchronicity implementing the

entire framework 84

Concluding Remarks

85

Concluding RemarksNext Stop… New Mexico

86

Concluding RemarksList of my Relevant Publications

1. M.Klein, M.L.Nelson, “A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web“, WIDM 2008, pp. 39-46

2. M.Klein, M.L.Nelson, “Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008, pp. 371-382

3. M.Klein, M.L.Nelson, “Correlation of Term Count and Document Frequency for Google N-Grams“, ECIR 2009, pp. 620-627

4. M.Klein, M.L.Nelson, “Inter-Search Engine Lexical Signature Performance“, JCDL 2009, pp. 413-414

5. M.Klein, M.L.Nelson, “Investigating the Change of Web Pages Titles Over Time“, InDP 2009

6. M.Klein, J.Shipman, M.L.Nelson, “Is This a Good Title”, Hypertext 2010, pp. 3-127. M.Klein, M.L.Nelson, “Evaluating Methods to Rediscover Missing Web Pages

from the Web Infrastructure”, JCDL 2010, pp. 59-688. M.Klein, J.Ware, M.L.Nelson, “Rediscovering Missing Web Pages Using Link

Neighborhood Lexical Signatures”, JCDL 20119. M.Klein, M.Aly, M.L.Nelson, “Synchronicity - Automatically Rediscover Missing

Web Pages in Real Time”, JCDL 201110. M.Klein, M.L.Nelson, “Find, New, Copy, Web, Page – Tagging for the

(Re-)Discovery of Web Pages”, TPDL 2011 to appear87

Concluding RemarksReferencesBischoff2008K.Bischoff, C.Firan, W.Nejdl, R.Paiu, “Can All Tags Be Used for Search?” In: Proceedings of CIKM '08, pp.193-202, 2008Dellavalle2003R.P.Dellavalle, E.J.Hester, L.F.Heilig, A.L.Drake, J.W.Kuntzman, M.Graber, L.M.Schilling, “Information Science: Going, Going, Gone: Lost Internet References”, Science 302(5646), pp.787-788, 2003Jones1973K.Spärck Jones, “Index Term Weighting”, Information Storage and Retrieval, pp. 619-633, 1973Kahle1997B.Kahle, “Preserving the Internet”, Scientific American 276, pp.82-83, 1997Koehler2002W.C.Koehler, “Web Page Change and Persistence - A Four-Year Longitudinal Study”, JASIST 53(2), pp.162-171, 2002Lawrence2001S.Lawrence, D.M.Pennock, G.W.Flake, R.Krovetz, F.M.Coetzee, E.Glover, F.A.Nielsen, A.Kruger, C.L.Giles, “Persistence of Web References in Scientic Research”, Computer 34(2), pp.26-31, 2001McCown2005F.McCown, S.Chan, M.L.Nelson, J.Bollen, “The Availability and Persistence of Web References in D-Lib Magazine”, Proceedings of IWAW '05, 2005Nelson2002M.L.Nelson, B.D.Allen, “Object Persistence and Availability in Digital Libraries”, D-Lib Magazine 8(1), 2002Ntoulas2006A. Ntoulas, M.Najork, M.Manasse, D.Fetterly, “Detecting Spam Web Pages Through Content Analysis”, Proceedings of WWW ’06, pp 83-92, 2006Park2004S.T.Park, D.M.Pennock, C.L.Giles, R.Krovetz, “Analysis of Lexical Signatures for Improving Information Persistence on the World Wide Web”, TOIS 22(4), pp.540-572, 2004Phelps2000T.A.Phelps, R.Wilensky, “Robust Hyperlinks Cost Just Five Words Each”, technical report, UC Berkeley, 2000Sanderson2011R.Sanderson, M.Phillips, H.Van de Sompel, “Analyzing the Persistence of Referenced Web Resources with Memento”, Proceedings of OR '11, 2011 88

Using the Web Infrastructurefor Real Time Recoveryof Missing Web Pages

Martin Kleinmklein@cs.odu.edu

http://www.cs.odu.edu/~mklein/

Backup Slides

Future Work

91

• “Story Telling” with Memento• Find more Stop Titles• Find more Ghost Tags• Identify “Stop Anchors”• Synchronicity 1.0

• Web service• CMD line tool

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

92

Motivation Background

Correlation of Term Count and Document Frequency for Google N-Grams(ECIR 2009)

• Need of a reliable source to accurately compute IDF values of web pages (in real time)

• Shown, screen scraping works but• missing validation of baseline (Google N-

Grams)• N-Grams seem suitable (recently created,

based on web pages) but provide TC and not DF what is their relationship?

The Problem

93

94

Background

Term All Buy Can’t Is Love Me Need Please You Long

TC 1 1 1 1 2 2 1 2 1 3

DF 1 1 1 1 2 2 1 1 1 1

• Google N-grams provide term count (TC) values

D1 = “Please, Please Me” D2 = “Can’t Buy Me Love”D3 = “All You Need Is Love” D4 = “Long, Long, Long”

TC >= DF, but is there a correlation?Can we use TC to estimate DF?

95

Experiment Results

Investigate correlation between TC and DFwithin “Web as Corpus” (WaC)

Rank similarity of all terms

96

Experiment Results

Investigate correlation between TC and DFwithin “Web as Corpus” (WaC)

Spearman’s ρ and Kendall τ

97

Experiment Results

Rank WaC-DF WaC-TC Google N-Grams1 IR IR IR IR2 RETRIEVAL RETRIEVAL RETRIEVAL IRSG3 IRSG IRSG IRSG RETRIEVAL4 BCS IRIT CONFERENCE BCS5 IRIT BCS BCS EUROPEAN6 CONFERENCE 2009 GRANT CONFERENCE7 GOOGLE FILTERING IRIT IRIT8 2009 GOOGLE FILTERING GOOGLE9 FILTERING CONFERENCE EUROPEAN ACM

10 GRANT ARIA PAPERS GRANT

Google: screen scraping DF values from the Google web interface

Top 10 terms in decreasing order of their TF/IDF valuestaken from http://ecir09.irit.fr

U = 14∩ = 6

Strong indicator that TC can be used to estimate DF for web pages!

98

Experiment Results

Show similarity between WaC based TC andGoogle N-Gram based TC

TC frequencies

N-Grams have a threshold of 200

Integer ValuesTwo Decimals One Decimal

Frequency of TC/DF Ratio Within the WaC

Experiment Results

99

• TC and DF Ranks within the WaC show strong correlation

• TC frequencies of WaC and Google N-Grams are very similiar

• N-Grams are suitable for accurate IDF estimation for web pages

Does not mean everything correlated to TC can be used as DF substitute!

Conclusions

100

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

101

Motivation Background

Inter-Search Engine Lexical Signature Performance(JCDL 2009)

Inter-Search EngineLexical Signature Performance

Martin Klein Michael L. Nelson

{mklein,mln}@cs.odu.edu

http://en.wikipedia.org/wiki/ElephantElephantTusksTrunkAfricanLoxodonta

Elephant, Asian, AfricanSpecies, TrunkElephant, African, Tusks

Asian, Trunk

103

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

104

Motivation Background

Synchronicity – Automatically Rediscover Missing Web Pages in Real Time(JCDL 2011)

Synchro…What?

Synchronicity• Experience of causally unrelated events

occurring together in a meaningful manner• Events reveal underlying pattern, framework

bigger than any of the synchronous systems• Carl Gustav Jung (1875-1961)

• “meaningful coincidence”• Deschamps – de Fontgibu plum

pudding example

picture from http://www.crystalinks.com/jung.html105

Synchro…What?

http://www.youtube.com/watch?v=X4HQyqc-aVU

Repo Man (1984)http://www.imdb.com/title/tt0087995/

106

Agenda

SynchronicityLink

Neighborhood LSs

Book of the DeadWeb Page Tags

Web Page Titles

LSs for Web Pages

DF Estimation Techniques

TC-DF Correlation

107

Motivation Background

(Not yet published)

Book of the Dead

• Corpus of missing web pages• 233 URIs returning status 404• Mechanical Turk to determine “aboutness”

• Guess from URI string• Mementos for 161 URIs

• Apply lexical signatures and title

108

5-term LSs Titles

109

Experiment Results

Dice Similarity Coefficientof Top 100 Results D = 0

0.0 < D ≤ 0.30.3 < D ≤ 0.60.6 < D ≤ 1.0

5-term LSs Titles

110

Experiment Results

Jaro Distance of Top 100 Results J = 0

0.0 < J ≤ 0.30.3 < J ≤ 0.60.6 < J ≤ 1.0

Book of the Dead

• Mechanical Turk to determine relevance of results

• Top 10 only• Relevant• Somewhat relevant• Not relevant• Broken URI

• nDCG of top 10 results

111

5-term LSs Titles

112

Experiment Results

Relevance of Top 10 Results

113

Experiment Results

nDCG of Top 10 Results