Synchronicity Real Time Recovery of Missing Web Pages Martin Klein [email protected] Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Slide 1

Synchronicity Real Time Recovery ofMissing Web Pages

Martin [email protected]

Introduction to Digital LibrariesWeek 14CS 751 Spring 201104/12/2011

2Who are you again?Ph.D. student w/ MLN since 2005Diagnostic exam in 2006, dissertation proposal in 200817 publications to dateOutstanding RA award CS dept CoS dissertation fellowship3 ACM SIGWEB + 2 misc travel grantsCS595 (S10) & CS518 (F10)3The Problem

http://www.jcdl2007.org

http://www.jcdl2007.org/JCDL2007_Program.pdf

4The ProblemWeb users experience 404 errorsexpected lifetime of a web page is 44 days [Kahle97]2% of web disappears every week [Fetterly03]Are they really gone? Or just relocated?has anybody crawled and indexed it?do Google, Yahoo!, Bing or the IA have a copy of that page?Information retrieval techniques needed to (re-)discover content

Web Infrastructure (WI) [McCown07]Web search engines (Google, Yahoo!, Bing) and their cachesWeb archives (Internet Archive)Research projects (CiteSeer)

5The Environment

Digital preservation happens in the WI

6Refreshing and Migration in the WI

Google ScholarCiteSeerXInternet Archivehttp://waybackmachine.org/*/http:/techreports.larc.nasa.gov/ltrs/PDF/tm109025.pdf1same URI maps to same or very similar content at a later time2same URI maps to different content at a later time3different URI maps to same or very similar content at the same or at a later time

4the content can not be found at any URI

7URI Content Mapping ProblemU1C1U1C1timeABU1C2U1C1timeABU2C1U1C1U1404timeABU1???U1C1timeABContent Similarity8JCDL 2005http://www.jcdl2005.org/July 2005http://www.jcdl2005.org/Today

Content Similarity9Hypertext 2006http://www.ht06.org/August 2006

http://www.ht06.org/TodayContent Similarity10PSP 2003http://www.pspcentral.org/events/annual_meeting_2003.htmlAugust 2003http://www.pspcentral.org/events/archive/annual_meeting_2003.htmlToday

Content Similarity11ECDL 1999http://www-rocq.inria.fr/EuroDL99/October 1999http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.htmlToday

Content Similarity12Greynet 1999http://www.konbib.nl/infolev/greynet/2.5.htm1999Today??LSRemovalHitRateProxyCacheGoogleYahooBingFirst introduced by Phelps and Wilensky [Phelps00]Small set of terms capturing aboutness of a document, lightweight metadata13Lexical Signatures (LSs)

Resource

AbstractFollowing TF-IDF scheme first introduced by Spaerck Jones and Robertson [Jones88]Term frequency (TF):How often does this word appear in this document?Inverse document frequency (IDF):In how many documents does this word appear?14Generation of Lexical Signatures

Robust Hyperlink5 terms are suitableAppend LS to URLhttp://www.cs.berkeley.edu/~wilensky/NLP.html?lexical-signature=texttiling+wilensky+disambiguation+subtopic+iagoLimitations:Applications (browsers) need to be modified to exploit LSsLSs need to be computed a prioriWorks well with most URLs but not with all of them15LS as Proposed by Phelps and WilenskyPark et al. [Park03] investigated performance of various LS generation algorithmsEvaluated tunability of TF and IDF componentWeight on TF increases recall (completeness)Weight on IDF improves precision (exactness)16Generation of Lexical SignaturesRank/ResultsURLLS1/243http://endeavour.cs.berkeley.edu/endeavour 94720-1776 achieve inter-endeavour amplifiesSearch1/1,930

http://www.jcdl2005.orgjcdl2005 libraries conference cyberinfrastructure jcdl Search1/25,900

http://www.loc.govcelebrate knowledge webcasts kluge librarySearch17Lexical Signatures -- Examples18Synchronicity

404 error occurs while browsing look for same or older page in WI (1)if user satisfied return page (2)else generate LS from retrieved page (3) query SEs with LS if result sufficient return good enough alternative page (4) else get more input about desired content (5) (link neighborhood, user input,...) re-generate LS && query SEs ... return pages (6)

The system may not return any results at all

19SynchroWhat?SynchronicityExperience of causally unrelated events occurring together in a meaningful mannerEvents reveal underlying pattern, framework bigger than any of the synchronous systemsCarl Gustav Jung (1875-1961)meaningful coincidenceDeschamps de Fontgibu plumpudding examplepicture from http://www.crystalinks.com/jung.html20404 Errors

21404 Errors

22Soft 404 Errors

23Soft 404 Errors

A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web(WIDM 2008) LSs are usually generated following the TF-IDF scheme TF rather trivial to compute IDF requires knowledge about: overall size of the corpus (# of documents) # of documents a term occurs in Not complicated to compute for bounded corpora (such as TREC) If the web is the corpus, values can only be estimatedThe Problem25 Use IDF values obtained from Local collection of web pages``screen scraping SE result pages Validate both methods through comparison to baseline Use Google N-Grams as baseline Note: N-Grams provide term count (TC) and not DF values details to comeThe Idea2627Accurate IDF Values for LSsScreen scraping the Google web interface

28The Dataset

Local universe consisting of copies of URLs from the IAbetween 1996 and 2007 Same as above, follows Zipf distribution

10,493 observations254,384 total terms16,791 unique termsThe Dataset29Total terms vs new terms

The Dataset30Based on all 3 methodsURL: http://www.perfect10wines.comYear: 2007Union: 12 unique terms

LSs Example31Normalized term overlapAssume term commutativityk-term LSs normalized by k Kendall TauModified version since LSs to compare may contain different termsM-ScorePenalizes discordance in higher ranksComparing LSs32

Top 5, 10 and 15 terms

LC local universe SC screen scraping NG N-GramsComparing LSs33 Both methods for the computation of IDF values provide accurate results compared to the Google N-Gram baseline

Screen scraping method seems preferable since similaity scores slightly higher feasible in real timeConclusions34Correlation of Term Count and Document Frequency for Google N-Grams(ECIR 2009) Need of a reliable source to accurately compute IDF values of web pages (in real time) Shown, screen scraping works but missing validation of baseline (Google N-Grams) N-Grams seem suitable (recently created, based on web pages) but provide TC and not DF what is their relationship?The Problem3637Background & Motivation Term frequency (TF) inverse document frequency (IDF) is a well known term weighting concept Used (among others) to generate lexical signatures (LSs)

TF is not hard to compute, IDF is since it depends on global knowledge about the corpus When the entire web is the corpus IDF can only be estimated!

Most text corpora provide term count values (TC) D1 = Please, Please Me D2 = Cant Buy Me LoveD3 = All You Need Is Love D4 = Long, Long, Long

TC >= DF but is there a correlation? Can we use TC to estimate DF?TermAllBuyCantIsLoveMeNeedPleaseYouLongTC1111221213DF1111221111 Investigate relationship between: TC and DF within the Web as Corpus (WaC) WaC based TC and Google N-Gram based TC TREC, BNC could be used but: they are not free TREC has been shown to be somewhat dated [Chiang05 ]The Idea38 Analyze correlation of list of terms ordered by their TC and DF rank by computing: Spearmans Rho Kendall Tau Display frequency of TC/DF ratio for all terms Compare TC (WaC) and TC (N-Grams) frequenciesThe Experiment3940Experiment Results

Investigate correlation between TC and DFwithin Web as Corpus (WaC) Rank similarity of all terms

41Experiment Results Investigate correlation between TC and DFwithin Web as Corpus (WaC) Spearmans and Kendall 42Experiment ResultsRankWaC-DFWaC-TCGoogleN-Grams1IRIRIRIR2RETRIEVALRETRIEVALRETRIEVALIRSG3IRSGIRSGIRSGRETRIEVAL4BCSIRITCONFERENCEBCS5IRITBCSBCSEUROPEAN6CONFERENCE2009GRANTCONFERENCE7GOOGLEFILTERINGIRITIRIT82009GOOGLEFILTERINGGOOGLE9FILTERINGCONFERENCEEUROPEANACM10GRANTARIAPAPERSGRANTGoogle: screen scraping DF values from the Google web interfaceTop 10 terms in decreasing order of their TF/IDF valuestaken from http://ecir09.irit.frU = 14 = 6

Strong indicator that TC can be used to estimate DF for web pages!Integer Values

Two DecimalsOne DecimalFrequency of TC/DF Ratio Within the WaCExperiment Results43

44Experiment ResultsShow similarity between WaC based TC andGoogle N-Gram based TC TC frequenciesN-Grams have a threshold of 200 TC and DF Ranks within the WaC show strong correlation TC frequencies of WaC and Google N-Grams are very similiar Together with results shown earlier (high correlation between baseline and two other methods) N-Grams seem suitable for accurate IDF estimation for web pages Does not mean everything correlated to TC can be used as DF substitude!Conclusions45Inter-Search EngineLexical Signature Performance(JCDL 2009)Inter-Search EngineLexical Signature Performance

Martin KleinMichael L. Nelson{mklein,mln}@cs.odu.edu

http://en.wikipedia.org/wiki/ElephantElephantTusksTrunkAfricanLoxodontaElephant, Asian, AfricanSpecies, TrunkElephant, African, TusksAsian, Trunk48

Revisiting Lexical Signatures to(Re-)Discover Web Pages(ECDL 2008)50How to Evaluate the Evolution of LSs over TimeIdea: Conduct overlap analysis of LSs generated over timeLSs based on local universe mentioned above

Neither Phelps and Wilensky nor Park et al. did thatPark et al. just re-confirmed their findings after 6 month51Dataset

Local universe consisting of copies of URLs from the IAbetween 1996 and 2007 10-term LSs generated forhttp://www.perfect10wines.com

LSs Over Time - Example5253LS Overlap Analysis

Rooted:overlap between the LS of the year of the first observation in the IA and all LSs of the consecutive years that URL has been observedSliding:overlap between two LSs of consecutive years starting with the first year and ending with the last54Evolution of LSs over Time

Results:Little overlap between the early years and more recent onesHighest overlap in the first 1-2 years after creation of the LSRarely peaks after that once terms are gone do not returnRooted55Evolution of LSs over TimeResults:Overlap increases over timeSeem to reach steady state around 2003

Sliding56Performance of LSsIdea: Query Google search API with LSsLSs based on local universe mentioned aboveIdentify URL in result set

For each URL it is possible that:URL is returned as the top ranked resultURL is ranked somewhere between 2 and 10URL is ranked somewhere between 11 and 100URL is ranked somewhere beyond rank 100 considered as not returned57Performance of LSs wrt Number of Terms

Results:2-, 3- and 4-term LSs perform poorly5-, 6- and 7-term LSs seem bestTop mean rank (MR) value with 5 termsMost top ranked with 7 termsBinary pattern: either in top 10 or undiscovered8 terms and beyond do not show improvement58

Performance - Number of TermsLightest gray = rank 1Black = rank 101 and beyondRanks 11-20, 21-30, colored proportionally50% top ranked, 20% in top 10, 30% blackRank distribution of 5 term LSsPerformance of LSs wrt Number of Terms59Performance of LSsScoring:normalized Discounted Cumulative Gain (nDCG)Binary relevance: 1 for match, 0 otherwise

60nDCG for LSs consisting of 2-15 terms(mean over all years)Performance of LSs wrt Number of Terms

61Performance of LSs over TimeScore for LSs consisting of 2, 5, 7 and 10 terms LSs decay over time Rooted: quickly after generation Sliding: seem to stabilize

5-, 6- and 7-term LSs seem to perform best 7 most top ranked 5 fewest undiscovered 5 lowest mean rank

2..4 as well as 8+ terms insufficient Conclusions62Evaluating Methods to Rediscover Missing Web Pages from theWeb Infrastructure(JCDL 2010)64

The ProblemInternet Archive - Wayback Machine64www.aircharter-international.comhttp://web.archive.org/web/*/http://www.aircharter-international.comLexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry

TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International59 copiesThe Problem65

The Problem65www.aircharter-international.comLexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry The Problem66The Problemwww.aircharter-international.comTitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International

The Problem67The ProblemIf no archived/cached copy can be found...

TagsC?BALink Neighborhood (LNLS)The Problem68The Problem

The Problem69ContributionsCompare performance of four automated methods to rediscover web pagesLexical signatures (LSs)3. TagsTitles4. LNLSAnalysis of title characteristics wrt their retrieval performanceEvaluate performance of combination of methods and suggest workflow for real time web page rediscoveryContributions70Experiment - Data Gathering500 URIs randomly sampled from DMOZApplied filters.com, .org, .net, .edu domainsEnglish Languagemin. of 50 terms [Park]Results in 309 URIs to download and parseData Gathering71Experiment - Data GatheringExtract title...Generate 3 LSs per pageIDF values obtained from Google, Yahoo!, MSN LiveObtain tags from delicious.com API (only 15%)Obtain link neighborhood from Yahoo! API (max. 50 URIs)Generate LNLSTF from bucket of words per neighborhoodIDF obtained from Yahoo! APIData Gathering72LS Retrieval Performance5- and 7-Term LSsYahoo! returns most URIs top ranked and leaves least undiscoveredBinary retrieval pattern, URI either within top 10 or undiscovered

LS Retrieval Performance73Title Retrieval PerformanceNon-Quoted and Quoted TitlesResults at least as good as for LSsGoogle and Yahoo! return more URIs for non-quoted titlesSame binary retrieval pattern

Title Retrieval Performance74Tags Retrieval PerformanceAPI returns up to top10 tags - distinguish between # of tags queriedLow # of URIsMore later

Tags Retrieval Performance75LNLS Retrieval Performance5- and 7-term LNLSs< 5% top rankedMore later

LNLS Retrieval Performance76Query LNLS

Combination of MethodsCan we achieve better retrieval performance if we combine 2 or more methods?Done

Done

Done

Query Tags

Query Title

Query LSCombination of Methods77Combination of MethodsTopTop10UndisLS550.812.632.4LS757.39.131.1TI69.38.119.7TA2.110.675.5TopTop10UndisLS567.67.822.3LS766.74.526.9TI63.88.127.5TA6.417.063.8TopTop10UndisLS563.18.127.2LS762.85.829.8TI61.56.830.7TA08.580.9GoogleYahoo!MSN LiveCombination of Methods78Combination of MethodsGoogleYahoo!MSN LiveLS5-TI65.073.871.5LS7-TI70.975.773.8TI-LS573.575.773.1TI-LS774.175.174.1LS5-TI-LS765.473.872.5LS7-TI-LS571.276.474.4TI-LS5-LS773.875.774.1TI-LS7-LS574.475.774.8LS5-LS752.868.064.4LS7-LS559.971.566.7Top Results for Combination of MethodsCombination of Methods79Length varies between 1 and 43 termsLength between 3 and 6 terms occurs most frequently and performs well [Ntoulas]Title Characteristics

Length in # of TermsTitle Characteristics

80Length varies between 4 and 294 charactersShort titles ( 12% discardedLexical signature generationIDF values from Yahoo!1..7 and 10 terms123The ProblemThe Results

level-radius-rankAnchor text

124The ProblemThe Results Backlink Levellevel-radius-rankAnchor text5 words



127The ProblemThe Results Radiuslevel-radius-rankAll Radii

128The ProblemThe Results Backlink Ranklevel-radius-rankAnchor,Ranks 10,100,1000129The ProblemThe Results In Numbers

1-anchor-1000

1-anchor-10WINNER 4 terms first backlink level only top 10 backlinks only anchor text only130Concluding RemarksLink neighborhood based lexical signatures can help rediscover missing pages.

It is a feasible Plan C due to the high success rate of cheaper methods (titles, tags, lexical signatures).

Fortunately smallest parameters perform best (anchor, 10 backlinks, 1st level backlinks)

Can we find an optimum for the number of backlinks? (10/100/1000 leaves a big margin)Can we identify Stop Anchors e.g. click here, acrobat, etcConclusions

Documents

Synchronicity Real Time Recovery of Missing Web Pages Martin Klein [email protected] Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011