View
217
Download
0
Tags:
Embed Size (px)
Citation preview
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Incrementally Ranking Ephemeral Web Documents in Search Incrementally Ranking Ephemeral Web Documents in Search EnginesEngines
• What’s ephemeral documents• What’s the problem to be solved?• Experiments with Google• Generations of rankings• Properties of ephemeral documents• Solution to rank computation• Future work in a big framework
Road Map
Jie Wu, 1.8.2003, Fri., Toronto, Canada
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Ephemeral DocumentsEphemeral Documents
What‘s Ephemeral Web DocumentsWhat‘s Ephemeral Web Documents
Definition: The (highly demanded) documents newly appear (and die) in the middle of 2 consecutive crawlings.
Significance of the study: Addressing the aspects of freshness, similarity, accuracy, personalization, etc. (semantic issues) of search engines.
Cause of the problem: Latency of crawling cycles. For example, ca. 1 month for Google, 2 weeks for MSN (1/3 to ½ size of Google), 3 weeks for Alltheweb.
Examples: Everyday news pages (not really ephemeral), web sites for events (e.g. Olympics, projects like Alvis, shor-term programs, unexpected big events like a war, etc.), deep-web, etc.
Question: How to make ephemeral documents available in a SE ASAP?
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Search for „sars“ on Google: Top 3Search for „sars“ on Google: Top 3
Google Example (Google Example (Done at ca. 21:45, 1.5.2003, Done at ca. 21:45, 1.5.2003, Thu.)Thu.)
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Search for „sars“ on Google: No. 4-6Search for „sars“ on Google: No. 4-6
Google Example cont.Google Example cont.
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Content of No. 2Content of No. 2
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Content of No. 4Content of No. 4
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Results from Google News (Results from Google News (ca. 10:15, ca. 10:15, 2.5.2003, Fri.)2.5.2003, Fri.)
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Results from MSN (Results from MSN (ca. 23:05, 1.5.2003, ca. 23:05, 1.5.2003, Thu.)Thu.)
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Google vs MSNGoogle vs MSN
Result AnalysisResult Analysis
1. Actually all top 15 results of MSN are about the disease SARS2. MSN’s collection size if only a bit more than 1/3 of that of Google3. MSN might adjust the weights of SARS-related documents4. How to do that in a systematic and uniform way for SE with a
huge collection of documents like Google?
Google‘s ProblemsGoogle‘s Problems
1. Ephemeral documents not included in the collection.2. Delayed reflection of public information needs.3. Weights given to ephemeral documents not enough.
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
My NotionsMy Notions
3 Generations of Rankings3 Generations of Rankings
Generation 1:Factors: on-page ones, such as keywords/termsAlgorithm: boolean model, vector space similarity, latent
semantic indexing, fuzzy set model, probablistic models, etc.Generation 2:
Factors: on-page ones + link structureAlgorithm: G1 + link sturcture analysis, e.g. PageRank
(importance of a page in general sense), HITSGeneration 3:
Factors: on-page ones + link structure + semantic factorsAlgorithm: G1 + G2 + Alvis
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Ranking Life Cycle of Normal DocumentsRanking Life Cycle of Normal Documents
Normal vs. Ephemeral Web Documents INormal vs. Ephemeral Web Documents I
Life time
Ran
king
val
ue
birth crawled
Pointed to bymore and moreincoming links
Entering into a more orless stable status
Otherperturbations
Viewpoints of PageRank and Human-Mind
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Ranking Life Cycle of Ehemeral DocumentsRanking Life Cycle of Ehemeral Documents
Normal vs. Ephemeral Web Documents Normal vs. Ephemeral Web Documents IIII
Viewpoints of PageRank and Human-Mind
Life time
Ran
king
val
ue
birth crawled
Pointed to bymore and moreincoming links
Entering into a more orless stable status
Otherperturbations
PageRank
Human-Mind
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Nothing basically.Nothing basically.
Current Work on Ephemeral Web Current Work on Ephemeral Web DocumentsDocuments
1. Google continues its trilogy of roughly monthly crawling of the whole web, PageRank computation, adding other factors in.
2. People may not consider it really important to solve this problem. The current centralized, colossal and complete strategy is good and enough.
3. Separate solutions and systems are provided to address the problem, for example, news.google.com.
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Analysis by Matrix ComputationAnalysis by Matrix Computation
P=cG+(1-c)EA=PT
The principal eigenvector of A.
G´=G+N+G2N+N2GP´=cG´+(1-c)E´A´=(P´)T
The principal eigenvector of A´.
Continuously compute the new eigenvectors given the old ones and the minor change.
G
N
G: the previous Web Graph N: newly emerged Web pages of a News Web site
N2G
G2N
Heavier weights have to be given to the links pointing to the new ephemeral documents.
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
New MatrixNew Matrix
After including ephemeral documentsAfter including ephemeral documents
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Computation Based on the New MatrixComputation Based on the New Matrix
1. Aperiodic: the matrix is induced by the web graph.2. Irreducible: strongly connected.
Ergodic Theorem applies: the Markov chain defined by Q has a unique stationary probability distribution.
The Computation Converges.The Computation Converges.
How to ComputeHow to Compute
1. Adaptive methods for PageRank computation.2. k = 400x(4,500∼35,000) = 1,800,000∼14,000,000
(0.06%∼0.47%) of 3 billion.3. Make use of the block structure.
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
After including ranking of ephemeral After including ranking of ephemeral documentsdocuments
Applications in Search EnginesApplications in Search Engines
1. Ranking of normal and ephemeral documents can be unified seamlessly.
2. Strong support of a decentralized architecture for Web and peer-to-peer search engines
3. No contradiction to using separate solutions. For example news.google.com can be easily built upon a unified ranking scheme.
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Revisiting the Challenges by Dr. Andrei BroderRevisiting the Challenges by Dr. Andrei Broder
3 Challenges3 Challenges• A web graph model that takes into account
information content.• A method to compare graph derived query
independent factors.• Mothods to create graphs where none exists.
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Real Computation on a Web-Scale Data SetReal Computation on a Web-Scale Data Set
Future WorkFuture Work
• Where is the data set?
Taking Into Account More Semantic Taking Into Account More Semantic InformationInformation• Semantic information of the documents and
the content
LSIR
All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne
Questions?Questions?
??