20
LSIR All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Incrementally Ranking Ephemeral Web Documents in Search Incrementally Ranking Ephemeral Web Documents in Search Engines Engines • What’s ephemeral documents • What’s the problem to be solved? • Experiments with Google • Generations of rankings • Properties of ephemeral documents • Solution to rank computation • Future work in a big framework Road Map Jie Wu, 1.8.2003, Fri., Toronto, Canada

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Incrementally Ranking Ephemeral Web Documents in Search Incrementally Ranking Ephemeral Web Documents in Search EnginesEngines

• What’s ephemeral documents• What’s the problem to be solved?• Experiments with Google• Generations of rankings• Properties of ephemeral documents• Solution to rank computation• Future work in a big framework

Road Map

Jie Wu, 1.8.2003, Fri., Toronto, Canada

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Ephemeral DocumentsEphemeral Documents

What‘s Ephemeral Web DocumentsWhat‘s Ephemeral Web Documents

Definition: The (highly demanded) documents newly appear (and die) in the middle of 2 consecutive crawlings.

Significance of the study: Addressing the aspects of freshness, similarity, accuracy, personalization, etc. (semantic issues) of search engines.

Cause of the problem: Latency of crawling cycles. For example, ca. 1 month for Google, 2 weeks for MSN (1/3 to ½ size of Google), 3 weeks for Alltheweb.

Examples: Everyday news pages (not really ephemeral), web sites for events (e.g. Olympics, projects like Alvis, shor-term programs, unexpected big events like a war, etc.), deep-web, etc.

Question: How to make ephemeral documents available in a SE ASAP?

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Search for „sars“ on Google: Top 3Search for „sars“ on Google: Top 3

Google Example (Google Example (Done at ca. 21:45, 1.5.2003, Done at ca. 21:45, 1.5.2003, Thu.)Thu.)

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Search for „sars“ on Google: No. 4-6Search for „sars“ on Google: No. 4-6

Google Example cont.Google Example cont.

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Content of No. 2Content of No. 2

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Content of No. 4Content of No. 4

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Results from Google News (Results from Google News (ca. 10:15, ca. 10:15, 2.5.2003, Fri.)2.5.2003, Fri.)

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Results from MSN (Results from MSN (ca. 23:05, 1.5.2003, ca. 23:05, 1.5.2003, Thu.)Thu.)

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Google vs MSNGoogle vs MSN

Result AnalysisResult Analysis

1. Actually all top 15 results of MSN are about the disease SARS2. MSN’s collection size if only a bit more than 1/3 of that of Google3. MSN might adjust the weights of SARS-related documents4. How to do that in a systematic and uniform way for SE with a

huge collection of documents like Google?

Google‘s ProblemsGoogle‘s Problems

1. Ephemeral documents not included in the collection.2. Delayed reflection of public information needs.3. Weights given to ephemeral documents not enough.

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

My NotionsMy Notions

3 Generations of Rankings3 Generations of Rankings

Generation 1:Factors: on-page ones, such as keywords/termsAlgorithm: boolean model, vector space similarity, latent

semantic indexing, fuzzy set model, probablistic models, etc.Generation 2:

Factors: on-page ones + link structureAlgorithm: G1 + link sturcture analysis, e.g. PageRank

(importance of a page in general sense), HITSGeneration 3:

Factors: on-page ones + link structure + semantic factorsAlgorithm: G1 + G2 + Alvis

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Ranking Life Cycle of Normal DocumentsRanking Life Cycle of Normal Documents

Normal vs. Ephemeral Web Documents INormal vs. Ephemeral Web Documents I

Life time

Ran

king

val

ue

birth crawled

Pointed to bymore and moreincoming links

Entering into a more orless stable status

Otherperturbations

Viewpoints of PageRank and Human-Mind

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Ranking Life Cycle of Ehemeral DocumentsRanking Life Cycle of Ehemeral Documents

Normal vs. Ephemeral Web Documents Normal vs. Ephemeral Web Documents IIII

Viewpoints of PageRank and Human-Mind

Life time

Ran

king

val

ue

birth crawled

Pointed to bymore and moreincoming links

Entering into a more orless stable status

Otherperturbations

PageRank

Human-Mind

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Nothing basically.Nothing basically.

Current Work on Ephemeral Web Current Work on Ephemeral Web DocumentsDocuments

1. Google continues its trilogy of roughly monthly crawling of the whole web, PageRank computation, adding other factors in.

2. People may not consider it really important to solve this problem. The current centralized, colossal and complete strategy is good and enough.

3. Separate solutions and systems are provided to address the problem, for example, news.google.com.

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Analysis by Matrix ComputationAnalysis by Matrix Computation

P=cG+(1-c)EA=PT

The principal eigenvector of A.

G´=G+N+G2N+N2GP´=cG´+(1-c)E´A´=(P´)T

The principal eigenvector of A´.

Continuously compute the new eigenvectors given the old ones and the minor change.

G

N

G: the previous Web Graph N: newly emerged Web pages of a News Web site

N2G

G2N

Heavier weights have to be given to the links pointing to the new ephemeral documents.

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

New MatrixNew Matrix

After including ephemeral documentsAfter including ephemeral documents

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Computation Based on the New MatrixComputation Based on the New Matrix

1. Aperiodic: the matrix is induced by the web graph.2. Irreducible: strongly connected.

Ergodic Theorem applies: the Markov chain defined by Q has a unique stationary probability distribution.

The Computation Converges.The Computation Converges.

How to ComputeHow to Compute

1. Adaptive methods for PageRank computation.2. k = 400x(4,500∼35,000) = 1,800,000∼14,000,000

(0.06%∼0.47%) of 3 billion.3. Make use of the block structure.

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

After including ranking of ephemeral After including ranking of ephemeral documentsdocuments

Applications in Search EnginesApplications in Search Engines

1. Ranking of normal and ephemeral documents can be unified seamlessly.

2. Strong support of a decentralized architecture for Web and peer-to-peer search engines

3. No contradiction to using separate solutions. For example news.google.com can be easily built upon a unified ranking scheme.

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Revisiting the Challenges by Dr. Andrei BroderRevisiting the Challenges by Dr. Andrei Broder

3 Challenges3 Challenges• A web graph model that takes into account

information content.• A method to compare graph derived query

independent factors.• Mothods to create graphs where none exists.

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Real Computation on a Web-Scale Data SetReal Computation on a Web-Scale Data Set

Future WorkFuture Work

• Where is the data set?

Taking Into Account More Semantic Taking Into Account More Semantic InformationInformation• Semantic information of the documents and

the content

LSIR

All rights reserved. © 2003, Jie Wu <[email protected]>, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne

Questions?Questions?

??