13
A Unified Probabilistic Framework for Web Page Scoring Systems Michelangelo Diligenti, Marco Gori, Fellow, IEEE, and Marco Maggini, Member, IEEE Computer Society Abstract—The definition of efficient page ranking algorithms is becoming an important issue in the design of the query interface of Web search engines. Information flooding is a common experience especially when broad topic queries are issued. Queries containing only one or two keywords usually match a huge number of documents, while users can only afford to visit the first positions of the returned list, which do not necessarily refer to the most appropriate answers. Some successful approaches to page ranking in a hyperlinked environment, like the Web, are based on link analysis. In this paper, we propose a general probabilistic framework for Web Page Scoring Systems (WPSS), which incorporates and extends many of the relevant models proposed in the literature. In particular, we introduce scoring systems for both generic (horizontal) and focused (vertical) search engines. Whereas horizontal scoring algorithms are only based on the topology of the Web graph, vertical ranking also takes the page contents into account and are the base for focused and user adapted search interfaces. Experimental results are reported to show the properties of some of the proposed scoring systems with special emphasis on vertical search. Index Terms—Web page scoring systems, random walks, HITS, PageRank, focused PageRank. æ 1 INTRODUCTION T HE Web, with its popularity, fast and uncontrollable growth, and heterogeneity poses serious challenges to search engine designers, even if most of the techniques required for searching a collection of documents had already been studied in the related fields of databases and information retrieval. The new scenario is quite different with respect to traditional information retrieval applica- tions, which deals with more controlled environments. The Web is extremely dynamic, its rate of growth is impressive, but one of the basic issues is that there is not central control of which (where and when) documents are published. Nowadays, almost any user of the Web can publish his/her own pages, with any contents, being a good author and expert on the topic he/she is writing on or being just a spammer. Thus, there are three main challenges which a search engine has to face: The first problem is how to find and index the documents on the Web. Search engines are de facto the only central indexes of the information available on the Web that otherwise would be accessible only navigating through hyperlinks. Unfortunately, search engines are not able to track the publication of new pages, and the only way they have to build their indexes is to collect the documents by crawling the Web graph. Crawling must be performed continuously and, nowadays, a com- plete crawl of the whole Web is not feasible. Both the size and the structure of the Web, as well as freshness requirements, force search engines to cover only a fraction of the whole Web [1], [2]. The second issue concerns the size and the heterogeneity of the information available on the Web. The different document formats, the various authoring styles used to write Web documents, and the huge quantity of informa- tion require an accurate processing to create reliable and efficient indexes. Finally, the user search interface is probably one of the principal success keys for the search engine. Even if most of the search engines offer advanced search interfaces, most of the users just use the simple keyword-based interface. The user issues his/her query as a list of words looking for documents which contain all (or some) of these words. While the techniques to retrieve the documents which match the query are relatively simple using inverted indexes, it is difficult to provide high quality and relevant results to the user. Typical queries are based only on few words (often just one or two) and, thus, they can be described as “broad-topic” queries. When thousands of documents match a query, the user is really flooded by information and can typically only afford to check a very small fraction of the returned answers. Thus, the definition of good document ranking functions turns out to be a crucial issue in search engine design. Proper criteria must be devised to compute automatically a score which evaluates both the relevance of the document with respect to the query and the “quality” of its contents. The analysis of the hyperlinks on the Web [3] has been proposed as a way to derive a quality measure for the information on the Web. The structure of the hyperlinks on the Web is the result of the collaborative activity of the community of Web authors. Authors usually like to link resources they consider authoritative and authority emerges from the dynamics of popularity of the resources on the Web. Sophisticated algorithms have been studied to com- pute reliable measures of authority from the topological structure of interconnections among the Web pages. A simple counting of the number of links to a page does not take into account the fact that not all the citations have the same authority. PageRank [4], used by the Google search 4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004 . The authors are with the Dipartimento di Ingegneria dell’Informazione, Universita`degli Studi di Siena, Via Roma 56 - I-53100 Siena, Italy. E-mail: {michi,marco,maggini}@dii.unisi.it. Manuscript received 1 Sept. 2002; revised 1 Apr. 2003; accepted 10 Apr. 2003. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 118557. 1041-4347/04/$17.00 ß 2004 IEEE Published by the IEEE Computer Society

A unified probabilistic framework for web page scoring ...gkmc.utah.edu/7910F/papers/IEEE TKDE web page scoring.pdf · write Web documents, and the huge quantity of informa-tion require

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A unified probabilistic framework for web page scoring ...gkmc.utah.edu/7910F/papers/IEEE TKDE web page scoring.pdf · write Web documents, and the huge quantity of informa-tion require

A Unified Probabilistic Frameworkfor Web Page Scoring Systems

Michelangelo Diligenti, Marco Gori, Fellow, IEEE, and Marco Maggini,Member, IEEE Computer Society

Abstract—The definition of efficient page ranking algorithms is becoming an important issue in the design of the query interface of

Web search engines. Information flooding is a common experience especially when broad topic queries are issued. Queries containing

only one or two keywords usually match a huge number of documents, while users can only afford to visit the first positions of the

returned list, which do not necessarily refer to the most appropriate answers. Some successful approaches to page ranking in a

hyperlinked environment, like the Web, are based on link analysis. In this paper, we propose a general probabilistic framework for Web

Page Scoring Systems (WPSS), which incorporates and extends many of the relevant models proposed in the literature. In particular,

we introduce scoring systems for both generic (horizontal) and focused (vertical) search engines. Whereas horizontal scoring

algorithms are only based on the topology of the Web graph, vertical ranking also takes the page contents into account and are the

base for focused and user adapted search interfaces. Experimental results are reported to show the properties of some of the

proposed scoring systems with special emphasis on vertical search.

Index Terms—Web page scoring systems, random walks, HITS, PageRank, focused PageRank.

1 INTRODUCTION

THE Web, with its popularity, fast and uncontrollablegrowth, and heterogeneity poses serious challenges to

search engine designers, even if most of the techniquesrequired for searching a collection of documents hadalready been studied in the related fields of databases andinformation retrieval. The new scenario is quite differentwith respect to traditional information retrieval applica-tions, which deals with more controlled environments. TheWeb is extremely dynamic, its rate of growth is impressive,but one of the basic issues is that there is not central controlof which (where and when) documents are published.Nowadays, almost any user of the Web can publish his/herown pages, with any contents, being a good author andexpert on the topic he/she is writing on or being just aspammer. Thus, there are three main challenges which asearch engine has to face: The first problem is how to findand index the documents on the Web. Search engines arede facto the only central indexes of the informationavailable on the Web that otherwise would be accessibleonly navigating through hyperlinks. Unfortunately, searchengines are not able to track the publication of new pages,and the only way they have to build their indexes is tocollect the documents by crawling the Web graph. Crawlingmust be performed continuously and, nowadays, a com-plete crawl of the whole Web is not feasible. Both the sizeand the structure of the Web, as well as freshnessrequirements, force search engines to cover only a fractionof the whole Web [1], [2].

The second issue concerns the size and the heterogeneityof the information available on the Web. The differentdocument formats, the various authoring styles used to

write Web documents, and the huge quantity of informa-tion require an accurate processing to create reliable andefficient indexes.

Finally, the user search interface is probably one of theprincipal success keys for the search engine. Even if most ofthe search engines offer advanced search interfaces, most ofthe users just use the simple keyword-based interface. Theuser issues his/her query as a list of words looking fordocuments which contain all (or some) of these words.While the techniques to retrieve the documents whichmatch the query are relatively simple using invertedindexes, it is difficult to provide high quality and relevantresults to the user. Typical queries are based only on fewwords (often just one or two) and, thus, they can bedescribed as “broad-topic” queries. When thousands ofdocuments match a query, the user is really flooded byinformation and can typically only afford to check a verysmall fraction of the returned answers. Thus, the definitionof good document ranking functions turns out to be acrucial issue in search engine design. Proper criteria mustbe devised to compute automatically a score whichevaluates both the relevance of the document with respectto the query and the “quality” of its contents.

The analysis of the hyperlinks on the Web [3] has beenproposed as a way to derive a quality measure for theinformation on the Web. The structure of the hyperlinks onthe Web is the result of the collaborative activity of thecommunity of Web authors. Authors usually like to linkresources they consider authoritative and authority emergesfrom the dynamics of popularity of the resources on theWeb. Sophisticated algorithms have been studied to com-pute reliable measures of authority from the topologicalstructure of interconnections among the Web pages. Asimple counting of the number of links to a page does nottake into account the fact that not all the citations have thesame authority. PageRank [4], used by the Google search

4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004

. The authors are with the Dipartimento di Ingegneria dell’Informazione,Universita degli Studi di Siena, Via Roma 56 - I-53100 Siena, Italy.E-mail: {michi,marco,maggini}@dii.unisi.it.

Manuscript received 1 Sept. 2002; revised 1 Apr. 2003; accepted 10 Apr. 2003.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 118557.

1041-4347/04/$17.00 � 2004 IEEE Published by the IEEE Computer Society

Page 2: A unified probabilistic framework for web page scoring ...gkmc.utah.edu/7910F/papers/IEEE TKDE web page scoring.pdf · write Web documents, and the huge quantity of informa-tion require

engine, is a noticeable example of a topological-basedranking criterion. The authority of a page is computedrecursively as a function of the authorities of the pages thatlink the target page. HITS [5] is another well-knownalgorithm which computes two values related to topologicalproperties of the Web pages, the authority and the hubness.The HITS scheme is query-dependent. User queries areissued to a search engine in order to create a set of seedpages. Crawling the Web forward and backward from thatseed is performed to mirror the Web portion containing theinformation which is likely to be useful. A ranking criterionbased on topological analyses can be applied to the pagesbelonging to the selected Web portion. Very interestingresults in this direction have been proposed in [6], [7], [8]. In[9], a Bayesian approach is used to compute hubs andauthorities, whereas in [10], both topological informationand information about the page content are included in thedistillation of information sources performed by a Bayesianapproach. Recently, other approaches which also include thepage contents in the score computation have been proposedto define focused measures of document quality [11], [12].

In this paper, we propose a general probabilistic frame-work for Web Page Scoring Systems (WPSS) which incorpo-rates and extends many of the relevant models proposed inthe literature. A first report of the research described in thispaper can be found in [13]. Here, we propose a furtherextension of WPSS and provide additional experimentalresults. The general Web page scoring model proposed inthis paper extends both PageRank [4] and the HITS scheme[5]. In addition, the proposed model exhibits a number ofnovel features, which turn out to be very useful, especiallyfor focused (vertical) search. The content of the pages iscombined with the Web graphical structure giving rise toscoring mechanisms which are focused on a specific topic.Moreover, in the proposed model, vertical search schemescan take into account the mutual relationship amongdifferent topics. In so doing, the discovery of pages withhigh score for a given topic affects the score of pages withrelated topics. Experimental results were carried out toassess the features of the proposed scoring systems withspecial emphasis on vertical search. The very promisingexperimental results reported in the paper provide a clearvalidation of the proposed general scheme for Web pagescoring systems.

The paper is organized as follows: The next sectionintroduces the general probabilistic framework based onrandom walks which can be used to describe the differentWPSSs. Section 3 describes the horizontal ranking schemeusing the proposed framework. Horizontal rankings areonly based on the graph topology and do not consider thepage contents. In particular, the well-known PageRank andHITS algorithms and some extensions are derived from thecommon framework. Vertical scoring systems are describedin Section 4. Vertical ranking functions are useful forfocused search interfaces and for user adapted applications.Some different models are described as examples of focusedranking functions which can be derived in the proposedframework. Finally, Section 5 presents a set of experimentalevaluations of both horizontal and vertical WPSSs and, inSection 6, the conclusions are drawn.

2 RANDOM WALKS AND PAGE RANKING

The Web can be viewed as a graph GG whose nodescorrespond to the pages and whose arcs are defined by thehyperlinks between the pages. If p and q are the nodescorresponding to the pages Dp and Dq, then there is the arcðp; qÞ in GG if the page Dp contains a hyperlink to the pageDq. The topology of the Web graph is quite complex and itis the result of the behavior of the community of Webauthors. Thus, the graph topology carries much informationrelated to the cooperative interaction of many agents. Oneof the emerging properties of the resulting graph is thathigh quality resources tend to be referred by many Webauthors.

The idea of using the collaborative judgments on Webresources hidden in the structure of the Web topology hasbeen proposed as the basis to define page ranking criteria.In particular, random walk theory has been proposed as aframework to define models to compute the absoluterelevance of a page [4], [8]. The relevance xp of the page p

is computed as the probability of visiting that page in arandom walk on the Web graph. The most popular pages(i.e., the most linked ones) are the most likely to be visitedduring the random walk on the Web.

2.1 The Single-Surfer Walk

In order to define a general probabilistic framework forrandom walks, we model the actions of a generic Websurfer. At each step of the walk, the surfer can perform oneout of the following atomic actions: jump to any node of thegraph (action j), follow a hyperlink from the current page(action l), follow a hyperlink in the inverse direction (actionb), and stay in the current node (action s). Thus, the set ofthe atomic actions is O ¼ fj; l; b; sg.

At each step, the behavior of the surfer depends on thecurrent page. For example, if the surfer considers thecurrent page relevant, a hyperlink contained in that pagewill likely be followed. Whereas, if the page is notinteresting, the surfer is likely to jump to a page not linkedby the current one. Thus, the surfer’s behavior can bemodeled by a set of conditional probabilities which dependon the current page q:

. xðljqÞ: the probability of following a hyperlink from q,

. xðbjqÞ: the probability of following a back-link from q,

. xðjjqÞ: the probability of jumping from q, and

. xðsjqÞ: the probability of remaining in q.

These values must satisfy the normalization constraintPo2O xðojqÞ ¼ 1.Most of these actions need to specify their targets. We

assume that the surfer’s behavior is time-invariant and thatthe model can assign a specific weight to each link of a page(like in [14]). Thus, we can model the targets for jumps,hyperlink, or back-link by using the following parameters:

. xðpjq; jÞ: probability of jumping frompage q to page p,

. xðpjq; lÞ: probability of selecting a hyperlink frompage q to page p; this value is not null only forthose pages p linked directly by page q, i.e.,p 2 chðqÞ, chðqÞ being the set of the children ofnode q in the graph GG, and

DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 5

Page 3: A unified probabilistic framework for web page scoring ...gkmc.utah.edu/7910F/papers/IEEE TKDE web page scoring.pdf · write Web documents, and the huge quantity of informa-tion require

. xðpjq; bÞ: probability of going back from page q topage p; this value is not null only for the pages pwhich link directly page q, i.e., p 2 paðqÞ, being paðqÞthe set of the parents of node q in the graph GG.

These sets of values must satisfy the following probabilitynormalization constraints for each page q 2 GG:X

p2GGxðpjq; jÞ ¼ 1;

Xp2chðqÞ

xðpjq; lÞ ¼ 1;X

p2paðqÞxðpjq; bÞ ¼ 1:

The random walk is defined by a sequence of actions

performed by the surfer. The probabilistic model can be used

to compute the probability that the surfer is located in each

page p at time t, xpðtÞ. The probability distribution on all the

pages is represented by the vector xxðtÞ ¼ ½x1ðtÞ; . . . ; xNðtÞ�0,beingN the total number of pages. The probabilities xpðtÞ areupdated at each step of the randomwalk taking into account

the surfer model and, in particular, the probabilities

associated to the actions that canbe taken, using the following

equation:

xpðtþ 1Þ ¼Xq2GG

xðpjqÞ � xqðtÞ

¼Xq2GG

xðpjq; jÞ � xðjjqÞ � xqðtÞ þX

q2paðpÞxðpjq; lÞ�

xðljqÞ � xqðtÞþ

Xq2chðpÞ

xðpjq; bÞ � xðbjqÞ � xqðtÞ þ xðsjpÞ � xpðtÞ;

ð1Þwhere the probability xðpjqÞ of moving from page q to page pis expanded by considering the user’s actions.

The probabilities defining the surfer model can beorganized in the following N �N matrices:

. the forward matrix �� whose element ðp; qÞ is theprobability xðpjq; lÞ;

. the backward matrix �� collecting the probabilitiesxðpjq; bÞ; and

. the jump matrix �� which is defined by the jumpprobabilities xðpjq; jÞ.

The forward and backward matrices are related to the Web

adjacency matrix WW , whose entry ðp; qÞ is one if page p links

page q, 0 otherwise. In particular, the forward matrix ��

does not have null entries only in the positions correspond-

ing to the entries equal to 1 in the matrix WW , and the

backward matrix �� does not have null entries in the

positions corresponding to the entries equal to 1 in the

matrix WW 0.We can also define the set of actionmatrices which collect

the probabilities of taking one of the possible actions from a

given page q. These are N �N diagonal matrices defined as

follows: DDj whose diagonal values ðq; qÞ are the probabil-

ities xðjjqÞ, DDl collecting the probabilities xðljqÞ, DDb contain-

ing the values xðbjqÞ, and DDs having the probabilities xðsjqÞon its diagonal. Hence, (1) can be written in matrix form as

xxðtþ 1Þ ¼ ð�� �DDjÞ0xxðtÞ þ ð�� �DDlÞ0xxðtÞþ ð�� �DDbÞ0xxðtÞ þ ðDDsÞ0xxðtÞ:

ð2Þ

By defining the transition matrix as

TT ¼ �� �DDj þ�� �DDl þ �� �DDb þDDs

� �0;

(2) can be written as

xxðtþ 1Þ ¼ TT � xxðtÞ: ð3ÞGiven the initial distribution xxð0Þ, (3) can be applied

recursively to compute the probability distribution at agiven time step t yielding

xxðtÞ ¼ TTt � xxð0Þ: ð4ÞThe absolute page rank for the pages on the Web is obtainedby considering the stationary distribution of the Markovchain defined by the previous equation. TT 0 is the statetransition matrix of the Markov chain. TT 0 is stable since it isa stochastic matrix having its maximum eigenvalue equal to1. Since the state vector xxðtÞ evolves following the equationof a Markov chain, it is guaranteed that, if

Pq2GG xqð0Þ ¼ 1,

thenP

q2GG xqðtÞ ¼ 1; t ¼ 1; 2; . . . .

Proposition 1. If xðjjqÞ 6¼ 0 and xðpjq; jÞ 6¼ 0; 8p; q 2 GG, thenthere exists x? such that limt!1xxðtÞ ¼ xx? and xx? does notdepend on the initial state vector xxð0Þ.

Proof. Because of the hypotheses, �� �DDj is strictlypositive, i.e., all its entries are greater than 0. Sincethe transition matrix TT of the Markov chain is obtainedby adding nonnegative matrices, then also the transi-tion matrix TT is strictly positive. Thus, the resultingMarkov chain is irreducible and, consequently, it has aunique stationary distribution given by the solution ofthe equation xx? ¼ TTxx?, where xx? satisfies ðxx?Þ0�� ¼ 1,being �� the N-dimensional vector whose entries are allequal to 1 (see, e.g., [15]). tu

2.2 The Multisurfer Walk

A model based on a single variable may not be able tocapture the complex relationships and dependencies amongWeb pages to compute their absolute relevance. Rankingschemes based on multiple variables have been proposed in[5], [8], where two variables are used to measure the hubnessand the authority of each page.

The random walk framework can be extended byconsidering a pool of Web surfers having differentbehaviors in order to capture different properties of theWeb. Each surfer can be modeled by using different valuesfor the parameters in the random walk (2) in order to definedifferent policies for evaluating the absolute importance ofthe pages. Moreover, the surfers may interact by acceptingsuggestions from each other.

The multisurfer model considers M different surfers. For

each surfer i; i ¼ 1; . . . ;M, xðiÞq ðtÞ represents the probability

of the surfer i to be visiting page q at time t. Each surfer may

accept the suggestion of another surfer before taking an

action. The interaction among the surfers is modeled by a set

of parameters bðijkÞ which define the probability of the

surfer k to jump to the page currently visited by the surfer i.

Thus, in this model, we hypothesize that the interaction does

not depend on the pages which the surfers are currently

6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004

Page 4: A unified probabilistic framework for web page scoring ...gkmc.utah.edu/7910F/papers/IEEE TKDE web page scoring.pdf · write Web documents, and the huge quantity of informa-tion require

visiting but only on how much the surfers trust each other.

The values bðijkÞ must satisfy the probability normalization

constraintPM

s¼1 bðsjkÞ ¼ 1 8k ¼ 1; . . . ;M.Hence, before taking any action, the surfer i moves to

page p with probability vðiÞp ðtÞ due to the suggestions of theother surfers. This probability is computed as

vðiÞp ðtÞ ¼XMs¼1

bðsjiÞxðsÞp ðtÞ: ð5Þ

The intermediate distribution vðiÞp is computed before takingthe action that generates the new probability distributionxðsÞp at the following time step. This intermediate step is

introduced to synchronize the pool of surfers. We canorganize the probability distributions on the pages for theM surfers, xxðiÞðtÞ, as the columns of a N �M matrix XXðtÞ.Moreover, we can define the M �M matrix AA whose ði; kÞelement is bðijkÞ. The matrix AA will be referred to as theinteraction matrix. The modified probability distributionsvðiÞp ðtÞ, due to the interaction among the surfers, can becollected in an N �M matrix VV ðtÞ, which is obtained asVV ðtÞ ¼ XXðtÞ �AA. Finally, the behavior of surfer i is modeledby the set of forward, backward, and jump matrices ��ðiÞ,��ðiÞ, ��ðiÞ, and by the action matrices DD

ðiÞj , DD

ðiÞl , DD

ðiÞb , DDðiÞ

s , asin (2). Thus, the transition matrix for the Markov chainassociated to the surfer i is

TT ðiÞ ¼ ��ðiÞ �DDðiÞj þ��ðiÞ �DDðiÞ

l þ �� �DDðiÞb þDDðiÞ

s

� �0:

The set of the M interacting surfers can be described bythe following equations:

xxð1Þðtþ 1Þ ¼ TT ð1Þ �XXðtÞ �AAð1Þ

..

.

xxðMÞðtþ 1Þ ¼ TT ðMÞ �XXðtÞ �AAðMÞ;

8><>: ð6Þ

where AAðkÞ denotes the kth column of the matrix AA. Whenthe surfers are independent of each other (i.e., bðijiÞ ¼ 1,i ¼ 0; . . . ;M, and bðijjÞ ¼ 0, i ¼ 0; . . .M, j ¼ 0; . . . ;M, j 6¼ i),the model reduces to a set of M independent models asdescribed by (4).

The general model herein described gives rise to manydifferent ranking schemes, some of which are properlyclassified in Table 1 and will be analyzed in detail in theremainder of the paper.

3 HORIZONTAL WPSS

Horizontal WPSSs define an absolute ranking on a set ofWeb pages using only the topological information repre-sented by the Web graph. These approaches are validatedby the idea that, in a hyperlinked environment, thestructure of the interconnections should reflect the qualityof the resources, i.e., scarcely linked pages are low qualitypages, whereas highly referred pages are relevant sourcesof information. Different criteria can be defined by refiningthis simple idea to define the authority of a page in thehyperlinked environment. In the proposed probabilisticframework, horizontal WPSSs are characterized by the factthat the parameters used in the probability computationsare independent on the page contents. In particular, in thissection, we show how the two most popular WPSSs,PageRank, and HITS can be described as special cases ofthe random walk framework, even if the original HITSalgorithm violates the probabilistic assumptions. We alsointroduce a hybrid version of these two algorithms.

3.1 PageRank

The computation of the PageRank [16] can be modeled by aSingle-Surfer random walk by choosing a surfer modelbased only on two actions: the surfer jumps to a newrandom page with probability xðjjpÞ ¼ 1� d or he followsone link from the current page with probability xðljpÞ ¼ d.The probabilities of the other two actions, considered in thegeneral model, are null, i.e., xðbjpÞ ¼ 0 and xðsjpÞ ¼ 0. Allthese values are clearly independent on the page p. Giventhat a jump is taken, its target is selected using a uniformprobability distribution over all the N Web pages, i.e.,xðpjjÞ ¼ 1=N; 8p 2 GG. Finally, the probability of followingthe hyperlink from page q to page p does not depend onthe page p, i.e., xðpjq; lÞ ¼ �q. In order to meet thenormalization constraint, �q ¼ 1=hq, where the hubness of

DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 7

TABLE 1Main Features of the Proposed Ranking Functions

The H (V) labels refer to functions for, respectively, horizontal (vertical) scoring systems. The S and M labels indicate whether the ranking function isunderlaid by, respectively, a single surfer and a pool of collaborative surfers. The jump, back, and forward columns indicate whether thecorresponding parameter, describing a surfer behavior, is focused (F) or uniform (U) for each proposed ranking function. This table is not exhaustive:other ranking functions (with specific features) could be derived from the proposed general framework, choosing appropriate settings.

Page 5: A unified probabilistic framework for web page scoring ...gkmc.utah.edu/7910F/papers/IEEE TKDE web page scoring.pdf · write Web documents, and the huge quantity of informa-tion require

page q, hq ¼ jchðqÞj, is the number of links exiting frompage q (the number of children of the node q in GG). Thisrequirement cannot be met by sink pages, i.e., the pageswhich do not contain any link to other pages. In order tokeep the probabilistic interpretation of PageRank, all sinknodes must be removed, unless the computation is slightlymodified as described further.

By using the PageRank surfer model, (1) can berewritten as

xpðtþ 1Þ ¼ ð1� dÞN

�Xq 2 GG

xqðtÞ þ dX

q 2 paðpÞ�q � xqðtÞ

¼ ð1� dÞN

þ dX

q 2 paðpÞ�q � xqðtÞ;

ð7Þ

whereP

q 2 GG xqðtÞ ¼ 1 because the probabilistic interpreta-tion is valid. The fact that 0 < d < 1 and, thus, xðjjpÞ ¼1� d > 0 guarantees that the PageRank vector converges toa distribution of page scores that does not depend on theinitial distribution.

The matrix form of the PageRank equation is

xxðtþ 1Þ ¼ ð1� dÞN

���þ d �WW 0�� � xxðtÞ; ð8Þ

where WW is the adjacency matrix of the Web graph and �� isthe diagonal matrix whose ðp; pÞ element is the inverse ofthe hubness hp of page p. Thus, because of the hypothesis ofindependence of the parameters xðpjq; lÞ from the page p, itfollows that ð�� �DDlÞ0 ¼ d �WW 0��, i.e., the matrix �� can befactorized into the product of the adjacency matrix of thegraph WW and the hubness diagonal matrix ��.

Sink nodes violate the probabilistic constraints since nolinks can actually be followed from a sink node, while thesurfer model considers this possibility as a valid action (i.e.,xðljqÞ ¼ d 6¼ 0). In order to overcome this problem, it shouldbe xðljqÞ ¼ 0 and, consequently, xðjjqÞ ¼ 1 for any sinknode q. Thus, in order to consider also the sink nodes, thePageRank computation should be modified by using

xðjjqÞ ¼ 1� d if chðqÞ 6¼ ;xðjjqÞ ¼ 1 if chðqÞ ¼ ;:

�ð9Þ

In this case, the contribution of the jump probabilities doesnot sum to a constant term as it happens in (7), but the valuexðpjj; tÞ ¼ 1

N

Pq 2 GG xðjjqÞxqðtÞ, which represents the prob-

ability of jumping to page p at time t, must be computed atthe beginning of each iteration. This is the computationalscheme we used in our experiments.

3.2 HITS

The HITS algorithm was proposed to model authoritativedocuments only relying on the information hidden in theconnections among them due to cocitation or Webhyperlinks [5]. The algorithm assigns two values to eachpage p: the authority is a measure of the page relevance asinformation source, while the hubness refers to the quality ofa page as a link to authoritative resources. Thus, the twovalues computed by the HITS algorithm allow us todistinguish among pages which are authorities and pageswhich are hubs. In the original formulation, these values arecomputed by applying iteratively the following equations:

aqðtþ 1Þ ¼ Pp 2 paðqÞ hpðtÞ

hqðtþ 1Þ ¼ Pp 2 chðqÞ apðtÞ;

�ð10Þ

where aq indicates the authority of page q and hq itshubness. If aaðtÞ is the vector collecting all the authorities atstep t, and hhðtÞ is the hubness vector at step t, the previousequation can be rewritten in matrix form as

aaðtþ 1Þ ¼ WW � hhðtÞhhðtþ 1Þ ¼ WW 0 � aaðtÞ;

�ð11Þ

where WW is the adjacency matrix of the Web graph. At eachtime step, the HITS algorithm requires normalizing the twovectors aaðtÞ and hhðtÞ to have unit length. It can bedemonstrated that as t tends to infinity, the direction ofthe authority vector tends to be parallel to the maineigenvector of the WW �WW 0 matrix (bibliographic couplingmatrix1), whereas the hubness vector tends to be parallelto the main eigenvector of the WW 0 �WW matrix (cocitationmatrix2). See [5] for further details.

The HITS ranking scheme can be represented in the

general Web surfer framework, even if some of the

assumptions violate the probabilistic interpretation. Since

HITS uses two state variables, the hubness and the authority

of a page, the corresponding random walk model is a

multisurfer scheme based on the activity of two surfers.

Surfer 1 is associated to the hubness of pageswhereas surfer 2

is associated to the authority of pages. For both surfers, the

probabilities of remaining in the same page xðiÞðsjpÞ and of

jumping to a random page xðiÞðjjpÞ are null. Surfer 1 never

follows a link, i.e., xð1ÞðljpÞ ¼ 0; 8p 2 GG, whereas he always

follows a back-link, i.e., xð1ÞðbjpÞ ¼ 1; 8p 2 GG. In order to

obtain the original HITS computation, we must set

xð1Þðpjq; bÞ ¼ 1 for each page q linked by page p. This

assumption violates the probability normalization con-

straints sinceP

p 2 paðqÞ xð1Þðpjq; bÞ ¼ jpaðqÞj � 1. Surfer 2 has

the opposite behavior with respect to surfer 1. He always

follows a link, i.e., xð2ÞðljpÞ ¼ 1; 8p 2 GG and he never follows a

back-link, i.e., xð2ÞðbjpÞ ¼ 0. In this case, the normalization

constraint is violated for the values of xð2Þðpjq; lÞ because theHITS scheme defines xð2Þðpjq; lÞ ¼ 1 for each page p linked by

page q and, thus,P

p2chðqÞ xð2Þðpjq; lÞ ¼ jchðqÞj � 1. The HITS

equations can be easily modified in order to obtain a

probabilistically coherent model. We just need to choose

xð1Þðpjq; bÞ ¼ 1jpaðqÞj and xð2Þðpjq; lÞ ¼ 1

jchðqÞj . This model is

analyzed in [8]. Thus, the action matrices describing the

HITS surfers are DDð1Þb ¼ II;DD

ð2Þl ¼ II, being I the identity

matrix,whereasDDð1Þj ,DD

ð1Þl ,DDð1Þ

s ,DDð2Þj ,DD

ð2Þb ,DDð2Þ

s are all equal to

the null matrix. Moreover, the interaction between the

surfers is described by the matrix:

AA ¼ 0 11 0

� �: ð12Þ

The interpretation of the interactions represented by thismatrix is that surfer 1 considers surfer 2 as an expert indiscovering authorities and always moves to the position

8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004

1. The entry ðp; qÞ of the bibliographic coupling matrix is the number ofpages jointly linked by the pages q and p [17].

2. The entry ðp; qÞ of the cocitation matrix is the number of pages whichjointly link the pages q and p [17].

Page 6: A unified probabilistic framework for web page scoring ...gkmc.utah.edu/7910F/papers/IEEE TKDE web page scoring.pdf · write Web documents, and the huge quantity of informa-tion require

suggested by that surfer before taking his own action. Onthe other hand, surfer 2 considers surfer 1 as an expert infinding hubs and then he always moves to the positionsuggested by that surfer before choosing the next action.

In this case, (6) is

xxð1Þðtþ 1Þ ¼ ��ð1Þ� �0

�XXðtÞ �AAð1Þ

xxð2Þðtþ 1Þ ¼ ��ð2Þ� �0 �XXðtÞ �AAð2Þ:

8<: ð13Þ

Using (12) and the HITS assumption ��ð1Þ0 ¼ WW 0, ��ð2Þ0 ¼ WW ,we obtain

xxð1Þðtþ 1Þ ¼ WW 0 � xxð2ÞðtÞxxð2Þðtþ 1Þ ¼ WW � xxð1ÞðtÞ

�ð14Þ

which, redefining aaðtÞ ¼ xxð1ÞðtÞ and hhðtÞ ¼ xxð2ÞðtÞ, is equiva-lent to the HITS computation of (11).

The HITS model violates the probabilistic interpretationand this makes the computation unstable since the WW �WW 0

matrix has a principal eigenvalue larger than 1. Hence, theHITS algorithm requires the score vector to be normalizedat the end of each iteration. Finally, the HITS scheme suffersfrom other drawbacks. In particular, large highly connectedcommunities of Web pages tend to attract the principaleigenvector of WW �WW 0, thus pushing to zero the relevance ofall other pages. As a result, the page scores tend to decreaserapidly to zero for pages outside those communities. In [8],this effect is analyzed in details and referred to as theTightly Knit Community Effect. Because of this, the HITSalgorithm can be reliably applied only on small subgraphsof the whole Web after an accurate pruning of the links. Infact, the HITS computation has been proposed as a scoringalgorithm to be applied on the result set of a query (root set)augmented by the pages which link and are linked by thosein the root set and not on the whole Web. Recently, someheuristics have been proposed to reduce the problemsaffecting the original HITS algorithm even if such behaviorcannot be generally avoided because of the properties of thedynamic system associated to the HITS computation [18].

3.3 PageRank-HITS

The multisurfer model allows us to combine the propertiesof the PageRank and HITS algorithms. Both of thesealgorithms have benefits and limitations and the aim ofthe PageRank-HITS model is to combine the positivecharacteristics of the two techniques. In fact, the computa-tion of the PageRank is stable and has a well-definedbehavior because of its probabilistic interpretation. More-over, it can be applied to large page collections since smallcommunities are not overwhelmed by larger ones but arestill influencing the ranking. On the other hand, PageRankis too simple to take into account the complex relationshipsof Web page citations. The HITS algorithm is not stable,only the largest Web community influences the ranking,and, thus, it cannot be applied to large page collections. Onthe other hand, the hub and authority model can capturethe relationships among Web pages with more details thanPageRank.

The PageRank-HITS model employs two surfers: Surfer 1

follows a back-link with probability xð1ÞðbjqÞ ¼ dð1Þ or jumps

to a random page with probability xð1ÞðjjqÞ ¼ 1� dð1Þ,

8q 2 GG. In both cases, the target page p is selected using a

uniform probability distribution, i.e., xð1Þðpjq; bÞ ¼ 1jpaðqÞj and

xð1Þðpjq; jÞ ¼ 1N . Surfer 2 follows a forward link with

probability xð2ÞðljqÞ ¼ dð2Þ or jumps to a random page with

probability xð2ÞðjjqÞ ¼ 1� dð2Þ, 8p 2 GG. In both cases, the

target page p is selected using a uniform probability

distribution, i.e., xð1Þðpjq; lÞ ¼ 1jchðqÞj and xð1Þðpjq; jÞ ¼ 1

N . Thus,

surfer 2 implements the PageRank model while surfer 1 can

be defined to follow a backward PageRank computation.3 As

in the HITS scheme, the interaction between the surfers is

described by the matrix

AA ¼ 0 11 0

� �:

In this case, (6) becomes

xxð1Þðtþ 1Þ ¼ ð1�dð1ÞÞN ��þ dð1ÞWW � �� � xxð2ÞðtÞ

xxð2Þðtþ 1Þ ¼ ð1�dð2ÞÞN ��þ dð2ÞWW 0 ��� � xxð1ÞðtÞ;

8<: ð15Þ

where ��ð1Þ0 ¼ WW 0 ���, ��ð2Þ0 ¼ WW � ��, being �� the diagonalmatrix with element ðp; pÞ equal to 1=jpaðpÞj and �� thediagonal matrix with element ðp; pÞ equal to 1=jchðpÞj.

This page rank is stable, the scores sum up to 1 and nonormalization is required at the end of each iteration.Moreover, the two state variables can capture and processmore complex relationships among pages. In particular,setting dð1Þ ¼ dð2Þ ¼ 1 yields a normalized version of HITS,which has been proposed in [6].

4 VERTICAL WPSS

Vertical WPSSs compute a relative ranking of pages whenfocusing on a specific topic. When applying scoringtechniques to focused search the page contents should betaken into account beside the graph topology. A verticalWPSS uses a set of features (e.g., a set of keywords)representing the page contents and a classifier whichassigns the degree of relevance with respect to the topicof interest to each page. The general random walk frame-work for WPSSs proposed in this paper can be used todefine a vertical approach to page scoring. Several modelscan be derived which combine the ideas of the topology-based criteria and the topic relevance measure provided bythe text classifier. In particular, the text classifier can beused to compute the values of the probabilities needed bythe random walk model. As shown by the experimentalresults, vertical WPSSs produce much more accurate resultsin ranking topic specific pages.

4.1 Focused PageRank

In the PageRank framework, when choosing to follow a linkfrom a page q, each link has the same probability 1=jchðqÞj tobe followed. In the focused domain, we can consider themodel of a surfer who follows the links according to the

DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 9

3. As shown in Section 3.1, sink nodes violate the probabilisticconstraints for surfer 2. In this model, supersource nodes (i.e., the nodes qhaving jpaðqÞj ¼ 0) also violate the probabilistic constraints for surfer 1. Theequations can be modified straightforwardly to eliminate these problems.

Page 7: A unified probabilistic framework for web page scoring ...gkmc.utah.edu/7910F/papers/IEEE TKDE web page scoring.pdf · write Web documents, and the huge quantity of informa-tion require

suggestions provided by a text classifier. Thus, this approachremoves the assumption of complete randomness in themovements of theWeb surfer. In this case, the surfer is awareof what he is searching and he will trust the classifiersuggestions following the links with a probability propor-tional to the topic-relevance of the page which is the target ofthe link.This allowsus toderive a topic-specificpage ranking.For example: the “Microsoft” home page is highly author-itative according to the topic-generic PageRank, whereas it isnot highly authoritative when searching for “Perl” languagetutorials. In fact, even if that page is highly linked,most of thelinks are scarcely related to the target topic and theircontribution will be negligible.

If the surfer is located at page q and the pages linked bypage q are assigned the scores sðch1ðqÞÞ; . . . ; sðchhq

ðqÞÞ bythe classifier, the probability of the surfer to follow the ithlink is defined as

xðchiðqÞjq; lÞ ¼ sðchiðqÞÞPhq

j¼1 sðchjðqÞÞ: ð16Þ

Thus, the forward matrix �� depends on the classifieroutputs on the pages in the data set. Hence, the modifiedequation to compute the combined page scores using aPageRank-like scheme is

xpðtþ 1Þ ¼ ð1� dÞN

þ dX

q 2 paðpÞxðpjq; lÞ � xqðtÞ; ð17Þ

where xðpjq; lÞ is computed as in (16).

4.2 Double Focused PageRank

The focused PageRank surfer, described in the previoussection, uses a topic specific distribution for selecting thelink to follow, but the decision on the action to take does notdepend on the contents of the current page. A moreaccurate model should consider that the decision aboutwhich action to take is usually dependent on the contents ofthe current page. For example, let us suppose that twosurfers are searching for a “Perl Language tutorial” and thatthe first one is located at the page www.perl.com, while thesecond is located at the page www.cnn.com. Clearly, it ismore likely that the first surfer will decide to follow a linkfrom the current page, while the second one will prefer tojump to another page which is related to the topic he isinterested in.

We can model this behavior by adapting the actionprobabilities using the contents of the current page, thusmodeling a focused choice of the surfer’s actions. Inparticular, the probability of following a hyperlink can bechosen to be proportional to the degree of relevance sðqÞ ofthe current page with respect to the target topic, i.e.,

xðljpÞ ¼ d � sðpÞmaxq2GG sðqÞ ; ð18Þ

where sðqÞ is computed by the text classifier.On the other hand, the probability of jumping away from

a page decreases proportionally to sðqÞ, i.e.,

xðjjpÞ ¼ 1� d � sðpÞmaxq2GG sðqÞ : ð19Þ

Finally, we assume that the probability of landing into a

page after a jump is proportional to its relevance sðpÞ, i.e.,

xðpjjÞ ¼ sðpÞPq2GG sðqÞ : ð20Þ

4.3 Focused HITS

The multisurfer model may be used to derive a modifica-

tion of the HITS algorithm similar to that proposed in [19],

[20]. This model also takes into account textual information,

in order to enforce the influence of the links pointing to on-topic pages and to filter the noise introduced by links to off-

topic pages.The HITS model is based on two coupled surfers

implementing the HITS algorithm, as shown in Section 3.2.

In the focused version, each surfer selects the links (back-

links) using the scores assigned by the text classifier tothe target pages. Given the scores sðch1ðqÞÞ; . . . ; sðchhq

ðqÞÞassigned by the text classifier to the pages chiðqÞ linked

by page q, the first surfer selects the link i to follow from

page q by using the probability distribution defined by

xð1ÞðchiðqÞjq; lÞ ¼ sðchiðqÞÞPnj¼1 sðchjðqÞÞ : ð21Þ

Likewise, given the scores sðpa1ðqÞÞ; . . . ; sðpamðqÞÞ assignedby the text classifier to the pages paiðqÞ linking page q, the

second surfer selects the back-link i to follow from page q

using the probability distribution

xð2ÞðpaiðqÞjq; bÞ ¼ sðpaiðqÞÞPmj¼1 sðpajðqÞÞ

: ð22Þ

The focused HITS model is thus represented by the two

equations

apðtþ 1Þ ¼ Pq 2 paðpÞ hqðtÞ � xð1Þðpjq; lÞ

hpðtþ 1Þ ¼ Pq2chðpÞ aqðtÞ; �xð2Þðpjq; bÞ;

(ð23Þ

where apðtÞ ¼ xð1Þp ðtÞ represents the focused authority

computed by the first surfer and hpðtÞ ¼ xð2Þp ðtÞ is the

focused hubness measured by the second surfer, when theprobabilities xð1Þðpjq; lÞ and xð2Þðpjq; bÞ are computed as in

(21) and (22).

4.4 Multitopic Rank

Topic hierarchies and topic correlations are of fundamental

importance to perform focused search in the Web.

Typically, pages on a specific topic may be reached througha path of pages belonging to correlated topics. In particular,

Fig. 1 shows a typical scenario on the Web, where a set of

“Researcher Home Pages” can be reached following a path

through pages belonging to different categories, like, forexample, home pages of a university department, home

pages of a university faculty, and home pages of a

university.In order to enhance the ranking functions for focused

information, a model based on a Multisurfer walk can bedevised. This model can capture the correlations among

topics and reveal more complex properties of the pages due

to the topological structure of the topics on the Web.

10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004

Page 8: A unified probabilistic framework for web page scoring ...gkmc.utah.edu/7910F/papers/IEEE TKDE web page scoring.pdf · write Web documents, and the huge quantity of informa-tion require

By analyzing Web portions densely populated by inter-estingdocuments,we can identify a set of topics related to thetarget one. For example, the set of correlated topics can bediscoveredautomaticallyusinga clusteringalgorithm.This isthe approach we used in the experiments described further.Once a set of T topics is defined, the probability of thetransition between each pair of topics can be estimated fromasample of the Web. Pð� 0j�Þ indicates the probability that apage on topic � links a page of topic � 0.

We can use these probability values which reflect thecorrelations among the T topics to define the interactionmatrix of a pool of T surfers where the �th surfer is focusedon the �th topic. Thus, if topic � 0 is highly correlated to topic� , then the surfer � 0 will be strongly influenced by the activityof the surfer � . Formally, the probability vð�

0Þp ðtÞ that surfer � 0

moves to page p due the other surfers’ suggestions is

vð�0Þ

p ðtÞ ¼XjT j�¼1

Pð� 0j�Þ � xð�Þp ðtÞ: ð24Þ

Thus, the multitopic rank considers an interaction matrix AAwhose element ð� 0; �Þ is equal to Pð� 0j�Þ. This choice allowseach surfer to move to a position which is more likely tolead to a page q on his topic of interest. Finally, each surfercan be modeled using one of the focused approachesdescribed in the previous sections.

5 EXPERIMENTAL RESULTS

We performed a set of experiments in order to analyze theproperties of some of the proposed scoring systems and tomake comparisons on the different rankings. Since we weremainly interested in evaluating the performance of scoring

DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 11

Fig. 1. Example of topic transitions among connected pages on the Web. In particular, we consider “Researcher Home Pages,” which are likely to beconnected to pages of different categories, among which we consider “Department Home Pages,” “Faculty Home Pages,” and “University HomePages.” (a) An example of the neighborhood of a set of researcher home pages. (b) Transition probabilities estimated from the sample Web portionshown in (a), taking into account the four considered categories. Each table row collects the probabilities that a page in the corresponding categorypoints to pages belonging to each considered category.

Fig. 2. The distribution of the page scores for two different topics. (a) “Linux” and (b) “cooking recipes.”

Page 9: A unified probabilistic framework for web page scoring ...gkmc.utah.edu/7910F/papers/IEEE TKDE web page scoring.pdf · write Web documents, and the huge quantity of informa-tion require

systems for vertical (topic-specific) applications, we based

our test on a set of single-topic data sets. Each data set was

collected using the focus crawler described in [21]. In

particular, the focus crawler employs a Naive Bayes

classifier ([22, chapter 6]), which computes the correlation

between the content of each downloaded page and the

considered topic. The classifier directs the crawl to the most

promising Web regions by selecting the links starting from

the pages having the highest scores.About 150,000 pages were downloaded for each single

crawl. The topics of the page collections were selected to be

not too specific in order to cover many different subtopics

in each data set. The selected topics were: pages on the

operating system Linux (data set “Linux”), pages on

cooking recipes (data set “cooking recipes”), pages con-

cerning the sport golf (data set “golf”), and documents

related to wines (data set “wine”).For each selected topic, a relavance score was assigned to

each page by the Naive Bayes classifiers which were

previously used to focus-crawl the Web. The scores

produced by the models were stored to be used in the

computation of the vertical page ranks. Considering the

hyperlinks contained in each page, a Web subgraph was

created from each data set to perform the evaluation of the

different WPSSs proposed in the previous sections.Beside the ranking systems described in the previous

sections, we report the results also for the “In-link” surfer.

Such a surfer is located in a page with probability

proportional to the number of in-links of that page. For all

the PageRank surfers (focused or not), we set the d

parameter to 0:85.

5.1 Score Distributions

We performed an analysis of the distribution of page scoresusing the different algorithms proposed in this paper. Foreach ranking function, we normalized the score values bytheir maximum over all the pages (thus, yielding values in[0,1]). Then, we sorted the pages according to their ranksand then we plotted the distribution of the normalized rankvalues. Fig. 2 reports the plots for the two categories,“Linux” and “cooking recipes.”

In both cases, the HITS surfer assigns a score valuesignificantly greater than zero only to the small set of pagesassociated to the principal community of the subgraph. Onthe other hand, the PageRank yields a smoother curve forthe score distribution. This is the effect of the homogeneousterm 1� d in (7). The focused versions of the PageRank arestill smooth but concentrate the scores on a smaller set ofauthoritative pages which are more specific for theconsidered topic. This reflects the fact that the verticalWPSSs are able to discriminate the authorities on thespecific topic, whereas the classical PageRank schemeconsiders the authoritative pages regardless of their topic.

5.2 Top Lists

Figs. 3 and 4 show the eight top score pages for fourdifferent WPSSs on the data set “Linux” and “cookingrecipes,” respectively. For the HITS surfer pool, we reportthe pages with the top authority values.

12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004

Fig. 3. The eight top score pages for the data set “Linux.”

Page 10: A unified probabilistic framework for web page scoring ...gkmc.utah.edu/7910F/papers/IEEE TKDE web page scoring.pdf · write Web documents, and the huge quantity of informa-tion require

As shown in Fig. 3, all pages selected by the HITSalgorithm are from the same site. This is due to the well-known property of the HITS algorithm which produces ascore vector “in the direction” of the most interconnectedcommunities. In many cases, it is difficult to reduce thisundesirable behavior of the HITS ranking by properlypruning the links among pages. For example, in order toreduce the “nepotism” of Web pages, for the data set“cooking recipes,” the connectivity map of pages waspruned removing all the intrasite links. However, as shownin the HITS section of Fig. 4, the Web site “www.allrecipe.com,” which is subdivided into a collection of Web sites(“www.seafoodrecipes.com,” “www.cookierecipes.com,”etc.) which are strongly interconnected, occupies all thetop positions in the ranking list. In [18], the content of pagesis considered in order to propagate relevance scores onlyover the subset of links pointing to pages on a specific topic.However, in this case, the performance cannot be improvedeven by using this approach since all the sites in thecommunity are effectively on-topic and, thus, the inter-connections are semantically coherent.

The PageRank algorithm is not topic dependent, and,consequently, highly interconnected pages result in beingauthoritative regardless of the topic of interest. Forexample, pages like “www.yahoo.com,” “www.google.com,” etc., are shown in the top list even if they are notclosely related to the specific topic. The focused versions of

PageRank can filter many off-topic authoritative pages fromthe top list. In particular, the “Double Focused PageRank”WPSS pushes all the authorities on the relevant topic to thetop positions.

5.3 Comparison of the WPPSs

In this section, compare the results obtained by the In-linksurfer, the Page Rank surfer, the Focused Page Rankscheme, the Double Focused Page Rank scheme, and theHITS surfer pool. We follow a methodology similar to thatone presented in [23]. For each topic, we created a collectionof pages which were evaluated by a pool of 10 humanexperts. The experts independently labeled each page in thecollection as “authoritative” or “not authoritative” for thespecific topic. In particular, the top 15 pages for eachranking function were shown to a set of experts. Eachexpert provided either a positive, negative, or null feedbackon each single page.

The labeled pages were used to measure the percentageof positive (negative) results in the top list returned by eachranking function. The length of the top list was variedbetween 1 and 300. The evaluation was performed on thetwo data sets “Linux” and “Golf.”

Fig. 5 reports the percentage of all pages labeled as“authoritative” by experts among the first N pages in thetop list produced by five different WPSSs for the two datasets. In both cases, the HITS algorithm produces the worstranking. This result confirms the fact that HITS can only beused as a query-dependent ranking schema [3]. Aspreviously reported in [23], in spite of its simplicity, theIn-link algorithm has a performance similar to PageRank. Inour experiments, PageRank outperformed the In-Linksalgorithm on the category “Golf,” whereas it was out-performed on the category “Linux.” However, in bothcases, the gap is small. The two focused ranking functionsclearly outperformed all the not focused ones, demonstrat-ing that when searching focused authorities, a higheraccuracy is provided by taking into account the pagecontents. In both cases, more than 60 percent of theauthoritative pages are in the top 50 pages suggested bythe Double Focused PageRank.

5.4 The Multitopic WPSS

We evaluated the scoring model which considers thecorrelation among different topics proposed in Section 4.4.Each surfer was associated to a subtopic correlated to themain topic by using a text classifier. The set of subtopicswas determined automatically by the following procedure.First, a set of seed pages for the topic of interest wasselected. Then, the context graph of each of these pages wasbuilt by back-crawling the Web up to three levels (i.e., thepages one, two, and three clicks away from the seed ones).A hierarchical clustering algorithm on the bag-of-wordsvectors representing the documents was used to split the setof the pages in the context graph into subsets correspondingto the subtopics. In the experiments, we fixed the maximumnumber of clusters to be 10.

Each cluster obtained in the previous step is associated toa surfer. In order to facilitate the integration with theprobabilistic model which is used to compute the pagescores, a set of naive Bayes classifiers (see e.g., [22, chapter 6])

DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 13

Fig. 4. The eight top score pages for the data set “cooking recipes.”

Page 11: A unified probabilistic framework for web page scoring ...gkmc.utah.edu/7910F/papers/IEEE TKDE web page scoring.pdf · write Web documents, and the huge quantity of informa-tion require

was trained using the documents in each cluster. Finally, thematrix of topic-transition probabilities was estimated fromthe context graph by counting the number of pages in clusterjwhich link a page in cluster i and by normalizing this valueusing the total number of links to the pages in cluster i. Theestimated topic-transition matrix was used as the interactionmatrix for the multiple surfer model used to compute thepage scores. Each surfer was focused on the particularsubtopic corresponding to the associated cluster and thesurfer behavior was chosen to be focused in the choice of thelinks to follow, in the jumps to take, and in the bias amongthese two actions (the value of the parameter d).

We performed a set of experiments on the three topics“wine,” “golf,” and “cooking recipes.”

Fig. 6 shows the plots of the scores assigned to the pages

by each of the 10 surfers for the three data sets. Surfer 0 is

the one associated to the topic of interest as defined by the

seed pages. Each curve is normalized with respect to the

maximum score assigned to a page. As can be seen from the

curve corresponding to surfer 0, which is mostly flat, for the

topic “wine” (plot c), only one page in the data set receives a

high score by surfer 0 (winelibrary.com), while many pages

are assigned similar scores. The scores assigned by the other

surfers correspond to the context subtopics and they show a

less uniform distribution. For the other data sets, the

distribution of the scores assigned by surfer 0 is less

uniform.

6 CONCLUSIONS

In this paper, we have proposed a general probabilistic

framework based on random walks for the definition of

ranking functions on a set of hyperlinked documents. The

proposed framework allows us the definition of both

horizontal (topology-based) and vertical (topic-topology

based) rankings. The proposed scheme incorporates many

relevant scoring models proposed in the literature. More-

over, it contains novel features which look very appro-

priate especially for vertical (focused) search engines. In

particular, in some of the proposed ranking algorithms,

the topological structure of the Web, as well as the

content of the Web pages, jointly play a crucial role for

the computation of the scores. The experimental results

support the effectiveness of the proposal which clearly

14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004

Fig. 5. The percentage of the authoritative pages as defined by a set of 10 users in the best N pages returned by the various WPSSs, respectively,

for the topic (a) “Linux” and (b) “Golf” (the highest the best). Vice-versa, in (c), the percentage of the nonauthoritative pages for topic “Golf” in the best

N pages returned by the various WPSSs (the lowest the best) is shown. Similar results hold for the “Linux” topic.

Page 12: A unified probabilistic framework for web page scoring ...gkmc.utah.edu/7910F/papers/IEEE TKDE web page scoring.pdf · write Web documents, and the huge quantity of informa-tion require

emerge especially for focused search. Finally, it is worth

mentioning that the model described in this paper is very

well-suited for the construction of learning-based WPSS,

which can, in principle, incorporate the user information

while surfing the Web.

ACKNOWLEDGMENTS

The authors would like to thank Ottavio Calzone and

Francesco Scali (DII, University of Siena) who performed

some of the experimental evaluations of the scoring

systems. Some fruitful discussions with Nicola Baldini

concerning the focuseek project (www.focuseek.com) were

also very stimulating and useful for the development of the

general framework described in the paper. Finally, the

authors would like to thank the anonymous reviewers for

the useful suggestions.

REFERENCES

[1] S. Lawrence and C.L. Giles, “Searching the Web,” Science, vol. 281,no. 5374, p. 175, 1998.

[2] S. Lawrence and C.L. Giles, “Accessibility of Information on theWeb,” Nature, vol. 400, no. 8, pp. 107-109, 1999.

[3] M. Henzinger, “Hyperlink Analysis for the Web,” IEEE InternetComputing, vol. 1, no. 5, pp. 45-50, 2001.

[4] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRankCitation Ranking: Bringing Order to the Web,” technical report,Computer Science Dept., Stanford Univ., 1998.

[5] J.M. Kleinberg, “Authoritative Sources in a Hyperlinked Environ-ment,” J. ACM, vol. 46, no. 5, pp. 604-632, 1999.

[6] K. Bharat and M.R. Henzinger, “Improved Algorithms for TopicDistillation in a Hyperlinked Environment,” Proc. 21st Ann. Int’lACM SIGIR Conf. Research and Development in Information Retrieval,pp. 104-111, 1998.

[7] R. Lempel and S. Moran, “The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect,” Proc. NinthWorld Wide Web Conf. (WWW9), pp. 387-401, 2000.

[8] R. Lempel and S. Moran, “Salsa: The Stochastic Approach forLink-Structure Analysis,” ACM Trans. Information Systems, vol. 19,no. 2, pp. 131-160, 2001.

[9] D. Cohn and H. Chang, “Learning to Probabilistically IdentifyAuthoritative Documents,” Proc. 17th Int’l Conf. Machine Learning(ICML), pp. 167-174, 2000.

[10] D. Cohn and T. Hofmann, “The Missing Link: A ProbabilisticModel of Document Content and Hypertext Connectivity,”Advances in Neural Information Processing Systems 13, pp. 430-436,2000.

[11] M. Richardson and P. Domingos, “The Intelligent Surfer:Probabilistic Combination of Link and Content Information inPagerank,” Advances in Neural Information Processing Systems 14,pp. 1441-1448, 2002.

[12] T. H. Haveliwala, “Topic-Sensitive Pagerank,” Proc. 11th WorldWide Web Conf. (WWW2002), pp. 517-526, 2002.

[13] M. Diligenti, M. Gori, and M. Maggini, “Web Page ScoringSystems for Horizontal and Vertical Search,” Proc. 11th World WideWeb Conf. (WWW2002), pp. 508-516, 2002.

[14] G. Greco, S. Greco, and E. Zumpano, “A Probabilistic Approachfor Distillation and Ranking ofWeb Pages,”World WideWeb, vol. 4,no. 3, pp. 189-207, 2001.

DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 15

Fig. 6. The distribution of page scores for each surfer when using a multisurfer model with 10 surfers. (a) Data set “golf.” (b) Data set “cooking

recipes.” (c) Data set “wine.”

Page 13: A unified probabilistic framework for web page scoring ...gkmc.utah.edu/7910F/papers/IEEE TKDE web page scoring.pdf · write Web documents, and the huge quantity of informa-tion require

[15] E. Seneta, Non-Negative Matrices and Markov Chains. Springer-Verlag, 1981.

[16] S. Brin and L. Page, “The Anatomy of a Large-Scale HypertextualWeb Search Engine,” Proc. Seventh World Wide Web Conf. (WWW7),pp. 107-117, 1998.

[17] M.M. Kessler, “Bibliographic Coupling between Scientific Pa-pers,” Am. Documentation, vol. 14, pp. 10-25, 1963.

[18] S. Chakrabarti, M. Joshi, and V. Tawde, “Enhanced TopicDistillation Using Text, Markup Tags, and Hyperlinks,” Proc.24th Ann. Int’l ACM SIGIR Conf. Research and Development inInformation Retrieval, pp. 208-216, 2001.

[19] S. Chakrabarti, M. Van der Berg, and B. Dom, “Focused Crawling:A New Approach to Topic-Specific Web Resource Discovery,”Proc. Eighth Int’l World Wide Web Conf. (WWW8), pp. 545-562, 1999.

[20] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson,and J. Kleinberg, “Automatic Resource Compilation by AnalyzingHyperlink Structure and Associated Text,” Proc. Seventh WorldWide Web Conf. (WWW7), pp. 65-74, 1998.

[21] M. Diligenti, F. Coetzee, S. Lawrence, L. Giles, and M. Gori,“Focus Crawling by Context Graphs,” Proc. 26th Int’l Conf. VeryLarge Databases (VLDB 2000), pp. 527-534, 2000.

[22] T.M. Mitchell, Machine Learning. McGraw-Hill, 1997.[23] B. Amento, L. Terveen, and W. Hill, “Does Authority Mean

Quality? Predicting Expert Quality Ratings of Web Documents,”Proc. 23rd Ann. Int’l ACM SIGIR Conf. Research and Development inInformation Retrieval, pp. 296-303, 2000.

Michelangelo Diligenti received the PhD de-gree in computer science and system engineer-ing in 2002 from the University of Florence, Italy.Currently, he is a research associate at theUniversity of Siena, Italy. He has collaboratedwith the University of Wollongong and the NECResearch Institute, Pricetown, New Jersey. Hismain research interests are pattern recognition,text categorization, visual databases, and ma-chine learning applied to the World Wide Web.

Marco Gori received the Laurea degree inelectronic engineering from Universita di Fire-nze, Italy, in 1984, and the PhD degree in 1990from Universita di Bologna, Italy. From October1988 to June 1989, he was a visiting student atthe School of Computer Science, McGill Uni-versity, Montreal. In 1992, he became anassociate professor of computer science atUniversita di Firenze and, in November 1995,he joined the University of Siena, where he is

currently full professor. His main research interests are in neuralnetworks, pattern recognition, and applications of machine learning toinformation retrieval on the Internet. He has led a number of researchprojects on these themes with either national or international partners,and has been involved in the organization of many scientific events,including the IEEE-INNS International Joint Conference on NeuralNetworks, for which he acted as the program chair (2000). Dr. Goriserves (served) as an associate editor of a number of technical journalsrelated to his areas of expertise, including Pattern Recognition, the IEEETransactions Neural Networks, Neurocomputing, and the InternationalJournal on Pattern Recognition and Artificial Intelligence. He is theItalian chairman of the IEEE Neural Network Council (R.I.G.), is actingas the cochair of the TC3 technical committee of the IAPR (InternationalAssociation for Pattern Recognition) on Neural Networks, and is thepresident of the Italian Association for Artificial Intelligence. Dr. Gori is afellow of the IEEE.

Marco Maggini received the Laurea degree(cum laude) in electronic engineering and thePhD degree in computer science and controlsystems from the University of Firenze in 1991and 1995, respectively. In February 1996, hebecame assistant professor of computer engi-neering in the School of Engineering at theUniversity of Siena, where, since March 2001,he has been an associate professor. His mainresearch interests are: machine learning, neural

networks, human-machine interaction, technologies for distributing andsearching information on the Internet, and nonstructured databases. Hehas been collaborating with the NEC Research Institute, Princeton, NewJersey, on parallel processing, neural networks, and financial timeseries prediction. He is member of the editorial board of the ElectronicLetters on Computer Vision and Image Analysis and associate editor ofthe ACM Transaction on Internet Technology. He has been guest editorof a special issue of the ACM Transactions on Internet Technology onmachine learning for the Internet. He contributed to the organization ofinternational and national scientific events. He is member of the IAPR-ICand the IEEE Computer Society.

. For more information on this or any computing topic, please visitour Digital Library at http://computer.org/publications/dlib.

16 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004