What’s HOT on the Web John Tomlin February 2003 Expanded from a talk given at INFORMS

What’s HOT on the Web

John Tomlin

February 2003Expanded from a talk given at INFORMS

San Jose, CA November 2002

MotivationThere are obvious similarities between the behavior of web surfers, and the behaviorof road traffic, as well as obvious differences.Can we adapt some of the ideas from TrafficTheory/Transportation Analysis to study andunderstand traffic on the WWW ?

Transportation Study Steps

1. Land use planning (work out the urban zones)

2. Traffic Distribution * (figure out interzonaltransfers)

3. Traffic Assignment (map interzonal transfersonto the network)

* This is the step that is most relevant here.

Traffic Distribution model

OriginZonesOf Trips

ai

DestinationZones ofTrips

bj

Possible Assignments

Simple Linear Programming Formulationi = 1,..., m Originsj = 1,..., n Destinations

Data:

ai Number of travelers at origin i

bj Number traveling to destination j

cij Travel cost for route (i,j)

Variables:

xij (>=0) Number of travelers from I to j

Constraints:

j xij = ai (i = 1,..., m)

i xij = bj (j = 1,..., n)

Objective:

Maximize ij cij xij

Simple LP Formulation (continued)

It is often convenient to normalize the model so that the variablesare probabilities: pij = xij / X

(where X is the sum of all the xij). The equations then become:

Constraints:

j pij = ai/X (i = 1,..., m)

i pij = bj/X (j = 1,..., n)

Objective:

Maximize ij cij pij

AdvantagesƒSimplicityƒVery fast algorithms availableƒRelatively modest data requirements

Disadvantagesƒ Deterministicƒ Unrealistic solutions

i.e. of the (m x n) possible O-D pairings

only (m + n) will have a nonzero weight xij .

Most O-D pairings will never occur.

ƒ The simple fix doesn't work*

Simple Linear Programming Formulation

* i.e. Placing lower bounds on the variables

Entropy Maximization And Information

Let x be a random variable which can take values

x1, x2, x3, ….., xn with probabilities p1, p2, p3, ….., pn .

These probabilities are not known. All we know are the

expectations of some functions fr(x)

E[ fr(x)] = pi fr(xi) (r = 1,…,m) (1)

Where of course

pi = 1 (2)

i

i

Jaynes – “out problem is that of finding a probability assignmentthat avoids bias, while agreeing with whatever information is given.The great advance provided by information theory lies in the discoverythat there is a unique, unambiguous criterion for the “amount ofuncertainty” represented by a discrete probability distribution.”

This measure of the uncertainty of a probability distribution was givenby Shannon as:

S(p1, p2, p3, ….., pn ) = -K pi ln pi (3)

and this is called the “entropy”.

Jaynes again – “It is now evident how to solve our problem; in makinginferences on the basis of partial information we must use the probabilitydistribution which has maximum entropy subject to whatever is known.”

Using the method of Lagrange multipliers to maximize (3) subject to (1)

and (2), with multipliers r on equations (1) and 0 on (2), the unique

Solution can be shown to be:

pi = exp[ - 0 - r fr(xi) ] (4)

Where Z(p) = e 0 = exp{ - r fr(xi) } (5)

is the partition function of the distribution.

r

r

i

i

r

Back to our ModelOK, so we want a maximum entropy solution for thetraffic distribution problem. Switching from the pi to pij ,

we modify the treatment just given so that:

pi pij

fra(xi) fr

a(xij) =

fr

b(xi) frb(xij) =

Then:

pij = exp[ - 0 - ia - j

b ] for (i,j)Eand

Z = e 0 = exp{- ia - j

b }

{

(i,j) E

1 if r = i0 otherwise

{ 1 if r = j0 otherwise

Entropy Model (continued)A purely random assignment of ads to users corresponds to solving:

Maximize - ij xij ln xij

Subject to:

j xij = ai (i = 1,..., m)

i xij = bj (j = 1,..., n)

The objective here is (only) the entropy function

We can then add the travel cost constraint:

ij cij xij = C

for some reasonable value of C.Suppose we assign this a Lagrange multiplier 1/

This in turn is equivalent to solving:

Maximize ij cij xij - (ij xij ln xij)

Subject to:

j xij = ai (i = 1,..., m)

i xij = bj (j = 1,..., n)

Where is a constant, which can be related to the average of

the cij values.

Even though this a nonlinear problem it can be solved by a

fast (iterative scaling) algorithm and has solutions of the form:

xij = Ai Bj exp(-cij / )

Note that because of the logarithm term, the xij values must

necessarily be positive.

Entropy Model (continued)

The World Wide Web

1. Over 1 billion pages today

2. More tomorrow – growing fast

3. About ~10 links per page

i

j

k

l

m

n

Graph model of the web:G = (V,E)

Where V is set of vertices (i,j,k,…..) or nodes or pagesAnd E is the set of edges (i,j)

PageRank

PageRank can be described in a number of ways;in terms of Markov Chains, equations, or eigensystems.

Intuitively, each page is assigned a rank in terms of theranks of the pages that point to it.

For simplicity let us initially assume that G is strongly connected (i.e. there is a directed path between every pair of pages in the set V)

Let ij = probability that a surfer at page i will click through to page j

The standard PageRank computation assumes that theseprobabilities are uniform, i.e. if di = out degree (number of outlinks) of page i, then ij = 1/di for all j such that (i,j) E

M = [ij] defines a Markov chain. The stochastic vectorw = <w1, …, wm>, where w1 + …. +wm = 1 is astationary state of the Markov chain if

wT = wTM

Let matrix A = MT , i.e.

aij = 1/dj for (j,i) E

where dj is the out degree of node (page) j, then theideal PageRank xi for page i is computed from:

xi = aij xj

Note the very strong assumption that the aij are fixed,and fixed with values 1/dj .

(j,i) E

i

j

k

l

m

n

PageRank of i is xi = (1/2) xj + (1/3) xk + xl of j is xj = (1/3) xm + (1/2) xk etc.

Beyond the Valley of the Markov Chains - A Network Flow Model

Let us maintain the graph G as our basic model, andthe assumption that each user clicks through to a page at each tick of the “clock”, but define flow variables:

yij = flow per unit time from page i to page j.

Where yij = Y (const) (*) (i,j) E

(We will usually find it convenient to work with the

normalized values pij = yij / Y (probabilities) )

The flows must satisfy conservation equations (Kirchoff Conditions)

yhi - yij = 0 i V(h,i) E (i,j) E

as well as yij > 0

and (*).

The yij can take any values which satisfy these

constraints. What should we estimate them to be?

Noting that we can write the number of “hits” per unittime on page i as:

Hi = yhi

(h,i) E

The PageRank assumption is that:

yij = Hi / di for all (i,j) E

(Where di is the out-degree of page i)

Is is easy to see, by direct substitution, that these flows,suitably scaled to satisfy (*), satisfy the conservation equations.

Now, what alternatives do we have?

Max Entropy for the WebOK, so we want a maximum entropy solution for the

network flow problem. Switching from the yij to pij ,

we modify the entropy treatment given so that:

pi pij

+1 for j=r, (i,r)E fr(xi) fr(xij) = -1 for i=r, (r,j)E 0 otherwise

E[fr(x)] = 0 for r VThen:

pij = exp[ - 0 - i + j ] for (i,j)Eand

Z = e 0 = exp{- i + j }

{

(i,j) E

Letting i = e -i , A = diag(1 , …, m)

and

cij = Z-1 for (i,j)E 0 otherwise,then

P = ACA-1

subject to

phi - pij = 0 i V

{

(i,j) E(h.i) E

pij = 1(i,j) E

This is a matrix balancing (iterative scaling) problem,which can be solved relatively efficiently.

General idea:

0. Guess the value of Z-1

Start with initial values for the i (e.g. 1) , denoted i(0),

and let pij(k) = Z-1 i

(k) / j(k). At each iteration:

1. Compute i(k)

= pij(k) , i

(k) = pji

(k)

2. Let i(k) = (i

(k) / i

(k) )1/2

3. Update i(k+1) i

(k) i(k) , for some or all i

4. Stop if 1 – i(k) 1+ , else step k and go to 1.

5. Check if sum of the final pij is 1.0. If not, adjust Z and go to 1.

jj

Note the work per inner iteration is about twice an iterationof power iteration (or Gauss-Seidel).

i

j

km

n

Many pages have no in-links or no out-links.The “random surfer” can never reach the green pagesor escape from the red pages. G is not strongly connected

However, in practice:

l

In matrix terms this means that the Markov chainmatrix is reducible, i.e. M can be arranged in the form:

M11 0 M21 M22

In this case all the w’s associated with M11 will bezero and hence the rank of those pages will alsobe zero. Also the pages with zero in-degree haveno rank to confer.

Solution idea - fudge in some “seed” rank.

xi = const + aij xj(j,i) E

[ ]

More sophisticated idea:

Assume that every few clicks the surfer makes a“random jump” to anywhere on the web, with frequency (1 – ), i.e.modify the Markov chain so that:

M* = (1 – ) e eT + M nwhere e = <1,1,….,1>, and 0 < < 1 (e.g. = 0.9)

This corresponds to a modified PageRank calculation

xi = (1 – ) e eTx + aij xj

nor x = [ (1 – ) e eT + A ] x n

(j,i) E

A Modified Network Formulation

We adopt the same idea – that with frequency (1 – )surfers make a random jump. However instead of fixing thisfrequency to be the same for surfers at each page, we anoverall fraction (1 – ) for the entire network.

Define an additional node n+1 and add it to V to get V’.We then construct edges from every node in V to n+1 andfrom n+1 to every other node.

The total traffic through this node is required to be

pi,n+1 = (1 – ) = pn+1,j

The new edge set E’ is E plus the new edges and also a selfloop at each dead-end page.

i j

i

j

km

nl

n+1

(All nodes should be linked to and from n+1but all such links are not drawn)

The extended model is now:

Maximize S = - pij ln pij

Subject to:

phi - pij = 0 i V

pi,n+1 = (1 – )

pn+1,j = (1 – )

pij = 1

(h.i) E’ (i,j) E’

(i,j) E’

i

j

(i,j) E’

The extended model still corresponds to a (hybrid) matrixbalancing problem, but with some row and column sumsrequired to take particular values – a simple extension.

Note that if we know the flow through any node we can writedown the equations, just as we did for the “artificial” node,and treat them in the same way. In other words we can reducethe uncertainty in the model by applying additional information.

The solution algorithm requires only minor modification, and

produces not only the “traffic” primal variables (pij) , but as an

essential by-product the exponentials of the Lagrange multipliers:

i = e -i

Traffic Ranking

It is natural to consider the estimated traffic through apage as measure of its “importance”.

We therefore compute:

Hi = phi ,

(h,i) E

normalize these values so that the largest is 1.0, and sortin decreasing order to obtain a “traffic ranking” from theprimal solution values.

What about the dual values (Lagrange multipliers)?

Entropy and Temperature

We noted earlier that the maximum entropy value is afunction of the optimal Lagrange multipliers: S = 0 + rE[fr(x)] (6)

Now, varying the functions fr(x) in an arbitrary way, so that

fr(xij) may be specified independently for each r and (i,j), and

Letting the expectations of the fr change in a manner which

Is also independent, we obtain from (5) 0 = log Z = - { rE[fr] + rE[ fr] }

And so from (6): S = r {E[fr] - E[ fr] } (7)

S = r Qr (8)

where (7) defines Qr as the rth form of “heat”.

r

r

r

r

Why??????Because in classical thermodynamics one has the followingrelation between entropy and heat:-

S = Q T

where Q is heat added (this defines absolute temperature T).Since we have:

S = r Qr

the r play the role of the inverse of temperature.Thus we may hypothesize a page temperature:

Tr 1/ r

and rank the pages by any order preserving function of 1/ r

We currently use e -i , since these fall out of the algorithm.These values form the Hyperlinked Object Temperature Scale(HOTS)

Computational results

1. IBM Intranet crawls made in 2002(a) 19 million pages ,206 million links

(b) 17 million pages.

2. Partial internet crawl made in 2001.173 million pages (constraints)2 billion links (variables)

Intranet Results

The results given are for Intranet crawls carried out in 2002,Covering 19 and 17 million pages respectively.

Some attempt was made to avoid duplicate pages and to canonicalize Domino URLs. (To avoid having many viewsof the same base document treated as distinct URLs.)However, there are still some duplicates and bogus URLs in the crawl.

There are about 200,000,000 links in the resulting graph.

What’s the Quality Like?

To measure “quality” of the ranking, we took a list of management blessed queries and the corresponding top URL’s they expected a good search engine to return (in late 2001). There were about 100 of these found in the first crawl and about 200 for the second.

We then computed the average rank of the ones that werethere, as our measure of quality. (Note that a rank of 1 isthe highest, so a small average is good).

Intranet Results

Average Scores ( x 106) (Possible scale 1 to n )

Test PageRank Traffic HOTness

1 0.644 2.275 0.461

2 1.242 1.417 1.160

Traffic < PageRank < HOTness

Internet Result Quality

Again use “human expert” ranking. In this case,the pages chosen by the Open Database Project(ODP http://dmoz.org). We compute the averagerank at each level of the hierarchy (including pagesat the specified level, or above), e.g.

Level 1: /Top/ComputersLevel 2: /Top/Computers/Education plus level 1Level 3: /Top/Computers/Education/Internet plus level2Etc.

http://dmoz.org/

Internet Result Quality (cont.)

Average Ranks (x 107)Level Number PageRank TrafficRank HOTness 1 27 0.753 6.404 1.656 2 4258 3.143 2.862 2.614 3 65343 4.448 4.385 3.949 4 228943 4.686 4.887 4.286 5 427578 4.817 5.127 4.438All 990354 5.236 5.677 4.812

Except at level 1 we again have: Traffic < PageRank < HOTness

Rank Aggregation

How do we use multiple static ranks (especially whenthey appear to rank by tenuously related criteria) ?For intranet crawl 1 we tried taking just the lowest of the 3 ranks for each URL in the list.This gives a new average score of 85113 (asopposed to ~461,000)However the scale is now compressed by at most 3.

Aggregation is covered in another paper (Fagin et al.)

Maximize S = {ij pij - pij ln pij }

Subject to:

phi - pij = 0 i U

phi = Hi i V-U

pij = Hi

cij pij = C

pij = 1

(h.i) E’ (i,j) E’

(i,j) E’

(i,j) E’

(h.i) E’

(i,j) E’

(i,j) E’

Generalized Model

Future Research

1. Efficient numerical methods for the hybrid matrixbalancing problem

2. Rank aggregation methods

3. Use in an actual search engine (especially for Intranet)

4. Obtaining and using traffic, a priori probability andimpedance data.

5. Vulnerability to spam?

6. Non-equilibrium (dynamic) extensions – web sensitivity.

Documents

What’s HOT on the Web John Tomlin February 2003 Expanded from a talk given at INFORMS