43
Estimating the Estimating the Global PageRank of Global PageRank of Web Communities Web Communities Paper by Jason V. Davis & Paper by Jason V. Davis & Inderjit S. Dhillon Inderjit S. Dhillon Dept. of Computer Sciences Dept. of Computer Sciences University of Texas at Austin University of Texas at Austin Presentation given by Scott J. McCallen Presentation given by Scott J. McCallen Dept. of Computer Science Dept. of Computer Science Kent State University Kent State University December 4 December 4 th th 2006 2006

Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Estimating the Estimating the Global PageRank of Global PageRank of Web CommunitiesWeb Communities

Paper by Jason V. Davis & Paper by Jason V. Davis & Inderjit S. DhillonInderjit S. Dhillon

Dept. of Computer SciencesDept. of Computer SciencesUniversity of Texas at AustinUniversity of Texas at Austin

Presentation given by Scott J. McCallenPresentation given by Scott J. McCallenDept. of Computer ScienceDept. of Computer Science

Kent State UniversityKent State UniversityDecember 4December 4thth 2006 2006

Localized Search EnginesLocalized Search Engines

What are they?What are they? Focus on a particular communityFocus on a particular community Examples: www.cs.kent.edu (site specific) or all Examples: www.cs.kent.edu (site specific) or all

computer science related websites (topic specific)computer science related websites (topic specific) AdvantagesAdvantages

Searching for particular terms with several Searching for particular terms with several meaningsmeanings

Relatively inexpensive to build and useRelatively inexpensive to build and use Use less bandwidth, space and timeUse less bandwidth, space and time

Local domains are orders of magnitude smaller Local domains are orders of magnitude smaller than global domainthan global domain

Localized Search Engines Localized Search Engines (con’t)(con’t)

DisadvantagesDisadvantages Lack of Global informationLack of Global information

i.e. only local PageRanks are availablei.e. only local PageRanks are available Why is this a problem?Why is this a problem?

Only pages within that community that are highly Only pages within that community that are highly regarded will have high PageRanksregarded will have high PageRanks

There is a need for a global PageRank for There is a need for a global PageRank for pages only within a local domainpages only within a local domain

Traditionally, this can only be obtained by Traditionally, this can only be obtained by crawling entire domaincrawling entire domain

Some Global FactsSome Global Facts 2003 Study by Lyman on the Global Domain2003 Study by Lyman on the Global Domain

8.9 billion pages on the internet (static pages)8.9 billion pages on the internet (static pages) Approximately 18.7 kilobytes eachApproximately 18.7 kilobytes each 167 terabytes needed to download and crawl the 167 terabytes needed to download and crawl the

entire webentire web These resources are only available to major These resources are only available to major

corporationscorporations Local DomainsLocal Domains

May only contain a couple hundred thousand pagesMay only contain a couple hundred thousand pages May already be contained on a local web server May already be contained on a local web server

(www.cs.kent.edu)(www.cs.kent.edu) There is much less restriction to the entire datasetThere is much less restriction to the entire dataset

The advantages of localized search engines The advantages of localized search engines becomes clearbecomes clear

Global (N) vs. Local (n)Global (N) vs. Local (n)Environmental Websites

EDU Websites

Political Websites

Other websites

Some parts overlap, but others don’t.

Overlap represents links

to other domains.

Each local domain isn’t aware of the rest of the

global domain.

How is it possible to

extract global information

when only the local domain is

available?

Excluding overlap from other

domains gives a very poor estimate

of global rank.

Proposed SolutionProposed Solution Find a good approximation to the global PageRank Find a good approximation to the global PageRank

value without crawling entire global domainvalue without crawling entire global domain Find a superdomain of local domain that will well Find a superdomain of local domain that will well

approximate the PageRankapproximate the PageRank Find this superdomain by crawling as few as n or Find this superdomain by crawling as few as n or

2n additional pages given a local domain of n 2n additional pages given a local domain of n pagespages

Esessentially, add as few pages to the local Esessentially, add as few pages to the local domain as possible until we find a very good domain as possible until we find a very good approximation of the PageRanks in the local approximation of the PageRanks in the local domain domain

PageRank - DescriptionPageRank - Description

Defines importance of pages based Defines importance of pages based on the hyperlinks from one page to on the hyperlinks from one page to another (the web graph)another (the web graph)

Computes the stationary distribution Computes the stationary distribution of a Markov chain created from the of a Markov chain created from the web graphweb graph

Uses the “random surfer” model to Uses the “random surfer” model to create a “random walk” over the create a “random walk” over the chainchain

PageRank MatrixPageRank Matrix

Given m x m adjacency matrix for the web Given m x m adjacency matrix for the web graph, define the PageRank Matrix asgraph, define the PageRank Matrix as

DDUU is diagonal matrix such that UD is diagonal matrix such that UDUU-1-1 is is

column stochasticcolumn stochastic 0 ≤ 0 ≤ αα ≤ 1 ≤ 1 e is vector of all 1’se is vector of all 1’s v is the random surfer vectorv is the random surfer vector

PageRank VectorPageRank Vector

The PageRank vector r represents the The PageRank vector r represents the page rank of every node in the page rank of every node in the webgraphwebgraph

It is defined as the dominate It is defined as the dominate eigenvector of the PageRank matrixeigenvector of the PageRank matrix

Computed using the power method Computed using the power method using a random starting vectorusing a random starting vector Computation can take as much as O(mComputation can take as much as O(m22) )

time for a dense graph but in practice is time for a dense graph but in practice is normally O(km), k being the average normally O(km), k being the average number of links per pagenumber of links per page

Algorithm 1Algorithm 1

Computing the PageRank vector Computing the PageRank vector based on the adjacency matrix U of based on the adjacency matrix U of the given web graphthe given web graph

Algorithm 1 Algorithm 1 (Explanation)(Explanation)

Input: Adjacency Matrix UInput: Adjacency Matrix U Output: PageRank vector rOutput: PageRank vector r MethodMethod

Choose a random initial value for rChoose a random initial value for r(0)(0)

Continue to iterate using the random Continue to iterate using the random surfer probability and vector until surfer probability and vector until reaching the convergence thresholdreaching the convergence threshold

Return the last iteration as the Return the last iteration as the dominant eigenvector for adjacency dominant eigenvector for adjacency matrix Umatrix U

For a local domain L, we have G as the For a local domain L, we have G as the entire global domain with an N x N entire global domain with an N x N adjacency matrixadjacency matrix

Define G to be as the followingDefine G to be as the following

i.e. we partition G into separate sections i.e. we partition G into separate sections that allow L to be containedthat allow L to be contained

Assume that L has already been crawled Assume that L has already been crawled and Land Loutout is known is known

Defining the Problem ( G Defining the Problem ( G vs. L)vs. L)

Defining the Problem (p* Defining the Problem (p* in g)in g)

If we partition G as such, we can denote If we partition G as such, we can denote actual PageRank vector of L as actual PageRank vector of L as

with respect to g (the global PageRank with respect to g (the global PageRank vector)vector)

Note: ENote: ELL selects only the nodes that selects only the nodes that correspond to L from gcorrespond to L from g

Defining the Problem (n Defining the Problem (n << N)<< N)

We define p as the PageRank vector We define p as the PageRank vector computed by crawling only local domain Lcomputed by crawling only local domain L

Note that p will be much different than p*Note that p will be much different than p* Continue to crawl more nodes of the global Continue to crawl more nodes of the global

domain and the difference will become domain and the difference will become smaller, however this is not possiblesmaller, however this is not possible

Find the supergraph F of L that will Find the supergraph F of L that will minimize the difference between p and p*minimize the difference between p and p*

Defining the Problem Defining the Problem (finding F)(finding F)

We need to find F that gives us the best We need to find F that gives us the best approximation of p*approximation of p*

i.e. minimize the following problem (the i.e. minimize the following problem (the difference between the actual global difference between the actual global PageRank and the estimated PageRank)PageRank and the estimated PageRank)

F is found with a greedy strategy, using F is found with a greedy strategy, using Algorithm 2Algorithm 2

Essentially, start with L and add the nodes in Essentially, start with L and add the nodes in FFoutout that minimize our objective and continue that minimize our objective and continue doing so a total of T iterationsdoing so a total of T iterations

Algorithm 2Algorithm 2

Algorithm 2 Algorithm 2 (Explanation)(Explanation)

Input: L (local domain), LInput: L (local domain), Loutout (outlinks from L), T (outlinks from L), T (number of iterations), k (pages to crawl per (number of iterations), k (pages to crawl per iteration)iteration)

Output: p (an improved estimated PageRank vector)Output: p (an improved estimated PageRank vector) MethodMethod

First set F (supergraph) and FFirst set F (supergraph) and Foutout equal to L and L equal to L and Loutout Compute the PageRank vector of FCompute the PageRank vector of F While T has not been exceededWhile T has not been exceeded

Select k new nodes to crawl based on F, FSelect k new nodes to crawl based on F, Foutout, f, f

Expand F to include those new nodes and modify FExpand F to include those new nodes and modify Foutout Compute the new PageRank vector for FCompute the new PageRank vector for F

Select the elements from f that correspond to L and Select the elements from f that correspond to L and return preturn p

Global (N) vs. Local (n) Global (N) vs. Local (n) (Again)(Again)

Environmental Websites

EDU Websites

Political Websites

Other websites

Using it on only the local domain

gives very inaccurate

estimates of the PageRank.

We know how to create the

PageRank vector using the power method.

How far can selecting more

nodes be allowed to

proceed without crawling the entire global

domain?

How can we select nodes from other

domains (i.e. expanding the

current domain) to improve accuracy?

Selecting NodesSelecting Nodes

Select nodes to expand L to FSelect nodes to expand L to F Selected nodes must bring us closer Selected nodes must bring us closer

to the actual PageRank vectorto the actual PageRank vector Some nodes will greatly influence Some nodes will greatly influence

the current PageRankthe current PageRank Only want to select at most O(n) Only want to select at most O(n)

more pages than those already in Lmore pages than those already in L

Finding the Best NodesFinding the Best Nodes

For a page j in the global domain and the For a page j in the global domain and the frontier of F (Ffrontier of F (Foutout), the addition of page j to F ), the addition of page j to F is as followsis as follows

uj is the outlinks from F to juj is the outlinks from F to j s is the estimated inlinks from j into F (j has s is the estimated inlinks from j into F (j has

not yet been crawled)not yet been crawled) s is estimated based on the expectation of inlink s is estimated based on the expectation of inlink

counts of pages already crawled as socounts of pages already crawled as so

Finding the Best Nodes Finding the Best Nodes (con’t)(con’t)

We defined the PageRank of F to be fWe defined the PageRank of F to be f The PageRank of FThe PageRank of Fjj is f is fjj

++

xxjj is the PageRank of node j (added to the is the PageRank of node j (added to the current PageRank vector)current PageRank vector)

Directly optimizing requires us to know the Directly optimizing requires us to know the global PageRank p*global PageRank p* How can we minimize the objective without How can we minimize the objective without

knowing p*?knowing p*?

Node InfluenceNode Influence Find the nodes in FFind the nodes in Foutout that will have the greatest that will have the greatest

influence on the local domain Linfluence on the local domain L Done by attaching an influence score to each node jDone by attaching an influence score to each node j

Summation of the difference adding page j will make Summation of the difference adding page j will make to PageRank vector among all pages in Lto PageRank vector among all pages in L

The influence score has a strong corollary to the The influence score has a strong corollary to the minimization of the GlobalDiff(fminimization of the GlobalDiff(fjj) function (as ) function (as compared to a baseline, for instance, the total compared to a baseline, for instance, the total outlink count from F to node j)outlink count from F to node j)

Node Influence ResultsNode Influence Results

Node Influence vs. Outlink Count on Node Influence vs. Outlink Count on a crawl of conservative web sitesa crawl of conservative web sites

Finding the InfluenceFinding the Influence Influence must be calculated for each node j in Influence must be calculated for each node j in

frontier of F that is consideredfrontier of F that is considered We are considering O(n) pages and the We are considering O(n) pages and the

calculation is O(n), we are left with a O(ncalculation is O(n), we are left with a O(n22) ) computationcomputation

To reduce this complexity, approximating the To reduce this complexity, approximating the influence of j may be acceptable, but how?influence of j may be acceptable, but how?

Using the power method for computing the Using the power method for computing the PageRank algorithms may lead us to a good PageRank algorithms may lead us to a good approximationapproximation

However, using the algorithm (Algorithm 1), However, using the algorithm (Algorithm 1), requires having a good starting vectorrequires having a good starting vector

PageRank Vector (again)PageRank Vector (again) The PageRank algorithm will converge at The PageRank algorithm will converge at

a rate equal to the random surfer a rate equal to the random surfer probability probability αα

With a starting vector xWith a starting vector x(0)(0), the complexity , the complexity of the algorithm is of the algorithm is

That is, the more accurate the vector That is, the more accurate the vector becomes, the more complex the process isbecomes, the more complex the process is

Saving Grace: Find a very good starting Saving Grace: Find a very good starting vector for xvector for x(0)(0), in which case we only need , in which case we only need to perform one iteration of Algorithm 1to perform one iteration of Algorithm 1

Finding the Best xFinding the Best x(0)(0)

Partition the PageRank matrix for FPartition the PageRank matrix for Fjj

Finding the Best xFinding the Best x(0)(0)

Simple approachSimple approach Use as the starting vector (the current Use as the starting vector (the current

PageRank vector)PageRank vector) Perform one PageRank iterationPerform one PageRank iteration Remove the element that corresponds to added nodeRemove the element that corresponds to added node

IssuesIssues

The estimate of fThe estimate of fjj++ will have an error of at least will have an error of at least

22ααxxjj

So if the PageRank of j is very high, very bad So if the PageRank of j is very high, very bad estimateestimate

Stochastic ComplementStochastic Complement

In an expanded form, the PageRank fIn an expanded form, the PageRank fjj++

isis

Which can be solved asWhich can be solved as

Observation: Observation: This is the stochastic complement of This is the stochastic complement of

PageRank matrix of FPageRank matrix of Fjj

Stochastic Complement Stochastic Complement (Observations)(Observations)

The stochastic complement of an The stochastic complement of an irreducible matrix is uniqueirreducible matrix is unique

The stochastic complement is also The stochastic complement is also irreducible and therefore has unique irreducible and therefore has unique stationary distributionstationary distribution

With regards to the matrix SWith regards to the matrix S The subdominant eigenvalue is at most The subdominant eigenvalue is at most

which means that for large l, it is which means that for large l, it is very close to very close to αα

The New PageRank The New PageRank ApproximationApproximation

Estimate the vector fEstimate the vector fjj of length l by of length l by performing one PageRank iteration over performing one PageRank iteration over S, starting at fS, starting at f

AdvantagesAdvantages Starting and ending with a vector of length lStarting and ending with a vector of length l Creates a lower bound for error of zeroCreates a lower bound for error of zero Example: Considering adding a node k to F Example: Considering adding a node k to F

that has no influence over the PageRank of Fthat has no influence over the PageRank of F Using the stochastic complement yields the exact Using the stochastic complement yields the exact

solutionsolution

The DetailsThe Details

Begin by expanding the difference Begin by expanding the difference between two PageRank vectorsbetween two PageRank vectors

withwith

The DetailsThe Details Substitute PSubstitute PFF into the equation into the equation

Summarizing into vectorsSummarizing into vectors

Algorithm 3 Algorithm 3 (Explanation)(Explanation)

Input: F (the current local subgraph), FInput: F (the current local subgraph), Foutout (outlinks of F), f (current PageRank of F), (outlinks of F), f (current PageRank of F), k (number of pages to return)k (number of pages to return)

Output: k new pages to crawlOutput: k new pages to crawl MethodMethod

Compute the outlink sums for each page in FCompute the outlink sums for each page in F Compute a scalar for every known global page Compute a scalar for every known global page

j (how many pages link to j)j (how many pages link to j) Compute y and z as formulatedCompute y and z as formulated For each of the pages in FFor each of the pages in Foutout

Computer x as formulatedComputer x as formulated Compute the score of each page using x, y and zCompute the score of each page using x, y and z

Return the k pages with the highest scoresReturn the k pages with the highest scores

PageRank Leaks and PageRank Leaks and FlowsFlows

The change of a PageRank based on added a The change of a PageRank based on added a node j to F can be described as Leaks and node j to F can be described as Leaks and FlowsFlows

A flow is the increase in local PageRanksA flow is the increase in local PageRanks Represented by Represented by

Scalar is the total amount j has to distributeScalar is the total amount j has to distribute Vector determines how it will be distributedVector determines how it will be distributed

A leak is the decrease in local PageRanksA leak is the decrease in local PageRanks Leaks come from non-positive vectors x and yLeaks come from non-positive vectors x and y X is proportional to the weighted sum of sibling X is proportional to the weighted sum of sibling

PageRanksPageRanks Y is an artifact of the random surfer vectorY is an artifact of the random surfer vector

Leaks and FlowsLeaks and Flows

JJ

Local

PagesFlows

Leaks

Random Surfer

Siblings

ExperimentsExperiments

MethodologyMethodology Resources are limited, global graph is Resources are limited, global graph is

approximatedapproximated Baseline AlgorithmsBaseline Algorithms

RandomRandom Nodes chosen uniformly at random from Nodes chosen uniformly at random from

known global nodesknown global nodes Outlink CountOutlink Count

Node chosen have the highest number of Node chosen have the highest number of outline counts from the current local domainoutline counts from the current local domain

Results (Data Sets)Results (Data Sets) Data SetData Set

Restricted to http pages that do not contain Restricted to http pages that do not contain the characters ?, *, @, or =the characters ?, *, @, or =

EDU Data SetEDU Data Set Crawl of the top 100 computer science Crawl of the top 100 computer science

universitiesuniversities Yielded 4.7 million pages, 22.9 million linksYielded 4.7 million pages, 22.9 million links

Politics Data SetPolitics Data Set Crawl of the pages under politics in dmoz Crawl of the pages under politics in dmoz

directorydirectory Yielded 4.4 million pages, 17.2 million linksYielded 4.4 million pages, 17.2 million links

Results (EDU Data Set)Results (EDU Data Set)

Normalizations show difference, Normalizations show difference, Kendall shows similarityKendall shows similarity

Results (Politics Data Results (Politics Data Set)Set)

Result SummaryResult Summary

Stochastic Complement Stochastic Complement outperformed other methods in outperformed other methods in nearly every trialnearly every trial

The results are significantly better The results are significantly better than the random walk approach with than the random walk approach with minimal computationminimal computation

ConclusionConclusion

Accurate estimates of the PageRank can Accurate estimates of the PageRank can be obtained by using local resultsbe obtained by using local results Expand the local graph based on influenceExpand the local graph based on influence Crawl at most O(n) more pagesCrawl at most O(n) more pages Use stochastic complement to accurately Use stochastic complement to accurately

estimate the new PageRank vectorestimate the new PageRank vector Not computationally or storage intensiveNot computationally or storage intensive

Estimating the Estimating the Global PageRank Global PageRank

of Web of Web Communities Communities

The EndThe End

Thank Thank YouYou