88
Search Engines & Search Engines & Question Answering Question Answering Giuseppe Attardi Giuseppe Attardi Università di Pisa Università di Pisa

Search Engines & Question Answering

  • Upload
    hina

  • View
    41

  • Download
    1

Embed Size (px)

DESCRIPTION

Search Engines & Question Answering. Giuseppe Attardi Università di Pisa. Topics. Web Search Search engines Architecture Crawling: parallel/distributed, focused Link analysis (Google PageRank) Scaling. Top Online Activities. Source: Jupiter Communications, 2000. - PowerPoint PPT Presentation

Citation preview

Page 1: Search Engines & Question Answering

Search Engines & Question Search Engines & Question AnsweringAnswering

Giuseppe AttardiGiuseppe AttardiUniversità di PisaUniversità di Pisa

Page 2: Search Engines & Question Answering

TopicsTopics

Web SearchWeb Search– Search engines– Architecture– Crawling: parallel/distributed, focused– Link analysis (Google PageRank)– Scaling

Page 3: Search Engines & Question Answering

Source: Jupiter Communications, 2000

72%

88%

96%

Product Info.Search

Web Search

Email

Top Online ActivitiesTop Online Activities

Page 4: Search Engines & Question Answering

Pew Study (US users July 2002)Pew Study (US users July 2002)

Total Internet users = 111 MTotal Internet users = 111 MDo a search on any given day = 33 MDo a search on any given day = 33 MHave used Internet to search = 85%Have used Internet to search = 85% //www.pewinternet.org/reports/toc.asp?//www.pewinternet.org/reports/toc.asp?

Report=64Report=64

Page 5: Search Engines & Question Answering

Search on the WebSearch on the Web CorpusCorpus:The publicly accessible Web: static + dynamic:The publicly accessible Web: static + dynamic

GoalGoal: Retrieve high quality results relevant to the user’s need: Retrieve high quality results relevant to the user’s need– (not docs!)

NeedNeed– Informational – want to learn about something (~40%)

– Navigational – want to go to that page (~25%)

– Transactional – want to do something (web-mediated) (~35%)• Access a service• Downloads • Shop

– Gray areas• Find a good hub• Exploratory search “see what’s there”

Low hemoglobin

United Airlines

Tampere weatherMars surface images

Nikon CoolPix

Car rental Finland

Page 6: Search Engines & Question Answering

ResultsResults

Static pages (documents)Static pages (documents)– text, mp3, images, video, ...

Dynamic pages = generated Dynamic pages = generated on request on request – data base access– “the invisible web”– proprietary content, etc.

Page 7: Search Engines & Question Answering

TerminologyTerminology

http://www.cism.it/cism/http://www.cism.it/cism/hotels_2001.htmhotels_2001.htm

Host name

Page name

Access method

URL = Universal Resource Locator

Page 8: Search Engines & Question Answering

ScaleScale

Immense amount of content Immense amount of content – 2-10B static pages, doubling every 8-12 months– Lexicon Size: 10s-100s of millions of words

Authors galore (1 in 4 hosts run a web server)Authors galore (1 in 4 hosts run a web server)

http://www.netcraft.com/Survey

Page 9: Search Engines & Question Answering

DiversityDiversity Languages/EncodingsLanguages/Encodings

– Hundreds (thousands ?) of languages, W3C encodings: 55 (Jul01) [W3C01]

– Home pages (1997): English 82%, Next 15: 13% [Babe97]– Google (mid 2001): English: 53%, JGCFSKRIP: 30%

Document & query topicDocument & query topicPopular Query Topics (from 1 million Google queries, Apr 2000)

1.8%1.8%Regional: EuropeRegional: Europe7.2%7.2%BusinessBusiness……………………

2.3%2.3%Business: IndustriesBusiness: Industries7.3%7.3%RecreationRecreation3.2%3.2%Computers: InternetComputers: Internet8%8%AdultAdult3.4%3.4%Computers: SoftwareComputers: Software8.7%8.7%SocietySociety4.4%4.4%Adult: Image GalleriesAdult: Image Galleries10.3%10.3%RegionalRegional5.3%5.3%Regional: North AmericaRegional: North America13.8%13.8%ComputersComputers

6.1%6.1%Arts: MusicArts: Music14.6%14.6%ArtsArts

Page 10: Search Engines & Question Answering

Rate of changeRate of change

[Cho00][Cho00] 720K pages from 270 popular 720K pages from 270 popular sites sampled daily from Feb 17 – sites sampled daily from Feb 17 – Jun 14, 1999Jun 14, 1999

Mathematically, whatdoes this seem to be?

Page 11: Search Engines & Question Answering

Web idiosyncrasiesWeb idiosyncrasies

Distributed authorshipDistributed authorship– Millions of people creating pages with their

own style, grammar, vocabulary, opinions, facts, falsehoods …

– Not all have the purest motives in providing high-quality information - commercial motives drive “spamming” - 100s of millions of pages.

– The open web is largely a marketing tool.• IBM’s home page does not contain computer.

Page 12: Search Engines & Question Answering

Other characteristicsOther characteristics Significant duplicationSignificant duplication

– Syntactic - 30%-40% (near) duplicates [Brod97, Shiv99b]

– Semantic - ??? High linkage High linkage

– ~ 8 links/page in the average Complex graph topologyComplex graph topology

– Not a small world; bow-tie structure [Brod00] More on these corpus characteristics laterMore on these corpus characteristics later

– how do we measure them?

Page 13: Search Engines & Question Answering

Web search usersWeb search users Ill-defined queriesIll-defined queries

– Short • AV 2001: 2.54 terms

avg, 80% < 3 words) – Imprecise terms– Sub-optimal syntax

(80% queries without operator)

– Low effort Wide variance inWide variance in

– Needs– Expectations– Knowledge– Bandwidth

Specific behaviorSpecific behavior– 85% look over one

result screen only (mostly above the fold)

– 78% of queries are not modified (one query/session)

– Follow links – “the scent of

information” ...

Page 14: Search Engines & Question Answering

Evolution of search enginesEvolution of search engines First generation -- use only “on page”, text dataFirst generation -- use only “on page”, text data

– Word frequency, language

Second generation -- use off-page, web-specific dataSecond generation -- use off-page, web-specific data– Link (or connectivity) analysis– Click-through data (What results people click on)– Anchor-text (How people refer to this page)

Third generation -- answer “the need behind the query”Third generation -- answer “the need behind the query”– Semantic analysis -- what is this about?– Focus on user need, rather than on query– Context determination– Helping the user– Integration of search and text analysis

1995-1997 AV, Excite, Lycos, etc

From 1998. Made popular by Google but everyone now

Still experimental

Page 15: Search Engines & Question Answering

Third generation search engine: Third generation search engine: answering “the need behind the query”answering “the need behind the query”Query language determinationQuery language determinationDifferent rankingDifferent ranking

–(if query Japanese do not return English)Hard & soft matchesHard & soft matches

–Personalities (triggered on names)–Cities (travel info, maps)–Medical info (triggered on names and/or

results)–Stock quotes, news (triggered on stock

symbol)–Company info, …

Integration of Search and Text AnalysisIntegration of Search and Text Analysis

Page 16: Search Engines & Question Answering

Answering “the need behind the query”Answering “the need behind the query”Context determinationContext determination

Context determination Context determination – spatial (user location/target location)– query stream (previous queries)– personal (user profile) – explicit (vertical search, family friendly)– implicit (use AltaVista from AltaVista France)

Context useContext use– Result restriction– Ranking modulation

Page 17: Search Engines & Question Answering

The spatial context - geo-searchThe spatial context - geo-search Two aspectsTwo aspects

– Geo-coding• encode geographic coordinates to make search effective

– Geo-parsing• the process of identifying geographic context.

Geo-codingGeo-coding– Geometrical hierarchy (squares)– Natural hierarchy (country, state, county, city, zip-codes, etc)

Geo-parsingGeo-parsing– Pages (infer from phone nos, zip, etc). About 10% feasible.– Queries (use dictionary of place names) – Users

• From IP data– Mobile phones

• In its infancy, many issues (display size, privacy, etc)

Page 18: Search Engines & Question Answering

AV AV barry bondsbarry bonds

Page 19: Search Engines & Question Answering

Lycos Lycos palo altopalo alto

Page 20: Search Engines & Question Answering

Geo-search example - Northern Light (Now Divine Inc)Geo-search example - Northern Light (Now Divine Inc)

Page 21: Search Engines & Question Answering

Helping the userHelping the user

UIUIspell checkingspell checkingquery refinementquery refinementquery suggestionquery suggestioncontext transfer …context transfer …

Page 22: Search Engines & Question Answering

Context sensitive spell checkContext sensitive spell check

Page 23: Search Engines & Question Answering

Crawl Control

Search Engine ArchitectureSearch Engine Architecture

Crawlers

Ranking

Page Repository

QueryEngine

Link Analysis

Text

Structure

Document Store

Queries

Results

SnippetExtractionIndexer

Page 24: Search Engines & Question Answering

TermsTerms

CrawlerCrawlerCrawler controlCrawler control Indexes – text, structure, utilityIndexes – text, structure, utilityPage repositoryPage repository IndexerIndexerCollection analysis moduleCollection analysis moduleQuery engineQuery engineRanking moduleRanking module

Page 25: Search Engines & Question Answering

RepositoryRepository

““Hidden Treasures”Hidden Treasures”

Page 26: Search Engines & Question Answering

StorageStorage

The page repository is a scalable storage The page repository is a scalable storage system for web pagessystem for web pages

Allows the Crawler to store pagesAllows the Crawler to store pages Allows the Indexer and Collection Allows the Indexer and Collection

Analysis to retrieve themAnalysis to retrieve them Similar to other data storage systems – Similar to other data storage systems –

DB or file systemsDB or file systems Does Does notnot have to provide some of the have to provide some of the

other systems’ features: transactions, other systems’ features: transactions, logging, directorylogging, directory

Page 27: Search Engines & Question Answering

Storage IssuesStorage Issues

Scalability and seamless load distribution Scalability and seamless load distribution Dual access modesDual access modes

– Random access (used by the query engine for cached pages)

– Streaming access (used by the Indexer and Collection Analysis)

Large bulk update – reclaim old space, Large bulk update – reclaim old space, avoid access/update conflictsavoid access/update conflicts

Obsolete pages - remove pages no longer Obsolete pages - remove pages no longer on the webon the web

Page 28: Search Engines & Question Answering

Designing a Distributed Web Designing a Distributed Web RepositoryRepositoryRepository designed to work over a Repository designed to work over a

cluster of interconnected nodescluster of interconnected nodesPage distribution across nodesPage distribution across nodesPhysical organization within a nodePhysical organization within a nodeUpdate strategyUpdate strategy

Page 29: Search Engines & Question Answering

Page DistributionPage Distribution

How to choose a node to store a How to choose a node to store a pagepage

Uniform distribution – any page can Uniform distribution – any page can be sent to any nodebe sent to any node

Hash distribution policy – hash page Hash distribution policy – hash page ID space into node ID spaceID space into node ID space

Page 30: Search Engines & Question Answering

Organization Within a NodeOrganization Within a Node Several operations requiredSeveral operations required

– Add / remove a page– High speed streaming – Random page access

Hashed organizationHashed organization– Treat each disk as a hash bucket– Assign according to a page’s ID

Log organizationLog organization– Treat the disk as one file, and add the page at the end– Support random access using a B-tree

Hybrid Hybrid – Hash map a page to an extent and use log structure

within an extent.

Page 31: Search Engines & Question Answering

Distribution PerformanceDistribution Performance

LogLog HashedHashed Hashed Hashed LogLog

Streaming Streaming performanceperformance

++++ -- ++

Random access Random access performanceperformance

+-+- ++++ +-+-

Page additionPage addition ++++ -- ++

Page 32: Search Engines & Question Answering

Update StrategiesUpdate Strategies

Updates are generated by the crawlerUpdates are generated by the crawlerSeveral characteristicsSeveral characteristics

– Time in which the crawl occurs and the repository receives information

– Whether the crawl’s information replaces the entire database or modifies parts of it

Page 33: Search Engines & Question Answering

Batch vs. SteadyBatch vs. Steady

Batch mode Batch mode – Periodically executed– Allocated a certain amount of time

Steady mode Steady mode – Run all the time– Always send results back to the

repository

Page 34: Search Engines & Question Answering

Partial vs. Complete CrawlsPartial vs. Complete Crawls

A batch mode crawler canA batch mode crawler can– Do a complete crawl every run, and replace

entire collection– Recrawl only a specific subset, and apply

updates to the existing collection – partial crawl

The repository can implementThe repository can implement– In place update

• Quickly refresh pages– Shadowing, update as another stage

• Avoid refresh-access conflicts

Page 35: Search Engines & Question Answering

Partial vs. Complete CrawlsPartial vs. Complete Crawls

Shadowing resolves the conflicts Shadowing resolves the conflicts between updates and read for the between updates and read for the queriesqueries

Batch mode suits well with Batch mode suits well with shadowingshadowing

Steady crawler suits with in place Steady crawler suits with in place updatesupdates

Page 36: Search Engines & Question Answering

IndexingIndexing

Page 37: Search Engines & Question Answering

The Indexer ModuleThe Indexer Module

Creates Two indexes :Creates Two indexes :Text (content) index : Uses Text (content) index : Uses

“Traditional” indexing methods like “Traditional” indexing methods like Inverted IndexingInverted Indexing

Structure(LinksStructure(Links(( index : Uses a index : Uses a directed graph of pages and links. directed graph of pages and links. Sometimes also creates an inverted Sometimes also creates an inverted graphgraph

Page 38: Search Engines & Question Answering

The Link Analysis ModuleThe Link Analysis Module

Uses the 2 basic indexes created byUses the 2 basic indexes created by the indexer module in order tothe indexer module in order to assemble “Utility Indexes”assemble “Utility Indexes”e.g. : A site index.e.g. : A site index.

Page 39: Search Engines & Question Answering

Inverted IndexInverted Index

A SA Set of inverted lists, one per each index et of inverted lists, one per each index term (word)term (word)

Inverted list of a term: A sorted list of Inverted list of a term: A sorted list of locations in which the term appeared.locations in which the term appeared.

Posting: A pair (w,l) where w is word and l Posting: A pair (w,l) where w is word and l is one of its locationsis one of its locations

Lexicon: Holds all index’s terms with Lexicon: Holds all index’s terms with statistics about the term (not the posting)statistics about the term (not the posting)

Page 40: Search Engines & Question Answering

Challenges Challenges

Index build must be:Index build must be:– Fast– Economic

(unlike traditional index buildings)(unlike traditional index buildings) Incremental Indexing must be Incremental Indexing must be

supportedsupportedStorage: compression vs. speedStorage: compression vs. speed

Page 41: Search Engines & Question Answering

Index PartitioningIndex Partitioning

A distributed text indexing can be done by:A distributed text indexing can be done by: Local inverted file Local inverted file (IF(IFLL))

– Each nodes contain disjoint random pages– Query is broadcasted– Result is the joined query answers

Global Global inverted file (IFinverted file (IFGG))– Each node is responsible only for a subset of

terms in the collection– Query sent only to the apropriate node

Page 42: Search Engines & Question Answering

Indexing, ConclusionIndexing, Conclusion

Web pages indexing is complicated Web pages indexing is complicated due to it’s scale (millions of pages, due to it’s scale (millions of pages, hundreds of gigabytes) hundreds of gigabytes)

Challenges: Incremental indexing Challenges: Incremental indexing and personalizationand personalization

Page 43: Search Engines & Question Answering

ScalingScaling

Page 44: Search Engines & Question Answering

ScalingScaling

Google (Nov 2002):Google (Nov 2002):– Number of pages: 3 billion– Refresh interval: 1 month (1200 pag/sec)– Queries/day: 150 million = 1700 q/s

Avg page size:10KBAvg page size:10KBAvg query size: 40 BAvg query size: 40 BAvg result size: 5 KBAvg result size: 5 KBAvg links/page: 8Avg links/page: 8

Page 45: Search Engines & Question Answering

Size of DatasetSize of Dataset

Total raw HTML data size:Total raw HTML data size:3G x 10 KB = 30 TB

Inverted index ~= corpus = 30 TBInverted index ~= corpus = 30 TBUsing compression 3:1Using compression 3:1

20 TB data on disk

Page 46: Search Engines & Question Answering

Single copy of indexSingle copy of index

IndexIndex– (10 TB) / (100 GB per disk) = 100 disk

DocumentDocument– (10 TB) / (100 GB per disk) = 100 disk

Page 47: Search Engines & Question Answering

Query LoadQuery Load

1700 queries/sec1700 queries/secRule of thumb: 20 q/s per CPURule of thumb: 20 q/s per CPU

– 85 clusters to answer queriesCluster: 100 machineCluster: 100 machineTotal = 85 x 100 = 8500Total = 85 x 100 = 8500Document serverDocument server

– Snippet search: 1000 snippet/s– (1700 * 10 / 1000) * 100 = 1700

Page 48: Search Engines & Question Answering

LimitsLimits

Redirector: 4000 req/secRedirector: 4000 req/secBandwidth: 1100 req/secBandwidth: 1100 req/secServer: 22 q/s eachServer: 22 q/s eachCluster: 50 nodes = 1100 q/s =Cluster: 50 nodes = 1100 q/s = 95 million q/day95 million q/day

Page 49: Search Engines & Question Answering

Scaling the IndexScaling the Index

Hardware based load balancer

Google Web ServerGoogle Web Server

Index serversDocument servers

Spell Checker Ad server

Queries

Google Web Server

Page 50: Search Engines & Question Answering

Pooled Shard ArchitecturePooled Shard Architecture

Web Server

Pool for shard 1

…SIS

Intermediate load balancer 1

Pool for shard N

Intermediate load balancer N

Index Server 1

Index load balancer

Index Server K

S1 S1

SIS

SN SN

SIS SIS

Pool Network

Index Server Network

1 Gb/s100 Mb/s

Page 51: Search Engines & Question Answering

Replicated Index ArchitectureReplicated Index Architecture

Web Server

Full Index 1

…SIS

Index Server 1 Index Server M

Index load balancer

S1

SIS

SN

Full Index M

…SIS

S1

SIS

SN

Index Server Network

1 Gb/s

100 Mb/s

Page 52: Search Engines & Question Answering

Index ReplicationIndex Replication

100 Mb/s bandwidth100 Mb/s bandwidth20 TB x 8 / 10020 TB x 8 / 10020 TB require one full day20 TB require one full day

Page 53: Search Engines & Question Answering

RankingRanking

Page 54: Search Engines & Question Answering

First generation rankingFirst generation ranking

Extended Boolean model Extended Boolean model – Matches: exact, prefix, phrase,…– Operators: AND, OR, AND NOT, NEAR, …– Fields: TITLE:, URL:, HOST:,…– AND is somewhat easier to implement, maybe

preferable as default for short queries RankingRanking

– TF like factors: TF, explicit keywords, words in title, explicit emphasis (headers), etc

– IDF factors: IDF, total word count in corpus, frequency in query log, frequency in language

Page 55: Search Engines & Question Answering

Second generation search engineSecond generation search engine

Ranking -- use off-page, web-specific Ranking -- use off-page, web-specific datadata– Link (or connectivity) analysis– Click-through data (What results people

click on)– Anchor-text (How people refer to this

page)CrawlingCrawling

– Algorithms to create the best possible corpus

Page 56: Search Engines & Question Answering

Connectivity analysisConnectivity analysis

Idea: mine hyperlink information in Idea: mine hyperlink information in the Webthe Web

Assumptions:Assumptions:– Links often connect related pages– A link between pages is a

recommendation “people vote with their links”

Page 57: Search Engines & Question Answering

Citation AnalysisCitation Analysis

Citation frequencyCitation frequency Co-citation coupling frequencyCo-citation coupling frequency

– Cocitations with a given author measures “impact”– Cocitation analysis [Mcca90]

Bibliographic coupling frequencyBibliographic coupling frequency– Articles that co-cite the same articles are related

Citation indexingCitation indexing– Who is a given author cited by? (Garfield [Garf72])

Pinsker and NarinPinsker and Narin

Page 58: Search Engines & Question Answering

Query-independent orderingQuery-independent ordering

First generation: using link counts as First generation: using link counts as simple measures of popularitysimple measures of popularity

Two basic suggestions:Two basic suggestions:– Undirected popularity:

• Each page gets a score = the number of in-links plus the number of out-links (3+2=5)

– Directed popularity:• Score of a page = number of its in-links (3)

Page 59: Search Engines & Question Answering

Query processingQuery processing

First retrieve all pages meeting the First retrieve all pages meeting the text query (say text query (say venture capitalventure capital))

Order these by their link popularity Order these by their link popularity (either variant on the previous page)(either variant on the previous page)

Page 60: Search Engines & Question Answering

Spamming simple popularitySpamming simple popularity

ExerciseExercise: How do you spam each of : How do you spam each of the following heuristics so your page the following heuristics so your page gets a high score?gets a high score?

Each page gets a score = the number Each page gets a score = the number of in-links plus the number of out-of in-links plus the number of out-linkslinks

Score of a page = number of its in-Score of a page = number of its in-linkslinks

Page 61: Search Engines & Question Answering

PageRank scoringPageRank scoring

Imagine a browser doing a random Imagine a browser doing a random walk on web pages:walk on web pages:– Start at a random page– At each step, go out of the current page

along one of the links on that page, equiprobably

““In the steady state” each page has a In the steady state” each page has a long-term visit rate - use this as the long-term visit rate - use this as the page’s scorepage’s score

1/31/31/3

Page 62: Search Engines & Question Answering

Not quite enoughNot quite enough

The web is full of dead-endsThe web is full of dead-ends– Random walk can get stuck in dead-

ends.– Makes no sense to talk about long-term

visit rates.

??

Page 63: Search Engines & Question Answering

TeleportingTeleporting

At each step, with probability 10%, At each step, with probability 10%, jump to a random web pagejump to a random web page

With remaining probability (90%), go With remaining probability (90%), go out on a random linkout on a random link– If no out-link, stay put in this case

Page 64: Search Engines & Question Answering

Result of teleportingResult of teleporting

Now cannot get stuck locallyNow cannot get stuck locallyThere is a long-term rate at which There is a long-term rate at which

any page is visited (not obvious, will any page is visited (not obvious, will show this)show this)

How do we compute this visit rate?How do we compute this visit rate?

Page 65: Search Engines & Question Answering

PageRankPageRank

Tries to capture the notion of “Importance Tries to capture the notion of “Importance of a page”of a page”

Uses Backlinks for rankingUses Backlinks for ranking Avoids trivial spamming: Distributes Avoids trivial spamming: Distributes

pages’ “voting power” among the pages pages’ “voting power” among the pages they are linking tothey are linking to

““Important” page linking to a page will Important” page linking to a page will raise it’s rank more the “Not Important” raise it’s rank more the “Not Important” oneone

Page 66: Search Engines & Question Answering

Simple PageRankSimple PageRank

Given by :Given by :WhereWhereBB((ii) : set of pages links to ) : set of pages links to iiNN((jj) : number of outgoing links from ) : number of outgoing links from jj

Well defined if link graph is strongly Well defined if link graph is strongly connectedconnected

Based on “Random Surfer Model” - Based on “Random Surfer Model” - Rank of page equals to the Rank of page equals to the probability of being in this pageprobability of being in this page

Page 67: Search Engines & Question Answering

Otherwise 0 topoints if )(/1

,

)](),...,2(),1([jiiN

jia

mrrrrrtAr

Computation Of PageRank (1)Computation Of PageRank (1)

Page 68: Search Engines & Question Answering

Given a matrix Given a matrix AA, an eigenvalue , an eigenvalue c c and the corresponding eigenvector and the corresponding eigenvector vv is defined if is defined if Av Av = = cvcv

Hence Hence rr is eigenvector of is eigenvector of AAtt for for eigenvalue “1”eigenvalue “1”

If If GG is strongly connected then is strongly connected then rr is is uniqueunique

Computation of PageRank (2)Computation of PageRank (2)

Page 69: Search Engines & Question Answering

Computation of PageRank (3)Computation of PageRank (3)

Simple PageRank can be computed Simple PageRank can be computed by:by:

ectorPageRank v theis 5.2 goto 4.5 goto |||| if 3.

2.

vectorrandomany 1.

rrssrsAr

st

Page 70: Search Engines & Question Answering

PageRank ExamplePageRank Example

Page 71: Search Engines & Question Answering

Practical PageRank: ProblemPractical PageRank: Problem

Web is not a strongly connected Web is not a strongly connected graph. It contains:graph. It contains:– “Rank Sinks”: cluster of pages without

outgoing links. Pages outside cluster will be ranked 0.

– “Rank Leaks”: a page without outgoing links. All pages will be ranked 0.

Page 72: Search Engines & Question Answering

Practical PageRank: SolutionPractical PageRank: Solution

Remove all Page LeaksRemove all Page LeaksAdd decay factor Add decay factor dd to Simple to Simple

PageRankPageRank

Based on “Board Surfer Model”Based on “Board Surfer Model”

Page 73: Search Engines & Question Answering

HITS: Hypertext Induced Topic SearchHITS: Hypertext Induced Topic Search

A query dependent techniqueA query dependent techniqueProduces two scores:Produces two scores:

– Authority: A most likely to be relevant page to a given query

– Hub: Points to many AuthoritiesContains two part: Contains two part:

– Identifying the focused subgraph– Link analysis

Page 74: Search Engines & Question Answering

HITS: Identifying The Focused SubgraphHITS: Identifying The Focused Subgraph

Subgraph creation from t-sized page set:Subgraph creation from t-sized page set:

(d reduces the influence of extremely(d reduces the influence of extremely popular pages like yahoo.com)popular pages like yahoo.com)

graph focused theholds 4.in topoints that pages all ) maximum to(up Include (b)

in topoints that pages theall Include (a) pageeach for 3.

2.pages initialt 1.

SSpd

SpRp

RSR

Page 75: Search Engines & Question Answering

HITS: Link AnalysisHITS: Link Analysis

Calculates Authorities & Hubs Calculates Authorities & Hubs scores (scores (aai i & & hhii) for each page in ) for each page in SS

pages) all(for (b)

pages) all(for (a)

econvergenc untilRepeat 2.y.arbitraril )1( , Initialize 1.

)(

)(

iFjji

iBjji

ii

ah

ha

niba

Page 76: Search Engines & Question Answering

HITS:Link Analysis ComputationHITS:Link Analysis Computation

Eigenvectors computation can be Eigenvectors computation can be used by:used by:

WhereWherea: Vector of Authorities’ scoresa: Vector of Authorities’ scores

h: Vector of Hubs’ scoresh: Vector of Hubs’ scores A: Adjacency matrix in which aA: Adjacency matrix in which ai,j i,j = 1 if points to j= 1 if points to j

AhAhaAAa

aAh

Ahatr

tr

tr

Page 77: Search Engines & Question Answering

Markov chainsMarkov chains

A Markov chain consists of A Markov chain consists of n n statesstates, plus , plus an an nnnn transition probability matrixtransition probability matrix PP

At each step, we are in exactly one of the At each step, we are in exactly one of the statesstates

For For 1 1 i,j i,j n, n, the matrix entry the matrix entry PPijij tells us tells us the probability of the probability of jj being the next state, being the next state, given we are currently in state given we are currently in state ii

i jPij

Pii>0is OK.

Page 78: Search Engines & Question Answering

.11

ij

n

j

P

Markov chainsMarkov chains

Clearly, for all i,Clearly, for all i, Markov chains are abstractions of Markov chains are abstractions of

random walksrandom walksExerciseExercise: represent the teleporting : represent the teleporting

random walk from 3 slides ago as a random walk from 3 slides ago as a Markov chain, for this case: Markov chain, for this case:

Page 79: Search Engines & Question Answering

Ergodic Markov chainsErgodic Markov chains

A Markov chain is A Markov chain is ergodicergodic if if– you have a path from any state to any

other– you can be in any state at every time

step, with non-zero probability

Notergodic(even/odd).

Page 80: Search Engines & Question Answering

Ergodic Markov chainsErgodic Markov chains

For any ergodic Markov chain, there For any ergodic Markov chain, there is a unique long-term visit rate for is a unique long-term visit rate for each stateeach state– Steady-state distribution

Over a long time-period, we visit Over a long time-period, we visit each state in proportion to this rateeach state in proportion to this rate

It doesn’t matter where we startIt doesn’t matter where we start

Page 81: Search Engines & Question Answering

Probability vectorsProbability vectors

A probability (row) vector A probability (row) vector xx = (x= (x11, … x, … xnn) ) tells us where the walk is at any pointtells us where the walk is at any point

E.g., (000…1…000) means we’re in state E.g., (000…1…000) means we’re in state iii n1

More generally, the vector x = (x1, … xn) means the walk is in state i with probability xi

.11

n

iix

Page 82: Search Engines & Question Answering

Change in probability vectorChange in probability vector

If the probability vector is x If the probability vector is x = (x= (x11, … , … xxnn) ) at this step, what is it at the next at this step, what is it at the next step?step?

Recall that row Recall that row ii of the transition of the transition prob. Matrix prob. Matrix PP tells us where we go tells us where we go next from state next from state ii

So from So from xx, our next state is , our next state is distributed as distributed as xPxP

Page 83: Search Engines & Question Answering

Computing the visit rateComputing the visit rate

The steady state looks like a vector The steady state looks like a vector of probabilities of probabilities aa = = ((aa11, … a, … ann):):– ai is the probability that we are in state i

1 23/4

1/43/41/4

For this example, a1=1/4 and a2=3/4

Page 84: Search Engines & Question Answering

How do we compute this vector?How do we compute this vector?

Let Let aa = = ((aa11, … a, … ann) denote the row vector of ) denote the row vector of steady-state probabilitiessteady-state probabilities

If we our current position is described by If we our current position is described by aa, then the next step is distributed as , then the next step is distributed as aPaP

But But aa is the steady state, so is the steady state, so aa==aPaP Solving this matrix equation gives us Solving this matrix equation gives us aa

– So a is the (left) eigenvector for P– (Corresponds to the “principal” eigenvector of

P with the largest eigenvalue)

Page 85: Search Engines & Question Answering

One way of computing One way of computing aa

Recall, regardless of where we start, we Recall, regardless of where we start, we eventually reach the steady state eventually reach the steady state aa

Start with any distribution (say Start with any distribution (say xx=(=(10…010…0)))) After one step, we’re at After one step, we’re at xPxP after two steps at after two steps at xPxP22 , then , then xPxP33 and so on and so on ““Eventually” means for “large” Eventually” means for “large” kk, , xPxPk k = = aa Algorithm: multiply Algorithm: multiply xx by increasing by increasing

powers of powers of PP until the product looks stable until the product looks stable

Page 86: Search Engines & Question Answering

Lempel: SalsaLempel: Salsa

By applying ergodic theorem, By applying ergodic theorem, Lempel has proved that:Lempel has proved that:– ai is proportional to number of incoming

links

Page 87: Search Engines & Question Answering

Pagerank summaryPagerank summary

Preprocessing:Preprocessing:– Given graph of links, build matrix P– From it compute a– The entry ai is a number between 0 and

1: the PageRank of page iQuery processing:Query processing:

– Retrieve pages meeting query– Rank them by their PageRank– Order is query-independent

Page 88: Search Engines & Question Answering

The realityThe reality

Pagerank is used in Google, but so Pagerank is used in Google, but so are many other clever heuristicsare many other clever heuristics