National & Kapodistrian University of Athens Dept.of Informatics & Telecommunications MSc. in Computer Systems Technology Distributed Systems Searching

National & Kapodistrian University of AthensNational & Kapodistrian University of AthensDept.of Informatics & TelecommunicationsDept.of Informatics & Telecommunications

MSc. in Computer Systems TechnologyMSc. in Computer Systems TechnologyDistributed SystemsDistributed Systems

Searching the WebBy A.Arasu, J.Cho, H.Garcia-Molina, A.Paepcke,

S.Raghavan

Giorgos Matrozos

M [email protected]

This paper is aboutThis paper is about … …

Search EnginesSearch Engines

Generic ArchitectureGeneric Architecture Each Component’s ArchitectureEach Component’s Architecture Each Component’s Design and Implementation Each Component’s Design and Implementation

TechniquesTechniques CrawlingCrawling Page StoragePage Storage IndexingIndexing Link AnalysisLink Analysis

A Quick LookA Quick Look

Why Use Search Engines - Why their Work is Hard ?Why Use Search Engines - Why their Work is Hard ?

Ans:Ans: Over a Billion pagesOver a Billion pages

Great Growth RateGreat Growth Rate

About 23% of the pages update dailyAbout 23% of the pages update daily

Linking between pages is very Linking between pages is very complicatedcomplicated

What about Information Retrieval ?What about Information Retrieval ?

Ans:Ans: It is used but it is unsuitable, because it isIt is used but it is unsuitable, because it is

for small, coherent collections. The Web for small, coherent collections. The Web onon

the other hand is massive, incoherent, the other hand is massive, incoherent,

distributed and rapidly changingdistributed and rapidly changing

Search Engine ComponentsSearch Engine Components

A search engine consists ofA search engine consists of

a Crawler modulea Crawler module a Crawler Control modulea Crawler Control module a Page Repositorya Page Repository an Indexer modulean Indexer module a Collection Analysis modulea Collection Analysis module a Utility Index a Utility Index a Query Engine modulea Query Engine module a Ranking modulea Ranking module

General Search Engine ArchitectureGeneral Search Engine Architecture

The Crawler moduleThe Crawler module

Starts with a set of URLs SStarts with a set of URLs S0 0

It has a prioritized queue from where it retrieves the It has a prioritized queue from where it retrieves the URLsURLs

Then the Crawler downloads the pages, extracts any Then the Crawler downloads the pages, extracts any new URL and places it in the queuenew URL and places it in the queue

This is done until it decides to stopThis is done until it decides to stop

But some questions arise.But some questions arise.

What pages should the Crawler download ? What pages should the Crawler download ?

Ans: Page Selection methodsAns: Page Selection methods

How should the Crawler refresh pages ? How should the Crawler refresh pages ?

Ans: Page Refresh methodsAns: Page Refresh methods

Page SelectionPage Selection

The Crawler may want to download important pages The Crawler may want to download important pages first for the collection to be of good qualityfirst for the collection to be of good quality

ButBut What is important?What is important?

How the Crawler operates?How the Crawler operates?

How the Crawler guesses good pages?How the Crawler guesses good pages?

HintsHints

Importance MetricsImportance Metrics

Crawler ModelsCrawler Models

Ordering MetricsOrdering Metrics

Importance Metrics IImportance Metrics I

Interest DrivenInterest DrivenGiven a query Q, the importance of the page P is defined Given a query Q, the importance of the page P is defined

as the textual similarity between P, Q.as the textual similarity between P, Q.

P, Q are considered vectors <wP, Q are considered vectors <w11, …, w, …, wnn> where w> where wii represents the irepresents the ithth word of the vocabulary. word of the vocabulary.

wwii = #_of_appear * idf (inverse document frequency). = #_of_appear * idf (inverse document frequency). idf = 1 / #_of_appear in the whole collection.idf = 1 / #_of_appear in the whole collection.Similarity between P,Q Similarity between P,Q IS(P) = cosine products of P,Q IS(P) = cosine products of P,Q

vectors. vectors. Idf was not used because it relies on global Idf was not used because it relies on global info.info.

But if we want to use idf factors, they must be estimated But if we want to use idf factors, they must be estimated using reference idf from other times. Then the similarity using reference idf from other times. Then the similarity is IS’(P) and it is estimated because we have not seen is IS’(P) and it is estimated because we have not seen yet the entire collection to compute the actual IS(P).yet the entire collection to compute the actual IS(P).

Importance Metrics IIImportance Metrics II

Popularity DrivenPopularity Driven

A way to define popularity is to use a page’s A way to define popularity is to use a page’s backlink backlink countcount, that is the links that point to this page. The , that is the links that point to this page. The number of these links determines its popularity number of these links determines its popularity IB(P).IB(P).

Note also that the Crawler estimates IB’(P) because the Note also that the Crawler estimates IB’(P) because the actual metric needs information about the whole actual metric needs information about the whole web. The estimate may be inaccurate early in the web. The estimate may be inaccurate early in the crawl.crawl.

A more sophisticated but similar technique is also used A more sophisticated but similar technique is also used in Page Ranking.in Page Ranking.

Importance Metrics IIIImportance Metrics III

Location DrivenLocation Driven

IL(P) is a function of its location, not its contents. If URL IL(P) is a function of its location, not its contents. If URL u leads to P, then IL(P) is a function of u.u leads to P, then IL(P) is a function of u.

This is a way to evaluate the location of the page and This is a way to evaluate the location of the page and through this its importance.through this its importance.

Another way used is the Another way used is the slashesslashes that appear in the that appear in the address. Fewer slashes are considered more useful.address. Fewer slashes are considered more useful.

FINALLY FINALLY IC(P) = k1 * IS(P) + k2 * IB(P) + k3 * IL(P) IC(P) = k1 * IS(P) + k2 * IB(P) + k3 * IL(P)

Crawler Models ICrawler Models I

Now, for a given importance metric, the crawler must Now, for a given importance metric, the crawler must guess using a Quality Metricguess using a Quality Metric

Crawl and StopCrawl and Stop

Starts with initial page PStarts with initial page P00 and stops after K pages. K is and stops after K pages. K is fixed. It’s the number of downloaded pages in one fixed. It’s the number of downloaded pages in one crawl.crawl.

A perfect crawler would have visited pages with RA perfect crawler would have visited pages with R11…R…RKK where these are ordered according to the importance where these are ordered according to the importance metric. BUT the real crawler visits M metric. BUT the real crawler visits M K ordered pages. K ordered pages.

So, the performance of the Crawler C is PSo, the performance of the Crawler C is PCSCS(C) = M*100/K(C) = M*100/K

A crawler with random visits would have a performance of A crawler with random visits would have a performance of K*100/T, where T are the pages in the entire Web. Each K*100/T, where T are the pages in the entire Web. Each page visited is a hot page with prob K/T. Thus the page visited is a hot page with prob K/T. Thus the expected number of desired pages until the crawler expected number of desired pages until the crawler stops is Kstops is K22/T./T.

Crawler Models IICrawler Models II

Crawl and Stop with ThresholdCrawl and Stop with Threshold

In this technique, there is an importance target G and In this technique, there is an importance target G and pages with importance higher than G are only pages with importance higher than G are only considered. Lets assume that this number is H.considered. Lets assume that this number is H.

The performance PThe performance PSTST(C) is the percentage of the H hot (C) is the percentage of the H hot pages.pages.

If K < H then If K < H then K*100/H K*100/H

If K If K H then H then the ideal crawler has 100% the ideal crawler has 100%

A random crawler is expected to visit (H/T)*K when it A random crawler is expected to visit (H/T)*K when it stops. Thus its performance is K*100/Tstops. Thus its performance is K*100/T

Ordering MetricsOrdering Metrics

According to this metric the Crawler selects the URL According to this metric the Crawler selects the URL from the queue. The ordering metric can only use from the queue. The ordering metric can only use information seen by the crawler. The ordering metric information seen by the crawler. The ordering metric should be design with an importance metric in mind.should be design with an importance metric in mind.

For example if the crawler searches for high popularity For example if the crawler searches for high popularity pages, then the ordering metric is IB’(P). Also pages, then the ordering metric is IB’(P). Also location metrics can be used.location metrics can be used.

It is hard to devise the ordering metric from the It is hard to devise the ordering metric from the similarity metric, since we have not seen P yet.similarity metric, since we have not seen P yet.

Page RefreshPage Refresh

After downloading the Crawler has to periodically After downloading the Crawler has to periodically refresh pages.refresh pages.

Two strategies :Two strategies :

Uniform Refresh Policy :Uniform Refresh Policy : Revisits all pages at the Revisits all pages at the same frequency f, regardless of how often they same frequency f, regardless of how often they change.change.

Proportional Refresh Policy :Proportional Refresh Policy : Assume Assume λλii is the is the change freq of echange freq of eii and that f and that fii is the crawler’s revisiting is the crawler’s revisiting freq of efreq of eii. Then the freq ratio . Then the freq ratio λλii/f/fii is the same for any is the same for any i.i.

Freshness and Age Metrics Freshness and Age Metrics Some definitionsSome definitions

Freshness of local page eFreshness of local page eii at time t. at time t.

Freshness of the local collection S at time t.Freshness of the local collection S at time t.

Age of local page eAge of local page eii at time t. at time t.

Age of the local collectionAge of the local collection

We define the time average of freshness of eWe define the time average of freshness of e ii and S and S

The time average of age, similarly. All the above are approximationsThe time average of age, similarly. All the above are approximations

Refresh Strategy IRefresh Strategy I

Note that crawlers can download/update limited number Note that crawlers can download/update limited number of pages within a period because they have limited of pages within a period because they have limited resources.resources.

Consider a simple example. Collection of 2 pages eConsider a simple example. Collection of 2 pages e11, e, e22. . ee11 changes 9 times per day and 2 once a day. For e1 a changes 9 times per day and 2 once a day. For e1 a day is split into 9 intervals and e1 changes once and day is split into 9 intervals and e1 changes once and only one in each interval, but we do not know precisely only one in each interval, but we do not know precisely when. ewhen. e22 changes once and only one in each day, but changes once and only one in each day, but we do not know precisely when.we do not know precisely when.

Assume that our crawler can refresh one page/day. But Assume that our crawler can refresh one page/day. But which page? If ewhich page? If e22 changes in the middle of the day and changes in the middle of the day and we refresh right after, ewe refresh right after, e22 will be up-to-date for the will be up-to-date for the remaining 1/2day. The prob. that change is before the remaining 1/2day. The prob. that change is before the middle is 1/2, thus the expected benefit is 1/4 and so middle is 1/2, thus the expected benefit is 1/4 and so on.on.

Refresh Strategy IIRefresh Strategy II

It can be mathematically be proved that uniform It can be mathematically be proved that uniform refresh policy is always superior or equal to the refresh policy is always superior or equal to the proportional for any number of pages, change freqs proportional for any number of pages, change freqs and refresh rates, for both freshness and age and refresh rates, for both freshness and age metrics.metrics.

Best solution Best solution Assume that pages change following a Assume that pages change following a Poisson process and their change freq. is static.Poisson process and their change freq. is static.

The mathematic proof and the idea of the above The mathematic proof and the idea of the above statement is described in “Cho, Garcia-Molina statement is described in “Cho, Garcia-Molina Synchronizing a database to improve freshness, Synchronizing a database to improve freshness, International Conf on Management of Data, 2000”International Conf on Management of Data, 2000”

StorageStorage

The page repository must manage a large collection of The page repository must manage a large collection of web pages. There are 4 challenges.web pages. There are 4 challenges.

Scalability.Scalability. It must be possible to distribute the It must be possible to distribute the repository across a cluster of computers and disks to repository across a cluster of computers and disks to cope with the size of the web.cope with the size of the web.Dual access modes.Dual access modes. Random access is used to Random access is used to quickly retrieve a specific web page, streaming quickly retrieve a specific web page, streaming access is used to receive the entire collection. The access is used to receive the entire collection. The first is used by the Query Engine and the second by first is used by the Query Engine and the second by the Indexer and Analysis modules.the Indexer and Analysis modules.Large bulk updates.Large bulk updates. As new versions of pages are As new versions of pages are stored, the space occupied by the old must be stored, the space occupied by the old must be reclaimed through compaction and reorganization.reclaimed through compaction and reorganization.Obsolete pages.Obsolete pages. Mechanism for detecting and Mechanism for detecting and removing obsolete pages removing obsolete pages

Page Distribution PoliciesPage Distribution Policies

Assumption : The repository is designed to function Assumption : The repository is designed to function over a cluster of interconnected storage nodes.over a cluster of interconnected storage nodes.

Uniform distribution.Uniform distribution. A page can be stored at any A page can be stored at any node independently of its identifier.node independently of its identifier.

Hash distribution.Hash distribution. A page id would be hashed to A page id would be hashed to yield a node id. The page should be stored at the yield a node id. The page should be stored at the corresponding node.corresponding node.

Physical Page Organization MethodsPhysical Page Organization Methods

Within an node, there are 3 possible operations : Within an node, there are 3 possible operations : addition/insertion, high-speed streaming, random addition/insertion, high-speed streaming, random page accesspage access . .

MethodsMethods

Hash-basedHash-based

Log-structuredLog-structured

Hashed-logHashed-log

Update StrategiesUpdate Strategies

Batch Mode or Steady CrawlerBatch Mode or Steady Crawler

A batch-mode crawler is a periodical crawler, that crawls A batch-mode crawler is a periodical crawler, that crawls for a certain amount of time. The repository receives for a certain amount of time. The repository receives updates only for a certain number of dates in a updates only for a certain number of dates in a month. In contrast a steady crawler crawls without month. In contrast a steady crawler crawls without any pause and updates continuously the repository.any pause and updates continuously the repository.

Partial or Complete crawlsPartial or Complete crawls

According to the crawl, update can be :According to the crawl, update can be :

In place, that is the pages are directly integrated In place, that is the pages are directly integrated in the repository’s existing collection, possibly in the repository’s existing collection, possibly replacing older versions.replacing older versions.

Shadowing, that is the pages are stored Shadowing, that is the pages are stored separately and update is done in another stepseparately and update is done in another step

The Stanford WebBase repositoryThe Stanford WebBase repository

It is a distributed storage system, that works with It is a distributed storage system, that works with the Stanford WebCrawler. the Stanford WebCrawler.

The repository employs a The repository employs a node managernode manager to monitor to monitor the nodes and collect status information. the nodes and collect status information.

Since the Stanford crawler is a batch crawler, the Since the Stanford crawler is a batch crawler, the repository applies a shadowing technique.repository applies a shadowing technique.

The URLs are first normalized to yield a canonical The URLs are first normalized to yield a canonical representation. The page id is computed as a representation. The page id is computed as a signature of this normalized URL.signature of this normalized URL.

Indexing Indexing

Structure (or link) indexStructure (or link) indexThe Web is modeled as a graph. The nodes are pages and The Web is modeled as a graph. The nodes are pages and

the edges hyperlinks from one to another. the edges hyperlinks from one to another. It uses It uses neighborhood informationneighborhood information : given a page P, retrieve : given a page P, retrieve

the pages pointed to by P or the pages pointing to P.the pages pointed to by P or the pages pointing to P.

Text (or content) indexText (or content) indexText –based retrieval continues to be the primary method Text –based retrieval continues to be the primary method

for identifying pages relevant to a query. Indices to for identifying pages relevant to a query. Indices to support this retrieval can be implemented with support this retrieval can be implemented with suffix suffix arrays, inverted files, inverted indicesarrays, inverted files, inverted indices and and signature filessignature files..

Utility indicesUtility indicesSpecial indices like site indices for example for searching in Special indices like site indices for example for searching in

one domain only. one domain only.

WebBase text-indexing system IWebBase text-indexing system I

3 types of nodes3 types of nodesDistributors, that store the pages to be indexedDistributors, that store the pages to be indexedIndexers, that execute the core of the index building Indexers, that execute the core of the index building engineengineQuery servers. The final inverted index is partitioned Query servers. The final inverted index is partitioned across them.across them.

The inverted index is built in 2 stagesThe inverted index is built in 2 stagesEach distributor runs a process that disseminates Each distributor runs a process that disseminates the pages to the indexers. Each subset is mutually the pages to the indexers. Each subset is mutually disjoint. The indexers extract postings, sort them disjoint. The indexers extract postings, sort them and flush to intermediate structures on disk.and flush to intermediate structures on disk.These are merged to create a inverted file and its These are merged to create a inverted file and its lexicon. These pairs are transferred to the query lexicon. These pairs are transferred to the query serversservers

WebBase text-indexing system IIWebBase text-indexing system II

The core of the indexing is the index-builder process. The core of the indexing is the index-builder process. This process can be parallelized with 3 phases. This process can be parallelized with 3 phases. Loading, ProcessingLoading, Processing and and FlushingFlushing..

Loading :Loading : pages are read and stored in memory pages are read and stored in memoryProcessing :Processing : pages are parsed and stored as a set of pages are parsed and stored as a set of

postings in a mem. Buffer. Then the postings are postings in a mem. Buffer. Then the postings are sorted by term and then by location.sorted by term and then by location.

Flushing :Flushing : The sorted postings are saved in the disk The sorted postings are saved in the disk as a sorted runas a sorted run

WebBase Indexing System Statistics IWebBase Indexing System Statistics IOne of the most commonly used statistic is idf. The idf of One of the most commonly used statistic is idf. The idf of

a term w is log(N/dfa term w is log(N/dfww) where N is the total number of ) where N is the total number of pages in the collection and dfpages in the collection and dfww is the number of pages is the number of pages that contain at least on occurrence of w.that contain at least on occurrence of w.

To avoid the the query time overhead, the WebBase To avoid the the query time overhead, the WebBase computes and stores statistics as part of index creation.computes and stores statistics as part of index creation.Avoiding explicit I/O for statistics :Avoiding explicit I/O for statistics : To avoid To avoid additional I/O the local data are sent to the statistician additional I/O the local data are sent to the statistician only when they are available in memory. 2 strategiesonly when they are available in memory. 2 strategies ME, FL :Send local info during merging or during ME, FL :Send local info during merging or during

flushingflushingLocal aggregation :Local aggregation : Multiple postings for a term pass Multiple postings for a term pass through memory in groups. Eg 1000 postings for “cat”. through memory in groups. Eg 1000 postings for “cat”. The pair (“cat”,1000) can be sent to the statistician.The pair (“cat”,1000) can be sent to the statistician.

Page Rank IPage Rank IPage Rank extends the basic idea of citation by taking Page Rank extends the basic idea of citation by taking

into consideration the importance of the pages into consideration the importance of the pages pointing to a given page. Thus a page receives more pointing to a given page. Thus a page receives more importance if YAHOO points to it, than an unknown importance if YAHOO points to it, than an unknown page. Note that the definition of Page Rank is page. Note that the definition of Page Rank is recursive.recursive.

Simple Page RankSimple Page RankLet 1…m be the pages of the web, N(i) the # of Let 1…m be the pages of the web, N(i) the # of

outgoing links from i, B(i) the set of pages that point outgoing links from i, B(i) the set of pages that point to i, then we have to i, then we have

The above definition leads to the idea of random The above definition leads to the idea of random walks, called the walks, called the Random Surfer ModelRandom Surfer Model. It can be . It can be proved that the page rank of a page is proportional proved that the page rank of a page is proportional to the freq. with which a random surfer would visit it.to the freq. with which a random surfer would visit it.

Page Rank IIPage Rank II

Practical Page RankPractical Page Rank

The Simple Page Rank is well defined if the graph is The Simple Page Rank is well defined if the graph is strongly connected. This isn’t the case here. A strongly connected. This isn’t the case here. A rank rank sinksink is a connected cluster of pages that has no is a connected cluster of pages that has no outgoing links. A outgoing links. A rank leakrank leak is a single page with no is a single page with no outgoing links.outgoing links.

Thus two solutions. Removal of all the leak nodes with Thus two solutions. Removal of all the leak nodes with out-degree 0 and introduction of a decay factor d to out-degree 0 and introduction of a decay factor d to solve the problem of sinks. So the modified Page solve the problem of sinks. So the modified Page RankRank

where m is the number of nodes in the graph.where m is the number of nodes in the graph.

HITS IHITS I

Link based search alg. : Link based search alg. : HHypertext ypertext IIncluded ncluded TTopic opic SSearchearch

Instead of producing a single ranking score, HITS Instead of producing a single ranking score, HITS produces the produces the AuthorityAuthority and the and the Hub scoreHub score. Authority . Authority pages are those most likely to be relevant to a query pages are those most likely to be relevant to a query and Hub pages are not necessarily authorities but and Hub pages are not necessarily authorities but point to several of them.point to several of them.

The HITS algorithmThe HITS algorithm

The basic idea is to identify a small subgraph of the The basic idea is to identify a small subgraph of the web and apply link analysis to locate the Authorities web and apply link analysis to locate the Authorities and the Hubs for a given query.and the Hubs for a given query.

HITS IIHITS II

Identifying the focused subgraphIdentifying the focused subgraph

Link AnalysisLink Analysis

Two kind of operations in each step, I and O.Two kind of operations in each step, I and O.

HITS IIIHITS III

The alg. iteratively repeats I and O steps, with The alg. iteratively repeats I and O steps, with normalization, until the hub and authority scores normalization, until the hub and authority scores converge.converge.

Other Link Based TechniquesOther Link Based Techniques

Identifying CommunitiesIdentifying Communities

Interesting problem to identify communities in the Interesting problem to identify communities in the web.web.

See ref [30] and [40]See ref [30] and [40]

Finding Related PagesFinding Related Pages

Companion and Cocitation algorithms.Companion and Cocitation algorithms.

See ref [22], [32] and [38]See ref [22], [32] and [38]

Classification and Resource CompilationClassification and Resource Compilation

Problem of automatically classifying documents.Problem of automatically classifying documents.

See ref [13], [14], [15]See ref [13], [14], [15]

Documents

National & Kapodistrian University of Athens Dept.of Informatics & Telecommunications MSc. in Computer Systems Technology Distributed Systems Searching