Part II - Basic Techniques: Search engine architecture Web crawling basics: following links, crawl courtesy,.. Storage Text indexing Querying and term-based

Part II - Basic Techniques:

• Search engine architecture• Web crawling basics: following links, crawl courtesy, ..• Storage• Text indexing• Querying and term-based ranking• Basic link-based ranking

1 - Search Engine Architecture:

Crawler

disks

Index

indexing

Search.comQuery: “computer”

look up

Crawling

Crawler

disks

• fetches pages from the web• starts at set of “seed pages”• parses fetched pages for hyperlinks• then follows those links (e.g., BFS)• variations: - recrawling - focused crawling - random walks

Indexing

disks

• parse & build lexicon & build index

• index very large

I/O-efficient techniques needed

“inverted index”

indexing

aardvark 3452, 11437, ….......arm 4, 19, 29, 98, 143, ...armada 145, 457, 789, ...armadillo 678, 2134, 3970, ...armani 90, 256, 372, 511, ........zebra 602, 1189, 3209, ...

Querying

Boolean queries: (zebra AND armadillo) OR armani

compute unions/intersections of lists

Ranked queries: zebra, armadillo

give scores to all docs in union

look up


Ranked Querying: • return best pages first• term- vs. link-based approaches• also add meaningful “snippets”

Google:

[Source: Brin/Page, WWW Conf., 1998]

http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm



Engine Architecture: (Poly prototype)

GenericCrawler

StorageServer

IndexServer

GraphServer

BFS-Crawler

FocusedCrawler

UserInterface

UserTools

AdminInterface

Data Acquisition

User Interfaces

Scalable Server Components

WWW

queries

Hardware architecture: (e.g., Inktomi)

high-speedLAN or SAN

• network of workstations/servers (Sun or Linux, Myrinet SAN) • BASE vs. ACID (Basically Available, Soft-state, Eventual consistency)• data and index partitioned over machines• each node responsible for part of the web (horizontal partitioning)

Sun Ultras withseveral

disks each

Cluster-Based Architectures:

• clusters of more than 100000 Linux servers (Google)

• > 4 billion web pages and 10 million web sites

• need to crawl, store, and process terabytes of data

• 10000 queries / second (Google)

• “giant-scale” or “planetary-scale” web service

(google, hotmail, yahoo, ...)

• proprietary code and secret recipes

Major search engines are based on scalable clusters of low-cost servers connected by LANs

pages

index

pages

index

pages

index

pages

index

pages

index

broadcasts each queryand combines the results

LAN

cluster with global index organization

Query Processing in Cluster-Based Engines

queryintegrator

• local index: every node stores and indexes a subset of the pages• every query broadcast to all nodes by query integrator (QI)• every node supplies top-10, and QI computes global top-10• note: we don’t really need top-10 from all, maybe only 2 (does that help?)

• low-cost cluster architecture (usually with additional replication)

2 - Crawling the Web:

• Basic idea: - start at a set of known URLs

- explore the web in “concentric circles” around these URLs

start pages

distance-one pages

distance-two pages

Simple Breadth-First Search Crawler:

this will eventually download all pages reachable from the start set(also, need to remember pages that have already been downloaded)

insert set of initial URLs into a queue Q

while Q is not empty

currentURL = dequeue(Q)

download page from currentURL

for any hyperlink found in the page

if hyperlink is to a new page

enqueue hyperlink URL into Q

Traversal strategies: (why BFS?)

• crawl will quickly spread all over the web• load-balancing between servers• in reality, more refined strategies (but still BFSish)

• many other strategies (focused crawls, recrawls, site crawls)

• Scripting languages (Python, Perl) • Java (performance tuning tricky)

• C/C++ with sockets (low-level)

• available crawling tools (usually not completely scalable)

Tools/languages for implementation:

Details: (lots of ‘em)

• handling filetypes (exclude some extensions, and use mime types)

• URL extensions and CGI scripts (to strip or not to strip? Ignore?)

• frames, imagemaps, base tags• black holes (robot traps, spam bots) (limit maximum depth of a site)

• different names for same site (could check IP address, but no perfect solution)

• duplicates, mirrors

(see this paper for details)

Performance considerations: later!

Robot Exclusion Protocol

• file robots.txt in root directory • allows webmaster to “exclude” crawlers (crawlers do not have to obey)

• may exclude only certain robots or certain parts of the site - to “protect proprietary data” (e.g., eBay case) - to prevent crawlers from getting lost - to avoid load due to crawling - to avoid crashes (protect CGI bin)

• if at all possible, follow robot exclusion protocol!

(see Web Robots Pages)

http://www.robotstxt.org/wc/robots.html

Robot exclusion - example:

Robot exclusion - example:

Robot META Tags

• allow page owners to restrict access to pages

• does not require access to root directory

• excludes all robots

• not yet supported by all crawlers

• “noindex” and “nofollow”

(see Web Robots Pages)

http://www.robotstxt.org/wc/robots.html

Crawling courtesy

• minimize load on crawled server

• no more than one outstanding request per site

• better: wait 30 seconds between accesses to site (this number is not fixed)

• problems: - one server may have many sites (use domain-based load-balancing)

- one site may have many pages (3 years to crawl 3-million page site) - intervals between requests should depend on site

• give contact info for large crawls (email or URL)

• expect to be contacted ...

Crawling challenges

• crawler may have to run for several weeks or months

• will interact with millions of web server

• some of them will be odd: - noncompliant server responses

- unfamiliarity with robot exclusion protocol - robot traps - CGI and unintended consequences - network security tools - weird webmasters

• unclear legal situation

• average HTML page size: ~ 15KB (plus 20-30KB images)

• 4 billion pages = 60 TB of HTML• compression with gzip/zlib: 15 TB (3-4 KB per page)

• or about 3 KB text per page after stripping tags (according to Stanford WebBase group)

• 1-2 KB per page if stripping and compressing

• 1-3 KB compressed index size per page (depends on whether we store position in document)

• 4-12 TB index size for 4 billion pages

• page and index compression important

• Google: index in memory! (thousands of 2-4 GB nodes)

3- Storage:

Low cost storage:

LAN or

SAN

• Linux PCs connected by Ethernet or Myrinet SAN (system area network) • 2-8 disks per node (160GB IDE for $90)• Stanford WebBase, Internet Archive (and here at Poly)

• parallel processing (active/intelligent disks paradigm?)• separate data and index, or not?

http://www.myri.com/

http://www-diglib.stanford.edu/~testbed/doc2/WebBase


http://www.archive.org/

• store pages in standard DBMS (Oracle, DB2, mySQL)

• use file system - many pages per file (due to file system limits and bottlenecks)

- done by Internet Archive

• use specialized storage system - hash-partitioned: Stanford WebBase, Berkeley DDS - range-partitioned: Polytechnic (Alex Okulov 2002)

- option: use Berkeley DB or Wisconsin Shore as storage manager on nodes

• operations: write, read, and scan range of pages

Storage system options: (for pages)



http://www.cs.washington.edu/homes/gribble/papers/dds.pdf

http://www.sleepycat.com/

http://www.cs.wisc.edu/shore

4 - Indexing

disks with pages

• how to build an index - in I/O-efficient manner - in-place (no extra space) - in parallel (later)

• closely related to I/O-efficient sorting• how to compress an index (while building it in-place)

• goal: intermediate size not much larger than final size

inverted index

indexing


• lexicon: set of all “words” encountered millions in the case of the web, mostly non-words

• for each word occurrence: store index of document where it occurs

• also store position in document? (probably yes)

- increases space for index significantly! - allows efficient search for phrases - relative positions of words may be important for ranking

• also store additional context? (in title, bold, in anchortext)

• stop words: common words such as “is”, “a”, “the”

• ignore stop words? (maybe better not) - saves space in index - cannot search for “to be or not to be” - performance more important than space!

• stemming: “runs = run = running” (depends on language)

Basic concepts and choices:

Indexing: (simplified approach)

(1) scan through all documents

(2) for every work encountered generate entry (word, doc#, pos)

(3) sort entries by (word, doc#, pos)

(4) now transform into final form

(see Witten/Moffat/Bell for details)

doc1: “Bob reads a book”doc2: “Alice likes Bob”doc3: “book”

bob, 1, 1 reads, 1, 2 a, 1, 3

book,1, 4 alice, 2, 1 likes, 2, 2

bob, 2, 3 book, 3, 1

a, 1, 3 alice, 2, 1 bob, 1, 1

bob, 2, 3 book, 1, 4 book, 3, 1

likes, 2, 2 reads, 1, 2

a: (1,3)Alice: (2, 1)Bob: (1, 1), (2, 3)book: (1, 4), (3, 1)likes: (2, 2)reads: (1, 2)

A Naive Approach (does not work)

a) create an empty dynamic dictionary data structure (e.g. hash table) in main memory;

b) scan through all the documents, for every word encountered:

i. create an entry in the dictionary, if the word does not exist;

ii. Insert (doc id, pos) into the inverted list corresponding to the word;

c) traverse the dictionary and dump the inverted index on disks.


bob 1 1

bob

reads

a

book

alice

likes

1 1 2 3

1 2

1 3

1 4

2 1

2 2

3 1

…

…

Improvements

• encode sorted runs by their gaps significant compression for frequent words!

• less effective if we also store position (adds incompressible lower order bits)

• many highly optimized schemes studied (see book)

• but trade-off with CPU overhead

..arm 4, 19, 29, 98, 143, ...armada 145, 457, 789, ...armadillo 678, 2134, 3970, ...armani 90, 256, 372, 511, .....

..arm 4, 15, 10, 69, 45, ...armada 145, 312, 332, ...armadillo 678, 1456, 1836, ...armani 90, 166, 116, 139, .....

http://www.cs.mu.oz.au/mg

• keep data compressed during index construction

• try to keep index in main memory? (altaVista?, google)

• keep important parts in memory?

• use database to store lists? (e.g., Berkeley DB) use BLOBs for compressed lists; rely on DB for caching

• or use text indexes provided by databases?

Additional issues:

Alternatives to inverted index:• signature files: false positives• bitmaps• better to stick with inverted files!

• 140 million pages, 1.8 TB

• 7 nodes: 800Mhz P-III with 512MB and 2*80GB

• 130 GB uncompressed, 35GB compressed per disk

• build one index structure per disk

• indexing performance: 4 MB/s per node (not best possible)

• 9 hours per disk, 18 hours for parallel index run

• index size: 1.6 KB per page = 12% of original size (including position in document)

Some indexing numbers: (Long/Suel 2002)

• more than 4 billions pages indexed by Google• index could be 4-12 TB• must be partitioned onto many nodes, even if on disk• horizontal vs. vertical partitioning

Partitioning Inverted Indexes

horizontal partitioning:index 1: doc1a: {(1, 3)}bob: {(1, 1)}book: {(1, 4)}reads: {(1, 2)}

index2: doc2 and doc3

alice: {(2, 1)}bob: {(2, 3)}book: {(3, 1)}likes: {(2, 2)}

vertical partitioning:index 1:a: {(1, 3)}alice: {(2, 1)}bob: {(1, 1), (2,3)}

index2:

book: {(1,4), (3, 1)}likes: {(2, 2)}reads: {(1, 2)}


• performance trade-off: CPU vs. network• all major engines use horizontal

• have only discussed bulk-building of indexes

• updates can be challenging - assume you are adding one new document (new page) - document consists of 500 words - requires 500 insertions into index on disk !!!!

• many indexers do not support updates (efficiently)

• solutions: - semi-dynamic solution: build separate index, and merge - buffer insertions in memory - use Zipf distribution of word occurrences - or buy lots of fast disks …

• need to decide if update performance is important

Updating index structures:

5 – Querying and Ranking

Boolean queries: (zebra AND armadillo) OR armani

compute unions/intersections of lists

Ranked queries: zebra, armadillo

give scores to all docs in union

look up


Boolean queries vs. ranking• most web queries involve one or two common words Boolean querying returns thousands of hits

• would like to rank results by … - inportance? - relevance? - accuracy?

• in general, arbitrary score function: “return pages with highest score relative to query”

• use inverted index as access path for pages - start with (possibly expanded) Boolean query - only rank Boolean results

- in fact, try to avoid computing complete Boolean results (pruning methods, later)

• scoring function: assigns score to each document with respect to a given query

• top-k queries: return k documents with highest scores• example cosine measure for query with terms t to t

Ranking in search engines:

• can be implemented by computing score for all documents that contain any of the query words (union of inverted lists)

• in case of search engines: often intersection instead of union• in large collections, lists are many MB for average queries

0 m-1

Ranking continued:

• vast amount of vector space work in IR (see Witten/Moffat/Bell and Baeza-Yates/Ribeiro-Neto for intro & pointers)

• not all results directly applicable to search engines

• additional factors in ranking:

- distance between terms in text - titles and headings and font size - use of meta tags? - user feedback or browsing behavior? - link structure!

• efficiency extremely important! (Google: 10000 queries/sec)

http://www.cs.mu.oz.au/mg

http://www.dcc.ufmg.br/irbook

• Basic idea: exploits judgments by millions of web pages

• A page that is highly referenced is often better or more important

• Pagerank (Brin&Page/Google)

“significance of a page depends on significance of those referencing it”

• HITS (Kleinberg/IBM)

“Hubs and Authorities”

manipulation a huge problem!

6 - Link-Based Ranking Techniques

• initialize the rank value of each node to 1/n (0.2 for 5 nodes)

• a node with k outgoing links transmits a 1/k fraction of its current rank value over that edge to its neighbor

• iterate this process many times until it converges

• NOTE: this is a random walk on the link graph

• Pagerank: stationary distribution of this random walk

1/21/2

1/21/2

1/2

1/2 1

1

Pagerank

0.2

0.2

0.2

0.2

0.2

1/21/2

1/21/2

1/2

1/2 1

1 0.2

0.2

0.2

0.2

0.2

1/21/2

1/21/2

1/2

1/2 1

1 0.1

0.2

0.2

0.3

0.2

1/21/2

1/21/2

1/2

1/2 1

1 0.1

0.3

0.15

0.25

0.2

1/21/2

1/21/2

1/2

1/2 1

1 0.143

0.286

0.143

0.286

0.143

..

(1) (2)

(3) (n)

• “HITS” (Jon Kleinberg)

Other iterative techniques for link analysis

• query-dependent: first get 100 docs using term-based techniques

• build a subgraph on these nodes and their neighbors

• run iterative process on this subgraph

• each node has hub score and authority score

• recall the cosine measure:

Combining Term- and Link-Based Methods

• Pagerank assigns a fixed score to each page (independent of query) • naïve way of integrating Pagerank value:

• works for any global ordering of page scores (e.g., based on traffic)

• but some more details remain• HITS would be somewhat different

Documents

Part II - Basic Techniques: Search engine architecture Web crawling basics: following links, crawl courtesy,.. Storage Text indexing Querying and term-based