Web Search Dr. Yingwu Zhu. Overview History Search Engine Architecture Web Spam

Web Search

Dr. Yingwu Zhu

Overview

• History• Search Engine Architecture• Web Spam

Search Engine Early History

• By late 1980’s many files were available by anonymous FTP.

• In 1990, Alan Emtage of McGill Univ. developed Archie (short for “archives”)– Assembled lists of files available on many

FTP servers.– Allowed regex search of these file names.

• In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers.

Web Search History

• In 1993, early web robots (spiders) were built to collect URL’s:– Wanderer– ALIWEB (Archie-Like Index of the WEB)– WWW Worm (indexed URL’s and titles

for regex search)• In 1994, Stanford grad students

David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo.

Web Search History (cont)• In early 1994, Brian Pinkerton developed

WebCrawler as a class project at U Wash. (eventually became part of Excite and AOL).

• A few months later, Fuzzy Maudlin, a grad student at CMU developed Lycos. First to use a standard IR system as developed for the DARPA Tipster project. First to index a large set of pages.

• In late 1995, DEC developed Altavista. Used a large farm of Alpha machines to quickly process large numbers of queries. Supported boolean operators, phrases, and “reverse pointer” queries.

Web Search Recent History

• In 1998, Larry Page and Sergey Brin, Ph.D. students at Stanford, started Google. Main advance is use of link analysis to rank results partially based on authority.– Pagerank

History1. FTP

2. FTP + Search

3. Web crawlers: URLs

4. Yahoo

5. Indexing + Search

6. Ranking: Google

Search Landscape 2005

• Four major “Mainframes”– Google,Yahoo, MSN, and ASK

• >450M searches daily– 60% international– Thousands of machines

• $8+B in Paid Search Revenues• Large indices

– Billions of documents– Terrabytes of data

• Excellent relevance– For some tasks

Source: Search Engine Watch

Overview

• History• Search Engine Architecture• Web Spam

Characteristics of Web search

• Huge amounts of text to search through

• Pages are linked• Pages differ greatly in quality• A single search may return many

pages– A user will not look at all result pages– Result pages need to be ranked– Complete result set may be unnecessary

Slide adapted from Lew & Davis

How Search Engines Work

Do you know how it works?Architecture?


How Search Engines Work

1. Gather the contents of all web pages (using a program called a crawler or spider)

2. Organize the contents of the pages in a way that allows efficient retrieval (indexing)

3. Take in a query, determine which pages match, and show the results (ranking and display of results)

Three main parts:

Standard Web Search Engine Architecture

crawl theweb

Create an inverted

index

Check for duplicates,store the

documents

Inverted index

Search engine servers

DocIdsCrawlermachines


crawl theweb

Create an inverted

index


documents

Inverted index


userquery

Show results To user


Search Engine Architecture

WWW Crawl

Snapshot

Indexer

Web Map

Meta data

Query Serving

Web Index

Ranking and Presentation Comprehensiveness and Freshness

Comprehensiveness

• Problem:– Make accessible all useful Web pages

• Issues:– Web has an infinite number of pages– Finite resources available

• Bandwidth• Disk capacity

• Selection Problem– Which pages to visit

• Crawl Policy– Which pages to index

• Index Selection Policy

Freshness

• Problem: – Ensure that what is indexed correctly

reflects current state of the web• Impossible to achieve exactly

– Revisit vs Discovery• Divide and Conquer

– A few pages change continually– Most pages are relatively static

Ranking

• Problem:– Given a well-formed query, place the most

relevant pages in the first few positions

• Issues:– Scale: Many candidate matches

• Response in < 100 msecs

– Evaluation:• Editorial • User Behavior

Overview

• History• Search Engine Architecture

– Crawler or Spider– Indexing– Ranking

• Web Spam

Crawler

• How does a crawler work?• How to design a crawler?• What need to be considered in

design?

Web Crawlers

• How do the web search engines get all of the items they index?

• Main idea: – Start with known sites– Record information for these sites– Follow the links from each site– Record information found at new sites– Repeat

What is a Crawler?

web

init

get next url

get page

extract urls

initial urls

to visit urls

visited urls

web pages

Web Crawling Algorithm• More precisely:

– Put a set of known sites on a queue– Repeat the following until the queue is empty:

• Take the first page off of the queue• If this page has not yet been processed:

– Record the information found on this page»Positions of words, links going out, etc

– Add each link on the current page to the queue

– Record that this page has been processed• Rule-of-thumb: 1 doc per minute per crawling server

Crawl Policy

• Pages found by following links– From an initial root set

• Basic iteration:– Visit pages and extract links– Prioritize next pages to visit (or revisit)

• Framework– Visit pages

• most likely to be viewed • most likely to contain links to pages that will be

viewed

– Prioritization by Query-independent Quality


Crawler behaviour varies

•Parts of a web page that are indexed•How deeply a site is indexed •Types of files indexed•How frequently the site is spidered

The behavior of a web crawler is the outcome of a combination of

policies • A selection policy that states which

pages to download. • A re-visit policy that states when to

check for changes to the pages. • A politeness policy that states how to

avoid overloading websites. • A parallelization policy that states how

to coordinate distributed web crawlers

Four Laws of Crawling

• A Crawler must show identification– A crawler must identify itself using the

User-agent field of an HTTP request

• A Crawler must obey the robots exclusion standardhttp://www.robotstxt.org/wc/norobots.html

• A Crawler must not hog resources• A Crawler must report errors

http://en.wikipedia.org/wiki/User_agent

http://en.wikipedia.org/wiki/HTTP

Lots of tricky aspects

• Servers are often down or slow• Hyperlinks can get the crawler into cycles• Some websites have junk in the web pages• Now many pages have dynamic content

– The “hidden” web– E.g., schedule.xxx.edu

• You don’t see the course schedules until you run a query.

• The web is HUGE

Web Crawling Issues• Keep out signs

– A file called norobots.txt lists “off-limits” directories– Freshness: Figure out which pages change often,

and recrawl these often.• Duplicates, virtual hosts, etc.

– Convert page contents with a hash function– Compare new pages to the hash table

• Lots of problems– Server unavailable; incorrect html; missing links;

attempts to “fool” search engine by giving crawler a version of the page with lots of spurious terms added ...

• Web crawling is difficult to do robustly!

Crawling order

• Want to visit best pages first.• Need a measure of quality (in-degree,

PageRank).• Possible Orderings

– Breadth-first search (FIFO)– In-degree (so far)– PageRank (so far)– Random

• Experiments suggest breadth-first search finds pages with high PageRank early (removes need for computation).

Overview



• Web Spam

Indexing

• Indexing using IR techniques, producing inverted files for web pages

• Vector space model (VSM)

How Inverted Files Are Created

• Periodically rebuilt, static otherwise.• Documents are parsed to extract tokens.

These are saved with the Document ID.

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

How Inverted Files are Created

• After all documents have been parsed the inverted file is sorted alphabetically.

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

How InvertedFiles are Created

• Multiple term entries for a single document are merged.

• Within-document term frequency information is compiled.

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2


• Finally, the file can be split into – A Dictionary or Lexicon file and – A Postings file


Dictionary/Lexicon PostingsTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

PageInverted file

• A vector of terms• Stop words removed• All words are stemmed

Inverted indexes• Permit fast search for individual terms• For each term, you get a list consisting of:

– document ID – frequency of term in doc (optional) – position of term in doc (optional)– font size (optional)– Capitalization (optional)– descriptor type, e.g. title, anchor, etc (optional)

• These lists can be used to solve Boolean queries:

• country -> d1, d2• manor -> d2• country AND manor -> d2

• Also used for statistical ranking algorithms

Inverted Indexes for Web Search Engines

• Inverted indexes are still used, even though the web is so huge.

• Some systems partition the indexes across different machines. Each machine handles different parts of the data.

• Other systems duplicate the data across many machines; queries are distributed among the machines.

• Most do a combination of these.


crawl theweb

Create an inverted

index


documents

Inverted index


userquery

Show results To user


Query Serving Architecture

• Index divided into segments each served by a node

• Each row of nodes replicated for query load

• Query integrator distributes query and merges results

• Front end creates a HTML page with the query results

Load Balancer

FE1

QI1

Node1,1 Node1,2 Node1,3 Node1,N




QI2 QI8

FE2 FE8

“travel”

“travel”

“travel”

“travel”

“travel”

…

…

…………

…

…

Overview



• Web Spam

How to do ranking?

Ranking result pages

• Based on content– Number of occurrences of the search

terms– Similarity to the query text

• Based on link structure– Backlink count– PageRank– Hub and authority scores (HITS)

Problems with Content-based Ranking?

Problems with content-based ranking

• Many pages containing search terms may be of poor quality or irrelevant– Example: a page with just a line “search engine”.

• Many high-quality or relevant pages do not even contain the search terms– Example: Google homepage

• Page containing more occurrences of the search terms are ranked higher; spamming is easy– Example: a page with line “search engine”

Repeated many times

Backlink

• A backlink of a page p is a link that points to p

• A page with more backlinks is ranked higher

• Intuition: Each backlink is a “vote” for the page’s importance

• Based on local link structure; still easy to spam– Create lots of pages that point to a particular

page

PageRank and HITS

• Page et al., “The PageRank Citation Ranking: Brining Order to the Web.” 1998

• Kleinberg, “Authoritative Sources in a Hyperlinked Environment.” Journal of the ACM, 1999

• Main idea: Pages pointed by high-ranking pages are ranked higher

• Definition is recursive by design• Based on global link structure; hard

to spam

Slide adapted from Manning, Raghavan, & Schuetze

Manipulating Ranking

• Motives– Commercial, political, religious, lobbies– Promotion funded by advertising budget

• Operators– Contractors (Search Engine Optimizers) for

lobbies, companies– Web masters– Hosting services

• Forum– Web master world ( www.webmasterworld.com )

http://www.webmasterworld.com/

Overview



• Web Spam

Slide adapted from Manning, Raghavan, & Schuetze

A few spam technologies

• Cloaking– Serve fake content to search engine robot– DNS cloaking: Switch IP address. Impersonate

• Doorway pages– Pages optimized for a single keyword that re-direct

to the real target page

• Keyword Spam– Misleading meta-keywords, excessive repetition of

a term, fake “anchor text”– Hidden text with colors, CSS tricks, etc.

• Link spamming– Mutual admiration societies, hidden links, awards– Domain flooding: numerous domains that point or

re-direct to a target page

• Robots– Fake click stream– Fake query stream– Millions of submissions via Add-Url

Is this a SearchEngine spider?

Y

N

SPAM

RealDoc

Cloaking

Meta-Keywords = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …”

Cloaking

• Content presented to the search engine spider is different from that presented to the users' browser.

• This is done by delivering content based on the IP addresses or the User-Agent HTTP header of the user requesting the page.

• When a user is identified as a search engine spider, a server-side script delivers a different version of the web page, one that contains content not present on the visible page.

• The purpose of cloaking is to deceive search engines so they display the page when it would not otherwise be displayed

Cloaking

• Cloaking is often used as a spamdexing technique, to try to trick search engines into giving the relevant site a higher ranking;

• it can also be used to trick search engine users into visiting a site based on the search engine description which site turns out to have substantially different, or even pornographic content redirection!

• Search engines delist sites when deceptive cloaking is reported.

• Cloaking is a form of the doorway page technique.

http://en.wikipedia.org/wiki/Spamdexing

Redirection

• Simple approach: take advantage of the refresh meta tag in the header of an HTML document, by setting refresh time to zero and the target page, spammers can achieve redirection as soon as the page gets loaded into the browsers <meta http-equiv=“refresh” content=

“0;url=target.html”>

• Search engines can easily detect it!

Redirection

• Using scripts, which are not executed by the web crawlers<script language=“javascript”></script>

Doorway Pages

• Creating low-quality web pages that contain very little content but are instead stuffed with very similar keywords and phrases.

• They are designed to rank highly within the search results, but serve no purpose to visitors looking for information. A doorway page will generally have "click here to enter" on the page.

• Once they are reported, Search engines delist the sites!

Keyword Spam• Keyword spam is the excessive repetition of keywords

on a page • It is usually done using hidden elements that are

indexed by search engines but are not visible to users including Title, Meta, and Alt. – Black-hatters have found that they can disguise keywords in

the contents of the page by making the text the same color as the background and tucking it away at the bottom of the page

– CTRL-A to highlight all the text on a page, get caught!– MSN Search claims to automatically penalize these pages.

• E.g., <meta name="keywords" content="wikipedia,encyclopedia"/>, specifies the document is relevant to wikipedia, encyclopedia!

http://www.abcseo.com/seo-book/black-hat-seo.htm

http://www.abcseo.com/seo-book/msn-search.htm

Keyword Spam• An extension on the hidden text idea is to hide

the keyword spam using style-sheets (CSS). – This gives the spammer great scope for stuffing

keywords into important elements such as Headings without them being noticed. The following style will format all Heading 1 text as 1pt high white text.

H1 { font-size : 1pt; color : white; }

• Many other ways of hiding content from users such as Layers and IFrames while still having it visible to search engines.

• Search engines can detect them at the cost of slowdown by paring style sheets and other structures!

Link Spam

• Takes advantage of link-based ranking algorithms, such as Google's PageRank algorithm and HITS algorithms, which gives a higher ranking to a website the more other highly ranked websites link to it

• Links farms: Involves creating tightly-knit communities of pages referencing each other, also known humorously as mutual admiration societies

• Page hijacking: This is achieved by creating a rogue copy of a popular website which shows contents similar to the original to a web crawler but redirects web surfers to unrelated or malicious websites.

http://en.wikipedia.org/wiki/Google

http://en.wikipedia.org/wiki/PageRank