32
Search Engine Survey Hongfei Yan 2/15/2007

Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline Background Information Definition, history, how search engines work General Search Engines

Embed Size (px)

Citation preview

Page 1: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

Search Engine Survey

Hongfei Yan

2/15/2007

Page 2: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

2

Outline Background Information

Definition, history, how search engines work

General Search Engines Interface, databases, featuresGoogle, Yahoo!, Baidu, Live

Open Source Search EnginesLucence, SWISH-E

Metasearch, Visual, and Answer Search Engines

Page 3: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

3

Definition of Search Engine

A search engine is an information retrieval system designed to help find information stored on a computer system, such as on the Web, inside a corporate or proprietary network, or in a personal computer. The search engine allows one to ask for content meeting

specific criteria (typically those containing a given word or phrase) and retrieves a list of items that match those criteria.

This list is often sorted with respect to some measure of relevance of the results.

Search engines use regularly updated indexes to operate quickly and efficiently.

search engine usually refers to a Web search engine, which searches for information on the public Web.

Page 4: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

4

Timeline of Search Engines

“Full text” crawler-based

Link popularity and PageRank

Page 5: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

5

How search engines work Web crawling

an automated Web browser which follows every link it sees. Exclusions can be made by the use of robots.txt.

Indexing The contents of each page are analyzed to determine how

it should be indexed (for example, words are extracted from the titles, headings, or special fields called meta tags).

Searching When a user comes to the search engine and makes a

query, the engine looks up the index and provides a listing of best-matching web pages according to its criteria

Page 6: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

6

Storage costs and crawling time Storage costs are not the limiting resource in

search engine implementation. Simply storing 10 billion pages of 10 kbytes each

(compressed) requires 100TB and another 100TB or so for indexes, giving a total hardware cost of under $200k: 100 cheap PCs each with four 500GB disk drives.

a public search engine requires considerably more resources than this to calculate query results and to provide high availability.

Also, the costs of operating a large server farm are not trivial.

Crawling 10B pages with 100 machines crawling at 100 pages/second would take 1M seconds, or 11.6 days on a very high capacity Internet connection.

Page 7: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

7

Outline Background Information

Definition, history, how search engines work

General Search Engines Interface, databases, featuresGoogle, Yahoo!, Baidu, Live

Open Source Search EnginesLucence, SWISH-E

Metasearch, Visual, and Answer Search Engines

Page 8: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

8

General Search Engine Primary Search Engines

they are either well-known and well-used. they can potentially generate so much traffic.

* Google * Yahoo! * Baidu * Live

Secondary Web Search Engines These are either smaller or not the primary search engine for

access to databases from the Providers of Search listed below. * Exalead * Gigablast * WiseNut

Dead Search Engines These search engines used to offer their own database or unique

search features. They have all abandoned their position in search, although they still may have some kind of search functionality.

* AlltheWeb * AltaVista *Excite * Infoseek * Inktomi

Page 9: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

9

GSE: Minimalist User Interface

Page 10: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

10

GSE: Databases

Web: Indexed Web pages (also includes URLs that it has

not fully indexed) and additional file types in the Web database include

PDF, .ps, .doc, .xls, .txt, .ppt, .rtf, .asp and more.

Ads: Paid advertisements usually shown on the right side (or top) under a "Sponsored Links" heading

Page 11: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

11

GSE: Google Database Components

In millions percent

Indexed Web Pages 1,465 73.1%

Unindexed URLs 500 25%

Other file types 35 1.75%

Daily Reindexed Web Pages

3 0.15%

Page 12: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

12

GSE: Features A large, unique search engine database Includes cached copies of pages utilize not only PageRank but more than 150

criteria to determine relevancy Default Operation: Multiple search terms are

processed as an AND operation by default. Phrase matches are ranked higher(Proximity Searching).

No truncation is available. Case Sensitivity: using either lower or upper case

results in the same hits.

Page 13: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

13

GSE: Features contd. Field searching Language Limits: Default is all languages. 30+

language limits are available. Stop Words: searches almost all words except for

operators like AND. Display:

The display includes the title, URL, a brief extract showing text near the search terms, the file size, and for many hits, a link to a cached copy of the page.

Page 14: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

14

Page 15: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

15

Page 16: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

16

Page 17: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

17

Page 18: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

18

Review of Google In Feb. 1999 Google moved from Alpha test version to Beta

and officially launched Sept. 21, 1999. Since that time it has made its mark with its relevance

ranking based on link analysis, cached pages, and aggressive growth. Since its beta release, it has had phrase searching and the - for

NOT, but it did not add an OR operation until Oct. 2000. In Dec. 2000, it added title searching. In June 2000 it announced a database of over 560 million pages,

which grew to over 600 million by the end of 2000 and then 1.5 billion in Dec. 2001.

The 2+ billion reported on their home page as of April 2002 includes indexed pages, unindexed URLs, and other file formats. By Nov. 2002, they moved their claim up to 3 billion, and in Feb. 2004 it went to 4 billion.

While no official claim is given, 20+ billion is once current estimate.

Page 19: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

19

Review of Yahoo! The two founders of Yahoo!, David Filo and Jerry Yang,

Ph.D. candidates in Electrical Engineering at Stanford University, started their guide in a campus trailer in February 1994 as a way to keep track of their personal interests on the Internet. Before long they were spending more time on their home-brewed lists of favourite links than on their doctoral dissertations. Eventually, Jerry and David's lists became too long and unwieldy, and they broke them out into categories. When the categories became too full, they developed subcategories ... and the core concept behind Yahoo! was born. In 2002, Yahoo! acquired Inktomi and in 2003, Yahoo! acquired

Overture, which owned AlltheWeb and AltaVista. in 2004, Yahoo! launched its own search engine based on the

combined technologies of its acquisitions and providing a service that gave pre-eminence to the Web search engine over the directory..

Page 20: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

20

Review of Live Live Search is the successor to MSN Search. This

is the Microsoft Web search engine. Launched in September 2006, it uses its own, unique database. In 2004 it debuted a beta version of its own results,

powered by its own web crawler (called msnbot). In early 2005 it started showing its own results live. At

the same time, Microsoft ceased using results from Inktomi, now owned by Yahoo!.

In 2006, Microsoft migrated to a new search platform - Windows Live Search, retiring the "MSN Search" name in the process.

Page 21: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

21

Review of Badu

Baidu (Chinese: 百度 ; pinyin: bǎi dù) is a popular Chinese search engine which launched in 2000 and can search text and images. As of January 2007, since at least as early as May 2006, it is fourth in Alexa's internet rankings with a market share of 52 percent.

Baidu provides an index of over 1 billion web pages.

Page 22: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

22

Outline Background Information

Definition, history, how search engines work

General Search Engines Interface, databases, featuresGoogle, Yahoo!, Baidu, Live

Open Source Search EnginesLucence, SWISH-E

Metasearch, Visual, and Answer Search Engines

Page 23: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

23

Lucene, lucene.apache.org Lucene is a free and open source information

retrieval API, originally implemented in Java by Doug Cutting. Lucene has been ported to programming languages including Perl, C#, C++, Python, Ruby and PHP.

While suitable for any application which requires full text indexing and searching capability.

At the core of Lucene's logical architecture is a notion of a document containing fields of text. This flexibility allows Lucene's API to be agnostic of file format. Text from PDFs, HTML, Microsoft Word documents, as well as many others can all be indexed so long as their textual information can be extracted.

Page 24: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

24

SWISH-E, swish-e.org

Swish-e stands for Simple Web Indexing System for Humans - Enhanced. It is used to index collections of documents ranging up to one million documents in size and includes import filters for many document types.

Many sites use Swish-e

Page 25: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

25

Outline Background knowledge

Definition, history, how search engines work

General Search Engines Interface, databases, featuresGoogle, Yahoo!, Baidu, Live

Open Source Search EnginesLucence, SWISH-E

Metasearch, Visual, and Answer Search Engines

Page 26: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

26

Visual Search Engine

A search returns both a list of search results and a tag cloud. The tag cloud contains the original search terms surrounded by related tags. The closer to the search terms, the larger the keyword suggestions (both in terms of font size and boldness), the more relevant they are deemed. Holding the mouse over a term will display a new set of results in the bottom window and will also show another keyword cloud overlaying the original.

Page 27: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

27

VSE: Quintura.com

Page 28: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

28

Metasearch Engines

Unlike search engines, metacrawlers don't crawl the web themselves to build listings. Instead, they allow searches to be sent to several search engines all at once. The results are then blended together onto one page.

Page 29: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

29

MSE: vivisimo

Page 30: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

30

MSE: Kartoo.com

Page 31: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

31

Answer-based search engines Answers.com:presents reference content in over four

million entries, collected from multiple sources.

Page 32: Search Engine Survey Hongfei Yan 2/15/2007. 2 Outline  Background Information  Definition, history, how search engines work  General Search Engines

32

Reference http://en.wikipedia.org/wiki/Search_engine http://www.searchengineshowdown.com/ http://searchenginewatch.com/ http://www.searchtools.com/tools/tools.html

……