51
Slide 1 ECT 7010 Fundamentals of E-Commerce Technologies Edited by Christopher C. Yang Web Searching

Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 1

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Web Searching

Page 2: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 2

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Needs for Search Engines• We have lots of work waiting for us everyday• Information is needed for many work

– E-Commerce • Microsoft, Oracle web sites over 1 million pages • Multi-media databases

– E-Governance • business, citizens: rules, regulations

– E-Employee • rules, regulations, human resources

– E-Business • product specifications, catalogs, contract details, availability

• Information Overload– Too much information available on the Internet as the World Wide Web becomes

popular • Distributed and non-homogeneous data

– different people come together at the WWW without centralized rules and guidance– different structure of documents on the WWW– data is non-homogenous for different human association models and value schemes

• Most computer systems are not able to cope with the complexity

Page 3: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 3

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

History of Search Engines

• Anonymous FTP sites• Archie

• for searching FTP sites• Collects the site listing of FTP archives into a database for users to query

• Gopher servers for text documents• Veronica

• for searching Gopher sites

• Spiders • Start on a popular site at a starter page and moves on to other pages

following the hyperlinks• to collect and search URLs, titles and web headings• Compiled into a database that can be searched through query strings

Page 4: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 4

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

How do Search Engines Work?

Page 5: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 5

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

How do Search Engines Work?• Gathering

• visit web sites, bring document copies• Revisit sites at some interval of time• Most search engines claim that they cover the entire Web, but only a

certain percentage of the Web• At early stage, search engines collect URLs, titles, and headings• Later, collects first hundred kilobytes of documents• Now, collects entire documents

• Indexing • collect key words, phrases in document and index against URLs• For each word and phrase, the database contains the URL address of the

page as well as the location in the document• Searching

• search index of words/ phrases• Order the list of URLs according to relevance

Page 6: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 6

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

How do Search Engines Work?

Page 7: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 7

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Common Search Features

• Boolean Searching AND OR

• NOT •In case of polysemy (words carry different meanings)•E.g., bank AND NOT river

Page 8: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 8

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Common Search Features

• Phrase Search – “acid rain”

• Proximity Search – Searching for “the warming of the American continent”– Query: America NEAR warming– Words appear within N words (N = 10 or 25 or …)

• Wild card Search – * ?

• Concept Search• Natural Language Search

– Ask Jeeves

Page 9: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 9

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Common Display Features

• Relevance Ranking• How to arrange the output of the search engine?

– Word frequency– Popularity of the site– Location of the words– Connectivity of the page and site

• Page with many links pointing to them are taken as more important– Date of page being created

Page 10: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 10

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Directories and Databases• Searchable subject/ topic directory Yahoo.com

• A hierarchy of topics • entries are submitted to the site, verified and included in the

directory. • Entire process is manual but good quality control

• Virtual Libraries • Maintained by professional such as librarians• Access to reference sources such as handbooks,

dictionaries, and encyclopedias • http://www.lii.org

• Specialized databases • Maintained by for-profit companies, charged • Lexis-Nexis, Medline, Dialog

Page 11: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 11

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Meta-Search Engines

Coordinate the services of several search-engines

Dogpile www.dogpile.com

Metacrawler www.metacrawler.com

Page 12: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 12

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Search Engine Evaluation

• Recall– Ratio of pages returned to total volume of pages on the world wide

web

• Precision – Ratio of pages returned that are relevant

• Conflicting nature of the above two criteria

Page 13: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 13

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Search Engine Evaluation

• Directories like Yahoo are high in precision, poor in recall

• General sites such as Altavista are high in recall and poor in precision

• Design challenge in search engine is to have both high precision and recall

Page 14: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 14

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Search Engine Evaluation

• Search life cycle

• Initial phase exploration phase high recall is required

• Later phase exploitation phase high precision is required

Page 15: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 15

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Popular Search Engines(Online Database)

• Yahoo!– Database of Web addresses in the manner of Yellow Pages– Offer 4 to 5 levels deep subject hierarchy (most others offer only 2

levels)

• Alta Vista– Largest copies of the Web at the earliest time– Scooter, a spider to roams the Web once every month to update the

database– Similar subject directory as Yahoo! but smaller– “refine” search: suggests likely topics and keywords to tighten the

search criteria

Page 16: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 16

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Popular Search Engines (continued)• Lycos

– “Lycos Top 5%”: manually listing on the basis of site popularity and content quality

– Site selection is reliable and authoritative– Multimedia search, such as pictures, clip art, video clips, sound

and music clips• Excite

– Concept-based approach instead of keyword-based approach for searching

– Looks for Web pages that have related ideas and concepts insteadof keyword matching

– Search smaller portion of Web– Allows “query by example” using an entry in result

• HotBot– Powered by Inktomi

• Google• Baidu

Page 17: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 17

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Flexible Search Strategy

• Determine specificity of search required

• In exploration or exploitation phase?

• Choose a search engine that suits the specificity of search • meta-search engines for initial phase of exploration

• Build a search query • Boolean statements

Page 18: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 18

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Client-based spider (agents technology)

– machine learning techniques• content-based• collaborative

Page 19: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 19

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

• TueMosaic– DeBra and Post, Edinhoven University of Technology (TUE)– P. DeBra, and R. Post, “Information Retrieval in the World Wide

Web: Making Client-based Searching Feasible,” Proceedings of the First International World Wide Web Conference, Geneva, Switzerland, 1994

– users enter keywords, specify the depth and width of search for links contained in the current homepage displayed

– Fish search algorithm, a modified Best First Search• Each URL corresponds to a fish• After the document is retrieved, the fish spawns a number of children

(URLs)• These URLs are produced depending on whether they are relevant

and how many URLs are embedded• The URLs will be removed if no relevant documents are found after

following several links

Page 20: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 20

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

• WebCrawler– Pinkerton– B. Pinkerton, “Finding What People Want: Experiences with the

WebCrawler,” Proceedings of the Second International World Wide Web Conference, Chicago, IL, October 17-20, 1994

– first appeared in April of 1994– purchased by American Online in January of 1995– WebCrawler extend the Fish Search algorithm

• initiate the search using index• follow links in an intelligent order

– evaluates the relevance of the link based on the similarity of anchor text to the user’s query

Page 21: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 21

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

• TkWWW– Spetka– S. Spetak, “The TkWWW Robot: Beyond Browsing,” Proceedings

of the Second International World Wide Web Conference, 1994– funded by Air Force Rome Laboratory– find logically related homepages and return a list of links

• only one or two hops from the original homepages• run in the background to build the HTML indexes, compile WWW

statistics, collect a portfolio of pictures, etc.

Page 22: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 22

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

• WebAnts– Leavitt– investigates the distribution of information collection tasks to a

number of cooperating processors– create cooperating explorers (ants)

• share the searching results and the indexing loading without repeating each other’s effort

Page 23: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 23

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

• RBSE (Repository Based Software Engineering)– David Eichmann– funded by NASA– first spider to index document by content– four searching algorithms are used,

• breadth first search from a given URL• limited depth first search from a given URL• breadth first search from unvisited URLs in the database• limited depth first search from unvisited URLs in the database

Page 24: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 24

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

• Lira System– Stanford– M. Balabanovic and Y. Shoham, “Learning Information Retrieval

Agents: Experiments with Automated Web Browsing,” AAAI 1995 Spring Symposium Information Gathering from Heterogeneous, Distributed Environments, Menlo Park, 1995

– browse the Internet on users’ behalf• searches the Web by taking a bounded amount of time, selecting the

best pages and receiving an evaluation from the user

Page 25: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 25

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

• Musag System– Hebrew University– C. V. Goldman, A. Langer, and J. S. Rosenschein, “Musag: An

Agent That Learns What You Mean,” Applied AI, vol.11, no.5, 1997, pp.331-339

– takes keywords from the users and searches the Web for relevant documents

– system generates a kind of thesaurus that relates concepts that are semantically similar to each other

Page 26: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 26

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

• Letizia– MIT– H. Lieberman, “Letizia: An Agent that Assists Web Browsing,”

Proceedings of the 14th International Joint Conference on AI (IJCAI95), Menlo Park, 1995

– a user-interface agent for assisting Web browsing• does not require any keywords or rating from the user• infers users’ interests from browser behavior• perform depth-first search

Page 27: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 27

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

• Personal WebWatcher– CMU– R. Armstrong et al., “WebWatcher: A Learning Apprentice for the

World Wide Web,” AAAI 1995 Spring Symposium Information Gathering from Heterogeneous, Distributed Environments, Menlo Park, CA, 1995

– personal assistant that accompanies user from page to page and highlights interesting hyperlinks

– generates a user profile based on the content analysis of the requested pages

• without requesting keywords or ratings

Page 28: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 28

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

• Syskill & WeBert– UCI– M. Pazzani, J. Muramatsu, and D. Billsus, “Syskill & Webert: Identifying

Interesting Web Sites,” Proceedings of the 13th National Conference AI, AAAI’96, Menlo Park, 1996, pp.54-61

– M. Pazzani, and D. Billsus, “Learning and Revising User Profiles: The Identification of Interesting Web Sites,” Machine Learning, 27, KluwerAcademic Publishers, Dorrdrecht, The Netherlands, 1997, pp.313-331

– collects ratings of the explored Web pages from the user and learns a user profile from them

– separate pages according to their topics and learns a separate profile for each topic

– Applies a naive Bayesian classifier for learning and revising user profile by a set of positive and negative examples

Page 29: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 29

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

• WAWA– Wincosin– J. Shavlik and T. Eliassi-Rad, “Building Intelligent Agents for

Web-based Tasks: A Theory-Refinement Approach,” Working Notes of Learning from Text and the Web, Conf. Automated Learning and Discovery (CONALD’98), Carnegie Mellon University, Pittsburgh, 1998; http://www.cs.cmu.edu/~conald/conald.shtml

– let users input personal interests and preferences– stores them in a neural network, and uses theory revision to refine

the obtained knowledge

Page 30: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 30

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

• Others ….• Antagonomy

– NEC– T. Kamba, H. Sakagami, and Y. Koseki, “Anatagonomy: A

Personalized Newspaper on the World Wide Web,” International Journal of Human-Computer Studies, vol.46, no.6, June, 1997, pp.789-803

– Learn user preferences based on both explicit feedback and implicit feedback for retrieving WWW-based newspaper articles

– Explicit feedback: user rating – Implicit feedback: scrolling and enlarging operations– User profile: user registered keywords

Page 31: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 31

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

• CiteSeer– Tx, NEC, NMIACS– K. Bollacker, S. Lawrence, and L. Giles, “CiteSeer: An

Autonomous System for Processing and Organizing Scientific Literature on the Web,” Working Notes of Learning from Text and the Web, Conf. Automated Learning and Discovery (CONALD’98),Carnegie Mellon University, Pittsburgh, 1998, http://www.cs.cmu.edu/~conald/conald.shtml

Page 32: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 32

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

• NewsWeeder– CMU– K. Lang, “News Weeder: Learning to Filter Netnews,”

Proceedings of the 12th International Conference on Machine Learning, Morgan Kaufmann, San Francisco, 1995, pp.331-339

• Itsy Bitsy Spider– University of Arizona– H. Chen, Y. Chung, M. Ramsey, and C. C. Yang, "A Smart Itsy

Bitsy Spider for the Web," Journal of the American Society for Information Science, vol.49, no.7, 15 May 1998, p.604-618.

Page 33: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Hyperlink Analysis

Page 34: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 34

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Document Retrieval vs Web Information Retrieval

• Document retrieval – Find all documents relevant to a user query in a given collection of

documents– Exclusively based on analysis on the words in the document

• Web Retrieval– Find all document (Web pages) relevant to a user query in the World

Wide Web– Based on the analysis on the words in the Web pages and the hyperlink

structure of the Web or markup language tags

Page 35: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 35

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Why is hyperlink useful?

• Web page authors use hyperlinks to provide valuable information content that is useful to readers

– Navigational aids• Point to other links in the Web site

– Access to documents that augment the content of the current page• Usually point to high quality page

• Hyperlink analysis improves the relevance of search result– Hyperlink analysis is adopted in ranking mechanism of most Internet search

engines

Page 36: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 36

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Hyperlink Analysis

• Assumption 1– A hyperlink from page A to page B is a recommendation of page B by the

author of page A• Assumption 2

– If page A and page B are connected by a hyperlink, they might be on the same topic

• Hyperlink analysis are used in two tasks in Web information retrieval– Crawling– Ranking

Page 37: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 37

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Crawling

• Crawling is the process of collecting Web pages– In typical document retrieval, the collection is given– In Web information retrieval, the search engine needs to find documents

• Crawling starts from a set of source Web pages– It follows the source page hyperlinks to find more Web pages– It repeats on each new set of pages and continues until no more new pages are discovered or

until a predetermined number of pages have been collected– It uses the metaphor of a spider “crawling” along the Web

• Crawler decides in which order to collect hyperlinked pages that have not yet been crawled

– Hyperlink analysis provides a means for judging the quality of pages that will be used in the priority of crawling

Page 38: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 38

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Ranking

• Ranking is the process of ordering the returned documents in descending order of relevance

• Ranking uses connectivity-based ranking in hyperlink analysis• In typical document retrieval, it only uses the words in the documents

– Vector space model by Salton• In Web information retrieval, only using word occurrence analysis is not reliable.

– Some Web page authors may add invisible text to manipulate the ranking algorithm due to commercial interests

• Hyperlink analysis uses the content of other pages to rank the current page

Page 39: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 39

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Connectivity-based Ranking

• Two classes of schemes• Query-independent schemes

– Assign a score to a page independent of a given query• Query-dependent schemes

– Assign a score to a page in the context of a given query

• Directed Graph representation– Link graph

• Each Web page is modeled by a node• If page A contains a hyperlink to page B, there exists a directed edge

(A,B)• Used for ranking

– undirected co-citation graph• Nodes A and B are connected by an undirected edge if and only if

there exists a third page C hyperlinking to both A and B– A and B are co-cited by C

• Used for categorization

Page 40: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 40

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Query-independent Ranking

• Measure the intrinsic quality of a page• Assuming the more hyperlinks pointing to a page, the better the page

– However, it does not distinguish between the quality of a page pointed by a number of low-quality pages and the quality of a page pointed by the same number of high-quality pages

– PageRank resolves this problem• PageRank computes the score by weighting each hyperlink to the page

proportionally to the quality of the page containing the hyperlink– To determine the quality, PageRank runs recursively with an arbitrary initial setting

Page 41: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 41

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

PageRank

• PageRank R(A) of a page A is defined as

– d is a constant usually set between 0.8 and 0.9– Oj is the number of edges leaving page j (number of hyperlinks in page B)

• PageRank of a page A depends on the PageRank of a page B pointing to A• PageRank is used by Google

∑∈

+−=Eij jO

jPddiP),(

)()1()(

Page 42: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 42

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

PageRank

Iter A (out=2) B (out=2) C (out=1) D (out=1)

0 1.0 1.0 1.0 1.0

Iter A (out=2) B (out=2) C (out=1) D (out=1)

0 1.0 1.0 1.0 1.0

1 1.0 0.575 1.0 1.425

2 1.36125 0.575 0.819375 1.244375

3 1.207719 0.728531 0.972906 1.090844

4 1.077217 0.66328 0.972906 1.286596

5 1.243607 0.607817 0.889712 1.258865

6 1.220035 0.678533 0.936855 1.164577

7 1.139891 0.668515 0.956891 1.234703

8 1.199498 0.634453 0.918572 1.247476

9 1.210355 0.659787 0.929429 1.200429

10 1.170365 0.664401 0.94481 1.220424

11 1.187361 0.647405 0.929775 1.235459

12 1.20014 0.654628 0.929775 1.215456

13 1.183138 0.66006 0.938277 1.218526

14 1.185747 0.652834 0.933359 1.22806

B

D

C

1.0

1.0 1.0

A1.0

0.575 1.0

Iter A (out=2) B (out=2) C (out=1) D (out=1)

0 1.0 1.0 1.0 1.0

1 1.0 0.575 1.0 1.425

1.185747

0. 0.652834 933359

1.01.228061.425

PR(A) = (1-0.85) + 0.85 * {PR(D) / outdegree(D)} = 0.15 + 0.85 * 1.0 / 1 = 1PR(B) = (1-0.85) + 0.85 * {PR(A) / outdegree(A)} = 0.15 + 0.85 * 1.0 / 2 = 0.75PR(C) = (1-0.85) + 0.85 * {PR(A) / outdegree(A) + PR(B) / outdegree(B)} = 0.15 + 0.85 * (1.0 / 2 + 1.0 / 2) = 1PR(D) = (1-0.85) + 0.85 * {PR(B) / outdegree(B) + PR(C) / outdegree(C)} = 0.15 + 0.85 * (1.0 / 2 + 1.0 / 1) = 1.425After 14 Iterations, the PageRank scores tend to converge …

Page 43: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 43

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Query-dependent Ranking• Build a query-specific graph, called a neighborhood graph• Carriere and Kazman proposes the following approach to

build a neighborhood graph– A start set of documents matching the query is fetched from a

search engine (e.g. top 200 matches)– The start set is augmented by its neighborhood (the set of

documents that either hyperlinked to or is hyperlinked to by documents in the start set

– Each document in both the start set and the neighborhood is modeled by a node

– Hyperlink between pages on the same Web host can be omitted since the authors might be affiliated

Page 44: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 44

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Query-dependent Ranking

• Neighborhood graphs consist of thousands of nodes, relatively small comparing with the link graph in PageRank

• PageRank is a indegree-based ranking– it produce similar ranking in a neighborhood graph

• In query-dependent ranking, another approach divided pages into two classes in a neighborhood graph

– Authorities – pages with good content on the topic– Hubs – pages with many hyperlinks to pages on the topic

• Kleinberg’s developed the hyperlink-induced topic search (HITS)– It determines the good hubs and authorities

Page 45: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 45

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

HITS

• Given a user query, – HITS iteratively computes hub and authority scores for each node in the neighborhood graph– Ranks the bodes by hub and authority scores

1. Let N be the set of nodes in the neighborhood graph2. For every node A in N, let Hub[A] be its hub score and Aut[A] be its authority score3. Initialize Hub[A] to 1 for all A in N4. While the vectors Hub and Aut have not converged

1. For all A in N, Aut[A] = ∑(B,A)∈NH[B]2. For all A in N, Hub[A] = ∑(A,B)∈A[B]3. Normalize the Hub and Aut vectors

• Elementary linear algebra shows that the Hub and Aut vectors will eventually converge– No bound on the number of iterations is known– In practice, the vectors converge quickly

Page 46: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 46

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Normalization:A B

C D

=

=

GBA

GAB

BAuthorityAHub

BHubAAuthority

),(

),(

)()(

)()(

∑∀

=A

AAuthority 1)(

∑∀

=A

AHub 1)(

A B C D

Authority(A) Hub(A) Authority(B) Hub(B) Authority(C) Hub(C) Authority(D) Hub(D)

0 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

1 0.166667 0.333333 0.333333 0.166667 0.166667 0.333333 0.333333 0.166667

2 0.1 0.3 0.4 0.2 0.2 0.4 0.3 0.1

Iteration

Page 47: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 47

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Applications of Hyperlink Analysis in Web Information Retrieval

• Search-by-example• Mirrored Hosts• Web Page Classification• Geographical Scope

Page 48: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 48

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Search-by-example

• Looks for pages related to a given page• HITS and co-citation graph perform well in this problem

– Frequent co-citation indicates relatedness

Page 49: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 49

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Mirrored Hosts

• The path of a Web page is the part of the URL following the host• E.g. in http://www.apple.com/ipod/index.html

– www.apple.com is the host– /ipod/index.html is the path

• Two hosts, H1 and H2, are mirrors if and only if for every document on H2, there is a highly similar document on H2 with the same path, and vice versa

• Mirrors exhibits a very similar hyperlink structure both within the host and among the mirror host and other hosts

• Mirror Web hosts waste space in the index data structure and lead to duplicate results in Web search engines

• Hyperlink analysis with IP address analysis and URL pattern analysis can detect many near-mirrors

Page 50: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 50

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Web page Categorization

• Hyperlink analysis can compute statistics about groups of Web pages– E.g. average length and percentage that in a specific language

• PageRank-like random walks can sample Web pages in an almost uniform distribution– It can used to measure various properties of Web pages

Page 51: Web Searching - Chinese University of Hong Kongect7010/Materials/Lecture/Lec7.pdf · 2006-11-23 · Web: Making Client-based Searching Feasible,” Proceedings of the First International

Slide 51

ECT 7010 Fundam

entals of E-Com

merce Technologies Edited by Christopher C. Yang

Geographical Scope

• Some Web pages are of interest only for people in a given region– E.g. weather-forecasting page is interesting only to the region it covers– E.g. Internal Revenue Service page is of interest to US taxpayers

• A page’s hyperlink structure reflects its range of interest– Local pages are mostly hyperlinked to by pages from the same region– Hyperlinks to pages of nationwide interest are roughly uniform throughout the

country• Such analysis supports search engines to tailor query results to the region the

user is in