4
Online Search Scope Reconstruction by Connectivity Inference Michael Chan, Stephen Chi-fai Chan and Cane Wing-ki Leung Department of Computing The Hong Kong Polytechnic University Hung Hom, Kowloon, Hong Kong SAR {csmkchan, csschan}@comp.polyu.edu.hk, [email protected] Abstract To cope with the continuing growth of the web, improve- ments should be made to the current brute-force techniques commonly used by robot-driven search engines. We propose a model that strikes a balance between robot and directory- based search engines by expanding the search scope of con- ventional directories to automatically include related cate- gories. Our model makes use of a knowledge-rich and well- structured corpus to infer relationships between documents and topic categories. We show that the hyperlink structure of Wikipedia articles can be effectively exploited to identify relations among topic categories. Our experiments show the average recall rate and precision rate achieved are 91% and between 85% and 215% of Google’s respectively. 1. Introduction An alternative web searching approach is desirable be- cause brute-force techniques are becoming increasingly in- efficient as the size of the web grows. Each of the two most common types of search services, robot-based and directory-based search engines, has its own strengths and weaknesses. For example, robot-based search engines, such as Google (www.google.com), adopt brute-force tech- niques and achieve high recall but low precision rates; these services typically return a wide variety of documents and search the entire universe in the process. As with directory-based search engines, such as Yahoo! Directory (dir.yahoo.com) and ODP (dmoz.org), searching is confined to only a single topic selected by the user, and therefore the search process is relatively more efficient; however, the documents retrieved are confined to the selected category, which leads to higher precision and lower recall rates. Our goal is to broaden the search scope, which deter- mines the collection of documents to be processed for a given query, of the conventional directory-based approach by discovering semantically related topics. We focus away from an extensive content-based approach, but on one that exploits the most basic and essential elements of web docu- ments, e.g., word tokens and hyperlinks. To reconstruct the search scope, we try to discover the underlying related top- ics of documents by exploiting a knowledgebase containing real world knowledge in our Search Scope Reconstruction (SSR) model. One such source of knowledge is Wikipedia, an online encyclopaedia covering a wide range of topics. Although it is not an exhaustive source of real world knowl- edge, it contains over 1.7 million articles and is sufficient for our purpose. Wikipedia articles also exhibit good hyperlink structures, in the sense that most hyperlinks within an arti- cle, links to other articles (internal links) in particular, are usually semantically meaningful and salient. Based on the observation that the semantics of web doc- uments is not solely conveyed by textual content, our ap- proach uses Wikipedia articles to infer related topics of a given document by first identifying semantically salient word tokens in the document and then identifying the topic and the related topics of those tokens based on the internal links on the corresponding Wikipedia articles. A ranking of semantically related topics of the document by accumulat- ing those links can then be produced. 2. Related work Although the issue of reconstructing search scope has not been widely explored, our previous work [3] presents a probabilistic approach that reconstructs the search scope by tracing only the out-links from the documents contained in the Yahoo! Directory. However, we believe plain hyperlink analysis may not be sufficient when the given documents exhibit poor link structures, e.g., news articles and blogs. There are various approaches to identifying related top- ics in web documents, and in many research efforts it is argued that the most realistic and accurate approach is to incorporate natural language processing techniques [1] to discover thematic roles, grammatical relations, and so on. However, producing high quality NLP results often de- 2007 IEEE/WIC/ACM International Conference on Web Intelligence 0-7695-3026-5/07 $25.00 © 2007 IEEE DOI 10.1109/WI.2007.39 655 2007 IEEE/WIC/ACM International Conference on Web Intelligence 0-7695-3026-5/07 $25.00 © 2007 IEEE DOI 10.1109/WI.2007.39 655 2007 IEEE/WIC/ACM International Conference on Web Intelligence 0-7695-3026-5/07 $25.00 © 2007 IEEE DOI 10.1109/WI.2007.39 655 2007 IEEE/WIC/ACM International Conference on Web Intelligence 0-7695-3026-5/07 $25.00 © 2007 IEEE DOI 10.1109/WI.2007.39 655 2007 IEEE/WIC/ACM International Conference on Web Intelligence 0-7695-3026-5/07 $25.00 © 2007 IEEE DOI 10.1109/WI.2007.39 655

[IEEE IEEE/WIC/ACM International Conference on Web Intelligence (WI'07) - Fremont, CA, USA (2007.11.2-2007.11.5)] IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)

  • Upload
    cane

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE IEEE/WIC/ACM International Conference on Web Intelligence (WI'07) - Fremont, CA, USA (2007.11.2-2007.11.5)] IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)

Online Search Scope Reconstruction by Connectivity Inference

Michael Chan, Stephen Chi-fai Chan and Cane Wing-ki LeungDepartment of Computing

The Hong Kong Polytechnic UniversityHung Hom, Kowloon, Hong Kong SAR

{csmkchan, csschan}@comp.polyu.edu.hk, [email protected]

Abstract

To cope with the continuing growth of the web, improve-ments should be made to the current brute-force techniquescommonly used by robot-driven search engines. We proposea model that strikes a balance between robot and directory-based search engines by expanding the search scope of con-ventional directories to automatically include related cate-gories. Our model makes use of a knowledge-rich and well-structured corpus to infer relationships between documentsand topic categories. We show that the hyperlink structureof Wikipedia articles can be effectively exploited to identifyrelations among topic categories. Our experiments showthe average recall rate and precision rate achieved are 91%and between 85% and 215% of Google’s respectively.

1. Introduction

An alternative web searching approach is desirable be-

cause brute-force techniques are becoming increasingly in-

efficient as the size of the web grows. Each of the two

most common types of search services, robot-based and

directory-based search engines, has its own strengths and

weaknesses. For example, robot-based search engines,

such as Google (www.google.com), adopt brute-force tech-

niques and achieve high recall but low precision rates;

these services typically return a wide variety of documents

and search the entire universe in the process. As with

directory-based search engines, such as Yahoo! Directory

(dir.yahoo.com) and ODP (dmoz.org), searching is confined

to only a single topic selected by the user, and therefore

the search process is relatively more efficient; however, the

documents retrieved are confined to the selected category,

which leads to higher precision and lower recall rates.

Our goal is to broaden the search scope, which deter-

mines the collection of documents to be processed for a

given query, of the conventional directory-based approach

by discovering semantically related topics. We focus away

from an extensive content-based approach, but on one that

exploits the most basic and essential elements of web docu-

ments, e.g., word tokens and hyperlinks. To reconstruct the

search scope, we try to discover the underlying related top-

ics of documents by exploiting a knowledgebase containing

real world knowledge in our Search Scope Reconstruction

(SSR) model. One such source of knowledge is Wikipedia,

an online encyclopaedia covering a wide range of topics.

Although it is not an exhaustive source of real world knowl-

edge, it contains over 1.7 million articles and is sufficient for

our purpose. Wikipedia articles also exhibit good hyperlink

structures, in the sense that most hyperlinks within an arti-

cle, links to other articles (internal links) in particular, are

usually semantically meaningful and salient.

Based on the observation that the semantics of web doc-

uments is not solely conveyed by textual content, our ap-

proach uses Wikipedia articles to infer related topics of

a given document by first identifying semantically salient

word tokens in the document and then identifying the topic

and the related topics of those tokens based on the internal

links on the corresponding Wikipedia articles. A ranking of

semantically related topics of the document by accumulat-

ing those links can then be produced.

2. Related work

Although the issue of reconstructing search scope has

not been widely explored, our previous work [3] presents a

probabilistic approach that reconstructs the search scope by

tracing only the out-links from the documents contained in

the Yahoo! Directory. However, we believe plain hyperlink

analysis may not be sufficient when the given documents

exhibit poor link structures, e.g., news articles and blogs.

There are various approaches to identifying related top-

ics in web documents, and in many research efforts it is

argued that the most realistic and accurate approach is to

incorporate natural language processing techniques [1] to

discover thematic roles, grammatical relations, and so on.

However, producing high quality NLP results often de-

2007 IEEE/WIC/ACM International Conference on Web Intelligence

0-7695-3026-5/07 $25.00 © 2007 IEEEDOI 10.1109/WI.2007.39

655

2007 IEEE/WIC/ACM International Conference on Web Intelligence

0-7695-3026-5/07 $25.00 © 2007 IEEEDOI 10.1109/WI.2007.39

655

2007 IEEE/WIC/ACM International Conference on Web Intelligence

0-7695-3026-5/07 $25.00 © 2007 IEEEDOI 10.1109/WI.2007.39

655

2007 IEEE/WIC/ACM International Conference on Web Intelligence

0-7695-3026-5/07 $25.00 © 2007 IEEEDOI 10.1109/WI.2007.39

655

2007 IEEE/WIC/ACM International Conference on Web Intelligence

0-7695-3026-5/07 $25.00 © 2007 IEEEDOI 10.1109/WI.2007.39

655

Page 2: [IEEE IEEE/WIC/ACM International Conference on Web Intelligence (WI'07) - Fremont, CA, USA (2007.11.2-2007.11.5)] IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)

mands expensive computations. Fortunately, extracting se-

mantics of words using bag-of-words approaches have also

been widely explored. An example of a successful linear

algebra based technique is LSI [5], which applies a noise

reduction technique to extract synonyms and related words.

The key observation that has brought great improve-

ments in search engine rankings is that web documents

and the hyperlinks between them form a directed graph.

For example, PageRank [8] takes on the web graph using

a random walk approach and has produced excellent au-

thoritative rankings. Kleinberg [7] also developed a con-

nectivity analysis algorithm, the crux of which is a mu-

tual reinforcement definition for identifying authoritative

documents. Moreover, Topic-sensitive PageRank [6] en-

ables PageRank to be more context sensitive by computing

a ranking vector per topic.

3. Related categories identification

Connectivity analysis can be useful for discovering rela-

tionships between documents exhibiting link structures; we

therefore perform a connectivity analysis on Wikipedia ar-

ticles in order to identify related categories within a web

directory, e.g., Yahoo! or ODP, to a given topic. The topics

within a directory are hereafter referred to as categories and

the topics within Wikipedia simply as topics. In addition,

Wikipedia articles are simply referred to as articles.

Web connectivity analysis exploits the linkage informa-

tion between documents, based on the assumption that a

link between two documents implies that they contain re-

lated content. The power of connectivity analysis therefore

depends on the quality of the hyperlink structure, which in

turn depends on the ability of the author. It is known that

wikipedia exhibits a rich and meaningful hyperlink struc-

ture, in which citations are usually present for salient topics.

3.1. The SSR Model

Technically, a well-compiled directory, which maps all

Wikipedia articles to its categories, is required in order for

an accurate connectivity analysis. We hereafter denote Dbe the set of all documents, A be the set of all Wikipedia

articles, and C be the set of all categories in a directory.

Definition 1 For a well-compiled directory, there exists amapping m : A −→ C where A is the article set and Cis the category set of the directory, such that for all a in A,m(a) returns some categories Φ ⊆ C.

In practice, these mappings are usually dependent on the

directory provider.

Definition 2 The set of related categories of an article a is:∀a : ArtToCatsa = {x|∃y ∈ Oa : x ∈ m(y)}, where

∀a ∈ A : Oa ⊆ A contains all destination articles of theinternal links within a.

In other words, ArtToCatsa contains the categories of

the articles that are directly linked from a.

Definition 3 Let Wd be the set of salient words containedin a document d. There exists a mapping f : Wd −→ A,such that for any w in Wd, f(w) returns the article in Awith w as its title.

Because we ignore all ambiguous salient words at this

stage of our work, f can be seen as a one-to-one function.

We regard salient words as words that together represent the

context of a document.

Definition 4 The set of all related categories to a documentd is: ∀d : DocToCatsd = {ArtToCatsa|∃w ∈ Wd : a ∈f(w)}.

Definition 5 For any category c in C, there exists a setSubc ⊆ D containing all documents subsumed by c suchthat ∀x : x ∈ Subc iff c is in m(x).

Definition 6 The set of all related categories to a categoryc ∈ C is: ∀c : CatToCatsc = {x|∃d ∈ Subc : x ∈DocToCatsd}.

Proposition 1 ∀φ, ϕ ∈ C, φ �= ϕ : [(∃Z : Z ∈CatToCatsϕ, φ ∈ Z) ←→ (∃x,∃d, ∃w, ∃y : d ∈Subϕ, w ∈ Wd, y ∈ f(w), x ∈ Subφ, x ∈ Oy)]

We state Proposition 1 without proof due to space limi-

tations.

Using Definition 4, we define a total preorder cfor any

category c such that it is a total preorder of relevancy scores

comparing the accumulated number of ”virtual links” to

each related category of c, based on the idea that the more

virtual links there are to a category, the stronger the rela-

tionship is with it.

4. Evaluation

4.1. Implementation

The work reported here is based on the Yahoo! directory

and Wikipedia’s English corpus as of April 20, 2006, which

contains more than 1.7 million articles. In our implemen-

tation, the mapping f relies on directly matching a given

word with the title of an article. Further, we give higher

weights to topics that are named entities themselves. To

this end, we incorporated the named entity recognizer from

GATE [4], but modified to include our own rules, named

entity types, and gazetteer lists.

656656656656656

Page 3: [IEEE IEEE/WIC/ACM International Conference on Web Intelligence (WI'07) - Fremont, CA, USA (2007.11.2-2007.11.5)] IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)

20

40

60

80

100

120

140

160

180

200

220

240

260

280

300

320

340

360

380

400

40.00%45.00%50.00%

55.00%60.00%

65.00%

70.00%

75.00%80.00%

85.00%

90.00%95.00%

100.00%

Lvl5

Lvl4

Lvl3Lvl2

Lvl1

Top K pages returned

Cove

rage

Figure 1. Scope coverage for query""hillary clinton"" excluding regionaltopics

20

40

60

80

100

120

140

160

180

200

220

240

260

280

300

320

340

360

380

400

40.00%45.00%50.00%

55.00%60.00%

65.00%

70.00%

75.00%80.00%

85.00%

90.00%95.00%

100.00%

Lvl5

Lvl4

Lvl3Lvl2

Lvl1

Top K pages returned

Cove

rage

Figure 2. Scope coverage for query ""davinci code"" excluding regional topics

Because there is no standard way to evaluate objective

measures of document-topic relevancy, one reasonable way

would be to compare against a state-of-the-art algorithm;

we decided that to be Google search. However, Google

does not have a directory for its index, so the implicit cate-

gories of the documents returned are not known to us. One

solution is to perform text categorization to assign Yahoo!

categories to documents returned. To this end, we adopted

our own algorithm, which is based on Definitions 2, 4, and

6. Each document is assigned with at most five related cat-

egories that are ranked highest according to the preorder

described. For each relevant document d returned for some

query in the context of a category c, we define c to cover dif c is one of the five selected categories.

4.2. Results

To help judge whether our implementation can iden-

tify relevant categories, three experiments were conducted.

First, Table 1 lists the six most related categories found by

the SSR model for four selected categories, in descending

order of the scores determined by the preorder c, where cis shown in the left column. The resulting related categories

can be regarded as being reasonably relevant. However, the

relative ordering derived does not correctly reflect the rel-

ative degree of relevancy, e.g., Leonardo DaVinci can

be considered as being more relevant to Dan Brown >The Da Vinci Code than to Jesus.

20

40

60

80

100

120

140

160

180

200

220

240

260

280

300

320

340

360

380

400

40.00%

45.00%

50.00%

55.00%

60.00%

65.00%

70.00%

75.00%

80.00%

85.00%

90.00%

95.00%

100.00%

Lvl5

Lvl4

Lvl3

Lvl2Lvl1

Top K pages returned

Cove

rage

Figure 3. Scope coverage for query""hillary clinton" AND "homelandsecurity"" excluding regional topics

The scope coverage tests performed, results shown in

Fig. 1 to Fig. 3, aim at analyzing the scope coverages for

different queries. Scope coverage is determined by the num-

ber of documents inside the search scope. Each chart de-

picts the accumulated coverage for scope levels one to five,

where level one and the following levels respectively repre-

sent the selected category itself and the related categories in

descending order of scores. The results shown ignore all re-

gional categories (e.g., United States) because these

categories usually contain a large number of documents and

are excessively dominant over specialized categories.

We captured the top 400 documents returned by Google

based on queries on the topic of Senator HillaryRodham Clinton and Dan Brown > The DaVinci Code. Although several other works (e.g., [7]

and [2]) have based their evaluations on the top 200

returned by other search engines, we believe it would be

more appropriate to use Google’s top 400 here due to its

larger index. Because Google’s algorithm is reasonably

successful at sorting documents by relevancy, we believe

it would be reasonable to assume the higher a document is

ranked the more relevant it is to the given query; documents

ranked lower than 400 were assumed to be irrelevant. The

charts shown, along with our other experiments not shown

due to space limitations, suggest that our model requires

only five levels to achieve at least 90% coverage of the top

20 documents returned by Google. Moreover, our imple-

mentation achieves a mean coverage of 89.3% of the top

200 returned. An interesting and important observation is

that the scope coverage gradually becomes narrower as the

ranks of the documents are lower. This trend may suggest

that our model, along with the categorization algorithm

used, culls irrelevant documents. By these results, we

estimate the average recall of SSR is about 91% (based on

the 60 samples shown in Figs. 1 to 3) of Google’s.

We evaluate the retrieval performance of the SSR model

against Google by comparing the number of documents

within the coverage containing some selected words with

the number of documents Google returns when given those

words as queries. Given our limited resources, categorizing

657657657657657

Page 4: [IEEE IEEE/WIC/ACM International Conference on Web Intelligence (WI'07) - Fremont, CA, USA (2007.11.2-2007.11.5)] IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)

Category Related categories identifiedSenator HillaryRodham Clinton

United States; Bill Clinton; U.S. Government Executive Branch;

U.S. Senate; Chelsea Clinton; President George W. BushDan Brown > TheDa Vinci Code

Jesus; Catholic; United States; Holy Grail; Movie Titles > TheDa Vinci Code (2006); Pope John Paul II

Film Directors> Ang Lee

Academy Awards; United States; Brokeback Mountain; EnglishLanguage; Golden Globe Awards; Heath Ledger

Table 1. Related categories identified for three Yahoo! categories

SSR GoogleScope coverage (mil.) 550.0 20,000+

Num. of mil. hillary 30.0 28.1

docs within clinton 58.3 72.7

coverage homeland 18.2 43.0

containing: ”hillary clinton” 1.7 3.1

Table 2. Scope coverage test results for theSenator Hillary Rodham Clinton category

all documents in the “universe” is infeasible and sampling

from this collection would represent an extremely small

portion. We therefore took a more accurate approach by

sampling from a smaller collection based on Proposition 1,

which we call a candidate pool. However, in our experi-

ment the use of a candidate pool was not effective as it con-

tains over 21 billion documents. We therefore sampled only

the top 400 documents returned by Google. Nonetheless,

since Google has an excellent ranking algorithm, we be-

lieve that the top 400 would be a sufficient representation.

In addition, as already discussed, scope coverage generally

narrows towards low ranked documents, so we estimate the

upper bound of the overall coverage to be the average cov-

erage for the top 400.

The precision of SSR can be estimated by using the

results from Table 2, which is based on the category

Senator Hillary Rodham Clinton. For exam-

ple, if a user submits a query “hillary clinton” to

search for documents about Hillary Clinton, the senator and

the former First Lady of the U.S., our implementation is es-

timated to return at most 1.7 million non-unique documents,

while Google returns about 3.1 million presumably uniquedocuments. This suggests that the precision of our model

can be estimated to be between 85% (worst-case scenario

by matching hillary) and 215% (best-case scenario by

matching homeland) of Google’s. We believe the overall

results are very encouraging; however, since robot-based

search engines today typically suffer from extremely low

precision rates, the estimated increase in precision may only

provide an insignificant practical improvement.

5. Conclusion

Described is a model which can be used for reconstruct-

ing search scope in web directory-based searching. Our

model relies on an existing web directory and exploits the

wealth of knowledge and hyperlink structure of Wikipedia

articles. In our experiments, the average recall rate achieved

by using the Yahoo! directory is 91% of Google’s, whereas

the precision can be doubled. An interesting observation is

that the scope coverage becomes narrower as the ranks of

the documents become lower, which might imply that the

number of rejected documents increases for lower ranked

documents. Our work has not only shown that inference

techniques can produce promising results for related cate-

gory identification, but also that Wikipedia articles’ hyper-

link structures can be useful for semantic analysis.

References

[1] J. Allen. Natural Language Processing. Benjamin/Cummings

Publ. Co., Menlo Park CA, 1987.[2] F. Can, R. Nuray, and A. B. Sevdik. Automatic performance

evaluation of web search engines. Information Processing &Management, 40(3):495–514, May 2004.

[3] S. Chan, T. Lai, E. Dang, M. Chan, and C. Leung. Increasing

relevance of internet search results using a topic network. In

Proc. of the IADIS e-Society 2006 Conference, 2006.[4] H. Cunningham, R. Gaizauskas, K. Humphreys, and Y. Wilks.

Experience with a language engineering architecture: Three

years of gate, 1999.[5] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Fur-

nas, and R. A. Harshman. Indexing by latent semantic analy-

sis. Journal of the American Society of Information Science,

41(6):391–407, 1990.[6] T. H. Haveliwala. Topic-sensitive pagerank: A context-

sensitive ranking algorithm for web search. IEEE Transac-tions on Knowledge and Data Engineering, 15(4):784–796,

2003.[7] J. M. Kleinberg. Authoritative sources in a hyperlinked envi-

ronment. Journal of the ACM, 46(5):604–632, 1999.[8] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank

citation ranking: Bringing order to the web. Technical report,

Stanford Digital Library Technologies Project, 1998.

658658658658658