Online Search Scope Reconstruction by Connectivity Inference
Michael Chan, Stephen Chi-fai Chan and Cane Wing-ki LeungDepartment of Computing
The Hong Kong Polytechnic UniversityHung Hom, Kowloon, Hong Kong SAR
{csmkchan, csschan}@comp.polyu.edu.hk, [email protected]
Abstract
To cope with the continuing growth of the web, improve-ments should be made to the current brute-force techniquescommonly used by robot-driven search engines. We proposea model that strikes a balance between robot and directory-based search engines by expanding the search scope of con-ventional directories to automatically include related cate-gories. Our model makes use of a knowledge-rich and well-structured corpus to infer relationships between documentsand topic categories. We show that the hyperlink structureof Wikipedia articles can be effectively exploited to identifyrelations among topic categories. Our experiments showthe average recall rate and precision rate achieved are 91%and between 85% and 215% of Google’s respectively.
1. Introduction
An alternative web searching approach is desirable be-
cause brute-force techniques are becoming increasingly in-
efficient as the size of the web grows. Each of the two
most common types of search services, robot-based and
directory-based search engines, has its own strengths and
weaknesses. For example, robot-based search engines,
such as Google (www.google.com), adopt brute-force tech-
niques and achieve high recall but low precision rates;
these services typically return a wide variety of documents
and search the entire universe in the process. As with
directory-based search engines, such as Yahoo! Directory
(dir.yahoo.com) and ODP (dmoz.org), searching is confined
to only a single topic selected by the user, and therefore
the search process is relatively more efficient; however, the
documents retrieved are confined to the selected category,
which leads to higher precision and lower recall rates.
Our goal is to broaden the search scope, which deter-
mines the collection of documents to be processed for a
given query, of the conventional directory-based approach
by discovering semantically related topics. We focus away
from an extensive content-based approach, but on one that
exploits the most basic and essential elements of web docu-
ments, e.g., word tokens and hyperlinks. To reconstruct the
search scope, we try to discover the underlying related top-
ics of documents by exploiting a knowledgebase containing
real world knowledge in our Search Scope Reconstruction
(SSR) model. One such source of knowledge is Wikipedia,
an online encyclopaedia covering a wide range of topics.
Although it is not an exhaustive source of real world knowl-
edge, it contains over 1.7 million articles and is sufficient for
our purpose. Wikipedia articles also exhibit good hyperlink
structures, in the sense that most hyperlinks within an arti-
cle, links to other articles (internal links) in particular, are
usually semantically meaningful and salient.
Based on the observation that the semantics of web doc-
uments is not solely conveyed by textual content, our ap-
proach uses Wikipedia articles to infer related topics of
a given document by first identifying semantically salient
word tokens in the document and then identifying the topic
and the related topics of those tokens based on the internal
links on the corresponding Wikipedia articles. A ranking of
semantically related topics of the document by accumulat-
ing those links can then be produced.
2. Related work
Although the issue of reconstructing search scope has
not been widely explored, our previous work [3] presents a
probabilistic approach that reconstructs the search scope by
tracing only the out-links from the documents contained in
the Yahoo! Directory. However, we believe plain hyperlink
analysis may not be sufficient when the given documents
exhibit poor link structures, e.g., news articles and blogs.
There are various approaches to identifying related top-
ics in web documents, and in many research efforts it is
argued that the most realistic and accurate approach is to
incorporate natural language processing techniques [1] to
discover thematic roles, grammatical relations, and so on.
However, producing high quality NLP results often de-
2007 IEEE/WIC/ACM International Conference on Web Intelligence
0-7695-3026-5/07 $25.00 © 2007 IEEEDOI 10.1109/WI.2007.39
655
2007 IEEE/WIC/ACM International Conference on Web Intelligence
0-7695-3026-5/07 $25.00 © 2007 IEEEDOI 10.1109/WI.2007.39
655
2007 IEEE/WIC/ACM International Conference on Web Intelligence
0-7695-3026-5/07 $25.00 © 2007 IEEEDOI 10.1109/WI.2007.39
655
2007 IEEE/WIC/ACM International Conference on Web Intelligence
0-7695-3026-5/07 $25.00 © 2007 IEEEDOI 10.1109/WI.2007.39
655
2007 IEEE/WIC/ACM International Conference on Web Intelligence
0-7695-3026-5/07 $25.00 © 2007 IEEEDOI 10.1109/WI.2007.39
655
mands expensive computations. Fortunately, extracting se-
mantics of words using bag-of-words approaches have also
been widely explored. An example of a successful linear
algebra based technique is LSI [5], which applies a noise
reduction technique to extract synonyms and related words.
The key observation that has brought great improve-
ments in search engine rankings is that web documents
and the hyperlinks between them form a directed graph.
For example, PageRank [8] takes on the web graph using
a random walk approach and has produced excellent au-
thoritative rankings. Kleinberg [7] also developed a con-
nectivity analysis algorithm, the crux of which is a mu-
tual reinforcement definition for identifying authoritative
documents. Moreover, Topic-sensitive PageRank [6] en-
ables PageRank to be more context sensitive by computing
a ranking vector per topic.
3. Related categories identification
Connectivity analysis can be useful for discovering rela-
tionships between documents exhibiting link structures; we
therefore perform a connectivity analysis on Wikipedia ar-
ticles in order to identify related categories within a web
directory, e.g., Yahoo! or ODP, to a given topic. The topics
within a directory are hereafter referred to as categories and
the topics within Wikipedia simply as topics. In addition,
Wikipedia articles are simply referred to as articles.
Web connectivity analysis exploits the linkage informa-
tion between documents, based on the assumption that a
link between two documents implies that they contain re-
lated content. The power of connectivity analysis therefore
depends on the quality of the hyperlink structure, which in
turn depends on the ability of the author. It is known that
wikipedia exhibits a rich and meaningful hyperlink struc-
ture, in which citations are usually present for salient topics.
3.1. The SSR Model
Technically, a well-compiled directory, which maps all
Wikipedia articles to its categories, is required in order for
an accurate connectivity analysis. We hereafter denote Dbe the set of all documents, A be the set of all Wikipedia
articles, and C be the set of all categories in a directory.
Definition 1 For a well-compiled directory, there exists amapping m : A −→ C where A is the article set and Cis the category set of the directory, such that for all a in A,m(a) returns some categories Φ ⊆ C.
In practice, these mappings are usually dependent on the
directory provider.
Definition 2 The set of related categories of an article a is:∀a : ArtToCatsa = {x|∃y ∈ Oa : x ∈ m(y)}, where
∀a ∈ A : Oa ⊆ A contains all destination articles of theinternal links within a.
In other words, ArtToCatsa contains the categories of
the articles that are directly linked from a.
Definition 3 Let Wd be the set of salient words containedin a document d. There exists a mapping f : Wd −→ A,such that for any w in Wd, f(w) returns the article in Awith w as its title.
Because we ignore all ambiguous salient words at this
stage of our work, f can be seen as a one-to-one function.
We regard salient words as words that together represent the
context of a document.
Definition 4 The set of all related categories to a documentd is: ∀d : DocToCatsd = {ArtToCatsa|∃w ∈ Wd : a ∈f(w)}.
Definition 5 For any category c in C, there exists a setSubc ⊆ D containing all documents subsumed by c suchthat ∀x : x ∈ Subc iff c is in m(x).
Definition 6 The set of all related categories to a categoryc ∈ C is: ∀c : CatToCatsc = {x|∃d ∈ Subc : x ∈DocToCatsd}.
Proposition 1 ∀φ, ϕ ∈ C, φ �= ϕ : [(∃Z : Z ∈CatToCatsϕ, φ ∈ Z) ←→ (∃x,∃d, ∃w, ∃y : d ∈Subϕ, w ∈ Wd, y ∈ f(w), x ∈ Subφ, x ∈ Oy)]
We state Proposition 1 without proof due to space limi-
tations.
Using Definition 4, we define a total preorder cfor any
category c such that it is a total preorder of relevancy scores
comparing the accumulated number of ”virtual links” to
each related category of c, based on the idea that the more
virtual links there are to a category, the stronger the rela-
tionship is with it.
4. Evaluation
4.1. Implementation
The work reported here is based on the Yahoo! directory
and Wikipedia’s English corpus as of April 20, 2006, which
contains more than 1.7 million articles. In our implemen-
tation, the mapping f relies on directly matching a given
word with the title of an article. Further, we give higher
weights to topics that are named entities themselves. To
this end, we incorporated the named entity recognizer from
GATE [4], but modified to include our own rules, named
entity types, and gazetteer lists.
656656656656656
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
320
340
360
380
400
40.00%45.00%50.00%
55.00%60.00%
65.00%
70.00%
75.00%80.00%
85.00%
90.00%95.00%
100.00%
Lvl5
Lvl4
Lvl3Lvl2
Lvl1
Top K pages returned
Cove
rage
Figure 1. Scope coverage for query""hillary clinton"" excluding regionaltopics
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
320
340
360
380
400
40.00%45.00%50.00%
55.00%60.00%
65.00%
70.00%
75.00%80.00%
85.00%
90.00%95.00%
100.00%
Lvl5
Lvl4
Lvl3Lvl2
Lvl1
Top K pages returned
Cove
rage
Figure 2. Scope coverage for query ""davinci code"" excluding regional topics
Because there is no standard way to evaluate objective
measures of document-topic relevancy, one reasonable way
would be to compare against a state-of-the-art algorithm;
we decided that to be Google search. However, Google
does not have a directory for its index, so the implicit cate-
gories of the documents returned are not known to us. One
solution is to perform text categorization to assign Yahoo!
categories to documents returned. To this end, we adopted
our own algorithm, which is based on Definitions 2, 4, and
6. Each document is assigned with at most five related cat-
egories that are ranked highest according to the preorder
described. For each relevant document d returned for some
query in the context of a category c, we define c to cover dif c is one of the five selected categories.
4.2. Results
To help judge whether our implementation can iden-
tify relevant categories, three experiments were conducted.
First, Table 1 lists the six most related categories found by
the SSR model for four selected categories, in descending
order of the scores determined by the preorder c, where cis shown in the left column. The resulting related categories
can be regarded as being reasonably relevant. However, the
relative ordering derived does not correctly reflect the rel-
ative degree of relevancy, e.g., Leonardo DaVinci can
be considered as being more relevant to Dan Brown >The Da Vinci Code than to Jesus.
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
320
340
360
380
400
40.00%
45.00%
50.00%
55.00%
60.00%
65.00%
70.00%
75.00%
80.00%
85.00%
90.00%
95.00%
100.00%
Lvl5
Lvl4
Lvl3
Lvl2Lvl1
Top K pages returned
Cove
rage
Figure 3. Scope coverage for query""hillary clinton" AND "homelandsecurity"" excluding regional topics
The scope coverage tests performed, results shown in
Fig. 1 to Fig. 3, aim at analyzing the scope coverages for
different queries. Scope coverage is determined by the num-
ber of documents inside the search scope. Each chart de-
picts the accumulated coverage for scope levels one to five,
where level one and the following levels respectively repre-
sent the selected category itself and the related categories in
descending order of scores. The results shown ignore all re-
gional categories (e.g., United States) because these
categories usually contain a large number of documents and
are excessively dominant over specialized categories.
We captured the top 400 documents returned by Google
based on queries on the topic of Senator HillaryRodham Clinton and Dan Brown > The DaVinci Code. Although several other works (e.g., [7]
and [2]) have based their evaluations on the top 200
returned by other search engines, we believe it would be
more appropriate to use Google’s top 400 here due to its
larger index. Because Google’s algorithm is reasonably
successful at sorting documents by relevancy, we believe
it would be reasonable to assume the higher a document is
ranked the more relevant it is to the given query; documents
ranked lower than 400 were assumed to be irrelevant. The
charts shown, along with our other experiments not shown
due to space limitations, suggest that our model requires
only five levels to achieve at least 90% coverage of the top
20 documents returned by Google. Moreover, our imple-
mentation achieves a mean coverage of 89.3% of the top
200 returned. An interesting and important observation is
that the scope coverage gradually becomes narrower as the
ranks of the documents are lower. This trend may suggest
that our model, along with the categorization algorithm
used, culls irrelevant documents. By these results, we
estimate the average recall of SSR is about 91% (based on
the 60 samples shown in Figs. 1 to 3) of Google’s.
We evaluate the retrieval performance of the SSR model
against Google by comparing the number of documents
within the coverage containing some selected words with
the number of documents Google returns when given those
words as queries. Given our limited resources, categorizing
657657657657657
Category Related categories identifiedSenator HillaryRodham Clinton
United States; Bill Clinton; U.S. Government Executive Branch;
U.S. Senate; Chelsea Clinton; President George W. BushDan Brown > TheDa Vinci Code
Jesus; Catholic; United States; Holy Grail; Movie Titles > TheDa Vinci Code (2006); Pope John Paul II
Film Directors> Ang Lee
Academy Awards; United States; Brokeback Mountain; EnglishLanguage; Golden Globe Awards; Heath Ledger
Table 1. Related categories identified for three Yahoo! categories
SSR GoogleScope coverage (mil.) 550.0 20,000+
Num. of mil. hillary 30.0 28.1
docs within clinton 58.3 72.7
coverage homeland 18.2 43.0
containing: ”hillary clinton” 1.7 3.1
Table 2. Scope coverage test results for theSenator Hillary Rodham Clinton category
all documents in the “universe” is infeasible and sampling
from this collection would represent an extremely small
portion. We therefore took a more accurate approach by
sampling from a smaller collection based on Proposition 1,
which we call a candidate pool. However, in our experi-
ment the use of a candidate pool was not effective as it con-
tains over 21 billion documents. We therefore sampled only
the top 400 documents returned by Google. Nonetheless,
since Google has an excellent ranking algorithm, we be-
lieve that the top 400 would be a sufficient representation.
In addition, as already discussed, scope coverage generally
narrows towards low ranked documents, so we estimate the
upper bound of the overall coverage to be the average cov-
erage for the top 400.
The precision of SSR can be estimated by using the
results from Table 2, which is based on the category
Senator Hillary Rodham Clinton. For exam-
ple, if a user submits a query “hillary clinton” to
search for documents about Hillary Clinton, the senator and
the former First Lady of the U.S., our implementation is es-
timated to return at most 1.7 million non-unique documents,
while Google returns about 3.1 million presumably uniquedocuments. This suggests that the precision of our model
can be estimated to be between 85% (worst-case scenario
by matching hillary) and 215% (best-case scenario by
matching homeland) of Google’s. We believe the overall
results are very encouraging; however, since robot-based
search engines today typically suffer from extremely low
precision rates, the estimated increase in precision may only
provide an insignificant practical improvement.
5. Conclusion
Described is a model which can be used for reconstruct-
ing search scope in web directory-based searching. Our
model relies on an existing web directory and exploits the
wealth of knowledge and hyperlink structure of Wikipedia
articles. In our experiments, the average recall rate achieved
by using the Yahoo! directory is 91% of Google’s, whereas
the precision can be doubled. An interesting observation is
that the scope coverage becomes narrower as the ranks of
the documents become lower, which might imply that the
number of rejected documents increases for lower ranked
documents. Our work has not only shown that inference
techniques can produce promising results for related cate-
gory identification, but also that Wikipedia articles’ hyper-
link structures can be useful for semantic analysis.
References
[1] J. Allen. Natural Language Processing. Benjamin/Cummings
Publ. Co., Menlo Park CA, 1987.[2] F. Can, R. Nuray, and A. B. Sevdik. Automatic performance
evaluation of web search engines. Information Processing &Management, 40(3):495–514, May 2004.
[3] S. Chan, T. Lai, E. Dang, M. Chan, and C. Leung. Increasing
relevance of internet search results using a topic network. In
Proc. of the IADIS e-Society 2006 Conference, 2006.[4] H. Cunningham, R. Gaizauskas, K. Humphreys, and Y. Wilks.
Experience with a language engineering architecture: Three
years of gate, 1999.[5] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Fur-
nas, and R. A. Harshman. Indexing by latent semantic analy-
sis. Journal of the American Society of Information Science,
41(6):391–407, 1990.[6] T. H. Haveliwala. Topic-sensitive pagerank: A context-
sensitive ranking algorithm for web search. IEEE Transac-tions on Knowledge and Data Engineering, 15(4):784–796,
2003.[7] J. M. Kleinberg. Authoritative sources in a hyperlinked envi-
ronment. Journal of the ACM, 46(5):604–632, 1999.[8] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank
citation ranking: Bringing order to the web. Technical report,
Stanford Digital Library Technologies Project, 1998.
658658658658658