Upload
bpratapcse
View
57
Download
1
Tags:
Embed Size (px)
Citation preview
Improving Web Search Result Using
Cluster Analysis
A thesis Submitted by
Biswapratap SinghSahoo
in partial fulfillment for the award of the degree of
MASTER OF SCIENCE
IN
COMPUTER SCIENCE
Supervisor
Dr. R.C Balabantaray
UTKAL UNIVERSITY: ODISHA
JUNE 2010
Copyright © by
Biswapratap SinghSahoo
June 2010
Contents Declaration iii
Abstract iv
Dedication vi
Acknowledgment vii
List of Figures viii
Chapter 1 Introduction 1
1.1 Motivation 2
1.1.1 From organic to mineral memory 2
1.1.2 The problem of abundance 4
1.1.3 Information retrieval and Web search 5
1.1.4 Web search and Web crawling 7
1.1.5 Why the Web is so popular now? 8
1.1.6 Search Engine System Architecture 9
1.1.7 Overview of Information Retrieval 11
1.1.8 Evaluation in IR 12
1.1.9 Methods for IR 14
Chapter 2 Related works 16
2.1 Search Engine Tools 18
2.1.1 Web Crawlers 18
2.1.2 How the Web Crawler Works 18
2.1.3 Overview of data clustering 19
2.2 An example information retrieval problem 20
Chapter 3 Implementation Details 26 3.1 Determining the user terms 26
3.1.1 Tokenization 26
3.1.2 Processing Boolean queries 27
3.1.3 Schematic Representation of Our Approach 29
3.1.4 Methodology of Our Proposed Model 29
3.2 Our Proposed Model Tool 29
3.2.1 Cluster Processor 29
3.2.2. DB/Cluster 30
3.3 Working Methodology 30
Chapter 4 Future Work and Conclusions 32 4.1 Future Work 32
4.2 Conclusion 32
Appendices 34
References and Bibliography 40
Annexure 44
A Modern Approach to Search Engine Using Cluster Analysis,
Biswapratap SinghSahoo. National Seminar on Computer
Security: Issues and Challenges on 13th & 14th February 2010 held
at PJ College of Management & Technology, Bhubaneswar,
sponsored by All India Council for Technical Education, New Delhi.
Page No - 27
DECLARATION I, Sri Biswapratap SinghSahoo, do hereby declare that this thesis
entitled “Improving Web Search Result Using Cluster Analysis”
submitted to Utkal University, Bhubaneswar for the award of the degree
of Master of Science in Computer Science is an original piece of work
done by me and has not been submitted for award of any degree or
diploma in any other Universities. Any help or source of information,
which has been availed in this connection, is duly acknowledged.
Date: Biswapratap SinghSahoo Place: Researcher
Abstract The key factors for the success of the World Wide Web are its large
size and the lack of a centralized control over its contents. Both issues
are also the most important source of problems for locating information.
The Web is a context in which traditional Information Retrieval methods
are challenged, and given the volume of the Web and its speed of change,
the coverage of modern search engines is relatively small. Moreover, the
distribution of quality is much skewed, and interesting pages are scarce
in comparison with the rest of the content.
Search engines have changed the way people access and discover
knowledge, allowing information almost any subject to be quickly and
easily retrieved within seconds. As increasingly more materials become
available electronically the influence of search engines on our lives will
continue to grow. To engineer a search engine is a challenging task.
Search engines index ten to hundred millions of web pages involving a
comparable number of distinct terms. They answer tens of millions of
queries every day. Despite the importance of large-scale search engines
on the web, very little academic research has been done on them.
Furthermore due to rapid advance in overview of current web search
engine designed and proposed a model with cluster analysis. We
introduce new Meta search engine, which dynamically groups the search
results into clusters labeled by phrases extracted from the snippets.
There has been less research in cluster analysis using user
terminology rather than document keywords. Until log files of web sites
were made available it was difficult to accumulate enough exact user
searches to make a cluster analysis feasible. Another limitation in using
searcher terms is that most users of the Internet employ short (one to
two words) queries (Jansen et al., 1998). Wu, et al. (2001) used queries
as a basis for clustering documents selected by searchers in response to
similar queries. This paper reports on an experimental search engine
based on a cluster analysis of user text for quality information.
To my parents
and to all my teachers
both formal and informal
Serendipity is too important to be left to chance.
Acknowledgements What you are is a consequence of whom you interact with, but just
saying “thanks everyone for everything” would be wasting this
opportunity. I have been very lucky of interacting with really great
people, even if some times I am prepared to understand just a small
fraction of what they have to teach me.
I am sincerely grateful for the support given by my advisor Dr. R. C Balabantaray during this thesis. The comments received from my
advisor during the review process were also very helpful and detailed. It
is my pleasure and good opportunity to express my profound sense of
gratitude and indebtedness towards him for inspiring guidance,
unceasing encouragement, over and above all critical insight that has
done into eventual fruition of this work. His blessings and inspiration
helped me to stand in this new and most challenging field of Information
Retrieval.
This thesis is just a step on a very long road. I want to thank the
professors I met during this study; especially I take this opportunity to
extend my thanks to Prof. S. Prasad and Er. S.K Naik for their
continuous encouragement throughout the entire course of the work.
I am also thankful to each and every staff members of Spintronic
Technology & Advance Research, Bhubaneswar for their time to time
cooperation and help.
I would say at the end that I owe everything to my parents, but
that would imply that they also owe everything to their parents and so
on, creating an infinite recursion that is outside the context of this work.
Therefore, I thank Dr. Balabantaray for being with me even from
before the beginning, and sometimes giving everything they have and
more, and I need no calculation to say that he has given me the best
guidance – thank you.
List of Figures Figure 1.1: Cyclic architecture for search engines, showing how different
components can use the information generated by the other
components.
Figure 1.2 Architecture of a simple Web Search Engine
Figure 2.1 A term-document incidence matrix. Matrix element (t, d) is 1
if the play in column d contains the word in row t, and is 0
otherwise.
Figure 2.2 Results from Shakespeare for the query Brutus AND Caesar
AND NOT Calpurnia.
Figure 2.3 The two parts of inverted index.
Chapter 1 Introduction
The World Wide Web (WWW) has seen a tremendous increase in
size in the last two decades as well as the number of new users
inexperienced in the art of web search [1]. The amount of information
and resources available on WWW today has grown exponentially and
almost any kind of information is present if the user looks long enough.
In order to find relevant pages, a user has to browse through many WWW
sites that may contain the information. Users may either browse the
pages through entry points such as the popular portals, Google, Yahoo,
MSN and AOL, etc. to look for specific information. Beginning the search
from one of the entry points is not always the best approach, since there
is no particular organized structure for the WWW, and not all pages are
reachable from others. In the case of using a search engine, a user
submits a query, typically a list of keywords, and the search engine
returns a list of the web pages that may be relevant according to the
keywords. In order to achieve this, the search engine has to search its
already existing index of all web pages for the relevant ones. Such search
engines rely on massive collections of web pages that are acquired with
the help of web crawlers, which traverse the web by following hyperlinks
and storing downloaded pages in a large database that is later indexed
for efficient execution of user queries. Many researchers have looked at
web search technology over the last few years but very little academic
research has been done on them. Search engines are constantly engaged
in the task of crawling through the WWW for the purpose of indexing.
When a user submits keywords for search, the search engine selects and
ranks the documents from its index. The task of ranking the documents,
according to some predefined criteria, falls under the responsibilities of
the ranking algorithms. A good search engine should present relevant
documents higher in the ranking, with less relevant documents following
them. A crawler for a large search engine has to address two issues.
First, it has to have a good crawling strategy, i.e., a strategy for deciding
which pages to download next. Second, it needs to have a highly
optimized system architecture that can download a large number of
pages per second from WWW.
1.1 Motivation 1.1.1 From organic to mineral memory
As we mentioned before, finding relevant information from the
mixed results is a time consuming task. In this context we introduce a
simple high-precision information retrieval system by clustering and re-
ranking retrieval results with the intention of eliminate these
shortcomings. The proposed architecture has some key features:
• Simple and high performance. Our experimental results (Section
4) shows that it’s almost 79 percent better than the best known standard
Persian retrieval systems [1, 2, 18].
• Independent of initial system architecture. It can embed in any
fabric information retrieval system. It cause proposed architecture very
good envisage for the web search engines.
• High-Precision. Relevant documents exhibit at top of the result
list.
We have three types of memory. The first one is organic, which is
the memory made of flesh and blood and the one administrated by our
brain. The second is mineral, and in this sense mankind has known two
kinds of mineral memory: millennia ago, this was the memory
represented by clay tablets and obelisks, pretty well known in this
country, on which people carved their texts. However, this second type is
also the electronic memory of today’s computers, based upon silicon. We
have also known another kind of memory, the vegetal one, the one
represented by the first papyruses, again well known in this country, and
then on books, made of paper.
TheWorldWideWeb, a vast mineral memory, has become in a few
years the largest cultural endeavor of all times, equivalent in importance
to the first Library of Alexandria. How was the ancient library created?
This is one version of the story:
“By decree of Ptolemy III of Egypt, all visitors to the city were
required to surrender all books and scrolls in their possession;
these writings were then swiftly copied by official scribes. The
originals were put into the Library, and the copies were delivered to
the previous owners. While encroaching on the rights of the
traveler or merchant, it also helped to create a reservoir of books in
the relatively new city.”
The main difference between the Library of Alexandria and the Web
is not that one was vegetal, made of scrolls and ink, and the other one is
mineral, made of cables and digital signals. The main difference is that
while in the Library books were copied by hand, most of the information
on the Web has been reviewed only once, by its author, at the time of
writing.
Also, modern mineral memory allows fast reproduction of the work,
with no human effort. The cost of disseminating content is lower due to
new technologies, and has been decreasing substantially from oral
tradition to writing, and then from printing and the press to electronic
communications. This has generated much more information than we
can handle.
1.1.2 The problem of abundance
The signal-to-noise ratio of the products of human culture is
remarkably high: mass media, including the press, radio and cable
networks, provide strong evidence of this phenomenon every day, as well
as more small-scale actions such as browsing a book store or having a
conversation. The average modern working day consists of dealing with
46 phone calls, 15 internal memos, 19 items of external post and 22 e-
mails.
We live in an era of information explosion, with information being
measured in exabytes (1018 bytes): “Print, film, magnetic, and optical
storage media produced about 5 exabytes of new information in 2002.
We estimate that new stored information grew about 30% a year between
1999 and 2002. Information flows through electronic channels –
telephone, radio, TV, and the Internet – contained almost 18 exabytes of
new information in 2002, three and a half times more than is recorded in
storage media. The World Wide Web contains about 170 terabytes of
information on its surface.” On the dawn of the World Wide Web, finding
information was done mainly by scanning through lists of links collected
and sorted by humans according to some criteria. Automated Web search
engines were not needed when Web pages were counted only by
thousands, and most directories of the Web included a prominent button
to “add a new Web page”. Web site administrators were encouraged to
submit their sites. Today, URLs of new pages are no longer a scarce
resource, as there are thousands of millions of Web pages. The main
problem search engines have to deal with is the size and rate of change
of the Web, with no search engine indexing more than one third of the
publicly available Web. As the number of pages grows, it will be
increasingly important to focus on the most “valuable” pages, as no
search engine will be able of indexing the complete Web. Moreover, in
this thesis we state that the number of Web pages is essentially infinite;
this makes this area even more relevant.
1.1.3 Information retrieval and Web search
Information Retrieval (IR) is the area of computer science
concerned with retrieving information about a subject from a collection of
data objects. This is not the same as Data Retrieval, which in the context
of documents consists mainly in determining which documents of a
collection contain the keywords of a user query. Information Retrieval
deals with satisfying a user need:
“... the IR system must somehow ’interpret’ the contents of the
information items (documents) in a collection and rank them
according to a degree of relevance to the user query. This
‘interpretation’ of document content involves extracting syntactic
and semantic information from the document text ...”
Although there was an important body of Information Retrieval
techniques published before the invention of the World Wide Web, there
are unique characteristics of the Web that made them unsuitable or
insufficient. A survey by Arasu et al. on searching the Web notes that:
“IR algorithms were developed for relatively small and coherent
collections such as newspaper articles or book catalogs in a
(physical) library. The Web, on the other hand, is massive, much
less coherent, changes more rapidly, and is spread over
geographically distributed computers ...”
This idea is also present in a survey about Web search by Brooks,
which states that a distinction could be made between the “closed Web”,
which comprises high-quality controlled collections on which a 3 search
engine can fully trust, and the “open Web”, which includes the vast
majority of web pages and on which traditional IR techniques concepts
and methods are challenged.
One of the main challenges the open Web poses to search engines
is “search engine spamming”, i.e.: malicious attempts to get an
undeserved high ranking in the results. This has created a whole branch
of Information Retrieval called “adversarial IR”, which is related to
retrieving information from collections in which a subset of the collection
has been manipulated to influence the algorithms. For instance, the
vector space model for documents], and the TF-IDF similarity measure
are useful for identifying which documents in a collection are relevant in
terms of a set of keywords provided by the user. However, this scheme
can be easily defeated in the “open Web” by just adding frequently-asked
query terms to Web pages.
A solution to this problem is to use the hypertext structure of the
Web, using links between pages as citations are used in academic
literature to find the most important papers in an area. Link analysis,
which is often not possible in traditional information repositories but is
quite natural on the Web, can be used to exploit links and extract useful
information from them, but this has to be done carefully, as in the case
of Pagerank:
“Unlike academic papers which are scrupulously reviewed, web
pages proliferate free of quality control or publishing costs. With a
simple program, huge numbers of pages can be created easily,
artificially inflating citation counts. Because the Web environment
contains profit seeking ventures, attention getting strategies evolve
in response to search engine algorithms. For this reason, any
evaluation strategy which counts replicable features of web pages
is prone to manipulation”.
The low cost of publishing in the “open Web” is a key part of its
success, but implies that searching information on the Web will always
be inherently more difficult that searching information in traditional,
closed repositories.
1.1.4 Web search and Web crawling
The typical design of search engines is a “cascade”, in which a Web
crawler creates a collection which is indexed and searched. Most of the
designs of search engines consider the Web crawler as just a first stage
in Web search, with little feedback from the ranking algorithms to the
crawling process. This is a cascade model, in which operations are
executed in strict order: first crawling, then indexing, and then
searching. Our approach is to provide the crawler with access to all the
information about the collection to guide the crawling process effectively.
This can be taken one step further, as there are tools available for
dealing with all the possible interactions between the modules of a
search engine, as shown in Figure 1.1
Figure 1.1: Cyclic architecture for search engines, showing how different
components can use the information generated by the other components.
The typical cascade model is depicted with thick arrows. The indexing
module can help the Web crawler by providing information about the
ranking of pages, so the crawler can be more selective and try to collect
important pages first. The searching process, through log file analysis or
other techniques, is a source of optimizations for the index, and can also
help the crawler by determining the “active set” of pages which are
actually seen by users. Finally, the Web crawler could provide on-
demand crawling services for search engines. All of these interactions are
possible if we conceive the search engine as a whole from the very
beginning.
1.1.5 Why the Web is so popular now?
Commercial developers noticed the potential of the web as a
communications and marketing tool when graphical Web browsers broke
onto the Internet scene (Mosaic, the precursor to Netscape Navigator,
was the first popular web browser) making the Internet, and specifically
the Web, "user friendly." As more sites were developed, the more popular
the browser became as an interface for the Web, which spurred more
Web use, more Web development etc. Now graphical web browsers are
powerful, easy and fun to use and incorporate many "extra" features
such as news and mail readers. The nature of the Web itself invites user
interaction; web sites are composed of hypertext documents, which mean
they are linked to one another. The user can choose his/her own path by
selecting predefined "links". Since hypertext documents are not organized
in an arrangement which requires the user to access the pages
sequentially, users really like the ability to choose what they will see next
and the chance to interact with the site contents.
1.1.6 Search Engine System Architecture
This section provides an overview of how the whole system of a
search engine works. The major functions of the search engine crawling,
indexing and searching are also covered in detail in the later sections.
Before a search engine can tell you where a file or document is, it must
be found. To find information on the hundreds of millions of Web pages
that exist, a typical search engine employs special software robots, called
spiders, to build lists of the words found on Web sites. When a spider is
building its lists, the process is called Web crawling. A Web crawler is a
program, which automatically traverses the web by downloading
documents and following links from page to page. They are mainly used
by web search engines to gather data for indexing. Other possible
applications include page validation, structural analysis and
visualization; update notification, mirroring and personal web
assistants/agents etc. Web crawlers are also known as spiders, robots,
worms etc. Crawlers are automated programs that follow the links found
on the web pages. There is a URL Server that sends lists of URLs to be
fetched to the crawlers. The web pages that are fetched are then sent to
the store server. The store server then compresses and stores the web
pages into a repository. Every web page has an associated ID number
called a doc ID, which is assigned whenever a new URL is parsed out of a
web page. The indexer and the sorter perform the indexing function. The
indexer performs a number of functions. It reads the repository,
uncompressed the documents, and parses them. Each document is
converted into a set of word occurrences called hits. The hits record the
word, position in document, an approximation of font size, and
capitalization. The indexer distributes these hits into a set of "barrels",
creating a partially sorted forward index. The indexer performs another
important function. It parses out all the links in every web page and
stores important information about them in an anchors file. This file
contains enough information to determine where each link points from
and to, and the text of the link. The URL Resolver reads the anchors file
and converts relative URLs into bsolute URLs and in turn into doc IDs. It
puts the anchor text into the forward index, associated with the doc ID
that the anchor points to. It also generates a database of links, which are
pairs of doc IDs. The links database is used to compute Page Ranks for
all the documents. The sorter takes the barrels, which are sorted by doc
ID and resorts them by word ID to generate the inverted index. This is
done in place so that little temporary space is needed for this operation.
The sorter also produces a list of word IDs and offsets into the inverted
index. A program called Dump Lexicon takes this list together with the
lexicon produced by the indexer and generates a new lexicon to be used
by the searcher. A lexicon lists all the terms occurring in the index along
with some term-level statistics (e.g., total number of documents in which
a term occurs) that are used by the ranking algorithms The searcher is
run by a web server and uses the lexicon built by Dump Lexicon together
with the inverted index and the Page Ranks to answer queries. (Brin and
Page 1998)
Figure 1.2 Architecture of a simple Web Search Engine
Figure 1.2 illustrates the architecture of a simple WWW search engine. In
general, a search engine usually consists of three major modules:
a) Information gathering
b) Data extraction and indexing
c) Document ranking
Retrieval systems generally look at each document as a unique in
assigning a page rank. If the document is viewed as a combination of
other related documents in the query area, we can have better results.
The conjecture that relevant documents tend to cluster was made by
[26]. Irrelevant documents share many terms with relevant documents
but about two completely different topics, so these may demonstrate
some patterns. On the other hand an irrelevant cluster can be viewed as
the retrieval result for a different query that share many terms with the
original query. Xu et al. believe that document clustering can make
mistake and when this happens, it adds more noise to the query
expansion process. But as we discuss, document clustering is a good tool
for high-precision information retrieval systems. In this context we
proposed architecture (Fig. 3.1) to cluster search results and re-rank
them based on cluster analysis. Although our benchmark in the Persian
language but we believe that same results must be exhibit in other
benchmarks. 1.1.7 Overview of Information Retrieval
People have the ability to understand abstract meanings that are
conveyed by natural language. This is why reference librarians are
useful; they can talk to a library patron about her information needs and
then find the documents that are relevant. The challenge of information
retrieval is to mimic this interaction, replacing the librarian with an
automated system. This task is difficult because the machine
understanding of natural language is, in the general case, still an open
research problem. More formally, the field of Information Retrieval (IR) is
concerned with the retrieval of information content that is relevant to a
user’s information needs (Frakes 1992).
Information Retrieval is often regarded as synonymous with
document retrieval and text retrieval, though many IR systems also
retrieve pictures, audio or other types of non-textual information. The
word “document” is used here to include not just text documents, but
any clump of information. Document retrieval subsumes two related
activities: indexing and searching (Sparck Jones 1997). Indexing refers
to the way documents, i.e. information to be retrieved, and queries, i.e.
statements of a user’s information needs, are represented for retrieval
purposes. Searching refers to the process whereby queries are used to
produce a set of documents that are relevant to the query. Relevance
here means simply that the documents are about the same topic as the
query, as would be determined by a human judge. Relevance is an
inherently fuzzy concept, and documents can be more or less relevant to
a given query. This fuzziness puts IR in opposition to Data Retrieval,
which uses deductive and boolean logic to find documents that
completely match a query (van Rijsbergen 1979).
1.1.8 Evaluation in IR
Information retrieval algorithms are usually evaluated in terms of
relevance to a given query, which is an arduous task considering that
relevance judgements must be made by a human for each document
retrieved. The Text REtrieval Conference (TREC) provides is a forum for
pooling resources to evaluate text retrieval algorithms. Document corpora
are chosen from naturally occurring collections such as the
Congressional Record and the Wall Street Journal. Queries are created by
searching corpora for topics of interest, and then selecting queries that
have a decent number of documents relevant to that topic. Queries and
corpora are distributed to participants, who use their algorithms to
return ranked lists of documents related to the given queries. These
documents are then evaluated for relevance by the same person who
wrote the query (Voorhees 1999).
This evaluation method is based on two assumptions. First, it
assumes that relevance to a query is the right criterion on which to judge
a retrieval system. Other factors such as the quality of the document
returned, whether the document was already known, the effort required
to find a document, and whether the query actually represented the
user’s true information needs are not considered. This assumption is
controversial in the field. One alternative that has been proposed is to
determine the overall utility of documents retrieved during normal task
(Cooper 1973).
Users would be asked how many dollars (or other units of utility)
each contact with a document was worth. The answer could be positive,
zero, or negative depending on the experience.
Utility would therefore be defined as any subjective value a document
gives the user, regardless of why the document is valuable. The second
assumption inherent in the evaluation method used in TREC is that
queries tested are representative of queries that will be performed during
actual use. This is not necessarily a valid assumption, since queries that
are not well represented by documents in the corpus are explicitly
removed from consideration. These two assumptions can be summarized
as follows: if a retrieval system returns no documents that meet a user’s
information needs, it is not considered the fault of the system so long the
failure is due either to poor query construction or poor documents in the
corpus.
1.1.9 Methods for IR
There are many different methods for both indexing and retrieval,
and a full description is out of the scope of this thesis. However, a few
broad categories will be described to give a feel for the range of methods
that exist.
Vector-space model. The vector-space model represents queries
and documents as vectors, where indexing terms are regarded as the
coordinates of a multidimensional information space (Salton 1975).
Terms can be words from the document or query itself or picked from a
controlled list of topics. Relevance is represented by the distance of a
query vector to a document vector within this information space.
Probabilistic model. The probabilistic model views IR as the
attempt to rank documents in order of the probability that, given a
query, the document will be useful (van Rijsbergen 1979). These models
rely on relevance feedback: a list of documents that have already been
annotated by the user as relevant or non-relevant to the query. With this
information and the simplifying assumption that terms in a document
are independent, an assessment can be made about which terms make a
document more or less likely to be useful.
Natural language processing model. Most of the other
approaches described are tricks to retrieve relevant documents without
requiring the computer to understand the contents of a document in any
deep way. Natural Language Processing (NLP) does not shirk this job,
and attempts to parse naturally occurring language into representations
of abstract meanings. The conceptual models of queries and documents
can then be compared directly (Rau 1988).
Knowledge-based approaches. Sometimes knowledge about a
particular domain can be used to aid retrieval. For example, an expert
system might retrieve documents on diseases based on a list of
symptoms. Such a system would rely on knowledge from the medical
domain to make a diagnosis and retrieve the appropriate documents.
Other domains may have additional structure that can be leveraged. For
example, links between web pages have been used to identify authorities
on a particular topic (Chakrabarti 1999). Data Fusion. Data fusion is a meta-technique whereby several
algorithms, indexing methods and search methods are used to produce
different sets of relevant documents. The results are then combined in
some form of voting to produce an overall best set of documents (Lee 1995). The Savant system described in Chapter 2.7 is an example of a
data fusion IR system.
Chapter 2 Related works
Using some kind of documents clustering technique to help
retrieval results is not new, although we believe we are the first to
explicitly present and deal with the low-precision problem in terms of
clustering search results. Many research efforts such as [10] have been
made on how to solve the keyword barrier which exists because there is
no perfect correlation between matching words and intended meaning.
[9] presents TermRank, a variation of the PageRank algorithm based on a
relational graph representation of the content of web document
collections. Search result clustering has successfully served this purpose
in both commercial and scientific systems [30, 10, 23, 16, 25, 33]. The
proposed methods focus on separating search results into meaningful
groups and user can browse and view of retrieval results. One of the first
approaches to search results clustering called Suffix Tree Clustering
would group documents according to the common phrases [13]. STC has
two key features: the use of phrases and a simple cluster definition. This
is very important when attempting to describe the contents of a cluster.
[12] proposes a new approach for web search result clustering to improve
the performance of approaches that uses the previous STC algorithms.
Search Results Clustering has a few interesting characteristics and one
of them is the fact that it is based only on document snippets. Certainly
Document snippets returned by search engines are usually very short
and noisy. Another shortage with these systems is the cluster’s name.
Cluster’s name must accurately and concisely describe the contents of
the cluster, so that the user can quickly decide if the cluster is
interesting or not. This aspect of these systems is difficult and sometimes
neglected [7, 12]. In this context our tendency to provide very simple
high-precision system based on cluster hypothesis [16] without any user
feedback. Document clustering can be performed, in advance, on the
collection as a whole (static clustering) [7, 15], but post-retrieval
document clustering (dynamic clustering) has been shown produce
superior results [10, 8]. Tombros et al. [14] conducted a number of
experiments using five document collections and four hierarchic
clustering methods to show that if hierarchic clustering is applied to
search results (query-specific clustering), then it has the potential to
increase the retrieval effectiveness compared both to that of static
clustering and of conventional inverted file search. The actual
effectiveness of hierarchic clustering can be gauged by Cluster-based
retrieval strategies perform a ranking of clusters instead of individual
documents in response to each query [13]. The generation of precision-
recall graphs is thus not possible in such systems, and in order to derive
an evaluation function for clustering systems some effectiveness function
was proposed by [13]. In this paper, firstly we want to propose a simple
architecture which uses local cluster analysis to improve the
effectiveness of retrieval and yet utilize traditional precision-recall
evaluation. Secondly, this paper is devoted to high-precision retrieval.
Thirdly, we use a larger Persian standard test collection which is created
based on TREC specifications that validate findings in a wider context.
Query expansion is another approach to improve the effectiveness of
information retrieval. These techniques can be categorized as either
global or local. While global techniques rely on analysis of a whole
collection to discover word relationships, local techniques emphasize
analysis of the top-ranked documents retrieved for a query [28]. While
local techniques have shown to be more effective that global techniques
in general [29, 2]. In this paper we don’t want to expand a query based
on the information in the set of top-ranked documents retrieved for the
query, instead use very simple and more efficient re-ranking approach to
improve the effectiveness of search result and make high-precision
system that contain more relevant documents at top of the result list to
help user that find information needs efficiently.
2.1 Search Engine Tools 2.1.1 Web Crawlers
To find information from the hundreds of millions of Web pages
that exist, a typical search engine employs special software robots, called
spiders, to build lists of the words found on Web sites [6]. When a spider
is building its lists, the process is called Web crawling. A Web crawler is
a program, which automatically traverses the web by downloading
documents and following links from page to page [8]. They are mainly
used by web search engines to gather data for indexing. Web crawlers are
also known as spiders, robots, worms etc. Crawlers are automated
programs that follow the links found on the web pages [10].
There are a number of different scenarios in which crawlers
are used for data acquisition. A very few examples and how they differ in
the crawling strategies used are Breadth-First Crawler, Recrawling Pages
for Updates, Focused Crawling, Random Walking and Sampling,
Crawling the “Hidden Web”[11].
2.1.2 How the Web Crawler Works
Following is the process by which Web crawlers work: [3]
Download the Web page.
Parse through downloaded page and retrieve all the links.
For each link retrieved, repeat the process.
In the first step, a Web crawler takes a URL and downloads the
page from the Internet at the given URL. Oftentimes the downloaded page
is saved to a file on disk or put in a database. [3] In the second step, a Web crawler parses through the downloaded
page and retrieves the links to other pages. After the crawler has
retrieved the links from the page, each link is added to a list of links to
be crawled. [3] The third step of Web crawling repeats the process. All crawlers
work in a recursive or loop fashion, but there are two different ways to
handle it. Links can be crawled in a depth-first or breadth-first manner.
[3] Web pages and links between them can be modeled by a directed
graph called the web graph. Web pages are represented by vertices and
linked are represented by directed edges [7].
Using depth first search, an initial web page is selected, a link is
followed to second web page (if there exist such a link), a link on the
second web page is followed to a third web page, if there is such a link
and so on, until a page with no new link is found. Backtracking Is used
to examine links at the previous level to look for new links and so on.
(Because of practical limitations, web spiders have limits to the depth
they search in depth first search.)
Using a breadth first search, an initial web page is selected and a
link on this page is followed to second web page, then a second link on
the initial page is followed (if it exist), and so on, until all link of the
initial page have been followed. Then links on the pages one level down
are followed, page by page and so on.
2.1.3 Overview of data clustering
The data clustering, as a class of data mining techniques, is to
partition a given data set into separate clusters, with each cluster
composed of the data objects with similar characteristics. Most existing
clustering methods can be broadly classified into two categories:
partitioning methods and hierarchical methods. Partitioning algorithms,
such as k-means, kmedoid and EM, attempt to partition a data set into k
clusters such that a previously given evaluation function can be
optimized. The basic idea of hierarchical clustering methods is to first
construct a hierarchy by decomposing the given data set, and then use
agglomerative or divisive operations to form clusters. In general, an
agglomeration-based hierarchical method starts with a disjoint set of
clusters, placing each data object into an individual cluster, and then
merges pairs of clusters until the number of clusters is reduced to a
given number k. On the other hand, the division-based hierarchical
method treats the whole data set as one cluster at the beginning, and
divides it iteratively until the number of clusters is increased to k. See
[11] for more information. Although [17, 20, 23, 31, 33] have developed
some special algorithms for clustering search results but now we prefer
to use traditional methods in this paper. We will show that our method
with basic clustering algorithms such as k-means and Principal Direction
Divisive Partitioning achieves significant improvement over the methods
based on similarity search ranking alone.
2.2 An example information retrieval problem
A fat book which many people own is Shakespeare’s Collected
Works. Suppose youwanted to determinewhich plays of Shakespeare
contain thewords Brutus AND Caesar AND NOT Calpurnia. One way to
do that is to start at the beginning and to read through all the text,
noting for each play whether it contains Brutus and Caesar and
excluding it from consideration if it contains Calpurnia. The simplest
form of document retrieval is for a computer to do this sort of linear scan
through documents. This process is commonly referred to as grepping
through text, after the Unix GREP command grep, which performs this
process. Grepping through text can be a very effective process, especially
given the speed of modern computers, and often allows useful
possibilities forwildcard patternmatching through the use of regular
expressions. With modern computers, for simple querying of modest
collections (the size of Shakespeare’s Collected Works is a bit under one
million words of text in total), you really need nothing more. But for
many purposes, you do need more:
1. To process large document collections quickly. The amount of
online data has grown at least as quickly as the speed of computers, and
we would now like to be able to search collections that total in the order
of billions to trillions of words.
2. To allow more flexible matching operations. For example, it is
impractical to perform the query Romans NEAR countrymen with grep,
where NEAR might be defined as “within 5 words” or “within the same
sentence”.
3. To allow ranked retrieval: in many cases you want the best answer to
an information need among many documents that contain certain words.
The way to avoid linearly scanning the texts for each query is to index the
documents in advance. Let us stick with Shakespeare’s Collected Works,
and use it to introduce the basics of the Boolean retrieval model.
Suppose we record for each document – here a play of Shakespeare’s –
whether it contains eachword out of all the words Shakespeare used
(Shakespeare used about 32,000 differentwords). The result is a binary
term-document incidence matrix, as in Figure 2.1. Terms are the indexed
units (further discussed in Section 2.2); they are usually words, and for
the moment you can think of them as words, but the information
retrieval literature normally speaks of terms because some of them, such
as perhaps I-9 or Hong Kong are not usually thought of aswords. Now,
depending onwhetherwe look at thematrix rows or columns, we can have
a vector for each term, which shows the documents it appears in, or a
vector for each document, showing the terms that occur in it.2
Figure: 2.1 A term-document incidence matrix. Matrix element (t, d) is 1
if the play in column d contains the word in row t, and is 0 otherwise.
To answer the query Brutus AND Caesar AND NOT Calpurnia, we take
the vectors for Brutus, Caesar and Calpurnia, complement the last, and
then do a bitwise AND:
The answers for this query are thus Antony and Cleopatra and Hamlet
(Figure 2.2).
The Boolean retrieval model is a model for information retrieval in which
we can pose any query which is in the form of a Boolean expression of
terms, that is, in which terms are combined with the operators AND, OR,
and NOT. The model views each document as just a set of words.
Figure:2.2 Results from Shakespeare for the query Brutus AND Caesar
AND NOT Calpurnia.
Let us now consider a more realistic scenario, simultaneously
using the opportunity to introduce some terminology and notation.
Suppose we have documents. By documents we mean whatever units we
have decided to build a retrieval system over. We will refer to the group of
documents over which we perform retrieval as the (document) collection.
It is sometimes also referred to as a corpus (a body of texts). Suppose
each document is about 1000 words long (2-3 book pages). If we assume
an average of 6 bytes per word including spaces and punctuation, then
this is a document collection about 6 GB in size. Typically, there might
be about distinct terms in these documents. There is nothing special
about the numbers we have chosen, and they might vary by an order of
magnitude or more, but they give us some idea of the dimensions of the
kinds of problems we need to handle.
Our goal is to develop a system to address the ad hoc retrieval
task. This is the most standard IR task. In it, a system aims to provide
documents from within the collection that are relevant to an arbitrary
user information need, communicated to the system by means of a one-
off, user-initiated query. An information need is the topic about which the
user desires to know more, and is differentiated from a query, which is
what the user conveys to the computer in an attempt to communicate
the information need. A document is relevant if it is one that the user
perceives as containing information of value with respect to their
personal information need. Our example above was rather artificial in
that the information need was defined in terms of particular words,
whereas usually a user is interested in a topic like ``pipeline leaks'' and
would like to find relevant documents regardless of whether they
precisely use those words or express the concept with other words such
as pipeline rupture. To assess the effectiveness of an IR system (i.e., the
quality of its search results), a user will usually want to know two key
statistics about the system's returned results for a query:
Precision : What fraction of the returned results are relevant to the
information need?
Recall : What fraction of the relevant documents in the collection
were returned by the system?
A matrix has half-a-trillion 0's and 1's - too many to fit in a
computer's memory. But the crucial observation is that the matrix is
extremely sparse, that is, it has few non-zero entries. Because each
document is 1000 words long, the matrix has no more than one billion
1's, so a minimum of 99.8% of the cells are zero. A much better
representation is to record only the things that do occur, that is the 1
position.
This idea is central to the first major concept in information
retrieval, the inverted index. The name is actually redundant: an index
always maps back from terms to the parts of a document where they
occur. Nevertheless, inverted index, or sometimes inverted file, has
become the standard term in information retrieval. We keep a dictionary
of terms (sometimes also referred to as a vocabulary or lexicon; in this
book, we use dictionary for the data structure and vocabulary for the set
of terms). Then for each term, we have a list that records which
documents the term occurs in. Each item in the list - which records that
a term appeared in a document (and, later, often, the positions in the
document) - is conventionally called a posting. The list is then called a
postings list (or), and all the postings lists taken together are referred to
as the postings.
Figure 2.3
Chapter 3 Implementation Details 3.1 Determining the user terms 3.1.1 Tokenization
Given a character sequence and a defined document unit,
tokenization is the task of chopping it up into pieces, called tokens,
perhaps at the same time throwing away certain characters, such as
punctuation. Here is an example of tokenization:
These tokens are often loosely referred to as terms or words, but it is
sometimes important to make a type/token distinction. A token is an
instance of a sequence of characters in some particular document that
are grouped together as a useful semantic unit for processing. A type is
the class of all tokens containing the same character sequence. A term is
a (perhaps normalized) type that is included in the IR system’s
dictionary. The set of index terms could be entirely distinct from the
tokens, for instance, they could be semantic identifiers in a taxonomy,
but in practice in modern IR systems they are strongly related to the
tokens in the document. However, rather than being exactly the tokens
that appear in the document, they are usually derived from them by
various normalization processes. For example, if the document to be
indexed is to sleep perchance to dream, then there are 5 tokens, but only
4 types (since there are 2 instances of to). However, if to is omitted from
the index, then there will be only 3 terms: sleep, perchance, and dream.
Themajor question of the tokenization phase is what are the correct
tokens to use? In this example, it looks fairly trivial: you chop on
whitespace and throw away punctuation characters. This is a starting
point, but even for English there are a number of tricky cases. For
example, what do you do about the various uses of the apostrophe for
possession and contractions? Mr. O’Neill thinks that the boys’ stories
about Chile’s capital aren’t amusing.
3.1.2 Processing Boolean queries How do we process a query using an inverted index and the basic
Boolean retrieval model? Consider processing the simple conjunctive
query:
1.1 Brutus AND Calpurnia
The intersection operation is the crucial one: we need to efficiently
intersect postings lists so as to be able to quickly find documents that
contain both terms. (This operation is sometimes referred to as merging
postings lists: this slightly counterintuitive name reflects using the term
merge algorithm for a general family of algorithms that combine multiple
sorted lists by interleaved advancing of pointers through each; here we
are merging the lists with a logical AND operation.)
There is a simple and effective method of intersecting postings lists using
the merge algorithm: we maintain pointers into both lists
and walk through the two postings lists simultaneously, in time linear in
the total number of postings entries. At each step, we compare the docID
pointed to by both pointers. If they are the same, we put that docID in
the results list, and advance both pointers. Otherwise we advance the
pointer pointing to the smaller docID. If the lengths of the postings lists
are x and y, the intersection takes O(x + y) operations. Formally, the
complexity of querying is Q(N), where N is the number of documents in
the collection.6 Our indexing methods gain us just a constant, not a
difference in Q time complexity compared to a linear scan, but in practice
the constant is huge. To use this algorithm, it is crucial that postings be
sorted by a single global ordering. Using a numeric sort by docID is one
simple way to achieve this. We can extend the intersection operation to
processmore complicated queries like:
1.2 (Brutus OR Caesar) AND NOT Calpurnia
1.3 Brutus AND Caesar AND Calpurnia
1.4 (Calpurnia AND Brutus) AND Caesar
1.5 (madding OR crowd) AND (ignoble OR strife) AND (killed OR slain)
3.1.3 Schematic Representation of Our Approach
Figure 3.1: A Schematic Model of Our Approach
3.1.4 Methodology of Our Proposed Model We have followed the existing process to get the DB/Indexes. Then
we will group or cluster the existing index_database by analyzing the
popularity of the page, the position and size of the search terms within
the page, and the proximity of the search terms to one another on the
page, and each cluster is associated with a set of keywords, which is
assumed to represent a concept e.g. technology, science, arts, film,
medical, music, sex, photo and so on.
3.2 Our Proposed Model Tool 3.2.1 Cluster Processor
Cluster Processor improve its performance automatically by
learning relationships and associations within the stored data and make
the cluster, a statistical technique is used for identifying patterns and
associations in complex data. It is somehow difficult to accumulate
enough exact user searches to make a cluster. The clustering process is
fully depends on fuzzy methods.
3.2.2. DB/Cluster
This is the second major module of our approach. It stores the
patterns or cluster of complex data present on the web with its
corresponding URL. The content inside the DB/cluster is similar to the
DB/Indexes but the terms or string or keywords those are related to
pattern or concept were found together in the same cluster whereas the
index is sorted alphabetically by search term, with each index entry
storing a list of documents in which the term appears and the location
within the text where it occurs in the DB/Indexes. The data structure
used in DB/Cluster allows rapid access to documents that contain user
query terms.
3.3 Working Methodology
When the user will give any query string through the entry point of
the search engine [12], the query engine will filter those strings or
keywords by analyzing them. This will also do by learning process. Next,
the Query Engine can detect that the searched string is associated with
which clusters. Next, the Query Engine will retrieve the string from the
relevant cluster only; without searching the entire DB/Indexes as in the
previous architecture. In this way our methodology can give fast and
relevant results.
But one potential problem with this system is: it may happen
that one string can be present many clusters also. E.g. Ferrari is a string
which is laptop model from Acer and also it is car model. Here how the
query engine will know which Ferrari the user is looking for. So in this
study we will store the frequency of each string in a file in DB/Cluster.
So that the query engine can compare the matching number of clusters
for that searched string and return the higher occurrences of the relevant
cluster result.
Chapter 4 Future Work and Conclusions 4.2 Future Work
There are several directions in which this research can proceed. In
this paper, we proposed a model for retrieval systems that is based on a
simple document re-ranking method using Local Cluster Analysis.
Experimental results showS that it is more effective than existing
techniques. Whereas we intended to exhibits the efficiency of the
proposed architecture, we use single clustering method (PDDP) to
produce clusters that are tailored to the information need represented by
the query. Afterwards, utilize K-means with PDDP clusters as initial
configuration (Hybrid approach) and showed that PDDP has potential to
improve results individually.
Whereas in our approach, the context of a document is considered
in the retrieved results by the combination of information search and
local cluster analysis, cause first: relevant cluster tailored to the user
information need and improve the search results efficiently, second:
make high-precision system that contain more relevant documents at top
of the result list. As it was shown, even in worst query that average
precision 0.1982 percent decreased, still our system remain high-
precision.
4.1 Conclusion
We will pursue the work in several directions. Firstly, the current
method for clustering search results is PDDP and hybrid K-means,
however our experimental results had shown that PDDP has a great
efficiency for our purpose but thence the total size of input in search
results clustering is small, we can afford some more complex processing,
which can possibly let us achieve better results. Unlike previous
clustering techniques that use some proximity measure between
documents, tries to discover meaningful phrases that can become cluster
descriptions and only then assign documents to those phrases to form
clusters. Use these concept-driven clustering approaches maybe a useful
future work.
Secondly, I assumed that search results contain two clusters
(Relevant and Irrelevant). In some cases irrelevant cluster can split into
other sub-clusters by semantic relations. Get the optimal sub-clusters
semantically can be produce better results.
Thirdly, we re-ranked results based on both clusters and after that
choose better one manually. As we mentioned before, we conjecture that
relevant cluster centroid must be near than irrelevant cluster centroid to
the query. So we can choose which cluster centroid that toward to the
query (relevant cluster).
Lastly, we evaluate the proposed architecture in adhoc retrieval. As
we mentioned before, our approach is independent of initial system
architecture so it can embed on any fabric search engine. One of the high
precision needful systems are Web search engines. Indisputable evaluate
this approach on Web search engines can be a prominent future work.
Appendices import javax.swing.*;
import javax.swing.JScrollPane;
import java.awt.*;
import java.awt.event.*;
import java.util.*;
import java.io.*;
public class GDemo
{
public static void main(String args[])
{
SimpleFrame frame = new SimpleFrame();
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
frame.setVisible(true);
}
}
class SimpleFrame extends JFrame implements ActionListener
{
public static HashMap<String,ArrayList> result= new HashMap<String, ArrayList>();
public static String token;
public static String op;
public static String searchstring;
public static final int DEFAULT_WIDTH = 600;
public static final int DEFAULT_HEIGHT = 400;
final JTextArea textArea;
final JTextField textField;
final JButton button;
public SimpleFrame()
{
setTitle("Information Retrival System");
setSize(DEFAULT_WIDTH, DEFAULT_HEIGHT);
textField = new JTextField(30);
Font f = new Font("Old Book Style", Font.BOLD, 12);
textField.setFont(f);
textArea = new JTextArea(20, 50);
JScrollPane scrollPane = new JScrollPane(textArea);
add(scrollPane,BorderLayout.CENTER);
textArea.setWrapStyleWord(true);
Font f1 = new Font("Old Book Style", Font.BOLD, 12);
textArea.setFont(f1);
JPanel panel = new JPanel();
JLabel label = new JLabel("Input Text: ");
panel.setLayout(new FlowLayout(FlowLayout.CENTER));
button = new JButton("Click Here");
panel.add(label);
panel.add(textField);
panel.add(button);
panel.add(textArea);
button.addActionListener(this);
Container cp=getContentPane();
cp.add(panel,BorderLayout.CENTER);
}//SimpleFrame()
public void actionPerformed(ActionEvent event)
{
Object sr=event.getSource();
if(sr==button)
{
textArea.setText("");
searchstring=textField.getText();
String tokens[]=searchstring.split(" ");
if(tokens.length > 2)
{
op=tokens[1];
//ArrayList list=searchText(tokens[1]);
result.put(tokens[0], searchText(tokens[0]));
result.put(tokens[2], searchText(tokens[2]));
if(op.equals("AND"))
{
HashSet<String> hs1= new
HashSet<String>(result.get(tokens[0]));
HashSet <String> hs2= new
HashSet<String>(result.get(tokens[2]));
hs1.retainAll(hs2);
//System.out.println("And "+hs1);
textArea.setText("");
//textArea.setText(hs1.toString());
for(String fileName: hs1)
textArea.append(fileName+"\n");
}
else if(op.equals("OR"))
{
HashSet<String> hs1= new
HashSet<String>(result.get(tokens[0]));
HashSet <String> hs2= new
HashSet<String>(result.get(tokens[2]));
hs1.addAll(hs2);
//System.out.println("OR" + hs1);
textArea.setText("");
//textArea.setText(hs1.toString());
for(String fileName: hs1)
textArea.append(fileName+"\n");
}
}
else
{
ArrayList list=searchText(searchstring);
textArea.setText("");
//textArea.setText(list.toString());
Iterator fileName=list.iterator();
while(fileName.hasNext())
{
//System.out.println(it.next());
textArea.append(fileName.next()+"\n");
}
}
//textArea.append(textField.getText()+"\n");
}
}//actionPerformed()
public ArrayList searchText(String args1)
{
//String args1=textField.getText();
//System.out.println("token="+args1);
String args[]=args1.split(" ");
for(int i=0;i<args.length;i++)
args[i]=args[i].toUpperCase();
ArrayList<String> filefound= new ArrayList<String>();
File f= new File("D:\\program\\Java\\Test");
File[] files=f.listFiles();
for(File s: files)
{
for(int i=0;i<args.length;i++)
{
try
{
if(search(s.getPath(),args[i]))
{
filefound.add(s.getPath());
}
}
catch(Exception e)
{
e.toString();
}
}
}
textArea.append(filefound+"\n");
return filefound;
}//searchText()
public boolean search(String file,String token) throws Exception
{
StringTokenizer st= null;
HashSet<String> set= new HashSet<String>();
BufferedReader br= new BufferedReader(new FileReader(file));
String line=null;
while((line=br.readLine())!=null)
{
st=new StringTokenizer(line," ,.");
while(st.hasMoreElements())
{
set.add((st.nextToken()).toUpperCase());
}
}
//System.out.println(set+"\n");
if(set.contains(token))
return true;
else
return false;
}//search()
}//class SimpleFrame
Results
REFERENCES AND BIBLIOGRAPHY 1. Brin, Sergey and Page Lawrence. The anatomy of a large-scale hyper textual Web
search engine. Computer Networks and ISDN Systems, April 1998
2. A Novel Page Ranking Algorithm for Search Engines Using Implicit Feedback by
Shahram Rahimi, Bidyut Gupta, Kaushik Adya, Southern Illinois University, USA,
Engineering Letters, 13:3, EL_13_3_20 (Advance online publication: 4 November
2006)
3. Crawling the Web with Java by James Holmes, Chapter 6, Page: 2 & 3
4. Breadth-First Search Crawling Yields High-Quality Pages by Marc Najork and Janet
L. Wiener, Compaq Systems Research Center, USA
5. HOW SEARCH ENGINES WORK AND A WEB CRAWLER APPLICATION by Monica
Peshave, Department of Computer Science, University of Illinois at Springfield,
Springfield
6. Search Engines for Intranets by K.T. Anuradha, National Centre for Science
Information (NCSI), Indian Institute of Science, Bangalore
7. Searching the Web by Arvind Arasu Junghoo Cho Hector Garcia-Molina Andreas
Paepcke Sriram Raghavan, Computer Science Department, Stanford University
8. Franklin, Curt. How Internet Search Engines Work, 2002. www.howstuffworks.com
9. Garcia-Molina, Hector. Searching the Web, August 2001
http://oak.cs.ucla.edu/~cho/papers/cho-toit01.pdf
10. Pant, Gautam, Padmini Srinivasan and Filippo Menczer: Crawling the Web, 2003.
http://dollar.biz.uiowa.edu/~pant/Papers/crawling.pdf
11. Retriever: Improving Web Search Engine Results Using Clustering by Anupam
Joshi, University of Maryland, USA and Zhihua Jiang, American Management
Systems, Inc., USA
12. Effective Web Crawling, PhD thesis by Carlos Castillo, Dept. of Computer Science -
University of Chile, November 2004 13. Design and Implementation of a High-Performance Distributed Web Crawler,
Vladislav Shkapenyuk Torsten Suel, CIS Department, Polytechnic University,
Brooklyn, New York 11201 14. R. Burke, K. Hammond, V. Kulyukin, S. Lytinen, N. Tomuro, and S. Schoenberg.
Natural language processing in the faq finder system: Results and prospects, 1997.
15. T. Calishain and R. Dornfest. Google Hacks: 100 Industrial-Strength Tips & Tools.
O’Reilly, ISBN 0596004478, 2003.
16. David Carmel, Eitan Farchi, Yael Petruschka, and Aya Soffer. Automatic query
refinement using lexical affinities with maximal information gain. In Proceedings of
the 25th annual international ACM SIGIR conference on Research and development in
information retrieval, pages 283–290. ACM Press, 2002.
17. Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data.
Morgan- Kauffman, 2002.
18. Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: a
new approach to topic-specific Web resource discovery. Computer Networks
(Amsterdam, Netherlands: 1999), 31(11–16):1623–1640, 1999.
19. Michael Chau, Hsinchun Chen, Jailun Qin, Yilu Zhou, Yi Qin, Wai-Ki Sung, and
Daniel Mc- Donald. Comparison of two approaches to building a vertical search tool:
A case study in the nanotechnology domain. In Proceedings Joint Conference on
Digital Libraries, Portland, OR., 2002.
20. M. Keen C.W. Cleverdon, J. Mills. Factors determining the performance of indexing
systems. Volume I - Design, Volume II - Test Results, ASLIB Cranfield Project,
Reprinted in Sparck Jones & Willett, Readings in Information Retrieval, 1966.
21. B. D. Davison, D. G. Deschenes, and D. B. Lewanda. Finding relevant website
queries. In Proceedings of the twelfth international World Wide Web conference, 2003.
22. Daniel Dreilinger and Adele E. Howe. Experiences with selecting search engines
using metasearch. ACM Transactions on Information Systems, 15(3):195–222, 1997.
23. Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank aggregation
methods for the web. In Proceedings of the tenth international conference on World
Wide Web, pages 613–622. ACM Press, 2001.
24. B. Efron. Bootstrap methods: Another look at the jackknife. Annals of Statistics,
7(1):1–26, 1979.
25. Tina Eliassi-Rad and Jude Shavlik. Intelligent Web agents that learn to retrieve and
extract information. Physica-Verlag GmbH, 2003.
26. Oren Etzioni. Moving up the information food chain: Deploying softbots on the world
wide web. In Proceedings of the Thirteenth National Conference on Artificial
Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conference,
pages 1322–1326, Menlo Park, 4– 8 1996. AAAI Press / MIT Press.
27. Ronald Fagin, Ravi Kumar, Kevin S. McCurley, Jasmine Novak, D. Sivakumar, John
A. Tomlin, and David P. Williamson. Searching the workplace web. In WWW ’03:
Proceedings of the twelfth international conference on World Wide Web, pages 366–
375. ACM Press, 2003.
28. Ronald Fagin, Ravi Kumar, and D. Sivakumar. Efficient similarity search and
classification via rank aggregation. In Proceedings of the 2003 ACM SIGMOD
international conference on on Management of data, pages 301–312. ACM Press,
2003.
29. A. Finn, N. Kushmerick, and B. Smyth. Genre classification and domain transfer for
information filtering. In Proc. 24th European Colloquium on Information Retrieval
Research, Glasgow, pages 353–362, 2002.
30. Aidan Finn and Nicholas Kushmerick. Learning to classify documents according to
genre. In IJCAI-03 Workshop on Computational Approaches to Style Analysis and
Synthesis, 2003.
31. C. Lee Giles, Kurt Bollacker, and Steve Lawrence. CiteSeer: An automatic citation
indexing system. In Ian Witten, Rob Akscyn, and Frank M. Shipman III, editors,
Digital Libraries 98 – 126 The Third ACM Conference on Digital Libraries, pages 89–
98, Pittsburgh, PA, June 23–26 1998. ACM Press.
32. Eric Glover, Gary Flake, Steve Lawrence, William P. Birmingham, Andries Kruger, C.
Lee Giles, and David Pennock. Improving category specific web search by learning
query modifications. In Symposium on Applications and the Internet, SAINT, pages
23–31, San Diego, CA, January 8–12 2001. IEEE Computer Society, Los Alamitos,
CA.
33. Eric J. Glover, Steve Lawrence, William P. Birmingham, and C. Lee Giles.
Architecture of a metasearch engine that supports user information needs. In
Proceedings of the eighth international conference on Information and knowledge
management, pages 210–216. ACM Press, 1999.
34. Ayse Goker. Capturing information need by learning user context. In Sixteenth
International Joint Conference in Artificial Intelligence: Learning About Users
Workshop, pages 21–27, 1999.
35. Ayse Goker, Stuart Watt, Hans I. Myrhaug, Nik Whitehead, Murat Yakici, Ralf
Bierig, Sree Kanth Nuti, and Hannah Cumming. User context learning for intelligent
information retrieval. In EUSAI ’04: Proceedings of the 2nd European Union
symposium on Ambient intelligence, pages 19–24. ACM Press, 2004.
36. Google Web APIs. http://www.google.com/apis/.
37. Luis Gravano, Chen-Chuan K. Chang, Hector Garcia-Molina, and Andreas Paepcke.
Starts: Stanford proposal for internet meta-searching. In Proceedings of the 1997
ACM SIGMOD international conference on Management of data, pages 207–218. ACM
Press, 1997.
38. Robert H. Guttmann and Pattie Maes. Agent-mediated integrative negotiation for
retail electronic commerce. Lecture Notes in Computer Science, pages 70–90, 1999.
39. Monika Henzinger, Bay-Wei Chang, Brian Milch, and Sergey Brin. Query-free news
search. In Twelfth international World Wide Web Conference (WWW-2003), Budapest,
Hungary, May 20-24 2003.
40. Adele E. Howe and Daniel Dreilinger. SAVVYSEARCH: A metasearch engine that
learns which search engines to query. AI Magazine, 18(2):19–25, 1997.
41. Jianying Hu, Ramanujan Kashi, and Gordon T. Wilfong. Document classification
using layout analysis. In DEXA Workshop, pages 556–560, 1999.
42. David Hull. Using statistical testing in the evaluation of retrieval experiments. In
SIGIR ’93: Proceedings of the 16th annual international ACM SIGIR conference on
Research and development in information retrieval, pages 329–338. ACM Press,
1993.
43. Thorsten Joachims. Text categorization with suport vector machines: Learning with
many relevant features. In Proceedings of the 10th European Conference on Machine
Learning, pages 137–142. Springer-Verlag, 1998.
44. Thorsten Joachims. Text categorization with support vector machines: learning with
many relevant features. In Claire N´edellec and C´eline Rouveirol, editors,
Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137–
142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE. 45. George H. John, Ron Kohavi, and Karl Pfleger. Irrelevant features and the subset
selection problem. In International Conference on Machine Learning, pages 121–129,
1994.