DEEPTHI THERESA K.K.dspace.cusat.ac.in/jspui/bitstream/123456789/250/1... · submitted by Deepthi Theresa K.K. in partial fulfillment of the requirements for the award of M.Tech in

WWeebb CClluusstteerriinngg EEnnggiinneess

SEMINAR REPORT2009-2011

In partial fulfillment of Requirements inDegree of Master of Technology

InCOMPUTER & INFORMATION SCIENCE

SUBMITTED BY

DEEPTHI THERESA K.K.

DEPARTMENT OF COMPUTER SCIENCECOCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY

KOCHI – 682 022

COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGYKOCHI – 682 022

DEPARTMENT OF COMPUTER SCIENCE

CCEERRTTIIFFIICCAATTEE

This is to certify that the seminar report entitled “Web Clustering Engines”” is being

submitted by DDeeeepptthhii TThheerreessaa KK..KK.. in partial fulfillment of the requirements for the award of

M.Tech in Computer & Information Science is a bonafide record of the seminar presented by

her during the academic year 2010.

Mr. G.Santhosh Kumar Prof. Dr.K.Poulose JacobLecturer DirectorDept. of Computer Science Dept. of Computer Science

AACCKKNNOOWWLLEEDDGGEEMMEENNTT

First of all let me thank our Director Prof: Dr. K. Paulose Jacob, Dept. of

Computer Science, CUSAT who provided with the necessary facilities and advice. I am also

thankful to Mr. G.Santhosh Kumar, Lecturer, Dept of Computer Science, CUSAT for his

valuable suggestions and support for the completion of this seminar. With great pleasure I

remember Dr. Sumam Mary Idicula, Reader, Dept. of Computer Science, CUSAT for her

sincere guidance. Also I am thankful to all of my teaching and non-teaching staff in the

department and my friends for extending their warm kindness and help.

I would like to thank my parents without their blessings and support I would not have

been able to accomplish my goal. I also extend my thanks to all my well wishers. Finally, I

thank the almighty for giving the guidance and blessings.

ABSTRACT

Web clustering Engines are emerging trend in the field of information retrieval.

They organize search results by topic, thus offering a complementary view to the flat ranked

list returned by the conventional search engines. The search results returned by traditional

search engines on different subtopics or meanings of a query will be mixed together in the list

so that the user may have to sift through a large number of irrelevant items to locate those of

interest. The Web clustering engines categorize the search results into different hierarchical

groups/clusters and display those cluster labels. Hence the user can locate the desired

document very fast.

In this seminar we discuss different phases in the implementation of web clustering

engines in detail and also incorporate some of the web clustering algorithms, their advantages

and issues. We will familiarize some currently using web clustering engines. Some future

research directions are also presented.

Additional Key Words and Phrases: Web Clustering Engines, Information retrieval, meta

search engines, search results clustering, Search results acquisition, Preprocessing, Cluster

construction and labeling, Vector Space model, data centric clustering algorithms, description

aware algorithms

Contents

1. Introduction 1

1.1 Motivation 1

1.2 Goal of web clustering engines 2

1.3 Issues in the implementation of clusters 3

2. Architecture and techniques of web clustering engines 5

2.1 Architecture of web clustering engines 5

2.1.1 Search results acquisition 5

2.1.2 Preprocessing of search results 6

2.1.3 Cluster construction and labeling 7

2.1.3.1 Data centric clustering algorithms 8

2.1.3.2 Description aware algorithms 10

2.1.4. Visualization of clustered results 15

3. Efficiency and future works 20

3.1.Search results clustering efficiency factors 20

3.2 Improve efficiency of clustering 21

3.3 Performance evaluation 22

3.4 Research directions and future works 23

4. Conclusion 24

5. References 25

6. Appendix 26

Seminar Report 2010 1 Web Clustering Engine

Dept. Of Computer Science CUSAT

1.INTRODUCTION

1.1 MOTIVATION

Search engines are an invaluable tool for retrieving information from the Web. In

response to a user query, they return a list of results ranked in order of relevance to the

query. The user starts at the top of the list and follows it down examining one result at a

time, until the sought information has been found.

Now a days efficient search engines are available like Google, Yahoo etc. Even

though they are definitely good for navigational searching and transactional searching,

they are not that much efficient in the case of queries which includes ambiguity.

Ambiguous queries means they should have multiple meaning in different contexts. The

search results returned by conventional search engines on different subtopics or meanings

of a query will be mixed together in the list so that the user may have to sift through a

large number of irrelevant items to locate those of interest. In this context clustering of

search results come in to picture.

Clustering is the act of grouping similar object into sets. The distance between the

objects in the same cluster(inter-cluster variations) should be minimum and the distance

between objects in different clusters(intra-cluster variations) should be maximum. In the

web search context, organizing web pages (search results) into groups, so that different

groups correspond to different user needs.

In 1979 Van Rijsbergen introduced the concept Cluster Hypothesis in the field of

information retrieval. It states that “Closely related documents tend to be relevant to the

same requests.”

Web Clustering Engines are the systems that perform clustering of web search

results. This systems group the results returned by a search engine into a hierarchy of

labeled clusters (also called categories).



To illustrate, Figure 1 in appendix shows the clustered results returned for the

query “tiger” .This result is given by one of the very popular web clustering engine called

Vivisimo (as of March 5, 2010). Like many queries on the Web, “tiger” has multiple

meanings like: the feline, the Mac OS X computer operating system, the golf champion

and so on. These different meanings are well represented in Figure 1.By contrast, if we

submit the query “tiger” to Google or Yahoo!(Figure 2), we can see that each meaning’s

items are scattered in the ranked list of search results, often through a large number of

result pages.

The first commercial clustering engine was Northern Light, at the end of the

1990s. It was based on a predefined set of categories, to which the search results were

assigned. A major breakthrough was then made by Vivısimo, whose clusters and cluster

labels were dynamically generated from the search results. Some other available

clustering engines are Clusty, Grokker, KartOO, Lingo3G, CREDO

1.2 GOAL OF WEB CLUSTERING ENGINES

Web Clustering Engines organize search results by topic, thus offering a

complementary view to the flat ranked list returned by the conventional search engines.

Main advantages of the cluster hierarchy is that:

It makes for shortcuts to the items that relate to the same meaning. Since Web

Clustering Engines group the search results having the same meaning within

same cluster it is very easy for the user to find similar documents. Hence the

search time will be less.

It allows better topic understanding. Since Web Clustering Engines give a

high level view of the query, it is useful for informational searches in

unknown or dynamic domains.



It favors systematic exploration of search results. A clustering engine

summarizes the content of many search results in one single view on the first

result page, the user may review hundreds of potentially relevant results

without the need to download and scroll to subsequent pages.

A clustering engine tries to address the limitations of current search engines by

providing clustered results as an added feature to their standard user interface.

1.3 ISSUES IN THE IMPLEMENTATION OF CLUSTERS

Unlike document clustering Web search results clustering included constantly

changing billions of pages. The data are mainly unstructured and heterogeneous and

additional information to consider (i.e. links, click-through data, etc.).

This dynamic nature of the data together with the interactive use of clustered

results pose new requirements and challenges to clustering technology:

Short input data description. Due to computational reasons, the data available

to the clustering algorithm for each search result are usually limited to a URL,

an optional title, and a short excerpt of the document’s text (the snippet)

Meaningful labels. Each cluster label should indicate the contents of the

cluster items within that cluster.

Selection of similarity measure. So many known methods are there for finding

the dissimilarity/similarity between 2 items within a cluster like, euclidean

distance, Manhattan distance etc.



Grouping of objects into clusters. So many approaches are available for

grouping the objects like, agglomerative clustering, suffix tree clustering, k-

means clustering.

Computational efficiency. Search results clustering is performed online, within

an application that requires overall subsecond response times. The critical step

is the acquisition of search results, whereas the efficiency of the cluster

construction algorithm is less important due to the low number of input

results.

Overlapping clusters. Since the same result may applied to different themes

we may allow overlapping clusters. Handling of overlapping clusters in a

dynamic environment is a open issue.

Unknown number of clusters. In search results clustering, both the number and

the size of clusters cannot be predetermined because they vary with the query.



2. ARCHITECTURE AND TECHNIQUES OF WEB

CLUSTERING ENGINES

2.1 ARCHITECTURE OF WEB CLUSTERING ENGINES

Practical implementations of Web search clustering engines will usually consist of

four general components: search results acquisition, input preprocessing, cluster

construction, and visualization of clustered results, all arranged in a processing pipeline.

2.1.1 SEARCH RESULTS ACQUISITION

The task of the search results acquisition component is to provide input for the

rest of the system. Based on the query, the acquisition component must deliver 50 to 500

results, each of which should contain a title, a contextual snippet, and the URL pointing

to the full text being referred to.

The source of search results can be any public search engines, such as google,

yahoo etc. Clustering applied to this smaller set of documents ,returned by the



conventional search engines, in response to the query. The most elegant way of fetching

results from such search engines is by using application programming interfaces(APIs)

these engines provide.

2.1.2 PREPROCESSING OF SEARCH RESULTS

Input preprocessing is a step that is common to all search results clustering

systems. Its primary aim is to convert the contents of search results (output by the

acquisition component) into a sequence of features used by the actual clustering

algorithm.

Steps for feature extraction are, Language identification, Tokenization, Stemming,

Selection of features.

Clustering engines that support multilingual content must perform initial

language recognition on each search result in the input.

During the tokenization step, the text of each search result gets split into a

sequence of basic independent units called tokens, which will usually represent single

words, numbers, symbols and so on .Tokenization becomes much more complex for

languages where white spaces are not present (such as Chinese) or where the text may

switch direction (such as an Arabic text, within which English phrases are quoted).

The aim of stemming is to remove the inflectional prefixes and suffixes of each

word and thus reduce different grammatical forms of the word to a common base form

called a stem. For example, the words connected, connecting and interconnection would

be transformed to the word connect .Here connect is the stem.

Last but not least, the preprocessing step needs to extract features for each search

result present in the input. Features are atomic entities by which we can describe an

object and represent its most important characteristic to an algorithm. When looking at



text, the most intuitive set of features would be simply words of a given language. But

this is not the only possibility. The features can vary from single words and fixed-length

tuples of words (n-grams) to frequent phrases (variable-length sequences of words), and

very algorithm-specific data structures, such as approximate sentences.

One method for representing a text is Vector Space model(VSM). A document d

is represented in the VSM as a vector [wt0 , wt1 , . . .wtn], where t0, t1, . . . tn is a global set

of words (features) and wti expresses the weight (importance) of feature ti to document d.

Weights in a document vector typically reflect the distribution of occurrences of features

in that document. For example, a term vector for the phrase “Polly had a dog and the dog

had Polly” could appear as shown below (weights are simply counts of words, articles are

rarely specific to any document and normally would be omitted).

2.1.3 CLUSTER CONSTRUCTION AND LABELLING

The set of search results along with their features, extracted in the preprocessing

step, are given as input to the clustering algorithm, which is responsible for building the

clusters and labeling them. There are a number of algorithms available for clustering. We

can classify them into two different categories, Data centric and Description aware.

In search results clustering users are the ultimate consumers of cluster. Hence the

created clusters should be aptly labeled. The labels should be unique, unambiguous,

comprehensive and sensible to the content. An inefficiently labeled cluster is useless

eventhough it contains closely related, relevant documents.



2.1.3.1 DATA CENTRIC CLUSTERING ALGORITHMS

The representatives of this group consists of a conventional data clustering

algorithms like Agglomerative Hierarchical Clustering (AHC), K-means etc.

Scatter/Gather is a landmark example of a data-centric system, developed in 1992 at

Xerox PARC, Scatter/Gather is commonly perceived as a predecessor and conceptual

parent of all clustering systems that appeared later. This system uses VSM for text

representation and the clustering technique used is agglomerative hierarchical clustering

(AHC), with an average-link merge criterion. It has an initial clustering of a collection of

documents in a set of k clusters(scattering).At Query time the user selected clusters of

interest(gather) and the system re-clustered those documents. This process repeats until a

small cluster with relevant documents is found. The following figure depicts the function

of a Scatter/Gather system

Agglomerative Hierarchical Clustering(AHC) is a typical example of Data centric

clustering algorithms. It is a bottom up approach. Initially each document is in its own

cluster. Build a distance matrix (dissimilarity matrix) for every pair of clusters. Merge 2

closest clusters and build the new distance matrix by replacing the merged cluster by one



cluster. Continue this process until the desired no of k clusters reached. The Complexity

of this algorithm is clearly O(n2) since we are using a matrix, where n is the number of

clusters.

Another Data centric algorithm is called as K-means clustering. K is a predefined

value for number of clusters and we are always selecting an average one as the cluster

centroid. Hence the name. Firstly choose the number of clusters k. Randomly generate k

clusters and find cluster representative/centroid. Calculate the distance between each

cluster and each document. Assign each document to the nearest cluster centroid. Re-

compute new cluster centroid. Repeat the steps until some convergence criterion is met.

The complexity is O(knT),where k is the number of clusters, n is the number of

documents and T is the number of times the algorithm should repeat for getting a stable

system(without changing the membership of document).

Data-centric algorithms borrow their strengths from well-known and proven

techniques targeted at clustering numeric data. Eventhough it uses simple keyword based

features, still it is a powerful method.

But there are some difficulties in these set of algorithms. All these algorithms are not

incremental in nature. ‘Incremental’ in the sense, as each document arrives from the web,

we “clean” it and add it to the available model. All the above algorithms excluded the

incremental property.

Another difficulty raised in Data centric approaches are in the case of meaningful labels.

In these algorithms cluster labels are created by selecting frequent keywords from the set

of cluster documents. This keyword based representation seemed to be insufficient from

the user perspective. Once a text is converted to a document vector we can hardly speak

of the text’s meaning, because the vector is basically a collection of unrelated terms.

Using the extracted features in a keyword based approach the content of the cluster is not

that much readable.



For justifying this argument refer the figure 3 in the appendix.

The query used here is

Retrieve the top 250 documents that contain the word star .

We ask Scatter/Gather to place the 250 documents into 5 groups. The Figure

contains only the first scattered clusters. Shown here are the clusters' sizes (how many

documents they contain), a list of topical terms, and a list of document titles.

One can see from the topical terms of Cluster 1 that this cluster contains

documents that involve stars as symbols, as in military rank and patriotic songs. Cluster 2

has 68 documents that appear mainly to be about movie and tv stars. Cluster 3 contains

97 documents that having to do with aspects of astrophysics. Cluster 4 contains 67

documents also about astronomy and astrophysics. This cluster contains many articles

about people who are astronomers. Cluster 5 contains all the articles that discuss animals

or plants, and that happen to contain the word star, for example, star fish.

But looking in to this clusters we can hardly conclude these descriptions about the

cluster contents. For getting more detailed cluster labels we can use Description aware

algorithms.

2.1.3.2 DESCRIPTION AWARE ALGORITHMS

Description-aware algorithms are aware of this labeling problem and try to ensure

that the construction of cluster descriptions is that feasible and it yields results

interpretable to a human. One way to achieve this goal is to use a monothetic clustering

algorithm (i.e., one in which objects are assigned to clusters based on a single feature)

and carefully select the features so that they are immediately recognizable to the user as

something meaningful. If features are meaningful and precise then they can be used to

describe the output clusters accurately and sufficiently. The algorithm that first

implemented this idea was Suffix Tree Clustering (STC), described in a few seminal



papers by Zamir and Etzioni in 1998, 1999, and implemented in a system called Grouper.

In practice, STC was as much of a break through to search results clustering.

Suffix Tree Clustering(STC) uses a data structure called suffix tree. It Use

phrases(ordered sequence of words) as their atomic features rather than keywords. 3 steps

are there for performing suffix tree clustering. Those are, data cleaning, identifying base

clusters and combining base clusters. We define a base cluster to be a set of documents

that share a common phrase.

A suffix tree-Definition

1 A suffix tree of a string S is a compact trie containing all suffixes of S.

2. It is a rooted tree.

3. Each internal node has at least two children

4. Each edge is labeled with a non empty substring of S. The label of a node is the

concatenation of the edge labels on the path from the root to that node

5. No two edges out of the same node can have edge labels that begin with the same word

For example the suffixes of a sentence “mouse ate cheese too” are:

Suffix no. Suffixes

1. mouse ate cheese too

2. ate cheese too

3. cheese too

4. too



A General Suffix Tree (GST) means a suffix tree contains all the suffixes of two

or more sentences.

Step1-Data Cleaning

In this step, the string of text representing each document is transformed using a

light stemming algorithm (deleting word prefixes and suffixes and reducing plural to

singular). Sentence boundaries (identified via punctuation and HTML tags) are marked

and non-word tokens (such as numbers, HTML tags and most punctuation) are stripped.

Step 2-Identifying base clusters

The following picture is an example for a General Suffix Tree of a set of strings-

1)"cat ate cheese", 2)"mouse ate cheese too" and 3)"cat ate mouse too". The nodes of the

suffix tree are drawn as circles. Each suffix-node has one or more boxes attached to it

designating the string(s) it originated from. The first number in each box designates the

string of origin (1-3 in our example, by the order the strings appear above); the second

number designates which suffix of that string labels that suffix-node.



Each node of the suffix tree represents a group of documents and a phrase that is

common to all of them. The label of the node represents the common phrase; the set of

documents tagging the suffix-nodes that are descendants of the node make up the

document group. Therefore, each node represents a base cluster.

Following Table lists the six marked nodes (a-f) from the example shown above

and their corresponding base clusters:

Each base cluster is assigned a score that is a function of the number of

documents it contains, and the words that make up its phrase. The score s(B) of base

cluster B with phrase P is given by:

where |B| is the number of documents in base cluster B, and |P| is the number of words in

P that have a non-zero score (i.e., the effective length of the phrase)

Step 3 - Combining Base Clusters

This step of the algorithm merges the base clusters, with a high overlap in their

document sets. For doing this we are using a base cluster graph. The nodes in this graph

are base clusters. Combine these base clusters based on some similarity measure.

The following figure is a base cluster graph of the previous example.



We define a binary similarity measure. Given 2 Base clusters Bm and Bn with

sizes |Bm | and | Bn | respectively.| Bm ∩ Bn | is the number. of documents common to both

base clusters. We define the similarity between Bm and Bn is to be 1 iff:

| Bm ∩ Bn | / | Bm |>0.5 and

| Bm ∩ Bn | / | Bn |>0.5

Otherwise similarity is equal to 0.

If similarity between base clusters is equal to 1 then draw an edge connecting

those base clusters. A cluster is defined as being a connected component in the base

cluster graph. Each cluster contains the union of the documents of all its base clusters. In

the above base cluster example there is one connected component, therefore one cluster.

The advantages of STC over Data centric algorithms are, The STC can be

constructed in linear time. It is incremental in nature. This method focused attention on

cluster label descriptiveness, so that the cluster labels will be more effective. STC support

overlapping clusters.

The following picture gives us an overview about the clusters created by Suffix

Tree Clustering method:



The Query used here is ‘salsa’. Only the first 5 clusters are shown here. The

words in bold are the shared phrases found in the clusters. Note the descriptive power of

phrases such as "Puerto Rico", "Latin Music" and "York Salsa Dancers".

2.1.4. VISUALIZATION OF CLUSTERED RESULTS

Now powerful visualizations are available for Web Clustering Engines. One

prominent approach is based on hierarchical folders. The Web Clustering Engines like,

Clusty, CREDO, Lingo3G ,etc are using hierarchical folder visualization approach. A

famous Clustering Engine called Grokker uses Nesting and zooming approach. Some

search engines also used Graph based interfaces. KartOO is such a system.



Some Clustering Engines and their visualizations are mentioned below:

Clusty

Clusty is a clustering engine developed by the company Vivisimo. Vivisimo won

the “best meta-search engine award” assigned by SearchEngineWatch.com from 2001 to

2003. Vivisimo means lively, bright, or clever in Spanish. Vivisimo's founders picked the

name to express their vision of optimizing and giving life to our information. Clusty is a

meta search engine, meaning it combines results from a variety of different sources. It

uses an algorithm to cluster content based on textual similarity. Every time of a search,

Clusty pulls together the data from other engines like Ask, MSN and Wisenut. It then

organizes the search results in a way that helps us navigate away from ambiguity towards

specific cluster of results.

Clusty uses a hierarchical folder approach. It is a very simple method and familiar

to everyone. Figure1 in appendix is the screenshot (taken on March 5, 2010) of Clusty.



The hierarchical folders are limited in the left side of the screen so that the user can

choose any cluster he may need within no time.

CREDO

CREDO ( Conceptual REorganization of DOcuments) has been developed at

Fondazione Ugo Bordoni by Claudio Carpineto and Gianni Romano. CREDO groups the

results of a web search (currently Yahoo APIs search results) in a lattice of conceptual

clusters that highlight the contents of the retrieved documents. CREDO is based on a

mathematical data representation termed a concept lattice. Compared to other systems for

clustering Web results, the clusters produced by CREDO are more justifiable, are easier

to navigate because they are organized in a lattice rather than a strict hierarchy, and allow

discovery of causal associations between the words contained in the results. CREDO is

an interesting example of a system that attempts to build the taxonomy of topics and their

descriptions simultaneously. Eventhough CREDO do not follow a strict hierarchical

organization can still use a tree-based visualization. Refer Figure 4(taken on March 6,

2010) in appendix for seeing the visualization of CREDO.

A version of CREDO for PDAs (Credino) and for cellular phones (SmartCREDO)

has been developed in collaboration with Stefano Mizzaro and Andrea Della Pietra

(University of Udine).

Grokker

Grokker is developed by a company called Groxis. Groxis was a tech company

based in San Francisco, California. The name Grokker is inspired by the 1961 Robert A.

Heinlein science fiction classic Stranger in a Strange Land, in which Grok is a Martian

word meaning literally ‘to drink’ and metaphorically ‘to be one with.’ To grok something

is to understand something so well that it is fully absorbed into oneself. It is to look at

every problem, opportunity, action, and point of view from any and all perspectives.

Grokker sits on top of multiple sources. After Grokker retrieves the information, it



"federates" it, meaning it meshes it all together. Finally, it clusters the returns into

categories. End users most frequently look at less than three screens from the thousands

of returned search results. Using Grokker, users immediately see the cluster(s) of greatest

relevance, and drill down, only within the cluster(s) that matter to them.

Grokker uses Nesting and Zooming approach. The screen shot of Grokker

is shown in appendix Figure 5. This Map View is a visual representation of the return of

hits. When the user click on one of the circles and see the subcategories again. By

clicking on Search Options the user can change the number of hits he will return. The

user can also choose which sites you want to search: Yahoo, Wikipedia and/or Amazon.

Simultaneous searching of different sites are also permitted. Finally, we can limit our

results by using the tools on the left side of the screen.

Some universities are using Grokker as their searching tool. Stanford University

was one of the first customers of Grokker. The new platform provides faculty and

students with a single point of access to multiple resources, including library catalogs,

proprietary subscription databases, and the Web. It helps Stanford users to be more

efficient in their research and navigation among the numerous available resources. The

desktop version of Stanford Grokker is no longer being supported, and is not available for

download. In March of 2009, Groxis ceased operations.

KartOO

KartOO was a meta search engine which displayed a visual interface. It operated

from 2001 to early 2010. KartOO had an advanced Adobe Flash GUI, as opposed to a

text-based list of results.It uses a Graph based approach. Its color scheme was to a degree

reminiscent of Apple Computer's Aqua interface. Search results were presented as a

"map", with blob-like masses of varying color connecting each item. The shape of the

blobs clearly depends on the relevance of the keyword corresponding to that blob,

according to the query. If one began their search with a general topic, KartOO sometimes

helped to narrow it down. Every "blob" clicked added another word to the search query.



The map would often succeed in presenting keywords or subtopics that defined the topic

one was searching on. Refer Figure 6 in appendix for seeing the visualization of KartOO.

It was co-founded in France by two cousins, Laurent and Nicholas Baleydier. This

project was then launched in 2001. In 2004, KartOO launched a new version called

UJIKO. In January 2010 KartOO closed down, removing all content from the KartOO

and UJIKO websites, but leaving a small message in French thanking its users for their

support.



3. EFFICIENCY AND FUTURE WORKS

3.1 SEARCH RESULTS CLUSTERING EFFICIENCY FACTORS

The most critical tasks involve the first three components presented namely

search result acquisition, preprocessing, and clustering. The visualization component is

not likely to affect the overall system efficiency in a significant manner.

Search Results Acquisition

The number of search results required for clustering cannot be fetched in one

remote request. The Yahoo! API allows up to 50 search results to be retrieved in one

request, while Google SOAP API returns a mere 10 results per one remote call. The

results obviously depend on network congestion , on the capability of local equipment

used , and also on the specific server processing the request on the search engine side.

Preprocessing

The performance of tokenization is a critical concern in the case of

preprocessing of search results. Tokenizers will have a different performance

characteristic depending on whether they were hand-written or automatically generated.

Tokenization becomes much more complex for languages where white spaces are not

present (such as Chinese) or where the text may switch direction (such as an Arabic text,

within which English phrases are quoted).

Clustering

Depending on the specific algorithm used, the clustering phase can significantly

contribute to the overall processing time. Search results clustering systems must be

optimized to handle smaller instances and process them as fast as possible.



3.2 IMPROVE EFFICIENCY OF CLUSTERING

There are a number of techniques that can be used to improve the computational

performance of a search results clustering engine.

Client side processing

The majority of currently available search clustering engines are doing all

processes as server-side processing. One possible problem with this approach is

thatduring high query rate periods the response times can significantly increase and thus

degrade the user experience. For avoiding this we can do some processes using the client

side resources. In this way, scalability issues and the resulting problems could be

avoided.

Incremental processing

One desirable feature of search results clustering would be incremental

processing- as each document arrives from the web, we “clean” it and add it to the

available model.

Pretokenized documents

The input to the Web Clustering Engine is the search results returned by the

conventional search engines. This search engines already will do some preprocessing

techniques to their results before they are retrieved. If the clustering engines can use these

tokens for their work it will be an added advantage.



3.3 PERFORMANCE EVALUATION

Clustering engines are designed to overcome the limitations of plain search

engines. So we need to evaluate whether the use of clustered results does yield a gain in

retrieval performance over flat ranked lists. Some methods are explained below:

First suggestive method related to the conventional notion of Recall and precision.

For applying this concept the retrieved list should be in a linear list, not in a clustered

form. One obvious way to perform such a clustering linearization would be to preserve

the order in which clusters are presented and just expand their content, but this would

amount to ignoring the role played by the user in the choice of the clusters to be

expanded. One of the earliest and simplest linearization techniques is to assume that the

user can choose the cluster with the highest density of relevant documents and to consider

only the documents contained in it ranked in order of relevance.

A more analytic approach is based on the reach time: a modelization of the time

taken to locate a relevant document in the hierarchy.

Another method is by analyzing the user logs. Compare the search engine logs to

clustering engine logs, computing several metrics such as the number of documents

followed, the time spent, and the click distance. The interpretation of user logs is,

however, difficult.

To date, the evaluation issue has probably not yet received sufficient attention. It

remains still as an open issue. Anyway some experimental findings are suggesting that

Web Clustering Engines may be more effective than plain search engines. Due to the lack

of an efficient method for the performance evaluation of clustering engines they are still

not seeking the attention of people.



3.4 RESEARCH DIRECTIONS AND FUTURE WORKS

The most important research issue is thus how to improve the quality and

usability of output hierarchies. For improving the cluster efficiency, should extract

powerful features. The developers should adopts methods for generating more expressive

and effective descriptions of clusters.

Finding optimal cluster representatives is another approach for increasing the

efficiency of clustering phase. If we can find a better cluster representative then the

iterations for stable clustering will be less, means less response time. Combination of

existing clustering algorithms can also be used for getting better clusters.

One advanced concept is called Personalized clustering. Since the clustering

process does not depend only on the search results, but is also influenced by the user

characteristics, we speak of personalization. Personalization means instead of optimizing

the construction of the hierarchy structure, one can try to reorganize a given structure

based on user actions. This proposed techniques exploit user feedback, to filter out parts

of the hierarchy that are presumably of no interest to the user.

One of the recent topics in the field of search result clustering is the on growing

market of mobile search. Two mobile versions of CREDO, suitable for personal digital

assistants and cellular phones, the systems, termed Credino (small CREDO,in Italian) and

SmartCREDO, are exclusively based on the search results and are freely available online.

The screenshots of Credino is available in the appendix Figure 7,Figure 8, Figure 9(taken

in March 6, 2010)

Semantic Web is a recent research topic. In Semantic Web the meaning

(semantics) of information on the web is defined, making it possible for machines to

process it. Google has initiated a good example of Semantic Web technology with its

"rich snippets". Swoogle is a semantic web search engine. In future clustering can also be

applied for Semantic web search engines also.



4. CONCLUSION

Web clustering engines organize search results by topic, thus offering a

complementary view to the flat-ranked list returned by conventional search engines. Web

Clustering Engines has reached a level in which research has been deployed and

commercial systems are being deployed. A number of advances must be made to improve

the cluster labels, coherence of cluster structure, performance evaluation studies,

advanced visualization techniques. Then Web Clustering Engines entirely fulfills the

promise of being the PageRank of the future.



5. REFERENCES

Journal/Paper:

Claudio Carpineto,Stanisiaw Osinski,Giovanni Romano and Dawid Weiss,”A survey

of Web Clustering Engines”,ACM Computing Surveys,Vol.41,No.3,Article 17,July

2009.

Oren Zamir and Orem Etzioni,Web Document Clustering :A Feasibility

Demonstration, In Proc. 21st annual Int. ACM SIGIR Conf. on Research and

Development of Information Retrieval, pp.46-54 ,1998.

Books:

C.J.Van Rijsbergen , Information Retrieval, Butterworth , 1979

Ricardo Baeza Yates and Berthier Ribeiro Neto, Modern Information Retrieval

Addison Wesley Longman Publishing Co. Inc.,1999

Websites:

http://clusty.com/ March 5, 2010

http://credo.fub.it March 8, 2010

http://www2.parc.com/istl/projects/ia/sg-example1.html March 4, 2010

http://credino.dimi.uniud.it/ March 10, 2010

http://smartcredo.dimi.uniud.it March 10, 2010

http://clusty.com/

http://credo.fub.it

http://www2.parc.com/istl/projects/ia/sg-example1.html

http://credino.dimi.uniud.it/



6. APPENDIX

Figure 1

http://smartcredo.dimi.uniud.it



Figure 2



Figure 3



Figure 4



Figure 5



Figure 6



Figure 7

Figure 8



Figure 9

Documents

DEEPTHI THERESA K.K.dspace.cusat.ac.in/jspui/bitstream/123456789/250/1... · submitted by Deepthi Theresa K.K. in partial fulfillment of the requirements for the award of M.Tech in