General Graduate Exams

8/23/2019 General Graduate Exams

1/72

General Graduate Exams

Exploration and Visualization of Information in Search Engines

by

Panagiotis Papadakos

Presented to Graduate Studies Committee of

the Computer Science Department of

the University of Crete

Heraklion, May 2009


2/72

ii


3/72

Contents

Page

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Interaction Paradigms and Visualization in Information Retrieval (IR) . . . . . . . . . . . . . . . 3

2.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Information Space and User Information Needs . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Micro and Macro Level of Information . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.2 Information Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.3 User Information Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Interaction Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.1 Query Searching vs Browsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.2 Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.3 Three Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4.2 Scientific and Information Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Interaction Paradigms and Related Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1 Results Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Clustering Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.2 Clustering Algorithms Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Hierarchical and Non-Hierarchical Approaches . . . . . . . . . . . . . . . . . . . . . 10

Document-based and Snippet-based Approaches . . . . . . . . . . . . . . . . . . . . 12

3.1.3 Cluster Presentation & User Interaction . . . . . . . . . . . . . . . . . . . . . . . . . 14

iii


4/72

3.2 Facets and Dynamic Taxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.2 Taxonomy Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.3 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.4 User Interface Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Visualization Models and Metaphors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 Multiple Reference Points Based Models (MRPBM) . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Euclidian Spatial Characteristic Based Model (ESCBM) . . . . . . . . . . . . . . . . . . . . 30

4.3 Pathfinder Associative Newtork (PFNET) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4 Multidimensional Scaling Models (MDS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.5 Self-organizing Map Model (SOM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.6 Metaphors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.6.1 Metaphors for the Semantic Framework Presentation . . . . . . . . . . . . . . . . . . 39

4.6.2 Metaphors for Information Retrieval Interaction . . . . . . . . . . . . . . . . . . . . 41

5 Vision and Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3.1 Information Visualization Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3.2 Metrics for Exploratory Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.3.3 Interaction Models for Exploratory Search . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3.4 Exploratory Search and Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3.6 Evaluation and Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.4 Work Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.4.1 ODBMS Index Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.4.2 FleXplorer, A Framework for Providing Faceted and Dynamic Taxonomy-based

Information Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.4.3 Exploratory Web Searching with Dynamic Taxonomies and Results Clustering . . . 52

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Appendices

A Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

iv


5/72

List of Figures

3.1 Clusty, a Snippet-based Clustering Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Quintura Word Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 grokker Generates an Euler Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 Top-200 Web Search Results Clustering Displayed Using Two-level TreeMaps . . . . . . . . 17

3.5 Kartoo Generates a Thematic Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.6 Example of a Materialized Faceted Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.7 ContentLandscape Applies Collapsible Panel Pattern for Zooming . . . . . . . . . . . . . . . 21

3.8 FacetZoom Combines Ideas from Zoomable User Interfaces (UIs) With Faceted Search . . . 22

3.9 Faceted Search for Small Screens in the FaThumb Prototype . . . . . . . . . . . . . . . . . . 23

3.10 Flamenco Allows Choosing Between a Search Over All Results or Within Current Focus . . 23

4.1 Display of 4 Reference Points in a Fixed Reference Point Environment . . . . . . . . . . . . 28

4.2 VIBE Using 5 Reference Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 WebStar Using 4 Reference Points (RPs). Snapshots During a Full Rotation of interna-

tional Reference Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4 Display of the Projected Cosine Model, in Distance-Angle DARE Model . . . . . . . . . . . 32

4.5 Display of the Projected Cosine Model, in the Angle-Angle TOFIR Model . . . . . . . . . . 33

4.6 Display of the Projected Distance Model, in the Distance-Distance GUIDO Model . . . . . 34

4.7 Display of Original Network (left) and Final PFNET Network (right) . . . . . . . . . . . . . 35

4.8 Display of ThemeScape and Galaxy Visualizations ofIN-SPIRE Visualization Program . . 37

4.9 A SOM Feature Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.10 A 3D Cone Tree (left) and a Basic Hyperbolic Tree (right) . . . . . . . . . . . . . . . . . . . 41

4.11 Perspective Wall (left) and ThemeRiver(right) . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.12 D ataLens, a 3D Pyramid Lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.13 Gridl Prototype Displays Search Results Along Two Axes . . . . . . . . . . . . . . . . . . . 43

v


6/72

4.14 HotMaps, a 2D Visualization of How Query Terms Relate to Search Results . . . . . . . . . 44

vi


7/72

List of Tables

3.1 Basic Notions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Interaction Notions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

vii


8/72

viii


9/72

Chapter 1

Motivation

The daily use of computers as tools of work, education, communication and entertainment produces a huge

volume of data. As recent surveys state1, the world produces between 1 and 2 exabytes (260 bytes) of

unique information per year, 90% of which is digital and with a 50% annual growth rate. In addition, this

new data are more complex and more dynamic. The adopted interaction paradigm of current IR systems

and Web Search Engine (WSE) with a simple rectangular textbox, where the user inserts most of the timesone or two terms and the system returns a ranked list of results has proven very useful for finding specific

information and is very simple and intuitive to use.

However, such systems do not provide adequate support for information needs that have an exploratory

nature and/or aim at decision making. User studies have shown that casual users usually inspect only the

first page of results and they do not exploit any of the query language operators (not even Boolean queries)

that is offered. Instead they issue very small queries which they reformulate in an iterative process based

on the returned results [73, 62]. On the other hand, the powerful and expressive query languages that are

usually offered for structured information (e.g. for the Semantic Web) are not fully utilized, in the sense

that the formulation of queries is a laborious and difficult task for end users.

In the previously analyzed, highly demanding and growing information environment, new intuitive and

more user friendly UIs have to be created, providing effective and efficient services for retrieving and explor-

ing the available information and supporting users in the various decision making tasks and processes. The

fields of IR, Information Visualization (IV) and Human Computer Interaction (HCI) have to collaborate

1http://www.sims.berkeley.edu/research/projects/how-much-info-2003/

1


10/72

2 Chapter 1. Motivation

in order to provide new intuitive and interactive UIs, where the information is presented, organized, and

analyzed, giving the user the ability to recognize patterns and relations.

For example, to select a hotel or a product to buy, it is not enough to return the list of choices that

satisfy user-provided criteria. The ranking of the available choices according to user-based (i.e. preference),

or statistical-based criteria is also required. Furthermore, exploration services, that provide users with

comprehensive summaries of the available choices which enable them to grasp quickly the information

landscape and allow them to restrict their focus, and thus approach gradually the most desired choices,

are required.

For this reason, efforts for the exploitation of the above languages in models of exploration/navigation

have started to come up [56, 49, 30, 4]. Summarizing, the constantly increasing volume and requirements

of our digital economy, requires providing intuitive modes of interaction, involving flexible and efficient

navigation, and advanced visualization.

Computer Science Department University of Crete


11/72

Chapter 2

Interaction Paradigms and Visualization in IR

2.1 Information Retrieval

(IR) is the domain focusing on searching, exploring and discovering information either from organized

textual and data repositories or the World Wide Web (WWW), in order to satisfy users information

needs. However, since the information environment is constantly growing, another important aspect of

IR systems is their ability to orgazine this information. This organization can facilitate the creation

of innovative, more intuitive and user friendly UIs, which will provide users efficient ways of mapping,

organizing and grouping available information. The above can enable users to discover new patterns and

relationships between the available information and satisfy faster and more accurately their information

needs.

2.2 Information Space and User Information Needs

2.2.1 Micro and Macro Level of Information

Information from an infobase can be divided to two levels. The first is the micro level, which refers to

individual objects or documents, such as contents, snippets or full text. This is the direct and obvious

information. The other level is the macro level which refers to aggregated information of objects or

documents from the collection. This information is not direct but is generated from the individual collection

of objects, and relies on the way the information is organized and presented. Such information can provide

3


12/72

4 Chapter 2. Interaction Paradigms and Visualization in IR

object connections, rhythms, trends, patterns and relationships, explaining information at the micro level.

The aggregate information at the macro level can vary in information organization methods and information

presentations for the same data. By navigating information in the macro level the user can gain a better

understanding of the provided collection and find unexpected insights [92]. An IR system should provide

access to both levels of information, by browsing and query searching.

2.2.2 Information Space

The information space can be conceived as an abstract and multidimensional space. Its structure is based

on the semantic characteristics and relationships, derived from the organization of the collection data set,

which enables users to explore and discover information from the data collection. An information space

can be constituted by intrinsic attributes such as keywords, citations, hyperlinks, and authors or extrinsic

structures like a subject directory, a thesaurous system, or an organized search result list. Combinations of

intrinsic attributes and extrinsic structures can also form an information space. Since information does not

constitute space, to describe its spatial characteristics, we have to define basic topological properties like

distance, direction and angle. For instance, the distance between two objects can be the shortest path of

hyperlinks, citation or hierarchical structure, and the Euclidean distance in the Vector Space Model (VSM).

Direction has a special meaning, in hyperlink and citation based systems, since if any objects links/cites

another object, it means that one object directs to the other. In a multidimensional vector-space based

IR system, angle is used as a retrieval model. Finally, the information space has to be reduced from N

dimensions to 1, 2 or 3, in order to be perceived by humans, which can lead to user disorientation and

ambiguity [92].

2.2.3 User Information Needs

User studies have shown that almost 60% of search tasks are exploratory [62]. The user does not know

accurately his information need, he only provides 2-5 words, and focalized search very commonly leads to

inadequate interactions and poor results. Unfortunately, the available user interfaces (UI)s do not aid theuser in formulating his query. Furthermore, such systems do not provide adequate support for information

needs that have an exploratory nature and/or aim at decision making. The answers returned are simple

ranked lists of results, with no organization and no information on the macro level of the infobase. Casual

users usually inspect only the first page of results and they do not exploit any of the query language

operators (not even Boolean queries) that is offered. Instead, they issue very small queries which they

reformulate in an iterative process based on the returned results [73, 62]. On the other hand, the powerful



13/72

2.3. Interaction Paradigms 5

and expressive query languages that are usually offered for structured information (e.g. for the Semantic

Web) are not fully utilized, in the sense that the formulation of queries is a laborious and difficult task.

2.3 Interaction Paradigms

2.3.1 Query Searching vs Browsing

In IR two paradigms are widely recognized: the first is query searching and the second is browsing. Query

searching is the paradigm where the user tries to describe his information needs with a group of relevant

and important terms. The query is then analyzed by the IR engine and a list of related documents (based

on the used ranking model) is returned. Most IR engines also display snippets of relevant parts of the

returned documents. This means that for ambiguous words, where each word can have many meanings,the system might return non relevant results, which the user might accept as a search failure.

On the other hand, browsing refers to UIs which allow the user to view, search and scan either the

whole information or part of it. This enables the user to explore and discover information, along with data

relationships and patterns. A UI for browsing should provide smooth and structured browsing. Methods

for information browsing include hyperlink and hierarchical structures. However, huge volumes of data

require the appropriate usage of automatic data analysis techniques, prior to visualization. According to

[81], browsing is useful when (a) there is good underlying structure, so items close to one another are

similar, (b) users are unfamiliar with the contents of the collection, (c) users have a limited understanding

of the organization of a system and prefer a less cognitively loaded method of exploration, (d) it is difficult

to verbalize the underlying information need and (e) the information is easier to recognize than describe.

2.3.2 Differences

According to [92], the differences between query searching and browsing, include:

Judgment of Relevance

Query searching is based on keyword matching of query terms and surrogates of documents in a

database, at a lexical level. On the other hand, the relevance judgement of browsing is completed by

users and it is a concept matching process.

Continuity

The retrieval process is continuous for browsing, while a retrieval process is discrete for query search-

ing. Selecting a browsing path, examining a context, and relevance judgment is continuous and

controlled by the user during browsing, while after executing a query, the internal query process and

General Graduate Exams Panagiotis Papadakos


14/72


ranking of the results is a black box for the users.

Cost in Time and Effort

Browsing is a time and effort consuming action, since the user must remember the browsing path,

search the contents and make decisions, while query searching involves only term selection and query

formulation.

Information Seeking behavior

Browsing is a system based seeking behavior (i.e. what the system can offer), while query searching

is a seeking behavior based on what the user wants.

Iteration

Browsing is completed by series of iterative acts, like getting an overview of available information,

fixing on a target and examining it more closely, and then moving on and starting the cycle again.

Query searching on the other hand requires the definition of the query terms, the formulation of the

query and examination of the results. Query searching might also be iterative, since the results might

not fullfil the information needs of the user.

Granularity

Using browsing the user can evaluate one relevant item at a time, while query search provides a group

of retrieved documents.

Clarity of information need

When a user starts an information seeking process, he might have not defined a clear information

need. In such a case, browsing is more appropriate, since it does not require a definite target, while

query searching requires a relatively well-conceived information need, for which keywords can b e

chosen and query can be formulated.

Interactivity

Browsing is an interactive process by nature, which makes it more complicated and challenging, whilequery searching has fewer steps and less interaction.

Retrieval Results

Results of browsing are richer and more diverse, since they can lead to a wide range of retrieval

results (i.e. from contextual information, structural information, relational information to individual

objects), while query searching only retrieves a ranked list of documents.



15/72

2.4. Visualization 7

2.3.3 Three Paradigms

Although query searching and browsing are different ways to seek information, they can be synthesized.

There are three basic paradigms:

Querying and Browsing (QB) In this paradigm, an initial query is submitted to the system to

restrict the infobase. Then the results are visualized in a visualization environment and browsed by

users.

Browsing and Querying (BQ) Information at the macro-level is presented and browsed and then

information in the micro-level is searched and highlighted in the visualization contexts.

Browsing Only (BO) Information at the macro level is displayed and browsed. It does not integrate

any query searching components.

Query searching is not categorized as a paradigm, because it is a traditional IR retrieval paradigm that

does not require a visual space.

2.4 Visualization

2.4.1 Definition

According to [51], visualization is a method of computing, which transforms the symbolic into the geometric,

enables researchers to observe their simulations and computations, offers a method for seeing the unseen,

enriches the process of scientific discovery, and fosters profound and unexpected insights. Visualization

is the process of transforming data, information, and knowledge into graphic presentations to support

tasks such as data analysis, information exploration, information explanation, trend prediction, pattern

detection, rhythm discovery, and so on. Without the visualization assistance, there is less perception or

comprehension of the data, information, or knowledge by people for a variety of reasons. Such reasons, may

include the limitations of human vision, or the invisibility and abstractness of the data, information and

knowledge. Visualization requires certain methods or algorithms to convert raw data into a meaningful,

interpretable, and displayable form to visually convey information to users.

2.4.2 Scientific and Information Visualization

Visualization can be classified into two categories: scientific visualization and information visualization.

Scientific visualization is used most of the times to show things that are either too fast or too slow for the



16/72


eye to perceive, or for structures much smaller or larger than human scale, or for phenomena that people

can not directly see, like x-ray or infrared radioaction [52]. Examples include shapes of molecules, missile

tracking, astrophysics, fluid dynamics, medical images, etc.

On the other hand, information visualization, is generally used to view abstract information. Examples

include visual reasoning, visual data modeling, visual programming, information retrieval visualization,

visualization of program execution, visual languages, spatial reasoning, and visualization of systems [82].

Although their fundamental design principles, implementation means, and issues are common, infor-

mation visualization does not have an inherent spatial structure or geometry of data to display, contrary

to the scientific visualization. For the former, a spatial structure or framework for semantic relationships

among data must be created. Finding or defining a spatial structure for information visualization is chal-

lenging because data in an information space may be multifaceted, relationships of data are interwoven

and complicated. Furthermore, data may be of diverse nature. Definition of such a spatial structure for

information visualization, is a complicated and creative process. Salient and displayable attributes from

objects must be extracted, a semantic framework for displayable objects must be established, information

must be organized, and objects must be projected onto the structure, in such a way that the user will be

able to search and find objects and objects relationships [92].



17/72

Chapter 3

Interaction Paradigms and Related

Techniques

In this chapter we discuss the interaction paradigms of Results Clustering and Faceted or Dynamic Tax-

onomies correspondingly.

3.1 Results Clustering

Results clustering is a type of data analysis method that can organize a dataset into categorical groups

clusters, besed on certain data association criteria. Different similarity measures can result in different

clustering results. Items or objects within the same group/cluster are more similar than items between

two distinct groups/clusters. Clustering is considered an unsupervised learning process because it can au-

tomatically reveal intrinsic categorical patterns from a dateset. The categories from a clustering algorithm

rely on the nature of the dataset, association criteria of clustering, and distribution of data items in the

dateset.

The advantage of clustering is that it can be easily applied to any collections, revealing interesting

and unexpected associations and trends. Disadvantages of clustering are the lack of predictability, their

conflation of many dimensions simultaneously, the difficulty in groups labeling and the counterintuitiveness

of cluster hierarchies [25].

9


18/72

10 Chapter 3. Interaction Paradigms and Related Techniques

3.1.1 Clustering Requirements

Results clustering algorithms should satisfy several requirements. First of all, the generated clusters should

be characterized from high intra-cluster similarity. Moreover, results clustering algorithms should be effi-

cient and scalable since clustering is an online task and the size of the retrieved document set can vary.

Usually only the top C documents are clustered in order to increase performance. In addition, the pre-

sentation of each cluster should be concise and accurate, to allow users to detect what they need quickly.

Cluster labeling is the task of deriving readable and meaningful (single-word or multiple-word) names for

clusters, in order to help the user to recognize the clusters/topics he is interested in. Such labels must

be predictive, descriptive, concise and syntactically correct. Finally, it should be possible to provide high

quality clusters based on small document snippets rather than the whole documents.

3.1.2 Clustering Algorithms Classification

We can categorize the clustering algorithms, using two different classification schemes, based on either the

structure of the clusters or the infobase that these algorithms are applied to.

Hierarchical and Non-Hierarchical Approaches

The first category classifies the clustering algorithms to either the non-hierarchical ones (partitioning clus-

tering algorithms) or the hierarchical ones [61]. The major difference between these two clustering types is

that the former generates a hierarchy of clustered items while the later partitions the items in a single-level

structure.

Non-Hierarchical Approaches

This kind of clustering algorithms, partition N items into K categories (K must be predefined).

One of the most popular non-hierarchical algorithms is the K-means [48] and its variants [36, 11]

which is based on a simple iterative scheme for finding a local minimal solution. The algorithm

starts with a guess about the solution, and then readjusts the cluster centroids, until reaching a local

optimum. A centroid is a special artificially created item in a cluster which is used to represent that

cluster for various purposes. It is defined as the average coordinates of all items in a cluster which it

represenents. A cluster membership function refers to a method to judge whether an item is assigned

to a cluster or not in a clustering process. The main advantages of this algorithm are its simplicity

and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same

result with each run, since the resulting clusters depend on the initial random assignments.

Another non-hierarchical algorithm is the Fuzzy c-means [32]. In fuzzy clustering, each point has a



19/72

3.1. Results Clustering 11

degree of belonging to clusters, as in fuzzy logic, rather than belonging completely to just one cluster.

Thus, points on the edge of a cluster, may be in the cluster to a lesser degree than points in the

center of cluster. The algorithm minimizes intra-cluster variance as well, but has the same problems

as k-means, the minimum is a local minimum, and the results depend on the initial choice of weights.

QT (quality threshold) clustering [29] is an alternative method of partitioning data, invented for gene

clustering. It requires more computing power than k-means, but does not require specifying the

number of clusters a priori, and always returns the same result when run several times. The user

chooses a maximum diameter for clusters and the algorithm builds a candidate cluster for each point

by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses

the threshold. The candidate cluster with the most points is the first true cluster. Then recurse with

the reduced set of points, to find the rest of the clusters.

STC and its variations, described later in the Section Snippet-based Approaches, are also non-

hierarchical approaches.

Hierarchical Approaches

The hierarchical clustering algorithm yields a tree structure, which is also called a dendrogram. In

such a structure, a child sub-cluster has to overlap with its parent cluster. The clustering process

in such algorithms is recursive, meaning that successive sub-clusters are generated from an existing

cluster, etc. There are two basic strategies for creating this structure: agglomerative(or from bottom

to top) algorithms and divisive (or from top to bottom). The former algorithm first clusters input

items, forming a set of clusters, and then merges close clusters from the existing cluster set to form a

parent cluster, based on a similarity measure. The algorithm ends when all clusters have been merge

to one parent cluster, the root of the tree [36]. Different variations may employ different similarity

measuring schemes [94]. The latter algorithm, takes the opposite direction. It starts with the root

of the tree, and breaks down one large cluster into several smaller clusters. The recursion stops

when certain criteria are met. Agglomerative clustering algorithms are more popular than divisive

clustering algorithms.

The above methods usually suffer from their inability to perform adjustment once a merge or split

has been performed. This ineflexibility often lowers the clustering accuracy. Furthermore, due to

the complexity of computing the similarity between every pair of clusters, such algorithms are not

scalable for handling large data sets in document clustering.

Another approach is the Hierarchical Frequent Term-based Clustering (HFTC) method, proposed in

[1]. This algorithm exploits the notion of frequent itemsets1 used in data mining. HFTC greedily

1A frequent itemset is a set of words which occur together in some minimum function of documents in a cluster



20/72


selects the next frequent itemset, which represents the next cluster, minimizing the overlap of clusters

in terms of shared documents. Experiments have shown that this algorithm is not scalable [21].

A different approach based on the idea of frequent itemsets is the Frequent Itemset Hierarchical

Clustering (FIHC). FIHC uses global frequent itemsets2 to construct clusters, which reduces the

dimensionality of the document set, making this algorithm more efficient and scalable.

Document-based and Snippet-based Approaches

Clustering can be applied either to the original documents (like in [11, 27, 21]), or to their (query-dependent)

snippets (as in [86, 79, 71, 19, 88, 23, 77]). For instance, clustering Meta Web Search Engines (MWSEs)

(e.g. clusty.com) use the results of one or more search engines (e.g. Google, Yahoo!), in order to increase

coverage/relevance. Therefore, meta-search engines have direct access only to the snippets returned by

the queried search engines. Clustering the snippets rather than the whole documents makes clustering

algorithms faster. Some clustering algorithms [19, 15, 84] use internal or external sources of knowledge

like Web directories3 (e.g. DMoz4, Yahoo! Directory), dictionaries (e.g. WordNet) and thesauri, online

encyclopedias (e.g. Wikipedia5) and other online knowledge bases. These external sources are exploited

to identify key phrases that represent the contents of the retrieved documents or to enrich the extracted

words/phrases in order to optimize the clustering and improve the quality of cluster labels.

Document Vector-based Approaches

The above traditional clustering algorithms, either flat (like K-means) or hierarchical (agglomerativeor divisive) are not based on snippets but on the original document vectors and on the similarity

measure. Another such approach is ESTC (Extended STC) [10], which is an extension of STC

(described latter in the Section Snippet-based Approaches), appropriate for application over the full

texts (not snippets). To reduce the (roughly two orders of magnitude) increased number of clusters, a

different scoring function and cluster selection algorithm is adopted. The cluster selection algorithm

is based on a greedy search algorithm aiming at reducing the overlap and at increasing the coverage

of the final clusters.

In brief, such approaches can be applied only on a stand alone engine (since they require accessing theentire vectors of the documents) and they are computationally expensive. Furtermore, clustering over

full text is not appropriate for a (Meta) WSE since full text may not be available or too expensive

to process.

2Frequent itemsets that appear together in more than a minimum fraction of the whole document set3A web directory is a listing of websites organized in a hierarchy or interconnected list of categories4www.dmoz.org5www.wikipedia.org



21/72


Snippet-based Approaches

Figure 3.1: Clusty, a Snippet-based Clustering Approach

Snippet-based approaches rely on snippets and there are already a few engines that provide such

clustering services. Clusty6 is probably the most famous one, shown in Figure 3.1. Suffix Tree

Clustering (STC) [86] is a key algorithm in this domain and is used by Grouper [87] and Carrot2

[79, 71] MWSEs. It treats each snippet as an ordered sequence of words, it identifies the phrases

(ordered sequences of one or more words) that are common to groups of documents by building

a suffix tree structure, and it returns a flat set of clusters that are naturally overlapping. Several

variations of STC have been proposed. For instance, the trie can be constructed with the N-grams

instead of the original suffixes. The resulting trie has lower memory requirements (since suffixes are

no longer than N words) and its building time is reduced, but less common phrases are discovered

and this may hurt the quality of the final clusters. Specifically, when N is smaller than the length

of true common phrases the cluster labels can be unreadable. To overcome this shortcoming [33]

proposed a join operation. A variant of STC with N-gram is STC with X gram [77] where X isan adaptive variable. It has lower memory requirements and is faster than both STC with N-gram

and the original STC since it maintains fewer words. It is claimed that it generates more readable

labels than STC with N-gram as it inserts in the suffix tree more true common phrases and joins

partial phrases to construct true common phrases, but no user study results have been reported in

the literature, and the performance improvements reported are small.

6www.clusty.com



22/72


Another snippet-based clustering approach is TermRank [23]. TermRank succeeds in ranking discrim-

inative terms higher than ambiguous terms, and ambiguous terms higher than common terms. The

top T terms, can then be used as feature vectors in K-means or any other Document Vector-based

clustering algorithm. This approach requires knowing TF, it does not work on phrases (but on single

words) and no evaluation results over snippets have been reported in the literature.

Another approach is Findex [34], a statistical algorithm that extracts candidate phrases by moving a

window with a length of 1..|P| words across the sentences (P), and fKWIC which extracts the most

frequent keyword contexts which must be phrases that contain at least one of the query words. In

contrast to STC, Findex does not merge clusters on the basis of the common documents but on the

similarity of the extracted phrases. However, no comparative results regarding cluster label quality

have been reported in the literature.

Finally, there are snippet-based approaches that use external resources (lexical or training data). For

instance, SNAKET7 [19] (a MWSE) uses DMoz web directory for ranking the gapped sentences8

which are extracted from the snippets. Deep Classifier [84] trims the large hierarchy, returned by an

online Web directory, into a narrow one and combines it with the results of a search engine making

use of a discriminative naive Bayesian Classifier. Another (supervised) machine learning technique

is the Salient Phrases Extraction[88]. It extracts salient phrases as candidate cluster names from the

list of titles and snippets of the answer, and ranks them using a regression model over five different

properties, learned from human training data. Another approach that uses several external resources,

such as WordNet and Wikipedia, in order to identify useful terms and to organize them hierachically

is described in [15]. Other extensions of STC for oriental languages and for cases where external

resources are available are described in [89, 78].

3.1.3 Cluster Presentation & User Interaction

Although cluster presentation and user interaction approaches are somehow orthogonal to the clustering

algorithms employed, they are crucial for providing flexible and effective access services to the end users.In most cases, clusters are presented using lists or trees. Some variations are described next. A well known

interaction paradigm that involves clustering is Scatter/Gather [11, 27] which provides an interactive

interface allowing the users to select clusters, then the documents of the selected clusters are clustered

again, the new clusters are presented, and so on.

7SNippet Aggregation for Knowledge ExtracTion8Gapped sentences are sequences of terms occurring not-contiguously into the snippets



23/72


Figure 3.2: Quintura Word Cloud

Clusty9 is an extension of Vivisimo that offers a new feature, called remix clustering, which clusters

again the same search results but ignoring the topics that the user has seen. Another approach for the

presentation layer is provided by Quintura10

, shown in Figure 3.2. It extracts keywords from search resultsand builds a word cloud (visual map). The name of each cluster is placed in a 2D area. The positions of

the names are based on their distance, while font size indicates the size of each cluster. By clicking words

in the cloud, the user query is refined. SNAKETs [19] interface offers a feature of personalization that is

performed at the client side: the user can select a set of labels and then ask SNAKET to filter out (from

the ranked list) all those snippets that do not belong to the folders labeled by the selected labels.

SOMs have been used to support exploration of a document space to search for patterns and gain

overviews of available documents and relationships between documents [42] (Figure 3.4). Another infor-

mation visualization alternative, Citiviz displays the clusters in search results using a hyperbolic tree anda scatterplot. Several (M)WSE incorporate visualizations similar to both treemaps and hyperbolic trees.

grokker11, shown in Figure 3.3 clusters documents into a hierarchy and produces an Euler diagram, a

coloured circle for each top-level cluster with sub-clusters nested recursively, where the user can zoom-in.

9www.clusty.com10www.quintura.com11www.grokker.com



24/72


Figure 3.3: grokker Generates an Euler Diagram

Another example is Kartoo12

), shown in Figure 3.5, which generates a thematic map from the top dozensearch results for a query, laying out small icons representing results onto the map, with which the user

can interact.

3.2 Facets and Dynamic Taxonomies

Dynamic taxonomies (also known as faceted search systems) [64] is a general knowledge management model

based on a multidimensional classification of heterogeneous data objects and is used to explore and browse

complex information bases in a guided, yet unconstrained way through a visual interface. Features of

faceted metadata search include (a) display of current results in multiple categorization schemes (facets)

(e.g. based on metadata terms, such as size, price or date), (b) display categories leading to non-empty

results, and (c) display of the count of the indexed objects of each category (i.e. the number of results the

user will get if he selects this category).

12www.kartoo.com



25/72

3.2. Facets and Dynamic Taxonomies 17

Figure 3.4: Top-200 Web Search Results Clustering Displayed Using Two-level TreeMaps

Figure 3.5: Kartoo Generates a Thematic Map

3.2.1 Introduction

Static taxonomies (such as Yahoo!s), based on a hierarchy of concepts can be used to select areas of

interest and restrict the portion of the retrieved infobase. The creation of such taxonomies is usually a



26/72


manual process although automatic and semi-automatic techniques have been proposed. However, static

taxonomies are not scalable for large information bases [65], and the number of documents becomes rapidly

too large for manual inspection.

On the other hand, dynamic taxonomies [63, 64, 76] (also known as faceted search systems) are a general

knowledge management model based on a multidimensional classification of heterogeneous data objects and

are used to explore/browse complex information bases in a guided, yet unconstrained way through a visual

interface. Features of faceted metadata search include:

display of current results in multiple categorization schemes (facets) (e.g. based on metadata terms,

such as size, price or date)

display categories leading to non-empty results (Poka-Yoke 13)

display of the count of the indexed objects of each category (i.e. the number of results the user will

get if he selects this category)

Such systems focus on user-centered interactive exploratory access, and propose a holistic approach in

which modeling, interface and interaction issues are considered together. One of the key factors of this

model is simplicity, in order to make it easily understandable and usable by end-users. The user always deals

with a single conceptual representation of the infobase. The conceptual schema of a dynamic taxonomy

is a plain taxonomy. It is a hierarchy going from the most general to the most specific concepts based on

subsumptions. Directed acyclic graph taxonomies modelling multiple inheritance are supported but rarely

required.

The user is guided to reach his goal, because at each stage he has a complete list of all the concepts related

to the current focus, which can be used to further refine his exploration. Furthermore as in traditional

search methods, the infobase can be restricted and a reduced taxonomy can be created. The user is in

charge of interaction and he can freely explore the infobase, discovering unexpected relationships. By

construction, no empty results can occur, because they are automatically pruned. Usability studies [26, 85]

show that despite slow response times, dynamic taxonomies produce a faster overall interaction and a

significantly better recall (both actual and perceived) than access through text retrieval.Dynamic taxonomies have an very fast convergence to small results sets, as described in [65]. For

example, 3 zoom operations on terminal concepts are sufficient to reduce a 10,000,000 object infobase

described by a compact taxonomy with 1,000 concepts to an average 10 objects. Finally, the conceptual

organization of dynamic taxonomies allows to gather user interests at a precise conceptual level by simply

monitoring the zoom operations issued and the concepts the user focuses on.

13Poka-Yoke is a Japanese term that means fail-safing or mistake-proofing.



27/72


Examples of applications of faceted metadata-search include: e-commerce (e.g. ebay), library and bibli-

ographic portals (e.g. DBLP), museum portals ( e.g. [49] and Europeana 14), mobile phone browsers (e.g.

[35]), specialized search engines and portals (e.g. [50]), Semantic Web (e.g. [30, 49, 56]), general purpose

web search engines (e.g. Google Base), and other frameworks (e.g. mSpace[67]).

3.2.2 Taxonomy Design

The most accurate way to create a taxonomy is to build categories by hand. Unfortunately, manual

classification is expensive and infeasible for many practical document collections, and especially for a WSE

document collection. Automatic clustering techniques generate clusters that are typically labeled using a

set of keywords, which leads to unpredictive and not intuitive labels. An alternative approach to clustering

is to generate hierarchies of terms for browsing the database. [66] introduced the subsumption hierarchiesand [45] showed experimentally that subsumption hierarchies outperform lexical hierarchies [60]. Another

approach is to use the hierarchical structure of WordNet15 16 to offer a hierarchy view over the topics [40].

WordNet together with a tree-minimization algorithm to create an appropriate concept hierarchy for a

database is also used in [72].

All these techniques generate a single hierarchy for browsing the database. A supervised approach for

extracting useful facets from a collection of text or text- annotated data is described in [14], which relies on

WordNet hypernyms17 and on a Support Vector Machine (SVM) classifier to assign new keywords to facets.

More recent work [15, 13], provide an unsupervised technique to extract useful facet terms, by expanding

a database using WordNet and Wikipedia to identify important terms.

3.2.3 Framework

Table 3.1 defines formally and introduces notations for terms, terminologies, taxonomies, faceted tax-

onomies, interpretations, descriptions and materialized faceted taxonomies as described in [76]. In brief,

Obj is a set of objects (the set of all documents indexed by the WSE), T is a set of terms, and the elements

of Obj can be described with respect to one or more aspects (facets), where each aspect is associated with

a value domain, finite or infinite, which may be ordered (in the general case we could have a partial order

(T,)). The description of an object with respect to one facet consists of assigning to the object one or

14http://www.europeana.eu15WordNet is a lexical database, which groups English words into sets of synonyms called synsets, provides short, general

definitions, and records the various semantic relations between these synonym sets16http://wordnet.princeton.edu/17Hypernym is a word whose meaning includes the meanings of other words, as the meaning of the term animal includes

the meaning of cat, dog, parrot



28/72


more terms from the taxonomy that corresponds to that facet.

Table 3.2 defines the required notions and notations regarding user interaction. The user explores or

navigates the information space by setting and changing his focus. The notion of focus can be intensional

or extensional. Specifically, any set of terms, i.e. any conjunction of terms (or any boolean expression of

terms) is a possible focus. For example, the initial focus can be the empty compound term, or the top term

of a facet. However, the user can also start from an arbitrary set of objects, and this is the common case

in the context of a WSE. In that case the focus is defined extensionally. Specifically, if A is the result of a

free text query q, then the interaction is based on the restriction of the materialized faceted taxonomy on

A (as defined at the bottom part of Table 3.2).

At any point during the interaction, the immediate zoom-in/out/side points along with count information

are computed and provided to the user. When the user selects one of these points then the selected term

is added to the focus, and so on. An example of a materialized faceted taxonomy, is shown in Figure 3.6.

Figure 3.6: Example of a Materialized Faceted Taxonomy

Foci are considered to be redundancy free. A focus ctx (i.e. ctx T) is redundancy free if ctx =



29/72


min(ctx). For example, ctx = {Greece, Crete} is not redundancy free because min(ctx) = {Crete}.

The contents (or extension) of a focus ctx, is the set of objects I(ctx). This notion can be refined, in order

to distinguish the shallowcontents I(ctx), from the deep contents I(ctx).

3.2.4 User Interface Design

System implementations for dynamic taxonomies and faceted search allow a wide range of query possibilities

on the data. Only when these are made accessible by appropriate UIs, the resulting applications can

support a variety of search, browsing and analysis tasks. Such systems should provide support at least for

the three basic characteristics of faceted and dynamic taxonomies. They should display non-empty results,

in multiple categorization schemes (facets), along with the count of the indexed objects of each category.

Additional UI functionality, is usually accompanied by additional complexity and visual clutter.Selection and de-selection of zoom-points is of central importance in faceted search. If only one concept

should be selectable at a time within a facet, traditional single-select controls such as radio buttons,

dropdown list controls or simple links can be used. On the other hand, the standard multi-select elements,

are check boxes. For instance, the yelp18 web application provides check buttons for multi-select facets and

simple links for facets with exclusive selection. Alternatives for allowing both modes in a facet would be

dedicated controls, or modifier keys (such as pressing shift while clicking). For range selection navigation

mode, slider controls can allow the specification of upper and lower bounds on the result set. De-selection

should be as easy as concept selection. Additionally, if breadcrumbs or a similar filter summary, indicating

summaries of single or all facets are present, these should include the option to clear individual filters as

well. Also, buttons for reseting single facets or all filter options can help users to zoom-out quickly.

Figure 3.7: ContentLandscape Applies Collapsible Panel Pattern for Zooming

For flat facets, i.e. not featuring a hierarchical relation between the concepts, simple list widgets are

18http://www.yelp.com



30/72


usually used. List sorting can either be alphabetical, or dynamically updated by the number of assigned

items in the current result set. For navigating hierarchies, a number of different presentation and navigation

options exist, which include: Explorer Tree (not very space efficient), Zoom and Replace which replaces

the facet widget content with the level below (used in Flamenco19 [85]), Collapsible panels, hierarchical

widgets based on the accordion pattern20 (used in the ContentLandscape application [70], Figure 3.7),

and Continuous Zooming, where hierarchical facets are displayed as space-filling widgets, which allow a

fast traversal across all levels, while simultaneously maintaining context (used in the FacetZoom prototype

[12], Figure 3.8). The number of indexed items for each facet and zoom-points, can be shown by numbers

(after the labels), bar charts, height of facets, colour, etc. Visgets[18], extends this principle by featuring a

whole number of visualizations. FaThumb [35], enables faceted search on mobile devices (Figure 3.9). The

filter area is grouped in nine zones, corresponding to the nine digit keys on mobile phones. The middle

zone serves as a spatial overview during navigation. The surrounding eight zones allow the user to select

hierarchy branches and repeatedly zoom in on subtrees. The left short shortcut key adds the currently

selected concept to the query, the right one allows to quickly jump back to the top.

Figure 3.8: FacetZoom Combines Ideas from Zoomable UIs With Faceted Search

Query searching can be done either over all results or within the current focus, as shown in Figure 3.10.

Moreover, in order to quickly locate zoom-points in a facet, and avoid having to navigate large hierarchies,

even though the target concept may already known by name, direct access to facet items can be achieved

with a keyword search over the concept labels (/facet [30]). Since the number of available facets can be

very big, ways to reduce their usage space are discussed in [24], and include collapsible facet widgets (such

as used by Getty images faceted navigation interface21) and expandable filter areas (i.e. More button).

Furthermore, systems should be able to determine which facet-value pairs the interface should provide

19Online demos available at http:// amenco.berkeley.edu/20http://www.welie.com/patterns/showPattern.php?patternID=accordion21http://gettyimages.com



31/72


Figure 3.9: Faceted Search for Small Screens in the FaThumb Prototype

Figure 3.10: Flamenco Allows Choosing Between a Search Over All Results or Within Current Focus

to a user. Personalization allows the system to present the facet-value pairs that can help the user quickly

find the documents that he is most interested. Existing approaches include, content based personaliza-

tion, where a recommendation system monitors users actions and pushes documents that match his user

profile, collaborative based faceted search personalization, where the system recommends items to a user

by leveraging information from other users with similar tastes and preferences, and finally an ontological



32/72


approach, which uses the distance between values of an ontology, to measure the relevance to users [75].



33/72


Name Notation Definition

terminology T a set of names, called terms (they may capture both

categorical and numeric values)subsumption a partial order (reflexive, transitive and antisymmet-

ric)

taxonomy (T,) T is a terminology, a subsumption relation over T

broaders of t B+(t) { t | t < t}

narrowers of t N+(t) { t | t < t}

direct broaders of t B(t) minimal


34/72


Name Notation Definition

focus ctx any subset of T such that ctx =

minimal(ctx)

focus projection on a facet i ctxi ctxi = ctx Ti

Kinds of zoom points w.r.t. a

facet i while being at ctx

Notation Definition(s)

zoom points AZi(ctx) = { t Ti | I(ctx) I(t) = }

zoom-in points Z+i (ctx) = AZi(ctx) N+(ctxi)

immediate zoom-in points Zi(ctx) = maximal(Z+i (ctx))

= AZi(ctx) N(ctxi)

zoom-side points ZR+i (ctx) = AZi(ctx) \ {ctxi N+(ctxi) B

+(ctxi)}

immediate zoom-side points ZRi(ctx) = maximal(ZR+(ctx))

Restriction over an object set Notation Definition(s)

restricted object set A any subset of Obj

reduced interpretation I I(t) = I(t) A

reduced terminology T = { t T | I(t) = }

= { t T | I(t) A = }

= oAB+(DI(o))

Table 3.2: Interaction Notions and Notations



35/72

Chapter 4

Visualization Models and Metaphors

In this chapter we will discuss five different visualization models. Initially, we will discuss MRPBM, which is

based on RPs, and then we will analyze ESCBM, which is based on the VSM ranking model and its spatial

characteristics. The next one is PFNET, which uses associative networks, and the fourth one is MDS,

a group of methods used to discover empirical relationships among investigated objects. Finally, we will

discuss SOM, which is a nonlinear topology-preserving projection method, to convert a high-dimensional

space into a low dimensional grid and different visualization metaphors.

4.1 Multiple Reference Points Based Models (MRPBM)

MRPBM models are visualization algorithms to display the results of a search not in the classical linear

order, but by projecting them on a low dimensional visual space. They can effectively handle complex

information needs by using multiple RPs. RP or Point of Interest (POI), is a search criterion against which

documents or surrogates are matched and search results are generated and presented to the users. In a

broad sense, a RP represents users information needs and any information related to users needs, from

user preferences and search history, to query terms or browsed documents. Multiple RPs can form a low

dimensional visual space and documents can be mapped onto the space, based upon their attraction to the

RPs.

Visualization models based on multiple RPs can be classified into three categories:

Fixed Multiple RPs Models

27


36/72

28 Chapter 4. Visualization Models and Metaphors

These models use multiple RPs, with a fixed position, and can be used for both vector-based and

Boolean based IR systems. The representative model is InfoCrystal [69]. In the boolean context,

each RP is equivalent to a term or a sub-Boolean logic expression from a Boolean query. The visual

space is a polygon, where RPs constitute vertices of the polygon and visual results are displayed.

The side lengths of the polygon are equal so that the RPs are evenly configured in the visual space.

The retrieved results are displayed inside the polygon. The polygon is partitioned by N exclusive

tiers, represented as concentric rings, where N is the number of RPs. The first tier, displays results

related to only one RP, the second results related to two RPs, etc. Figure 4.1 shows a fixed multiple

RPs model.

Figure 4.1: Display of 4 Reference Points in a Fixed Reference Point Environment

Movable Multiple RPs Models

These models use multiple RPs, which can be manipulated by the user, while semantic connections

of displayed objects are still maintained in the visual space. VIBE [55] and its variations, VR-VIBE [2] and LyberWorld [28] are such models. The primary benefit of this approach is that the

user may arbitrary place a RP to any interesting area, such as another RP, document or cluster of

documents, and observe the impact of the RP to that area. According to the algorithm, the position

of a document is strongly related to the similarities between the document and a group of predefined

RPs. The positions of all related RPs in the visual space, play a very important role in positioning

a projected document. In addition, taking into consideration the relevance between a document and



37/72

4.1. Multiple Reference Points Based Models (MRPBM) 29

related RPs, the ultimate position of a document is calculated. Initially the first two related RPs are

selected in order to calculate the position of the document. The new position of the document serves

as an intermediate RP for further consideration, and the process continues until all related RPs are

considered. If the user add, remove, or change the position of any RP, the whole algorithm must be

executed again.

Figure 4.2 shows a snapshot of VIBE. In this example 5 RPs (circles) are used and documents are

represented as rectangles. Those documents that contain at least one of the descriptors indicated by

the user when initiating the search are considered relevant. The documents with greater coincidence

in their descriptors with those of the RP are placed closer to that RP. The user can also expand the

icons of the documents or documents that are useful by simply drawing a box around a document or

documents that are of interest and a list is shown of the chosen selection. Clicking with the mouse

on any of the documents on the list will open another window with the complete document. One

characteristic that makes the system interactive is that the user may add, change or remove the RPs

from the screen. On carrying out any of these changes the system automatically launches the search

query and re-orders the found documents to present the relationships between documents and those

between POIs.

Automatic RPs Rotation Models

This model is a similarity ratio based model and was introduced with WebStar [93] to visualize link

structures. The uniqueness of this model is that it adds a new feature, automatic rotation of RP to

the 2D visual space. The visual space is build on a polar coordinate system, where the origin of the

visual space is a central document (focus point), specified or selected by users, and RPs are evenly

distributed on a sphere with the focus point as center. All of the relevant documents are scattered

within the visual space based on their projection angle (which is similarity based) and distance (which

is not). By selecting a RP, it automatically rotates around the sphere. As a consequence, related

documents are attracted and also rotated.

Figure 4.3, shows the WebStar system. The central document (focus point), is denoted with a blue

square at the center of the circle, while the four RPs (sport, research, international, library),

are represented with the yellow squares, evenly distributed outside the circle. Documents are the

pink squares scattered inside the circle. In this example the user has selected the international RP,

coloured in red, which is rotated around the circle. Notice how documents change position as the RP

rotates.

Both the models for fixed and movable multiple RPs require at least three RPs to project documents in



38/72


Figure 4.2: VIBE Using 5 Reference Points

their visual spaces, while the model for automatic RPs rotation requires at least one RPs in conjunctionwith the focus point. Furthermore, visualization models for multiple RPs can be 2D or 3D and can be

applied to either Boolean or vector based information systems. The position of any RPs can be controlled

and manipulated by users at will. It is the flexibility of manipulation that enables users to compare and

analyze the impact of two reference points on documents, and identify good/poor discriminative terms.

Such models can be used to visualize Internet hyperlinks, search results from an information retrieval

system, a full-text, and term discriminative analysis.

4.2 Euclidian Spatial Characteristic Based Model (ESCBM)

These visualization models are based on the VSM model and its spatial characteristics. The basic Euclidean

spatial elements such as point, distance, and angle may have a special connection to information retrieval in

the contexts of the vector-based space. For instance, a document or RP in a vector based space corresponds

to a spatial point in the Euclidean space. Euclidean distance between documents and RPs can be used

as an indicator of their similarity. Their visual spaces are 2D and in order to construct them, they use



39/72

4.2. Euclidian Spatial Characteristic Based Model (ESCBM) 31

Figure 4.3: WebStar Using 4 RPs. Snapshots During a Full Rotation of international Reference Point

two RPs, which serve as view points, one major (KV P), and one minor (AV P). These RPs, the reference

axis that they form and the distance between them, are all selected by the user and affect the relevant

documents placement.

The projection conversion equation for an IR evaluation model is crucial for visually displaying it in

the visual space. The complexity of a conversion equation depends upon multiple factors such as the

definition of the visual space and nature of the retrieval evaluation model. Some equations are simple and

straightforward while others may be complicated. The significance of visualizing an IR evaluation model

is not only to make the invisible internal retrieval process transparent to users but also to allow them to

manipulate the model in the visual space at will.

In this context, three visualization models have been proposed.



40/72


Distance-angle Based Model

In this model the visual projection distance and angle are defined for any document Di. The pro-

jection distance is the distance from the document Di to the KV P and the distance angle is the

angle formed by the lines KV PDi and KV P AV P, in the vector space. The valid display area of

this model is a half-infinite plank, where the X-axis and Y-axis are defined as the visual projection

angle and distance respectively. The width of X-axis is always equal to and the width of Y-axis is

infinite. KV P is always mapped onto the origin visual space, because its visual projection distance

is 0 and the angle is defined as 0. The position of AV P is mapped onto the Y-axis. because its visual

projection distance is the length between the two reference points, in the visual space and the visual

projection angle is defined as 0. The distance between the two reference points does not affect this

model. DARE [90] is such a model and

Figure 4.4 shows the display of the projected cosine model using DARE. The angle a is the retrieval

threshold, while R2 is AV P. D1 is a document situated within the retrieval area defined by the angle

, and D2 is any document located on one boundary of the angle . Users may drag the vertical

retrieval line to any place within the valid display area, to increase or decrease the retrieval area.

Figure 4.4: Display of the Projected Cosine Model, in Distance-Angle DARE Model

Angle-angle Based Model

In this model two visual projection angles are defined for any document Di. The first angle () is

the angle formed by the lines KV PDi and KV P AV P, and the second one () is the angle formed

by the lines AV P Di and KV P AV P, both of them in the vector space. The two angles and



41/72

4.2. Euclidian Spatial Characteristic Based Model (ESCBM) 33

are assigned to the X-axis and Y-axis. The minimum value and maximum value for the two angles

and , are 0 and respectively. The valid display area is a triangle and the two reference points

are projected at (/2, 0) and (0, /2) respectively. This model again is not affected by the distance

between the two RPs.

TOFIR [91] is an example of such a model, shown in Figure 4.5. The angle is the retrieval threshold,

the origin of the vector space is KV P, while R2 is AV P. In the figure O is the projected origin of

the vector space. The horizontal line defines the retrieval area, and can be manipulated by the users.

Figure 4.5: Display of the Projected Cosine Model, in the Angle-Angle TOFIR Model

Distance-distance Based Model

In this model two visual projection distances are defined for any document Di. The first distance, is

the distance from the document Di to the KV P, and the second is the distance from the document

Di to the AV P, both of them in the vector space. The two projection distances are assigned to the

X-axis and Y-axis. The valid display area is a half-infinite plank, where both the X-axis and Y-axis

are assigned as the visual projection distances. It forms a /4 angle against the X-axis or the Y-axis,

its two corners are connected to the X-axis and Y-axis respectively, and its width is dynamic and

determined by the distance between the two RP. GUIDO [54] is such a model. Figure 4.6 shows the

distance model in GUIDO.

One of the distinguishing characteristics of these visualization models is their capacities to visualize

traditional IR evaluation models in addition to visualizing relationships among documents. Document dis-

tributions in these visual spaces change accordingly when the RPs change. This implies that the displayed



42/72


Figure 4.6: Display of the Projected Distance Model, in the Distance-Distance GUIDO Model

document configurations in the visual spaces can be customized based upon users dynamic information

needs.

4.3 Pathfinder Associative Newtork (PFNET)

The Pathfinder associative network PFNET is a structural and procedural modeling technique that extracts

underlying connection patterns in proximity data and represents them spatially in a class of networks

[8]. The power of the Pathfinder associative network is its ability to discard insignificant links in the

original network while it reserves the salient semantic structure of the network. The simplified network

still maintains the proximity connections and fundamental characteristics of the original network.

The main idea of the Pathfinder associative network is to discard the redundant paths and keep the

significant ones in a network. PFNET uses the triangle inequality, to identify paths with the lowest weights

in the network, eliminate redundant ones, and make the network more economical. Figure 4.7 displays the

original network and the final PFNET network. Moreover, the principle of the triangle inequality can be

extended to an abstract space. In that case, connection proximity between two points may be measured

in other forms such as invisible semantic similarity between two objects rather than distance.

Application of a PFNET to a domain problem requires identifying two basic elements: the first is the

objects which are used as nodes in the network, and the second is the proximity relationship between the

two objects, which is used to form a link between the two objects. Proximity can be procured by either a



43/72

4.3. Pathfinder Associative Newtork (PFNET) 35

Figure 4.7: Display of Original Network (left) and Final PFNET Network (right)

human-interference method or an automatic computation method. Different objects and proximity methods

can lead to different Pathfinder associative networks.

The Pathfinder network technique is very effective and efficient for display of complex relationships

among objects such as sophisticated semantic networks. As an IV means, it can be applied to a wide

spectrum of IR environments, ranging from information searches [7, 20], author co-citation analysis1 [80],

term co-occurrence analysis2 [16], to the Internet information representation [6].

Specifically for query searching, after a query is submitted to the network, the relevance between the

query and a document is calculated using the Pearson correlation coefficient, and the relevance is indicated

by the height of a raising spike from the document [7]. In another case [20, 16], both the query and a

document are converted into two Pathfinder associative networks, and the similarity between a query and

a document is the similarity between the two Pathfinder networks. The proximity algorithm consists of

two parts. The first part is defined as the ratio of common terms in both a query and a document to the

number of all terms in the query. The second part measures the network structure similarity between the

query network and a document network. The value of this part increases when nodes (terms) connected

in the query network also appear closely connected in the document network. Finally, the two parts are

weighted and integrated into a final similarity value.

The weaknesses of the Pathfinder associative network include its computational complexity, which may

prevent PFNET from visualizing a large dataset, and dynamically modifying a PFNET caused by interac-

tions between users and the network. Another disadvantage of PFNETs in the present state of development

1Phenomena occuring when the authors of two different papers, both co-cite the same paper(s) in their work2Keywords appearing together in a predefined length of text in the same document



44/72


is that people have no way of knowing the features upon which similarity judgments are made, which re-

sults in that the semantic content of links is not easily discernible. PFNET cannot generate a local visual

configuration based on users individual information needs, but it only produces a global overview for a

data collection.

4.4 Multidimensional Scaling Models (MDS)

The MDS technique consists of a group of methods used to discover empirical relationships among inves-

tigated objects, by visualizing them and presenting their geographic representation in a low dimensional

display space. It can be used to reveal and illustrate hidden patterns for a set of proximity measures

among objects for multivariate, exploratory, and visual data analysis. An MDS algorithm starts with a

matrix of itemitem similarities, and then assigns a location to each item in N-dimensional space ( N is

specified a priori), where users may perceive and analyze the relationships among the displayed objects.

For sufficiently small N, the resulting locations may be displayed in a graph or 3D visualisation. The more

similar two objects, the closer to each other they are, and vice versa.

One of MDS techniques advantages is the diversity of its algorithms, where each one of them handles

different situations. They can be classified into metric and non-metric MDS algorithms, based upon the

types of input proximity data. The non-metric MDS algorithm is applied to qualitative3 proximity data,

while metric MDS is applied to quantitative4 proximity data. Another category of MDS technique is

classical MDS algorithm. which is used with quantitative proximity data.

Applications of MDS in IR can be roughly categorized into two groups, based on the proximity definition:

one is to use a co-citation method to define the proximity metric, and the other is to use a non-co-

citation method such as traditional distance-based or angle-based similarity measures. However, applying

traditional MDS to a very large data set may be prohibitively slow, since it uses a linear algebra solution

for the problem, which is computationally costly and makes heavy demands on storage. On the other

hand, the non-metric (metric) MDS method looks for the best match between the original proximity of

two objects and their Euclidean distance in a low dimensional, using an iterative process, starting with

a random initial configuration. The Kruskal algorithm, which is used for the minimization, is iterative,simple and its computational complexity is in practice almost O(N).

Furhermore, the huge number of displayed objects in a low dimensional space raises concerns in terms of

efficient system implementation and information representation in the MDS display space, for interactive

systems. To solve the problem, people use the supernode method [68] that visualizes ob ject clusters and

3Qualitative proximity data refers to ordinal data4Quantitative proximity data refers to ratio-scaled data



45/72

4.5. Self-organizing Map Model (SOM) 37

objects at different levels respectively. In the MDS visual display space, documents are clustered first

so that highly related documents in terms of the co-citation are formed as new supernodes. So instead

of individual documents, the system displays these supernodes. Documents within a supernode can be

visualized, at a lower level, if users zoom on a selected cluster.

Figure 4.8: Display of ThemeScape and Galaxy Visualizations of IN-SPIRE Visualization Program

Another potential problem is the intuitive representation of projected objects in a low dimensional MDS

space. It is extremely important for users to easily understand and meaningfully interpret the graphic

presentation. Towards that aim, the MDS approach was combined with the so called ecological approach, in

order to take advantage of natural display formats that humans are used to [83]. The ecological landscape isa MDS display space, which consists of a group of ecologically connected local landscapes. Each landscape

represented an object cluster. The size of each local landscape is related to the number of documents

containing a thematic term which defined the local landscape. A document is positioned based upon its

indexing terms, the thematic term, and the category assigned to the document. Figure 4.8 shows the

ThemeScape and Galaxy visualizations using the IN-SPIRE software.

4.5 Self-organizing Map Model (SOM)

The SOM (neural network), is a nonlinear topology-preserving projection method to convert a high di-

mensional space into a low (1D, 2D, or 3D) dimensional grid (feature map), as shown in Figure 4.9. There

are three spaces which are involved in SOM: the high dimensional document vector space (associated with

objects), the high dimensional weight vector space (associated with the nodes of the display grid), and the

low dimensional visual space (the display grid). During the learning process, each input vector is randomly

picked up and is assigned to the closest neuron, whose weight vector is the most relevant one. After the



46/72


training process, the documents are projected onto the feature map and labels are assigned to the feature

map areas (which most of the times is weight-based).

Figure 4.9: A SOM Feature Map

Each partitioned area in the map clearly represents a concept(s) and documents associated with theconcepts. The size of each area in the map indicates term occurrence frequencies or the possible size of the

projected documents. After term labeling processing, semantically related areas are also