Upload
panagiotis-papadakos
View
214
Download
0
Embed Size (px)
Citation preview
8/23/2019 General Graduate Exams
1/72
General Graduate Exams
Exploration and Visualization of Information in Search Engines
by
Panagiotis Papadakos
Presented to Graduate Studies Committee of
the Computer Science Department of
the University of Crete
Heraklion, May 2009
8/23/2019 General Graduate Exams
2/72
ii
8/23/2019 General Graduate Exams
3/72
Contents
Page
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Interaction Paradigms and Visualization in Information Retrieval (IR) . . . . . . . . . . . . . . . 3
2.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Information Space and User Information Needs . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 Micro and Macro Level of Information . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.2 Information Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.3 User Information Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Interaction Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Query Searching vs Browsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.3 Three Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.2 Scientific and Information Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Interaction Paradigms and Related Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Results Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Clustering Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.2 Clustering Algorithms Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Hierarchical and Non-Hierarchical Approaches . . . . . . . . . . . . . . . . . . . . . 10
Document-based and Snippet-based Approaches . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Cluster Presentation & User Interaction . . . . . . . . . . . . . . . . . . . . . . . . . 14
iii
8/23/2019 General Graduate Exams
4/72
3.2 Facets and Dynamic Taxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Taxonomy Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.4 User Interface Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Visualization Models and Metaphors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Multiple Reference Points Based Models (MRPBM) . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Euclidian Spatial Characteristic Based Model (ESCBM) . . . . . . . . . . . . . . . . . . . . 30
4.3 Pathfinder Associative Newtork (PFNET) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Multidimensional Scaling Models (MDS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5 Self-organizing Map Model (SOM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6 Metaphors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.6.1 Metaphors for the Semantic Framework Presentation . . . . . . . . . . . . . . . . . . 39
4.6.2 Metaphors for Information Retrieval Interaction . . . . . . . . . . . . . . . . . . . . 41
5 Vision and Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.1 Information Visualization Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.2 Metrics for Exploratory Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.3.3 Interaction Models for Exploratory Search . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.4 Exploratory Search and Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.6 Evaluation and Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 Work Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4.1 ODBMS Index Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4.2 FleXplorer, A Framework for Providing Faceted and Dynamic Taxonomy-based
Information Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4.3 Exploratory Web Searching with Dynamic Taxonomies and Results Clustering . . . 52
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Appendices
A Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
iv
8/23/2019 General Graduate Exams
5/72
List of Figures
3.1 Clusty, a Snippet-based Clustering Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Quintura Word Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 grokker Generates an Euler Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Top-200 Web Search Results Clustering Displayed Using Two-level TreeMaps . . . . . . . . 17
3.5 Kartoo Generates a Thematic Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6 Example of a Materialized Faceted Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.7 ContentLandscape Applies Collapsible Panel Pattern for Zooming . . . . . . . . . . . . . . . 21
3.8 FacetZoom Combines Ideas from Zoomable User Interfaces (UIs) With Faceted Search . . . 22
3.9 Faceted Search for Small Screens in the FaThumb Prototype . . . . . . . . . . . . . . . . . . 23
3.10 Flamenco Allows Choosing Between a Search Over All Results or Within Current Focus . . 23
4.1 Display of 4 Reference Points in a Fixed Reference Point Environment . . . . . . . . . . . . 28
4.2 VIBE Using 5 Reference Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 WebStar Using 4 Reference Points (RPs). Snapshots During a Full Rotation of interna-
tional Reference Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Display of the Projected Cosine Model, in Distance-Angle DARE Model . . . . . . . . . . . 32
4.5 Display of the Projected Cosine Model, in the Angle-Angle TOFIR Model . . . . . . . . . . 33
4.6 Display of the Projected Distance Model, in the Distance-Distance GUIDO Model . . . . . 34
4.7 Display of Original Network (left) and Final PFNET Network (right) . . . . . . . . . . . . . 35
4.8 Display of ThemeScape and Galaxy Visualizations ofIN-SPIRE Visualization Program . . 37
4.9 A SOM Feature Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.10 A 3D Cone Tree (left) and a Basic Hyperbolic Tree (right) . . . . . . . . . . . . . . . . . . . 41
4.11 Perspective Wall (left) and ThemeRiver(right) . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.12 D ataLens, a 3D Pyramid Lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.13 Gridl Prototype Displays Search Results Along Two Axes . . . . . . . . . . . . . . . . . . . 43
v
8/23/2019 General Graduate Exams
6/72
4.14 HotMaps, a 2D Visualization of How Query Terms Relate to Search Results . . . . . . . . . 44
vi
8/23/2019 General Graduate Exams
7/72
List of Tables
3.1 Basic Notions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Interaction Notions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
vii
8/23/2019 General Graduate Exams
8/72
viii
8/23/2019 General Graduate Exams
9/72
Chapter 1
Motivation
The daily use of computers as tools of work, education, communication and entertainment produces a huge
volume of data. As recent surveys state1, the world produces between 1 and 2 exabytes (260 bytes) of
unique information per year, 90% of which is digital and with a 50% annual growth rate. In addition, this
new data are more complex and more dynamic. The adopted interaction paradigm of current IR systems
and Web Search Engine (WSE) with a simple rectangular textbox, where the user inserts most of the timesone or two terms and the system returns a ranked list of results has proven very useful for finding specific
information and is very simple and intuitive to use.
However, such systems do not provide adequate support for information needs that have an exploratory
nature and/or aim at decision making. User studies have shown that casual users usually inspect only the
first page of results and they do not exploit any of the query language operators (not even Boolean queries)
that is offered. Instead they issue very small queries which they reformulate in an iterative process based
on the returned results [73, 62]. On the other hand, the powerful and expressive query languages that are
usually offered for structured information (e.g. for the Semantic Web) are not fully utilized, in the sense
that the formulation of queries is a laborious and difficult task for end users.
In the previously analyzed, highly demanding and growing information environment, new intuitive and
more user friendly UIs have to be created, providing effective and efficient services for retrieving and explor-
ing the available information and supporting users in the various decision making tasks and processes. The
fields of IR, Information Visualization (IV) and Human Computer Interaction (HCI) have to collaborate
1http://www.sims.berkeley.edu/research/projects/how-much-info-2003/
1
8/23/2019 General Graduate Exams
10/72
2 Chapter 1. Motivation
in order to provide new intuitive and interactive UIs, where the information is presented, organized, and
analyzed, giving the user the ability to recognize patterns and relations.
For example, to select a hotel or a product to buy, it is not enough to return the list of choices that
satisfy user-provided criteria. The ranking of the available choices according to user-based (i.e. preference),
or statistical-based criteria is also required. Furthermore, exploration services, that provide users with
comprehensive summaries of the available choices which enable them to grasp quickly the information
landscape and allow them to restrict their focus, and thus approach gradually the most desired choices,
are required.
For this reason, efforts for the exploitation of the above languages in models of exploration/navigation
have started to come up [56, 49, 30, 4]. Summarizing, the constantly increasing volume and requirements
of our digital economy, requires providing intuitive modes of interaction, involving flexible and efficient
navigation, and advanced visualization.
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
11/72
Chapter 2
Interaction Paradigms and Visualization in IR
2.1 Information Retrieval
(IR) is the domain focusing on searching, exploring and discovering information either from organized
textual and data repositories or the World Wide Web (WWW), in order to satisfy users information
needs. However, since the information environment is constantly growing, another important aspect of
IR systems is their ability to orgazine this information. This organization can facilitate the creation
of innovative, more intuitive and user friendly UIs, which will provide users efficient ways of mapping,
organizing and grouping available information. The above can enable users to discover new patterns and
relationships between the available information and satisfy faster and more accurately their information
needs.
2.2 Information Space and User Information Needs
2.2.1 Micro and Macro Level of Information
Information from an infobase can be divided to two levels. The first is the micro level, which refers to
individual objects or documents, such as contents, snippets or full text. This is the direct and obvious
information. The other level is the macro level which refers to aggregated information of objects or
documents from the collection. This information is not direct but is generated from the individual collection
of objects, and relies on the way the information is organized and presented. Such information can provide
3
8/23/2019 General Graduate Exams
12/72
4 Chapter 2. Interaction Paradigms and Visualization in IR
object connections, rhythms, trends, patterns and relationships, explaining information at the micro level.
The aggregate information at the macro level can vary in information organization methods and information
presentations for the same data. By navigating information in the macro level the user can gain a better
understanding of the provided collection and find unexpected insights [92]. An IR system should provide
access to both levels of information, by browsing and query searching.
2.2.2 Information Space
The information space can be conceived as an abstract and multidimensional space. Its structure is based
on the semantic characteristics and relationships, derived from the organization of the collection data set,
which enables users to explore and discover information from the data collection. An information space
can be constituted by intrinsic attributes such as keywords, citations, hyperlinks, and authors or extrinsic
structures like a subject directory, a thesaurous system, or an organized search result list. Combinations of
intrinsic attributes and extrinsic structures can also form an information space. Since information does not
constitute space, to describe its spatial characteristics, we have to define basic topological properties like
distance, direction and angle. For instance, the distance between two objects can be the shortest path of
hyperlinks, citation or hierarchical structure, and the Euclidean distance in the Vector Space Model (VSM).
Direction has a special meaning, in hyperlink and citation based systems, since if any objects links/cites
another object, it means that one object directs to the other. In a multidimensional vector-space based
IR system, angle is used as a retrieval model. Finally, the information space has to be reduced from N
dimensions to 1, 2 or 3, in order to be perceived by humans, which can lead to user disorientation and
ambiguity [92].
2.2.3 User Information Needs
User studies have shown that almost 60% of search tasks are exploratory [62]. The user does not know
accurately his information need, he only provides 2-5 words, and focalized search very commonly leads to
inadequate interactions and poor results. Unfortunately, the available user interfaces (UI)s do not aid theuser in formulating his query. Furthermore, such systems do not provide adequate support for information
needs that have an exploratory nature and/or aim at decision making. The answers returned are simple
ranked lists of results, with no organization and no information on the macro level of the infobase. Casual
users usually inspect only the first page of results and they do not exploit any of the query language
operators (not even Boolean queries) that is offered. Instead, they issue very small queries which they
reformulate in an iterative process based on the returned results [73, 62]. On the other hand, the powerful
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
13/72
2.3. Interaction Paradigms 5
and expressive query languages that are usually offered for structured information (e.g. for the Semantic
Web) are not fully utilized, in the sense that the formulation of queries is a laborious and difficult task.
2.3 Interaction Paradigms
2.3.1 Query Searching vs Browsing
In IR two paradigms are widely recognized: the first is query searching and the second is browsing. Query
searching is the paradigm where the user tries to describe his information needs with a group of relevant
and important terms. The query is then analyzed by the IR engine and a list of related documents (based
on the used ranking model) is returned. Most IR engines also display snippets of relevant parts of the
returned documents. This means that for ambiguous words, where each word can have many meanings,the system might return non relevant results, which the user might accept as a search failure.
On the other hand, browsing refers to UIs which allow the user to view, search and scan either the
whole information or part of it. This enables the user to explore and discover information, along with data
relationships and patterns. A UI for browsing should provide smooth and structured browsing. Methods
for information browsing include hyperlink and hierarchical structures. However, huge volumes of data
require the appropriate usage of automatic data analysis techniques, prior to visualization. According to
[81], browsing is useful when (a) there is good underlying structure, so items close to one another are
similar, (b) users are unfamiliar with the contents of the collection, (c) users have a limited understanding
of the organization of a system and prefer a less cognitively loaded method of exploration, (d) it is difficult
to verbalize the underlying information need and (e) the information is easier to recognize than describe.
2.3.2 Differences
According to [92], the differences between query searching and browsing, include:
Judgment of Relevance
Query searching is based on keyword matching of query terms and surrogates of documents in a
database, at a lexical level. On the other hand, the relevance judgement of browsing is completed by
users and it is a concept matching process.
Continuity
The retrieval process is continuous for browsing, while a retrieval process is discrete for query search-
ing. Selecting a browsing path, examining a context, and relevance judgment is continuous and
controlled by the user during browsing, while after executing a query, the internal query process and
General Graduate Exams Panagiotis Papadakos
8/23/2019 General Graduate Exams
14/72
6 Chapter 2. Interaction Paradigms and Visualization in IR
ranking of the results is a black box for the users.
Cost in Time and Effort
Browsing is a time and effort consuming action, since the user must remember the browsing path,
search the contents and make decisions, while query searching involves only term selection and query
formulation.
Information Seeking behavior
Browsing is a system based seeking behavior (i.e. what the system can offer), while query searching
is a seeking behavior based on what the user wants.
Iteration
Browsing is completed by series of iterative acts, like getting an overview of available information,
fixing on a target and examining it more closely, and then moving on and starting the cycle again.
Query searching on the other hand requires the definition of the query terms, the formulation of the
query and examination of the results. Query searching might also be iterative, since the results might
not fullfil the information needs of the user.
Granularity
Using browsing the user can evaluate one relevant item at a time, while query search provides a group
of retrieved documents.
Clarity of information need
When a user starts an information seeking process, he might have not defined a clear information
need. In such a case, browsing is more appropriate, since it does not require a definite target, while
query searching requires a relatively well-conceived information need, for which keywords can b e
chosen and query can be formulated.
Interactivity
Browsing is an interactive process by nature, which makes it more complicated and challenging, whilequery searching has fewer steps and less interaction.
Retrieval Results
Results of browsing are richer and more diverse, since they can lead to a wide range of retrieval
results (i.e. from contextual information, structural information, relational information to individual
objects), while query searching only retrieves a ranked list of documents.
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
15/72
2.4. Visualization 7
2.3.3 Three Paradigms
Although query searching and browsing are different ways to seek information, they can be synthesized.
There are three basic paradigms:
Querying and Browsing (QB) In this paradigm, an initial query is submitted to the system to
restrict the infobase. Then the results are visualized in a visualization environment and browsed by
users.
Browsing and Querying (BQ) Information at the macro-level is presented and browsed and then
information in the micro-level is searched and highlighted in the visualization contexts.
Browsing Only (BO) Information at the macro level is displayed and browsed. It does not integrate
any query searching components.
Query searching is not categorized as a paradigm, because it is a traditional IR retrieval paradigm that
does not require a visual space.
2.4 Visualization
2.4.1 Definition
According to [51], visualization is a method of computing, which transforms the symbolic into the geometric,
enables researchers to observe their simulations and computations, offers a method for seeing the unseen,
enriches the process of scientific discovery, and fosters profound and unexpected insights. Visualization
is the process of transforming data, information, and knowledge into graphic presentations to support
tasks such as data analysis, information exploration, information explanation, trend prediction, pattern
detection, rhythm discovery, and so on. Without the visualization assistance, there is less perception or
comprehension of the data, information, or knowledge by people for a variety of reasons. Such reasons, may
include the limitations of human vision, or the invisibility and abstractness of the data, information and
knowledge. Visualization requires certain methods or algorithms to convert raw data into a meaningful,
interpretable, and displayable form to visually convey information to users.
2.4.2 Scientific and Information Visualization
Visualization can be classified into two categories: scientific visualization and information visualization.
Scientific visualization is used most of the times to show things that are either too fast or too slow for the
General Graduate Exams Panagiotis Papadakos
8/23/2019 General Graduate Exams
16/72
8 Chapter 2. Interaction Paradigms and Visualization in IR
eye to perceive, or for structures much smaller or larger than human scale, or for phenomena that people
can not directly see, like x-ray or infrared radioaction [52]. Examples include shapes of molecules, missile
tracking, astrophysics, fluid dynamics, medical images, etc.
On the other hand, information visualization, is generally used to view abstract information. Examples
include visual reasoning, visual data modeling, visual programming, information retrieval visualization,
visualization of program execution, visual languages, spatial reasoning, and visualization of systems [82].
Although their fundamental design principles, implementation means, and issues are common, infor-
mation visualization does not have an inherent spatial structure or geometry of data to display, contrary
to the scientific visualization. For the former, a spatial structure or framework for semantic relationships
among data must be created. Finding or defining a spatial structure for information visualization is chal-
lenging because data in an information space may be multifaceted, relationships of data are interwoven
and complicated. Furthermore, data may be of diverse nature. Definition of such a spatial structure for
information visualization, is a complicated and creative process. Salient and displayable attributes from
objects must be extracted, a semantic framework for displayable objects must be established, information
must be organized, and objects must be projected onto the structure, in such a way that the user will be
able to search and find objects and objects relationships [92].
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
17/72
Chapter 3
Interaction Paradigms and Related
Techniques
In this chapter we discuss the interaction paradigms of Results Clustering and Faceted or Dynamic Tax-
onomies correspondingly.
3.1 Results Clustering
Results clustering is a type of data analysis method that can organize a dataset into categorical groups
clusters, besed on certain data association criteria. Different similarity measures can result in different
clustering results. Items or objects within the same group/cluster are more similar than items between
two distinct groups/clusters. Clustering is considered an unsupervised learning process because it can au-
tomatically reveal intrinsic categorical patterns from a dateset. The categories from a clustering algorithm
rely on the nature of the dataset, association criteria of clustering, and distribution of data items in the
dateset.
The advantage of clustering is that it can be easily applied to any collections, revealing interesting
and unexpected associations and trends. Disadvantages of clustering are the lack of predictability, their
conflation of many dimensions simultaneously, the difficulty in groups labeling and the counterintuitiveness
of cluster hierarchies [25].
9
8/23/2019 General Graduate Exams
18/72
10 Chapter 3. Interaction Paradigms and Related Techniques
3.1.1 Clustering Requirements
Results clustering algorithms should satisfy several requirements. First of all, the generated clusters should
be characterized from high intra-cluster similarity. Moreover, results clustering algorithms should be effi-
cient and scalable since clustering is an online task and the size of the retrieved document set can vary.
Usually only the top C documents are clustered in order to increase performance. In addition, the pre-
sentation of each cluster should be concise and accurate, to allow users to detect what they need quickly.
Cluster labeling is the task of deriving readable and meaningful (single-word or multiple-word) names for
clusters, in order to help the user to recognize the clusters/topics he is interested in. Such labels must
be predictive, descriptive, concise and syntactically correct. Finally, it should be possible to provide high
quality clusters based on small document snippets rather than the whole documents.
3.1.2 Clustering Algorithms Classification
We can categorize the clustering algorithms, using two different classification schemes, based on either the
structure of the clusters or the infobase that these algorithms are applied to.
Hierarchical and Non-Hierarchical Approaches
The first category classifies the clustering algorithms to either the non-hierarchical ones (partitioning clus-
tering algorithms) or the hierarchical ones [61]. The major difference between these two clustering types is
that the former generates a hierarchy of clustered items while the later partitions the items in a single-level
structure.
Non-Hierarchical Approaches
This kind of clustering algorithms, partition N items into K categories (K must be predefined).
One of the most popular non-hierarchical algorithms is the K-means [48] and its variants [36, 11]
which is based on a simple iterative scheme for finding a local minimal solution. The algorithm
starts with a guess about the solution, and then readjusts the cluster centroids, until reaching a local
optimum. A centroid is a special artificially created item in a cluster which is used to represent that
cluster for various purposes. It is defined as the average coordinates of all items in a cluster which it
represenents. A cluster membership function refers to a method to judge whether an item is assigned
to a cluster or not in a clustering process. The main advantages of this algorithm are its simplicity
and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same
result with each run, since the resulting clusters depend on the initial random assignments.
Another non-hierarchical algorithm is the Fuzzy c-means [32]. In fuzzy clustering, each point has a
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
19/72
3.1. Results Clustering 11
degree of belonging to clusters, as in fuzzy logic, rather than belonging completely to just one cluster.
Thus, points on the edge of a cluster, may be in the cluster to a lesser degree than points in the
center of cluster. The algorithm minimizes intra-cluster variance as well, but has the same problems
as k-means, the minimum is a local minimum, and the results depend on the initial choice of weights.
QT (quality threshold) clustering [29] is an alternative method of partitioning data, invented for gene
clustering. It requires more computing power than k-means, but does not require specifying the
number of clusters a priori, and always returns the same result when run several times. The user
chooses a maximum diameter for clusters and the algorithm builds a candidate cluster for each point
by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses
the threshold. The candidate cluster with the most points is the first true cluster. Then recurse with
the reduced set of points, to find the rest of the clusters.
STC and its variations, described later in the Section Snippet-based Approaches, are also non-
hierarchical approaches.
Hierarchical Approaches
The hierarchical clustering algorithm yields a tree structure, which is also called a dendrogram. In
such a structure, a child sub-cluster has to overlap with its parent cluster. The clustering process
in such algorithms is recursive, meaning that successive sub-clusters are generated from an existing
cluster, etc. There are two basic strategies for creating this structure: agglomerative(or from bottom
to top) algorithms and divisive (or from top to bottom). The former algorithm first clusters input
items, forming a set of clusters, and then merges close clusters from the existing cluster set to form a
parent cluster, based on a similarity measure. The algorithm ends when all clusters have been merge
to one parent cluster, the root of the tree [36]. Different variations may employ different similarity
measuring schemes [94]. The latter algorithm, takes the opposite direction. It starts with the root
of the tree, and breaks down one large cluster into several smaller clusters. The recursion stops
when certain criteria are met. Agglomerative clustering algorithms are more popular than divisive
clustering algorithms.
The above methods usually suffer from their inability to perform adjustment once a merge or split
has been performed. This ineflexibility often lowers the clustering accuracy. Furthermore, due to
the complexity of computing the similarity between every pair of clusters, such algorithms are not
scalable for handling large data sets in document clustering.
Another approach is the Hierarchical Frequent Term-based Clustering (HFTC) method, proposed in
[1]. This algorithm exploits the notion of frequent itemsets1 used in data mining. HFTC greedily
1A frequent itemset is a set of words which occur together in some minimum function of documents in a cluster
General Graduate Exams Panagiotis Papadakos
8/23/2019 General Graduate Exams
20/72
12 Chapter 3. Interaction Paradigms and Related Techniques
selects the next frequent itemset, which represents the next cluster, minimizing the overlap of clusters
in terms of shared documents. Experiments have shown that this algorithm is not scalable [21].
A different approach based on the idea of frequent itemsets is the Frequent Itemset Hierarchical
Clustering (FIHC). FIHC uses global frequent itemsets2 to construct clusters, which reduces the
dimensionality of the document set, making this algorithm more efficient and scalable.
Document-based and Snippet-based Approaches
Clustering can be applied either to the original documents (like in [11, 27, 21]), or to their (query-dependent)
snippets (as in [86, 79, 71, 19, 88, 23, 77]). For instance, clustering Meta Web Search Engines (MWSEs)
(e.g. clusty.com) use the results of one or more search engines (e.g. Google, Yahoo!), in order to increase
coverage/relevance. Therefore, meta-search engines have direct access only to the snippets returned by
the queried search engines. Clustering the snippets rather than the whole documents makes clustering
algorithms faster. Some clustering algorithms [19, 15, 84] use internal or external sources of knowledge
like Web directories3 (e.g. DMoz4, Yahoo! Directory), dictionaries (e.g. WordNet) and thesauri, online
encyclopedias (e.g. Wikipedia5) and other online knowledge bases. These external sources are exploited
to identify key phrases that represent the contents of the retrieved documents or to enrich the extracted
words/phrases in order to optimize the clustering and improve the quality of cluster labels.
Document Vector-based Approaches
The above traditional clustering algorithms, either flat (like K-means) or hierarchical (agglomerativeor divisive) are not based on snippets but on the original document vectors and on the similarity
measure. Another such approach is ESTC (Extended STC) [10], which is an extension of STC
(described latter in the Section Snippet-based Approaches), appropriate for application over the full
texts (not snippets). To reduce the (roughly two orders of magnitude) increased number of clusters, a
different scoring function and cluster selection algorithm is adopted. The cluster selection algorithm
is based on a greedy search algorithm aiming at reducing the overlap and at increasing the coverage
of the final clusters.
In brief, such approaches can be applied only on a stand alone engine (since they require accessing theentire vectors of the documents) and they are computationally expensive. Furtermore, clustering over
full text is not appropriate for a (Meta) WSE since full text may not be available or too expensive
to process.
2Frequent itemsets that appear together in more than a minimum fraction of the whole document set3A web directory is a listing of websites organized in a hierarchy or interconnected list of categories4www.dmoz.org5www.wikipedia.org
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
21/72
3.1. Results Clustering 13
Snippet-based Approaches
Figure 3.1: Clusty, a Snippet-based Clustering Approach
Snippet-based approaches rely on snippets and there are already a few engines that provide such
clustering services. Clusty6 is probably the most famous one, shown in Figure 3.1. Suffix Tree
Clustering (STC) [86] is a key algorithm in this domain and is used by Grouper [87] and Carrot2
[79, 71] MWSEs. It treats each snippet as an ordered sequence of words, it identifies the phrases
(ordered sequences of one or more words) that are common to groups of documents by building
a suffix tree structure, and it returns a flat set of clusters that are naturally overlapping. Several
variations of STC have been proposed. For instance, the trie can be constructed with the N-grams
instead of the original suffixes. The resulting trie has lower memory requirements (since suffixes are
no longer than N words) and its building time is reduced, but less common phrases are discovered
and this may hurt the quality of the final clusters. Specifically, when N is smaller than the length
of true common phrases the cluster labels can be unreadable. To overcome this shortcoming [33]
proposed a join operation. A variant of STC with N-gram is STC with X gram [77] where X isan adaptive variable. It has lower memory requirements and is faster than both STC with N-gram
and the original STC since it maintains fewer words. It is claimed that it generates more readable
labels than STC with N-gram as it inserts in the suffix tree more true common phrases and joins
partial phrases to construct true common phrases, but no user study results have been reported in
the literature, and the performance improvements reported are small.
6www.clusty.com
General Graduate Exams Panagiotis Papadakos
8/23/2019 General Graduate Exams
22/72
14 Chapter 3. Interaction Paradigms and Related Techniques
Another snippet-based clustering approach is TermRank [23]. TermRank succeeds in ranking discrim-
inative terms higher than ambiguous terms, and ambiguous terms higher than common terms. The
top T terms, can then be used as feature vectors in K-means or any other Document Vector-based
clustering algorithm. This approach requires knowing TF, it does not work on phrases (but on single
words) and no evaluation results over snippets have been reported in the literature.
Another approach is Findex [34], a statistical algorithm that extracts candidate phrases by moving a
window with a length of 1..|P| words across the sentences (P), and fKWIC which extracts the most
frequent keyword contexts which must be phrases that contain at least one of the query words. In
contrast to STC, Findex does not merge clusters on the basis of the common documents but on the
similarity of the extracted phrases. However, no comparative results regarding cluster label quality
have been reported in the literature.
Finally, there are snippet-based approaches that use external resources (lexical or training data). For
instance, SNAKET7 [19] (a MWSE) uses DMoz web directory for ranking the gapped sentences8
which are extracted from the snippets. Deep Classifier [84] trims the large hierarchy, returned by an
online Web directory, into a narrow one and combines it with the results of a search engine making
use of a discriminative naive Bayesian Classifier. Another (supervised) machine learning technique
is the Salient Phrases Extraction[88]. It extracts salient phrases as candidate cluster names from the
list of titles and snippets of the answer, and ranks them using a regression model over five different
properties, learned from human training data. Another approach that uses several external resources,
such as WordNet and Wikipedia, in order to identify useful terms and to organize them hierachically
is described in [15]. Other extensions of STC for oriental languages and for cases where external
resources are available are described in [89, 78].
3.1.3 Cluster Presentation & User Interaction
Although cluster presentation and user interaction approaches are somehow orthogonal to the clustering
algorithms employed, they are crucial for providing flexible and effective access services to the end users.In most cases, clusters are presented using lists or trees. Some variations are described next. A well known
interaction paradigm that involves clustering is Scatter/Gather [11, 27] which provides an interactive
interface allowing the users to select clusters, then the documents of the selected clusters are clustered
again, the new clusters are presented, and so on.
7SNippet Aggregation for Knowledge ExtracTion8Gapped sentences are sequences of terms occurring not-contiguously into the snippets
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
23/72
3.1. Results Clustering 15
Figure 3.2: Quintura Word Cloud
Clusty9 is an extension of Vivisimo that offers a new feature, called remix clustering, which clusters
again the same search results but ignoring the topics that the user has seen. Another approach for the
presentation layer is provided by Quintura10
, shown in Figure 3.2. It extracts keywords from search resultsand builds a word cloud (visual map). The name of each cluster is placed in a 2D area. The positions of
the names are based on their distance, while font size indicates the size of each cluster. By clicking words
in the cloud, the user query is refined. SNAKETs [19] interface offers a feature of personalization that is
performed at the client side: the user can select a set of labels and then ask SNAKET to filter out (from
the ranked list) all those snippets that do not belong to the folders labeled by the selected labels.
SOMs have been used to support exploration of a document space to search for patterns and gain
overviews of available documents and relationships between documents [42] (Figure 3.4). Another infor-
mation visualization alternative, Citiviz displays the clusters in search results using a hyperbolic tree anda scatterplot. Several (M)WSE incorporate visualizations similar to both treemaps and hyperbolic trees.
grokker11, shown in Figure 3.3 clusters documents into a hierarchy and produces an Euler diagram, a
coloured circle for each top-level cluster with sub-clusters nested recursively, where the user can zoom-in.
9www.clusty.com10www.quintura.com11www.grokker.com
General Graduate Exams Panagiotis Papadakos
8/23/2019 General Graduate Exams
24/72
16 Chapter 3. Interaction Paradigms and Related Techniques
Figure 3.3: grokker Generates an Euler Diagram
Another example is Kartoo12
), shown in Figure 3.5, which generates a thematic map from the top dozensearch results for a query, laying out small icons representing results onto the map, with which the user
can interact.
3.2 Facets and Dynamic Taxonomies
Dynamic taxonomies (also known as faceted search systems) [64] is a general knowledge management model
based on a multidimensional classification of heterogeneous data objects and is used to explore and browse
complex information bases in a guided, yet unconstrained way through a visual interface. Features of
faceted metadata search include (a) display of current results in multiple categorization schemes (facets)
(e.g. based on metadata terms, such as size, price or date), (b) display categories leading to non-empty
results, and (c) display of the count of the indexed objects of each category (i.e. the number of results the
user will get if he selects this category).
12www.kartoo.com
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
25/72
3.2. Facets and Dynamic Taxonomies 17
Figure 3.4: Top-200 Web Search Results Clustering Displayed Using Two-level TreeMaps
Figure 3.5: Kartoo Generates a Thematic Map
3.2.1 Introduction
Static taxonomies (such as Yahoo!s), based on a hierarchy of concepts can be used to select areas of
interest and restrict the portion of the retrieved infobase. The creation of such taxonomies is usually a
General Graduate Exams Panagiotis Papadakos
8/23/2019 General Graduate Exams
26/72
18 Chapter 3. Interaction Paradigms and Related Techniques
manual process although automatic and semi-automatic techniques have been proposed. However, static
taxonomies are not scalable for large information bases [65], and the number of documents becomes rapidly
too large for manual inspection.
On the other hand, dynamic taxonomies [63, 64, 76] (also known as faceted search systems) are a general
knowledge management model based on a multidimensional classification of heterogeneous data objects and
are used to explore/browse complex information bases in a guided, yet unconstrained way through a visual
interface. Features of faceted metadata search include:
display of current results in multiple categorization schemes (facets) (e.g. based on metadata terms,
such as size, price or date)
display categories leading to non-empty results (Poka-Yoke 13)
display of the count of the indexed objects of each category (i.e. the number of results the user will
get if he selects this category)
Such systems focus on user-centered interactive exploratory access, and propose a holistic approach in
which modeling, interface and interaction issues are considered together. One of the key factors of this
model is simplicity, in order to make it easily understandable and usable by end-users. The user always deals
with a single conceptual representation of the infobase. The conceptual schema of a dynamic taxonomy
is a plain taxonomy. It is a hierarchy going from the most general to the most specific concepts based on
subsumptions. Directed acyclic graph taxonomies modelling multiple inheritance are supported but rarely
required.
The user is guided to reach his goal, because at each stage he has a complete list of all the concepts related
to the current focus, which can be used to further refine his exploration. Furthermore as in traditional
search methods, the infobase can be restricted and a reduced taxonomy can be created. The user is in
charge of interaction and he can freely explore the infobase, discovering unexpected relationships. By
construction, no empty results can occur, because they are automatically pruned. Usability studies [26, 85]
show that despite slow response times, dynamic taxonomies produce a faster overall interaction and a
significantly better recall (both actual and perceived) than access through text retrieval.Dynamic taxonomies have an very fast convergence to small results sets, as described in [65]. For
example, 3 zoom operations on terminal concepts are sufficient to reduce a 10,000,000 object infobase
described by a compact taxonomy with 1,000 concepts to an average 10 objects. Finally, the conceptual
organization of dynamic taxonomies allows to gather user interests at a precise conceptual level by simply
monitoring the zoom operations issued and the concepts the user focuses on.
13Poka-Yoke is a Japanese term that means fail-safing or mistake-proofing.
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
27/72
3.2. Facets and Dynamic Taxonomies 19
Examples of applications of faceted metadata-search include: e-commerce (e.g. ebay), library and bibli-
ographic portals (e.g. DBLP), museum portals ( e.g. [49] and Europeana 14), mobile phone browsers (e.g.
[35]), specialized search engines and portals (e.g. [50]), Semantic Web (e.g. [30, 49, 56]), general purpose
web search engines (e.g. Google Base), and other frameworks (e.g. mSpace[67]).
3.2.2 Taxonomy Design
The most accurate way to create a taxonomy is to build categories by hand. Unfortunately, manual
classification is expensive and infeasible for many practical document collections, and especially for a WSE
document collection. Automatic clustering techniques generate clusters that are typically labeled using a
set of keywords, which leads to unpredictive and not intuitive labels. An alternative approach to clustering
is to generate hierarchies of terms for browsing the database. [66] introduced the subsumption hierarchiesand [45] showed experimentally that subsumption hierarchies outperform lexical hierarchies [60]. Another
approach is to use the hierarchical structure of WordNet15 16 to offer a hierarchy view over the topics [40].
WordNet together with a tree-minimization algorithm to create an appropriate concept hierarchy for a
database is also used in [72].
All these techniques generate a single hierarchy for browsing the database. A supervised approach for
extracting useful facets from a collection of text or text- annotated data is described in [14], which relies on
WordNet hypernyms17 and on a Support Vector Machine (SVM) classifier to assign new keywords to facets.
More recent work [15, 13], provide an unsupervised technique to extract useful facet terms, by expanding
a database using WordNet and Wikipedia to identify important terms.
3.2.3 Framework
Table 3.1 defines formally and introduces notations for terms, terminologies, taxonomies, faceted tax-
onomies, interpretations, descriptions and materialized faceted taxonomies as described in [76]. In brief,
Obj is a set of objects (the set of all documents indexed by the WSE), T is a set of terms, and the elements
of Obj can be described with respect to one or more aspects (facets), where each aspect is associated with
a value domain, finite or infinite, which may be ordered (in the general case we could have a partial order
(T,)). The description of an object with respect to one facet consists of assigning to the object one or
14http://www.europeana.eu15WordNet is a lexical database, which groups English words into sets of synonyms called synsets, provides short, general
definitions, and records the various semantic relations between these synonym sets16http://wordnet.princeton.edu/17Hypernym is a word whose meaning includes the meanings of other words, as the meaning of the term animal includes
the meaning of cat, dog, parrot
General Graduate Exams Panagiotis Papadakos
8/23/2019 General Graduate Exams
28/72
20 Chapter 3. Interaction Paradigms and Related Techniques
more terms from the taxonomy that corresponds to that facet.
Table 3.2 defines the required notions and notations regarding user interaction. The user explores or
navigates the information space by setting and changing his focus. The notion of focus can be intensional
or extensional. Specifically, any set of terms, i.e. any conjunction of terms (or any boolean expression of
terms) is a possible focus. For example, the initial focus can be the empty compound term, or the top term
of a facet. However, the user can also start from an arbitrary set of objects, and this is the common case
in the context of a WSE. In that case the focus is defined extensionally. Specifically, if A is the result of a
free text query q, then the interaction is based on the restriction of the materialized faceted taxonomy on
A (as defined at the bottom part of Table 3.2).
At any point during the interaction, the immediate zoom-in/out/side points along with count information
are computed and provided to the user. When the user selects one of these points then the selected term
is added to the focus, and so on. An example of a materialized faceted taxonomy, is shown in Figure 3.6.
Figure 3.6: Example of a Materialized Faceted Taxonomy
Foci are considered to be redundancy free. A focus ctx (i.e. ctx T) is redundancy free if ctx =
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
29/72
3.2. Facets and Dynamic Taxonomies 21
min(ctx). For example, ctx = {Greece, Crete} is not redundancy free because min(ctx) = {Crete}.
The contents (or extension) of a focus ctx, is the set of objects I(ctx). This notion can be refined, in order
to distinguish the shallowcontents I(ctx), from the deep contents I(ctx).
3.2.4 User Interface Design
System implementations for dynamic taxonomies and faceted search allow a wide range of query possibilities
on the data. Only when these are made accessible by appropriate UIs, the resulting applications can
support a variety of search, browsing and analysis tasks. Such systems should provide support at least for
the three basic characteristics of faceted and dynamic taxonomies. They should display non-empty results,
in multiple categorization schemes (facets), along with the count of the indexed objects of each category.
Additional UI functionality, is usually accompanied by additional complexity and visual clutter.Selection and de-selection of zoom-points is of central importance in faceted search. If only one concept
should be selectable at a time within a facet, traditional single-select controls such as radio buttons,
dropdown list controls or simple links can be used. On the other hand, the standard multi-select elements,
are check boxes. For instance, the yelp18 web application provides check buttons for multi-select facets and
simple links for facets with exclusive selection. Alternatives for allowing both modes in a facet would be
dedicated controls, or modifier keys (such as pressing shift while clicking). For range selection navigation
mode, slider controls can allow the specification of upper and lower bounds on the result set. De-selection
should be as easy as concept selection. Additionally, if breadcrumbs or a similar filter summary, indicating
summaries of single or all facets are present, these should include the option to clear individual filters as
well. Also, buttons for reseting single facets or all filter options can help users to zoom-out quickly.
Figure 3.7: ContentLandscape Applies Collapsible Panel Pattern for Zooming
For flat facets, i.e. not featuring a hierarchical relation between the concepts, simple list widgets are
18http://www.yelp.com
General Graduate Exams Panagiotis Papadakos
8/23/2019 General Graduate Exams
30/72
22 Chapter 3. Interaction Paradigms and Related Techniques
usually used. List sorting can either be alphabetical, or dynamically updated by the number of assigned
items in the current result set. For navigating hierarchies, a number of different presentation and navigation
options exist, which include: Explorer Tree (not very space efficient), Zoom and Replace which replaces
the facet widget content with the level below (used in Flamenco19 [85]), Collapsible panels, hierarchical
widgets based on the accordion pattern20 (used in the ContentLandscape application [70], Figure 3.7),
and Continuous Zooming, where hierarchical facets are displayed as space-filling widgets, which allow a
fast traversal across all levels, while simultaneously maintaining context (used in the FacetZoom prototype
[12], Figure 3.8). The number of indexed items for each facet and zoom-points, can be shown by numbers
(after the labels), bar charts, height of facets, colour, etc. Visgets[18], extends this principle by featuring a
whole number of visualizations. FaThumb [35], enables faceted search on mobile devices (Figure 3.9). The
filter area is grouped in nine zones, corresponding to the nine digit keys on mobile phones. The middle
zone serves as a spatial overview during navigation. The surrounding eight zones allow the user to select
hierarchy branches and repeatedly zoom in on subtrees. The left short shortcut key adds the currently
selected concept to the query, the right one allows to quickly jump back to the top.
Figure 3.8: FacetZoom Combines Ideas from Zoomable UIs With Faceted Search
Query searching can be done either over all results or within the current focus, as shown in Figure 3.10.
Moreover, in order to quickly locate zoom-points in a facet, and avoid having to navigate large hierarchies,
even though the target concept may already known by name, direct access to facet items can be achieved
with a keyword search over the concept labels (/facet [30]). Since the number of available facets can be
very big, ways to reduce their usage space are discussed in [24], and include collapsible facet widgets (such
as used by Getty images faceted navigation interface21) and expandable filter areas (i.e. More button).
Furthermore, systems should be able to determine which facet-value pairs the interface should provide
19Online demos available at http:// amenco.berkeley.edu/20http://www.welie.com/patterns/showPattern.php?patternID=accordion21http://gettyimages.com
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
31/72
3.2. Facets and Dynamic Taxonomies 23
Figure 3.9: Faceted Search for Small Screens in the FaThumb Prototype
Figure 3.10: Flamenco Allows Choosing Between a Search Over All Results or Within Current Focus
to a user. Personalization allows the system to present the facet-value pairs that can help the user quickly
find the documents that he is most interested. Existing approaches include, content based personaliza-
tion, where a recommendation system monitors users actions and pushes documents that match his user
profile, collaborative based faceted search personalization, where the system recommends items to a user
by leveraging information from other users with similar tastes and preferences, and finally an ontological
General Graduate Exams Panagiotis Papadakos
8/23/2019 General Graduate Exams
32/72
24 Chapter 3. Interaction Paradigms and Related Techniques
approach, which uses the distance between values of an ontology, to measure the relevance to users [75].
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
33/72
3.2. Facets and Dynamic Taxonomies 25
Name Notation Definition
terminology T a set of names, called terms (they may capture both
categorical and numeric values)subsumption a partial order (reflexive, transitive and antisymmet-
ric)
taxonomy (T,) T is a terminology, a subsumption relation over T
broaders of t B+(t) { t | t < t}
narrowers of t N+(t) { t | t < t}
direct broaders of t B(t) minimal
8/23/2019 General Graduate Exams
34/72
26 Chapter 3. Interaction Paradigms and Related Techniques
Name Notation Definition
focus ctx any subset of T such that ctx =
minimal(ctx)
focus projection on a facet i ctxi ctxi = ctx Ti
Kinds of zoom points w.r.t. a
facet i while being at ctx
Notation Definition(s)
zoom points AZi(ctx) = { t Ti | I(ctx) I(t) = }
zoom-in points Z+i (ctx) = AZi(ctx) N+(ctxi)
immediate zoom-in points Zi(ctx) = maximal(Z+i (ctx))
= AZi(ctx) N(ctxi)
zoom-side points ZR+i (ctx) = AZi(ctx) \ {ctxi N+(ctxi) B
+(ctxi)}
immediate zoom-side points ZRi(ctx) = maximal(ZR+(ctx))
Restriction over an object set Notation Definition(s)
restricted object set A any subset of Obj
reduced interpretation I I(t) = I(t) A
reduced terminology T = { t T | I(t) = }
= { t T | I(t) A = }
= oAB+(DI(o))
Table 3.2: Interaction Notions and Notations
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
35/72
Chapter 4
Visualization Models and Metaphors
In this chapter we will discuss five different visualization models. Initially, we will discuss MRPBM, which is
based on RPs, and then we will analyze ESCBM, which is based on the VSM ranking model and its spatial
characteristics. The next one is PFNET, which uses associative networks, and the fourth one is MDS,
a group of methods used to discover empirical relationships among investigated objects. Finally, we will
discuss SOM, which is a nonlinear topology-preserving projection method, to convert a high-dimensional
space into a low dimensional grid and different visualization metaphors.
4.1 Multiple Reference Points Based Models (MRPBM)
MRPBM models are visualization algorithms to display the results of a search not in the classical linear
order, but by projecting them on a low dimensional visual space. They can effectively handle complex
information needs by using multiple RPs. RP or Point of Interest (POI), is a search criterion against which
documents or surrogates are matched and search results are generated and presented to the users. In a
broad sense, a RP represents users information needs and any information related to users needs, from
user preferences and search history, to query terms or browsed documents. Multiple RPs can form a low
dimensional visual space and documents can be mapped onto the space, based upon their attraction to the
RPs.
Visualization models based on multiple RPs can be classified into three categories:
Fixed Multiple RPs Models
27
8/23/2019 General Graduate Exams
36/72
28 Chapter 4. Visualization Models and Metaphors
These models use multiple RPs, with a fixed position, and can be used for both vector-based and
Boolean based IR systems. The representative model is InfoCrystal [69]. In the boolean context,
each RP is equivalent to a term or a sub-Boolean logic expression from a Boolean query. The visual
space is a polygon, where RPs constitute vertices of the polygon and visual results are displayed.
The side lengths of the polygon are equal so that the RPs are evenly configured in the visual space.
The retrieved results are displayed inside the polygon. The polygon is partitioned by N exclusive
tiers, represented as concentric rings, where N is the number of RPs. The first tier, displays results
related to only one RP, the second results related to two RPs, etc. Figure 4.1 shows a fixed multiple
RPs model.
Figure 4.1: Display of 4 Reference Points in a Fixed Reference Point Environment
Movable Multiple RPs Models
These models use multiple RPs, which can be manipulated by the user, while semantic connections
of displayed objects are still maintained in the visual space. VIBE [55] and its variations, VR-VIBE [2] and LyberWorld [28] are such models. The primary benefit of this approach is that the
user may arbitrary place a RP to any interesting area, such as another RP, document or cluster of
documents, and observe the impact of the RP to that area. According to the algorithm, the position
of a document is strongly related to the similarities between the document and a group of predefined
RPs. The positions of all related RPs in the visual space, play a very important role in positioning
a projected document. In addition, taking into consideration the relevance between a document and
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
37/72
4.1. Multiple Reference Points Based Models (MRPBM) 29
related RPs, the ultimate position of a document is calculated. Initially the first two related RPs are
selected in order to calculate the position of the document. The new position of the document serves
as an intermediate RP for further consideration, and the process continues until all related RPs are
considered. If the user add, remove, or change the position of any RP, the whole algorithm must be
executed again.
Figure 4.2 shows a snapshot of VIBE. In this example 5 RPs (circles) are used and documents are
represented as rectangles. Those documents that contain at least one of the descriptors indicated by
the user when initiating the search are considered relevant. The documents with greater coincidence
in their descriptors with those of the RP are placed closer to that RP. The user can also expand the
icons of the documents or documents that are useful by simply drawing a box around a document or
documents that are of interest and a list is shown of the chosen selection. Clicking with the mouse
on any of the documents on the list will open another window with the complete document. One
characteristic that makes the system interactive is that the user may add, change or remove the RPs
from the screen. On carrying out any of these changes the system automatically launches the search
query and re-orders the found documents to present the relationships between documents and those
between POIs.
Automatic RPs Rotation Models
This model is a similarity ratio based model and was introduced with WebStar [93] to visualize link
structures. The uniqueness of this model is that it adds a new feature, automatic rotation of RP to
the 2D visual space. The visual space is build on a polar coordinate system, where the origin of the
visual space is a central document (focus point), specified or selected by users, and RPs are evenly
distributed on a sphere with the focus point as center. All of the relevant documents are scattered
within the visual space based on their projection angle (which is similarity based) and distance (which
is not). By selecting a RP, it automatically rotates around the sphere. As a consequence, related
documents are attracted and also rotated.
Figure 4.3, shows the WebStar system. The central document (focus point), is denoted with a blue
square at the center of the circle, while the four RPs (sport, research, international, library),
are represented with the yellow squares, evenly distributed outside the circle. Documents are the
pink squares scattered inside the circle. In this example the user has selected the international RP,
coloured in red, which is rotated around the circle. Notice how documents change position as the RP
rotates.
Both the models for fixed and movable multiple RPs require at least three RPs to project documents in
General Graduate Exams Panagiotis Papadakos
8/23/2019 General Graduate Exams
38/72
30 Chapter 4. Visualization Models and Metaphors
Figure 4.2: VIBE Using 5 Reference Points
their visual spaces, while the model for automatic RPs rotation requires at least one RPs in conjunctionwith the focus point. Furthermore, visualization models for multiple RPs can be 2D or 3D and can be
applied to either Boolean or vector based information systems. The position of any RPs can be controlled
and manipulated by users at will. It is the flexibility of manipulation that enables users to compare and
analyze the impact of two reference points on documents, and identify good/poor discriminative terms.
Such models can be used to visualize Internet hyperlinks, search results from an information retrieval
system, a full-text, and term discriminative analysis.
4.2 Euclidian Spatial Characteristic Based Model (ESCBM)
These visualization models are based on the VSM model and its spatial characteristics. The basic Euclidean
spatial elements such as point, distance, and angle may have a special connection to information retrieval in
the contexts of the vector-based space. For instance, a document or RP in a vector based space corresponds
to a spatial point in the Euclidean space. Euclidean distance between documents and RPs can be used
as an indicator of their similarity. Their visual spaces are 2D and in order to construct them, they use
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
39/72
4.2. Euclidian Spatial Characteristic Based Model (ESCBM) 31
Figure 4.3: WebStar Using 4 RPs. Snapshots During a Full Rotation of international Reference Point
two RPs, which serve as view points, one major (KV P), and one minor (AV P). These RPs, the reference
axis that they form and the distance between them, are all selected by the user and affect the relevant
documents placement.
The projection conversion equation for an IR evaluation model is crucial for visually displaying it in
the visual space. The complexity of a conversion equation depends upon multiple factors such as the
definition of the visual space and nature of the retrieval evaluation model. Some equations are simple and
straightforward while others may be complicated. The significance of visualizing an IR evaluation model
is not only to make the invisible internal retrieval process transparent to users but also to allow them to
manipulate the model in the visual space at will.
In this context, three visualization models have been proposed.
General Graduate Exams Panagiotis Papadakos
8/23/2019 General Graduate Exams
40/72
32 Chapter 4. Visualization Models and Metaphors
Distance-angle Based Model
In this model the visual projection distance and angle are defined for any document Di. The pro-
jection distance is the distance from the document Di to the KV P and the distance angle is the
angle formed by the lines KV PDi and KV P AV P, in the vector space. The valid display area of
this model is a half-infinite plank, where the X-axis and Y-axis are defined as the visual projection
angle and distance respectively. The width of X-axis is always equal to and the width of Y-axis is
infinite. KV P is always mapped onto the origin visual space, because its visual projection distance
is 0 and the angle is defined as 0. The position of AV P is mapped onto the Y-axis. because its visual
projection distance is the length between the two reference points, in the visual space and the visual
projection angle is defined as 0. The distance between the two reference points does not affect this
model. DARE [90] is such a model and
Figure 4.4 shows the display of the projected cosine model using DARE. The angle a is the retrieval
threshold, while R2 is AV P. D1 is a document situated within the retrieval area defined by the angle
, and D2 is any document located on one boundary of the angle . Users may drag the vertical
retrieval line to any place within the valid display area, to increase or decrease the retrieval area.
Figure 4.4: Display of the Projected Cosine Model, in Distance-Angle DARE Model
Angle-angle Based Model
In this model two visual projection angles are defined for any document Di. The first angle () is
the angle formed by the lines KV PDi and KV P AV P, and the second one () is the angle formed
by the lines AV P Di and KV P AV P, both of them in the vector space. The two angles and
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
41/72
4.2. Euclidian Spatial Characteristic Based Model (ESCBM) 33
are assigned to the X-axis and Y-axis. The minimum value and maximum value for the two angles
and , are 0 and respectively. The valid display area is a triangle and the two reference points
are projected at (/2, 0) and (0, /2) respectively. This model again is not affected by the distance
between the two RPs.
TOFIR [91] is an example of such a model, shown in Figure 4.5. The angle is the retrieval threshold,
the origin of the vector space is KV P, while R2 is AV P. In the figure O is the projected origin of
the vector space. The horizontal line defines the retrieval area, and can be manipulated by the users.
Figure 4.5: Display of the Projected Cosine Model, in the Angle-Angle TOFIR Model
Distance-distance Based Model
In this model two visual projection distances are defined for any document Di. The first distance, is
the distance from the document Di to the KV P, and the second is the distance from the document
Di to the AV P, both of them in the vector space. The two projection distances are assigned to the
X-axis and Y-axis. The valid display area is a half-infinite plank, where both the X-axis and Y-axis
are assigned as the visual projection distances. It forms a /4 angle against the X-axis or the Y-axis,
its two corners are connected to the X-axis and Y-axis respectively, and its width is dynamic and
determined by the distance between the two RP. GUIDO [54] is such a model. Figure 4.6 shows the
distance model in GUIDO.
One of the distinguishing characteristics of these visualization models is their capacities to visualize
traditional IR evaluation models in addition to visualizing relationships among documents. Document dis-
tributions in these visual spaces change accordingly when the RPs change. This implies that the displayed
General Graduate Exams Panagiotis Papadakos
8/23/2019 General Graduate Exams
42/72
34 Chapter 4. Visualization Models and Metaphors
Figure 4.6: Display of the Projected Distance Model, in the Distance-Distance GUIDO Model
document configurations in the visual spaces can be customized based upon users dynamic information
needs.
4.3 Pathfinder Associative Newtork (PFNET)
The Pathfinder associative network PFNET is a structural and procedural modeling technique that extracts
underlying connection patterns in proximity data and represents them spatially in a class of networks
[8]. The power of the Pathfinder associative network is its ability to discard insignificant links in the
original network while it reserves the salient semantic structure of the network. The simplified network
still maintains the proximity connections and fundamental characteristics of the original network.
The main idea of the Pathfinder associative network is to discard the redundant paths and keep the
significant ones in a network. PFNET uses the triangle inequality, to identify paths with the lowest weights
in the network, eliminate redundant ones, and make the network more economical. Figure 4.7 displays the
original network and the final PFNET network. Moreover, the principle of the triangle inequality can be
extended to an abstract space. In that case, connection proximity between two points may be measured
in other forms such as invisible semantic similarity between two objects rather than distance.
Application of a PFNET to a domain problem requires identifying two basic elements: the first is the
objects which are used as nodes in the network, and the second is the proximity relationship between the
two objects, which is used to form a link between the two objects. Proximity can be procured by either a
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
43/72
4.3. Pathfinder Associative Newtork (PFNET) 35
Figure 4.7: Display of Original Network (left) and Final PFNET Network (right)
human-interference method or an automatic computation method. Different objects and proximity methods
can lead to different Pathfinder associative networks.
The Pathfinder network technique is very effective and efficient for display of complex relationships
among objects such as sophisticated semantic networks. As an IV means, it can be applied to a wide
spectrum of IR environments, ranging from information searches [7, 20], author co-citation analysis1 [80],
term co-occurrence analysis2 [16], to the Internet information representation [6].
Specifically for query searching, after a query is submitted to the network, the relevance between the
query and a document is calculated using the Pearson correlation coefficient, and the relevance is indicated
by the height of a raising spike from the document [7]. In another case [20, 16], both the query and a
document are converted into two Pathfinder associative networks, and the similarity between a query and
a document is the similarity between the two Pathfinder networks. The proximity algorithm consists of
two parts. The first part is defined as the ratio of common terms in both a query and a document to the
number of all terms in the query. The second part measures the network structure similarity between the
query network and a document network. The value of this part increases when nodes (terms) connected
in the query network also appear closely connected in the document network. Finally, the two parts are
weighted and integrated into a final similarity value.
The weaknesses of the Pathfinder associative network include its computational complexity, which may
prevent PFNET from visualizing a large dataset, and dynamically modifying a PFNET caused by interac-
tions between users and the network. Another disadvantage of PFNETs in the present state of development
1Phenomena occuring when the authors of two different papers, both co-cite the same paper(s) in their work2Keywords appearing together in a predefined length of text in the same document
General Graduate Exams Panagiotis Papadakos
8/23/2019 General Graduate Exams
44/72
36 Chapter 4. Visualization Models and Metaphors
is that people have no way of knowing the features upon which similarity judgments are made, which re-
sults in that the semantic content of links is not easily discernible. PFNET cannot generate a local visual
configuration based on users individual information needs, but it only produces a global overview for a
data collection.
4.4 Multidimensional Scaling Models (MDS)
The MDS technique consists of a group of methods used to discover empirical relationships among inves-
tigated objects, by visualizing them and presenting their geographic representation in a low dimensional
display space. It can be used to reveal and illustrate hidden patterns for a set of proximity measures
among objects for multivariate, exploratory, and visual data analysis. An MDS algorithm starts with a
matrix of itemitem similarities, and then assigns a location to each item in N-dimensional space ( N is
specified a priori), where users may perceive and analyze the relationships among the displayed objects.
For sufficiently small N, the resulting locations may be displayed in a graph or 3D visualisation. The more
similar two objects, the closer to each other they are, and vice versa.
One of MDS techniques advantages is the diversity of its algorithms, where each one of them handles
different situations. They can be classified into metric and non-metric MDS algorithms, based upon the
types of input proximity data. The non-metric MDS algorithm is applied to qualitative3 proximity data,
while metric MDS is applied to quantitative4 proximity data. Another category of MDS technique is
classical MDS algorithm. which is used with quantitative proximity data.
Applications of MDS in IR can be roughly categorized into two groups, based on the proximity definition:
one is to use a co-citation method to define the proximity metric, and the other is to use a non-co-
citation method such as traditional distance-based or angle-based similarity measures. However, applying
traditional MDS to a very large data set may be prohibitively slow, since it uses a linear algebra solution
for the problem, which is computationally costly and makes heavy demands on storage. On the other
hand, the non-metric (metric) MDS method looks for the best match between the original proximity of
two objects and their Euclidean distance in a low dimensional, using an iterative process, starting with
a random initial configuration. The Kruskal algorithm, which is used for the minimization, is iterative,simple and its computational complexity is in practice almost O(N).
Furhermore, the huge number of displayed objects in a low dimensional space raises concerns in terms of
efficient system implementation and information representation in the MDS display space, for interactive
systems. To solve the problem, people use the supernode method [68] that visualizes ob ject clusters and
3Qualitative proximity data refers to ordinal data4Quantitative proximity data refers to ratio-scaled data
Computer Science Department University of Crete
8/23/2019 General Graduate Exams
45/72
4.5. Self-organizing Map Model (SOM) 37
objects at different levels respectively. In the MDS visual display space, documents are clustered first
so that highly related documents in terms of the co-citation are formed as new supernodes. So instead
of individual documents, the system displays these supernodes. Documents within a supernode can be
visualized, at a lower level, if users zoom on a selected cluster.
Figure 4.8: Display of ThemeScape and Galaxy Visualizations of IN-SPIRE Visualization Program
Another potential problem is the intuitive representation of projected objects in a low dimensional MDS
space. It is extremely important for users to easily understand and meaningfully interpret the graphic
presentation. Towards that aim, the MDS approach was combined with the so called ecological approach, in
order to take advantage of natural display formats that humans are used to [83]. The ecological landscape isa MDS display space, which consists of a group of ecologically connected local landscapes. Each landscape
represented an object cluster. The size of each local landscape is related to the number of documents
containing a thematic term which defined the local landscape. A document is positioned based upon its
indexing terms, the thematic term, and the category assigned to the document. Figure 4.8 shows the
ThemeScape and Galaxy visualizations using the IN-SPIRE software.
4.5 Self-organizing Map Model (SOM)
The SOM (neural network), is a nonlinear topology-preserving projection method to convert a high di-
mensional space into a low (1D, 2D, or 3D) dimensional grid (feature map), as shown in Figure 4.9. There
are three spaces which are involved in SOM: the high dimensional document vector space (associated with
objects), the high dimensional weight vector space (associated with the nodes of the display grid), and the
low dimensional visual space (the display grid). During the learning process, each input vector is randomly
picked up and is assigned to the closest neuron, whose weight vector is the most relevant one. After the
General Graduate Exams Panagiotis Papadakos
8/23/2019 General Graduate Exams
46/72
38 Chapter 4. Visualization Models and Metaphors
training process, the documents are projected onto the feature map and labels are assigned to the feature
map areas (which most of the times is weight-based).
Figure 4.9: A SOM Feature Map
Each partitioned area in the map clearly represents a concept(s) and documents associated with theconcepts. The size of each area in the map indicates term occurrence frequencies or the possible size of the
projected documents. After term labeling processing, semantically related areas are also