General Graduate Exams

Embed Size (px)

Citation preview

  • 8/23/2019 General Graduate Exams

    1/72

    General Graduate Exams

    Exploration and Visualization of Information in Search Engines

    by

    Panagiotis Papadakos

    Presented to Graduate Studies Committee of

    the Computer Science Department of

    the University of Crete

    Heraklion, May 2009

  • 8/23/2019 General Graduate Exams

    2/72

    ii

  • 8/23/2019 General Graduate Exams

    3/72

    Contents

    Page

    Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

    List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

    1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    2 Interaction Paradigms and Visualization in Information Retrieval (IR) . . . . . . . . . . . . . . . 3

    2.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.2 Information Space and User Information Needs . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.2.1 Micro and Macro Level of Information . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.2.2 Information Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.2.3 User Information Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.3 Interaction Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.3.1 Query Searching vs Browsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.3.2 Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.3.3 Three Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.4.2 Scientific and Information Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3 Interaction Paradigms and Related Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3.1 Results Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3.1.1 Clustering Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.1.2 Clustering Algorithms Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    Hierarchical and Non-Hierarchical Approaches . . . . . . . . . . . . . . . . . . . . . 10

    Document-based and Snippet-based Approaches . . . . . . . . . . . . . . . . . . . . 12

    3.1.3 Cluster Presentation & User Interaction . . . . . . . . . . . . . . . . . . . . . . . . . 14

    iii

  • 8/23/2019 General Graduate Exams

    4/72

    3.2 Facets and Dynamic Taxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.2.2 Taxonomy Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3.2.3 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3.2.4 User Interface Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    4 Visualization Models and Metaphors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    4.1 Multiple Reference Points Based Models (MRPBM) . . . . . . . . . . . . . . . . . . . . . . 27

    4.2 Euclidian Spatial Characteristic Based Model (ESCBM) . . . . . . . . . . . . . . . . . . . . 30

    4.3 Pathfinder Associative Newtork (PFNET) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    4.4 Multidimensional Scaling Models (MDS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    4.5 Self-organizing Map Model (SOM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    4.6 Metaphors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    4.6.1 Metaphors for the Semantic Framework Presentation . . . . . . . . . . . . . . . . . . 39

    4.6.2 Metaphors for Information Retrieval Interaction . . . . . . . . . . . . . . . . . . . . 41

    5 Vision and Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    5.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    5.2 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    5.3 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    5.3.1 Information Visualization Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    5.3.2 Metrics for Exploratory Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.3.3 Interaction Models for Exploratory Search . . . . . . . . . . . . . . . . . . . . . . . . 49

    5.3.4 Exploratory Search and Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    5.3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    5.3.6 Evaluation and Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    5.4 Work Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    5.4.1 ODBMS Index Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    5.4.2 FleXplorer, A Framework for Providing Faceted and Dynamic Taxonomy-based

    Information Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    5.4.3 Exploratory Web Searching with Dynamic Taxonomies and Results Clustering . . . 52

    6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    Appendices

    A Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    iv

  • 8/23/2019 General Graduate Exams

    5/72

    List of Figures

    3.1 Clusty, a Snippet-based Clustering Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3.2 Quintura Word Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.3 grokker Generates an Euler Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    3.4 Top-200 Web Search Results Clustering Displayed Using Two-level TreeMaps . . . . . . . . 17

    3.5 Kartoo Generates a Thematic Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.6 Example of a Materialized Faceted Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    3.7 ContentLandscape Applies Collapsible Panel Pattern for Zooming . . . . . . . . . . . . . . . 21

    3.8 FacetZoom Combines Ideas from Zoomable User Interfaces (UIs) With Faceted Search . . . 22

    3.9 Faceted Search for Small Screens in the FaThumb Prototype . . . . . . . . . . . . . . . . . . 23

    3.10 Flamenco Allows Choosing Between a Search Over All Results or Within Current Focus . . 23

    4.1 Display of 4 Reference Points in a Fixed Reference Point Environment . . . . . . . . . . . . 28

    4.2 VIBE Using 5 Reference Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    4.3 WebStar Using 4 Reference Points (RPs). Snapshots During a Full Rotation of interna-

    tional Reference Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    4.4 Display of the Projected Cosine Model, in Distance-Angle DARE Model . . . . . . . . . . . 32

    4.5 Display of the Projected Cosine Model, in the Angle-Angle TOFIR Model . . . . . . . . . . 33

    4.6 Display of the Projected Distance Model, in the Distance-Distance GUIDO Model . . . . . 34

    4.7 Display of Original Network (left) and Final PFNET Network (right) . . . . . . . . . . . . . 35

    4.8 Display of ThemeScape and Galaxy Visualizations ofIN-SPIRE Visualization Program . . 37

    4.9 A SOM Feature Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    4.10 A 3D Cone Tree (left) and a Basic Hyperbolic Tree (right) . . . . . . . . . . . . . . . . . . . 41

    4.11 Perspective Wall (left) and ThemeRiver(right) . . . . . . . . . . . . . . . . . . . . . . . . . 42

    4.12 D ataLens, a 3D Pyramid Lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    4.13 Gridl Prototype Displays Search Results Along Two Axes . . . . . . . . . . . . . . . . . . . 43

    v

  • 8/23/2019 General Graduate Exams

    6/72

    4.14 HotMaps, a 2D Visualization of How Query Terms Relate to Search Results . . . . . . . . . 44

    vi

  • 8/23/2019 General Graduate Exams

    7/72

    List of Tables

    3.1 Basic Notions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    3.2 Interaction Notions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    vii

  • 8/23/2019 General Graduate Exams

    8/72

    viii

  • 8/23/2019 General Graduate Exams

    9/72

    Chapter 1

    Motivation

    The daily use of computers as tools of work, education, communication and entertainment produces a huge

    volume of data. As recent surveys state1, the world produces between 1 and 2 exabytes (260 bytes) of

    unique information per year, 90% of which is digital and with a 50% annual growth rate. In addition, this

    new data are more complex and more dynamic. The adopted interaction paradigm of current IR systems

    and Web Search Engine (WSE) with a simple rectangular textbox, where the user inserts most of the timesone or two terms and the system returns a ranked list of results has proven very useful for finding specific

    information and is very simple and intuitive to use.

    However, such systems do not provide adequate support for information needs that have an exploratory

    nature and/or aim at decision making. User studies have shown that casual users usually inspect only the

    first page of results and they do not exploit any of the query language operators (not even Boolean queries)

    that is offered. Instead they issue very small queries which they reformulate in an iterative process based

    on the returned results [73, 62]. On the other hand, the powerful and expressive query languages that are

    usually offered for structured information (e.g. for the Semantic Web) are not fully utilized, in the sense

    that the formulation of queries is a laborious and difficult task for end users.

    In the previously analyzed, highly demanding and growing information environment, new intuitive and

    more user friendly UIs have to be created, providing effective and efficient services for retrieving and explor-

    ing the available information and supporting users in the various decision making tasks and processes. The

    fields of IR, Information Visualization (IV) and Human Computer Interaction (HCI) have to collaborate

    1http://www.sims.berkeley.edu/research/projects/how-much-info-2003/

    1

  • 8/23/2019 General Graduate Exams

    10/72

    2 Chapter 1. Motivation

    in order to provide new intuitive and interactive UIs, where the information is presented, organized, and

    analyzed, giving the user the ability to recognize patterns and relations.

    For example, to select a hotel or a product to buy, it is not enough to return the list of choices that

    satisfy user-provided criteria. The ranking of the available choices according to user-based (i.e. preference),

    or statistical-based criteria is also required. Furthermore, exploration services, that provide users with

    comprehensive summaries of the available choices which enable them to grasp quickly the information

    landscape and allow them to restrict their focus, and thus approach gradually the most desired choices,

    are required.

    For this reason, efforts for the exploitation of the above languages in models of exploration/navigation

    have started to come up [56, 49, 30, 4]. Summarizing, the constantly increasing volume and requirements

    of our digital economy, requires providing intuitive modes of interaction, involving flexible and efficient

    navigation, and advanced visualization.

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    11/72

    Chapter 2

    Interaction Paradigms and Visualization in IR

    2.1 Information Retrieval

    (IR) is the domain focusing on searching, exploring and discovering information either from organized

    textual and data repositories or the World Wide Web (WWW), in order to satisfy users information

    needs. However, since the information environment is constantly growing, another important aspect of

    IR systems is their ability to orgazine this information. This organization can facilitate the creation

    of innovative, more intuitive and user friendly UIs, which will provide users efficient ways of mapping,

    organizing and grouping available information. The above can enable users to discover new patterns and

    relationships between the available information and satisfy faster and more accurately their information

    needs.

    2.2 Information Space and User Information Needs

    2.2.1 Micro and Macro Level of Information

    Information from an infobase can be divided to two levels. The first is the micro level, which refers to

    individual objects or documents, such as contents, snippets or full text. This is the direct and obvious

    information. The other level is the macro level which refers to aggregated information of objects or

    documents from the collection. This information is not direct but is generated from the individual collection

    of objects, and relies on the way the information is organized and presented. Such information can provide

    3

  • 8/23/2019 General Graduate Exams

    12/72

    4 Chapter 2. Interaction Paradigms and Visualization in IR

    object connections, rhythms, trends, patterns and relationships, explaining information at the micro level.

    The aggregate information at the macro level can vary in information organization methods and information

    presentations for the same data. By navigating information in the macro level the user can gain a better

    understanding of the provided collection and find unexpected insights [92]. An IR system should provide

    access to both levels of information, by browsing and query searching.

    2.2.2 Information Space

    The information space can be conceived as an abstract and multidimensional space. Its structure is based

    on the semantic characteristics and relationships, derived from the organization of the collection data set,

    which enables users to explore and discover information from the data collection. An information space

    can be constituted by intrinsic attributes such as keywords, citations, hyperlinks, and authors or extrinsic

    structures like a subject directory, a thesaurous system, or an organized search result list. Combinations of

    intrinsic attributes and extrinsic structures can also form an information space. Since information does not

    constitute space, to describe its spatial characteristics, we have to define basic topological properties like

    distance, direction and angle. For instance, the distance between two objects can be the shortest path of

    hyperlinks, citation or hierarchical structure, and the Euclidean distance in the Vector Space Model (VSM).

    Direction has a special meaning, in hyperlink and citation based systems, since if any objects links/cites

    another object, it means that one object directs to the other. In a multidimensional vector-space based

    IR system, angle is used as a retrieval model. Finally, the information space has to be reduced from N

    dimensions to 1, 2 or 3, in order to be perceived by humans, which can lead to user disorientation and

    ambiguity [92].

    2.2.3 User Information Needs

    User studies have shown that almost 60% of search tasks are exploratory [62]. The user does not know

    accurately his information need, he only provides 2-5 words, and focalized search very commonly leads to

    inadequate interactions and poor results. Unfortunately, the available user interfaces (UI)s do not aid theuser in formulating his query. Furthermore, such systems do not provide adequate support for information

    needs that have an exploratory nature and/or aim at decision making. The answers returned are simple

    ranked lists of results, with no organization and no information on the macro level of the infobase. Casual

    users usually inspect only the first page of results and they do not exploit any of the query language

    operators (not even Boolean queries) that is offered. Instead, they issue very small queries which they

    reformulate in an iterative process based on the returned results [73, 62]. On the other hand, the powerful

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    13/72

    2.3. Interaction Paradigms 5

    and expressive query languages that are usually offered for structured information (e.g. for the Semantic

    Web) are not fully utilized, in the sense that the formulation of queries is a laborious and difficult task.

    2.3 Interaction Paradigms

    2.3.1 Query Searching vs Browsing

    In IR two paradigms are widely recognized: the first is query searching and the second is browsing. Query

    searching is the paradigm where the user tries to describe his information needs with a group of relevant

    and important terms. The query is then analyzed by the IR engine and a list of related documents (based

    on the used ranking model) is returned. Most IR engines also display snippets of relevant parts of the

    returned documents. This means that for ambiguous words, where each word can have many meanings,the system might return non relevant results, which the user might accept as a search failure.

    On the other hand, browsing refers to UIs which allow the user to view, search and scan either the

    whole information or part of it. This enables the user to explore and discover information, along with data

    relationships and patterns. A UI for browsing should provide smooth and structured browsing. Methods

    for information browsing include hyperlink and hierarchical structures. However, huge volumes of data

    require the appropriate usage of automatic data analysis techniques, prior to visualization. According to

    [81], browsing is useful when (a) there is good underlying structure, so items close to one another are

    similar, (b) users are unfamiliar with the contents of the collection, (c) users have a limited understanding

    of the organization of a system and prefer a less cognitively loaded method of exploration, (d) it is difficult

    to verbalize the underlying information need and (e) the information is easier to recognize than describe.

    2.3.2 Differences

    According to [92], the differences between query searching and browsing, include:

    Judgment of Relevance

    Query searching is based on keyword matching of query terms and surrogates of documents in a

    database, at a lexical level. On the other hand, the relevance judgement of browsing is completed by

    users and it is a concept matching process.

    Continuity

    The retrieval process is continuous for browsing, while a retrieval process is discrete for query search-

    ing. Selecting a browsing path, examining a context, and relevance judgment is continuous and

    controlled by the user during browsing, while after executing a query, the internal query process and

    General Graduate Exams Panagiotis Papadakos

  • 8/23/2019 General Graduate Exams

    14/72

    6 Chapter 2. Interaction Paradigms and Visualization in IR

    ranking of the results is a black box for the users.

    Cost in Time and Effort

    Browsing is a time and effort consuming action, since the user must remember the browsing path,

    search the contents and make decisions, while query searching involves only term selection and query

    formulation.

    Information Seeking behavior

    Browsing is a system based seeking behavior (i.e. what the system can offer), while query searching

    is a seeking behavior based on what the user wants.

    Iteration

    Browsing is completed by series of iterative acts, like getting an overview of available information,

    fixing on a target and examining it more closely, and then moving on and starting the cycle again.

    Query searching on the other hand requires the definition of the query terms, the formulation of the

    query and examination of the results. Query searching might also be iterative, since the results might

    not fullfil the information needs of the user.

    Granularity

    Using browsing the user can evaluate one relevant item at a time, while query search provides a group

    of retrieved documents.

    Clarity of information need

    When a user starts an information seeking process, he might have not defined a clear information

    need. In such a case, browsing is more appropriate, since it does not require a definite target, while

    query searching requires a relatively well-conceived information need, for which keywords can b e

    chosen and query can be formulated.

    Interactivity

    Browsing is an interactive process by nature, which makes it more complicated and challenging, whilequery searching has fewer steps and less interaction.

    Retrieval Results

    Results of browsing are richer and more diverse, since they can lead to a wide range of retrieval

    results (i.e. from contextual information, structural information, relational information to individual

    objects), while query searching only retrieves a ranked list of documents.

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    15/72

    2.4. Visualization 7

    2.3.3 Three Paradigms

    Although query searching and browsing are different ways to seek information, they can be synthesized.

    There are three basic paradigms:

    Querying and Browsing (QB) In this paradigm, an initial query is submitted to the system to

    restrict the infobase. Then the results are visualized in a visualization environment and browsed by

    users.

    Browsing and Querying (BQ) Information at the macro-level is presented and browsed and then

    information in the micro-level is searched and highlighted in the visualization contexts.

    Browsing Only (BO) Information at the macro level is displayed and browsed. It does not integrate

    any query searching components.

    Query searching is not categorized as a paradigm, because it is a traditional IR retrieval paradigm that

    does not require a visual space.

    2.4 Visualization

    2.4.1 Definition

    According to [51], visualization is a method of computing, which transforms the symbolic into the geometric,

    enables researchers to observe their simulations and computations, offers a method for seeing the unseen,

    enriches the process of scientific discovery, and fosters profound and unexpected insights. Visualization

    is the process of transforming data, information, and knowledge into graphic presentations to support

    tasks such as data analysis, information exploration, information explanation, trend prediction, pattern

    detection, rhythm discovery, and so on. Without the visualization assistance, there is less perception or

    comprehension of the data, information, or knowledge by people for a variety of reasons. Such reasons, may

    include the limitations of human vision, or the invisibility and abstractness of the data, information and

    knowledge. Visualization requires certain methods or algorithms to convert raw data into a meaningful,

    interpretable, and displayable form to visually convey information to users.

    2.4.2 Scientific and Information Visualization

    Visualization can be classified into two categories: scientific visualization and information visualization.

    Scientific visualization is used most of the times to show things that are either too fast or too slow for the

    General Graduate Exams Panagiotis Papadakos

  • 8/23/2019 General Graduate Exams

    16/72

    8 Chapter 2. Interaction Paradigms and Visualization in IR

    eye to perceive, or for structures much smaller or larger than human scale, or for phenomena that people

    can not directly see, like x-ray or infrared radioaction [52]. Examples include shapes of molecules, missile

    tracking, astrophysics, fluid dynamics, medical images, etc.

    On the other hand, information visualization, is generally used to view abstract information. Examples

    include visual reasoning, visual data modeling, visual programming, information retrieval visualization,

    visualization of program execution, visual languages, spatial reasoning, and visualization of systems [82].

    Although their fundamental design principles, implementation means, and issues are common, infor-

    mation visualization does not have an inherent spatial structure or geometry of data to display, contrary

    to the scientific visualization. For the former, a spatial structure or framework for semantic relationships

    among data must be created. Finding or defining a spatial structure for information visualization is chal-

    lenging because data in an information space may be multifaceted, relationships of data are interwoven

    and complicated. Furthermore, data may be of diverse nature. Definition of such a spatial structure for

    information visualization, is a complicated and creative process. Salient and displayable attributes from

    objects must be extracted, a semantic framework for displayable objects must be established, information

    must be organized, and objects must be projected onto the structure, in such a way that the user will be

    able to search and find objects and objects relationships [92].

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    17/72

    Chapter 3

    Interaction Paradigms and Related

    Techniques

    In this chapter we discuss the interaction paradigms of Results Clustering and Faceted or Dynamic Tax-

    onomies correspondingly.

    3.1 Results Clustering

    Results clustering is a type of data analysis method that can organize a dataset into categorical groups

    clusters, besed on certain data association criteria. Different similarity measures can result in different

    clustering results. Items or objects within the same group/cluster are more similar than items between

    two distinct groups/clusters. Clustering is considered an unsupervised learning process because it can au-

    tomatically reveal intrinsic categorical patterns from a dateset. The categories from a clustering algorithm

    rely on the nature of the dataset, association criteria of clustering, and distribution of data items in the

    dateset.

    The advantage of clustering is that it can be easily applied to any collections, revealing interesting

    and unexpected associations and trends. Disadvantages of clustering are the lack of predictability, their

    conflation of many dimensions simultaneously, the difficulty in groups labeling and the counterintuitiveness

    of cluster hierarchies [25].

    9

  • 8/23/2019 General Graduate Exams

    18/72

    10 Chapter 3. Interaction Paradigms and Related Techniques

    3.1.1 Clustering Requirements

    Results clustering algorithms should satisfy several requirements. First of all, the generated clusters should

    be characterized from high intra-cluster similarity. Moreover, results clustering algorithms should be effi-

    cient and scalable since clustering is an online task and the size of the retrieved document set can vary.

    Usually only the top C documents are clustered in order to increase performance. In addition, the pre-

    sentation of each cluster should be concise and accurate, to allow users to detect what they need quickly.

    Cluster labeling is the task of deriving readable and meaningful (single-word or multiple-word) names for

    clusters, in order to help the user to recognize the clusters/topics he is interested in. Such labels must

    be predictive, descriptive, concise and syntactically correct. Finally, it should be possible to provide high

    quality clusters based on small document snippets rather than the whole documents.

    3.1.2 Clustering Algorithms Classification

    We can categorize the clustering algorithms, using two different classification schemes, based on either the

    structure of the clusters or the infobase that these algorithms are applied to.

    Hierarchical and Non-Hierarchical Approaches

    The first category classifies the clustering algorithms to either the non-hierarchical ones (partitioning clus-

    tering algorithms) or the hierarchical ones [61]. The major difference between these two clustering types is

    that the former generates a hierarchy of clustered items while the later partitions the items in a single-level

    structure.

    Non-Hierarchical Approaches

    This kind of clustering algorithms, partition N items into K categories (K must be predefined).

    One of the most popular non-hierarchical algorithms is the K-means [48] and its variants [36, 11]

    which is based on a simple iterative scheme for finding a local minimal solution. The algorithm

    starts with a guess about the solution, and then readjusts the cluster centroids, until reaching a local

    optimum. A centroid is a special artificially created item in a cluster which is used to represent that

    cluster for various purposes. It is defined as the average coordinates of all items in a cluster which it

    represenents. A cluster membership function refers to a method to judge whether an item is assigned

    to a cluster or not in a clustering process. The main advantages of this algorithm are its simplicity

    and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same

    result with each run, since the resulting clusters depend on the initial random assignments.

    Another non-hierarchical algorithm is the Fuzzy c-means [32]. In fuzzy clustering, each point has a

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    19/72

    3.1. Results Clustering 11

    degree of belonging to clusters, as in fuzzy logic, rather than belonging completely to just one cluster.

    Thus, points on the edge of a cluster, may be in the cluster to a lesser degree than points in the

    center of cluster. The algorithm minimizes intra-cluster variance as well, but has the same problems

    as k-means, the minimum is a local minimum, and the results depend on the initial choice of weights.

    QT (quality threshold) clustering [29] is an alternative method of partitioning data, invented for gene

    clustering. It requires more computing power than k-means, but does not require specifying the

    number of clusters a priori, and always returns the same result when run several times. The user

    chooses a maximum diameter for clusters and the algorithm builds a candidate cluster for each point

    by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses

    the threshold. The candidate cluster with the most points is the first true cluster. Then recurse with

    the reduced set of points, to find the rest of the clusters.

    STC and its variations, described later in the Section Snippet-based Approaches, are also non-

    hierarchical approaches.

    Hierarchical Approaches

    The hierarchical clustering algorithm yields a tree structure, which is also called a dendrogram. In

    such a structure, a child sub-cluster has to overlap with its parent cluster. The clustering process

    in such algorithms is recursive, meaning that successive sub-clusters are generated from an existing

    cluster, etc. There are two basic strategies for creating this structure: agglomerative(or from bottom

    to top) algorithms and divisive (or from top to bottom). The former algorithm first clusters input

    items, forming a set of clusters, and then merges close clusters from the existing cluster set to form a

    parent cluster, based on a similarity measure. The algorithm ends when all clusters have been merge

    to one parent cluster, the root of the tree [36]. Different variations may employ different similarity

    measuring schemes [94]. The latter algorithm, takes the opposite direction. It starts with the root

    of the tree, and breaks down one large cluster into several smaller clusters. The recursion stops

    when certain criteria are met. Agglomerative clustering algorithms are more popular than divisive

    clustering algorithms.

    The above methods usually suffer from their inability to perform adjustment once a merge or split

    has been performed. This ineflexibility often lowers the clustering accuracy. Furthermore, due to

    the complexity of computing the similarity between every pair of clusters, such algorithms are not

    scalable for handling large data sets in document clustering.

    Another approach is the Hierarchical Frequent Term-based Clustering (HFTC) method, proposed in

    [1]. This algorithm exploits the notion of frequent itemsets1 used in data mining. HFTC greedily

    1A frequent itemset is a set of words which occur together in some minimum function of documents in a cluster

    General Graduate Exams Panagiotis Papadakos

  • 8/23/2019 General Graduate Exams

    20/72

    12 Chapter 3. Interaction Paradigms and Related Techniques

    selects the next frequent itemset, which represents the next cluster, minimizing the overlap of clusters

    in terms of shared documents. Experiments have shown that this algorithm is not scalable [21].

    A different approach based on the idea of frequent itemsets is the Frequent Itemset Hierarchical

    Clustering (FIHC). FIHC uses global frequent itemsets2 to construct clusters, which reduces the

    dimensionality of the document set, making this algorithm more efficient and scalable.

    Document-based and Snippet-based Approaches

    Clustering can be applied either to the original documents (like in [11, 27, 21]), or to their (query-dependent)

    snippets (as in [86, 79, 71, 19, 88, 23, 77]). For instance, clustering Meta Web Search Engines (MWSEs)

    (e.g. clusty.com) use the results of one or more search engines (e.g. Google, Yahoo!), in order to increase

    coverage/relevance. Therefore, meta-search engines have direct access only to the snippets returned by

    the queried search engines. Clustering the snippets rather than the whole documents makes clustering

    algorithms faster. Some clustering algorithms [19, 15, 84] use internal or external sources of knowledge

    like Web directories3 (e.g. DMoz4, Yahoo! Directory), dictionaries (e.g. WordNet) and thesauri, online

    encyclopedias (e.g. Wikipedia5) and other online knowledge bases. These external sources are exploited

    to identify key phrases that represent the contents of the retrieved documents or to enrich the extracted

    words/phrases in order to optimize the clustering and improve the quality of cluster labels.

    Document Vector-based Approaches

    The above traditional clustering algorithms, either flat (like K-means) or hierarchical (agglomerativeor divisive) are not based on snippets but on the original document vectors and on the similarity

    measure. Another such approach is ESTC (Extended STC) [10], which is an extension of STC

    (described latter in the Section Snippet-based Approaches), appropriate for application over the full

    texts (not snippets). To reduce the (roughly two orders of magnitude) increased number of clusters, a

    different scoring function and cluster selection algorithm is adopted. The cluster selection algorithm

    is based on a greedy search algorithm aiming at reducing the overlap and at increasing the coverage

    of the final clusters.

    In brief, such approaches can be applied only on a stand alone engine (since they require accessing theentire vectors of the documents) and they are computationally expensive. Furtermore, clustering over

    full text is not appropriate for a (Meta) WSE since full text may not be available or too expensive

    to process.

    2Frequent itemsets that appear together in more than a minimum fraction of the whole document set3A web directory is a listing of websites organized in a hierarchy or interconnected list of categories4www.dmoz.org5www.wikipedia.org

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    21/72

    3.1. Results Clustering 13

    Snippet-based Approaches

    Figure 3.1: Clusty, a Snippet-based Clustering Approach

    Snippet-based approaches rely on snippets and there are already a few engines that provide such

    clustering services. Clusty6 is probably the most famous one, shown in Figure 3.1. Suffix Tree

    Clustering (STC) [86] is a key algorithm in this domain and is used by Grouper [87] and Carrot2

    [79, 71] MWSEs. It treats each snippet as an ordered sequence of words, it identifies the phrases

    (ordered sequences of one or more words) that are common to groups of documents by building

    a suffix tree structure, and it returns a flat set of clusters that are naturally overlapping. Several

    variations of STC have been proposed. For instance, the trie can be constructed with the N-grams

    instead of the original suffixes. The resulting trie has lower memory requirements (since suffixes are

    no longer than N words) and its building time is reduced, but less common phrases are discovered

    and this may hurt the quality of the final clusters. Specifically, when N is smaller than the length

    of true common phrases the cluster labels can be unreadable. To overcome this shortcoming [33]

    proposed a join operation. A variant of STC with N-gram is STC with X gram [77] where X isan adaptive variable. It has lower memory requirements and is faster than both STC with N-gram

    and the original STC since it maintains fewer words. It is claimed that it generates more readable

    labels than STC with N-gram as it inserts in the suffix tree more true common phrases and joins

    partial phrases to construct true common phrases, but no user study results have been reported in

    the literature, and the performance improvements reported are small.

    6www.clusty.com

    General Graduate Exams Panagiotis Papadakos

  • 8/23/2019 General Graduate Exams

    22/72

    14 Chapter 3. Interaction Paradigms and Related Techniques

    Another snippet-based clustering approach is TermRank [23]. TermRank succeeds in ranking discrim-

    inative terms higher than ambiguous terms, and ambiguous terms higher than common terms. The

    top T terms, can then be used as feature vectors in K-means or any other Document Vector-based

    clustering algorithm. This approach requires knowing TF, it does not work on phrases (but on single

    words) and no evaluation results over snippets have been reported in the literature.

    Another approach is Findex [34], a statistical algorithm that extracts candidate phrases by moving a

    window with a length of 1..|P| words across the sentences (P), and fKWIC which extracts the most

    frequent keyword contexts which must be phrases that contain at least one of the query words. In

    contrast to STC, Findex does not merge clusters on the basis of the common documents but on the

    similarity of the extracted phrases. However, no comparative results regarding cluster label quality

    have been reported in the literature.

    Finally, there are snippet-based approaches that use external resources (lexical or training data). For

    instance, SNAKET7 [19] (a MWSE) uses DMoz web directory for ranking the gapped sentences8

    which are extracted from the snippets. Deep Classifier [84] trims the large hierarchy, returned by an

    online Web directory, into a narrow one and combines it with the results of a search engine making

    use of a discriminative naive Bayesian Classifier. Another (supervised) machine learning technique

    is the Salient Phrases Extraction[88]. It extracts salient phrases as candidate cluster names from the

    list of titles and snippets of the answer, and ranks them using a regression model over five different

    properties, learned from human training data. Another approach that uses several external resources,

    such as WordNet and Wikipedia, in order to identify useful terms and to organize them hierachically

    is described in [15]. Other extensions of STC for oriental languages and for cases where external

    resources are available are described in [89, 78].

    3.1.3 Cluster Presentation & User Interaction

    Although cluster presentation and user interaction approaches are somehow orthogonal to the clustering

    algorithms employed, they are crucial for providing flexible and effective access services to the end users.In most cases, clusters are presented using lists or trees. Some variations are described next. A well known

    interaction paradigm that involves clustering is Scatter/Gather [11, 27] which provides an interactive

    interface allowing the users to select clusters, then the documents of the selected clusters are clustered

    again, the new clusters are presented, and so on.

    7SNippet Aggregation for Knowledge ExtracTion8Gapped sentences are sequences of terms occurring not-contiguously into the snippets

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    23/72

    3.1. Results Clustering 15

    Figure 3.2: Quintura Word Cloud

    Clusty9 is an extension of Vivisimo that offers a new feature, called remix clustering, which clusters

    again the same search results but ignoring the topics that the user has seen. Another approach for the

    presentation layer is provided by Quintura10

    , shown in Figure 3.2. It extracts keywords from search resultsand builds a word cloud (visual map). The name of each cluster is placed in a 2D area. The positions of

    the names are based on their distance, while font size indicates the size of each cluster. By clicking words

    in the cloud, the user query is refined. SNAKETs [19] interface offers a feature of personalization that is

    performed at the client side: the user can select a set of labels and then ask SNAKET to filter out (from

    the ranked list) all those snippets that do not belong to the folders labeled by the selected labels.

    SOMs have been used to support exploration of a document space to search for patterns and gain

    overviews of available documents and relationships between documents [42] (Figure 3.4). Another infor-

    mation visualization alternative, Citiviz displays the clusters in search results using a hyperbolic tree anda scatterplot. Several (M)WSE incorporate visualizations similar to both treemaps and hyperbolic trees.

    grokker11, shown in Figure 3.3 clusters documents into a hierarchy and produces an Euler diagram, a

    coloured circle for each top-level cluster with sub-clusters nested recursively, where the user can zoom-in.

    9www.clusty.com10www.quintura.com11www.grokker.com

    General Graduate Exams Panagiotis Papadakos

  • 8/23/2019 General Graduate Exams

    24/72

    16 Chapter 3. Interaction Paradigms and Related Techniques

    Figure 3.3: grokker Generates an Euler Diagram

    Another example is Kartoo12

    ), shown in Figure 3.5, which generates a thematic map from the top dozensearch results for a query, laying out small icons representing results onto the map, with which the user

    can interact.

    3.2 Facets and Dynamic Taxonomies

    Dynamic taxonomies (also known as faceted search systems) [64] is a general knowledge management model

    based on a multidimensional classification of heterogeneous data objects and is used to explore and browse

    complex information bases in a guided, yet unconstrained way through a visual interface. Features of

    faceted metadata search include (a) display of current results in multiple categorization schemes (facets)

    (e.g. based on metadata terms, such as size, price or date), (b) display categories leading to non-empty

    results, and (c) display of the count of the indexed objects of each category (i.e. the number of results the

    user will get if he selects this category).

    12www.kartoo.com

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    25/72

    3.2. Facets and Dynamic Taxonomies 17

    Figure 3.4: Top-200 Web Search Results Clustering Displayed Using Two-level TreeMaps

    Figure 3.5: Kartoo Generates a Thematic Map

    3.2.1 Introduction

    Static taxonomies (such as Yahoo!s), based on a hierarchy of concepts can be used to select areas of

    interest and restrict the portion of the retrieved infobase. The creation of such taxonomies is usually a

    General Graduate Exams Panagiotis Papadakos

  • 8/23/2019 General Graduate Exams

    26/72

    18 Chapter 3. Interaction Paradigms and Related Techniques

    manual process although automatic and semi-automatic techniques have been proposed. However, static

    taxonomies are not scalable for large information bases [65], and the number of documents becomes rapidly

    too large for manual inspection.

    On the other hand, dynamic taxonomies [63, 64, 76] (also known as faceted search systems) are a general

    knowledge management model based on a multidimensional classification of heterogeneous data objects and

    are used to explore/browse complex information bases in a guided, yet unconstrained way through a visual

    interface. Features of faceted metadata search include:

    display of current results in multiple categorization schemes (facets) (e.g. based on metadata terms,

    such as size, price or date)

    display categories leading to non-empty results (Poka-Yoke 13)

    display of the count of the indexed objects of each category (i.e. the number of results the user will

    get if he selects this category)

    Such systems focus on user-centered interactive exploratory access, and propose a holistic approach in

    which modeling, interface and interaction issues are considered together. One of the key factors of this

    model is simplicity, in order to make it easily understandable and usable by end-users. The user always deals

    with a single conceptual representation of the infobase. The conceptual schema of a dynamic taxonomy

    is a plain taxonomy. It is a hierarchy going from the most general to the most specific concepts based on

    subsumptions. Directed acyclic graph taxonomies modelling multiple inheritance are supported but rarely

    required.

    The user is guided to reach his goal, because at each stage he has a complete list of all the concepts related

    to the current focus, which can be used to further refine his exploration. Furthermore as in traditional

    search methods, the infobase can be restricted and a reduced taxonomy can be created. The user is in

    charge of interaction and he can freely explore the infobase, discovering unexpected relationships. By

    construction, no empty results can occur, because they are automatically pruned. Usability studies [26, 85]

    show that despite slow response times, dynamic taxonomies produce a faster overall interaction and a

    significantly better recall (both actual and perceived) than access through text retrieval.Dynamic taxonomies have an very fast convergence to small results sets, as described in [65]. For

    example, 3 zoom operations on terminal concepts are sufficient to reduce a 10,000,000 object infobase

    described by a compact taxonomy with 1,000 concepts to an average 10 objects. Finally, the conceptual

    organization of dynamic taxonomies allows to gather user interests at a precise conceptual level by simply

    monitoring the zoom operations issued and the concepts the user focuses on.

    13Poka-Yoke is a Japanese term that means fail-safing or mistake-proofing.

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    27/72

    3.2. Facets and Dynamic Taxonomies 19

    Examples of applications of faceted metadata-search include: e-commerce (e.g. ebay), library and bibli-

    ographic portals (e.g. DBLP), museum portals ( e.g. [49] and Europeana 14), mobile phone browsers (e.g.

    [35]), specialized search engines and portals (e.g. [50]), Semantic Web (e.g. [30, 49, 56]), general purpose

    web search engines (e.g. Google Base), and other frameworks (e.g. mSpace[67]).

    3.2.2 Taxonomy Design

    The most accurate way to create a taxonomy is to build categories by hand. Unfortunately, manual

    classification is expensive and infeasible for many practical document collections, and especially for a WSE

    document collection. Automatic clustering techniques generate clusters that are typically labeled using a

    set of keywords, which leads to unpredictive and not intuitive labels. An alternative approach to clustering

    is to generate hierarchies of terms for browsing the database. [66] introduced the subsumption hierarchiesand [45] showed experimentally that subsumption hierarchies outperform lexical hierarchies [60]. Another

    approach is to use the hierarchical structure of WordNet15 16 to offer a hierarchy view over the topics [40].

    WordNet together with a tree-minimization algorithm to create an appropriate concept hierarchy for a

    database is also used in [72].

    All these techniques generate a single hierarchy for browsing the database. A supervised approach for

    extracting useful facets from a collection of text or text- annotated data is described in [14], which relies on

    WordNet hypernyms17 and on a Support Vector Machine (SVM) classifier to assign new keywords to facets.

    More recent work [15, 13], provide an unsupervised technique to extract useful facet terms, by expanding

    a database using WordNet and Wikipedia to identify important terms.

    3.2.3 Framework

    Table 3.1 defines formally and introduces notations for terms, terminologies, taxonomies, faceted tax-

    onomies, interpretations, descriptions and materialized faceted taxonomies as described in [76]. In brief,

    Obj is a set of objects (the set of all documents indexed by the WSE), T is a set of terms, and the elements

    of Obj can be described with respect to one or more aspects (facets), where each aspect is associated with

    a value domain, finite or infinite, which may be ordered (in the general case we could have a partial order

    (T,)). The description of an object with respect to one facet consists of assigning to the object one or

    14http://www.europeana.eu15WordNet is a lexical database, which groups English words into sets of synonyms called synsets, provides short, general

    definitions, and records the various semantic relations between these synonym sets16http://wordnet.princeton.edu/17Hypernym is a word whose meaning includes the meanings of other words, as the meaning of the term animal includes

    the meaning of cat, dog, parrot

    General Graduate Exams Panagiotis Papadakos

  • 8/23/2019 General Graduate Exams

    28/72

    20 Chapter 3. Interaction Paradigms and Related Techniques

    more terms from the taxonomy that corresponds to that facet.

    Table 3.2 defines the required notions and notations regarding user interaction. The user explores or

    navigates the information space by setting and changing his focus. The notion of focus can be intensional

    or extensional. Specifically, any set of terms, i.e. any conjunction of terms (or any boolean expression of

    terms) is a possible focus. For example, the initial focus can be the empty compound term, or the top term

    of a facet. However, the user can also start from an arbitrary set of objects, and this is the common case

    in the context of a WSE. In that case the focus is defined extensionally. Specifically, if A is the result of a

    free text query q, then the interaction is based on the restriction of the materialized faceted taxonomy on

    A (as defined at the bottom part of Table 3.2).

    At any point during the interaction, the immediate zoom-in/out/side points along with count information

    are computed and provided to the user. When the user selects one of these points then the selected term

    is added to the focus, and so on. An example of a materialized faceted taxonomy, is shown in Figure 3.6.

    Figure 3.6: Example of a Materialized Faceted Taxonomy

    Foci are considered to be redundancy free. A focus ctx (i.e. ctx T) is redundancy free if ctx =

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    29/72

    3.2. Facets and Dynamic Taxonomies 21

    min(ctx). For example, ctx = {Greece, Crete} is not redundancy free because min(ctx) = {Crete}.

    The contents (or extension) of a focus ctx, is the set of objects I(ctx). This notion can be refined, in order

    to distinguish the shallowcontents I(ctx), from the deep contents I(ctx).

    3.2.4 User Interface Design

    System implementations for dynamic taxonomies and faceted search allow a wide range of query possibilities

    on the data. Only when these are made accessible by appropriate UIs, the resulting applications can

    support a variety of search, browsing and analysis tasks. Such systems should provide support at least for

    the three basic characteristics of faceted and dynamic taxonomies. They should display non-empty results,

    in multiple categorization schemes (facets), along with the count of the indexed objects of each category.

    Additional UI functionality, is usually accompanied by additional complexity and visual clutter.Selection and de-selection of zoom-points is of central importance in faceted search. If only one concept

    should be selectable at a time within a facet, traditional single-select controls such as radio buttons,

    dropdown list controls or simple links can be used. On the other hand, the standard multi-select elements,

    are check boxes. For instance, the yelp18 web application provides check buttons for multi-select facets and

    simple links for facets with exclusive selection. Alternatives for allowing both modes in a facet would be

    dedicated controls, or modifier keys (such as pressing shift while clicking). For range selection navigation

    mode, slider controls can allow the specification of upper and lower bounds on the result set. De-selection

    should be as easy as concept selection. Additionally, if breadcrumbs or a similar filter summary, indicating

    summaries of single or all facets are present, these should include the option to clear individual filters as

    well. Also, buttons for reseting single facets or all filter options can help users to zoom-out quickly.

    Figure 3.7: ContentLandscape Applies Collapsible Panel Pattern for Zooming

    For flat facets, i.e. not featuring a hierarchical relation between the concepts, simple list widgets are

    18http://www.yelp.com

    General Graduate Exams Panagiotis Papadakos

  • 8/23/2019 General Graduate Exams

    30/72

    22 Chapter 3. Interaction Paradigms and Related Techniques

    usually used. List sorting can either be alphabetical, or dynamically updated by the number of assigned

    items in the current result set. For navigating hierarchies, a number of different presentation and navigation

    options exist, which include: Explorer Tree (not very space efficient), Zoom and Replace which replaces

    the facet widget content with the level below (used in Flamenco19 [85]), Collapsible panels, hierarchical

    widgets based on the accordion pattern20 (used in the ContentLandscape application [70], Figure 3.7),

    and Continuous Zooming, where hierarchical facets are displayed as space-filling widgets, which allow a

    fast traversal across all levels, while simultaneously maintaining context (used in the FacetZoom prototype

    [12], Figure 3.8). The number of indexed items for each facet and zoom-points, can be shown by numbers

    (after the labels), bar charts, height of facets, colour, etc. Visgets[18], extends this principle by featuring a

    whole number of visualizations. FaThumb [35], enables faceted search on mobile devices (Figure 3.9). The

    filter area is grouped in nine zones, corresponding to the nine digit keys on mobile phones. The middle

    zone serves as a spatial overview during navigation. The surrounding eight zones allow the user to select

    hierarchy branches and repeatedly zoom in on subtrees. The left short shortcut key adds the currently

    selected concept to the query, the right one allows to quickly jump back to the top.

    Figure 3.8: FacetZoom Combines Ideas from Zoomable UIs With Faceted Search

    Query searching can be done either over all results or within the current focus, as shown in Figure 3.10.

    Moreover, in order to quickly locate zoom-points in a facet, and avoid having to navigate large hierarchies,

    even though the target concept may already known by name, direct access to facet items can be achieved

    with a keyword search over the concept labels (/facet [30]). Since the number of available facets can be

    very big, ways to reduce their usage space are discussed in [24], and include collapsible facet widgets (such

    as used by Getty images faceted navigation interface21) and expandable filter areas (i.e. More button).

    Furthermore, systems should be able to determine which facet-value pairs the interface should provide

    19Online demos available at http:// amenco.berkeley.edu/20http://www.welie.com/patterns/showPattern.php?patternID=accordion21http://gettyimages.com

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    31/72

    3.2. Facets and Dynamic Taxonomies 23

    Figure 3.9: Faceted Search for Small Screens in the FaThumb Prototype

    Figure 3.10: Flamenco Allows Choosing Between a Search Over All Results or Within Current Focus

    to a user. Personalization allows the system to present the facet-value pairs that can help the user quickly

    find the documents that he is most interested. Existing approaches include, content based personaliza-

    tion, where a recommendation system monitors users actions and pushes documents that match his user

    profile, collaborative based faceted search personalization, where the system recommends items to a user

    by leveraging information from other users with similar tastes and preferences, and finally an ontological

    General Graduate Exams Panagiotis Papadakos

  • 8/23/2019 General Graduate Exams

    32/72

    24 Chapter 3. Interaction Paradigms and Related Techniques

    approach, which uses the distance between values of an ontology, to measure the relevance to users [75].

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    33/72

    3.2. Facets and Dynamic Taxonomies 25

    Name Notation Definition

    terminology T a set of names, called terms (they may capture both

    categorical and numeric values)subsumption a partial order (reflexive, transitive and antisymmet-

    ric)

    taxonomy (T,) T is a terminology, a subsumption relation over T

    broaders of t B+(t) { t | t < t}

    narrowers of t N+(t) { t | t < t}

    direct broaders of t B(t) minimal

  • 8/23/2019 General Graduate Exams

    34/72

    26 Chapter 3. Interaction Paradigms and Related Techniques

    Name Notation Definition

    focus ctx any subset of T such that ctx =

    minimal(ctx)

    focus projection on a facet i ctxi ctxi = ctx Ti

    Kinds of zoom points w.r.t. a

    facet i while being at ctx

    Notation Definition(s)

    zoom points AZi(ctx) = { t Ti | I(ctx) I(t) = }

    zoom-in points Z+i (ctx) = AZi(ctx) N+(ctxi)

    immediate zoom-in points Zi(ctx) = maximal(Z+i (ctx))

    = AZi(ctx) N(ctxi)

    zoom-side points ZR+i (ctx) = AZi(ctx) \ {ctxi N+(ctxi) B

    +(ctxi)}

    immediate zoom-side points ZRi(ctx) = maximal(ZR+(ctx))

    Restriction over an object set Notation Definition(s)

    restricted object set A any subset of Obj

    reduced interpretation I I(t) = I(t) A

    reduced terminology T = { t T | I(t) = }

    = { t T | I(t) A = }

    = oAB+(DI(o))

    Table 3.2: Interaction Notions and Notations

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    35/72

    Chapter 4

    Visualization Models and Metaphors

    In this chapter we will discuss five different visualization models. Initially, we will discuss MRPBM, which is

    based on RPs, and then we will analyze ESCBM, which is based on the VSM ranking model and its spatial

    characteristics. The next one is PFNET, which uses associative networks, and the fourth one is MDS,

    a group of methods used to discover empirical relationships among investigated objects. Finally, we will

    discuss SOM, which is a nonlinear topology-preserving projection method, to convert a high-dimensional

    space into a low dimensional grid and different visualization metaphors.

    4.1 Multiple Reference Points Based Models (MRPBM)

    MRPBM models are visualization algorithms to display the results of a search not in the classical linear

    order, but by projecting them on a low dimensional visual space. They can effectively handle complex

    information needs by using multiple RPs. RP or Point of Interest (POI), is a search criterion against which

    documents or surrogates are matched and search results are generated and presented to the users. In a

    broad sense, a RP represents users information needs and any information related to users needs, from

    user preferences and search history, to query terms or browsed documents. Multiple RPs can form a low

    dimensional visual space and documents can be mapped onto the space, based upon their attraction to the

    RPs.

    Visualization models based on multiple RPs can be classified into three categories:

    Fixed Multiple RPs Models

    27

  • 8/23/2019 General Graduate Exams

    36/72

    28 Chapter 4. Visualization Models and Metaphors

    These models use multiple RPs, with a fixed position, and can be used for both vector-based and

    Boolean based IR systems. The representative model is InfoCrystal [69]. In the boolean context,

    each RP is equivalent to a term or a sub-Boolean logic expression from a Boolean query. The visual

    space is a polygon, where RPs constitute vertices of the polygon and visual results are displayed.

    The side lengths of the polygon are equal so that the RPs are evenly configured in the visual space.

    The retrieved results are displayed inside the polygon. The polygon is partitioned by N exclusive

    tiers, represented as concentric rings, where N is the number of RPs. The first tier, displays results

    related to only one RP, the second results related to two RPs, etc. Figure 4.1 shows a fixed multiple

    RPs model.

    Figure 4.1: Display of 4 Reference Points in a Fixed Reference Point Environment

    Movable Multiple RPs Models

    These models use multiple RPs, which can be manipulated by the user, while semantic connections

    of displayed objects are still maintained in the visual space. VIBE [55] and its variations, VR-VIBE [2] and LyberWorld [28] are such models. The primary benefit of this approach is that the

    user may arbitrary place a RP to any interesting area, such as another RP, document or cluster of

    documents, and observe the impact of the RP to that area. According to the algorithm, the position

    of a document is strongly related to the similarities between the document and a group of predefined

    RPs. The positions of all related RPs in the visual space, play a very important role in positioning

    a projected document. In addition, taking into consideration the relevance between a document and

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    37/72

    4.1. Multiple Reference Points Based Models (MRPBM) 29

    related RPs, the ultimate position of a document is calculated. Initially the first two related RPs are

    selected in order to calculate the position of the document. The new position of the document serves

    as an intermediate RP for further consideration, and the process continues until all related RPs are

    considered. If the user add, remove, or change the position of any RP, the whole algorithm must be

    executed again.

    Figure 4.2 shows a snapshot of VIBE. In this example 5 RPs (circles) are used and documents are

    represented as rectangles. Those documents that contain at least one of the descriptors indicated by

    the user when initiating the search are considered relevant. The documents with greater coincidence

    in their descriptors with those of the RP are placed closer to that RP. The user can also expand the

    icons of the documents or documents that are useful by simply drawing a box around a document or

    documents that are of interest and a list is shown of the chosen selection. Clicking with the mouse

    on any of the documents on the list will open another window with the complete document. One

    characteristic that makes the system interactive is that the user may add, change or remove the RPs

    from the screen. On carrying out any of these changes the system automatically launches the search

    query and re-orders the found documents to present the relationships between documents and those

    between POIs.

    Automatic RPs Rotation Models

    This model is a similarity ratio based model and was introduced with WebStar [93] to visualize link

    structures. The uniqueness of this model is that it adds a new feature, automatic rotation of RP to

    the 2D visual space. The visual space is build on a polar coordinate system, where the origin of the

    visual space is a central document (focus point), specified or selected by users, and RPs are evenly

    distributed on a sphere with the focus point as center. All of the relevant documents are scattered

    within the visual space based on their projection angle (which is similarity based) and distance (which

    is not). By selecting a RP, it automatically rotates around the sphere. As a consequence, related

    documents are attracted and also rotated.

    Figure 4.3, shows the WebStar system. The central document (focus point), is denoted with a blue

    square at the center of the circle, while the four RPs (sport, research, international, library),

    are represented with the yellow squares, evenly distributed outside the circle. Documents are the

    pink squares scattered inside the circle. In this example the user has selected the international RP,

    coloured in red, which is rotated around the circle. Notice how documents change position as the RP

    rotates.

    Both the models for fixed and movable multiple RPs require at least three RPs to project documents in

    General Graduate Exams Panagiotis Papadakos

  • 8/23/2019 General Graduate Exams

    38/72

    30 Chapter 4. Visualization Models and Metaphors

    Figure 4.2: VIBE Using 5 Reference Points

    their visual spaces, while the model for automatic RPs rotation requires at least one RPs in conjunctionwith the focus point. Furthermore, visualization models for multiple RPs can be 2D or 3D and can be

    applied to either Boolean or vector based information systems. The position of any RPs can be controlled

    and manipulated by users at will. It is the flexibility of manipulation that enables users to compare and

    analyze the impact of two reference points on documents, and identify good/poor discriminative terms.

    Such models can be used to visualize Internet hyperlinks, search results from an information retrieval

    system, a full-text, and term discriminative analysis.

    4.2 Euclidian Spatial Characteristic Based Model (ESCBM)

    These visualization models are based on the VSM model and its spatial characteristics. The basic Euclidean

    spatial elements such as point, distance, and angle may have a special connection to information retrieval in

    the contexts of the vector-based space. For instance, a document or RP in a vector based space corresponds

    to a spatial point in the Euclidean space. Euclidean distance between documents and RPs can be used

    as an indicator of their similarity. Their visual spaces are 2D and in order to construct them, they use

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    39/72

    4.2. Euclidian Spatial Characteristic Based Model (ESCBM) 31

    Figure 4.3: WebStar Using 4 RPs. Snapshots During a Full Rotation of international Reference Point

    two RPs, which serve as view points, one major (KV P), and one minor (AV P). These RPs, the reference

    axis that they form and the distance between them, are all selected by the user and affect the relevant

    documents placement.

    The projection conversion equation for an IR evaluation model is crucial for visually displaying it in

    the visual space. The complexity of a conversion equation depends upon multiple factors such as the

    definition of the visual space and nature of the retrieval evaluation model. Some equations are simple and

    straightforward while others may be complicated. The significance of visualizing an IR evaluation model

    is not only to make the invisible internal retrieval process transparent to users but also to allow them to

    manipulate the model in the visual space at will.

    In this context, three visualization models have been proposed.

    General Graduate Exams Panagiotis Papadakos

  • 8/23/2019 General Graduate Exams

    40/72

    32 Chapter 4. Visualization Models and Metaphors

    Distance-angle Based Model

    In this model the visual projection distance and angle are defined for any document Di. The pro-

    jection distance is the distance from the document Di to the KV P and the distance angle is the

    angle formed by the lines KV PDi and KV P AV P, in the vector space. The valid display area of

    this model is a half-infinite plank, where the X-axis and Y-axis are defined as the visual projection

    angle and distance respectively. The width of X-axis is always equal to and the width of Y-axis is

    infinite. KV P is always mapped onto the origin visual space, because its visual projection distance

    is 0 and the angle is defined as 0. The position of AV P is mapped onto the Y-axis. because its visual

    projection distance is the length between the two reference points, in the visual space and the visual

    projection angle is defined as 0. The distance between the two reference points does not affect this

    model. DARE [90] is such a model and

    Figure 4.4 shows the display of the projected cosine model using DARE. The angle a is the retrieval

    threshold, while R2 is AV P. D1 is a document situated within the retrieval area defined by the angle

    , and D2 is any document located on one boundary of the angle . Users may drag the vertical

    retrieval line to any place within the valid display area, to increase or decrease the retrieval area.

    Figure 4.4: Display of the Projected Cosine Model, in Distance-Angle DARE Model

    Angle-angle Based Model

    In this model two visual projection angles are defined for any document Di. The first angle () is

    the angle formed by the lines KV PDi and KV P AV P, and the second one () is the angle formed

    by the lines AV P Di and KV P AV P, both of them in the vector space. The two angles and

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    41/72

    4.2. Euclidian Spatial Characteristic Based Model (ESCBM) 33

    are assigned to the X-axis and Y-axis. The minimum value and maximum value for the two angles

    and , are 0 and respectively. The valid display area is a triangle and the two reference points

    are projected at (/2, 0) and (0, /2) respectively. This model again is not affected by the distance

    between the two RPs.

    TOFIR [91] is an example of such a model, shown in Figure 4.5. The angle is the retrieval threshold,

    the origin of the vector space is KV P, while R2 is AV P. In the figure O is the projected origin of

    the vector space. The horizontal line defines the retrieval area, and can be manipulated by the users.

    Figure 4.5: Display of the Projected Cosine Model, in the Angle-Angle TOFIR Model

    Distance-distance Based Model

    In this model two visual projection distances are defined for any document Di. The first distance, is

    the distance from the document Di to the KV P, and the second is the distance from the document

    Di to the AV P, both of them in the vector space. The two projection distances are assigned to the

    X-axis and Y-axis. The valid display area is a half-infinite plank, where both the X-axis and Y-axis

    are assigned as the visual projection distances. It forms a /4 angle against the X-axis or the Y-axis,

    its two corners are connected to the X-axis and Y-axis respectively, and its width is dynamic and

    determined by the distance between the two RP. GUIDO [54] is such a model. Figure 4.6 shows the

    distance model in GUIDO.

    One of the distinguishing characteristics of these visualization models is their capacities to visualize

    traditional IR evaluation models in addition to visualizing relationships among documents. Document dis-

    tributions in these visual spaces change accordingly when the RPs change. This implies that the displayed

    General Graduate Exams Panagiotis Papadakos

  • 8/23/2019 General Graduate Exams

    42/72

    34 Chapter 4. Visualization Models and Metaphors

    Figure 4.6: Display of the Projected Distance Model, in the Distance-Distance GUIDO Model

    document configurations in the visual spaces can be customized based upon users dynamic information

    needs.

    4.3 Pathfinder Associative Newtork (PFNET)

    The Pathfinder associative network PFNET is a structural and procedural modeling technique that extracts

    underlying connection patterns in proximity data and represents them spatially in a class of networks

    [8]. The power of the Pathfinder associative network is its ability to discard insignificant links in the

    original network while it reserves the salient semantic structure of the network. The simplified network

    still maintains the proximity connections and fundamental characteristics of the original network.

    The main idea of the Pathfinder associative network is to discard the redundant paths and keep the

    significant ones in a network. PFNET uses the triangle inequality, to identify paths with the lowest weights

    in the network, eliminate redundant ones, and make the network more economical. Figure 4.7 displays the

    original network and the final PFNET network. Moreover, the principle of the triangle inequality can be

    extended to an abstract space. In that case, connection proximity between two points may be measured

    in other forms such as invisible semantic similarity between two objects rather than distance.

    Application of a PFNET to a domain problem requires identifying two basic elements: the first is the

    objects which are used as nodes in the network, and the second is the proximity relationship between the

    two objects, which is used to form a link between the two objects. Proximity can be procured by either a

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    43/72

    4.3. Pathfinder Associative Newtork (PFNET) 35

    Figure 4.7: Display of Original Network (left) and Final PFNET Network (right)

    human-interference method or an automatic computation method. Different objects and proximity methods

    can lead to different Pathfinder associative networks.

    The Pathfinder network technique is very effective and efficient for display of complex relationships

    among objects such as sophisticated semantic networks. As an IV means, it can be applied to a wide

    spectrum of IR environments, ranging from information searches [7, 20], author co-citation analysis1 [80],

    term co-occurrence analysis2 [16], to the Internet information representation [6].

    Specifically for query searching, after a query is submitted to the network, the relevance between the

    query and a document is calculated using the Pearson correlation coefficient, and the relevance is indicated

    by the height of a raising spike from the document [7]. In another case [20, 16], both the query and a

    document are converted into two Pathfinder associative networks, and the similarity between a query and

    a document is the similarity between the two Pathfinder networks. The proximity algorithm consists of

    two parts. The first part is defined as the ratio of common terms in both a query and a document to the

    number of all terms in the query. The second part measures the network structure similarity between the

    query network and a document network. The value of this part increases when nodes (terms) connected

    in the query network also appear closely connected in the document network. Finally, the two parts are

    weighted and integrated into a final similarity value.

    The weaknesses of the Pathfinder associative network include its computational complexity, which may

    prevent PFNET from visualizing a large dataset, and dynamically modifying a PFNET caused by interac-

    tions between users and the network. Another disadvantage of PFNETs in the present state of development

    1Phenomena occuring when the authors of two different papers, both co-cite the same paper(s) in their work2Keywords appearing together in a predefined length of text in the same document

    General Graduate Exams Panagiotis Papadakos

  • 8/23/2019 General Graduate Exams

    44/72

    36 Chapter 4. Visualization Models and Metaphors

    is that people have no way of knowing the features upon which similarity judgments are made, which re-

    sults in that the semantic content of links is not easily discernible. PFNET cannot generate a local visual

    configuration based on users individual information needs, but it only produces a global overview for a

    data collection.

    4.4 Multidimensional Scaling Models (MDS)

    The MDS technique consists of a group of methods used to discover empirical relationships among inves-

    tigated objects, by visualizing them and presenting their geographic representation in a low dimensional

    display space. It can be used to reveal and illustrate hidden patterns for a set of proximity measures

    among objects for multivariate, exploratory, and visual data analysis. An MDS algorithm starts with a

    matrix of itemitem similarities, and then assigns a location to each item in N-dimensional space ( N is

    specified a priori), where users may perceive and analyze the relationships among the displayed objects.

    For sufficiently small N, the resulting locations may be displayed in a graph or 3D visualisation. The more

    similar two objects, the closer to each other they are, and vice versa.

    One of MDS techniques advantages is the diversity of its algorithms, where each one of them handles

    different situations. They can be classified into metric and non-metric MDS algorithms, based upon the

    types of input proximity data. The non-metric MDS algorithm is applied to qualitative3 proximity data,

    while metric MDS is applied to quantitative4 proximity data. Another category of MDS technique is

    classical MDS algorithm. which is used with quantitative proximity data.

    Applications of MDS in IR can be roughly categorized into two groups, based on the proximity definition:

    one is to use a co-citation method to define the proximity metric, and the other is to use a non-co-

    citation method such as traditional distance-based or angle-based similarity measures. However, applying

    traditional MDS to a very large data set may be prohibitively slow, since it uses a linear algebra solution

    for the problem, which is computationally costly and makes heavy demands on storage. On the other

    hand, the non-metric (metric) MDS method looks for the best match between the original proximity of

    two objects and their Euclidean distance in a low dimensional, using an iterative process, starting with

    a random initial configuration. The Kruskal algorithm, which is used for the minimization, is iterative,simple and its computational complexity is in practice almost O(N).

    Furhermore, the huge number of displayed objects in a low dimensional space raises concerns in terms of

    efficient system implementation and information representation in the MDS display space, for interactive

    systems. To solve the problem, people use the supernode method [68] that visualizes ob ject clusters and

    3Qualitative proximity data refers to ordinal data4Quantitative proximity data refers to ratio-scaled data

    Computer Science Department University of Crete

  • 8/23/2019 General Graduate Exams

    45/72

    4.5. Self-organizing Map Model (SOM) 37

    objects at different levels respectively. In the MDS visual display space, documents are clustered first

    so that highly related documents in terms of the co-citation are formed as new supernodes. So instead

    of individual documents, the system displays these supernodes. Documents within a supernode can be

    visualized, at a lower level, if users zoom on a selected cluster.

    Figure 4.8: Display of ThemeScape and Galaxy Visualizations of IN-SPIRE Visualization Program

    Another potential problem is the intuitive representation of projected objects in a low dimensional MDS

    space. It is extremely important for users to easily understand and meaningfully interpret the graphic

    presentation. Towards that aim, the MDS approach was combined with the so called ecological approach, in

    order to take advantage of natural display formats that humans are used to [83]. The ecological landscape isa MDS display space, which consists of a group of ecologically connected local landscapes. Each landscape

    represented an object cluster. The size of each local landscape is related to the number of documents

    containing a thematic term which defined the local landscape. A document is positioned based upon its

    indexing terms, the thematic term, and the category assigned to the document. Figure 4.8 shows the

    ThemeScape and Galaxy visualizations using the IN-SPIRE software.

    4.5 Self-organizing Map Model (SOM)

    The SOM (neural network), is a nonlinear topology-preserving projection method to convert a high di-

    mensional space into a low (1D, 2D, or 3D) dimensional grid (feature map), as shown in Figure 4.9. There

    are three spaces which are involved in SOM: the high dimensional document vector space (associated with

    objects), the high dimensional weight vector space (associated with the nodes of the display grid), and the

    low dimensional visual space (the display grid). During the learning process, each input vector is randomly

    picked up and is assigned to the closest neuron, whose weight vector is the most relevant one. After the

    General Graduate Exams Panagiotis Papadakos

  • 8/23/2019 General Graduate Exams

    46/72

    38 Chapter 4. Visualization Models and Metaphors

    training process, the documents are projected onto the feature map and labels are assigned to the feature

    map areas (which most of the times is weight-based).

    Figure 4.9: A SOM Feature Map

    Each partitioned area in the map clearly represents a concept(s) and documents associated with theconcepts. The size of each area in the map indicates term occurrence frequencies or the possible size of the

    projected documents. After term labeling processing, semantically related areas are also