Mediated Information Retrieval – The WebCluster and MIR Projects – Gheorghe Muresan School of Communication, Information and Library Sciences Rutgers University

Mediated Information Retrieval – The WebCluster and MIR Projects –

Gheorghe MuresanSchool of Communication, Information and Library Sciences

Rutgers University

Structure of the talk

The WebCluster project

Design decisions in WebCluster

The MIR project

Integrated approach to interaction modeling, logging and analysis

WebCluster - Motivation

InformationNeed

Query Search engine

(within some subject domain)

WWW_SearchEngine

Domain

Gulfs– information need query

– structured subject domain unstructured target collection (WWW)

Informationneed

1. Select library

2. Consult catalog

3. Browse shelves

4. Use inter-library scheme

Information Need Formulation

Interaction in the library

1. Select source collection

Information Need Formulation

2. Exploresource collectionwith ClusterBook

Results

Results

Information need

3. Search WWW

Can we simulate the library interaction ?

Structuredsourcecollections

The mediated access interaction

Information need

Web

sea

rch

en

gin

e

W

ebC

lust

er

Query

Specialised source

Target collection (WWW)

Topicaldocuments

Interaction model vs. prototype

Structuring the source collection Document clustering Supervised classification Manual (intellectual) classification

Exploring the structured source collection Metaphor – Library, book, encyclopaedia Visualization tool – Folder metaphor, hyperbolic

tree, themescape, cone trees, thematic maps Search strategies supported – Best match or

cluster-based searching, browsing

Model vs. prototype

Interaction model Explicit (the user marks relevant documents) vs. implicit (cues

on relevance are derived based on user behavior/actions)

Transparent (the user is aware) vs. opaque (the user is happy to see effect of ‘magic’)

Automatic vs. manual/intellectual generation of the mediated query

Query model Language models (generative, Kullback-Leibler) Probabilistic models Rocchio or other RF-specific formulae

ClusterBook - Source collection

ClusterBook - Target collection

Informal experiments - Objectives -

Test the users’ reaction to the mediated access concept

Test the user satisfaction regarding the functionality of the system, and the relevance of the documents retrieved

Formative usability testing - some volunteers were not only experienced searchers, but also had experience in evaluating IR systems

Comparison of user generated queries vs. system generated queries

Note. These experiments were run at different stages of the development

Informal experiments - Experimental procedure -

Subjects received introduction to the system

Task assigned: “You are a trainee in a newspaper. You support the journalists by providing information for the topic of their articles.”

Sample topics: The history of the Brasilian debt crisis How are the quotas for growing coffee set and controlled on a world-wide

basis ?

Source collection: a sub-collection of Reuters (newspaper articles)

Steps followed by users (explicit scenario): Formulate a query and record it Browse source collection, select ‘best’ cluster, edit query generated by system,

submit it to the search engine Submit to the same search engine the initial, self-generated query Compare results of the two searches

Informal experiments - Results -

Users found the mediation useful for unfamiliar topics

The system nearly always proposed new, good query terms

Users not always good at recognizing ‘good’ query terms

The system proposed bad query terms (not specific to the topic) the opaque scenario not viable unless the query formulation is improved

The two-step process was questioned when: the query formulation was considered easy, for a familiar topic the documents of the source collection were considered sufficient to cover the

information need

Complete link, group average – OK; single link – bad

Overall, the system is usable

Consequences of informal experiments

Formal experiments are needed to verify the main assumptions: The Cluster Hypothesis holds for a specialized collection Good clusters can be found with the search strategies provided Mediated queries can improve retrieval effectiveness

The effect on retrieval performance of various parameters should be compared Weighting schemes Clustering methods Search strategies

Fixed Plants

Coastal Wind Farms

Pacific Rim Wind Farms

Design of CoastalWind Farms

Design of….

DesertWind Farms

Inland Wind Farms

...

Portable Generators

...

Wind generatorsfor yachts

Power Generation Propulsion

Wind Energy

Critical issue: The label generation Document representatives

searching

Cluster representatives browsing searching mediation

Collection representatives collection selection

Mediation experiment - simulations Objectives:

Test the potential of mediation to increase retrieval effectiveness Test the effect on effectiveness of a variety of parameters

Search engine

Search engine

Simple query generator(baseline)

Topic-based mediator(upperbound)

Sourcecollection

Targetcollection

Cluster-based mediation (realistic mediation)

Experimental setup Interactive track of TREC-8

Offers relevance judgments for complex topics, with a multitude of aspects

Offers the experimental design for the user experiment

Six topics with 12 to 56 aspects each Target collection: FT 1991-4, with 210,158 articles

Source collection built based on relevance judgments: half of the relevant documents, their nearest neighbors, plus the documents judged non-relevant

Results – the cluster hypothesis

Aspectual cluster hypothesis confirmed by an extended version of the van Rijsbergen – Sparck Jones separation test Similarity between pairs of docs

covering the same aspect is higher than between pairs of docs covering the same topics, which is higher than between pairs of docs in the collection

Consequence confirmed: clustering groups documents in pockets of relevance

Results – retrieval effectiveness

Tf-Idf > KL > RelFreq as weighting schemes for document representation

Adding disambiguation terms to the query increases recall, but decreases precision

Nearest-neighbor mediation (“more like this”) highly significantly improves both recall and precision, even if just one exemplary document is offered for each topic aspect

Cosine and Dice performs similarly

Mediation results Upperbound experiment (all relevant docs

known in source) Both recall and precision increase with query length Query term weights strongly affect performance No evidence that uniformity of term frequency affects

performance

Clustered source mediation Best cluster mediation increases P, decreases R “Fuse and search” – strong increase in R and P “Search and fuse” – good R, terrible P !

Contributions of WebCluster

Proposes and explores system-based mediated access to very large heterogeneous document collections

Explores the use of clustering for capturing the topical, semantic structure of a problem domain (as represented by a specialized collection)

Explores the use of language models for building cluster and document representatives

Offers a framework for building structured portals on the WWW

Offers a framework for building collaborative environments




The MIR project





The MIR project


User experiment – effectiveness of mediated information retrieval for Web searches

Within-SubjectsBetween-Subjects

Non-mediated Mediated Total Subjects

Linear Ranking List NL: Linear (Web) ML: Linear (NJEDL and Web)

16

Combination of Linear ranking and classified display of

search results

NC: Combination (Web)

MC: Combination (NJEDL and Web)

16

Total Subjects 32 32 Total 32

NL- Non-mediated and Linear, NC- Non-mediated and CombinationML- Mediated and Linear, MC- Mediated and Combination

Research Hypotheses

The mediated system is conducive to higher effectiveness than the non-mediated system

The combination of linear/ranked display with a hierarchic/clustered display is conducive to higher effectiveness than simple ranked display

Mediation assumptions

Relevant documents tend to be clustered together in the source collection The cluster hypotheses

Subjects can identify relevant documents Subjects spend time exploring the source

collection and some relevant documents Queries submitted after mediation are better

Longer, higher clarity, wider vocabulary

Other areas of exploration

Interaction models Transitions between states and activities

The effect on search behavior of subject expertise or familiarity with a topic

The subjects’ ability to recognize good documents and clusters based on snippets or labels

Compare user-generated queries with system-generated queries in terms of performance

User experiment – no mediation

User experiment – mediated access

User experiment – mediated access




The MIR project


Motivation Interest in studying Human Information

Behavior and Interactive Information Retrieval Qualitative aspects

Patterns of behavior → User models → Predictions of future behavior

Quantitative aspects # queries, # query terms, # documents viewed / opened

/ saved, # errors / corrections, time spent → Conclusions regarding retrieval effectiveness & efficiency

Typical tools Think-aloud protocols, video recording,

questionnaires, interviews, activity logging

Motivation Logging – options:

Commercial tools (Morae, uLog, Camtasia, etc) Expensive, less control over what is logged, format

usually proprietary DIY – log events related to the research questions

Rather inflexible – what if new research ideas come to light?

Idea #1: log all semantic events Identified during interaction / interface design →

integrate: Interface design Logger design Log analyzer design

Motivation - practical Frustration with existing practices in IR

research Rutgers participation in Interactive TREC 2002

User interface Logging

Idea #2: use state-based design, logging and analysis Advantages:

Design tools are plentiful The entire research team can participate in design Once the design is completed, the procedures to

generate the logging software and the log analyzer are deterministic

Typical interactive IR experiment

Research Hypotheses

Prepare experimental system

Design system

Build system

Add functionality forlogging interactions

Run experiment

Generate experimental data,including logs of interactions

Analyze experimental data

Draw conclusions

Problems with this experimental model(based on anectodal evidence)

The system is built and the extraction of experimental data from logs is done by “those who can program” in the research group Most researchers are not involved in specifying system

requirements The design stage is often skipped The system may display usability problems, which affects

the exploration of the research problems The logging functionality is added ad-hoc, in unsystematic

fashion There is no standard format for interaction logs; additional

software is needed for parsing logs and extracting useful data

Proposed integrated approach(Keywords: UML, DTD, XML)

Research HypothesesConceptual design of the system(UML model of interactions)

DTD model of interactions

Experimental system

Build system

XML loggerRun experiment

Generate experimental data,including logs of interactions

XML log formatXML parser

Analyze experimental data

Data visualization

Summary Model the interaction using UML diagrams

This allows the whole research team to contribute to the design of the user interfaces, and supports the documentation of the interface.

Derive a DTD coding for the states of the user interface, the valid user actions in every state and the state transitions that take place based on user actions; Use XML to log the user actions and their outcomes based on the interaction DTD; Based on the DTD, generate a log parser for the log analysis; The log analysis provides interaction information and can also be used for generating a visual / graphic representation of the interaction

MIR – state diagramIdle

evStartTask

UserState

ViewResults

ExploreTarget

ViewTargetHitList

ViewTargetDoc

SavingDoc

ViewSavedDoc

ExploreSource

ViewSourceHierarchy

ViewSourceDoc

ViewSourceHitList

H

Thinking EditingQuery

evDisplayDoc

evQueryEdit

evQueryEdit

evStartSavingDoc

evSelectPane

evSelectSavedDoc

evUnsave

evUnsave

evSaveDoc

evNotQueryEdit

evSelectPane

evNotQueryEdit

evSelectDocevHierarchy

evDisplayDocevList

evNotQueryEdit

MIR – DTD

MIR – XML log

MIR – logger class diagramUserState

# seenWarning : boolean = false# startMillis : long = 0# stopMillis : long = 0

+ setSeenWarning ( )+ getSeenWarning ( )+ setStartMillis ( )+ setStopMillis ( )+ getDuration ( )+ getDuration ( )+ toString ( )+ handleSubmitQuery ( )

ViewResults

+ ViewResults ( )+ handle ( )

ExploreSource

ExploreTarget

ViewSourceHierarchy

ViewSourceDoc

+ ViewSourceDoc ( )+ handleViewSourceDoc ( )+ getDoc ( )+ setDoc ( )

ViewTargetHitList

ViewTargetDoc

+ ViewTargetDoc ( )+ handleViewTargetDoc ( )+ getDoc ( )+ setDoc ( )

SavingDoc

- confirmed : boolean- option : int- aspects : String

+ SavingDoc ( )+ setDoc ( )+ getDoc ( )+ getAspects ( )+ isSaveConfirmed ( )+ getOption ( )+ setAspects ( )+ setSaveConfirmed ( )+ setOption ( )

ViewSavedDoc

- unsaved : boolean = false

+ ViewSavedDoc ( )+ getDoc ( )+ setDoc ( )+ isDocUnsaved ( )+ setDocUnsaved ( )+ handleViewingSavedDoc ( )+ handleUnsaveDoc ( )

ViewSourceHitList

LogAnalyzer

XPathLogAnalyzer

DOMLogAnalyzer

- cSource : int = 2- cTarget : int = 1- collId : int = - 1- millis : long

+ DOMLogAnalyzer ( )+ changeState ( )+ getUserState ( )+ analyze ( )+ setMillis ( )+ getMillis ( )+ handleStartSession ( )+ handleEndSession ( )~ handleEditQuery ( )~ handleSubmitQuery ( )~ handleSelectPane ( )~ handleDoc ( )~ handleDisplayDoc ( )~ handleResult ( )~ handleSearchResults ( )~ handleStartSaveDoc ( )~ handleSaveDoc ( )~ handleUnsaveDoc ( )~ handleCluster ( )~ handleTouchCluster ( )~ handleShowMessage ( )+ summarize ( )+ report ( )+ getAspects ( )

- state

Doc

- intId : String = null- extId : String = null

+ Doc ( )+ getId ( )+ getExtId ( )+ toString ( )

- doc

LogScanner

+ LogScanner ( )+ visitDocument ( )~ visitElement_log ( )~ visitElement_record ( )~ visitElement_date ( )~ visitElement_millis ( )~ visitElement_message ( )~ visitElement_StartSession ( )~ visitElement_EndSession ( )~ visitElement_EditQuery ( )~ visitElement_SubmitQuery ( )~ visitElement_SearchResults ( )~ visitElement_Result ( )~ visitElement_Doc ( )~ visitElement_SelectPane ( )~ visitElement_DisplayDoc ( )~ visitElement_StartSaveDoc ( )~ visitElement_SaveDoc ( )~ visitElement_UnsaveDoc ( )~ visitElement_TouchCluster ( )~ visitElement_Cluster ( )~ visitElement_ShowMessage ( )

- analyzer

logScanner

Think

+ handle ( )+ Think ( )

EditQuery

- query : String- collection : String

+ EditQuery ( )+ handleSubmitQuery ( )+ getQuery ( )+ setQuery ( )+ getCollection ( )+ setCollection ( )

Design decisions Design patterns

State – each state of the interface/system is modeled by a class

Inheritance (class hierarchy) is used to model sub-states (states at different levels of granularity)

Composition is used to model orthogonal states

Visitor – decouples the strategy chosen for parsing / visiting the log from the actions taken in each node

Strategy – supports different analysis strategies DOMLogAnalyzer – visits the entire log tree, for a

comprehensive analysis XPathAnalyzer – visits only a selection of nodes,

relevant for a certain RH (“log/record/message/SaveDoc”)

Design decisions Singletons vs. multiple objects for states

Singleton – one object for each class Adv: simple Disadv: only appropriate for cumulative data or

summaries Multiple objects

Adv: supports accurate, detailed analysis

Explicit vs. implicit logging of states Explicit: allows a human reader to interpret the logs;

redundancy; problems capturing orthogonal states Implicit: only events are captured, states are re-created

Types of analysis supported State transitions, user behavior

Average user vs. individual user Levels of state granularity

(Think, EditQuery, ViewResults (ExploreSource (ViewSourceHierarchy, ViewSourceHitList, ViewSourceDoc), ExploreTarget (ViewTargetHitList, ViewTargetDoc, ViewSavedDoc)))

Statistical analysis on qual data ANOVA shows no difference in

number of saved docs between non-mediated condition (m=3.94, sd=1.76) and mediated condition (m=3.13, sd=1.62)

Think 4EditQuery 9ViewTargetHitList 15ViewTargetDoc 78SavingDoc 16ViewTargetHitList 6ViewTargetDoc 31ViewTargetDoc 9ViewTargetDoc 35SavingDoc 11ViewTargetHitList 3ViewTargetDoc 173SavingDoc 16ViewTargetHitList 14EditQuery 7ViewTargetHitList 4ViewTargetDoc 17ViewTargetDoc 59ViewTargetDoc 51ViewTargetDoc 39EditQuery 13ViewTargetHitList 25ViewTargetDoc 38SavingDoc 15…

Conclusions Advantages of the proposed method

Better teamwork All members contribute and are responsible for the

design

More accurate experimental results Increased usability of the experimental system Accurate data, due to accurate logging of events

Less effort in testing and debugging, as well as in parsing and analyzing results

DTD offers the interaction template XML logs support debugging Available open-source XML parsers

Future work

Automatic (or semi-automatic) generation of the DTD model from the UML model Conceptual problem: designing a transition scheme

between the two models Practical problem: interpreting the format that various

modeling packages use to store UML models Visualization of the interaction

Model: timeline of the interaction vs. summary Format: HTML vs. SVG vs. …

Automatic generation: programming language (Java) vs. transformation template (XSLT)

Questions ?

Query formulation problems Vague information need

Vocabulary mismatch

Difficulty of query language syntax

Lack of context, ambiguity of terms

Lack of a search strategy

No understanding of the underlying indexing/searching model

Note. TREC experiments have shown that the quality of the query has a higher impact on retrieval effectiveness than weighting schemes or search algorithms.

Role of structure

Computing

Computer

Screen Keyboard C++Pascal

Programming language

...

Mathematics

...

Algebra

Computing, Mathematics Physics

Science

Reveals the semantic structure of the domain & its concepts Groups (semantically ?) similar documents Supports exploration and concept formation Supports term disambiguation (context) (Has potential for efficient retrieval) (Has potential for effective retrieval)

Browsing label(relative cluster representative)

Coastal Wind Farms Inland Wind Farms

PacificRim Wind Farms


Design of….

DesertWind Farms


Fixed Plants

...

Portable Generators

...Power Generation Propulsion

Wind Energyparenti

clustericlusteriii p

ppparentclusterKLR

,

,, log),(

Searching label(absolute cluster representative)

Coastal Wind Farms Inland Wind Farms



Design of….

DesertWind Farms


Fixed Plants

...

Portable Generators

...Power Generation Propulsion

Wind Energycollectioni

clustericlusteriii p

ppcollectionclusterKLA

,

,, log),(

Mediation label(Expanded cluster representative)

Fixed Plants

Coastal Wind Farms



Design of….

DesertWind Farms

Inland Wind Farms

...

Portable Generators

...


Power Generation Propulsion

Wind Energyri

rri

r

iiii

AA

AAAE

,1,1

2,2

1,0,

)1(

...)1()1()1(

Topic model representations

Exemplary representation

Statistical representation

Statistical analysis

Language model

Context analysis

Typical terms, weighted

Thresholding

Mediated query

Keyword representation

The cluster hypothesis Reminder: the original cluster hypothesis

“Closely associated documents tend to be relevant to the same requests” (van Rijsbergen)

Aspectual cluster hypothesis: Highly similar documents tend to be relevant to the same topic. However, documents relevant to the same topic may be quite dissimilar if they cover distinct aspects of the topic. Consequence: Clustering algorithms tend to group

together documents that cover highly focused topics, or aspects of complex topic. Documents covering distinct aspects of complex topics tend to be spread over the cluster structure.

Aspects of relevance in the mediated access process

Distribution of relevant documentsin clusters

Clustering vs. Random

0%

5%

10%

15%

20%

25%

Clusters

Rec

all

Clustering Random

WebCluster scenario#1

Document from the source collectionDocument from the target collection (WWW)

WebCluster

Web Search Engine

c0

c4 c5

c2c1 c3

c’0

c’3

c’2

c’5

WWW Name

Transparent mediated access

Targeted users Experienced searchers

Specific The users are aware of the

mediation process, of the separation between the source and target collections

The users have the option to edit the query generated (proposed) by the system. They understand the indexing / searching model.

WebCluster scenario#2WebCluster

c0

c4 c5

c2c1 c3

WWW

c’0

c’3

c’2

c’5

Web Search Engine

Name Opaque mediated access

Targeted users Naive / beginner searchers

Specific The users explore the structure of the domain, which contains sample

documents, and have the option of asking for similar documents The users are unaware of the mediation process - the query generation

and target search are not visible

Document from the source collectionDocument from the target collection (WWW)

Initial user interface (Java AWT)

History of the IR research group at RGU

“Systemic” approach with an interest in building software frameworks for IR Eclair - An Extensible Class Library for Information

Retrieval (Harper et al, ’92) Flair – A Flexible Architecture for Information Retrieval

(Jose et al, ’96) Epic - A Photographic Retrieval System Based on Evidence

Combination Approach

Fireworks - An Architecture for Implementing Extensible Information-Seeking Environments (Hendry et al, ’96)

SketchTrieve - An Informal Information-Seeking Environment

Initial plan of work for WebCluster

Produce a flexible Clustering Framework that can apply a variety of clustering algorithms on: Static and dynamic (on the fly) document collections User profiles Sources of information

“Play” with the CF to understand how clustering works on various collections

Use the CF for structuring source collections in view of mediation

Design, build and test a few user interfaces for mediation

The Clustering Framework

User

Application

Clustering FrameworkKernel

CF-Web SearchEngine Interface

ECLAIR IRS

CF-DocumentCollection Interface

CF-UserInterface

CF-ObjectStoreInterface

CF-File SystemInterface

File System

ObjectStore

CF-ECLAIRInterface

WEB

Requirements of the CF

Generality Not devoted to a particular document collection, nor to a particular IRS.

Document Independence

The document are not managed by the CF, which handles only their

representatives. Flexibility and Reusability

Large variety of clustering methods, inter-document similarity measures, halt

conditions, ... Storage management independence

We can use file system, OODBMS, ..., to make the result persistent. Adaptability and Extensibility

Users can add their own clustering methods. Transparency

For reuse as a toolbox in other applications.

Original design of the CF

Example of clustering parameter file GroupAverageCM CosineSimilarityMeasure

InqueryTfIdfWeighting 0 6 0.0001 0.001

*. Clustering method. Possible values: CompleteLinkCM, GroupAverageCM, SingleLinkCM

*. Similarity measure. Possible values: CosineSimilarityMeasure, DiceCoefficientSimilarityMeasure

*. Weighting measure (indexing). Possible values: FreqWeighting, RelFreqWeighting, KLWeighting, KLRelWeighting, InqueryTfIdfWeighting

*. Cluster size threshold (don't agglomerate clusters with sufficient docs) *. Halt condition (stop when I'm down to a certain number of clusters).

Note. Value 0 means no restriction; stop when you can't cluster anymore.

*. Similarity measure threshold *. Cluster cleaning threshold

Design patterns Used extensively, as they provide

Flexibility and extensibility – for a research system, playing with parameters and plugging in more modules was more important than performance

Combined Most operations (such as ClusteringMethod) combined Strategy,

Singleton, Factory Method, Product-Trader, sometimes TemplateMethod

For storage management we combined Strategy, Bridge and Serializer

Adapted, rather than applied blindly For the cluster structure simply applying Composite (Cluster,

SimpleCluster, ComplexCluster) was very inefficient; we combined it with a Mediator that indexes documents and clusters (Clustering)

Documentcollection

Vocabulary,index,inverted file

Clusterhierarchy

(Informia)meta-search engine

Indexer

ClusteringFramework

Cluster-basedsearcher

Ranked-basedsearcher

GetHits

GetHits

GetCollections

GetClustering

SearchClusters

SearchDocuments

GetClusteredHits

Client (ClusterBook) Server

Query:

Mediated query:

1.

2.

3.

4.

5.

6.

The client-server architecture

7.

Client Server_Proxy Comms_Proxy

Client side

Server side

Server Client_Proxy TCP_Proxy

Client-server communication

Architecture / design decisions Good ones

Software framework in the server Java for the user interface Refactoring as language support improved

AWT implementation replaced by Swing– MVC, native tree representation, CellRenderers

STL instead of own library (String, List, Map, Iterator)

Tight coupling with Informia replaced by loose coupling Data-centered rather than software-centered

Questionable ones The client-server connection (CGI, TCP, HTTP)

Alternatives: RMI, CORBA, servlet & JNI ?

Bad example: Rutgers Interactive TREC 2002 (a)

Bad example: Rutgers Interactive TREC 2002 (b)

Bad example – Logging in Interactive TREC 2002

TREC-2002 START: 2002-08-15 17:58:57

QUERY: geneticly engineered foods safety

QUERY: geneticly engineered foods safety

SAVE DOCUMENT: [G13-84-2041245] Food Safety and Biotechnology: Are They Related?

QUERY: problems genetically engineered foods

SAVE DOCUMENT: [G40-01-0459199] International Information Programs, U.S. Department of State, Economic Perspectives, October 1999

FINALLY SAVED DOCUMENTS: [G40-01-0459199] International Information Programs, U.S. Department of State, Economic Perspectives, October 1999; [G13-84-2041245] Food Safety and Biotechnology: Are They Related?

NUMBER OF VIEWED DOCUMENTS: 12

NUMBER OF UNIQUE VIEWED DOCUMENTS: 8

TREC-2002 STOP: 2002-08-15 18:03:50