Upload
patience-churchman
View
222
Download
4
Tags:
Embed Size (px)
Citation preview
Mediated Information Retrieval – The WebCluster and MIR Projects –
Gheorghe MuresanSchool of Communication, Information and Library Sciences
Rutgers University
Structure of the talk
The WebCluster project
Design decisions in WebCluster
The MIR project
Integrated approach to interaction modeling, logging and analysis
WebCluster - Motivation
InformationNeed
Query Search engine
(within some subject domain)
WWW_SearchEngine
Domain
Gulfs– information need query
– structured subject domain unstructured target collection (WWW)
Informationneed
1. Select library
2. Consult catalog
3. Browse shelves
4. Use inter-library scheme
Information Need Formulation
Interaction in the library
1. Select source collection
Information Need Formulation
2. Exploresource collectionwith ClusterBook
Results
Results
Information need
3. Search WWW
Can we simulate the library interaction ?
Structuredsourcecollections
The mediated access interaction
Information need
Web
sea
rch
en
gin
e
W
ebC
lust
er
Query
Specialised source
Target collection (WWW)
Topicaldocuments
Interaction model vs. prototype
Structuring the source collection Document clustering Supervised classification Manual (intellectual) classification
Exploring the structured source collection Metaphor – Library, book, encyclopaedia Visualization tool – Folder metaphor, hyperbolic
tree, themescape, cone trees, thematic maps Search strategies supported – Best match or
cluster-based searching, browsing
Model vs. prototype
Interaction model Explicit (the user marks relevant documents) vs. implicit (cues
on relevance are derived based on user behavior/actions)
Transparent (the user is aware) vs. opaque (the user is happy to see effect of ‘magic’)
Automatic vs. manual/intellectual generation of the mediated query
Query model Language models (generative, Kullback-Leibler) Probabilistic models Rocchio or other RF-specific formulae
ClusterBook - Source collection
ClusterBook - Target collection
Informal experiments - Objectives -
Test the users’ reaction to the mediated access concept
Test the user satisfaction regarding the functionality of the system, and the relevance of the documents retrieved
Formative usability testing - some volunteers were not only experienced searchers, but also had experience in evaluating IR systems
Comparison of user generated queries vs. system generated queries
Note. These experiments were run at different stages of the development
Informal experiments - Experimental procedure -
Subjects received introduction to the system
Task assigned: “You are a trainee in a newspaper. You support the journalists by providing information for the topic of their articles.”
Sample topics: The history of the Brasilian debt crisis How are the quotas for growing coffee set and controlled on a world-wide
basis ?
Source collection: a sub-collection of Reuters (newspaper articles)
Steps followed by users (explicit scenario): Formulate a query and record it Browse source collection, select ‘best’ cluster, edit query generated by system,
submit it to the search engine Submit to the same search engine the initial, self-generated query Compare results of the two searches
Informal experiments - Results -
Users found the mediation useful for unfamiliar topics
The system nearly always proposed new, good query terms
Users not always good at recognizing ‘good’ query terms
The system proposed bad query terms (not specific to the topic) the opaque scenario not viable unless the query formulation is improved
The two-step process was questioned when: the query formulation was considered easy, for a familiar topic the documents of the source collection were considered sufficient to cover the
information need
Complete link, group average – OK; single link – bad
Overall, the system is usable
Consequences of informal experiments
Formal experiments are needed to verify the main assumptions: The Cluster Hypothesis holds for a specialized collection Good clusters can be found with the search strategies provided Mediated queries can improve retrieval effectiveness
The effect on retrieval performance of various parameters should be compared Weighting schemes Clustering methods Search strategies
Fixed Plants
Coastal Wind Farms
Pacific Rim Wind Farms
Design of CoastalWind Farms
Design of….
DesertWind Farms
Inland Wind Farms
...
Portable Generators
...
Wind generatorsfor yachts
Power Generation Propulsion
Wind Energy
Critical issue: The label generation Document representatives
searching
Cluster representatives browsing searching mediation
Collection representatives collection selection
Mediation experiment - simulations Objectives:
Test the potential of mediation to increase retrieval effectiveness Test the effect on effectiveness of a variety of parameters
Search engine
Search engine
Simple query generator(baseline)
Topic-based mediator(upperbound)
Sourcecollection
Targetcollection
Cluster-based mediation (realistic mediation)
Experimental setup Interactive track of TREC-8
Offers relevance judgments for complex topics, with a multitude of aspects
Offers the experimental design for the user experiment
Six topics with 12 to 56 aspects each Target collection: FT 1991-4, with 210,158 articles
Source collection built based on relevance judgments: half of the relevant documents, their nearest neighbors, plus the documents judged non-relevant
Results – the cluster hypothesis
Aspectual cluster hypothesis confirmed by an extended version of the van Rijsbergen – Sparck Jones separation test Similarity between pairs of docs
covering the same aspect is higher than between pairs of docs covering the same topics, which is higher than between pairs of docs in the collection
Consequence confirmed: clustering groups documents in pockets of relevance
Results – retrieval effectiveness
Tf-Idf > KL > RelFreq as weighting schemes for document representation
Adding disambiguation terms to the query increases recall, but decreases precision
Nearest-neighbor mediation (“more like this”) highly significantly improves both recall and precision, even if just one exemplary document is offered for each topic aspect
Cosine and Dice performs similarly
Mediation results Upperbound experiment (all relevant docs
known in source) Both recall and precision increase with query length Query term weights strongly affect performance No evidence that uniformity of term frequency affects
performance
Clustered source mediation Best cluster mediation increases P, decreases R “Fuse and search” – strong increase in R and P “Search and fuse” – good R, terrible P !
Contributions of WebCluster
Proposes and explores system-based mediated access to very large heterogeneous document collections
Explores the use of clustering for capturing the topical, semantic structure of a problem domain (as represented by a specialized collection)
Explores the use of language models for building cluster and document representatives
Offers a framework for building structured portals on the WWW
Offers a framework for building collaborative environments
Structure of the talk
The WebCluster project
Design decisions in WebCluster
The MIR project
Integrated approach to interaction modeling, logging and analysis
Structure of the talk
The WebCluster project
Design decisions in WebCluster
The MIR project
Integrated approach to interaction modeling, logging and analysis
User experiment – effectiveness of mediated information retrieval for Web searches
Within-SubjectsBetween-Subjects
Non-mediated Mediated Total Subjects
Linear Ranking List NL: Linear (Web) ML: Linear (NJEDL and Web)
16
Combination of Linear ranking and classified display of
search results
NC: Combination (Web)
MC: Combination (NJEDL and Web)
16
Total Subjects 32 32 Total 32
NL- Non-mediated and Linear, NC- Non-mediated and CombinationML- Mediated and Linear, MC- Mediated and Combination
Research Hypotheses
The mediated system is conducive to higher effectiveness than the non-mediated system
The combination of linear/ranked display with a hierarchic/clustered display is conducive to higher effectiveness than simple ranked display
Mediation assumptions
Relevant documents tend to be clustered together in the source collection The cluster hypotheses
Subjects can identify relevant documents Subjects spend time exploring the source
collection and some relevant documents Queries submitted after mediation are better
Longer, higher clarity, wider vocabulary
Other areas of exploration
Interaction models Transitions between states and activities
The effect on search behavior of subject expertise or familiarity with a topic
The subjects’ ability to recognize good documents and clusters based on snippets or labels
Compare user-generated queries with system-generated queries in terms of performance
User experiment – no mediation
User experiment – mediated access
User experiment – mediated access
Structure of the talk
The WebCluster project
Design decisions in WebCluster
The MIR project
Integrated approach to interaction modeling, logging and analysis
Motivation Interest in studying Human Information
Behavior and Interactive Information Retrieval Qualitative aspects
Patterns of behavior → User models → Predictions of future behavior
Quantitative aspects # queries, # query terms, # documents viewed / opened
/ saved, # errors / corrections, time spent → Conclusions regarding retrieval effectiveness & efficiency
Typical tools Think-aloud protocols, video recording,
questionnaires, interviews, activity logging
Motivation Logging – options:
Commercial tools (Morae, uLog, Camtasia, etc) Expensive, less control over what is logged, format
usually proprietary DIY – log events related to the research questions
Rather inflexible – what if new research ideas come to light?
Idea #1: log all semantic events Identified during interaction / interface design →
integrate: Interface design Logger design Log analyzer design
Motivation - practical Frustration with existing practices in IR
research Rutgers participation in Interactive TREC 2002
User interface Logging
Idea #2: use state-based design, logging and analysis Advantages:
Design tools are plentiful The entire research team can participate in design Once the design is completed, the procedures to
generate the logging software and the log analyzer are deterministic
Typical interactive IR experiment
Research Hypotheses
Prepare experimental system
Design system
Build system
Add functionality forlogging interactions
Run experiment
Generate experimental data,including logs of interactions
Analyze experimental data
Draw conclusions
Problems with this experimental model(based on anectodal evidence)
The system is built and the extraction of experimental data from logs is done by “those who can program” in the research group Most researchers are not involved in specifying system
requirements The design stage is often skipped The system may display usability problems, which affects
the exploration of the research problems The logging functionality is added ad-hoc, in unsystematic
fashion There is no standard format for interaction logs; additional
software is needed for parsing logs and extracting useful data
Proposed integrated approach(Keywords: UML, DTD, XML)
Research HypothesesConceptual design of the system(UML model of interactions)
DTD model of interactions
Experimental system
Build system
XML loggerRun experiment
Generate experimental data,including logs of interactions
XML log formatXML parser
Analyze experimental data
Data visualization
Summary Model the interaction using UML diagrams
This allows the whole research team to contribute to the design of the user interfaces, and supports the documentation of the interface.
Derive a DTD coding for the states of the user interface, the valid user actions in every state and the state transitions that take place based on user actions; Use XML to log the user actions and their outcomes based on the interaction DTD; Based on the DTD, generate a log parser for the log analysis; The log analysis provides interaction information and can also be used for generating a visual / graphic representation of the interaction
MIR – state diagramIdle
evStartTask
UserState
ViewResults
ExploreTarget
ViewTargetHitList
ViewTargetDoc
SavingDoc
ViewSavedDoc
ExploreSource
ViewSourceHierarchy
ViewSourceDoc
ViewSourceHitList
H
Thinking EditingQuery
evDisplayDoc
evQueryEdit
evQueryEdit
evStartSavingDoc
evSelectPane
evSelectSavedDoc
evUnsave
evUnsave
evSaveDoc
evNotQueryEdit
evSelectPane
evNotQueryEdit
evSelectDocevHierarchy
evDisplayDocevList
evNotQueryEdit
MIR – DTD
MIR – XML log
MIR – logger class diagramUserState
# seenWarning : boolean = false# startMillis : long = 0# stopMillis : long = 0
+ setSeenWarning ( )+ getSeenWarning ( )+ setStartMillis ( )+ setStopMillis ( )+ getDuration ( )+ getDuration ( )+ toString ( )+ handleSubmitQuery ( )
ViewResults
+ ViewResults ( )+ handle ( )
ExploreSource
ExploreTarget
ViewSourceHierarchy
ViewSourceDoc
+ ViewSourceDoc ( )+ handleViewSourceDoc ( )+ getDoc ( )+ setDoc ( )
ViewTargetHitList
ViewTargetDoc
+ ViewTargetDoc ( )+ handleViewTargetDoc ( )+ getDoc ( )+ setDoc ( )
SavingDoc
- confirmed : boolean- option : int- aspects : String
+ SavingDoc ( )+ setDoc ( )+ getDoc ( )+ getAspects ( )+ isSaveConfirmed ( )+ getOption ( )+ setAspects ( )+ setSaveConfirmed ( )+ setOption ( )
ViewSavedDoc
- unsaved : boolean = false
+ ViewSavedDoc ( )+ getDoc ( )+ setDoc ( )+ isDocUnsaved ( )+ setDocUnsaved ( )+ handleViewingSavedDoc ( )+ handleUnsaveDoc ( )
ViewSourceHitList
LogAnalyzer
XPathLogAnalyzer
DOMLogAnalyzer
- cSource : int = 2- cTarget : int = 1- collId : int = - 1- millis : long
+ DOMLogAnalyzer ( )+ changeState ( )+ getUserState ( )+ analyze ( )+ setMillis ( )+ getMillis ( )+ handleStartSession ( )+ handleEndSession ( )~ handleEditQuery ( )~ handleSubmitQuery ( )~ handleSelectPane ( )~ handleDoc ( )~ handleDisplayDoc ( )~ handleResult ( )~ handleSearchResults ( )~ handleStartSaveDoc ( )~ handleSaveDoc ( )~ handleUnsaveDoc ( )~ handleCluster ( )~ handleTouchCluster ( )~ handleShowMessage ( )+ summarize ( )+ report ( )+ getAspects ( )
- state
Doc
- intId : String = null- extId : String = null
+ Doc ( )+ getId ( )+ getExtId ( )+ toString ( )
- doc
LogScanner
+ LogScanner ( )+ visitDocument ( )~ visitElement_log ( )~ visitElement_record ( )~ visitElement_date ( )~ visitElement_millis ( )~ visitElement_message ( )~ visitElement_StartSession ( )~ visitElement_EndSession ( )~ visitElement_EditQuery ( )~ visitElement_SubmitQuery ( )~ visitElement_SearchResults ( )~ visitElement_Result ( )~ visitElement_Doc ( )~ visitElement_SelectPane ( )~ visitElement_DisplayDoc ( )~ visitElement_StartSaveDoc ( )~ visitElement_SaveDoc ( )~ visitElement_UnsaveDoc ( )~ visitElement_TouchCluster ( )~ visitElement_Cluster ( )~ visitElement_ShowMessage ( )
- analyzer
logScanner
Think
+ handle ( )+ Think ( )
EditQuery
- query : String- collection : String
+ EditQuery ( )+ handleSubmitQuery ( )+ getQuery ( )+ setQuery ( )+ getCollection ( )+ setCollection ( )
Design decisions Design patterns
State – each state of the interface/system is modeled by a class
Inheritance (class hierarchy) is used to model sub-states (states at different levels of granularity)
Composition is used to model orthogonal states
Visitor – decouples the strategy chosen for parsing / visiting the log from the actions taken in each node
Strategy – supports different analysis strategies DOMLogAnalyzer – visits the entire log tree, for a
comprehensive analysis XPathAnalyzer – visits only a selection of nodes,
relevant for a certain RH (“log/record/message/SaveDoc”)
Design decisions Singletons vs. multiple objects for states
Singleton – one object for each class Adv: simple Disadv: only appropriate for cumulative data or
summaries Multiple objects
Adv: supports accurate, detailed analysis
Explicit vs. implicit logging of states Explicit: allows a human reader to interpret the logs;
redundancy; problems capturing orthogonal states Implicit: only events are captured, states are re-created
Types of analysis supported State transitions, user behavior
Average user vs. individual user Levels of state granularity
(Think, EditQuery, ViewResults (ExploreSource (ViewSourceHierarchy, ViewSourceHitList, ViewSourceDoc), ExploreTarget (ViewTargetHitList, ViewTargetDoc, ViewSavedDoc)))
Statistical analysis on qual data ANOVA shows no difference in
number of saved docs between non-mediated condition (m=3.94, sd=1.76) and mediated condition (m=3.13, sd=1.62)
Think 4EditQuery 9ViewTargetHitList 15ViewTargetDoc 78SavingDoc 16ViewTargetHitList 6ViewTargetDoc 31ViewTargetDoc 9ViewTargetDoc 35SavingDoc 11ViewTargetHitList 3ViewTargetDoc 173SavingDoc 16ViewTargetHitList 14EditQuery 7ViewTargetHitList 4ViewTargetDoc 17ViewTargetDoc 59ViewTargetDoc 51ViewTargetDoc 39EditQuery 13ViewTargetHitList 25ViewTargetDoc 38SavingDoc 15…
Conclusions Advantages of the proposed method
Better teamwork All members contribute and are responsible for the
design
More accurate experimental results Increased usability of the experimental system Accurate data, due to accurate logging of events
Less effort in testing and debugging, as well as in parsing and analyzing results
DTD offers the interaction template XML logs support debugging Available open-source XML parsers
Future work
Automatic (or semi-automatic) generation of the DTD model from the UML model Conceptual problem: designing a transition scheme
between the two models Practical problem: interpreting the format that various
modeling packages use to store UML models Visualization of the interaction
Model: timeline of the interaction vs. summary Format: HTML vs. SVG vs. …
Automatic generation: programming language (Java) vs. transformation template (XSLT)
Questions ?
Query formulation problems Vague information need
Vocabulary mismatch
Difficulty of query language syntax
Lack of context, ambiguity of terms
Lack of a search strategy
No understanding of the underlying indexing/searching model
Note. TREC experiments have shown that the quality of the query has a higher impact on retrieval effectiveness than weighting schemes or search algorithms.
Role of structure
Computing
Computer
Screen Keyboard C++Pascal
Programming language
...
Mathematics
...
Algebra
Computing, Mathematics Physics
Science
Reveals the semantic structure of the domain & its concepts Groups (semantically ?) similar documents Supports exploration and concept formation Supports term disambiguation (context) (Has potential for efficient retrieval) (Has potential for effective retrieval)
Browsing label(relative cluster representative)
Coastal Wind Farms Inland Wind Farms
PacificRim Wind Farms
Design of CoastalWind Farms
Design of….
DesertWind Farms
Wind generatorsfor yachts
Fixed Plants
...
Portable Generators
...Power Generation Propulsion
Wind Energyparenti
clustericlusteriii p
ppparentclusterKLR
,
,, log),(
Searching label(absolute cluster representative)
Coastal Wind Farms Inland Wind Farms
PacificRim Wind Farms
Design of CoastalWind Farms
Design of….
DesertWind Farms
Wind generatorsfor yachts
Fixed Plants
...
Portable Generators
...Power Generation Propulsion
Wind Energycollectioni
clustericlusteriii p
ppcollectionclusterKLA
,
,, log),(
Mediation label(Expanded cluster representative)
Fixed Plants
Coastal Wind Farms
PacificRim Wind Farms
Design of CoastalWind Farms
Design of….
DesertWind Farms
Inland Wind Farms
...
Portable Generators
...
Wind generatorsfor yachts
Power Generation Propulsion
Wind Energyri
rri
r
iiii
AA
AAAE
,1,1
2,2
1,0,
)1(
...)1()1()1(
Topic model representations
Exemplary representation
Statistical representation
Statistical analysis
Language model
Context analysis
Typical terms, weighted
Thresholding
Mediated query
Keyword representation
The cluster hypothesis Reminder: the original cluster hypothesis
“Closely associated documents tend to be relevant to the same requests” (van Rijsbergen)
Aspectual cluster hypothesis: Highly similar documents tend to be relevant to the same topic. However, documents relevant to the same topic may be quite dissimilar if they cover distinct aspects of the topic. Consequence: Clustering algorithms tend to group
together documents that cover highly focused topics, or aspects of complex topic. Documents covering distinct aspects of complex topics tend to be spread over the cluster structure.
Aspects of relevance in the mediated access process
Distribution of relevant documentsin clusters
Clustering vs. Random
0%
5%
10%
15%
20%
25%
Clusters
Rec
all
Clustering Random
WebCluster scenario#1
Document from the source collectionDocument from the target collection (WWW)
WebCluster
Web Search Engine
c0
c4 c5
c2c1 c3
c’0
c’3
c’2
c’5
WWW Name
Transparent mediated access
Targeted users Experienced searchers
Specific The users are aware of the
mediation process, of the separation between the source and target collections
The users have the option to edit the query generated (proposed) by the system. They understand the indexing / searching model.
WebCluster scenario#2WebCluster
c0
c4 c5
c2c1 c3
WWW
c’0
c’3
c’2
c’5
Web Search Engine
Name Opaque mediated access
Targeted users Naive / beginner searchers
Specific The users explore the structure of the domain, which contains sample
documents, and have the option of asking for similar documents The users are unaware of the mediation process - the query generation
and target search are not visible
Document from the source collectionDocument from the target collection (WWW)
Initial user interface (Java AWT)
History of the IR research group at RGU
“Systemic” approach with an interest in building software frameworks for IR Eclair - An Extensible Class Library for Information
Retrieval (Harper et al, ’92) Flair – A Flexible Architecture for Information Retrieval
(Jose et al, ’96) Epic - A Photographic Retrieval System Based on Evidence
Combination Approach
Fireworks - An Architecture for Implementing Extensible Information-Seeking Environments (Hendry et al, ’96)
SketchTrieve - An Informal Information-Seeking Environment
Initial plan of work for WebCluster
Produce a flexible Clustering Framework that can apply a variety of clustering algorithms on: Static and dynamic (on the fly) document collections User profiles Sources of information
“Play” with the CF to understand how clustering works on various collections
Use the CF for structuring source collections in view of mediation
Design, build and test a few user interfaces for mediation
The Clustering Framework
User
Application
Clustering FrameworkKernel
CF-Web SearchEngine Interface
ECLAIR IRS
CF-DocumentCollection Interface
CF-UserInterface
CF-ObjectStoreInterface
CF-File SystemInterface
File System
ObjectStore
CF-ECLAIRInterface
WEB
Requirements of the CF
Generality Not devoted to a particular document collection, nor to a particular IRS.
Document Independence
The document are not managed by the CF, which handles only their
representatives. Flexibility and Reusability
Large variety of clustering methods, inter-document similarity measures, halt
conditions, ... Storage management independence
We can use file system, OODBMS, ..., to make the result persistent. Adaptability and Extensibility
Users can add their own clustering methods. Transparency
For reuse as a toolbox in other applications.
Original design of the CF
Example of clustering parameter file GroupAverageCM CosineSimilarityMeasure
InqueryTfIdfWeighting 0 6 0.0001 0.001
*. Clustering method. Possible values: CompleteLinkCM, GroupAverageCM, SingleLinkCM
*. Similarity measure. Possible values: CosineSimilarityMeasure, DiceCoefficientSimilarityMeasure
*. Weighting measure (indexing). Possible values: FreqWeighting, RelFreqWeighting, KLWeighting, KLRelWeighting, InqueryTfIdfWeighting
*. Cluster size threshold (don't agglomerate clusters with sufficient docs) *. Halt condition (stop when I'm down to a certain number of clusters).
Note. Value 0 means no restriction; stop when you can't cluster anymore.
*. Similarity measure threshold *. Cluster cleaning threshold
Design patterns Used extensively, as they provide
Flexibility and extensibility – for a research system, playing with parameters and plugging in more modules was more important than performance
Combined Most operations (such as ClusteringMethod) combined Strategy,
Singleton, Factory Method, Product-Trader, sometimes TemplateMethod
For storage management we combined Strategy, Bridge and Serializer
Adapted, rather than applied blindly For the cluster structure simply applying Composite (Cluster,
SimpleCluster, ComplexCluster) was very inefficient; we combined it with a Mediator that indexes documents and clusters (Clustering)
Documentcollection
Vocabulary,index,inverted file
Clusterhierarchy
(Informia)meta-search engine
Indexer
ClusteringFramework
Cluster-basedsearcher
Ranked-basedsearcher
GetHits
GetHits
GetCollections
GetClustering
SearchClusters
SearchDocuments
GetClusteredHits
Client (ClusterBook) Server
Query:
Mediated query:
1.
2.
3.
4.
5.
6.
The client-server architecture
7.
Client Server_Proxy Comms_Proxy
Client side
Server side
Server Client_Proxy TCP_Proxy
Client-server communication
Architecture / design decisions Good ones
Software framework in the server Java for the user interface Refactoring as language support improved
AWT implementation replaced by Swing– MVC, native tree representation, CellRenderers
STL instead of own library (String, List, Map, Iterator)
Tight coupling with Informia replaced by loose coupling Data-centered rather than software-centered
Questionable ones The client-server connection (CGI, TCP, HTTP)
Alternatives: RMI, CORBA, servlet & JNI ?
Bad example: Rutgers Interactive TREC 2002 (a)
Bad example: Rutgers Interactive TREC 2002 (b)
Bad example – Logging in Interactive TREC 2002
TREC-2002 START: 2002-08-15 17:58:57
QUERY: geneticly engineered foods safety
QUERY: geneticly engineered foods safety
SAVE DOCUMENT: [G13-84-2041245] Food Safety and Biotechnology: Are They Related?
QUERY: problems genetically engineered foods
SAVE DOCUMENT: [G40-01-0459199] International Information Programs, U.S. Department of State, Economic Perspectives, October 1999
FINALLY SAVED DOCUMENTS: [G40-01-0459199] International Information Programs, U.S. Department of State, Economic Perspectives, October 1999; [G13-84-2041245] Food Safety and Biotechnology: Are They Related?
NUMBER OF VIEWED DOCUMENTS: 12
NUMBER OF UNIQUE VIEWED DOCUMENTS: 8
TREC-2002 STOP: 2002-08-15 18:03:50