Upload
erin-cameron
View
218
Download
0
Embed Size (px)
Citation preview
WWW Search and Navigation
Mark Levene
SCIS, Birkbeck College
University of London
www.dcs.bbk.ac.uk/~mark/
2
Talk Overview
• Hypertext and the navigation problem
• NavigationZone’s solution
• Problems being researched
• A Demonstration
3
Hypertext and Navigation
• Long history – Bush 1945, memex – trail blazing– Nelson 1965, Xanadu - network of documents
• Problem of “getting lost in hyperspace”• Navigation aids
– Bookmarks– History– Overview diagrams– Recommendations
4
State-of-the-Art Navigation Aids
• Novel User-Interfaces to visualise web sites
• Clustering (e.g. Self-Organising Maps)
• Web data mining – finding user patterns
• Semi-automated navigation, BestTrail algorithm – motivation to follow …
5
Typical corporate search
6
A typical search scenario
1) Submit a query to a search engine• Is it too broad / too specific? • Does it capture my information needs?
2) Select a URL from the result set• Have I made the right choice?
3) Start manual navigation• Where - am I? have I come from ? am I going to ?
4) Goto (1) to reformulate the query
7
Content centric approach
a
c
e
d* ba
e
d
8
Problems with standard Search
• Page level relevance scoring – sensitive to query terms
• No look ahead– ‘click and discover’
• No context– results are totally isolated
• No navigation support– Users are left on their own to find their way
9
Possible solutions (information retrieval)
• Improve basic IR
• Link analysis, e.g. pagerank and HITS
• Meta data tagging– Keywords and taxonomies (semantic web)
• Natural language– Q&A, sentence analysis, synonyms
10
Possible solutions (information seeking)
• Suggestion engines– Link and content generation
• Categories and directories– Explicit manual construction
• Automatic classification– Machine learning techniques
11
Are these feasible?
• Re-architecting corporate information infrastructure is extremely expensive
• Sophisticated approaches are not always intuitive and are yet to be proven
• Same problem every couple of years
• Mergers and acquisitions
12
There is, actually, a better way!
• Treat sequence of pages, or trails, as first-class citizens for search
• Consider the topology of the area in which you are searching
• Employ navigational aids
13
Context centric approach
a
c
e
d* ba
c
e
d* b
e
a
c
d* b
14
The information value of a trail is higher than the sum of it parts!
15
Our approach
• Provide information retrieval of the highest quality and in addition,
• Find out what is beyond the most relevant pages by ‘exploring the area’
• Present users with precise and relevant trails
• Provide navigation assistance within the UI
16
NavZone user interface
17
First Monday paper
Task – find answers to 5 types of questions
1) Fact Finding – What are the term dates?
2) Judgement – Is CSIS a “good” place to do research?
3) Fact Comparison – Which train station is closest to the college?
4) Judgement Comparison – Is the research in deptA better than that in deptB?
5) General Navigational – How do you get to the checkout?
NavZone Usability Study
18
% of subjects, 4+ questions correct
59% Google 75% Compass83% NavZone
NavZone vs. Google and Compass
19
44 Google40 Compass27 NavZone
NavZone is bandwidth “green” !
Average # clicks to complete task
20
18 Compass17 Google13 NavZone
Average time taken per task (min)
Wilcoxon Test - Statistically Significant
22
The main ingredients
robot
ParserHTML, XML,
PDF, PostScript,Word, Other
genericformat
crawler
BestTrail
web graph
userinterface
trail engine
postprocessor
invertedfile
indexer
BestTrail
web graph
userinterface
26
Under Development
• Alternative User-Interfaces
• Seamless integration with relational databases and file systems
• Data mining and personalisation
• Mobile/PDA support
27
Open Problem
• How do we make use of statistical regularities that are present in the web to improve search and navigation?
• See, Levene et al. A stochastic model for the evolution of the web., Condensed Matter Archive, cond-mat/0110016, 2001- many distributions related to the web graph follow a power law