View
24
Download
0
Category
Tags:
Preview:
DESCRIPTION
Improving the effectiveness of Web searching: Methodological issues. Barry Eaglestone. Department of Information Studies University of Sheffield B.Eaglestone@shef.ac.uk. Overview. An inductive study to build evidence-based meta-cognitive models of web searching by the general public. - PowerPoint PPT Presentation
Citation preview
Improving the effectiveness of Web searching:
Methodological issuesBarry EaglestoneDepartment of Information StudiesUniversity of SheffieldB.Eaglestone@shef.ac.uk
Overview
• An inductive study to build evidence-based meta-cognitive models of web searching by the general public.
• Data modelling issues– A Temporal data modelling solution
• Discussion & Final thoughts
An inductive study of how the general public search on the web.
Setting the scene – the database approach and state of the art.
Motivation
• Need to develop new models for searching: update outdated usage paradigms.– Improve training methods– Develop automated assistance systems
Previous studies of search logs
• Web search is shallow + promiscuous• Low use of advanced features• Global statistics
– number of queries/search– Pages viewed / user– query reformulation (change in no of terms)– Most users enter few terms– Little to be gained by increasing complexity
chemoinformatics
Database
The Team
Information SeekingInformation Seeking
chemoinformatics
Database
Soft Hard
Spectrum of Research Perspective
Modelling/engineering/empirical
Qualitative / quantitative data analysis / modeling
Human / organisationalissues
FormallyDefinedproblems
Computer world formalisations
Hardware /Software solutions
CS Computer WorldCS Computer WorldPeople world ISPeople world ISInventionInventionDiscoveryDiscovery
ProblemProblemSolvingSolvingformalismformalism
How will we use it?
Effectiveness?
Meta-cognitiveKnowledge aboutweb searching?
How do theysearch?
Who are the searchers?What are they searching
for?
Infer effectiveness from•search transformation patterns•subject’s narrative
ContextThe GENERAL PUBLICVolunteers (c500 searches):
ICT coursesUniversity evening classesCity Learning Centre coursesCitizens’ forumPersonal contactsLibraryAdvertisingStudents and academics
+ over 1,000,000 search logs anonymous searchers
•Self-selected searches explained through interview and think aloud protocols•2-3 set searches
Observe and record•Over 1,000,000 anonymous search engine transaction logs
•c500 observed and recorded searches; talk to searchersDetermine query similarity
Delimit searchesCode query transformationModel searches as transformation graphsData mine for stereotypical search strategesCorrelate with who, why and effectivenessThus, establish evidence-based models of search strategy, related to user and problem characteristics and likelihood of success
Evidence-based meta-cognitive trainingIntelligent interfaces
Why Meta-cognition?. “Meta-cognition refers to higher order thinking
which involves active control over the cognitive processes engaged in learning. ….”
Livingston (1997)
• Meta-cognitive knowledge– “…knowledge of personal variables to general knowledge about
how human beings learn and process information, as well as individual knowledge of one’s own learning processes…” e.g. “I have a bad memory!”
• Meta-cognitive regulation– “… activities used to ensure that that a cognitive goal has been
met….”, e.g., question yourself about the text and then re-read.Livingston (1997)
Cognitive Styles Analysishttp://www.memletics.com/manual/default.asp?ref=ga&data=999+learning+styles+free+test
Holist Analyst
Verbalizer
Imager
Syntactical/quantitative Semantic/qualitative
Exite search logs
~106 searchesHolistic search logs
Supplemented with qualitative data
Preliminary work
• Analysis of search logs
• Development of descriptive codes
• Aim is to form a basis for the analysis of our experimental data
Strengths / Limitations• Large sample• Definitely general public.• No enquiry context – what are they looking
for? What are they thinking?• No measure of success.• Are they searching or just browsing?• Where does one enquiry end and another
begin?• Limited to one search engine – what did they
do during a delay?
Excite Database Sampleqid uid time rank query querymore totwords
343 000000000000006a 192141 0 alco fence company ohio No 4
344 000000000000006a 192219 0 alco fence company ohio No 4
345 000000000000006a 192228 10 alco fence company ohio No 4
346 000000000000006a 192243 20 alco fence company ohio No 4
347 000000000000006a 192328 0 lifetime fence company ohio No 4
348 000000000000006a 192359 10 lifetime fence company ohio No 4
349 000000000000006a 192455 0 lifetime wire fence No 3
350 000000000000006a 192634 0 high tensile wire fence No 4
351 000000000000006b 161906 0 sickle cell anemia No 3
352 000000000000006b 162006 10 sickle cell anemia No 3
353 000000000000006b 162130 0 sickle cell anemia No 3
354 000000000000006c 144303 0 Hilton Garden Inn No 3
355 000000000000006c 144331 0 Hilton Garden Inn Jacksonville No 4
356 000000000000006c 144433 0 Hotel Search No 2
357 000000000000006c 144541 0 Jacksonvill Hotel No 2
358 000000000000006c 144728 0 www.hilton.com No 1
~ 106 queries
1
2
3
Sessions
Query Transformations• Changes in search strategy
– conceptual e.g. changes in type of search: broad specific text image
– Linguistic: syntactic, query structure.
• Examples Q1: shakespeare hamletQ2: shakespeare hamlet quotes
Q3: to be or not to beQ4 “to be or not to be”Q5: “to be or not to be” +shakespeare
Our Preliminary Analysis
• To look at textual (syntactic) changes.• Link queries by text similarity.• Infer enquiry change from textual
dissimilarity.• Use these elements to develop a
machine-readable codification of QT’s.• To mine for characteristic patterns.
Code TransformationN New queryR A repeated query /same page
rank – relevance feedback. P Page ranking (seek more)p Page ranking (earlier pages)
I(k) Identical C(k) Conjoint
S(k) Sub-phrase in common s(k) Sub-phrase + words in commonM(k) Other textual similarity
Example Transformations
QT graphs
N 1 2 3 5 4 6 Start
M C C s
22
23
25
26
27 s
s
s
S s
24 28
RP(14)
END
s s
20
29
R
s
21 5
uid 74: NM(1)C(2)C(3)S(4)s(5)PPRPRRRRPPRRppI(5)s(6)s(22)s(22)s(23)s(25)s(26)s(22)R
nursing careerspaid undergraduate nursing schools in baltimore city maryland
Code Transformation
N New query
R A repeated query /same page rank – relevance feedback.
P Page ranking (seek more)
p Page ranking (earlier pages)
I(k) Identical
C(k) Conjoint
S(k) Sub-phrase in common
s(k) Sub-phrase + words in common
M(k) Other textual similarity
QT graphs
7 2
N 1 2
3
5
4
6 Start M C
QJ
C 19 15
14
18
P(7)
END
20
P C
P(3)
Delay
QJ
QD
uid 342: NM(1)C(2)QJ(3)_C(2)PI(2)PPPPPPPC(2)PPPQJ(15)QD(15)
molsworth
"us army"
Preliminary Conclusions• We have developed a rich set of codes
describing syntactic part of QT’s• These can be used to develop a graph-based
description• Correlations between the codes are
meaningful/interesting• They form part of the analysis for our current
experimental study.
…and if you want to read about our preliminary results….
• Whittle M, Eaglestone B, Ford N, Madden A (2007), Data Mining of Search Logs, Journal of the American Society for Information Science and Technology (in press)
• Whittle M, Eaglestone B, Ford N, Gillet V.J., Madden A (2006), Query Tranformations And Their Role In Web Searching By The General Public, Information Research, 12(1) October 2006
• Whittle M, Eaglestone B, Ford N, Gillet V, Madden A (2006), Query transformations and their role in web searching by the general public. Information Seeking in Context Conference 2006 ISIC, Austrailia
• Andrew Madden, Barry Eaglestone, Nigel Ford, MartinWhittle (2006) Search engines: a first step to finding information: preliminary findings from a study of observed searches, Information Seeking in Context Conference 2006 ISIC, Austrailia.
Sheffield Experimental StudyScreensAudio
Qualitativeanalysis
Quantitativeanalysis
KeystrokesQueriesWeb page titles
Transcribing Pre-Processing
Temporaldatabase
Modeldevelopment
Data modelling issues
Evolution of databasesSetting the scene – the database approach and state of the art.
The database approach – A database should be a natural representation of information as data, suitable for all relevant applications without duplication, including the ones you have not yet though of.
“A well designed database system will mirror its users’ perceptionsmirror its users’ perceptions of the problem space, and thus allows them to address the problem in hand without address the problem in hand without complexities and distractions of complexities and distractions of computer world implementation computer world implementation detailsdetails… Implicit is the notion that users should work within the bounds of ‘good ‘good practice’practice’””
The semantic gap
Customer Salesperson
Take_byPlaced_by
Sales_Order
1
n m
1
C# Name …C1 Dr. EaglestoneC2 Ms Smith
SP# Names …S5 Mr. Chan …S8 Dr. Shao
C# SP# Product QuantityC1 S5 P99 120C1 S5 P2 10
Customer
Salesperson
SalesOrder
The gap between what you wish to represent and what you can represent.
Setting the scene – the database approach and state of the art.
….. & Data Independence
Applications/Users
External Model
Logical Model
Internal Model
Principles of database technology…
QT graphs
7 2
N 1 2
3
5
4
6 Start M C
QJ
C 19 15
14
18
P(7)
END
20
P C
P(3)
Delay
QJ
QD
uid 342: NM(1)C(2)QJ(3)_C(2)PI(2)PPPPPPPC(2)PPPQJ(15)QD(15)
molsworth
"us army"
A Ready-madeTemporal data modelling solution
GENREG – A ready-made solution that has also been proposed for healthcare ?
The Organisation: National Museum of Denmark
Multimedia– Pictures as well as descriptions
Distributed– Each department ran their own database system
for their collection (ownership!) Object-oriented design
– Entities, not just values Relational implementation
Database Research
Science
Technology
Application
Praxis
Theory
TopologyDanish Pre-history
Department of Antiquity
Ethnographic Department
Coin Collection
LAN
1,000,000 artefacts200,000 images
Design / Abstractions•Design
•Object oriented•Based on a curator’s perspective
•“Curators apply scientific training to determine the history of artefacts…creating knowledge about past and present societies by determining relationships which group artefacts within certain times and places in history”
•AbstractionsArtefactEventRelationship
•relate artefacts which participate in common events
Mould
usedto
fabricate
Brooches
GENREG data model
ARTIFACT
EVENT/ARTIFACT
One (or more) artifactsparticipates
in one or more events.
Burial site
Grave Grave
ArtefactArtefactArtefactArtefact Artefact Artefact
E
IH
F
DCB
A
G
LKJ
Merchant’s House
Manor House
Rooms
Furniture
Furniture
Purchase event
Integrated Care Pathways Application
[Procter, P., Eaglestone, B.M. & Burdis, C. “A unified model to support an information intensive healthcare environment, MIE
'99]
P1
P2
P6P3
P4 P5
It
It+2
It+1
It+2
It+1
Treatment
Alternative diagnoses
Alternative prognoses
A formal GENREG Modeltype Genreg = abs [tuple[ Collection : Artifacts, Events : set[Event]]
new : () Genreg,= : (Genreg × Genreg) boolean,events : (Genreg) set[Event],collection : (Genreg) Artifacts]
type Artifacts = graph[Artifact]
type Event = abs[ tuple [id: E_Id, type : Exent_type, t : Time,place : Location, actors : set[Actor_Type], edge : set[Edge]]= : (Event × Event) boolean,id : (Event) E_Id,type : (Event) Event_Type,t : (Event) Time,place : (Event) Location,actors : (Event) set[Actor_Type],edgeset : (Event) set[Edges]]…
type Time = abs[tuple[ lower, upper: T]new : () Time,= : (Time × Time) boolean,before : (Time × Time) boolean,meets : (Time × Time) boolean,overlaps : (Time × Time) boolean,during : (Time × Time) boolean,starts : (Time × Time) boolean,finishes : (Time × Time) boolean,
• add_artifact / delete_artifact (D, a)• add_event / delete_event (D, e)• merge (D,F,E)
• select_artefacts (D,p)• select_events (D,p)• related_to (D,n)• related_by (D,e,n)
Temporal Data Models(See also SQL/Temporal)
Entity
Attr
ibut
e
Time Entity: Barry; Height: 5’ 10’’
Entity: Barry; Height: 2’ 3’’
Time: 2004
Time: 1950
• Artefact histories are created retrospectively
• Multiple orthogonal time dimensions can be represented (using specialised events), e.g., discovery and historic time.
• Relationships between events and states are modelled.
• Multiple objects can represent different states and interpretations of an entity.
QT graphs
7 2
N 1 2
3
5
4
6 Start M C
QJ
C 19 15
14
18
P(7)
END
20
P C
P(3)
Delay
QJ
QD
uid 342: NM(1)C(2)QJ(3)_C(2)PI(2)PPPPPPPC(2)PPPQJ(15)QD(15)
molsworth
"us army"
Q3
Q4
QJt
Some final thoughts…
Some final thoughts…
• The Database Approach?• Semantic gap?• Data independence?• Temporal modelling?• Query language?• So, what’s happening?
IR & DB?
IR – collections of artefacts are available for ad hoc querying (any relevant problem) –
The problem is modelled by the query
DB – collections of artefacts are structured to model the problem space.
Server(s)Internet accessible
repositoriesof artefacts
Client(s)User are researchers
who derive knowledge fromretrieved artefacts
Problem-relatedQuery
Problem-relevantartefacts
Researcher’s workspace –Developed to model the
Problem spaceArtefact collection
…final thoughts…• Knowledge of research methodology is
important (qualitative and quantitative)• Nudist, Atlas, SPSS don’t support mixed
methods• Database approach allows integration of
qualitative and quantitative data, and organisation of data to evolve to model emerging theory
• Temporal data models are key to modelling evolving strategy…
Acknowledgments
• The project team – Nigel Ford, Andrew Madden, Martin Whittle
• Arts and Humanities Research Council (formerly Board) for funding
• Mark Sanderson and Amanda Spink for making the Excite logs available
• Val Gillet and Eleanor Gardiner for help with graphs.
Summary
Feedback can lead to semantic changesComplexity can be a hindranceSearches don’t necessarily end when a searcher leaves a search engine.
AlgorithmLoop over session queries
Loop over previous queries
for i = 1 to n
for j = 1 to i-1
Compare query i with j
Choose most similar pair i,j
Analyse to assign QT type
i
j 1
n
time
Some Preliminary Observations
• Quote marks are likely to be used with a new query.
• Delay is strongly associated with N (New query): these are successful single queries within a session.
• B (Include Boolean) & C (Conjoint) are positively associated
• B & D (Disjoint) are negatively associated
Number of words/query: Excite 2001
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 10 100terms/query
Nor
mal
ised
freq
uenc
y
Classification of textual QT’s
• Word order, addition, subtraction.• Inclusion or removal of
– Boolean terms– “quotes”
• Detection of new enquiries.
• We use similarity methods to compare words and queries.
Self-selected searches
Prompts:• Think about the last time you had
trouble finding something you were looking for on the Internet.
• Do you have any hobbies or interests for which the Internet might provide useful information?
Hölsher & Strube (2000): Graphical Representation
Close-up of direct interaction with a search engine: numbers show transition probabilities.
Experts and novicesdoing specificsearch tasks
Set searches
Heads:What was written on Neville Chamberlain’s piece of paper?You’ve won a holiday to Saga. What can you find out about the place that interests you?
Set searches
Tails:You’ve received a postcard from friends who say they’re visiting Map. Where are they? There are many opportunities to win things on the Internet. Can you find some that relate to your interests?
Additional search:Find the postcode of the tallest building in the UK outside of London.
All searches recorded using
Spector pro (key stroke recorder) and My Screen Recorder (which records voice + activities on PC).
Annotated transcriptsTime at which stated action takes place.
Browse time preceding action
Search 100.50 “I might as well go with what I know best”01.20 (enters ‘CD albums collection’)01.27 (6s browse) Selects 2nd link (CD universe)01.53 (31s browse) – selects Dance = 7 of ?
(>24) (on LHS). “See this is the trouble, cos I don’t really know what category it would go into. It was a mixed CD so it’s got all sorts of different things on, and there’s not really a category for that, I don’t think.”
01.56 (8s browse) – Selects Dance Collections = 7 of 12 (top of page)
Search dimensions
VolunteerSearch
no. On Off
On .
On+Off DepthIntensity:
Mean (s.d.)1 1 2 1 0.67 1 43.33 (24.66)
2 10 8 0.56 2 14.72 (15.1)
3 6 5 0.55 3 12.27 (11.26)
4 3 1 0.75 1 7.5 (6.45)
2 1 30 14 0.68 6 4.55 (6.36)
2 22 8 0.73 2 7.67 (9.8)
3 8 1 0.89 1 13.33 (16.96)
4 24 2 0.92 1 6.73 (12.88)
Progress
ca54 volunteers observed since Oct 2005 (representing c200 searches).
cf Transaction Logs
Internet searches are often regarded as being ‘shallow and promiscuous’ (=many short,simple searches).This idea supports the perception of searches viewed from search engine transaction logs. A useful summary of search engine use, but not of Web search behaviour viewed as a whole.
Feedback loops
Learn from previous searchesE.g. semantic shifts
Sheffield Pals Battalion
Richard Sparling
Complex search ≠ good search
Familiarity with search engine facilities (Boolean, “”, etc) does not always indicate competence. E.g.: postcode "tallest building outside london" –london.
Use the general to find the specialist
Search engine used to find a more focussed search tool. E.g. – searcher looking for info on B&B in York finds a directory of holiday accommodation.
• Jansen ref re complexity• Findings title• Search dimensions slide• Database side – modelling.
Previous studies of search logs• Web search is shallow + promiscuous• Low use of advanced features• Global statistics
– number of queries/search– Pages viewed / user– query reformulation (change in no of terms)– Most users enter few terms– Little to be gained by increasing complexity
Strengths• Large sample.• Natural environment.• Definitely general public.
• No enquiry context – what are they looking for? What are they thinking?
• No measure of success.• Are they searching or just browsing?• Where does one enquiry end and another begin?• Limited to one search engine – what did they do during a delay?
Limitations
Experimental Study
• Strengths– Very detailed information.– Searching not surfing.– Comparison of identical enquiries.
• Limitations– Small sample of queries.– Limited public sample – volunteers.
This work• Development of quantitative analysis
• Analysis of search logs (Excite 2001)
• Development of descriptive codes
• Aim is to form a basis for the analysis of our experimental data
Aims of Quantitative Analysis
• To look at textual (syntactic) changes.• Link queries by text similarity.• Infer enquiry change from textual
dissimilarity.• Use these elements to develop a
machine-readable codification of QT’s.
Word similarity
667.087
10*2W
bacS
Drawback:On this measure doing and going are very similar (0.8)while bug and debugging have SW = 0.5
Dice Coefficient
e l e c t e d e l e c t i o n 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0
Shift
Word Similarity Threshold
dingping 75.0
86WS
bringthing
6.0106
WS
tryingstring
5.0126
WSnursingtraining 4.0
156
WS
•Partial solution: introduce threshold WST = 0.4•Anything less similar than WST is given SW = 0
Query Similarity• For each word in query 1 find the most similar
word in query 2 and combine results
• Accommodates repeated words (in query 2) without weighting
• Main point of WST is to avoid the accumulation of many small contributions to the query similarity
Query Similarity Example
leaf gelatin supplier barcelona
gelatine supplies in spain
Score = 0 Score = 0.93 Score = 0.88
Score = 0
wordsofnumberscoresofsumS
maxQ Evaluate = 0.453
Query Similarity Threshold
We are looking for the most similar previous query to i
i
jtimeIf none are similar maybe i isa new enquiry
Set QST =0.3 as lowest acceptable similarity for a valid query connection
Setting WST and QST
• Result narrowed down by close inspection
• In first 300 queries the set with WST = 0.4 and QMT = 0.3 agreed with a human analysis of the best categorisation in all cases bar one, which was in any case an unusual entry.
AlgorithmLoop over session queries
Loop over previous queries
for i = 1 to n
for j = 1 to i-1
Compare query i with j
Choose most similar pair i,j Assign k=j
Analyse to assign QT type k i
i
j 1
n
time
Code Transformation
U Unique
N New query
R Repeated query
P Page viewing (seek more)
p Page viewing (earlier pages)
“Trivial” Transformations
Substantive Transformations ICode Transformation (relative to k)I(k) Identical J(k) Identical apart from Quotes/Boolean
C(k) Conjoint
D(k) Disjoint
S(k) Sub-phrase in common
s(k) Sub-phrase + words in common
Substantive Transformations II
Code Transformation (relative to k)W(k) Single word in commonw(k) Separated single words in common
M(k) Other textual similarity
Below Threshold SimilarityZ(k) Not similar but word in common
z(k) Not similar but words in common
Target: one two three
Target: 123 Comparison Symbol Type
Basic transfomations 1234 C Conjunction 12 D Disjunction
Common sub-phrase 124 S Replacement 231 s Reordering 1243 s Insertion/removal
Common word 145 W Replacement 132 w Reordering 143 w Repacement/insertion
Below threshold similarity 1456 Z Common word 1245678 z Common phrase
Code Transformation
B Include Boolean term
b Remove Boolean term
Q Include quote marks
q Remove quote marks
_ Delay > 1 hour
Supplementary Transformations
Example full transformationMay include up to 4 terms e.g.
BQC(4)_Boolean
Quote MarksSubstantive Delay
Some examples Code Query1 Query2 QJ(k) bargain music “bargain music” QC(k) Bacteremia “Pneumoccol Bacteremia” qJ(k) “university of texas”
“alternative medicine” university of texas” “alternative medicine”
qw(k) "tax law_depreciation system"
tax law/depreciation system
BC(k) "the sopranos" "the sopranos" +scripts BJ(k) +"Complaint form letters"
Insurance +"Complaint form letters" +Insurance
BS(k) doppler effect labs doppler effect +lab
More examples Code Query1 Query2 Bs(k) conferences image processing +image +processing
+conferences +finland BqW(k) "Craig Larman" +Larman +Valtech BqZ(k) +"lbp 1000" +review +canon +review +laser
+printer BqW(k) Hevia AND bagpipe "Spanish bagpipe" bQs(k) +used +horse +trailer +arndt +"horse trailer" used bqW(k) +arndt +"horse trailer" used +Arndt trailer bqs(k) +Moby +southside +"Gwen
Stefani" +mp3 +Moby +southside +mp3
Output for thefirst 100
Excite queries
Source file: excite.txt word modification threshold : 0.400000 query modification level : 0.300000 sub-session delay/s : 3600 qid0 uid nq Modification list 1 1 ** 1 U 2 2 ** 5 NW(1)_NPP 7 3 ** 4 NS(1)PP 11 4 ** 1 U 12 5 ** 1 U 13 6 ** 1 U 14 7 ** 5 N_QNPPP 19 8 ** 4 NPPP 23 9 ** 1 U 24 10 ** 4 NQJ(1)NQN 28 11 ** 5 N_NN_NP 33 12 ** 2 N_N 35 13 ** 3 NR_R 38 14 ** 1 U 39 15 ** 1 U 40 16 ** 4 NM(1)RN 44 17 ** 21 N_N_NC(1)PPPPNW(9)PPPPC(10)PPPPPP 65 18 ** 2 NP 67 19 ** 10 NRPC(1)RP_NS(7)D(7)I(7) 77 20 ** 1 QU 78 21 ** 1 U 79 22 ** 1 U 80 23 ** 1 U 81 24 ** 1 U 82 25 ** 1 QU 83 26 ** 11 N_NC(2)PPPPW(3)NC(9)P 94 27 ** 5 NNW(2)RR 99 28 ** 1 U 100 29 ** 3 NW(1)_M(1)
N_NC(2)PPPPW(3)NC(9)P
One session - 3 sub-sessions
qid uid time rank query querymore totwords
83 000000000000001a 083122 0 chicago sun times No 3
84 000000000000001a 105439 0 f8 No 1
85 000000000000001a 105453 0 f8 airplane No 2
86 000000000000001a 105536 10 f8 airplane No 2
87 000000000000001a 105614 20 f8 airplane No 2
88 000000000000001a 105630 30 f8 airplane No 2
89 000000000000001a 105731 40 f8 airplane No 2
90 000000000000001a 105740 0 airplanes f8 No 2
91 000000000000001a 113441 0 ceo compensation No 2
92 000000000000001a 113633 0 2000 ceo compensation No 3
93 000000000000001a 113752 10 2000 ceo compensation No 3
1 N_
2 N
3 C(2)
4 P
5 P
6 P
7 P
8 W(3)
9 N
10 C(9)
11 P
Query lengths
1
10
100
1000
10000
100000
1000000
1 10 100
Length/Queries
Freq
uenc
y
sessions sub-session
10% of sub-sessionsare at least 7 queries in length
QT relative frequencies
0
5
10
15
20
25
30
35
U N P p R I J C D S s W w M Z z B b Q q _Query Transformation
Per
cant
age
Freq
uenc
y
Terminal QT’s
0
0.2
0.4
0.6
0.8
1
1.2
U N P p R I J C D S s W w M Z z B b Q q _
Query Transformation
Term
inal
QT
ratio
)(QTFreqQTFinalFreqRatio
i.e.: The lastqueries in a sub-session
QT graphs
N 1 2 3 5 4 6 Start
M C C s
22
23
25
26
27 s
s
s
S s
24 28
RP(14)
END
s s
20
29
R
s
21 5
uid 74: NM(1)C(2)C(3)S(4)s(5)PPRPRRRRPPRRppI(5)s(6)s(22)s(22)s(23)s(25)s(26)s(22)R
nursing careers
paid undergraduate nursing schools in baltimore city maryland
QT graphs
7 2
N 1 2
3
5
4
6 Start M C
QJ
C 19 15
14
18
P(7)
END
20
P C
P(3)
Delay
QJ
QD
uid 342: NM(1)C(2)QJ(3)_C(2)PI(2)PPPPPPPC(2)PPPQJ(15)QD(15)
molsworth
"us army"
Frequency of nodes with k connections
0
2
4
6
8
10
12
0 2 4 6 8 10k
ln(f)
Query length 10
Query length 20
Slope = -1
Exponential scaling
Intra-QT correlations
• f (A,B) measured coincident frequency of codes A and B
• E{} Expected value• V{} Variance
ij
ijijijf AAfV
AAfEAAfAAD
,
,,,
Correlations within a transform e.g. [BQC(3)_]
Intra-QT correlations
Type B b Q q –— U 20.60 – 1.32 – – N -1.48 – 23.26 – 78.27 P – – – – -66.16 p – – – – -9.63 R – – – – 10.53 I – – – – 4.45 J 61.85 47.37 136.42 78.37 -5.74 C 46.02 -42.81 -15.14 -19.22 -4.70 D -34.07 62.20 -15.09 13.45 -4.79 S -24.52 -11.14 -20.69 -7.63 -5.65 s -2.62 9.93 -7.05 3.65 -8.04 W -35.00 -10.35 -32.99 -6.81 -6.05 w -2.63 9.14 -11.51 -0.98 -8.18 M -21.05 -12.98 -37.31 -13.28 -1.97 Z -2.26 14.11 -10.06 2.23 -0.90 z 1.78 2.82 0.55 1.45 0.95 B 0.00 – 1.16 76.78 -15.01 b – 0.00 74.95 10.05 -11.07 Q 1.16 74.95 0.00 – -0.28 q 76.78 10.05 – 0.00 -7.77 — -15.01 -11.07 -0.28 -7.77 0.00
Example:
[BQC(3)_]
Some Observations
• Quote marks are likely to be used with a new query.
• Delay is strongly associated with N: these are successful single queries within a session.
• B & C are positively associated• B & D are negatively associated
Application to Experimental Results
Query Transformsqid SS Query QM(similarity) QM(preceeding)1 * CD albums collection N N2 CD albums collection R R3 * Autotrader N N4 * atlas N N5 * place names N N6 place names R R7 * map N N8 * online competitions N N9 * Tall British buildings N N10 Tall buildings w(9) w(9)11 Tall buildings R R12 Tall buildings R R13 Tall buildings in Britain w(9) C(12)14 Tallest building outside London M(9) M(13)
Temporal Database•A repository of all data for each session•Accessible to SQL•Used to build evidence-based models for searching
Background detailsWeb experienceCognitive style scores
Subjects appraisalof searches
uid
Search queriesWeb page titles
uid
Key stroke recordActivity timings
Query modificationcodes
qidqid
Qualitative analysis
Acknowledgments
• Arts and Humanities Research Council (formerly Board) for funding
• Mark Sanderson and Amanda Spink for making the Excite logs available
Questions ?
Setting WST and QST
excite: WST = 0.4
0
50000
100000
150000
200000
250000
300000
350000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Query Transformation
Freq
uenc
y
Tot NewTot Modz+Z
Inter-QT correlations
• f ( A | B ) measured frequency of codes B following A
• E{} Expected value• V{} Variance
ij
ijijijf ABfV
ABfEABfABD
|
|||
Correlations of one transform with the next.
Inter-QT correlations
Prior Transformation Type N P p R I J C D S s W w M Z z B b Q q —
N 82.40 -39.20 -2.92 13.26 22.95 2.22 10.77 9.92 5.92 -2.37 23.22 2.37 30.55 11.24 3.99 22.86 8.45 17.84 6.60 102.17 P -42.39 323.03 9.91 -15.98 -17.58 -9.12 -4.10 -6.83 -12.90 -5.45 -19.89 -5.75 -32.02 -8.76 -2.25 -25.01 -19.35 -18.59 -7.81 -71.47 p -50.08 79.89 154.30 17.11 4.96 -8.42 -18.06 -10.74 -15.30 -10.98 -21.52 -11.35 -18.32 -8.35 -2.35 -21.79 -10.70 -17.30 -7.25 21.57 R 125.10 -85.27 3.73 198.05 23.30 -2.83 0.55 -2.51 -3.93 -6.24 1.94 -3.17 14.86 1.31 -0.46 -16.30 -12.24 -0.72 -6.71 89.80 I -8.96 -39.39 7.11 25.19 152.36 23.27 35.60 20.45 19.44 10.92 33.41 15.91 61.29 5.88 1.04 0.33 6.43 -0.72 4.76 61.21 J 31.31 -28.13 0.42 -1.56 -2.36 45.43 29.05 12.92 21.68 19.21 15.47 15.55 10.37 7.08 4.06 66.72 37.31 70.63 46.88 -5.89 C 98.65 -27.61 -2.25 -7.92 -3.51 9.43 50.98 -1.42 2.57 -5.27 11.76 -2.43 7.80 10.78 1.98 33.37 6.34 25.51 3.16 -8.53 D 39.12 -24.03 -2.58 -3.66 -0.82 23.95 14.41 21.89 32.39 29.83 26.52 21.93 -4.62 11.31 4.55 45.21 24.60 57.86 14.67 5.58 S 35.67 -30.46 -3.62 -7.55 0.35 12.88 31.20 28.48 108.55 44.56 27.07 25.89 -6.91 26.54 5.79 56.24 35.14 39.28 17.90 6.90 s 8.44 -18.69 -2.58 -6.79 -1.78 15.49 43.13 15.71 59.83 117.15 1.57 34.34 -12.48 30.55 21.59 46.67 34.77 33.33 22.27 1.00 W 79.54 -43.79 -5.10 -9.05 4.91 15.72 16.39 32.98 10.95 -0.93 117.56 23.20 24.22 14.02 -0.47 70.07 38.85 46.57 17.34 27.82 w 17.74 -17.47 -2.16 -5.35 2.10 12.61 23.19 16.82 22.55 23.51 44.17 66.50 -2.25 18.13 3.57 39.50 35.21 26.21 14.42 6.21 M 109.09 -57.39 -6.00 0.68 8.81 4.55 -5.14 7.04 -11.05 -11.98 4.69 -7.25 160.36 -3.45 -2.86 31.61 14.40 9.17 4.19 31.52 Z 37.56 -13.24 -3.22 -0.98 1.32 6.09 9.11 5.53 17.10 13.88 5.76 5.96 -2.27 19.33 3.01 29.60 10.64 12.79 6.22 30.99 z 9.83 -4.61 0.69 0.25 -0.56 2.35 2.28 -0.82 7.06 8.53 -0.52 3.29 -2.42 8.85 20.34 12.08 4.22 4.48 2.57 4.33 B 61.06 -42.37 3.02 -0.11 -3.05 56.39 36.12 14.63 22.86 19.43 33.25 19.98 23.90 14.39 4.25 204.51 70.57 72.24 51.54 0.67 b 38.59 -32.48 -8.39 -14.33 -4.12 50.59 17.99 24.07 35.23 41.38 27.86 27.48 12.74 19.47 9.57 247.85 145.67 44.35 48.16 4.51 Q 35.97 -24.81 -5.29 -9.96 -3.80 112.76 21.46 12.99 19.11 17.75 23.45 15.62 7.47 8.74 2.70 81.08 67.37 126.97 50.84 5.15 q 18.26 -22.93 -2.71 -5.39 -0.10 54.20 17.40 22.01 23.42 28.37 23.34 14.49 6.52 7.45 3.91 41.28 40.34 173.97 135.55 5.06
Pos
terio
r Tra
nsfo
rmat
ion
— 54.44 -16.60 0.96 28.90 14.56 0.59 9.51 3.69 7.01 0.35 9.44 1.59 11.49 3.65 0.14 0.87 -1.84 4.33 -0.49 65.46
Example: [BQC(3)_][bqD(5)]
Some Observations
• Self-correlations suggest habitual tendencies
• Substantive QT’s rarely follow or precede page-viewing. They are associated with active searching.
• Delay is followed by N, a new query or R or I – suggesting memory refresh.
Number of words/query: Excite 2001
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 10 100terms/query
Nor
mal
ised
freq
uenc
y
Hölsher & Strube (2000): Graphical Representation
Close-up of direct interaction with a search engine: numbers show transition probabilities.
Experts and novicesdoing specificsearch tasks
Word Similarity
e l e c t e d e l e c t i o n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Shift word along until the best match is found
e l e c t e d e l e c t i o n 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
logical AND: same letter
Motivation
• Need to develop new models for searching: update outdated usage paradigms.
• Improve training methods
• Develop automated assistance systems
Context
• How do the general public search the web?
• Experimental study– general public volunteers– record sound, screens, keystrokes
• Goal: evidence-based model of effective searching
Previous studies of search logs• Web search is shallow + promiscuous• Low use of advanced features• Global statistics
– number of queries/search– Pages viewed / user– query reformulation (change in no of terms)– Most users enter few terms– Little to be gained by increasing complexity
This work• Development of quantitative analysis
• Analysis of search logs (Excite 2001)
• Development of descriptive codes
• Aim is to form a basis for the analysis of our experimental data
Aims of Quantitative Analysis
• To look at textual (syntactic) changes.• Link queries by text similarity.• Infer enquiry change from textual
dissimilarity.• Use these elements to develop a
machine-readable codification of QT’s.
Target: one two three
Target: 123 Comparison Symbol Type
Basic transfomations 1234 C Conjunction 12 D Disjunction
Common sub-phrase 124 S Replacement 231 s Reordering 1243 s Insertion/removal
Common word 145 W Replacement 132 w Reordering 143 w Repacement/insertion
Below threshold similarity 1456 Z Common word 1245678 z Common phrase
Code Transformation
B Include Boolean term
b Remove Boolean term
Q Include quote marks
q Remove quote marks
_ Delay > 1 hour
Supplementary Transformations
Example full transformationMay include up to 4 terms e.g.
BQC(4)_Boolean
Quote MarksSubstantive Delay
Some examples Code Query1 Query2 QJ(k) bargain music “bargain music” QC(k) Bacteremia “Pneumoccol Bacteremia” qJ(k) “university of texas”
“alternative medicine” university of texas” “alternative medicine”
qw(k) "tax law_depreciation system"
tax law/depreciation system
BC(k) "the sopranos" "the sopranos" +scripts BJ(k) +"Complaint form letters"
Insurance +"Complaint form letters" +Insurance
BS(k) doppler effect labs doppler effect +lab
More examples Code Query1 Query2 Bs(k) conferences image processing +image +processing
+conferences +finland BqW(k) "Craig Larman" +Larman +Valtech BqZ(k) +"lbp 1000" +review +canon +review +laser
+printer BqW(k) Hevia AND bagpipe "Spanish bagpipe" bQs(k) +used +horse +trailer +arndt +"horse trailer" used bqW(k) +arndt +"horse trailer" used +Arndt trailer bqs(k) +Moby +southside +"Gwen
Stefani" +mp3 +Moby +southside +mp3
Output for thefirst 100
Excite queries
Source file: excite.txt word modification threshold : 0.400000 query modification level : 0.300000 sub-session delay/s : 3600 qid0 uid nq Modification list 1 1 ** 1 U 2 2 ** 5 NW(1)_NPP 7 3 ** 4 NS(1)PP 11 4 ** 1 U 12 5 ** 1 U 13 6 ** 1 U 14 7 ** 5 N_QNPPP 19 8 ** 4 NPPP 23 9 ** 1 U 24 10 ** 4 NQJ(1)NQN 28 11 ** 5 N_NN_NP 33 12 ** 2 N_N 35 13 ** 3 NR_R 38 14 ** 1 U 39 15 ** 1 U 40 16 ** 4 NM(1)RN 44 17 ** 21 N_N_NC(1)PPPPNW(9)PPPPC(10)PPPPPP 65 18 ** 2 NP 67 19 ** 10 NRPC(1)RP_NS(7)D(7)I(7) 77 20 ** 1 QU 78 21 ** 1 U 79 22 ** 1 U 80 23 ** 1 U 81 24 ** 1 U 82 25 ** 1 QU 83 26 ** 11 N_NC(2)PPPPW(3)NC(9)P 94 27 ** 5 NNW(2)RR 99 28 ** 1 U 100 29 ** 3 NW(1)_M(1)
N_NC(2)PPPPW(3)NC(9)P
One session - 3 sub-sessions
qid uid time rank query querymore totwords
83 000000000000001a 083122 0 chicago sun times No 3
84 000000000000001a 105439 0 f8 No 1
85 000000000000001a 105453 0 f8 airplane No 2
86 000000000000001a 105536 10 f8 airplane No 2
87 000000000000001a 105614 20 f8 airplane No 2
88 000000000000001a 105630 30 f8 airplane No 2
89 000000000000001a 105731 40 f8 airplane No 2
90 000000000000001a 105740 0 airplanes f8 No 2
91 000000000000001a 113441 0 ceo compensation No 2
92 000000000000001a 113633 0 2000 ceo compensation No 3
93 000000000000001a 113752 10 2000 ceo compensation No 3
1 N_
2 N
3 C(2)
4 P
5 P
6 P
7 P
8 W(3)
9 N
10 C(9)
11 P
Query lengths
1
10
100
1000
10000
100000
1000000
1 10 100
Length/Queries
Freq
uenc
y
sessions sub-session
10% of sub-sessionsare at least 7 queries in length
QT relative frequencies
0
5
10
15
20
25
30
35
U N P p R I J C D S s W w M Z z B b Q q _Query Transformation
Per
cant
age
Freq
uenc
y
Terminal QT’s
0
0.2
0.4
0.6
0.8
1
1.2
U N P p R I J C D S s W w M Z z B b Q q _
Query Transformation
Term
inal
QT
ratio
)(QTFreqQTFinalFreqRatio
i.e.: The lastqueries in a sub-session
QT graphs
N 1 2 3 5 4 6 Start
M C C s
22
23
25
26
27 s
s
s
S s
24 28
RP(14)
END
s s
20
29
R
s
21 5
uid 74: NM(1)C(2)C(3)S(4)s(5)PPRPRRRRPPRRppI(5)s(6)s(22)s(22)s(23)s(25)s(26)s(22)R
nursing careers
paid undergraduate nursing schools in baltimore city maryland
QT graphs
7 2
N 1 2
3
5
4
6 Start M C
QJ
C 19 15
14
18
P(7)
END
20
P C
P(3)
Delay
QJ
QD
uid 342: NM(1)C(2)QJ(3)_C(2)PI(2)PPPPPPPC(2)PPPQJ(15)QD(15)
molsworth
"us army"
Frequency of nodes with k connections
0
2
4
6
8
10
12
0 2 4 6 8 10k
ln(f)
Query length 10
Query length 20
Slope = -1
Exponential scaling
Intra-QT correlations
• f (A,B) measured coincident frequency of codes A and B
• E{} Expected value• V{} Variance
ij
ijijijf AAfV
AAfEAAfAAD
,
,,,
Correlations within a transform e.g. [BQC(3)_]
Intra-QT correlations
Type B b Q q –— U 20.60 – 1.32 – – N -1.48 – 23.26 – 78.27 P – – – – -66.16 p – – – – -9.63 R – – – – 10.53 I – – – – 4.45 J 61.85 47.37 136.42 78.37 -5.74 C 46.02 -42.81 -15.14 -19.22 -4.70 D -34.07 62.20 -15.09 13.45 -4.79 S -24.52 -11.14 -20.69 -7.63 -5.65 s -2.62 9.93 -7.05 3.65 -8.04 W -35.00 -10.35 -32.99 -6.81 -6.05 w -2.63 9.14 -11.51 -0.98 -8.18 M -21.05 -12.98 -37.31 -13.28 -1.97 Z -2.26 14.11 -10.06 2.23 -0.90 z 1.78 2.82 0.55 1.45 0.95 B 0.00 – 1.16 76.78 -15.01 b – 0.00 74.95 10.05 -11.07 Q 1.16 74.95 0.00 – -0.28 q 76.78 10.05 – 0.00 -7.77 — -15.01 -11.07 -0.28 -7.77 0.00
Example:
[BQC(3)_]
Application to Experimental Results
Query Transformsqid SS Query QM(similarity) QM(preceeding)1 * CD albums collection N N2 CD albums collection R R3 * Autotrader N N4 * atlas N N5 * place names N N6 place names R R7 * map N N8 * online competitions N N9 * Tall British buildings N N10 Tall buildings w(9) w(9)11 Tall buildings R R12 Tall buildings R R13 Tall buildings in Britain w(9) C(12)14 Tallest building outside London M(9) M(13)
Temporal Database•A repository of all data for each session•Accessible to SQL•Used to build evidence-based models for searching
Background detailsWeb experienceCognitive style scores
Subjects appraisalof searches
uid
Search queriesWeb page titles
uid
Key stroke recordActivity timings
Query modificationcodes
qidqid
Qualitative analysis
Conclusions• We have developed a rich set of codes
describing syntactic part of QT’s• These can be used to develop a graph-based
description• Correlations between the codes are
meaningful/interesting• They will form part of the analysis for our
experimental study.
Acknowledgments
• Arts and Humanities Research Council (formerly Board) for funding
• Mark Sanderson and Amanda Spink for making the Excite logs available
Questions ?
Setting WST and QST
excite: WST = 0.4
0
50000
100000
150000
200000
250000
300000
350000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Query Transformation
Freq
uenc
y
Tot NewTot Modz+Z
Inter-QT correlations
• f ( A | B ) measured frequency of codes B following A
• E{} Expected value• V{} Variance
ij
ijijijf ABfV
ABfEABfABD
|
|||
Correlations of one transform with the next.
Inter-QT correlations
Prior Transformation Type N P p R I J C D S s W w M Z z B b Q q —
N 82.40 -39.20 -2.92 13.26 22.95 2.22 10.77 9.92 5.92 -2.37 23.22 2.37 30.55 11.24 3.99 22.86 8.45 17.84 6.60 102.17 P -42.39 323.03 9.91 -15.98 -17.58 -9.12 -4.10 -6.83 -12.90 -5.45 -19.89 -5.75 -32.02 -8.76 -2.25 -25.01 -19.35 -18.59 -7.81 -71.47 p -50.08 79.89 154.30 17.11 4.96 -8.42 -18.06 -10.74 -15.30 -10.98 -21.52 -11.35 -18.32 -8.35 -2.35 -21.79 -10.70 -17.30 -7.25 21.57 R 125.10 -85.27 3.73 198.05 23.30 -2.83 0.55 -2.51 -3.93 -6.24 1.94 -3.17 14.86 1.31 -0.46 -16.30 -12.24 -0.72 -6.71 89.80 I -8.96 -39.39 7.11 25.19 152.36 23.27 35.60 20.45 19.44 10.92 33.41 15.91 61.29 5.88 1.04 0.33 6.43 -0.72 4.76 61.21 J 31.31 -28.13 0.42 -1.56 -2.36 45.43 29.05 12.92 21.68 19.21 15.47 15.55 10.37 7.08 4.06 66.72 37.31 70.63 46.88 -5.89 C 98.65 -27.61 -2.25 -7.92 -3.51 9.43 50.98 -1.42 2.57 -5.27 11.76 -2.43 7.80 10.78 1.98 33.37 6.34 25.51 3.16 -8.53 D 39.12 -24.03 -2.58 -3.66 -0.82 23.95 14.41 21.89 32.39 29.83 26.52 21.93 -4.62 11.31 4.55 45.21 24.60 57.86 14.67 5.58 S 35.67 -30.46 -3.62 -7.55 0.35 12.88 31.20 28.48 108.55 44.56 27.07 25.89 -6.91 26.54 5.79 56.24 35.14 39.28 17.90 6.90 s 8.44 -18.69 -2.58 -6.79 -1.78 15.49 43.13 15.71 59.83 117.15 1.57 34.34 -12.48 30.55 21.59 46.67 34.77 33.33 22.27 1.00 W 79.54 -43.79 -5.10 -9.05 4.91 15.72 16.39 32.98 10.95 -0.93 117.56 23.20 24.22 14.02 -0.47 70.07 38.85 46.57 17.34 27.82 w 17.74 -17.47 -2.16 -5.35 2.10 12.61 23.19 16.82 22.55 23.51 44.17 66.50 -2.25 18.13 3.57 39.50 35.21 26.21 14.42 6.21 M 109.09 -57.39 -6.00 0.68 8.81 4.55 -5.14 7.04 -11.05 -11.98 4.69 -7.25 160.36 -3.45 -2.86 31.61 14.40 9.17 4.19 31.52 Z 37.56 -13.24 -3.22 -0.98 1.32 6.09 9.11 5.53 17.10 13.88 5.76 5.96 -2.27 19.33 3.01 29.60 10.64 12.79 6.22 30.99 z 9.83 -4.61 0.69 0.25 -0.56 2.35 2.28 -0.82 7.06 8.53 -0.52 3.29 -2.42 8.85 20.34 12.08 4.22 4.48 2.57 4.33 B 61.06 -42.37 3.02 -0.11 -3.05 56.39 36.12 14.63 22.86 19.43 33.25 19.98 23.90 14.39 4.25 204.51 70.57 72.24 51.54 0.67 b 38.59 -32.48 -8.39 -14.33 -4.12 50.59 17.99 24.07 35.23 41.38 27.86 27.48 12.74 19.47 9.57 247.85 145.67 44.35 48.16 4.51 Q 35.97 -24.81 -5.29 -9.96 -3.80 112.76 21.46 12.99 19.11 17.75 23.45 15.62 7.47 8.74 2.70 81.08 67.37 126.97 50.84 5.15 q 18.26 -22.93 -2.71 -5.39 -0.10 54.20 17.40 22.01 23.42 28.37 23.34 14.49 6.52 7.45 3.91 41.28 40.34 173.97 135.55 5.06
Pos
terio
r Tra
nsfo
rmat
ion
— 54.44 -16.60 0.96 28.90 14.56 0.59 9.51 3.69 7.01 0.35 9.44 1.59 11.49 3.65 0.14 0.87 -1.84 4.33 -0.49 65.46
Example: [BQC(3)_][bqD(5)]
Some Observations
• Self-correlations suggest habitual tendencies
• Substantive QT’s rarely follow or precede page-viewing. They are associated with active searching.
• Delay is followed by N, a new query or R or I – suggesting memory refresh.
Number of words/query: Excite 2001
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 10 100terms/query
Nor
mal
ised
freq
uenc
y
Recommended