Improving the effectiveness of Web searching: Methodological issues

Improving the effectiveness of Web searching:

Methodological issuesBarry EaglestoneDepartment of Information StudiesUniversity of SheffieldB.Eaglestone@shef.ac.uk

Overview

• An inductive study to build evidence-based meta-cognitive models of web searching by the general public.

• Data modelling issues– A Temporal data modelling solution

• Discussion & Final thoughts

An inductive study of how the general public search on the web.

Setting the scene – the database approach and state of the art.

Motivation

• Need to develop new models for searching: update outdated usage paradigms.– Improve training methods– Develop automated assistance systems

Previous studies of search logs

• Web search is shallow + promiscuous• Low use of advanced features• Global statistics

– number of queries/search– Pages viewed / user– query reformulation (change in no of terms)– Most users enter few terms– Little to be gained by increasing complexity

chemoinformatics

Database

The Team

Information SeekingInformation Seeking

chemoinformatics

Database

Soft Hard

Spectrum of Research Perspective

Modelling/engineering/empirical

Qualitative / quantitative data analysis / modeling

Human / organisationalissues

FormallyDefinedproblems

Computer world formalisations

Hardware /Software solutions

CS Computer WorldCS Computer WorldPeople world ISPeople world ISInventionInventionDiscoveryDiscovery

ProblemProblemSolvingSolvingformalismformalism

How will we use it?

Effectiveness?

Meta-cognitiveKnowledge aboutweb searching?

How do theysearch?

Who are the searchers?What are they searching

Infer effectiveness from•search transformation patterns•subject’s narrative

ContextThe GENERAL PUBLICVolunteers (c500 searches):

ICT coursesUniversity evening classesCity Learning Centre coursesCitizens’ forumPersonal contactsLibraryAdvertisingStudents and academics

+ over 1,000,000 search logs anonymous searchers

•Self-selected searches explained through interview and think aloud protocols•2-3 set searches

Observe and record•Over 1,000,000 anonymous search engine transaction logs

•c500 observed and recorded searches; talk to searchersDetermine query similarity

Delimit searchesCode query transformationModel searches as transformation graphsData mine for stereotypical search strategesCorrelate with who, why and effectivenessThus, establish evidence-based models of search strategy, related to user and problem characteristics and likelihood of success

Evidence-based meta-cognitive trainingIntelligent interfaces

Why Meta-cognition?. “Meta-cognition refers to higher order thinking

which involves active control over the cognitive processes engaged in learning. ….”

Livingston (1997)

• Meta-cognitive knowledge– “…knowledge of personal variables to general knowledge about

how human beings learn and process information, as well as individual knowledge of one’s own learning processes…” e.g. “I have a bad memory!”

• Meta-cognitive regulation– “… activities used to ensure that that a cognitive goal has been

met….”, e.g., question yourself about the text and then re-read.Livingston (1997)

Cognitive Styles Analysishttp://www.memletics.com/manual/default.asp?ref=ga&data=999+learning+styles+free+test

Holist Analyst

Verbalizer

Imager

Syntactical/quantitative Semantic/qualitative

Exite search logs

~106 searchesHolistic search logs

Supplemented with qualitative data

Preliminary work

• Analysis of search logs

• Development of descriptive codes

• Aim is to form a basis for the analysis of our experimental data

Strengths / Limitations• Large sample• Definitely general public.• No enquiry context – what are they looking

for? What are they thinking?• No measure of success.• Are they searching or just browsing?• Where does one enquiry end and another

begin?• Limited to one search engine – what did they

do during a delay?

Excite Database Sampleqid uid time rank query querymore totwords

343 000000000000006a 192141 0 alco fence company ohio No 4

347 000000000000006a 192328 0 lifetime fence company ohio No 4

348 000000000000006a 192359 10 lifetime fence company ohio No 4

349 000000000000006a 192455 0 lifetime wire fence No 3

350 000000000000006a 192634 0 high tensile wire fence No 4

351 000000000000006b 161906 0 sickle cell anemia No 3

354 000000000000006c 144303 0 Hilton Garden Inn No 3

355 000000000000006c 144331 0 Hilton Garden Inn Jacksonville No 4

356 000000000000006c 144433 0 Hotel Search No 2

357 000000000000006c 144541 0 Jacksonvill Hotel No 2

358 000000000000006c 144728 0 www.hilton.com No 1

~ 106 queries

Sessions

Query Transformations• Changes in search strategy

– conceptual e.g. changes in type of search: broad specific text image

– Linguistic: syntactic, query structure.

• Examples Q1: shakespeare hamletQ2: shakespeare hamlet quotes

Q3: to be or not to beQ4 “to be or not to be”Q5: “to be or not to be” +shakespeare

Our Preliminary Analysis

• To look at textual (syntactic) changes.• Link queries by text similarity.• Infer enquiry change from textual

dissimilarity.• Use these elements to develop a

machine-readable codification of QT’s.• To mine for characteristic patterns.

Code TransformationN New queryR A repeated query /same page

rank – relevance feedback. P Page ranking (seek more)p Page ranking (earlier pages)

I(k) Identical C(k) Conjoint

S(k) Sub-phrase in common s(k) Sub-phrase + words in commonM(k) Other textual similarity

Example Transformations

QT graphs

N 1 2 3 5 4 6 Start

M C C s

RP(14)

uid 74: NM(1)C(2)C(3)S(4)s(5)PPRPRRRRPPRRppI(5)s(6)s(22)s(22)s(23)s(25)s(26)s(22)R

nursing careerspaid undergraduate nursing schools in baltimore city maryland

Code Transformation

N New query

R A repeated query /same page rank – relevance feedback.

P Page ranking (seek more)

p Page ranking (earlier pages)

I(k) Identical

C(k) Conjoint

S(k) Sub-phrase in common

s(k) Sub-phrase + words in common

M(k) Other textual similarity

QT graphs

6 Start M C

C 19 15

uid 342: NM(1)C(2)QJ(3)_C(2)PI(2)PPPPPPPC(2)PPPQJ(15)QD(15)

molsworth

"us army"

Preliminary Conclusions• We have developed a rich set of codes

describing syntactic part of QT’s• These can be used to develop a graph-based

description• Correlations between the codes are

meaningful/interesting• They form part of the analysis for our current

experimental study.

…and if you want to read about our preliminary results….

• Whittle M, Eaglestone B, Ford N, Madden A (2007), Data Mining of Search Logs, Journal of the American Society for Information Science and Technology (in press)

• Whittle M, Eaglestone B, Ford N, Gillet V.J., Madden A (2006), Query Tranformations And Their Role In Web Searching By The General Public, Information Research, 12(1) October 2006

• Whittle M, Eaglestone B, Ford N, Gillet V, Madden A (2006), Query transformations and their role in web searching by the general public. Information Seeking in Context Conference 2006 ISIC, Austrailia

• Andrew Madden, Barry Eaglestone, Nigel Ford, MartinWhittle (2006) Search engines: a first step to finding information: preliminary findings from a study of observed searches, Information Seeking in Context Conference 2006 ISIC, Austrailia.

Sheffield Experimental StudyScreensAudio

Qualitativeanalysis

Quantitativeanalysis

KeystrokesQueriesWeb page titles

Transcribing Pre-Processing

Temporaldatabase

Modeldevelopment

Data modelling issues

Evolution of databasesSetting the scene – the database approach and state of the art.

The database approach – A database should be a natural representation of information as data, suitable for all relevant applications without duplication, including the ones you have not yet though of.

“A well designed database system will mirror its users’ perceptionsmirror its users’ perceptions of the problem space, and thus allows them to address the problem in hand without address the problem in hand without complexities and distractions of complexities and distractions of computer world implementation computer world implementation detailsdetails… Implicit is the notion that users should work within the bounds of ‘good ‘good practice’practice’””

The semantic gap

Customer Salesperson

Take_byPlaced_by

Sales_Order

C# Name …C1 Dr. EaglestoneC2 Ms Smith

SP# Names …S5 Mr. Chan …S8 Dr. Shao

C# SP# Product QuantityC1 S5 P99 120C1 S5 P2 10

Customer

Salesperson

SalesOrder

The gap between what you wish to represent and what you can represent.

Setting the scene – the database approach and state of the art.

….. & Data Independence

Applications/Users

External Model

Logical Model

Internal Model

Principles of database technology…

QT graphs

6 Start M C

C 19 15

molsworth

"us army"

A Ready-madeTemporal data modelling solution

GENREG – A ready-made solution that has also been proposed for healthcare ?

The Organisation: National Museum of Denmark

Multimedia– Pictures as well as descriptions

Distributed– Each department ran their own database system

for their collection (ownership!) Object-oriented design

– Entities, not just values Relational implementation

Database Research

Science

Technology

Application

Praxis

Theory

TopologyDanish Pre-history

Department of Antiquity

Ethnographic Department

Coin Collection

1,000,000 artefacts200,000 images

Design / Abstractions•Design

•Object oriented•Based on a curator’s perspective

•“Curators apply scientific training to determine the history of artefacts…creating knowledge about past and present societies by determining relationships which group artefacts within certain times and places in history”

•AbstractionsArtefactEventRelationship

•relate artefacts which participate in common events

usedto

fabricate

Brooches

GENREG data model

ARTIFACT

EVENT/ARTIFACT

One (or more) artifactsparticipates

in one or more events.

Burial site

Grave Grave

ArtefactArtefactArtefactArtefact Artefact Artefact

Merchant’s House

Manor House

Furniture

Purchase event

Integrated Care Pathways Application

[Procter, P., Eaglestone, B.M. & Burdis, C. “A unified model to support an information intensive healthcare environment, MIE

Treatment

Alternative diagnoses

Alternative prognoses

A formal GENREG Modeltype Genreg = abs [tuple[ Collection : Artifacts, Events : set[Event]]

new : () Genreg,= : (Genreg × Genreg) boolean,events : (Genreg) set[Event],collection : (Genreg) Artifacts]

type Artifacts = graph[Artifact]

type Event = abs[ tuple [id: E_Id, type : Exent_type, t : Time,place : Location, actors : set[Actor_Type], edge : set[Edge]]= : (Event × Event) boolean,id : (Event) E_Id,type : (Event) Event_Type,t : (Event) Time,place : (Event) Location,actors : (Event) set[Actor_Type],edgeset : (Event) set[Edges]]…

type Time = abs[tuple[ lower, upper: T]new : () Time,= : (Time × Time) boolean,before : (Time × Time) boolean,meets : (Time × Time) boolean,overlaps : (Time × Time) boolean,during : (Time × Time) boolean,starts : (Time × Time) boolean,finishes : (Time × Time) boolean,

• add_artifact / delete_artifact (D, a)• add_event / delete_event (D, e)• merge (D,F,E)

• select_artefacts (D,p)• select_events (D,p)• related_to (D,n)• related_by (D,e,n)

Temporal Data Models(See also SQL/Temporal)

Entity

Time Entity: Barry; Height: 5’ 10’’

Entity: Barry; Height: 2’ 3’’

Time: 2004

Time: 1950

• Artefact histories are created retrospectively

• Multiple orthogonal time dimensions can be represented (using specialised events), e.g., discovery and historic time.

• Relationships between events and states are modelled.

• Multiple objects can represent different states and interpretations of an entity.

QT graphs

6 Start M C

C 19 15

molsworth

"us army"

Some final thoughts…

• The Database Approach?• Semantic gap?• Data independence?• Temporal modelling?• Query language?• So, what’s happening?

IR & DB?

IR – collections of artefacts are available for ad hoc querying (any relevant problem) –

The problem is modelled by the query

DB – collections of artefacts are structured to model the problem space.

Server(s)Internet accessible

repositoriesof artefacts

Client(s)User are researchers

who derive knowledge fromretrieved artefacts

Problem-relatedQuery

Problem-relevantartefacts

Researcher’s workspace –Developed to model the

Problem spaceArtefact collection

…final thoughts…• Knowledge of research methodology is

important (qualitative and quantitative)• Nudist, Atlas, SPSS don’t support mixed

methods• Database approach allows integration of

qualitative and quantitative data, and organisation of data to evolve to model emerging theory

• Temporal data models are key to modelling evolving strategy…

Acknowledgments

• The project team – Nigel Ford, Andrew Madden, Martin Whittle

• Arts and Humanities Research Council (formerly Board) for funding

• Mark Sanderson and Amanda Spink for making the Excite logs available

• Val Gillet and Eleanor Gardiner for help with graphs.

Summary

Feedback can lead to semantic changesComplexity can be a hindranceSearches don’t necessarily end when a searcher leaves a search engine.

AlgorithmLoop over session queries

Loop over previous queries

for i = 1 to n

for j = 1 to i-1

Compare query i with j

Choose most similar pair i,j

Analyse to assign QT type

Some Preliminary Observations

• Quote marks are likely to be used with a new query.

• Delay is strongly associated with N (New query): these are successful single queries within a session.

• B (Include Boolean) & C (Conjoint) are positively associated

• B & D (Disjoint) are negatively associated

Number of words/query: Excite 2001

1 10 100terms/query

Classification of textual QT’s

• Word order, addition, subtraction.• Inclusion or removal of

– Boolean terms– “quotes”

• Detection of new enquiries.

• We use similarity methods to compare words and queries.

Self-selected searches

Prompts:• Think about the last time you had

trouble finding something you were looking for on the Internet.

• Do you have any hobbies or interests for which the Internet might provide useful information?

Hölsher & Strube (2000): Graphical Representation

Close-up of direct interaction with a search engine: numbers show transition probabilities.

Experts and novicesdoing specificsearch tasks

Set searches

Heads:What was written on Neville Chamberlain’s piece of paper?You’ve won a holiday to Saga. What can you find out about the place that interests you?

Set searches

Tails:You’ve received a postcard from friends who say they’re visiting Map. Where are they? There are many opportunities to win things on the Internet. Can you find some that relate to your interests?

Additional search:Find the postcode of the tallest building in the UK outside of London.

All searches recorded using

Spector pro (key stroke recorder) and My Screen Recorder (which records voice + activities on PC).

Annotated transcriptsTime at which stated action takes place.

Browse time preceding action

Search 100.50 “I might as well go with what I know best”01.20 (enters ‘CD albums collection’)01.27 (6s browse) Selects 2nd link (CD universe)01.53 (31s browse) – selects Dance = 7 of ?

(>24) (on LHS). “See this is the trouble, cos I don’t really know what category it would go into. It was a mixed CD so it’s got all sorts of different things on, and there’s not really a category for that, I don’t think.”

01.56 (8s browse) – Selects Dance Collections = 7 of 12 (top of page)

Search dimensions

VolunteerSearch

no. On Off

On+Off DepthIntensity:

Mean (s.d.)1 1 2 1 0.67 1 43.33 (24.66)

2 10 8 0.56 2 14.72 (15.1)

3 6 5 0.55 3 12.27 (11.26)

4 3 1 0.75 1 7.5 (6.45)

2 1 30 14 0.68 6 4.55 (6.36)

2 22 8 0.73 2 7.67 (9.8)

3 8 1 0.89 1 13.33 (16.96)

4 24 2 0.92 1 6.73 (12.88)

Progress

ca54 volunteers observed since Oct 2005 (representing c200 searches).

cf Transaction Logs

Internet searches are often regarded as being ‘shallow and promiscuous’ (=many short,simple searches).This idea supports the perception of searches viewed from search engine transaction logs. A useful summary of search engine use, but not of Web search behaviour viewed as a whole.

Feedback loops

Learn from previous searchesE.g. semantic shifts

Sheffield Pals Battalion

Richard Sparling

Complex search ≠ good search

Familiarity with search engine facilities (Boolean, “”, etc) does not always indicate competence. E.g.: postcode "tallest building outside london" –london.

Use the general to find the specialist

Search engine used to find a more focussed search tool. E.g. – searcher looking for info on B&B in York finds a directory of holiday accommodation.

• Jansen ref re complexity• Findings title• Search dimensions slide• Database side – modelling.

Previous studies of search logs• Web search is shallow + promiscuous• Low use of advanced features• Global statistics

Strengths• Large sample.• Natural environment.• Definitely general public.

• No enquiry context – what are they looking for? What are they thinking?

• No measure of success.• Are they searching or just browsing?• Where does one enquiry end and another begin?• Limited to one search engine – what did they do during a delay?

Limitations

Experimental Study

• Strengths– Very detailed information.– Searching not surfing.– Comparison of identical enquiries.

• Limitations– Small sample of queries.– Limited public sample – volunteers.

This work• Development of quantitative analysis

• Analysis of search logs (Excite 2001)

Aims of Quantitative Analysis

machine-readable codification of QT’s.

Word similarity

667.087

Drawback:On this measure doing and going are very similar (0.8)while bug and debugging have SW = 0.5

Dice Coefficient

e l e c t e d e l e c t i o n 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0

Word Similarity Threshold

dingping 75.0

bringthing

6.0106

tryingstring

5.0126

WSnursingtraining 4.0

•Partial solution: introduce threshold WST = 0.4•Anything less similar than WST is given SW = 0

Query Similarity• For each word in query 1 find the most similar

word in query 2 and combine results

• Accommodates repeated words (in query 2) without weighting

• Main point of WST is to avoid the accumulation of many small contributions to the query similarity

Query Similarity Example

leaf gelatin supplier barcelona

gelatine supplies in spain

Score = 0 Score = 0.93 Score = 0.88

Score = 0

wordsofnumberscoresofsumS

maxQ Evaluate = 0.453

Query Similarity Threshold

We are looking for the most similar previous query to i

jtimeIf none are similar maybe i isa new enquiry

Set QST =0.3 as lowest acceptable similarity for a valid query connection

Setting WST and QST

• Result narrowed down by close inspection

• In first 300 queries the set with WST = 0.4 and QMT = 0.3 agreed with a human analysis of the best categorisation in all cases bar one, which was in any case an unusual entry.

AlgorithmLoop over session queries

Loop over previous queries

for i = 1 to n

for j = 1 to i-1

Compare query i with j

Choose most similar pair i,j Assign k=j

Analyse to assign QT type k i

Code Transformation

U Unique

N New query

R Repeated query

P Page viewing (seek more)

p Page viewing (earlier pages)

“Trivial” Transformations

Substantive Transformations ICode Transformation (relative to k)I(k) Identical J(k) Identical apart from Quotes/Boolean

C(k) Conjoint

D(k) Disjoint

S(k) Sub-phrase in common

s(k) Sub-phrase + words in common

Substantive Transformations II

Code Transformation (relative to k)W(k) Single word in commonw(k) Separated single words in common

M(k) Other textual similarity

Below Threshold SimilarityZ(k) Not similar but word in common

z(k) Not similar but words in common

Target: one two three

Target: 123 Comparison Symbol Type

Basic transfomations 1234 C Conjunction 12 D Disjunction

Common sub-phrase 124 S Replacement 231 s Reordering 1243 s Insertion/removal

Common word 145 W Replacement 132 w Reordering 143 w Repacement/insertion

Below threshold similarity 1456 Z Common word 1245678 z Common phrase

Code Transformation

B Include Boolean term

b Remove Boolean term

Q Include quote marks

q Remove quote marks

_ Delay > 1 hour

Supplementary Transformations

Example full transformationMay include up to 4 terms e.g.

BQC(4)_Boolean

Quote MarksSubstantive Delay

Some examples Code Query1 Query2 QJ(k) bargain music “bargain music” QC(k) Bacteremia “Pneumoccol Bacteremia” qJ(k) “university of texas”

“alternative medicine” university of texas” “alternative medicine”

qw(k) "tax law_depreciation system"

tax law/depreciation system

BC(k) "the sopranos" "the sopranos" +scripts BJ(k) +"Complaint form letters"

Insurance +"Complaint form letters" +Insurance

BS(k) doppler effect labs doppler effect +lab

More examples Code Query1 Query2 Bs(k) conferences image processing +image +processing

+conferences +finland BqW(k) "Craig Larman" +Larman +Valtech BqZ(k) +"lbp 1000" +review +canon +review +laser

+printer BqW(k) Hevia AND bagpipe "Spanish bagpipe" bQs(k) +used +horse +trailer +arndt +"horse trailer" used bqW(k) +arndt +"horse trailer" used +Arndt trailer bqs(k) +Moby +southside +"Gwen

Stefani" +mp3 +Moby +southside +mp3

Output for thefirst 100

Excite queries

Source file: excite.txt word modification threshold : 0.400000 query modification level : 0.300000 sub-session delay/s : 3600 qid0 uid nq Modification list 1 1 ** 1 U 2 2 ** 5 NW(1)_NPP 7 3 ** 4 NS(1)PP 11 4 ** 1 U 12 5 ** 1 U 13 6 ** 1 U 14 7 ** 5 N_QNPPP 19 8 ** 4 NPPP 23 9 ** 1 U 24 10 ** 4 NQJ(1)NQN 28 11 ** 5 N_NN_NP 33 12 ** 2 N_N 35 13 ** 3 NR_R 38 14 ** 1 U 39 15 ** 1 U 40 16 ** 4 NM(1)RN 44 17 ** 21 N_N_NC(1)PPPPNW(9)PPPPC(10)PPPPPP 65 18 ** 2 NP 67 19 ** 10 NRPC(1)RP_NS(7)D(7)I(7) 77 20 ** 1 QU 78 21 ** 1 U 79 22 ** 1 U 80 23 ** 1 U 81 24 ** 1 U 82 25 ** 1 QU 83 26 ** 11 N_NC(2)PPPPW(3)NC(9)P 94 27 ** 5 NNW(2)RR 99 28 ** 1 U 100 29 ** 3 NW(1)_M(1)

N_NC(2)PPPPW(3)NC(9)P

One session - 3 sub-sessions

qid uid time rank query querymore totwords

83 000000000000001a 083122 0 chicago sun times No 3

84 000000000000001a 105439 0 f8 No 1

85 000000000000001a 105453 0 f8 airplane No 2

86 000000000000001a 105536 10 f8 airplane No 2

87 000000000000001a 105614 20 f8 airplane No 2

88 000000000000001a 105630 30 f8 airplane No 2

89 000000000000001a 105731 40 f8 airplane No 2

90 000000000000001a 105740 0 airplanes f8 No 2

91 000000000000001a 113441 0 ceo compensation No 2

92 000000000000001a 113633 0 2000 ceo compensation No 3

3 C(2)

8 W(3)

10 C(9)

Query lengths

100000

1000000

1 10 100

Length/Queries

sessions sub-session

10% of sub-sessionsare at least 7 queries in length

QT relative frequencies

U N P p R I J C D S s W w M Z z B b Q q _Query Transformation

Terminal QT’s

U N P p R I J C D S s W w M Z z B b Q q _

Query Transformation

)(QTFreqQTFinalFreqRatio

i.e.: The lastqueries in a sub-session

QT graphs

N 1 2 3 5 4 6 Start

M C C s

RP(14)

nursing careers

paid undergraduate nursing schools in baltimore city maryland

QT graphs

6 Start M C

C 19 15

molsworth

"us army"

Frequency of nodes with k connections

0 2 4 6 8 10k

Query length 10

Query length 20

Slope = -1

Exponential scaling

Intra-QT correlations

• f (A,B) measured coincident frequency of codes A and B

• E{} Expected value• V{} Variance

ijijijf AAfV

AAfEAAfAAD

Correlations within a transform e.g. [BQC(3)_]

Type B b Q q –— U 20.60 – 1.32 – – N -1.48 – 23.26 – 78.27 P – – – – -66.16 p – – – – -9.63 R – – – – 10.53 I – – – – 4.45 J 61.85 47.37 136.42 78.37 -5.74 C 46.02 -42.81 -15.14 -19.22 -4.70 D -34.07 62.20 -15.09 13.45 -4.79 S -24.52 -11.14 -20.69 -7.63 -5.65 s -2.62 9.93 -7.05 3.65 -8.04 W -35.00 -10.35 -32.99 -6.81 -6.05 w -2.63 9.14 -11.51 -0.98 -8.18 M -21.05 -12.98 -37.31 -13.28 -1.97 Z -2.26 14.11 -10.06 2.23 -0.90 z 1.78 2.82 0.55 1.45 0.95 B 0.00 – 1.16 76.78 -15.01 b – 0.00 74.95 10.05 -11.07 Q 1.16 74.95 0.00 – -0.28 q 76.78 10.05 – 0.00 -7.77 — -15.01 -11.07 -0.28 -7.77 0.00

Example:

[BQC(3)_]

Some Observations

• Quote marks are likely to be used with a new query.

• Delay is strongly associated with N: these are successful single queries within a session.

• B & C are positively associated• B & D are negatively associated

Application to Experimental Results

Query Transformsqid SS Query QM(similarity) QM(preceeding)1 * CD albums collection N N2 CD albums collection R R3 * Autotrader N N4 * atlas N N5 * place names N N6 place names R R7 * map N N8 * online competitions N N9 * Tall British buildings N N10 Tall buildings w(9) w(9)11 Tall buildings R R12 Tall buildings R R13 Tall buildings in Britain w(9) C(12)14 Tallest building outside London M(9) M(13)

Temporal Database•A repository of all data for each session•Accessible to SQL•Used to build evidence-based models for searching

Background detailsWeb experienceCognitive style scores

Subjects appraisalof searches

Search queriesWeb page titles

Key stroke recordActivity timings

Query modificationcodes

qidqid

Qualitative analysis

Acknowledgments

Questions ?

Setting WST and QST

excite: WST = 0.4

100000

150000

200000

250000

300000

350000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tot NewTot Modz+Z

Inter-QT correlations

• f ( A | B ) measured frequency of codes B following A

ijijijf ABfV

ABfEABfABD

Correlations of one transform with the next.

Prior Transformation Type N P p R I J C D S s W w M Z z B b Q q —

N 82.40 -39.20 -2.92 13.26 22.95 2.22 10.77 9.92 5.92 -2.37 23.22 2.37 30.55 11.24 3.99 22.86 8.45 17.84 6.60 102.17 P -42.39 323.03 9.91 -15.98 -17.58 -9.12 -4.10 -6.83 -12.90 -5.45 -19.89 -5.75 -32.02 -8.76 -2.25 -25.01 -19.35 -18.59 -7.81 -71.47 p -50.08 79.89 154.30 17.11 4.96 -8.42 -18.06 -10.74 -15.30 -10.98 -21.52 -11.35 -18.32 -8.35 -2.35 -21.79 -10.70 -17.30 -7.25 21.57 R 125.10 -85.27 3.73 198.05 23.30 -2.83 0.55 -2.51 -3.93 -6.24 1.94 -3.17 14.86 1.31 -0.46 -16.30 -12.24 -0.72 -6.71 89.80 I -8.96 -39.39 7.11 25.19 152.36 23.27 35.60 20.45 19.44 10.92 33.41 15.91 61.29 5.88 1.04 0.33 6.43 -0.72 4.76 61.21 J 31.31 -28.13 0.42 -1.56 -2.36 45.43 29.05 12.92 21.68 19.21 15.47 15.55 10.37 7.08 4.06 66.72 37.31 70.63 46.88 -5.89 C 98.65 -27.61 -2.25 -7.92 -3.51 9.43 50.98 -1.42 2.57 -5.27 11.76 -2.43 7.80 10.78 1.98 33.37 6.34 25.51 3.16 -8.53 D 39.12 -24.03 -2.58 -3.66 -0.82 23.95 14.41 21.89 32.39 29.83 26.52 21.93 -4.62 11.31 4.55 45.21 24.60 57.86 14.67 5.58 S 35.67 -30.46 -3.62 -7.55 0.35 12.88 31.20 28.48 108.55 44.56 27.07 25.89 -6.91 26.54 5.79 56.24 35.14 39.28 17.90 6.90 s 8.44 -18.69 -2.58 -6.79 -1.78 15.49 43.13 15.71 59.83 117.15 1.57 34.34 -12.48 30.55 21.59 46.67 34.77 33.33 22.27 1.00 W 79.54 -43.79 -5.10 -9.05 4.91 15.72 16.39 32.98 10.95 -0.93 117.56 23.20 24.22 14.02 -0.47 70.07 38.85 46.57 17.34 27.82 w 17.74 -17.47 -2.16 -5.35 2.10 12.61 23.19 16.82 22.55 23.51 44.17 66.50 -2.25 18.13 3.57 39.50 35.21 26.21 14.42 6.21 M 109.09 -57.39 -6.00 0.68 8.81 4.55 -5.14 7.04 -11.05 -11.98 4.69 -7.25 160.36 -3.45 -2.86 31.61 14.40 9.17 4.19 31.52 Z 37.56 -13.24 -3.22 -0.98 1.32 6.09 9.11 5.53 17.10 13.88 5.76 5.96 -2.27 19.33 3.01 29.60 10.64 12.79 6.22 30.99 z 9.83 -4.61 0.69 0.25 -0.56 2.35 2.28 -0.82 7.06 8.53 -0.52 3.29 -2.42 8.85 20.34 12.08 4.22 4.48 2.57 4.33 B 61.06 -42.37 3.02 -0.11 -3.05 56.39 36.12 14.63 22.86 19.43 33.25 19.98 23.90 14.39 4.25 204.51 70.57 72.24 51.54 0.67 b 38.59 -32.48 -8.39 -14.33 -4.12 50.59 17.99 24.07 35.23 41.38 27.86 27.48 12.74 19.47 9.57 247.85 145.67 44.35 48.16 4.51 Q 35.97 -24.81 -5.29 -9.96 -3.80 112.76 21.46 12.99 19.11 17.75 23.45 15.62 7.47 8.74 2.70 81.08 67.37 126.97 50.84 5.15 q 18.26 -22.93 -2.71 -5.39 -0.10 54.20 17.40 22.01 23.42 28.37 23.34 14.49 6.52 7.45 3.91 41.28 40.34 173.97 135.55 5.06

— 54.44 -16.60 0.96 28.90 14.56 0.59 9.51 3.69 7.01 0.35 9.44 1.59 11.49 3.65 0.14 0.87 -1.84 4.33 -0.49 65.46

Example: [BQC(3)_][bqD(5)]

Some Observations

• Self-correlations suggest habitual tendencies

• Substantive QT’s rarely follow or precede page-viewing. They are associated with active searching.

• Delay is followed by N, a new query or R or I – suggesting memory refresh.

1 10 100terms/query

Hölsher & Strube (2000): Graphical Representation

Close-up of direct interaction with a search engine: numbers show transition probabilities.

Experts and novicesdoing specificsearch tasks

Word Similarity

Shift word along until the best match is found

logical AND: same letter

Motivation

• Need to develop new models for searching: update outdated usage paradigms.

• Improve training methods

• Develop automated assistance systems

Context

• How do the general public search the web?

• Experimental study– general public volunteers– record sound, screens, keystrokes

• Goal: evidence-based model of effective searching

Previous studies of search logs• Web search is shallow + promiscuous• Low use of advanced features• Global statistics

This work• Development of quantitative analysis

• Analysis of search logs (Excite 2001)

Aims of Quantitative Analysis

machine-readable codification of QT’s.

Target: one two three

Target: 123 Comparison Symbol Type

Basic transfomations 1234 C Conjunction 12 D Disjunction

Common sub-phrase 124 S Replacement 231 s Reordering 1243 s Insertion/removal

Common word 145 W Replacement 132 w Reordering 143 w Repacement/insertion

Below threshold similarity 1456 Z Common word 1245678 z Common phrase

Code Transformation

B Include Boolean term

b Remove Boolean term

Q Include quote marks

q Remove quote marks

_ Delay > 1 hour

Supplementary Transformations

Example full transformationMay include up to 4 terms e.g.

BQC(4)_Boolean

Quote MarksSubstantive Delay

Some examples Code Query1 Query2 QJ(k) bargain music “bargain music” QC(k) Bacteremia “Pneumoccol Bacteremia” qJ(k) “university of texas”

“alternative medicine” university of texas” “alternative medicine”

qw(k) "tax law_depreciation system"

tax law/depreciation system

BC(k) "the sopranos" "the sopranos" +scripts BJ(k) +"Complaint form letters"

Insurance +"Complaint form letters" +Insurance

BS(k) doppler effect labs doppler effect +lab

More examples Code Query1 Query2 Bs(k) conferences image processing +image +processing

+conferences +finland BqW(k) "Craig Larman" +Larman +Valtech BqZ(k) +"lbp 1000" +review +canon +review +laser

+printer BqW(k) Hevia AND bagpipe "Spanish bagpipe" bQs(k) +used +horse +trailer +arndt +"horse trailer" used bqW(k) +arndt +"horse trailer" used +Arndt trailer bqs(k) +Moby +southside +"Gwen

Stefani" +mp3 +Moby +southside +mp3

Output for thefirst 100

Excite queries

Source file: excite.txt word modification threshold : 0.400000 query modification level : 0.300000 sub-session delay/s : 3600 qid0 uid nq Modification list 1 1 ** 1 U 2 2 ** 5 NW(1)_NPP 7 3 ** 4 NS(1)PP 11 4 ** 1 U 12 5 ** 1 U 13 6 ** 1 U 14 7 ** 5 N_QNPPP 19 8 ** 4 NPPP 23 9 ** 1 U 24 10 ** 4 NQJ(1)NQN 28 11 ** 5 N_NN_NP 33 12 ** 2 N_N 35 13 ** 3 NR_R 38 14 ** 1 U 39 15 ** 1 U 40 16 ** 4 NM(1)RN 44 17 ** 21 N_N_NC(1)PPPPNW(9)PPPPC(10)PPPPPP 65 18 ** 2 NP 67 19 ** 10 NRPC(1)RP_NS(7)D(7)I(7) 77 20 ** 1 QU 78 21 ** 1 U 79 22 ** 1 U 80 23 ** 1 U 81 24 ** 1 U 82 25 ** 1 QU 83 26 ** 11 N_NC(2)PPPPW(3)NC(9)P 94 27 ** 5 NNW(2)RR 99 28 ** 1 U 100 29 ** 3 NW(1)_M(1)

N_NC(2)PPPPW(3)NC(9)P

One session - 3 sub-sessions

qid uid time rank query querymore totwords

83 000000000000001a 083122 0 chicago sun times No 3

84 000000000000001a 105439 0 f8 No 1

85 000000000000001a 105453 0 f8 airplane No 2

86 000000000000001a 105536 10 f8 airplane No 2

87 000000000000001a 105614 20 f8 airplane No 2

88 000000000000001a 105630 30 f8 airplane No 2

89 000000000000001a 105731 40 f8 airplane No 2

90 000000000000001a 105740 0 airplanes f8 No 2

91 000000000000001a 113441 0 ceo compensation No 2

3 C(2)

8 W(3)

10 C(9)

Query lengths

100000

1000000

1 10 100

Length/Queries

sessions sub-session

10% of sub-sessionsare at least 7 queries in length

QT relative frequencies

U N P p R I J C D S s W w M Z z B b Q q _Query Transformation

Terminal QT’s

U N P p R I J C D S s W w M Z z B b Q q _

)(QTFreqQTFinalFreqRatio

i.e.: The lastqueries in a sub-session

QT graphs

N 1 2 3 5 4 6 Start

M C C s

RP(14)

nursing careers

paid undergraduate nursing schools in baltimore city maryland

QT graphs

6 Start M C

C 19 15

molsworth

"us army"

Frequency of nodes with k connections

0 2 4 6 8 10k

Query length 10

Query length 20

Slope = -1

Exponential scaling

• f (A,B) measured coincident frequency of codes A and B

ijijijf AAfV

AAfEAAfAAD

Correlations within a transform e.g. [BQC(3)_]

Type B b Q q –— U 20.60 – 1.32 – – N -1.48 – 23.26 – 78.27 P – – – – -66.16 p – – – – -9.63 R – – – – 10.53 I – – – – 4.45 J 61.85 47.37 136.42 78.37 -5.74 C 46.02 -42.81 -15.14 -19.22 -4.70 D -34.07 62.20 -15.09 13.45 -4.79 S -24.52 -11.14 -20.69 -7.63 -5.65 s -2.62 9.93 -7.05 3.65 -8.04 W -35.00 -10.35 -32.99 -6.81 -6.05 w -2.63 9.14 -11.51 -0.98 -8.18 M -21.05 -12.98 -37.31 -13.28 -1.97 Z -2.26 14.11 -10.06 2.23 -0.90 z 1.78 2.82 0.55 1.45 0.95 B 0.00 – 1.16 76.78 -15.01 b – 0.00 74.95 10.05 -11.07 Q 1.16 74.95 0.00 – -0.28 q 76.78 10.05 – 0.00 -7.77 — -15.01 -11.07 -0.28 -7.77 0.00

Example:

[BQC(3)_]

Application to Experimental Results

Query Transformsqid SS Query QM(similarity) QM(preceeding)1 * CD albums collection N N2 CD albums collection R R3 * Autotrader N N4 * atlas N N5 * place names N N6 place names R R7 * map N N8 * online competitions N N9 * Tall British buildings N N10 Tall buildings w(9) w(9)11 Tall buildings R R12 Tall buildings R R13 Tall buildings in Britain w(9) C(12)14 Tallest building outside London M(9) M(13)

Temporal Database•A repository of all data for each session•Accessible to SQL•Used to build evidence-based models for searching

Background detailsWeb experienceCognitive style scores

Subjects appraisalof searches

Search queriesWeb page titles

Key stroke recordActivity timings

Query modificationcodes

qidqid

Qualitative analysis

Conclusions• We have developed a rich set of codes

describing syntactic part of QT’s• These can be used to develop a graph-based

description• Correlations between the codes are

meaningful/interesting• They will form part of the analysis for our

experimental study.

Acknowledgments

Questions ?

Setting WST and QST

excite: WST = 0.4

100000

150000

200000

250000

300000

350000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tot NewTot Modz+Z

• f ( A | B ) measured frequency of codes B following A

ijijijf ABfV

ABfEABfABD

Correlations of one transform with the next.

Prior Transformation Type N P p R I J C D S s W w M Z z B b Q q —

N 82.40 -39.20 -2.92 13.26 22.95 2.22 10.77 9.92 5.92 -2.37 23.22 2.37 30.55 11.24 3.99 22.86 8.45 17.84 6.60 102.17 P -42.39 323.03 9.91 -15.98 -17.58 -9.12 -4.10 -6.83 -12.90 -5.45 -19.89 -5.75 -32.02 -8.76 -2.25 -25.01 -19.35 -18.59 -7.81 -71.47 p -50.08 79.89 154.30 17.11 4.96 -8.42 -18.06 -10.74 -15.30 -10.98 -21.52 -11.35 -18.32 -8.35 -2.35 -21.79 -10.70 -17.30 -7.25 21.57 R 125.10 -85.27 3.73 198.05 23.30 -2.83 0.55 -2.51 -3.93 -6.24 1.94 -3.17 14.86 1.31 -0.46 -16.30 -12.24 -0.72 -6.71 89.80 I -8.96 -39.39 7.11 25.19 152.36 23.27 35.60 20.45 19.44 10.92 33.41 15.91 61.29 5.88 1.04 0.33 6.43 -0.72 4.76 61.21 J 31.31 -28.13 0.42 -1.56 -2.36 45.43 29.05 12.92 21.68 19.21 15.47 15.55 10.37 7.08 4.06 66.72 37.31 70.63 46.88 -5.89 C 98.65 -27.61 -2.25 -7.92 -3.51 9.43 50.98 -1.42 2.57 -5.27 11.76 -2.43 7.80 10.78 1.98 33.37 6.34 25.51 3.16 -8.53 D 39.12 -24.03 -2.58 -3.66 -0.82 23.95 14.41 21.89 32.39 29.83 26.52 21.93 -4.62 11.31 4.55 45.21 24.60 57.86 14.67 5.58 S 35.67 -30.46 -3.62 -7.55 0.35 12.88 31.20 28.48 108.55 44.56 27.07 25.89 -6.91 26.54 5.79 56.24 35.14 39.28 17.90 6.90 s 8.44 -18.69 -2.58 -6.79 -1.78 15.49 43.13 15.71 59.83 117.15 1.57 34.34 -12.48 30.55 21.59 46.67 34.77 33.33 22.27 1.00 W 79.54 -43.79 -5.10 -9.05 4.91 15.72 16.39 32.98 10.95 -0.93 117.56 23.20 24.22 14.02 -0.47 70.07 38.85 46.57 17.34 27.82 w 17.74 -17.47 -2.16 -5.35 2.10 12.61 23.19 16.82 22.55 23.51 44.17 66.50 -2.25 18.13 3.57 39.50 35.21 26.21 14.42 6.21 M 109.09 -57.39 -6.00 0.68 8.81 4.55 -5.14 7.04 -11.05 -11.98 4.69 -7.25 160.36 -3.45 -2.86 31.61 14.40 9.17 4.19 31.52 Z 37.56 -13.24 -3.22 -0.98 1.32 6.09 9.11 5.53 17.10 13.88 5.76 5.96 -2.27 19.33 3.01 29.60 10.64 12.79 6.22 30.99 z 9.83 -4.61 0.69 0.25 -0.56 2.35 2.28 -0.82 7.06 8.53 -0.52 3.29 -2.42 8.85 20.34 12.08 4.22 4.48 2.57 4.33 B 61.06 -42.37 3.02 -0.11 -3.05 56.39 36.12 14.63 22.86 19.43 33.25 19.98 23.90 14.39 4.25 204.51 70.57 72.24 51.54 0.67 b 38.59 -32.48 -8.39 -14.33 -4.12 50.59 17.99 24.07 35.23 41.38 27.86 27.48 12.74 19.47 9.57 247.85 145.67 44.35 48.16 4.51 Q 35.97 -24.81 -5.29 -9.96 -3.80 112.76 21.46 12.99 19.11 17.75 23.45 15.62 7.47 8.74 2.70 81.08 67.37 126.97 50.84 5.15 q 18.26 -22.93 -2.71 -5.39 -0.10 54.20 17.40 22.01 23.42 28.37 23.34 14.49 6.52 7.45 3.91 41.28 40.34 173.97 135.55 5.06

— 54.44 -16.60 0.96 28.90 14.56 0.59 9.51 3.69 7.01 0.35 9.44 1.59 11.49 3.65 0.14 0.87 -1.84 4.33 -0.49 65.46

Example: [BQC(3)_][bqD(5)]

Some Observations

• Self-correlations suggest habitual tendencies

• Substantive QT’s rarely follow or precede page-viewing. They are associated with active searching.

• Delay is followed by N, a new query or R or I – suggesting memory refresh.

1 10 100terms/query

Improving the effectiveness of Web searching: Methodological issues

Documents

Description of: Stakeholder Issue Analysis, methodological ... · PDF fileDescription of: Stakeholder Issue Analysis, methodological steps. ... Stakeholder Issue Analysis, methodological

Methodological module

Methodological Comparison of Cost-effectiveness of …bcap-energy.org/wp-content/.../ICF-Comparison-of-Cost-effectiveness... · Cost-effectiveness of IECC Residential Energy Codes

A Methodological Review of Studies on Effects of Financial ... · PDF fileA Methodological Review of Studies on ... mechanism as well as the effectiveness of financial aid policies

Evaluating Aid for Trade Effectiveness on the Ground: A Methodological Framework

Methodological Basis

EXPLORING THE EFFECTIVENESS OF 3D FILE BROWSING TECHNIQUES FOR FILE SEARCHING TASKS … · 2010-04-20 · EXPLORING THE EFFECTIVENESS OF 3D FILE BROWSING TECHNIQUES FOR FILE SEARCHING

Searching for ELT research n Purposes of searching –While setting up research: General overview of topic Methodological concerns –Finding literature to

Searching for microbes Part IV. Microbes and outer influences Decontamination methods, and how to assess their effectiveness Ondřej Zahradníček To practical

Effectiveness and robustness of robot infotaxis for searching in … · 2020. 7. 8. · Effectiveness and robustness of robot infotaxis for searching in dilute conditions Eduardo

THE EBA METHODOLOGICAL GUIDEEBA+Methodologic… · THE EBA METHODOLOGICAL GUIDE ... Part II. Other methodological ... International Financial Reporting Standards . IP . Immovable

Improving the effectiveness of Web searching: Methodological issues Barry Eaglestone Department of Information Studies University of Sheffield B.Eaglestone@shef.ac.uk

Methodological Recommendations for Comparative … · 7/17/2012 · product developers and clinical researchers on the design of comparative effectiveness (CER) studies on treatments

Methodological issues

STATISTICAL AND METHODOLOGICAL …€¦STATISTICAL AND METHODOLOGICAL CONSIDERATIONS FOR EXAMINING PROGRAM EFFECTIVENESS Carli Straight, PhD and Giovanni Sosa, PhD Chaffey College

Searching for qualitative research for inclusion in systematic ......RESEARCH Open Access Searching for qualitative research for inclusion in systematic reviews: a structured methodological

COST ANALYSIS, COST SAVINGS, AND COST EFFECTIVENESS ANALYSIS OF KANGAROO MOTHER ... · COST ANALYSIS, COST SAVINGS, AND COST EFFECTIVENESS ANALYSIS OF KANGAROO ... • Back searching

Theoretical Perspectives and Methodological Approaches · PDF fileTheoretical Perspectives and Methodological Approaches in Political Socialization ... Perspectives and Methodological

Methodological Proposal

Methodological Advances in Measuring the Effectiveness of ...dyren9vod226n.cloudfront.net/wp-content/uploads/... · Microsoft PowerPoint - Ferraro ERS CBEAR Presentation 1.pptx Author: