Usability Evaluation and Retrieval Performance of Google, AltaVista …dagda.shef.ac.uk/dispub/dissertations/2004-05/External/... · 2005-11-01 · between retrieval performance and

Usability Evaluation and Retrieval Performance of Google, AltaVista and Yahoo using TERS platform

A study submitted in partial fulfilment Of the requirements for the degree of Master of Science

In Information Management

at

The University of Sheffield

by

Kessopoulou Eftychia

September 2005

1

Abstract

This research sets out to compare three popular search engines, namely Google,

AltaVista and Yahoo by evaluating their retrieval performance and usability. The end

result sought is: which search engine satisfies equally and more efficiently the retrieval

performance and usability criteria. The objectives of this study are a) to examine if the

user is capable to successfully formulate queries, and evaluate results by measuring the

effectivity, efficiency and user satisfaction, b) to examine which system is able to

retrieve relevant documents and rank more relevant documents higher in the list of

retrieved results by using the recall and precision measures, and c) to see the correlation

between retrieval performance and usability testing.

In order to conduct the appropriate experiment, the researcher employed a

specialized platform, called TERS (Testbed for the evaluation of retrieval systems), and

designed by Dr. Roelof van Zwol. Thirty (30) participants were recruited and a test

collection of thirty (30) topics was prepared. After each participant conducted fifteen

(15) different searching tasks, the researcher evaluated their searching results.

The conclusions of the study indicate that there is a conflict in results between

usability and retrieval performance. Thus, despite the fact that Google performed better

than the other search engines when usability was tested, Yahoo seemed to have a better

retrieval and ranking algorithm.

2

Table of Contents 1. Introduction 6

2. Literature Review 9

2.1 Information Retrieval Evaluation 9

2.2 Experimental Web Search Engine Evaluation Background 13

2.3 Criteria of Evaluation 20

3. Methodology 24

3.1 Introduction 24

3.2 Search Engine Evaluation Methodology 24

3.3 Data collection, types of data and data instruments 26

3.3.1 Data collection and types of data 26

3.3.2 Data instruments 26

a) Questionnaire 26

b) Observation checklist 27

c) Transaction logs 27

3.4 Experimental Design 28

3.4.1 Search Engine Features 28

3.4.2 Experimental Scenario 30

3.4.3 Test Environment 31

3.4.4 Participant Sample 31

3.4.5 Information needs (Topics) 31

3.5 Experimental Procedure 33

3.5.1 Ethics Approval 33

3.5.2 Pilot experiment 33

3.5.3 Configuration of TERS platform 34

a) General Specifications 34

i) Retrieval system identification 34

ii) Definition of the ‘topics’ 34

iii) Creation of user accounts 34

3

b) Usability experiment specifications 35

i) Survey questions set-up 35

ii) Participant workload in the usability experiment 35

c) Retrieval Performance experiment specifications 36

i) Participant Workload in the retrieval 36

performance experiment 36

ii) Run set-up 36

3.5.4 Presentation of TERS, Pre-session Questionnaires, Practice 36

3.5.5 Usability experiment section 37

3.5.6 Post-session 38

3.5.7 Judgment of Usability relevance assessments 38

3.5.8 Retrieval performance experiment section 38

3.6 Result Analysis 40

3.6.1 Retrieval Performance measures 40

a) Precision and Recall measures 40

3.6.2 Usability measures 40

a) Effectivity 40

b) Efficiency 41

c) Satisfaction 41

4 Results and analysis 42

4.1 Introduction 42

4.2 Participant’s profile 42

4.3 Retrieval performance experiment results and analysis 46

4.3.1 Recall - Precision results 46

a) Interpolated Recall-Precision 46

4.4 Usability experiment results 49

4.4.1 Overall analysis 49

4.4.2 Search engines usability testing results 51

4.4.3 Topics analysis results 54

4.4.4 User results analysis 57

4

5. Limitation of the experimental study 62

5.1 Search engines 62

5.2 Participants 63

5.3 Test collection 63

5.4 Observation data 63

5.5 Researcher’s interference 64

6. Discussion 65

7. Conclusion 66

References 68

Appendix A 73

Appendix B 76

Appendix C 86

Appendix D 87

Appendix E 93

5

1. Introduction

The www is a popular internet application which first appeared in 1991. It has

two major functions: publishing of information by web authors and retrieving it via the

process of searching by web users. In the beginning the relocation of information into

online resources was not as complicated as only a respectful number of resources

became available and it was indexed into “an alphabetized listing of links to pages”

(Schartz, 1998:974). But as time passed, the bulk of resources in this “hyperlinked

collection” (Kumar, 2005:3) increased amazingly and a need for an information retrieval

system to find this information became essential. It was then, not until 1994, that then

search engines came into sight and start developing.

A search engine is mainly comprised by a robot, a database and an agent. The

robot or spider is a program-algorithm that crosses the www “determines the quality and

quantity of information it accesses and retrieves for its database” (Scales & Felt quoted

by Dong and Su, 1997: 69). The database is an indexed list of all the information that the

robot gathers. The agent is the intermediary between the user that requests the

information and the search engines database. When the user formulates a search query

the agent searches the search engine’s database and retrieves the relevant documents for

the user, presenting them in a list with the most popular results appearing first. In order

the web user to communicate with the retrieval system an interface is used which gives

him/her the opportunity to formulate a query and take a presentation of the results. Thus,

the more effective the search engine is the higher it rates in the user’s choices and the

sponsors preferences, which are the basic resource of revenues to the web search engine

authors. Consequently, the significance of the search engine comparison and evaluation

according to Dong and Su (1997:67) is of “great importance for system developers and

information professionals, as well as end-users, for the improvement and development of

better tools”.

6

The evaluation of information retrieval systems has a history of 40 years and is

one of the most challenging fields in information retrieval domain. Information retrieval

according to Tague-Sutcliffe (1996) “is a process where sets of records are searched to

find items which may help to satisfy the information need or interest of an individual or

a group”. Evaluation of retrieval systems is involved with the ability of a system to meet

the information needs of its users (Voorhees, 2002).

There are two major approaches for the evaluation of information retrieval

systems performance. The one is the traditional system oriented approach which focuses

on “improving the match between the query and the document as well as the

specifications of the computationally effective and efficient retrieval algorithms” (Pors,

2000). Thus, in order to evaluate how well the system can rank relevant documents

precision and recall measures or other variations of this notions were used exclusively

by researchers. The other approach is user centered and it takes into account the

interactivity between the user and the information retrieval system, with respect to user’s

satisfaction. Due to the fact that this perspective is relatively new the measures used are

still in doubt. Several measures have been employed in many studies concerning the

comparison and evaluation of search engines from this point of view. According to Su

(1992:503) though, “there is a lack of agreement of which are the best existing

evaluation measures”.

During the previous years both approaches were applied in IR experimental

environments for the performance and evaluation of search engines, but not

simultaneously. Each perspective was giving results measuring only the variables from

its point of view and there was no attempt for any correlation between the 2 approaches

as there was no experiment combining them. In 2004 Dr. Roelof van Zwol (2004)

devised an evaluation platform that gives the ability to conduct simultaneously both

usability and retrieval performance evaluation studies. This raises the research topic

which is a combination of both evaluation perspectives by using DR. Roelof van Zwol

platform, the so-called TERS (Testbed for the evaluation of retrieval systems). This tool

7

will be used for the comparison of three popular search engines, namely Google,

AltaVista and Yahoo.

In brief the objectives of this experiment are:

• to examine if the user is capable to successfully formulate queries, and evaluate

results by measuring the effectivity, efficiency and user satisfaction

• to examine which system is able to retrieve relevant documents and rank more

relevant documents higher in the list of retrieved results by using the recall and

precision measures

• to see the correlation between retrieval performance and usability testing.

This research is divided in 8 chapters: a) literature review of relevant experimental

studies, b) introduction of the methodology applied, c) experimental design, d) report

and analysis of the results, e) limitations of the experimental study, f) discussion and g)

conclusion.

8

2. Literature Review

This section is dedicated to the review of the two major perspectives, system-

and user-oriented, that are used for information retrieval evaluation and subsequently for

evaluation of web search engines. The following paragraphs include an analysis of the

aforementioned methodological approaches, an overview of the criteria used, as well as

search engine related comparative studies taken place in the past.

2.1 Information Retrieval Evaluation

In information retrieval (IR) evaluation is an important notion that requires

further development. Van House et al. quoted by Dong and Su (1997: 68) provides a

rather wide definition of the word evaluation as “the process of identifying and

collecting data about specific services or activities, establishing criteria by which their

success can be assessed and determining both the quality of the service or the activity

and the degree to which the service or activity accomplishes stated goals and

objectives”.

The fundamental goal of an information retrieval system is to locate relevant

documents with respect to users’ requests. Thus, the evaluation of an IR system focuses

on two major aspects: to improve the retrieval performance of the system and to

maximize the user satisfaction during the process of retrieval.

IR evaluation started in the 1950’s at the Granfield College of Aeronautics in

England, and it was basically involved with the measurement of the retrieval

performance of a system, namely the system-centered approach. During the 50’s, the

idea that “the system knows better” rulled. The evaluation of an IR system was based on

the improvement of the match between the query and the documents retrieved from the

9

systems database, by developing effective and efficient algorithms. The typical example

of this approach is the Granfield paradigm.

The system-oriented methodology is in general grounded in experimental basis

taking place in a laboratory environment. According to Voorhess (2002:355), the choice

of location depended on the effortless ability of the researchers to “control some of the

variables that affect the performance of the systems and thus increase the power of the

comparative experiments”. The experimental scenario involved “a collection of

documents, a set of example information requests and a set of relevant documents for

each example information request” (Baeza-Yates, 1999:74). The ‘information requests’

known also as ‘queries’ were “considered to be representations of the information

needs” (Pors, 2000:60) of the user. The ‘set of relevant documents for each example

information request’ known also as ‘relevance judgments’ were not made by the user,

but instead were provided by the experimenters and were “objective and static, which is

the reason for their often-binary role” (Pors, 2000:60).

According to Pors (2000), criticism against this approach is based on the set of

documents, the queries and the concept of relevance. He states that traditional

experiments are based on a test collection with a relatively small sample size as it is

based on only one type of documents. For instance, the Grafield paradigm was based

only on aeronautic topics. Moreover, Pors expresses his doubts that the queries represent

real information needs and he supports that the “methodology of relevance might be

skewed” (Pors, 2000:62).

This traditional approach is too narrow to be adopted for the evaluation of IR

systems because it does not take into account the variable of interaction. Thus,

researchers during 1990’s have paid a lot more attention to the user-centered approach,

which defines the IR system broader; even though it provides less “diagnostic

information regarding the system behavior” (Vorhees, 2002:355). According to Pors

(2000,) the interaction between the mechanisms of information retrieval -database, user,

10

representation language, IR algorithm, interface, user language and requests- is of great

interest to information retrieval research.

The point in the user-oriented evaluation approach is to investigate “how well the

user, the retrieval mechanisms and the database interact in extracting information under

real life situations” (Borlund, 2000:74). Thus, this approach according to Pors (2000) is

three-dimensional including: relevance, cognition and process orientation.

Concerning relevance, the user-oriented approach escapes and overcomes the

narrow assumption adopted in the system-centered approach, which correlates the stated

request with the information need. Robertson and Beaulieu (1992:458) suggest that

relevance should be judged according to users real information needs. Furthermore,

some “self- reported mental activities or external observations” should be employed in

order to approach this concept. Schamber -quoted in Borlund (2000:72)- investigated in

his study the elements that influence user’s relevance judgments and constructed a list of

eight variables. Moreover, he supports that his findings conclude that relevance is a

multi-dimensional concept and static binary assumptions are not acceptable.

From a cognitive viewpoint, during the process of information searching and

retrieval, the user is regarded to have a “certain anomalous state of knowledge” which

indicates his/hers information need (Belkin quoted by Robertson and Beaulieu

(1992:459). According to Borlund (2000:72), “this means that an information need, from

the users’ perspective, is a personal and individual perception of a given information

requirement and that an information need for the same user can change over time”.

In the process-oriented dimension, information retrieval is regarded as an

interactive process. The user has a problem to solve (information need) and therefore

he/she tries several mechanisms as his/hers state of knowledge develops or changes.

Thus, the process-oriented perspective focuses on the whole process of information

searching. Moreover, it tries to analyze the user’s behavior and find the elements that

affect it.

11

Presumably, this modern user-oriented approach is completely different form the

static ‘black box’ perspective, applied in the Granfield paradigm, where the user was of

no importance. The experiments adopting the user-oriented approach are mainly

operational and they differ from a laboratory setting on lesser control applied over the

variables. Operational experiments are closer to real life and include users for the

evaluation of IR systems. They are comprehensive, even though the difficulty of

measuring qualitative variables as well as the cost of such experimentation is higher

when compared to the traditional approach. This is the main reason that the system-

centered approach is not abandoned but is still used in many experiments.

Moreover, the user-centered approach is not so easy to conduct. Jones and

Willett quoted by Vorhees (2002:355), mention that “a properly designed user-based

evaluation must use a sufficient large, representative sample of actual users of the

retrieval system; each of the systems to be compared must be equally well developed

and complete with an appropriate user interface; each subject must be equally well

trained on all systems and care must be taken to control for the learning effect”.

Saracevic (1995) supported in earlier times the idea of using both methodological

approaches instead of adopting only one. He considers information retrieval evaluation

as an amalgam of “a system, a criterion or criteria, measures, measuring instruments and

methodology” (Saracevic, 1995:142). Moreover, he divides evaluation in six categories:

“engineering-dealing with hardware and software-, input-investigating inputs and

contents of the system-, processing-who the inputs are processed-, output-interactions

with the system-, use and user-questions of applications and given problems and tasks

raised- and social-issues of the environment impact” (Saracevic, 1995:140-141). He

addresses the system-centered approach applied in the processing level, and he expresses

the need to focus the evaluation on all six aforementioned categories. In his paper, he

severely criticizes Dervin and Nilan who suggested a methodological change from the

system-based to the user-oriented evaluation.

12

Bolrund (2000) accepts the importance of both approaches for the evaluation of IR

systems and consents with the suggestion of Beaulieu et al (1996) that there is a need to

“simulate a realistic interactive searching task within a laboratory environment”. She

proposes an experimental evaluation of IR systems that combines both “realism and

control” and consists of three basic elements:

• the involvement of the potential users as test persons

• the application of dynamic and individual information needs

• the use of multidimensional and dynamic relevance judgments (Bolrund,

2000:76)

Presumably, it is evident that both system-centered and user-oriented approaches

need to be taken into account in order to provide a broader framework for the evaluation

of retrieval systems. Especially, due to the current enormous expanse of the World Wide

Web, search engines, as specific, representative and interactive information retrieval

systems, are treated with great concern.

2.2 Experimental Web Search Engine Evaluation Background

Before reviewing modern experiments based on search engine evaluation, it is

worth referring to the cornerstone of IR evaluation experiments. This is the well known

Granfield paradigm, and more specifically the second run of tests conducted by

Cleverdon et al. in 1960’s, which is a typical example of laboratory experiment.

The researchers investigated which of the 33 indexing languages performed

better with respect to retrieval. The experimental scenario involved a ‘test collection’

comprised of 1400 documents relating to aeronautical engineering, a set of 331 ‘queries’

representing user needs and a set of ‘relevance judgments’, “where each document was

judged relevant or not relevant to each query” (Harter and Hert, 1997:8). The relevance

was based on topical similarity where the judgments were made by domain experts.

13

Cleverdon et al. devised the measures of recall, precision and fallout in order to measure

the effectiveness for each indexing language.

These traditional measures and especially recall and precision received a lot of criticism.

Voorhees (2002:356) when criticizing 3 hypotheses on which the Granfield paradigm is

based on, expressing his concern:

• Relevance can be approximated by topical similarity. Thus, all relevant

documents are equally desirable, the relevance of one document is independent

of the relevance of any other document and the user information need is static

• A single set of judgments for a topic is representative of the user population

• The list of relevant document for each topic is complete

TREC experiments bare a lot of similarities to the Granfield model but move a step

closer to interactive information retrieval with the adoption of ‘interactive track’ in

TREC. TREC experiments use a larger test collection, taking into account that the size

of the collection plays an important role in IR performance evaluation. The methodology

adopted is based on system-driven approach and the queries used, like in the Granfield

model, are formed in a standard way. On the other hand, the relevance judgments are

“created using the pooling method” (Harter and Hert, 1997:25), but the relevance is

measured here in the same way as in the Granfield model using the binary scale of

relevant and non relevant.

Borlund (2000) criticizes many aspects of the interactive TREC. She supports that

the queries, namely topics that are used, are “too static, limited and unrealistic” and

presumably the test persons are unable to develop their own interpretations over an

information request. Moreover, she comments more generally on the concept of

interaction stating that as far as TREC adopts a system-oriented approach this

consequently implies that the notion of interaction is disregarded. She concludes with

the assumption that TREC is not adequate for the evaluation of IIR systems because it

adopts the framework of IR system evaluation.

14

Although TREC experiments have received a lot of criticism about the methodology

and criteria adopted and their unsuitability to evaluate interactive information retrieval

systems, they remained a serious influential example to later interactive information

retrieval evaluation experiments.

Chu and Rosenthal in 1996 conducted an experiment to evaluate the performance of

three web search engines: AltaVista, Excite and Lycos. They had an adequate sample of

search queries for statistical analysis, which was constructed using variations of query

syntax. The reason being was their interest in studying search engine ability in handling

with different types of queries. For instance, some of the queries were single words

while others required the use of Boolean operators (Chu and Rosenthal, 1996). The

evaluation criteria that they used were those proposed by Lancaster and Fayen (1973).

Thus they measured coverage, precision, response time, user effort and form of output

but they omitted recall because they considered its measurement problematic in an

environment such as the World Wide Web. They grounded their opinion on the fact that

the number of relevant documents on the World Wide Web is indefinable regarding the

dynamic and unstable nature of this environment. As far as precision is concerned, they

calculated it taking into account the first ten hits after assigning score (1) for highly

relevant documents, (0.5) for fairly relevant and (0) for irrelevant. Leighton and

Srivastava (1997) criticize the fact that they studied only three search engines and they

did not use any significant test for their precision mean. Moreover, the relevance

judgments are made by the researchers themselves and this may involved any kind of

bias.

Ding and Marchionini (1996) evaluated the performance of three search engines by

taking into account the impact of hyperlinks. According to the researchers, as far as

World Wide Web is a hyperlinked environment they considered adequate to include in

their study “not only primary resources but also secondary” (Ding and Marchionini,

1996:139) because links to other relevant documents are of great importance to search

engine evaluation.

15

Ding and Marchionini used a small test suite of five queries which according to

Leighton and Srivastava (1997) is statistically insignificant. Moreover, the later

(1997:page 4) comment on the queries as “narrowly defined academic topics with the

use of multiple words”. The researchers of the study also accepted that the query

analysis and formulation as relevance judgments are subjective.

As far as the evaluation criteria are concerned Ding and Marchionini found

interesting to measure: precision of first twenty hits, duplication in the retrieved sets,

validation of links and the degree of overlap. The measures used for these criteria are

precision and recall in three types, salience and relevance concentration. Dong (1999)

comments positively on precision measured in this study as being based on the

“standardized formula” (Dong, 1999:148) and Gordon and Pathak regard that this

comparative evaluation performs sufficient statistical tests used for comparing search

engines.

Clarke and Willett (1997) conducted a study to compare the effectiveness of three

search engines using thirty queries based on library and information science topics. The

researchers calculated first ten precision and developed an algorithm to measure

approximate recall. They adopted Chu and Rosenthal relevance criteria and they

assigned (1) for relevant documents retrieved, (0.5) for partially relevant, and (0) for

irrelevant. Additionally, the authors extended the concept of relevance by assigning (0.5)

for a page that lead to one or more relevant pages –taking into account the effect of

hyperlinking- and (0) for duplicate sites. According to Oppenheim et al. (2000:197) the

study of Clarke and Willett is essential both because “of the critical evaluation of earlier

research that it provides and because it offers a realistic and achievable methodology for

evaluating search engines”.

According to Nicholson, Toamiuolo and Parker conducted the “first large scale”

study. They evaluated four search engines and two evaluative search tools, which had

the ability to review and evaluate the web pages that they index. For their study they

used two hundred query topics that they gathered from undergraduates. The precision

16

was calculated for the first ten hits and the relevance judgments were made by the

researchers themselves. Gordon and Pathak (1999) mention that the relevance judgments

were based frequently on “the short summary descriptions of the web pages that search

engines provide”. Moreover, Leighton and Srivastava (1997:page 4) state that the

researchers did not frequently check if all visited links were active and they criticize

Toamiuolo and Parker on their lack of defining criteria for relevance.

Hawking et al. (1999) compared five popular search engines during that time, using

the TREC-like methodology. The set of queries used, numbered 54, were gathered from

transaction logs by AltaVista and Electric Monk. They applied the use of scripts to

present each of the queries to each search engine and they calculated precision on the

first twenty hits. The relevance judgments were conducted by four judges each of whom

judged for one query all the documents retrieved, using binary relevance. According to

Su (2003:1177) the methodology that applied by Hawking et al. excels on providing

“reproducible results and blind testing”. Moreover, they did not take into account

inactive links when measuring precision.

Leighton and Srivastava (1999) compared five web search engines based on first 20

precision assigning different weights to the retrieved results. They divided three groups

for these 2twenty hits and they weighted differently the first three next seven and last ten

of twenty hits but they used arbitrary measurement. Moreover, they used four categories

of relevance by taking into account inactive and duplicate links and penalizing them.

They used fifteen queries of which ten were real reference questions and remaining five

queries were taken from a previous study conducted by them. Evaluator different from

the users, conducted the relevance judgments and they were unaware of the search

engines that returned the results judged. The way adopted was such to minimize any

bias. They used statistical techniques for non-normally distributed data and analyzed the

data several times using different relevance assessments.

User Oriented experiments

17

User-oriented experimental approaches conducted in the later years in order to

form same kind of methodology were involving real users.

One of the most up-to-date experimental studies is the one of Gordon and Pathak

(1999). In their experiment they compared the effectiveness of eight web search engines.

They used thirty three real information requests which they gathered from students of a

university business school. Presumably, the information requests where based on one

type of documents, limiting the scope of evaluation. When they gathered the requests

they employed expert searchers to conduct the searches on the behalf of the users. When

using expert searchers, other than the user, it “seems somewhat inconsistent” (Hawking

et al. 2001:37) with the requirement of the relevance judgments to be made by the

person that originally has the information need. During the search task, the search

intermediaries conducted the search repeatedly and interactively. They were instructed

to apply query optimization and accept the best results retrieved for each search engine

about a topic, before saving them. The idea of “near-optimal” queries according to

Hawking (2001) is interesting especially when they provide results of better

performance on comparison to simple queries. But Gordon and Pathak did not make any

similar comparisons. The top twenty hits for each search engine, with respect to the

queries submitted, were printed and sent to the users to judge their relevance using a

(4.0) relevance scale. Thus, as far as the users were only provided with the results,

‘blind’ judging was used to minimize bias. Moreover, they calculated recall, precision

and overlap at various cutoff levels. They also used statistical tests in order their results

to be more confident.

Su (1997) proposed a methodology which was adopted by Su, Chen and Dong

(1998). The researchers presented a pilot study evaluation of four popular search engines

using real users who searched for their personal information needs. The users also

judged the relevance, the features of the search engines, interaction, search results and

overall performance. The criteria employed in their study are relevance, efficiency, user

satisfaction, utility and connectivity. Later on Su and Chen (1999) performed the same

18

experiment using a larger test suite and based on the same methodology proposed by Su

(1997) but with some differences. The most important part of this experiment is that

each participant searched the same topic in each search engine, which numbered 4. The

purpose is to minimize the between-subject impact (Su, 2003).

Simulated experiments

Up until now according to Roelof Van Zwol (2004) there is no experimental

attempt to combine the two approaches. Borlund (2000) proposed an experimental

setting combining both system-centered and user-centered perspectives. She was based

on Borlund and Ingwersen (1997) previous methodological approach who introduced the

concept of “simulated work task situation”. Borlund investigated the possibility of

potential differences between real information need and simulated ones. She concluded

that based on empirical findings of meta-evaluation “simulated work task situations” are

suitable in IIR evaluation.

Roelof van Zwol (2004) conducted an experimental study combining the two major

approaches by constructing a self-made platform, namely TERS. For the user-centered

approach, widely known as usability study, he employed the measures of effectivity,

efficiency and satisfaction. The measure of effectivity illustrates the ability of the users

to find relevant information. Efficiency is involved with the time that users need to

successfully complete a topic. Satisfaction measures the degree of satisfaction of the

users, with respect to the information retrieval interface. On the other hand in order to

measure the retrieval performance of the search engines he used the traditional recall and

precision measures. The experimental scenario involved twenty users, who participated

in both experiments (usability and retrieval performance). The participants were divided

in three categories: beginners, novices and experts. The queries used numbered 37 and

they were divided in 18 specific topics and 19 more generic. The researcher was based

on Rouet’s (2003) criteria for the construction of the topics, in which he applied

statistical tests for differences between the two categories of topics. Moreover, he

19

calculated precision in different cutoff levels in order for his findings to be more

confident.

2.3 Criteria of Evaluation

‘Batch’ evaluation experiments applying recall and precision measures where

used exclusively in the Granfield paradigm. These standard measures have raised many

appraisals and objections in the information retrieval field, with regard to evaluation.

Hersh (2000:17) is in favor of these measures and assumes that “this is an effective and

realistic approach to determining the systems performance”. On the other hand,

disagreement towards this view have been placed concerning the deficiency in these

measures, the lack of recall validity, the drawbacks in measuring recall, presence of

interactivity and the ambiguous definition of relevance.

First of all, Large, Tedd and Hurtley (2001) characterize these measures

incomplete when used to evaluate information retrieval systems. They further introduce

other factors such as “expense”, “time” and “the ease of conducting the search via the

system interface”. Complementary, Lancaster and Warner, quoted by Large, Tedd and

Hurtley (2001:282), support that “accessibility” and “ease of use” are the most important

criteria for the user in choosing a source to retrieve information.

Moreover, the validity of the recall measure is controversial. The Granfield tests

were based on the hypothesis that a user is in favor of “finding as many relevant records

as possible” (Large, Tedd and Hurtley, 2001:282). This is an arbitrary assumption that

does not necessarily describe the average user behavior. It is acceptable that there are

many cases where the user is not interested in the retrieval of a great possible amount of

records but just in obtaining a specific record. When this is the case, then recall is

useless and precision can stand alone for the evaluation of the retrieval system.

Furthermore, another criticism regarding recall lies in its measurement when

applied in large databases or in interactive retrieval environments such as the World

20

Wide Web (Large, Tedd and Hurtley, 2001). The main reason is that in these fields the

amount of records available can not be gauged but it can only be estimated. It is widely

recognized that every day numerous brand new web pages come to light and others are

disabled. Thus, in order to count all these records found in these pages, it could take

infinite time and effort. Only, approximately recall is estimated even if it is enhanced

with additional characteristics such as “absolute retrieval” (Large, Tedd and Hurtley,

2001). On the other hand, maybe this measurement is not crucial in measuring

performance of interactive retrieval systems as Ralph, quoted by Jones and Willett

(1997), suggests.

Additionally, another drawback that emerged from these measures concerns the

concept of relevance. Two main issues are raised in this area, which relate to the

subjectivity and judgment of the relevance. Jones and Willett (1997:168), quoting

Cooper, concede “the property sought of retrieved documents is subjective utility to the

user rather than topic relevance, which is objective property of documents in

themselves”. Thus, the searcher examines what is relevant or not, with reference to the

information enquiry posed. Besides, with reference to the second issue, the user’s

opinion about a record is possible to be biased via other records earlier scanned. Albeit

to all these, Large, Tedd and Hurtley (2001) culminate in arguing that the measures of

recall and precision continue to be valid even if it is difficult to estimate them exactly

within experimental environments.

Moving further, interactivity constitutes another challenging domain for recall

and precision measures. It is comprehensible that these measures correlate to static

environments such as the Granfield experiments. However, nowadays there is a lot of

interaction between the system and the user. Thus, the evaluation methods that leave the

user outside tend to be pushed out to the limelight (Beaulieu, quoted by Jones and

Willett, 1997). Consequently, the need for a broader evaluation approach can not be

disregarded according to. This is the only solution in order to simulate real

circumstances because ‘batch’ searching results give misleading evaluation and thus

conclude in poor design of retrieval systems.

21

With reference to all the above, the evaluation of retrieval systems turns towards

the users as well. New measures are added for the usability evaluation. The most

representative measures are those proposed by ISO’s broad definition (Frokjaer,

Hertzum and Hornbeak, 2000): “effectiveness”1 2 3, “efficiency” and “user satisfaction” .

Frokjaer, Hertzum and Hornbeak (2000) emphasize the importance of accounting for all

three measures. Thus, they applied an experiment resulting in no correlation between the

three terms, and concluded that there is need to calculate all of them in order to measure

usability. Their results contradicted previous studies that “refrain from accounting for

effectiveness and settle for measures of the efficiency of the interaction process”

(Frokjaer, Hertzum and Hornbeak, 2000:346).

On the other hand, even if evaluating interactive systems using usability seems to

be an up-to-date approach, it faces important challenges that need to be taken under

consideration. The most imperative of them is the interference of the experimenter

(Jones and Willett, 1997). For example, when the investigator asks questions about what

the users are doing, the latter usually get disoriented especially if they are not familiar

with this. Moreover, the subjectivity of users is repeatedly surfaced, since each user

values the searches based on different characteristics. Additionally, another issue is

“nonreplicability” and Jones and Willett (1997:171) note that “once a user has done a

search seeking to satisfy some information need with some particular system or utility,

they cannot by definition, do another search for the same need with a different system”.

They attribute this to the knowledge gained previously and they argue that for in-depth

experiments large samples of individuals are necessary.

Recently, another vital notion that raised debate in the field of evaluation of

information retrieval, concerns the correlation between the two experiments. In his paper 1 “Effectiveness is the accuracy and completeness with which users achieve certain goals. Indicators of effectiveness include quality of solution and error rates” (Frokjaer, Hertzum and Hornbeak, 2000:345) 2 “Efficiency is the relation between the accuracy and completeness with which users achieve certain goals and the resources expanding in achieving them. Indicators of efficiency include task completion time and learning time” (Frokjaer, Hertzum and Hornbeak, 2000:345) 3 “Satisfaction is the users’ comfort with and positive attitudes towards the use of the system. It can be measured by attitude rating scales” (Frokjaer, Hertzum and Hornbeak, 2000:345)

22

Hersh (2001) explains the reasons that retrieval performance and usability studies do not

give the same results based on an experiment taken place by him. One point that he

makes is that users browse down the documents’ list that the system retrieves.

Contradictory, Roelof van Zwol (2004:388) supports, with regard to his experiment, that

“users are very well capable of finding the relevant information within the top ranked

documents. If they are not then they just refine their search rather than browse through

lower ranked documents”. However, generally Roelof van Zwol (2004) accepts the

findings of Hersh(2000) that there is disconnection between the traditional measures of

recall and precision and user satisfaction.

23

3. Methodology

3.1 Introduction

This chapter aims to describe the search engine evaluation methodology that it

was adopted by the researcher as well as the methods of collecting the data, the data

types and instruments that were employed for this purpose.

3.2 Search Engine Evaluation Methodology

Taking into account the dynamic nature of the search engines and the objectives

of this experiment, both methodologies, system-centered and user-oriented were

combined for the search engine evaluation. The system-centered approach is used

exclusively for the retrieval performance of the web search engines. This approach gives

the opportunity to compare the algorithms of different search engines in an attempt to

find out which of them can provide the most relevant documents. Thus, a collection of

documents that represent the information problem, a set of example information requests

(or simpler in search engine terms, a set of queries) and a set of relevant documents of

each example query are regarded as adequate criteria to measure the information

retrieval performance.

This seemed to be the approach since the 1990’s when a new approach

undermined the utility of the traditional system-centered approach and turned closer to

the user than the system. This new approach is diversified. From one viewpoint it tries to

identify the notion of relevance. From another perspective it is involved with the user

24

cognition. From a third dimension it emphasizes process orientation, which is attributed

to the interaction between the user and the system.

Experimental studies though continued applying the traditional approach because

it is much simpler than the new one, which required the presence of real users. The

reason being is that experimental studies due to their laboratory character require control

over some of the variables that are tested while user-centered approaches are operational

including qualitative data that is difficult to measure. Moreover, the cost of the later, as a

consequence of its nature is fairly higher in comparison to the laboratory one.

A combination of both approaches, the one measuring the usability and the other

measuring the retrieval performance was not attempted until 2004, when Dr. Roelof van

Zwol built an experimental testbed to facilitate the conduction of both experiments in an

attempt to evaluate information retrieval systems form a complete perspective.

This research is based on this idea and by using the testbed (TERS) will attempt

to evaluate the retrieval performance and the usability of three web search engines.

Additionally, this experimental study will investigate further the correlation of the two

experiments. Thus, different types of users were recruited to take part in the experiment

which included both realism and control.

The researcher used both quantitative and qualitative methods for the purposes of

this experiment in order to gather both data about the user-system interaction and more

system-oriented ones. Thus, the testbed was configured and questionnaires and

observation methods were used. Both quantitative and qualitative data helped to answer

the research questions about the users’ ability to successfully formulate queries and

evaluate results while only quantitative data helped to answer the research questions

referred to the systems ability to retrieve relevant documents and rank more relevant

documents higher in the list of retrieved documents.

25

3.3 Data collection, types of data and data instruments

3.3.1 Data collection and types of data

In order to extract information abut the system, the user and their interaction both

quantitative and qualitative data collection methods were used. Quantitative data helped

to describe the participants profile and to measure the retrieval performance of the

search engines. On the other hand qualitative data focused on the interaction between the

system and the user. More specifically the types of data used are: questionnaire data,

transaction log data and observation data.

3.3.2 Data instruments

a) Questionnaire

Two kinds of questionnaire were filled in by the participants: pre-session and

post-session. The pre-session questionnaire was divided in two parts due to the

shortcomings of adjusting its first part on the TERS platform. Thus, the first part of the

pre-session questionnaire (Pre-session Questionnaire I) had a paper-pencil form and it

was constructed using checklists. This questionnaire was used to gather data relevant to

the participants’ characteristics, experiences and preferences. The second part of the

questionnaire (Pre-session Questionnaire II) was uploaded in TERS platform and it was

displayed before the participants start the experiment. This questionnaire included a

seven-point Likert scale and used to map the participants’ searching experiences and

skills, when using web search engines to satisfy their information needs. On the other

hand, the post-session questionnaire which was again uploaded on the platform was

filled in after the participants were finishing the usability experiment. This questionnaire

26

gave the ability to the users to judge the usability of search engines on terms of search

options, result presentation, relevance, response time and overall satisfaction that he

participants get when using them. Using the post-session questionnaire the researcher

had the ability to map the attitudes of the participants towards each web search engine.

Three types of questions were included in this questionnaire: 7 point Likert-scaled

questions, Boolean questions (checklist that gives two options, from which the user has

to check one or another), and open questions (no obligation to be filled in).

b) Observation checklist

In the quantitative part of this project, the researcher used observation even

though it is regarded as a qualitative measure. The purpose of using observation in this

way was to focus on a specific aspect of the participants’ behavior: the searching

strategy that is followed when searching using web search engines. Thus an observation

checklist was constructed by the researcher for this purpose. The checklist included

general searching strategy characteristics that participants used in order to satisfy their

information need, which was not easy to be gathered by other data instruments. The

checklist assisted to quantify the behavior of the users in some way.

c) Transaction logs

“Every SQL Server database has at least two files associated with it: one data file

that houses the actual data and one transaction log file. The transaction log is a

fundamental component of a database management system. All changes to application

data in the database are recorded serially in the transaction log. Using this information,

the DBMS can track which transaction made which changes to SQL Server data”

(http://www.dbazine.com/sql/sql-articles/mullins-sqlserver). Thus, with the help of

transaction logs the researcher was able to gather data regarding the search and the

27

http://www.dbazine.com/sql/sql-articles/mullins-sqlserver

relevance judgments provided by users. All the data were saved in the TERS platform to

which the researcher had access via the internet technology.

3.4 Experimental Design

3.4.1 Search Engine Features

Due to the fact that search engine companies neglect to reveal information to the

public about their exact algorithms for indexing and ranking, the researcher provided

only information gathered from the three search engine web sites and from personal

experience.

i) Google

Google search engine supports both simple and advance search mode. For each

mode a different interface is provided. The simple interface (Appendix A) has a single

box where the user can type the query terms. There is the option of either displaying a

list of results or just limit the result to one, which is the most relevant that Google finds.

The advanced mode (Appendix A) provides options of searching of pages that include

all the search terms, an exact phrase, at least one of the words, not any of the words,

written in a certain language, created in a certain file format, that have been updated

within a certain period, that contain numbers within a certain range, within a certain

domain or website, that do not contain “adult” material. Google ignores common words

and characters such as where, the how and other digits and letters which slow down the

search without improving the results. Moreover, it is not case sensitive, it adds

automatically “and” operator between the queries, uses stemming technology and it does

not support truncation. (www.google.com)

ii) AltaVista

28

http://www.google.com/

AltaVista search engine offers simple search and advanced (Appendix A,). The

simple search mode provides a single box where the user can type a query. The

advanced search mode gives the opportunity for building queries that include all the

words, an exact phrase, any of the words, or none of the words and return results in any

of the 36 languages, at specific time range, from specific locations or domains. Also, it

provides an option of displaying either a maximum of two results per site or all the

results, with results form the same site. Moreover, there is an additional free-form

Boolean query box where expert users have the opportunity to build a query using

Boolean operators. Additionally, AltaVista indexes all of the words on each web page,

gives the chance of translating a web page, does not support truncation and it is case

sensitive. (www.altavista.com)

iii) Yahoo

Yahoo search engine provides two types of search: simple and advance

(Appendix A). Simple search is the first option here as well as in the previous search

engines. The advanced search mode gives the chance to the user to limit the search

showing results with all of the words, at least one of the words, an exact phrase, none of

the words, updated, from a certain site or domain, containing “adult” content, from a

specific country, in a specific language, with a specific number per page, who are related

to a particular site and in a specific file format. Additionally, it supports truncation and it

is case sensitive. (http://search.yahoo.com/)

29

http://www.altavista.com/

http://search.yahoo.com/

3.4.2 Experimental Scenario

The experimental scenario involves the usability and the retrieval performance

experiment using TERS platform. Three popular search engines were used for

comparison: Google, AltaVista and Yahoo. Thirty (30) volunteers were recruited as

testing searchers and relevance judges. This means that they participated in both

experiments. The number of topics (information needs) was thirty as well. In the

usability experiment fifteen (15) topics were assigned to each participant. Each

participant searched these topics in the search engines that were previously set by the

researcher. A different search engine opened each time the participant assessed a topic.

None of the participating students had previous knowledge of the search tasks that they

assessed in the different search engines. The participants would conduct the search as

they would normally do in their life. They had the option of following the links to satisfy

their information need or altering the search if they were not satisfied by the results.

When they were finding a result that answered the information problem then they copied

the information of the URL, the position of the link in the ranked list, the search words

that they used and a relevant fragment from the web page that lead to the decision that

they satisfied their information need. Observation notes were taken by the researcher

during the usability experiment. After the assessment of all topics by all participants, the

researcher judged the results for their correctness in order to investigate the ability of the

searchers to find relevant information.

In the second part of the experiment, the retrieval performance was based on the

first fifty hits retrieved by the search engines. The researcher optimized the queries for

each topic and the participants were requested to provide relevance judgments for these

results using a (5.0) scale. Each participant had the chance to choose one topic of

his/hers preference and provide relevance judgments for all fifty topics retrieved for this

topic.

30

The experiment took place in the “Microlab” lecture theater of the Department of

Information Studies at the University of Sheffield from 15th July until 19th August 2005.

3.4.3 Test Environment

The “Microlab” lecture theater is equipped with desktop computers with internet

facilities. This was a prerequisite in order TERS toolkit to be uploaded on the browser

for each session. There were ten sessions of three students taking part in each session so

as the researcher to be able to observe the participants.

3.4.4 Participant Sample

For the purposes of this study thirty participants (30) were recruited. All of them

were students at the University of Sheffield. Four (4) of them were research students and

the remaining twenty six (26) postgraduates. The academic background of postgraduate

participants varies in number: three (3) information studies, five (5) computer science

studies, five (5) engineering studies, one (3) medicine studies, two (2) political studies,

one (1) human geography studies, four (4) business and management studies, 0ne (1)

law studies and two (2) linguistics. The research students’ background was: one (1) from

architecture, one (1) from engineering, and two (2) from medicine. All participants in

the experimental study were volunteers. In order to find if there are differences between

the user’s performances the students were divided in 3 categories as suggested by Roelof

van Zwol (2004). These categories are: beginners, novices and experts.

3.4.5 Information needs (Topics)

In the information retrieval field there is a lot of debate relating to the nature of

the information needs used in the experimental design of various studies. According to

Borlund and Ingwersen (1997) when information needs are stimulated their dynamic

31

nature emerges during experimentation. TREC adopts an interesting approach when

generating topics. TREC test collection includes an example of indicative information

needs which are used when testing search engine performance. Each indicative

information need is referred to as “topic”. The “topics” are written by experienced users

of real systems and are representing real information needs. In this experiment the test

collections of TREC that refer to World Wide Web were used because they include a

variety of topics. The researcher preferred to avoid constructing the information needs in

an attempt to avoid any kind of bias. More specifically the information needs, namely

“topics” that were used in the experiment are chosen from three test collection provided

by TREC: from TREC 9 web track, from TREC 6 and form TREC 8 web track the ad

hoc and small web topics test collection (http://trec.nist.gov/data.html). The choice of

the “topics” from this collection was based on the preferences of 10 students with

different academic background different from the participants who volunteered to take

part. The list of the “topics”, numbered (30) and an additional “topic” for practice, is

presented in the (Appendix B) enhanced with the description, which provide specific

information on the information need and the narrative, which gives further information

of what is regarded as relevant to each “topics” specific information need.

For the purposes of this study in order to investigate any differences between the

topics the researcher based on Rouet’s (2002) criteria divided the “topics” into two

categories. One category comprised of topics with generic nature while the other

category included “topics” which were more specific. The whole list with the two

categories is provided in (Appendix B, page)

32

3.5 Experimental Procedure

3.5.1 Ethics Approval

Prior to implementing this experiment, ethical approval was obtained from the

Research Ethics Committee at the University of Sheffield. The researcher then

contacted the participants explaining the project, provided an information sheet with

further information about the experiment and asked their consent to participate. Consent

forms were provided to each student. Once this process was completed the research

design was implemented.

3.5.2 Pilot experiment

As in every experimental project, a pilot experiment was conducted. The purpose

of this primary attempt was to verify the feasibility of the experimental procedure by

testing all the features of the TERS platform with the use of a demo. Moreover, a

secondary objective was to calculate the time required for every session both for the

usability performance and the retrieval performance part. The pilot experiment took

place in the “Microlab” lecture theater in the Department of Information Studies of the

University of Sheffield and two potential participants, both of them university students,

an expert and a novice took part. The ‘topics’ used for the retrieval performance were 5

and for the usability 1. After the pilot study, considering the time required for the

completion of both stages of the experiment, the researcher made alterations concerning

the number of ‘topics’ assessed by each participant.

33

3.5.3 Configuration of TERS platform

The configuration of the TERS toolkit involved the following steps: general

specifications, retrieval performance experiment specifications and usability experiment

specifications. The general specification include: retrieval system identification,

definition of the ‘topics’ (search assessments), creation of user accounts. The usability

performance experiment specification include: setting up of the survey questions and

identification of the participants’ workload. The retrieval performance specifications

involve: setting of the runs for each ‘topic’ and identify the workload for each

participant. In order to configure TERS platform the researcher constructed files for all

the aforementioned characteristics and loaded them to the platform.

a) General Specifications

i) Retrieval system identification

In this section the researcher identified the three search engines that will be

subject to evaluation by constructing a file and loading it into the platform. The contents

of the file are illustrated in the Appendix D.

ii) Definition of the ‘topics’

In this part the researcher defined the ‘topics’ and uploaded them to the platform.

All the topics were based on the TREC test collections. A sample part of the file that the

author constructed is available on the Appendix D.

iii) Creation of user accounts

34

The thirty (30) participants that took part in the experiment were given a

username and a password to have access to the experimental platform. Both for the

usability and the retrieval performance the participants were using the same

identification details (username and password). With reference to the ethics concerning

this study, further information about the participants can not be provided.

b) Usability experiment specifications

i) Survey questions set-up

In this section, the questions that are asked to the participant in the usability

experiment are defined. There are two types of questions: general questions and system

survey questions. The general questions are being asked at the participant before

carrying out the usability experiment (Pre-session Questionnaire II) and are about the

search behavior of the participant. The system survey questions are being asked at the

end of the usability experiment and investigate the satisfaction of the participant with

respect to a particular retrieval system. There are three types of questions: 7 point Likert-

scaled questions, Boolean questions, and open questions. For the scaled and Boolean

questions a low and high value are specified, indicating how the scale should be

interpreted. The open questions are not obligatory to be answered. A list of all survey

questions is presented in Appendix D.

ii) Participant workload in the usability experiment

In this section the workload for each participant in the usability experiment is

defined. In order all ‘topics’ to be assessed in all search engines, the researcher designed

the workload equally spreading the ‘topics’ between search engines and participants. An

35

example list of how the workload is divided among search engines and participants can

be found in Appendix D.

c) Retrieval Performance experiment specifications

i) Participant Workload in the retrieval performance experiment

Taking into account the pilot study conducted earlier and the excess in time

required for the retrieval performance experiment, the researcher assigned only one

‘topic’ to each participant. The list of the participants and the topics assigned to them is

provided in the Appendix D.

ii) Run set-up

Due to the time limitations each topic was assessed in only one search engine.

Each run was consisting of the top 50 results that were retrieved by a particular search

engine. The researcher gathered the first 50 results for each topic and uploaded them to

the platform. A typical example of the one run is available in the Appendix D. Before

the experiment all links where checked.

3.5.4 Presentation of TERS, Pre-session Questionnaires, Practice

The experiment started with a 10-minute PowerPoint presentation in order the

participants to get accustomed with the use of TERS platform. During this time

volunteers had the chance to ask questions and withdraw if they were not interested. The

remaining participants continued by filling in the Pre-session Questionnaire I (Appendix

D,), which hold general information about the testers, and logged in the platform. The

Pre-session questionnaire II had more specific information relating to the participant’s

36

searching experience and skills. There was an example ‘topic’ for practice in the

usability experiment, before they start the original search.

3.5.5 Usability experiment section

Fifteen (15) ‘topics’ were assigned to each participant. The testers had no previous

knowledge about the ‘topics’. Therefore the usability experiment had to be conducted

first because the topics used in both experiments were the same. Each participant

assessed each ‘topic’ to one of the three (3) search engines already set by the researcher

on the platform. The participant after understanding the indicative information need was

forming a query to the system that was opening each time via the platform. From the

retrieved rank list that the search engine was providing the participant was choosing a

link that was satisfying his search request. When finding the appropriate to his/hers

information need document he/she was coping the following information to the

platform:

• URL of the link satisfying his need

• Search words used when forming the query in the search engine interface

• Position that the link satisfying his/hers need had in the ranked list that the search

engine was providing

• A relevant fragment of the document that made him/her deciding to choose this

specific document

When finishing with one ‘topic’ the participant could move to the next ‘topics’,

without this being obligatory. The testers could postpone one difficult search and return

later to accomplish it.

37

3.5.6 Post-session

When finishing the searching for all 15 ‘topics’ the participants had to judge each

system by filling in a post-session questionnaire (system survey). After the completion

of the usability session testers could take a 15-minute break before continuing with the

retrieval performance experiment.

3.5.7 Judgment of Usability relevance assessments

Before calculating the statistics of the usability experiment, the correctness of the

answers found by the participants had to be determined. Thus, the researcher checked

the correctness of each of the fifteen (15) answers provided by the participants for each

‘topic’. The examination was based on a two-point scale, where 0 mean that the answer

given by the participants was not relevant and 1 that the answer given was relevant. The

tester could only see the ‘topic’ with the narrative of further information about what is

relevant and what is not. Both the users and the systems information where codified.

3.5.8 Retrieval performance experiment section

In this part the user was requested to provide relevance judgments about one ‘topic’

(indicative information need). The relevance judgments were based on the ‘topic’

description and on a brief narrative, which provided further information about which

documents actually satisfy the information need or which document are not relevant to

the information need. A list of the first 50 hits about each ‘topic’ was displayed on the

platform and the participant had to judge all 50 hits of one ‘topic’, using a 5-point scale:

• highly relevant when the whole document is relevant

• relevant when there is a relevant fragment only in the whole document

• irrelevant

38

• highly irrelevant, when the document is a complete miss. This category also

includes dead links

• not sure, when he/she was unable to judge the relevance

39

3.6 Result Analysis

3.6.1 Retrieval Performance measures

a) Precision and Recall measures

For the measurement of the retrieval performance experiments normally the

measures of recall and precision are employed. Considering the fact that the answer set

of the information requests is ranked a measurement can not be based only on these

measures. It is evident that the retrieval strategy plays an important role in interactive

environments like the search engines. Thus, the curve of precision versus recall

expresses the impact of the retrieval strategy. This curve is calculated by focusing on 11

standard recall levels, for which precision is measured. Instead, the average precision at

given cut-off level can be measured. This approach gives information about the retrieval

algorithm as well. Thus, if an information request has 40 relevant documents, then

precision is measured after 40 documents. This avoids some of the averaging problems

of the "precision at X documents". If average precision is greater than the number of

documents retrieved for an information request, then the non-retrieved documents are all

assumed to be non-relevant. These are the guidelines that trec_eval tool (version 7.0)

uses. TERS platform applies this tool for the measurement of recall and precision.

3.6.2 Usability measures

a) Effectivity

Within TERS, in order to find if the participants can find relevant information

the (effectivity) two measures are used: correctness and positioning. Correctness defines

the degree to which the answers to topics are correct and it is calculated as: number of

correct answers / total number of answers. It has a scale from 0 to 1. Positioning

40

specifies the ability to rank correct answers in a high position. For each answer the

position in the ranking is recorded. If the document’s position is greater than 10, it is

given position 11. The formula used is: sum(11 – position of a correct answer) / total

number of answers. Its scale varies from 0 to 10.

b) Efficiency

Efficiency is a criterion that measures the length of time that users need to

successfully complete an information problem (topic). It is calculated as: correctness /

total amount of seconds. Its scale varies from 0 to 1.

c) Satisfaction

TERS platform measures users satisfaction by taking into account the answers

given to the survey questions (Post session Questionnaire) by the participants. An

overall satisfaction score is calculated by averaging the scores of the questions that are

associated with a search engine. The formula used for this measurement is: sum(score) /

total number of related questions. The measure has a scale from 0 to 7.

41

4. Results and analysis

4.1 Introduction

All thirty (30) participants took part in both parts, usability and retrieval

performance, of the experiment and returned the questionnaires completed. In this

chapter the analysis of the gathered data starts with a presentation of the participants’

profile. Then the results of the retrieval performance and usability experiment are

presented and analyzed.

4.2 Participant’s profile

Among thirty participants, sixteen of them are at the age from twenty one to

twenty five. Nineteen of them are male and eleven are female. Twenty six are

postgraduate students and four of them are research students. There is a variety of

academic background between the participants varying between information studies,

computer science studies, engineering studies, medicine studies, political studies, human

geography studies, business and management studies, law studies and linguistics.

Table 1 avg

(mean) Std

1. How many years have you been using search engines? 5.10 1.24 2. What kind of search engine user are you? (beginner expert) 4.86 1.59 3. How often do you use the Internet to find information? (monthly daily) 6.66 0.60

4. Does your search for information in that case always lead to satisfying results? (always never) 5.53 0.93

5. How often do you use the advanced options when you search using the web search engines? (never always) 3.93 2.08

42

6. I know what "Boolean search" is (strongly disagree strongly agree) 3.36 2.68 7. I use "Boolean search" (strongly disagree strongly agree) 2.63 2.22 8. I know what a "ranked list" is (strongly disagree strongly agree) 4.10 2.59 9. The "advanced" search mode in the web search engines is an effective tool ((strongly disagree strongly agree) 5.10 1.42

All tables of analytical results are available in Appendix E

Analyzing Table 1 the participants have the following characteristics: a) most of

them started using the search engines five to six years ago, b) they regard themselves as

neither beginners nor experts, c) they use quite often the internet to find information, d)

their search for information most of the times leads to satisfying results, e) though they

accept that the “advance” search mode is an interesting tools, they do not seem to use it,

f) though some of them know how to make a more effective search they do not seem to

do so.

Taking into account the answers given by the participants it seems that most of

them have been using search engines five to six years. Comparing their age and the

years of web search engine use it can be accepted that a great number of them started

using search engines during their studies at the university. This is illustrated in the Graph

1 below.

43

1 1-2 3-4 5-6 7-8 9-10 over 10

Years of search engine use

0

1

2

3

4

5

6 Age 21-25 26-30 over 35

Graph 1

Moreover, regarding search engine preference, the results were astounding. Out

of all thirty (30) participants it is amazing to find out that Google rates higher in their

preferences. Google gathered all thirty participants’ answers while the other search

engines, AltaVista, Yahoo, Alltheweb, EntireWeb and Search.com seemed not to

interest the sample at all.

44

0

5

10

15

20

25

30

Participants

Goo

gle

Alta

Vist

aYa

hoo

Allth

eweb

Entir

eWeb

Sear

ch.c

om

Graph 2

Table 2: How the participants alter their search when not satisfied with the results Participant

Answers Percent

1. do nothing/give up 0 0 2. ask somebody for help 0 0 3. use the “help” option that the web search engine provide

3.3 1

4. change the web search engine 2 6.7 5. use the “advance” search mode 5 16.7 6. change the query slightly 22 73.3 7. change the query completely 0 0 TOTAL 30 100

Taking in to account the answers given by the participants a high number of

them calculated to 73.3% prefer to change the query slightly as a first option, while

16.7% change the mode from simple to advanced. Only 6.7% change the web search

engine, while 3.3% use the “help” option.

45

4.3 Retrieval performance experiment results and analysis

4.3.1 Recall - Precision results Table 3 Google AltaVista Yahoo Retrieved 500 500 499 Relevant 242 170 258 Relevant Retrieved 242 170 260 Recall 1 1 1 Precision 0.4861 0.4622 0.5266

From the table it is excluded that Yahoo had a better precision, with Google following and AltaVista coming last.

a) Interpolated Recall-Precision Table 4 Interpolated Recall-Precision Averages Google AltaVista Yahoo 0 0,6737 0,9117 0,8533 0,1 0,5917 0,8442 0,7217 0,2 0,5877 0,6030 0,6636 0,3 0,5669 0,5411 0,6143 0,4 0,5251 0,4980 0,5934 0,5 0,5251 0,4884 0,5779 0,6 0,5185 0,4800 0,5762 0,7 0,5100 0,4800 0,5688 0,8 0,5022 0,4800 0,5627 0,9 0,4896 0,4662 0,5330 1 0,4861 0,4627 0,5266

In order to compare the retrieval performance of the three search engine algorithms over all test queries, the average precision was used at each recall level. According to Yates (1999:77) “since recall levels for each query might be distinct form the 11 standard recall levels, utilization of an interpolation procedure is often necessary”. The curve of precision versus recall is illustrated below resulting from the data table above.

46

Recall-Precision curves

0,0000

0,2000

0,4000

0,6000

0,8000

1,0000

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

11 standard recall levels

Pre

cisi

on Google AltaVista Yahoo

Graph 3

Moreover, the average precision in different cut off levels is shown below:

Table 5 Documents Google AltaVista Yahoo 1 0.3000 0.8000 0.7000 2 0.3000 0.6000 0.6000 3 0.3000 0.4667 0.6000 4 0.3500 0.4500 0.6000 5 0.3600 0.4400 0.6000 6 0.4167 0.4000 0.5667 7 0.4286 0.4000 0.5429 8 0.4125 0.4000 0.5375 9 0.4000 0.3667 0.5333 10 0.4000 0.3600 0.5300 15 0.4200 0.3400 0.5267 20 0.4350 0.3400 0.5150

47

Precision at document cutoff levels

0,00000,10000,20000,30000,40000,50000,60000,70000,80000,9000

1 2 3 4 5 6 7 8 9 10 15 20

Document cutoff levels

Pre

cisi

on GoogleAltaVistaYahoo

Graph 4

48

4.4 Usability experiment results

4.4.1 Overall analysis

Table 6

Topics Time Correctness Positioning Efficiency Satisfaction

total total average stddev min max average average stddev correctness avg positioning stddevrate

450 120609 268 281 41 3154 0.78 7.83 2.64 0.0029 0.0286 5.01 1.55 354 96941 273 303 41 3154

49

Overall from the four hundred topics (450) assessed by the participants the

researcher inspected all of them and found that three hundred and fifty four (354) topics

were correct. This means that 78.66% of the answers given were correct. Presumably,

participants were competent enough to search and retrieve relevant documents. The

participants needed on average thirty three hours and a half (33.5) to search for these

topics on the web search engines. The average time spend by topic was calculated to

four minutes and forty six seconds (4.46 minutes).

In order to measure the first element of usability the researcher took into account

the correctness and positioning as analyzed in the section 3.5 regarding result analyses.

Average correctness was found to be 0.78. Taking into account that the scale of

correctness is from 0 to 1 it is evident that the result is adequately high. Moreover, the

researcher calculating positioning found that the average positioning is 7.83. As, the

scale for this measure is from 0 to 10 it can be comprehensible that the positioning level

is also high. Presumably, effectivity, which is regarded as one measurement of

estimating usability, answers positively to the question referring to the ability of the

users to find relevant information.

Additionally, the other important element of usability is efficiency. Efficiency

refers to the length of time that is required by the users to successfully complete a topic.

Thus, knowing already the correctly answered topics efficiency is 0.0029. On a scale

from 0 to 1 this score is not very high. On the other hand is surprisingly low, which

means that the user do not need much time to successfully complete a topic.

Finally, the third component of usability, satisfaction is found to be on average

very high. Its score is calculated 5.1 on a scale from 1 to 7.

Presumably, taking into account all the above calculations the overall usability of

these search engines that regards the interaction between the user and the systems is

positive.

50

4.4.2 Search engines usability testing results

Table 7 System Topics Time Correctness Positioning Efficiency Satisfaction

id avg name total total average stddev min max average average stddev correctness positioning stddevrate1 Google 150 35651 237 187 50 1081 0.76 8.37 2.47 0.0032 0.0356 5 1 115 27001 234 183 50 1081

2 Altavista 150 43728 291 279 41 1985 0.79 7.01 2.70 0.0027 0.0231 4 1 119 36065 303 301 41 1985

3 Yahoo 150 41230 274 351 46 3154 0.80 8.12 2.56 0.0029 0.0287 5 1 120 33875 282 385 46 3154

51

With reference to this study the researcher compared the three search engines on

usability measures to estimate their performance.

From the hundred and fifty topics (150) assessed in each search engine the

researcher found that hundred and fifteen (115) where correctly answered by the

participants when using Google search engine, hundred and nineteen (119) when using

AltaVista and hundred and twenty (120) when using Yahoo. On average users spent 3

minutes and 95 seconds to find an answer to a topic on Google, 4 minutes and 85

seconds on AltaVista and 4 minutes and 56 seconds on Yahoo. Thus, when searching on

Google the participants spent relatively less time than in the other search engines but

their answers were not correct when compared to the other systems. The differences

though between found are not significantly important as the correctness found for

Google is 0.76 for AltaVista 0.79 and for Yahoo 0.80. Moreover, when measured

positioning the researcher found that Google ranked the correct documents higher than

the other systems. Yahoo follows with average 8.12 positioning and last is AltaVista

with the significant difference of having positioning only 7.01.

As far as efficiency is regarded, participants needed more time to complete a task

correctly when using Google and less when using Yahoo and AltaVista even though in

ranking terms Google has a better performance, with Yahoo following and then

AltaVista.

On terms of satisfaction, Google and Yahoo rate the same by taking into account

the answers of the survey questions and AltaVista follow with a score of 4 out of 7.

Overall, participants were 35.71% satisfied by Google, 37.71% by Yahoo and

28.58% by AltaVista. They spent on average less time searching for a topic on Google,

more on Yahoo and the most on AltaVista. They provided less correct answers when

using Google, more when using AltaVista and the most when using Yahoo. They needed

more time to complete a task successfully on Google, less on Yahoo and the least of all

52

on AltaVista. The correct answers though given were positioned higher in the ranked list

of the Google, then of Yahoo and finally of AltaVista.

Summary Statistics

Table 8

Google AltaVista Yahoo

Effectivity

Correctness 0.76 0.79 0.80

Positioning 8.37 7.01 8.12

Efficiency 0.0032 0.0027 0.0029

Satisfaction 5 4 5

Having a look in the table above it seems that Google performed better in terms of effectivity positioning, efficiency and satisfaction with Yahoo following next and AltaVista coming last. Thus, it could be said on terms of usability that Google rates higher facilitating the interaction between the user and the system via its interface.

53

54

4.4.3 Topics analysis results Table 9 Topic Topics Time Correctness Positioning Efficiency id title total total average stddev min max average average stddev correctness positioning1 Bengals cat 1515 33053305 220220 133133 100100 564564 1.00 8.00 2.44 0.0045 0.0363 2 Chevrolet Trucks 158 45751843 305230 242168 111111 1049625 0.53 8.37 2.72 0.0017 0.0363 3 Fasting 1512 21611667 144138 5458 6767 285285 0.80 8.66 1.87 0.0055 0.0623 4 Lava lamps 1513 28672573 191197 8183 8181 334334 0.86 7.38 2.66 0.0045 0.0373 5 Tartin 1513 42303975 282305 254266 8989 10431043 0.86 7.30 3.01 0.0030 0.0238 6 Deer 159 41122418 274268 220217 6767 741741 0.60 9.00 1.11 0.0021 0.0334 7 Incandescent light bulb 1514 30772321 205165 16772 5353 756289 0.93 9.00 1.79 0.0045 0.0542 8 Mexican food culture 1511 34001803 226163 16492 4646 644276 0.73 8.90 1.30 0.0032 0.0543 9 Jennifer Aniston 1514 31753043 211217 146150 6868 586586 0.93 8.57 1.39 0.0044 0.0394 10 Pine tree 158 38181848 254231 130129 7777 465461 0.53 8.25 2.54 0.0020 0.0357 11 Auto skoda 1510 79846802 532680 796950 8989 31543154 0.66 8.80 2.04 0.0012 0.0129 12 Nirvana 1515 37513751 250250 179179 8787 683683 1.00 7.40 2.94 0.0039 0.0295 13 Decade of the 1920's 1515 28062806 187187 112112 7575 398398 1.00 6.93 3.21 0.0053 0.0370 14 DNA Testing 157 64264218 428602 497688 9393 19851985 0.46 6.14 2.85 0.0010 0.0101 15 Behavioral Genetics 1514 32803205 218228 124122 7070 484484 0.93 7.78 2.19 0.0042 0.0340 16 Cosmic Events 1511 74726601 498600 555622 119120 21322132 0.73 6.63 3.10 0.0014 0.0110 17 Tropical Storms 1511 33322683 222243 126131 76105 551551 0.73 8.63 1.74 0.0033 0.0354

18 Carbon Monoxide Poisoning 1511 61535313 410483 314334 6464 11241124 0.73 8.45 1.91 0.0017 0.0175

19 UV damage, eyes 1514 27772715 185193 10199 6274 393393 0.93 8.00 2.68 0.0050 0.0412

20 Greek, philosophy, stoicism 15 6657 443 483 67 1808 0.53 5.50 4.10 0.0012 0.0095 8 4630 578 629 67 1808

21 Antibiotics Ineffectiveness 15 3178 211 148 53 531 0.80 8.75 1.91 0.0037 0.0407 12 2575 214 132 60 531

22 Drugs in Golden Triangle 15 3636 242 178 81 660 0.80 8.41 1.88 0.0033 0.0309 12 3267 272 187 81 660

23 Legionnaires' disease 15 1652 110 45 49 192 1.00 9.06 1.33 0.0090 0.0823 15 1652 110 45 49 192

24 Killer Bee Attacks 15 2799 186 129 53 460 0.86 8.30 2.62 0.0046 0.0410 13 2629 202 131 54 460

25 Radio Waves and Brain Cancer 15 3144 209 102 56 415 0.93 6.07 3.14 0.0044 0.0278 14 3057 218 100 56 415

26 Undersea Fiber Optic Cable 15 4212 280 182 88 692 0.60 6.11 3.65 0.0021 0.0183 9 2996 332 217 88 692

27 Risk of Aspirin 15 3590 239 197 41 680 0.60 5.88 4.28 0.0025 0.0212 9 2491 276 235 41 680

28 Metabolism 15 3903 260 194 47 614 0.73 8.00 2.82 0.0028 0.0385 11 2282 207 173 47 614

29 Health and Computer Terminals 15 4252 283 228 78 822 0.86 7.23 3.16 0.0030 0.0242 13 3871 297 243 78 822

30 New Hydroelectric Projects 15 4885 325 273 66 1081 0.86 7.92 2.25 0.0026 0.0223 13 4601 353 283 66 1081

55

Based on the evidence of the table above, the topics that the participant found

rather difficult are: 2. “Chevrolet trucks”, 6. “Deer”, 10. “Pine tree”, 14. “DNA testing”,

20. “Greek philosophy, stoicism”, 26. “ Undersea Fiber Optic Cable”, 27. “Risk of

aspirin”. Taking into account the list constructed by the researcher between generic and

specific topics (Appendix B) it is acceptable that topics 2, 6, 10, 14, and 20 that are

generic could be interpreted in many ways. Thus, their nature explains their score. On

the contrary, topics like 26 and 27 that were specific performed surprisingly bad.

The easiest topic according to the data is topic 23, which had received 15 correct

answers and the participants spent the less time assessing it to the search engines. On the

means of correctness topics 1, 12, 13 and 23 had the best score while topics 2, 10 14 and

20 had the worst.

Moreover, the topics ranked higher in the ranked list of the results, based on the

correct answers are topics 23, 6 and 7 that had an average positioning of 9 out of 10. On

the other hand topics 14, 20, 25, 26 and 27 were found lower in the ranked list of results.

Additionally, a high score of efficiency is gained for topics 7, 8 and 19.

56

4.4.4 User results analysis Table 10

Us Topics Time Correctness Positioning Efficiency Satisfaction er

tota averag stdde aver avg id total min max average stddev correctness positioning stddev l e v age rate

1702138

01 15 113 38 70 215 0.80 8.41 2.10 0.0070 0.0731 5.12 1.04 12 115 42 70 215

2412161

02 15 160 100 68 416 0.73 6.63 2.97 0.0045 0.0453 5.10 2.16 11 146 80 68 265

5582410

93 15 372 258 67 1049 0.80 7.08 2.96 0.0021 0.0206 6.41 0.91 12 342 194 67 683

7216 134643

9

13

44 15 481 411 1808 0.86 6.76 3.72 0.0018 0.0136 5.18 1.55 13 495 435 1808

2578128

85 15 171 82 81 422 0.60 8.88 1.61 0.0034 0.0621 4.95 1.23 9 143 48 81 224

3034249

26 15 202 101 78 425 0.80 9.16 0.93 0.0039 0.0441 5.93 0.69 12 207 110 78 425

5043 106436

5

10

67 15 336 337 1419 0.93 6.50 2.82 0.0027 0.0208 4.85 1.33 14 311 336 1419

2296165

18 15 153 78 69 364 0.73 8.27 2.28 0.0047 0.0551 5.58 1.86 11 150 85 80 364

57

2987200

19 15 199 81 89 384 0.66 7.60 2.79 0.0033 0.0379 5.60 1.19 10 200 90 89 384

6902 138637

0

13

810 15 460 256 1124 0.93 6.28 2.70 0.0020 0.0138 5.18 1.87 14 455 265 1124

2666198

311 15 177 134 60 521 0.73 7.09 3.36 0.0041 0.0393 2.62 0.95 11 180 156 60 521

2933186

412 15 195 114 76 480 0.73 7.45 3.32 0.0037 0.0439 5.02 1.61 11 169 61 76 254

2800253

313 15 186 133 60 564 0.86 8.69 1.60 0.0046 0.0446 4.58 1.48 13 194 141 60 564

8341802

914 15 556 789 53 3154 0.93 8.85 2.17 0.0016 0.0154 4.79 1.45 14 573 816 53 3154

3594 115235

3

11

515 15 239 105 437 0.66 6.70 2.75 0.0027 0.0284 2.93 1.22 10 235 113 437

2096162

416 15 139 113 53 498 0.73 8.90 1.44 0.0052 0.0603 5.75 1.08 11 147 128 54 498

3885232

117 15 259 124 90 476 0.66 8.40 2.36 0.0025 0.0361 5.66 0.63 10 232 121 90 448

4878 139432

0

13

918 15 325 119 520 0.86 6.53 3.38 0.0026 0.0196 4.60 1.59 13 332 126 520

58

3571 102280

5

10

219 15 238 139 566 0.80 8.58 2.57 0.0033 0.0367 4.54 1.45 12 233 129 566

4255 116275

5

11

620 15 283 150 642 0.80 9.25 1.28 0.0028 0.0402 5.93 0.31 12 229 97 414

2282178

321 15 152 45 77 243 0.80 8.08 2.27 0.0052 0.0544 4.95 0.65 12 148 44 77 243

4034 101379

1

10

622 15 268 213 822 0.86 7.23 3.21 0.0032 0.0247 4.43 0.84 13 291 221 822

8078790

923 15 538 438 60 1401 0.93 8.07 2.86 0.0017 0.0142 5.56 1.33 14 564 442 60 1401

24 15 1166 77 30 41 155 0.46 7.71 1.79 0.0060 0.1153 3.89 1.58 7 468 66 26 41 120

59

5539427

425 15 369 460 96 1985 0.73 8.00 2.32 0.0019 0.0205 5.22 2.04 11 388 538 96 1985

4481294

626 15 298 199 46 756 0.80 8.66 1.96 0.0026 0.0353 5.27 1.60 12 245 162 46 625

4082298

427 15 272 177 74 644 0.80 8.66 1.72 0.0029 0.0348 4.62 1.14 12 248 156 74 539

3196312

028 15 213 151 49 547 0.93 7.78 2.86 0.0043 0.0349 5.10 1.32 14 222 152 49 547

4668 145317

9

15

129 15 311 192 865 0.66 7.60 3.77 0.0021 0.0239 5.00 1.14 10 317 214 865

4312419

530 15 287 519 59 2132 0.93 7.57 2.56 0.0032 0.0252 6.06 0.80 14 299 536 59 2132

60

On general terms all users succeeded in assessing all 15 topics each one to the

different search engines. None of them found all the correct answers for the topics

he/she searched for. 6 out of 30 participants managed to find 14 out of 15 correct

answers. Only two participants had collect answers below 10. Participant 23 seems to

spend the most time while conducting the search, but succeeded in finding 14 out of 15

correct answers. On average he/she spent 8 minutes and 9 seconds per topic. On the

other hand participant 24, as extracted from the data, spent the less time (1 minute and

28 seconds per topic) and had a rather bad score regarding correct answers. Moreover, it

is interesting to notice that most participants found answers higher in the ranked list of

the retrieved records.

Satisfaction

From all the measurements defined to measure search engine usability the most

important of them is satisfaction. The table below is a collection of the results given the

users when judging the 3 web search engines.

61

Table 11

Google AltaVista Yahoo

avg (mean) avg (mean) avg (mean)

1. Search options 5.40 4.29 5.02

2. Presentation of

results

4.87 3.92 4.56

3. Relevance of

retrieved documents

4.93 4.03 4.66

4. Response time 6.33 5.30 5.40

5. Satisfaction overall 5.86 4.12 5.01

Total 5.48 4.33 4.93

At it is extracted from the table in means of satisfaction Google performed better

in all areas of usability. Yahoo came second with an average of 4.93 and AltaVista third

with an average of 4.33. (The data from the participants’ answers are provided in

Appendix E)

5 Limitation of the experimental study

5.1 Search engines

The description of the search engines compared in this experiment can be

insufficient. Only data gathered from the search engine web sites were included whereas

no information are provided regarding the search engines ranking algorithms. When

comparing these kind of information retrieval systems, in means of their ability to

retrieve relevant documents and rank more relevant documents higher in the list of

retrieved results, it is important the researcher to acquire more specific information. It is

comprehensible that this kind of information is regarded as business secret. On the other

hand it is essential to realize that only through evaluation can the design of these systems

be improved.

5.2 Participants

When comparing search engines especially in their interface it is important to

have an adequate sample. The initial estimation of the sample was 60 participants, but it

was not easy to be met. By the time this experiment took place, after the ethical approval

from the university authorities was obtained, most of the university students were not

available to take part. Moreover, the fact that the participants were volunteers raises

concerns about their efficiency of conducting the relevance evaluations during the

retrieval performance experiment.

5.3 Test collection

Due to time limitations the idea of constructing the test collection by the

researcher was not possible to be implemented. Thus, the author thought adequate to

build a list based on the TREC test collections that are used for this purposes. Neglecting

the fact that there is a lot of debate concerning the TREC test collections and their use

when comparing interactive information retrieval systems, the researcher decided to

select from three (3) test collections 30 topics based on the students preferences. These

students did not participate to the experiment, as no previous knowledge of the topics

was a prerequisite for the usability evaluation testing. Additionally, the construction of

the test collection list can be questioned. Moreover, the assumption that the participants

had no previous knowledge can not be guaranteed, as the researcher gave them the

opportunity to choose from the list of topics one of their preference.

5.4 Observation data

63

Although observational data were gathered from all the participants, it was rather

difficult to be analyzed because of the time limitations.

5.5 Researcher’s interference

Taking into account the literature concerning the interference of the researcher

(when similar experiments were conducting) the influence that this can have to the

participants cannot be disregarded. The results gathered when participants are supervised

differ from those when participants are being observed.

64

6. Discussion

Even though the results were presented and analyzed in section four (4)

exclusively, satisfying the major objectives of this study, there is one more interesting

point to be taken into consideration. This is the correlation of the two experiments.

As it is extracted from the data analysis in the retrieval performance experiment

Yahoo seemed to perform better than the other two systems. This means that the

retrieval algorithm of Yahoo is more efficient when compared to Google and AltaVista.

Thus, Yahoo can retrieve more relevant documents than the other systems and it can

rank more relevant documents higher in the list of retrieved results. The significant

differences though between Yahoo and Google are low and the results of this experiment

can be questioned, as discussed in the previous section (5. Limitations of the

experimental study). On the hand the results of the usability testing analysis, showed that

Google’s interface is much closer to the users information needs. Specifically, Google

satisfied more of the criteria used for the measurement of usability in comparison to the

other search engines. Presumably, there is a negative correlation between the usability

and the retrieval performance.

The findings of this experiment are in line with Turpin’s (2000) investigations on

the correlation of the two experiments and the results of a case study conducted by Dr.

Roelof van Zwol’s (2004). On the contrary to this study, Turpin (2000) tried also to

explain the reasons of such differences between ‘batch’ evaluations and user-oriented

experiments. One of his most important reasoning to this conflict was that “ users do not

issue queries that take advantage of the increased performance offered by the batch

systems…as users of the improved system issue queries that result in more relevant

documents being ranked higher in the output” (Turpin and Hersh, 2000:230).

65

7. Conclusion

The fundamental reason for the development of web search engines is the

digitization of information resources. In order for search engines to justify their

existence, usage and efficiency an extensive evaluation of these information retrieval

systems has become essential.

In earlier days the importance of such evaluation lied in the ability of search

engines to develop better retrieval algorithms so as to retrieve relevant documents. Thus,

the researchers focused more on the systems mechanisms and tried to analyze them by

means of finding ways to improve it. For this purpose they needed a laboratory

environment, a test collection of documents, a set of queries and a set of relevant

documents. Consequently, by formulating queries the researchers were testing the

algorithms retrieval performance. This approach is representative of the Granfield

paradigm conducted by Cleverdon et al. in 1950’s. Though this perspective was

adequate during the evaluation of information retrieval systems at the time, it could not

be applied solely in today’s highly interactive environment, like the World Wide Web.

Presumably, new approaches had to be developed for the evaluation of modern web

search engines.

Therefore, the focus turned to a more user-centered approach. This approach’s

major objective is according to Borlund (2000) the investigation of the interaction

between the user, the retrieval mechanism and the database when extracting information

under real life conditions. The supporters of this methodology criticized severely the

criteria used by the traditional system-centered approach and highlighted their

inadequacy to evaluate interactive information retrieval systems like the search engines.

Although this approach is more modern and undermines the narrow system-

centered logic it is not so easy to be conducted. Jones and Willett quoted by Vorhees

(2002:355) mention that “a properly designed user-based evaluation must use a

66

sufficient large, representative sample of actual users of the retrieval system; each of the

systems to be compared must be equally well developed and complete with an

appropriate user interface; each subject must be equally well trained on all systems and

care must be taken to control for the learning effect”.

Based on the above token, another perspective has to be adopted that emphasizes

the need for experimentation in a laboratory environment with life situations. As a result,

the horizons are broadening in order to satisfy this need.

Roefol van Zwol (2004) supported this idea and devised an evaluation platform

that combines both user-centered and system-oriented approaches. For the usability

testing he used the measures of “effectivity”, “efficiency” and “satisfaction”. For the

retrieval performance evaluation he remained focused on the traditional measures of

recall and precision changing them slightly.

The researcher is in line with the idea of bringing realism in a laboratory

environment. He/she thus is using the platform built by Roelof van Zwol (2004) to

conduct this study in order to compare the usability and retrieval performance of

Google, AltaVista and Yahoo.

The result of the first part of the experiment – usability testing – revealed the

superiority of Google among the other search engines. Google performed better in

“positioning measure”, (a sub-defining measure of “effectivity” ,which specifies the

ability of the search engine to rank the correct answers given by the participants, in a

high position), by scoring 8.37 out of 10, where as the scores for AltaVista and Yahoo

are 7.01 and 8.12 respectively. In the other part of the experiment which sub-defined

measure of “effectivity”-which is “correctness” (defines the degree to which the answers

to the topics are correct)-Yahoo scored 0.80 out of 1, whereas Google gained 0.76 and

AltaVista 0.79. Moreover, by means of “efficiency” (the ability of successfully

completing a topic related to the time needed) and “satisfaction” (the users’ satisfaction,

which is measured with the survey questionnaire) Google performed better in both with

a score 0.0032 out of 1 and 5 out of 7 respectively. For the same measures, AltaVista

67

had a mean of 0.0027 and 4, whereas Yahoo 0.0029 and 5. Presumably overall, out of

the 4 criteria (including sub-criteria) Google succeeded higher in all three of them.

Consequently, in the usability evaluation Google is ranked first, Yahoo second and

AltaVista third.

On the other hand, comparing the search engine performance, higher precision

gained Yahoo 0.5266, which is a rather good performance, then Google with 0.4861 and

third AltaVista with 0.4622. As a result, the conflict between the usability and the

retrieval performance is confirmed and the findings are in line with the results of Turpin

(2001) and Roelof van Zwol (2004).

As for future work and further development, more experiments like this are

recommended to be conducted with a bigger sample of participants in order to

investigate the reasons of the conflict between usability and retrieval performance

experiments.

68

References

• Beaulieu, M., Robertson, S. and Rasmussen, E. (1996). “Evaluating interactive

systems in TREC”, Journal of the American Society for Information Science,

47(1), 85-94.

• Borlund, P. (2000) “Experimental components for the evaluation of IIR

systems”. Journal of Documentation. 56 (1). 71 –90.

• Chu, H. and Rosenthal, M. (1996). “Search engines for the World Wide Web: a

comparative study and evaluation methodology”, Proceedings of the ASIS

Annual Meeting, 33, 127-135.

• Courtois, M.P., Baer, W. and Stark, M. (1995). “Cool tools for searching the

Web: a performance evaluation”, Online, 19(6), 14-32.

• Ding, W. and Marchionini, G. (1996). “A comparative study of Web search

service performance”, Proceedings of the ASIS Annual Meeting, 33, 136-142.

• Dong, X. and Su, L.T. (1997). “Search engines on the World Wide Web and

information retrieval from the Internet: a review and evaluation”, Online &

CDROM Review, 21(2), 67-68.

• Frokjaer, E., Hertzum, M. & Hornbeak, K. (2000). “Measuring Usability: Are

effectiveness, efficiency, and satisfaction really correlated?”. In: SIGCHI 2000

Hague [Online]. Proceedings of the SIGCHI conference on Human factors in

computing systems. 01 – 06 April 2000, Hague, The Netherlands. New York:

ACM Press. http://muse7.shef.ac.uk/mirror/delivery.acm.org/ [Accessed 14 May

2005].

• Gauch, S. and Guijun Wang (1996). “Information Fusion with Profusion”,

Webnet 96 Conference, San Francisco, CA, October 15-19.

69

• Gordon, M. and Pathak, P. (1999). “Finding information on the World Wide

Web: The retrieval effectiveness of search engines”, Information Processing and

Management, 35(2), 141-180.

Griesbaum, J. (2004). “Evaluation of three German search engines: Altavista.de,

Google.de and Lycos.de”. Information Research, 9(4) paper 189 [Available at

http://InformationR.net/ir/9-4/paper189.html]

•

• Hawking, D., Craswell, N., Bailey, P. and Griffiths, K. (2001). “Measuring

search engine quality”, Information Retrieval, 4, 33-59.

• Hersh, W. & Turpin, A. (2000). “Do Batch and User Evaluations give the same

results?”. In: SIRIG 2000 Athens [Online]. Proceedings of the 23rd annual

international ACM SIGIR conference on Research and development in

information retrieval. 24 – 28 July, 2000, Athens, Greece. New York: ACM

Press. http://muse7.shef.ac.uk/mirror/delivery.acm.org/ [Accessed 14 May 2005].

http://help.yahoo.com/help/us/ysearch/•

http://trec.nist.gov/data.html•

http://www.altavista.com/help/•

http://www.google.com/help/basics.html•

• Jones, S. & Willett, P. (1997). “Readings in information retrieval”. San

Francisco: Morgan Kaufmann Publishers.

• Large, A., Tedd, L. & Hurtley, R. (2001). “Information seeking in the online

age: principles and practice”. München : K. G. Saur.

70

http://help.yahoo.com/help/us/ysearch/

http://trec.nist.gov/data.html

http://www.altavista.com/help/

http://www.google.com/help/basics.html

• Leighton, H. V. and Srivastava, J. (1997). “Precision among World Wide Web

Search Services (Search Engines): Alta Vista, Excite, HotBot, InfoSeek,Lycos.

Site visited at: 18/04/02 cybermetrics.cindoc.csic.es

• Leighton, H. V. and Srivastava, J. (1999). “First 20 Precision among World

Wide Web Search Services (Search Engines)” Journal of the American Society

for Information Science, 50 (10), 870-881

• Mullins, C. (2005). “Transaction Log Guidelines”. dbazine [Online] 25 April.

http://www.dbazine.com/sql/sql-articles/mullins-sqlserver [Accessed 5

September 2005].

• Over, P. (2001). “The TREC interactive track: an annotated bibliography”.

Information Processing and Management. 37, 369-381.

• Pors, N. (2000). “Information retrieval, experimental models and statistical

analysis”, Journal of Documentation, 56(1), 55-70.

• Rouet, J.F. (2003). “What was I looking for? The influence of task specificity

and prior knowledge on students search strategies in hypertext”. Interacting with

Computers, 15 (3), 409-428.

• Saracevic, T. (1995). “Evaluation of Evaluation in information retrieval”. In:

SIGIR 1995 Seattle [Online]. Proceedings of the 18th annual international ACM

SIGIR conference on Research and development in information retrieval. 9-13

July 1995, Seattle, USA. New York: ACM Press.

http://muse7.shef.ac.uk/mirror/delivery.acm.org/ [Accessed 14 May 2005].

• Schwartz, C. (1998). “Web search engines”. Journal of the American Society for

Information Science, 49(11), 973-982.

71

http://xmuse10.shef.ac.uk/cgi-bin/jhome/27981


• Spink, A. (2002). “A user-centered approach to evaluating human interaction

with Web search engines: an exploratory study”. Information Processing and

Management, 38(3), 401-424.

Su, L. (1992). “Evaluation measures for interactive information retrieval”. •

Information Processing and Management, 28(4), 503-516.

• Su, L. (1994). “The Relevance of Recall and Precision in User Evaluation”.

Journal of the American Society for Information Science, Volume 45(3), 207-

217.

Su, L. (1998). “Value of search results as a whole as the best single measure of

information retrieval performance”.

•

Information Processing and Management,

34(5), 557-579.

• Turpin, A. & Hersh, W. (2001). “Why Batch and User Evaluations do not give

the same results”. In: SIGIR 2001 New Orleans. [Online]. Proceedings of the

24th annual international ACM SIGIR conference on Research and development

in information retrieval. 9-12 September 2001, New Orleans, USA. New York:

ACM Press. http://muse7.shef.ac.uk/mirror/delivery.acm.org/ [Accessed 14 May

2005].

• Van Zwol, R. (2004). “Google’s “I’m Feeling Lucky”, truly a gamble?”. In:

Proceedings lecture notes in computer science 3306: 378-389. Web Information

Systems. Wise 2004.

• Voorhees, E.M. and Harman, D. (2000). “Overview of the sixth Text Retrieval

Conference (TREC-6)”, Information Processing and Management, 36(1), 3-36.

• Zorn, P., Emanoil, M., Marshall, L. and Panek, M. (1996). “Advanced Web

searching: tricks of the trade”, Online, 20(3), 15-28.

72


http://muse7.shef.ac.uk/mirror/delivery.acm.org/

Appendix A Google Simple Search Interface

Google Advanced Search Interface

73

AltaVista Simple Search Interface

AltaVista Advanced Search Interface

74

Yahoo Simple Search Interface

Yahoo Advanced Search Interface

75

Appendix B

TREC-9 <num> Number: 451 <title> What is a Bengals cat? <desc> Description: Provide information on the Bengal cat breed.

1 <narr> Narrative: Item should include any information on the Bengal cat breed, including description, origin, characteristics, breeding program, names of breeders and catteries carrying bengals. References which discuss bengal clubs only are not relevant. Discussions of bengal tigers are not relevant. <num> Number: 457 <title> Chevrolet Trucks <desc> Description:

2 Find documents that address the types of Chevrolet trucks available. <narr> Narrative: Relevant documents must contain information such as: the length, weight, cargo size, wheelbase, horsepower, cost, etc. <num> Number: 458 <title> fasting <desc> Description: Find documents that discuss fasting for religious reasons. 3 <narr> Narrative: A relevant document discusses fasting as related to periods of religious significance. Relevant documents should state the reason for fasting and the benefits to be derived. <num> Number: 461 <title> lava lamps 4 <desc> Description: Find documents that discuss the origin or operation of lava lamps. <narr> Narrative:

76

A relevant document must contain information on the origin or the operation of the lava lamp. <num> Number: 463 <title> tartin <desc> Description: Find information on Scottish tartans: their history, current use, how they are made, and how to wear them. 5 <narr> Narrative: Simple listings of clan/tartan names or price lists are not relevant. Pictures or descriptions of individual plaids are not relevant unless accompanied by history of their development. <num> Number: 465 <title> deer <desc> Description: What kinds of diseases can infect humans due to contact with deer or consumption of deer meat? 6

<narr> Narrative: Documents explaining the transference of Lyme disease to humans from deer ticks are relevant. <num> Number: 468 <title> incandescent light bulb <desc> Description: Find documents that address the history of the incandescent light bulb. 7 <narr> Narrative: A relevant document must provide information on who worked on the development of the incandescent light bulb. Relevant documents should include locations and dates of the development efforts. Documents that discuss unsuccessful development attempts and non-commercial use of incandescent light bulbs are considered relevant. <num> Number: 471

8 <title> mexican food culture <desc> Description:

77

Find documents that discuss the popularity or appeal of Mexican food outside of the United States. <narr> Narrative: Documents that discuss the popularity of Mexican food in the United States, Central and South America are not relevant. Relevant documents discuss the extent to which Mexican food is enjoyed or used in Europe, Asia, Africa, or Australia. <num> Number: 476 <title> Jennifer Aniston <desc> Description: Find documents that identify movies and/or television programs that Jennifer Aniston has appeared in. 9

<narr> Narrative: Relevant documents include movies and/or television programs that Jennifer Aniston has appeared in. <num> Number: 482 <title> where can i find growth rates for the pine tree? <desc> Description:

10 Find documents that give growth rates of pine trees. <narr> Narrative: Document that give heights of trees but not the rate of growth are not relevant. <num> Number: 484 <title> auto skoda <desc> Description: Skoda is a heavy industrial complex in Czechoslovakia. Does it manufacture vehicles? 11 <narr> Narrative: Relevant documents would include references to historic and contemporary automobile and truck production. Non-relevant documents would pertain to armament production.

12

78

<num> Number: 494 <title> nirvana <desc> Description: Find information on members of the rock group Nirvana. <narr> Narrative: Descriptions of members' behavior at various concerts and their performing style is relevant. Information on who wrote certain songs or a band member's role in producing a song is relevant. Biographical information on members is also relevant. <num> Number: 495 <title> Where can I find information on the decade of the 1920's? <desc> Description: Find information on the decade of the 1920's, known also as the Roaring Twenties. 13 <narr> Narrative: Information on life or happenings during the 1920's decade anywhere in the world is relevant. Simple dates of birth or death in the 1920's are not relevant unless they have broader significance. <num> Number: 500 <title> DNA Testing <desc> Description: This search seeks information on the state of the art of DNA testing; what it is and what its goals are. 14 <narr> Narrative: Relevant documents may discuss those things which are essentially steps in the DNA testing procedure, such as: sequencing, analysis, fingerprinting, and profiling. Documents that provide descriptions of elaborate scientific DNA testing are not relevant.

TREC-8 <num> Number: 402 <title> behavioral genetics

15 <desc> Description: What is happening in the field of behavioral genetics, the study of the relative influence of genetic and environmental factors on an individual's behavior or

79

personality? <narr> Narrative: Documents describing genetic or environmental factors relating to understanding and preventing substance abuse and addictions are relevant. Documents pertaining to attention deficit disorders tied in with genetics are also relevant, as are genetic disorders affecting hearing or muscles. The genome project is relevant when tied in with behavior disorders (i.e., mood disorders, Alzheimer's disease). <num> Number: 405 <title> cosmic events <desc> Description: What unexpected or unexplained cosmic events or celestial phenomena, such as radiation and supernova outbursts or new comets, have been detected? 16

<narr> Narrative: New theories or new interpretations concerning known celestial objects made as a result of new technology are not relevant. <num> Number: 408 <title> tropical storms <desc> Description: What tropical storms (hurricanes and typhoons) have caused significant property damage and loss of life? 17 <narr> Narrative: The date of the storm, the area affected, and the extent of damage/casualties are all of interest. Documents that describe the damage caused by a tropical storm as "slight", "limited", or "small" are not relevant. <num> Number: 417 <title> creativity <desc> Description:

Practice Find ways of measuring creativity. Topic

<narr> Narrative: Relevant items include definitions of creativity, descriptions of characteristics associated with creativity, and factors linked to creativity.

80

<num> Number: 420

18

<title> carbon monoxide poisoning <desc> Description: How widespread is carbon monoxide poisoning on a global scale? <narr> Narrative: Relevant documents will contain data on what carbon monoxide poisoning is, symptoms, causes, and/or prevention. Advertisements for carbon monoxide protection products or services are not relevant. Discussions of auto emissions and air pollution are not relevant even though they can contain carbon monoxide. <num> Number: 427 <title> UV damage, eyes <desc> Description: Find documents that discuss the damage ultraviolet (UV) light from the sun can do to eyes. 19 <narr> Narrative: A relevant document will discuss diseases that result from exposure of the eyes to UV light, treatments for the damage, and/or education programs that help prevent damage. Documents discussing treatment methods for cataracts and ocular melanoma are relevant even when a specific cause is not mentioned. However, documents that discuss radiation damage from nuclear sources or lasers are not relevant. <num> Number: 433

20

<title> Greek, philosophy, stoicism <desc> Description: Is there contemporary interest in the Greek philosophy of stoicism? <narr> Narrative: Actual references to the philosophy or philosophers, productions of Greek stoic plays, and new "stoic" artistic productions are all relevant. <num> Number: 449

21<title> antibiotics ineffectiveness <desc> Description: What has caused the current ineffectiveness of antibiotics against infections and what is the prognosis for new drugs?

81

<narr> Narrative: To be relevant, a document must discuss the reasons or causes for the ineffectiveness of current antibiotics. Relevant documents may also include efforts by pharmaceutical companies and federal government agencies to find new cures, updating current testing phases, new drugs being tested, and the prognosis for the availability of new and effective antibiotics. <num> Number: 415 <title> drugs, Golden Triangle <desc> Description: What is known about drug trafficking in the "Golden Triangle", the area where Burma, Thailand and Laos meet? 22 <narr> Narrative: A relevant document will discuss drug trafficking in the Golden Triangle, including organizations that produce or distribute the drugs; international efforts to combat the traffic; or the quantities of drugs produced in the area. <num> Number: 429 <title> Legionnaires' disease <desc> Description: Identify outbreaks of Legionnaires' disease. 23 <narr> Narrative: To be relevant, a document must discuss a specific outbreak of Legionnaires' disease. Documents that address prevention of or cures for the disease without citing a specific case are not relevant. <num> Number: 430 <title> killer bee attacks <desc> Description: Identify instances of attacks on humans by Africanized (killer) bees. 24 <narr> Narrative: Relevant documents must cite a specific instance of a human attacked by killer bees. Documents that note migration patterns or report attacks on other animals are not relevant unless they also cite an attack on a human.

TREC-6

82

<num> Number: 310 <title> Radio Waves and Brain Cancer <desc> Description: Evidence that radio waves from radio towers or car phones affect brain cancer occurrence. <narr> Narrative:

25 Persons living near radio towers and more recently persons using car phones have been diagnosed with brain cancer. The argument rages regarding the direct association of one with the other. The incidence of cancer among the groups cited is considered, by some, to be higher than that found in the normal population. A relevant document includes any experiment with animals, statistical study, articles, news items which report on the incidence of brain cancer being higher/lower/same as those persons who live near a radio tower and those using car phones as compared to those in the general population. <num> Number: 320 <title> Undersea Fiber Optic Cable <desc> Description: Fiber optic link around the globe (Flag) will be the world's longest undersea fiber optic cable. Who's involved and how extensive is the technology on this system. What problems exist? 26 <narr> Narrative: Relevant documents will reference companies involved in building the system or the technology needed for such an endeavor. Of relevance also would be information on the link up points of FLAG or landing sites or interconnection with other telecommunication cables. Relevant documents may reference any regulatory problems with the system once constructed. A non-relevant document would contain information on other fiber optic systems currently in place. <num> Number: 338

27

<title> Risk of Aspirin <desc> Description: What adverse effects have people experienced while taking aspirin repeatedly? <narr> Narrative: A relevant document should identify any adverse effects experienced from the repeated use of aspirin. Possible effects might include intestinal bleeding,

83

inflammation of the stomach, or various forms of ulcers. The purpose of the individual's repeated aspirin use should also be stated. <num> Number: 349 <title> Metabolism <desc> Description: Document will discuss the chemical reactions necessary to keep living cells healthy and/or producing energy. 28 <narr> Narrative: A relevant document will contain specific information on the catabolic and anabolic reactions of the metabolic process. Relevant information includes, but is not limited to, the reactions occurring in metabolism, biochemical processes (Glycolysis or Krebs cycle for production of energy), and disorders associated with the metabolic rate. <num> Number: 350 <title> Health and Computer Terminals <desc> Description: Is it hazardous to the health of individuals to work with computer terminals on a daily basis?

29 <narr> Narrative: Relevant documents would contain any information that expands on any physical disorder/problems that may be associated with the daily working with computer terminals. Such things as carpel tunnel, cataracts, and fatigue have been said to be associated, but how widespread are these or other problems and what is being done to alleviate any health problems. <num> Number: 307 <title> New Hydroelectric Projects <desc> Description: Identify hydroelectric projects proposed or under construction by country and location. Detailed description of nature, extent, purpose, problems, and consequences is desirable.

30

<narr> Narrative: Relevant documents would contain as a minimum a clear statement that a hydroelectric project is planned or construction is under way and the location of the project. Renovation of existing facilities would be judged not relevant

84

unless plans call for a significant increase in acre-feet or reservoir or a marked change in the environmental impact of the project. Arguments for and against proposed projects are relevant as long as they are supported by specifics, including as a minimum the name or location of the project. A statement that an individual or organization is for or against such projects in general would not be relevant. Proposals or projects underway to dismantle existing facilities or drain existing reservoirs are not relevant, nor are articles reporting a decision to drop a proposed plan.

Topic Number Generic Specific

1 √ 2 √ 3 √ 4 √ 5 √ 6 √ 7 √ 8 √ 9 √ 10 √ 11 √ 12 √ 13 √ 14 √ 15 √ 16 √ 17 √ 18 √ 19 √ 20 √ 21 √ 22 √ 23 √ 24 √ 25 √ 26 √ 27 √ 28 √ 29 √ 30 √

85

Appendix C

Model Participant Consent Form Title of Project: Usability Evaluation and Retrieval Performance of 3 web search engines using TERS platform Name of Researcher: Kessopoulou Eftychia Participant Identification Number for this project: Please initial box 1. I confirm that I have read and understand the information sheet dated: 20 May 2005 for the above project and have had the opportunity to ask questions. 2. I understand that my participation is voluntary and that I am free to withdraw at any time without giving any reason.

3. I understand that my responses will be anonymised before analysis. I give permission for members of the research team to have access to my anonymised responses. 4. I agree to take part in the above project. ________________________ ________________ ____________________ Name of Participant Date Signature _________________________ ________________ ____________________ Name of Person taking consent Date Signature (if different from researcher) _________________________ ________________ ____________________ Researcher Date Signature

86

Copies: One copy for the participant and one copy for the Principal Investigator / Supervisor.

Appendix D Retrieval system identification The file has the following format : <system name> <TAB> <url><NEWLINE> Search engine links loaded on TERS platform Google http://www.google.com AltaVista http://www.altavista.com Yahoo http://www.yahoo.com Topic Construction sample of the original .txt file The file has the following format: <topic id> <TAB> <title> <TAB> <description> <TAB> <narrative> <TAB> <search phrase><NEWLINE>. 0 Creativity Find ways of measuring creativity Relevant items include definitions of creativity, descriptions of characteristics associated with creativity, and factors linked to creativity measurement of creativity 1 Bengals cat Provide information on the Bengal cat breed Item should include any information on the Bengal cat breed, including description, origin, characteristics, breeding program, names of breeders and catteries carrying bengals. References which discuss bengal clubs only are not relevant. Discussions of bengal tigers are not relevant bengal cat breeding 2 Chevrolet Trucks Find documents that address the types of Chevrolet trucks available Relevant documents must contain information such as: the length, weight, cargo size, wheelbase, horsepower, cost, etc types of Chevrolet trucks Survey questions

General questions (Pre-session Questionnaire II) ID Type Question

Low -> High 1 scaled What kind of search engine user are you?

Beginner --> Expert 2 scaled How often do you use the Internet to find information?

Monthly --> Daily 3 scaled Does your search for information in that case always lead to satisfying

results? Never --> Always

4 scaled How often do you use the advanced options when you search using the

87

web search engines? Never --> Always

5 scaled I know what "Boolean search" is. Strongly Disagree --> Strongly Agree

6 scaled I use "Boolean search". Strongly Disagree --> Strongly Agree

7 scaled I know what a "ranked list" is. Strongly Disagree --> Strongly Agree

8 scaled The "advanced" search mode in the web search engines is an effective tool. Strongly Disagree --> Strongly Agree

Survey questions assessed in all three search engines (Post session Questionnaire) 9 scaled Are you satisfied with the search options (simple/advance search) offered

by the web search engine? Not satisfied --> Satisfied

10 scaled Are you satisfied with the flexibility of the web search engine to assist the user by filling in the search terms? Not satisfied --> Satisfied

11 scaled Are you satisfied when the web search engine refines the search (provide relevant results that narrow down the search)? Not satisfied --> Satisfied

12 scaled Are you satisfied with the searching of phrases in this search engine? Not satisfied --> Satisfied

13 scaled Are you satisfied with the presentation of the results by this search engine? Not satisfied --> Satisfied

14 scaled What do you think about the reading characters on the screen? Not satisfied --> Satisfied

15 scaled What do you think about the organization of the information on the screen? Not satisfied --> Satisfied

16 Boolean Do you prefer the search terms being highlighted and/or in bold in the search results? Yes --> No

17 scaled Do you think that the search terms highlighted or in bold help you to filter out the most useful results? Strongly disagree --> Strongly agree

18 scaled Are you satisfied when there is information helping you to know where you are? Not satisfied --> Satisfied

19 scaled Are you satisfied with the ranking of the results? Not satisfied --> Satisfied

20 scaled Are you satisfied with the relevance of the retrieved documents? Not satisfied --> Satisfied

21 scaled Are you satisfied with the response time of the search engine? Not satisfied --> Satisfied

22 scaled Are you satisfied with using this web search engine for your search as a

88

whole? Not satisfied --> Satisfied

23 scaled Are you satisfied with the design of this web search engine? Not satisfied --> Satisfied

24 scaled Are you satisfied when the web search engine informs you about its progress? (e.g. Results 1 - 10 of about 16,000,000 for air pollution - 0.09 sec) Not satisfied --> Satisfied

25 scaled Are you satisfied with the ease-of-use of the search engine? Not satisfied --> Satisfied

26 open Do you have any additional remarks regarding the usability of the web search engine? - --> -

27 Boolean Did you use the advanced search? Yes --> No

28 open Are you interrupted while participating in the experiment? (e.g. phone rang, connection failed) - --> -

(Search options: Questions 9-12, Result Presentation: Questions 13-19, Relevance: Question 20, Response time: Question 21, Satisfaction Overall: 22-28) Usability Workload Participant ID Topic ID System ID 1 0 1 1 1 1 1 2 2 2 0 1 2 1 2 2 2 3 3 0 2 3 1 3 3 2 1 4 0 1 4 16 1 4 17 2 5 0 1 5 16 2 5 17 3 6 0 2 6 16 3 6 17 1

89

Retrieval Performance Workload Participant ID Topic ID 1 18 2 4 3 9 4 22 5 24 6 29 7 25 8 17 9 2 10 30 11 12 12 20 13 23 14 28 15 13 16 21 17 1 18 19 19 6 20 8 21 10 22 27 23 15 24 7 25 14 26 3 27 5 28 16 29 11

90

30 26 Run set-up The file uploaded had the following format: <system> <TAB> <topic> <TAB> <rank> <TAB> <uri> <NEWLINE>. System Topic Rank Uri 1 18 1 http://www.carbonmonoxidekills.com/1 18 2 http://www.carbon-monoxide-poisoning.com/1 18 3 http://www.nlm.nih.gov/medlineplus/carbonmonoxidepoisoning.html1 18 4 http://www.epa.gov/iaq/pubs/coftsht.html1 18 5 http://www.emedicinehealth.com/articles/13442-1.asp1 18 6 http://www.cdc.gov/nceh/airpollution/carbonmonoxide/default.htm1 18 7 http://www.cdc.gov/nceh/airpollution/carbonmonoxide/checklist.htm1 18 http://www.osha.gov/OshDoc/data_General_Facts/carbonmonoxide-8 factsheet.pdf1 18 9 http://www.postgradmed.com/issues/1999/01_99/tomaszewski.htm

91

http://www.carbonmonoxidekills.com/

http://www.carbon-monoxide-poisoning.com/

http://www.nlm.nih.gov/medlineplus/carbonmonoxidepoisoning.html

http://www.epa.gov/iaq/pubs/coftsht.html

http://www.emedicinehealth.com/articles/13442-1.asp

http://www.cdc.gov/nceh/airpollution/carbonmonoxide/default.htm

http://www.cdc.gov/nceh/airpollution/carbonmonoxide/checklist.htm

http://www.osha.gov/OshDoc/data_General_Facts/carbonmonoxide-factsheet.pdf

http://www.osha.gov/OshDoc/data_General_Facts/carbonmonoxide-factsheet.pdf

http://www.postgradmed.com/issues/1999/01_99/tomaszewski.htm

Pre-session questionnaire (I)

1. Age: □ below 20 □ 21-25 □ 26-30 □ 31-35 □ over 35

2. Gender: □ Male □ Female

3. You are: □ Undergraduate □ Postgraduate □ Research Student

4. Academic program that you are attending:

5. How many years have you be been using web search engines? □ below 1 □ 1 □ 1-2 □ 3-4 □ 5-6 □ 7-8 □ 9-10 □ over 10

6. Which search engine do you use more often? (please choose one option only) □ Google □ Altavista □ Yahoo □ Alltheweb □ EntireWeb □ Search.com

7. How do you alter your search if you are not satisfied with the results you get?

(please choose one option only) □ do nothing/give up □ ask somebody for help □ use the “help” option that the web search engines provide □ change the web search engine □ use the “advanced” search mode □ change the query slightly □ change the query completely

92

Appendix E

Age

Cumulative Frequency Percent Valid Percent Percent

21-25 16 53,3 53,3 53,3 26-30 13 43,3 43,3 96,7 over 35 1 3,3 3,3 100,0

Valid

Total 30 100,0 100,0 Gender


Male 19 63,3 63,3 63,3 Female 11 36,7 36,7 100,0

Valid

Total 30 100,0 100,0 Postgraduate/Research Students


Postgraduate 26 86,7 86,7 86,7 Research Student 4 13,3 13,3 100,0

Valid

Total 30 100,0 100,0 Frequency of internet using to find information


5 2 6,7 6,7 6,76 6 20,0 20,0 26,77 22 73,3 73,3 100,0

Valid

Total 30 100,0 100,0 Scale 1-7 High: Daily Low: Monthly

93

Search and Satisfying results


3 1 3,3 3,3 3,34 2 6,7 6,7 10,05 11 36,7 36,7 46,76 12 40,0 40,0 86,77 4 13,3 13,3 100,0

Valid

Total 30 100,0 100,0 Scale 1-7 High: Always Low: Never Frequency of Advanced option used


1 4 13,3 13,3 13,32 7 23,3 23,3 36,73 3 10,0 10,0 46,74 2 6,7 6,7 53,35 4 13,3 13,3 66,76 7 23,3 23,3 90,07 3 10,0 10,0 100,0

Valid

Total 30 100,0 100,0 Scale 1-7 High: Always Low: Never Knowledge of Boolean Search


1 16 53,3 53,3 53,34 2 6,7 6,7 60,05 2 6,7 6,7 66,76 3 10,0 10,0 76,77 7 23,3 23,3 100,0

Valid

Total 30 100,0 100,0 Scale 1-7 High: Strongly agree Low: Strongly disagree

94

Use of Boolean Search


1 18 60,0 60,0 60,03 2 6,7 6,7 66,74 2 6,7 6,7 73,35 4 13,3 13,3 86,76 1 3,3 3,3 90,07 3 10,0 10,0 100,0

Valid

Total 30 100,0 100,0 Scale 1-7 High: Strongly agree Low: Strongly disagree Knowledge of ranked list


1 10 33,3 33,3 33,32 1 3,3 3,3 36,73 2 6,7 6,7 43,34 2 6,7 6,7 50,05 2 6,7 6,7 56,76 4 13,3 13,3 70,07 9 30,0 30,0 100,0

Valid

Total 30 100,0 100,0 Scale 1-7 High: Strongly agree Low: Strongly disagree Utility of advance search

Frequency Cumulative

Percent Valid Percent Percent 3 5 16,7 16,7 16,74 7 23,3 23,3 40,05 4 13,3 13,3 53,36 8 26,7 26,7 80,07 6 20,0 20,0 100,0

Valid

Total 30 100,0 100,0 Scale 1-7 High: Strongly agree Low: Strongly disagree

95

Survey Questions (Post-session Questionnaire) Question 9

system id system name # of answers average std dev OVERALL 90 4.96 1.40 3 Yahoo 30 4.96 1.35 1 Google 30 5.50 1.25 2 AltaVista 30 4.43 1.43

Question 10

system id system name # of answers average std dev OVERALL 90 4.92 1.34 1 Google 30 5.46 1.22 2 AltaVista 30 4.43 1.40 3 Yahoo 30 4.86 1.22

Question 11


Question 12


Question 13

96


Question 14


Question 15


97

Question 16


Question 17


Question 18


98

Question 19


Question 20


Question 21


99

Question 22


Question 23


Question 24


100

Question 25


101